foundationdb

Commit Graph

Author	SHA1	Message	Date
Meng Xu	2993a96de8	TeamCollectionInfo: Remove debug trace and apply clang format	2019-06-27 14:15:51 -07:00
Meng Xu	5f5c404291	BugFix:ReplicationPolicy always fails when teamSize is 1 Whenever use selectReplicas function, be careful that it may have bugs! This bug is that it always return false (not able to find candidates) when the storage team size is 1. This is wrong because when storage team size is 1, the selectReplicas should return an empty result.	2019-06-27 13:47:49 -07:00
Meng Xu	90c158984c	TeamCollection:Add extra trace events	2019-06-27 11:27:29 -07:00
Meng Xu	aaf97542e9	TeamCollectionTest: Update unit test	2019-06-27 11:27:29 -07:00
Meng Xu	53324e4db7	TeamCollectionInfo: clang format	2019-06-27 11:27:29 -07:00
Meng Xu	cc6a0e9bcd	TeamCollectionTest:Do not enforce minServerTeamOnServer larger than 0 In ConfigureTest, one server may be left with 0 server teams, even if we call buildTeams in the storageServerTracker.	2019-06-27 11:27:29 -07:00
Meng Xu	c23d89c98a	TeamCollection:Only count healthy teams for a server When team collection add new server teams, it picks a team with the least number of teams. We should only consider the healthy teams because the unhealthy ones will not be useful.	2019-06-27 11:27:29 -07:00
Meng Xu	e1d459075a	TeamCollection:Count healthy machine teams only Team collection should prioritize to build machine teams for a machine that has the least number of healthy machine teams, instead of just machine teams, because unhealthy machine team will not be able to produce more server teams.	2019-06-27 11:27:29 -07:00
Meng Xu	ee916b337d	TeamCollection:Change the target team number to build When team collection (TC) build server teams and machine teams, it needs to build enough teams such that each server and machine has the DESIRED_TEAMS_PER_SERVER server teams and machine teams. This change calculate the number of teams (server team and machine teams) needed to get each teams for each server and machine.	2019-06-27 11:16:44 -07:00
Meng Xu	21664742a6	TeamCollection:Desired team number may be larger than the max possible team number For example, we have 3 servers for replica factor 3. We can have only 1 team but the desired team number is 3 times 5 equal to 15. Instead of sanity checking the absolute team number per server, we check the difference between the minServerTeamOnServer and maxServerTeamOnServer.	2019-06-27 11:15:06 -07:00
Meng Xu	08f28e99f9	TeamCollection:Test no server or machine has incorrect team number Add test for simulation test which make sure the server team number per server will be no less than the desired_teams_per_server defined in knobs and no larger than the max_teams_per_server. Add similar test for machine teams number per machine as well.	2019-06-27 11:15:06 -07:00
Evan Tschannen	90fe085696	fix: the healthyZone needs to be checked again once the timeout is expected to have elapsed	2019-05-21 13:49:16 -07:00
Evan Tschannen	a8e8be5aac	added a wait failure client which always waits the full failure reaction time, even if it knows the interface is never coming back use this new wait failure client in data distribution, to give time for a storage server to rejoin the cluster after its interface fails	2019-05-21 11:54:17 -07:00
Evan Tschannen	e0f7ec96aa	Data distribution needs to build new teams as old teams are removed to ensure data remains balanced across servers	2019-04-22 17:29:46 -07:00
Evan Tschannen	a38c396283	made all maintenance transactions lock aware	2019-04-02 14:27:48 -07:00
Evan Tschannen	628fec8c8b	updated status with information about ongoing maintenance clear the maintenance zone if a different storage server is detected failed	2019-04-02 14:15:51 -07:00
Evan Tschannen	781cf9b5a0	added the ability to make a zoneId for maintenance in fdbcli	2019-04-01 17:55:13 -07:00
Jingyu Zhou	9f6fe5f649	Merge remote-tracking branch 'apple/master' into ratekeeper	2019-03-15 11:30:04 -07:00
Jingyu Zhou	99d521ef4f	Monitor Ratekeeper and DataDistributor to use stateless processes Since Ratekeeper and DataDistributor are no longer running with Master, they might be running with stateful processes before a new Master becomes alive, which is undesirable. This PR adds a monitoring of both Ratekeeper and DataDistributor at Cluster Controller -- if Master runs on a stateless class and RK/DD runs at a worse class, then RK/DD will be killed. I.e., RK/DD should be running at their own classes or on the same stateless process as Master. After restart, RK/DD should be running at a better process class.	2019-03-14 15:00:57 -07:00
Evan Tschannen	a2108047aa	removed LocalitySetRef and IRepPolicyRef typedefs, because for clarity the Ref suffix is reserved for arena allocated objects instead of reference counted objects.	2019-03-13 13:14:39 -07:00
Jingyu Zhou	2b0139670e	Fix review comment for PR 1176	2019-03-12 12:02:30 -07:00
Jingyu Zhou	cdfe906c30	Data distributor pulls batch limited info from proxy Add a flag in HealthMetrics to indicate that batch priority is rate limited. Data distributor pulls this flag from proxy to know roughly when rate limiting happens. DD uses this information to determine when to do the rebalance in the background, i.e., moving data from heavily loaded servers to lighter ones. If the cluster is currently rate limited for batch commits, then the rebalance will use longer time intervals, otherwise use shorter intervals. See BgDDMountainChopper() and BgDDValleyFiller() in DataDistributionQueue.actor.cpp.	2019-03-07 13:16:20 -08:00
Jingyu Zhou	835cc278c3	Fix rebase conflicts.	2019-03-07 13:16:20 -08:00
Jingyu Zhou	d52ff738c0	Fix merge conflicts during rebase.	2019-03-07 13:16:20 -08:00
Jingyu Zhou	b2ee41ba33	Remove lastLimited from data distribution Fix a serialization bug in ServerDBInfo, which causes test failures.	2019-03-07 13:16:20 -08:00
Jingyu Zhou	36a51a7b57	Fix a segfault bug due to uncopied ratekeeper interface	2019-03-07 13:16:20 -08:00
Jingyu Zhou	e6ac3f7fe8	Minor fix on ratekeeper work registration.	2019-03-07 13:16:20 -08:00
Jingyu Zhou	3c86643822	Separate Ratekeeper from data distribution. Add a new role for ratekeeper. Remove StorageServerChanges from data distribution. Ratekeeper monitors storage servers, which borrows the idea from DataDistribution.	2019-03-07 13:16:20 -08:00
anoyes	981426bac9	More ide fixes	2019-03-05 18:03:57 -08:00
Evan Tschannen	b8910ba7cd	Merge branch 'master' into feature-fix-force-recovery # Conflicts: # fdbclient/ManagementAPI.actor.h # fdbserver/DataDistribution.actor.cpp # fdbserver/storageserver.actor.cpp # fdbserver/workloads/KillRegion.actor.cpp	2019-02-22 14:38:13 -08:00
Evan Tschannen	d008de576e	Merge pull request #1139 from xumengpanda/mengxu/machine-team-upgrade-PR Add background actor to remove redundant teams	2019-02-22 14:22:07 -08:00
Meng Xu	9445ac0b0c	Status: Use new data distributor worker to publish status After we add a new data distributor role, we publish the data related to data distributor and rate keeper through the new role (and new worker). So the status needs to contact the data distributor, instead of master, to get the status information.	2019-02-21 18:05:50 -08:00
Meng Xu	3e703dc2d1	TeamRemover: Fix bug that may not remove all teams needed	2019-02-21 15:54:16 -08:00
Meng Xu	7cca439e00	TeamRemover: Add status to show redundant team removing Distinguish the removal of unhealthy team and redundant team. Change status report to include redundant team removal report.	2019-02-21 14:16:46 -08:00
Meng Xu	0ac7014142	TeamRemover: Resolve minor comments from code review	2019-02-21 13:18:11 -08:00
Evan Tschannen	329ab766f1	factored out a duplicate code block attempted to fix a compiler error	2019-02-20 18:20:10 -08:00
Meng Xu	d86ba0e811	TeamRemover: Change it to run periodically This simplifies the problem of when we should invoke the teamRemover	2019-02-20 16:08:34 -08:00
Evan Tschannen	27e3617548	fix: remove bad teams needed to use dd_stall_check delay, because in simulation the buggified delay time could make us remove bad teams before they submit their ranges to the queue	2019-02-20 14:18:36 -08:00
Evan Tschannen	3a572b010f	fix: a forced recovery needed to force the data distributor to restart	2019-02-19 16:04:52 -08:00
mpilman	27a3153719	Use ACTOR forward declarations in MoveKeys Also MoveKeys.h -> MoveKeys.actor.h	2019-02-19 15:16:59 -08:00
mpilman	3a0f9839b9	Fix minor IDE build errors	2019-02-19 15:16:59 -08:00
mpilman	3cb2391b58	use proper fwd declarations in ManagementAPI Also ManagementAPI.h -> ManagementAPI.actor.h	2019-02-19 15:16:59 -08:00
Meng Xu	111ab2eccc	TeamRemover: Check redundant team flag before satisfiesPolicy In addTeam(), to determine the team is badTeam or not, we should check redundantTeam before check satisfiesPolicy. Because if a team is redundantTeam, it has been removed from the system before we call addTeam(). The only reason we call addTeam() for a removed redundantTeam is to kick off the badTeam cleanup logic.	2019-02-19 14:46:47 -08:00
Meng Xu	e256d9a9ac	TeamRemover: Change ASSERT in teamRemover function When we remove a machine team in teamRemover function, we should always find the machine team in the global machineTeams. Change the ASSERT to the above invariant.	2019-02-19 08:13:10 -08:00
Meng Xu	3c1ed2eba9	TeamRemover: Confident no duplicate machine teams In removeMachineTeam, we are confident that there is no duplicate machine team when remove a machine team from a machines vector of machineTeams	2019-02-19 08:13:10 -08:00
Meng Xu	ed1d4635bc	TeamRemover: Format cleaning Use clang-format and remove debug messages for the code that fixes bugs in merging the PR of adding a DataDistributor role	2019-02-19 08:13:10 -08:00
Meng Xu	211036ee22	TeamRemover: Fix bugs introduced in the previous commit	2019-02-19 08:13:10 -08:00
Meng Xu	a7810d9594	TeamRemover: Fix ASSERT condition in teamRemover	2019-02-19 08:13:10 -08:00
Meng Xu	06b6a1d2ad	TeamRemover: Bug fix in teamRemover and add teamRemover invocation point	2019-02-19 08:13:10 -08:00
Meng Xu	a6d3a5a3d6	TeamRemover: Change machineTeamNumber to healthyMachineTeamNumber Always use healthy machine team number as the condition of if redundant teams exist	2019-02-19 08:13:10 -08:00

1 2 3 4 5

235 Commits