foundationdb

Commit Graph

Author	SHA1	Message	Date
Meng Xu	5c5e883745	TC:Keep building until each server and machine has at least the expected number of teams	2019-07-12 19:16:18 -07:00
Meng Xu	8454d74da9	TC:Change remainingTeamBudget to ensure each server has more than desired team number	2019-07-12 18:39:01 -07:00
Meng Xu	1c0daa7f2c	Resolve review comments:Remove unneeded code	2019-07-12 18:10:04 -07:00
Meng Xu	aa19da6977	TC:TraceAllInfo:Remove unused variable Also change some code format in self review	2019-07-12 10:41:05 -07:00
Meng Xu	4da2071b49	ServerTeamRemover:Believe all servers are healthy when we start to remove Before the serverTeamRemover tries to pick a team to remove, it waits for all data movement to finish, which means all teams are healthy. When the serverTeamRemover starts to pick a team to remove, we believe all servers are healthy.	2019-07-11 23:47:31 -07:00
Meng Xu	cf935ff9e6	Remove debug message and format code	2019-07-11 22:05:20 -07:00
Meng Xu	bb758c18ee	ServerTracker:Not always mark server undesired when no healthy team exists A storage server is not desired to be colocated with tLogs. So we want to mark the server as undesired. However, if there is not enough process in the system, we will have no choice but do so. The old logic makes the server undesired if optimalTeamCount > 0; However, there is a rare case when optimalTeamCount is 1 when it is supposed to be 0. To overcome the situation, we add another condition healthyTeamCount > 0 as a guard to mark such a colocated server undesired.	2019-07-11 17:36:57 -07:00
Meng Xu	221e6945db	TeamTracker:Fix bug in counting optimalTeamCount When a teamTracker is cancelled, e.g, by redundant teamRemover or badTeamRemover, we should decrease the optimalTeamCount if the team is considered as an optimal team, i.e., all members' machine fitness is no worse than unset, and the team is healthy.	2019-07-11 17:22:41 -07:00
Meng Xu	c6e42d6119	ReplicationPolicy:Add trace for the name of each keyIndex	2019-07-10 19:29:29 -07:00
Meng Xu	4fae510633	AddBestMachineTeams:BugFix:Must build team when it has remainingMachineTeamBudget	2019-07-10 11:55:06 -07:00
Meng Xu	9816fb6aca	ConsistencyCheck:Check minServerTeamOnServer larger than 0	2019-07-10 11:53:47 -07:00
Meng Xu	aa459a2b03	AddTeamsBestOf:Calculate minTeamNumPerServer before use it	2019-07-09 14:28:39 -07:00
Meng Xu	522230f050	ConsistencyCheck:getTeamCollectionValid tries 10 times before return false Because serverTeamRemover takes time to remove teams, getTeamCollectionValid() need to wait for a while before concluding that the number of server teams is larger than the desired number.	2019-07-09 11:46:57 -07:00
Meng Xu	cf03b274a2	TeamTracker:Add traceTeamCollectionInfo	2019-07-08 23:01:25 -07:00
Meng Xu	bf8af985b9	ServerTeamRemover: Change unit test to include the remover Also further speed up serverTeamRemover in simulation, and Add comments	2019-07-08 20:12:16 -07:00
Meng Xu	3b9618fe11	ServerTeamRemover:Speedup removing teams in simulation Otherwise, simulation may time out when team remover needs to remove hundreds of teams.	2019-07-08 18:17:21 -07:00
Meng Xu	08d76a7bbe	ServerTeamRemover:Bug fix and clang-format	2019-07-08 17:08:32 -07:00
Meng Xu	9cc11e88c5	TeamBuilder:Reduce unnecessary calculation of remainingTeamBudget	2019-07-08 16:56:06 -07:00
Meng Xu	874539149a	ServerTeamRemover: Resolve review comments Pick the team whose minimum team number of a server is the largest one to remove. AddTeamsBestOf should keep building teams until each server has at least the target number of teams.	2019-07-08 16:40:37 -07:00
Meng Xu	08a721b320	Merge branch 'master' into mengxu/server-team-remover-PR	2019-07-08 16:30:32 -07:00
A.J. Beamon	0a5c7608df	Remove "Number" suffix from newly added events (and variables that feed the events).	2019-07-08 15:45:28 -07:00
A.J. Beamon	f52c239ef8	Merge branch 'master' into trace-event-rename # Conflicts: # fdbserver/DataDistribution.actor.cpp # fdbserver/QuietDatabase.actor.cpp	2019-07-08 15:37:00 -07:00
Evan Tschannen	ec11ef024b	Merge pull request #1798 from ajbeamon/merge-release-6.1-into-master Merge release 6.1 into master	2019-07-08 09:02:56 -07:00
A.J. Beamon	dd85edb08c	Merge pull request #1802 from xumengpanda/mengxu/DD-ensure-redundant-team-priority-as700-PR TeamTracker:Set redundant team priority as PRIORITY_TEAM_REDUNDANT	2019-07-08 08:47:28 -07:00
Jingyu Zhou	50e7593c5b	Merge pull request #1796 from ajbeamon/remove-trace-event-underscores Remove trace event underscores	2019-07-05 21:45:55 -07:00
Meng Xu	e8fb7564f5	Merge branch 'master' into mengxu/DD-ensure-redundant-team-priority-as700-PR	2019-07-05 17:28:12 -07:00
Meng Xu	c7a996267c	TeamRemover: Remove unused declaration Also change state variable to variable.	2019-07-05 16:54:06 -07:00
Meng Xu	46d28a3b79	TeamTracker:Set redundant team priority as redundant The redundant team removed by teamRemover will not exist in the global teams data structure. So we will not find the redundant team from shard-to-team mapping in the system key. Before this change, teamTracker marks such team as PRIORITY_TEAM_UNHEALTHY. With this change, it marks it as PRIORITY_TEAM_REDUNDANT	2019-07-05 15:24:00 -07:00
A.J. Beamon	2a56e011ea	Merge branch 'release-6.1' into merge-release-6.1-into-master # Conflicts: # documentation/sphinx/source/release-notes.rst # fdbserver/DataDistribution.actor.cpp	2019-07-05 13:52:29 -07:00
Meng Xu	7ba6cd2d9d	ServerTeamRemover:Reduce the overshot server team number to build Each server has the maximum of DESIRED_TEAMS_PER_SERVER and (DESIRED_TEAMS_PER_SERVER * storageTeamSize) / 2)	2019-07-05 11:01:50 -07:00
A.J. Beamon	2a709ee5d0	Rename event details that use the suffix "Number" to indicate a count, as number could also imply an index. Rename a few other trace events and details that e.g. needed to be pluralized.	2019-07-05 08:54:21 -07:00
A.J. Beamon	a3ac9c7eea	Remove underscores from some trace event names	2019-07-05 08:08:29 -07:00
Meng Xu	2782d432ac	ServerTeamRemover:Update the desired number and pick unhealthy teams first	2019-07-02 22:17:53 -07:00
Meng Xu	599fcb2e6d	Add serverTeamRemover to remove redundant server teams	2019-07-02 17:40:37 -07:00
Meng Xu	7461c87ae6	AddTeamsBestOf: Build more teams than desired We build more teams than we finally want so that we can use serverTeamRemover() actor to remove the teams whose member belong to too many teams. This allows us to get a more balanced number of teams per server.	2019-07-02 17:40:37 -07:00
Evan Tschannen	86b0224347	Merge branch 'release-6.1' of github.com:apple/foundationdb into release-6.1	2019-07-02 16:27:31 -07:00
Evan Tschannen	64e33bb4f9	added logging for maintenance mode	2019-07-02 16:25:29 -07:00
Meng Xu	7afbd10a10	Change teamRemover to machineTeamRemover	2019-07-02 15:16:34 -07:00
Meng Xu	d2d6022ed4	StorageServerTracker:Do not always set doBuildTeams When interface changes, we set doBuildTeams to true only when the interface location changes.	2019-07-02 14:24:26 -07:00
Meng Xu	de5bcaf588	minTeamNumber for server and machine cannot be uint64_t Because the consistency check will try to conver the value to int64_t. If no server exists, the variable will not be updated and thus get overflowed when it is converted to int64_t	2019-07-01 21:39:18 -07:00
Meng Xu	347a7ecdff	MachineTeams:Make traceTeamCollectionInfo not an actor	2019-07-01 16:50:53 -07:00
Meng Xu	b8cb883040	AddBestMachineTeams:Fix input must be non-negative value	2019-06-28 22:46:16 -07:00
Meng Xu	63c42533eb	TaceTeamCollectionInfo:Remove delay	2019-06-28 16:19:58 -07:00
Meng Xu	875cb877ac	TeamCollection: Apply clang-format	2019-06-28 16:01:05 -07:00
Meng Xu	0baae134f6	TeamCollectionInfo: Resolve review comments	2019-06-28 15:59:47 -07:00
Meng Xu	cb681693df	TeamCollection:Do NOT consider healthyness in counting team number If a team is removed from DD, it will be marked as failed and eventually removed from the global teams data structure. Team healthyness is likely to be a temporary state which can be changed rather quickly.	2019-06-28 09:50:43 -07:00
Meng Xu	ce7eb10cac	TeamCollectionInfo: Only count team number for healthy server and machine	2019-06-27 19:04:22 -07:00
Meng Xu	f889843332	Change traceTeamCollectionInfo to actor There are cases where traceTeamCollectionInfo was called within the same execution block, i.e., no wait between the two traceTeamCollectionInfo calls. Because simulation uses the same time for all execution instructions in the same execution block, having more than one traceTeamCollectionInfo at the same time will mess up the trackLatest semantics. When one of them is always chosen by simulator, simulation test will report false positive error. Changing this function to actor and adding a small delay inside the function can solve this problem.	2019-06-27 18:24:20 -07:00
Meng Xu	4fe3c7f749	TeamCollectionInfo:Revert to original version where it is	2019-06-27 17:09:21 -07:00
Meng Xu	42620e4831	TeamCollectionTest:GetTeamCollectionValid wait until values are correct	2019-06-27 16:52:36 -07:00
Meng Xu	ee41311a54	TeamCollection:Call addTeamsBestOf when remainingTeamBudget is not 0	2019-06-27 15:29:26 -07:00
Meng Xu	2993a96de8	TeamCollectionInfo: Remove debug trace and apply clang format	2019-06-27 14:15:51 -07:00
Meng Xu	5f5c404291	BugFix:ReplicationPolicy always fails when teamSize is 1 Whenever use selectReplicas function, be careful that it may have bugs! This bug is that it always return false (not able to find candidates) when the storage team size is 1. This is wrong because when storage team size is 1, the selectReplicas should return an empty result.	2019-06-27 13:47:49 -07:00
Meng Xu	90c158984c	TeamCollection:Add extra trace events	2019-06-27 11:27:29 -07:00
Meng Xu	aaf97542e9	TeamCollectionTest: Update unit test	2019-06-27 11:27:29 -07:00
Meng Xu	53324e4db7	TeamCollectionInfo: clang format	2019-06-27 11:27:29 -07:00
Meng Xu	cc6a0e9bcd	TeamCollectionTest:Do not enforce minServerTeamOnServer larger than 0 In ConfigureTest, one server may be left with 0 server teams, even if we call buildTeams in the storageServerTracker.	2019-06-27 11:27:29 -07:00
Meng Xu	c23d89c98a	TeamCollection:Only count healthy teams for a server When team collection add new server teams, it picks a team with the least number of teams. We should only consider the healthy teams because the unhealthy ones will not be useful.	2019-06-27 11:27:29 -07:00
Meng Xu	e1d459075a	TeamCollection:Count healthy machine teams only Team collection should prioritize to build machine teams for a machine that has the least number of healthy machine teams, instead of just machine teams, because unhealthy machine team will not be able to produce more server teams.	2019-06-27 11:27:29 -07:00
Meng Xu	ee916b337d	TeamCollection:Change the target team number to build When team collection (TC) build server teams and machine teams, it needs to build enough teams such that each server and machine has the DESIRED_TEAMS_PER_SERVER server teams and machine teams. This change calculate the number of teams (server team and machine teams) needed to get each teams for each server and machine.	2019-06-27 11:16:44 -07:00
Meng Xu	21664742a6	TeamCollection:Desired team number may be larger than the max possible team number For example, we have 3 servers for replica factor 3. We can have only 1 team but the desired team number is 3 times 5 equal to 15. Instead of sanity checking the absolute team number per server, we check the difference between the minServerTeamOnServer and maxServerTeamOnServer.	2019-06-27 11:15:06 -07:00
Meng Xu	08f28e99f9	TeamCollection:Test no server or machine has incorrect team number Add test for simulation test which make sure the server team number per server will be no less than the desired_teams_per_server defined in knobs and no larger than the max_teams_per_server. Add similar test for machine teams number per machine as well.	2019-06-27 11:15:06 -07:00
Alex Miller	7a500cd37f	A giant translation of TaskFooPriority -> TaskPriority::Foo This is so that APIs that take priorities don't take ints, which are common and easy to accidentally pass the wrong thing.	2019-06-25 02:47:35 -07:00
A.J. Beamon	f417e60264	Merge branch 'merge-release-6.1-into-master' into thread-safe-random-number-generation # Conflicts: # fdbserver/QuietDatabase.actor.cpp	2019-05-23 09:52:00 -07:00
A.J. Beamon	d29c7e4c9b	Merge branch 'release-6.1' into merge-release-6.1-into-master # Conflicts: # documentation/sphinx/source/release-notes.rst # fdbserver/QuietDatabase.actor.cpp # versions.target	2019-05-23 09:28:45 -07:00
Evan Tschannen	90fe085696	fix: the healthyZone needs to be checked again once the timeout is expected to have elapsed	2019-05-21 13:49:16 -07:00
Evan Tschannen	a8e8be5aac	added a wait failure client which always waits the full failure reaction time, even if it knows the interface is never coming back use this new wait failure client in data distribution, to give time for a storage server to rejoin the cluster after its interface fails	2019-05-21 11:54:17 -07:00
A.J. Beamon	5f55f3f613	Replace g_random and g_nondeterministic_random with functions deterministicRandom() and nondeterministicRandom() that return thread_local random number generators. Delete g_debug_random and trace_random. Allow only deterministicRandom() to be seeded, and require it to be seeded from each thread on which it is used.	2019-05-10 14:01:52 -07:00
Evan Tschannen	2d5043c665	Merge branch 'release-6.1' # Conflicts: # documentation/sphinx/source/release-notes.rst # versions.target	2019-04-30 18:27:04 -07:00
Evan Tschannen	e0f7ec96aa	Data distribution needs to build new teams as old teams are removed to ensure data remains balanced across servers	2019-04-22 17:29:46 -07:00
Evan Tschannen	6220a5ce0f	Merge pull request #1370 from jzhou77/fix-unreferenced Remove unused functions	2019-04-09 11:49:45 -07:00
mpilman	1c16f87a4e	Remove trace-calls to printable (in non-workloads)	2019-04-05 13:12:19 -07:00
Evan Tschannen	a38c396283	made all maintenance transactions lock aware	2019-04-02 14:27:48 -07:00
Evan Tschannen	628fec8c8b	updated status with information about ongoing maintenance clear the maintenance zone if a different storage server is detected failed	2019-04-02 14:15:51 -07:00
Evan Tschannen	781cf9b5a0	added the ability to make a zoneId for maintenance in fdbcli	2019-04-01 17:55:13 -07:00
Jingyu Zhou	f7f8ddd894	Fix warnings on unused variables Found by -Wunused-variable flag.	2019-04-01 14:00:20 -07:00
Jingyu Zhou	9f6fe5f649	Merge remote-tracking branch 'apple/master' into ratekeeper	2019-03-15 11:30:04 -07:00
Jingyu Zhou	99d521ef4f	Monitor Ratekeeper and DataDistributor to use stateless processes Since Ratekeeper and DataDistributor are no longer running with Master, they might be running with stateful processes before a new Master becomes alive, which is undesirable. This PR adds a monitoring of both Ratekeeper and DataDistributor at Cluster Controller -- if Master runs on a stateless class and RK/DD runs at a worse class, then RK/DD will be killed. I.e., RK/DD should be running at their own classes or on the same stateless process as Master. After restart, RK/DD should be running at a better process class.	2019-03-14 15:00:57 -07:00
Evan Tschannen	a2108047aa	removed LocalitySetRef and IRepPolicyRef typedefs, because for clarity the Ref suffix is reserved for arena allocated objects instead of reference counted objects.	2019-03-13 13:14:39 -07:00
Jingyu Zhou	2b0139670e	Fix review comment for PR 1176	2019-03-12 12:02:30 -07:00
Jingyu Zhou	cdfe906c30	Data distributor pulls batch limited info from proxy Add a flag in HealthMetrics to indicate that batch priority is rate limited. Data distributor pulls this flag from proxy to know roughly when rate limiting happens. DD uses this information to determine when to do the rebalance in the background, i.e., moving data from heavily loaded servers to lighter ones. If the cluster is currently rate limited for batch commits, then the rebalance will use longer time intervals, otherwise use shorter intervals. See BgDDMountainChopper() and BgDDValleyFiller() in DataDistributionQueue.actor.cpp.	2019-03-07 13:16:20 -08:00
Jingyu Zhou	835cc278c3	Fix rebase conflicts.	2019-03-07 13:16:20 -08:00
Jingyu Zhou	d52ff738c0	Fix merge conflicts during rebase.	2019-03-07 13:16:20 -08:00
Jingyu Zhou	b2ee41ba33	Remove lastLimited from data distribution Fix a serialization bug in ServerDBInfo, which causes test failures.	2019-03-07 13:16:20 -08:00
Jingyu Zhou	36a51a7b57	Fix a segfault bug due to uncopied ratekeeper interface	2019-03-07 13:16:20 -08:00
Jingyu Zhou	e6ac3f7fe8	Minor fix on ratekeeper work registration.	2019-03-07 13:16:20 -08:00
Jingyu Zhou	3c86643822	Separate Ratekeeper from data distribution. Add a new role for ratekeeper. Remove StorageServerChanges from data distribution. Ratekeeper monitors storage servers, which borrows the idea from DataDistribution.	2019-03-07 13:16:20 -08:00
anoyes	981426bac9	More ide fixes	2019-03-05 18:03:57 -08:00
Evan Tschannen	b8910ba7cd	Merge branch 'master' into feature-fix-force-recovery # Conflicts: # fdbclient/ManagementAPI.actor.h # fdbserver/DataDistribution.actor.cpp # fdbserver/storageserver.actor.cpp # fdbserver/workloads/KillRegion.actor.cpp	2019-02-22 14:38:13 -08:00
Evan Tschannen	d008de576e	Merge pull request #1139 from xumengpanda/mengxu/machine-team-upgrade-PR Add background actor to remove redundant teams	2019-02-22 14:22:07 -08:00
Meng Xu	9445ac0b0c	Status: Use new data distributor worker to publish status After we add a new data distributor role, we publish the data related to data distributor and rate keeper through the new role (and new worker). So the status needs to contact the data distributor, instead of master, to get the status information.	2019-02-21 18:05:50 -08:00
Meng Xu	3e703dc2d1	TeamRemover: Fix bug that may not remove all teams needed	2019-02-21 15:54:16 -08:00
Meng Xu	7cca439e00	TeamRemover: Add status to show redundant team removing Distinguish the removal of unhealthy team and redundant team. Change status report to include redundant team removal report.	2019-02-21 14:16:46 -08:00
Meng Xu	0ac7014142	TeamRemover: Resolve minor comments from code review	2019-02-21 13:18:11 -08:00
Evan Tschannen	329ab766f1	factored out a duplicate code block attempted to fix a compiler error	2019-02-20 18:20:10 -08:00
Meng Xu	d86ba0e811	TeamRemover: Change it to run periodically This simplifies the problem of when we should invoke the teamRemover	2019-02-20 16:08:34 -08:00
Evan Tschannen	27e3617548	fix: remove bad teams needed to use dd_stall_check delay, because in simulation the buggified delay time could make us remove bad teams before they submit their ranges to the queue	2019-02-20 14:18:36 -08:00
Evan Tschannen	3a572b010f	fix: a forced recovery needed to force the data distributor to restart	2019-02-19 16:04:52 -08:00
mpilman	27a3153719	Use ACTOR forward declarations in MoveKeys Also MoveKeys.h -> MoveKeys.actor.h	2019-02-19 15:16:59 -08:00
mpilman	3a0f9839b9	Fix minor IDE build errors	2019-02-19 15:16:59 -08:00

1 2 3 4 5 ...

344 Commits