foundationdb

Commit Graph

Author	SHA1	Message	Date
Meng Xu	24047defff	StorageEngineSwitch:Remove and rename server variables Remove unused toRemove variable from server_info, Rename wrongStoreTypeRemoved to wrongStoreTypeToRemove.	2019-08-13 15:44:48 -07:00
Meng Xu	e6284684f0	StorageEngineSwitch:Always remove wrong storeType SS In the old logic of switching storage engines, it marks a storage server with wrong store type as undesired even though this can lead to no healthy team. In the first version of the new storage engine switch, we mimic the same logic of the old version.	2019-08-13 14:59:46 -07:00
Meng Xu	b216cd2516	StorageEngineSwitch:Use AsyncVar to signal server to remove Trigger does not have an effect if the receiver is not waiting on the trigger. To ensure the wrong store type server that is selected to be removed is removed, we should use an AysncVar<bool> to trigger the storage tracker.	2019-08-12 18:14:05 -07:00
Meng Xu	a588710376	StorageEngineSwitch:Graceful switch When fdbcli change storeType for storage engines, we switch the store type of storage servers one by one gracefully. This avoids recruiting multiple storage servers on the same process, which can cause OOM error.	2019-08-12 17:37:52 -07:00
Evan Tschannen	ba54508c47	code cleanup	2019-08-06 16:30:30 -07:00
Evan Tschannen	5dc4c80d44	fix: the machineAttrition workload did not ensure that healthyZone was always cleared fix: an assert could trigger spuriously	2019-08-05 15:00:17 -07:00
Evan Tschannen	7d7aa27c2d	Merge pull request #1814 from dongxinEric/feature/1508/finer-grained-dd-controls Added finer grained controls to DataDistribution in fdbcli.	2019-07-31 17:36:20 -07:00
Evan Tschannen	bba01c6531	fix: add subsetOfEmergencyTeam could add an unsorted team	2019-07-31 16:02:08 -07:00
Xin Dong	b653ddb30d	Final clean ups after rebasing master	2019-07-30 22:35:34 -07:00
Xin Dong	5d20364423	Address review comments	2019-07-30 22:24:30 -07:00
Xin Dong	1922c39377	Resolve review comments. 100K run shows one suspecious ASSERT_WE_THINK failure which I think could be a race.	2019-07-30 22:24:30 -07:00
Xin Dong	c6e5472d8d	Apply suggestions from code review Co-Authored-By: A.J. Beamon <ajbeamon@users.noreply.github.com>	2019-07-30 22:20:45 -07:00
Xin Dong	f5d6e3a5b3	- Addressed review commends - Added test for the storage server failure disable switch	2019-07-30 22:20:45 -07:00
Xin Dong	ae11efcb0a	Made following changes: - Make sure the disabled data distribution won't be accidentally enabled by the 'maintenance' command - Make sure the status json reflects the status of DD accordingly - Make sure the CLI can play with the new DD states correctly, i.e. print out warns when necessary	2019-07-30 22:20:45 -07:00
Xin Dong	4ecfc9830f	Added finer grained controls to DataDistribution in fdbcli. What's happening under the hood is: - Use pre-existing 'healthZone' key and write a special value to it in order to disable DD for all storage server failures - Use a new system key 'rebalanceDDIgnored' key to disable/enable DD for all rebalance reasons(MountainChopper and ValleyFiller) Kicked off two 200K correctness and showed no related errors.	2019-07-30 22:17:21 -07:00
Evan Tschannen	dd4ab63d90	fixed another bad trace event name	2019-07-30 19:36:26 -07:00
Evan Tschannen	b8cd51c4d3	fixed invalid trace event name	2019-07-30 19:23:54 -07:00
Evan Tschannen	a78a97f186	Merge pull request #1908 from etschannen/feature-better-dd A few data distribution improvements	2019-07-30 17:34:50 -07:00
sramamoorthy	a88aaa0f04	review comment	2019-07-30 17:04:51 -07:00
sramamoorthy	63941e0d96	disable DD with a in-memory flag and use in snapv2	2019-07-30 17:04:51 -07:00
Evan Tschannen	5dd9043fd3	addressed review comments	2019-07-30 17:04:41 -07:00
Evan Tschannen	481642fbd4	Merge branch 'master' into feature-better-dd	2019-07-30 16:56:27 -07:00
Evan Tschannen	a3fe3d4324	Merge pull request #1923 from xumengpanda/mengxu/evan-dd-improvement-minor-improvement DD:Change condition for lastBuildTeamsFailed	2019-07-30 16:54:42 -07:00
A.J. Beamon	14648e20f9	Merge pull request #1901 from ajbeamon/data-distribution-receives-bytes-input-rate Send bytes input rate to data distribution	2019-07-30 15:01:36 -07:00
Meng Xu	0e50656c7f	DD:Change condition for lastBuildTeamsFailed Change the threshold team number per server that should set lastBuildTeamsFailed from DESIRED_TEAMS_PER_SERVER to (SERVER_KNOBS->DESIRED_TEAMS_PER_SERVER * (configuration.storageTeamSize + 1)) / 2;	2019-07-30 11:07:02 -07:00
Evan Tschannen	a0f26b604c	Merge pull request #1907 from etschannen/master A number of bug fixes for rare problems found by correctness testing	2019-07-29 21:04:38 -07:00
sramamoorthy	5a56f6b456	minor snap create client improvement and bug fixes	2019-07-29 20:28:22 -07:00
Evan Tschannen	cc4481b71a	team builders prefer to make teams which overlap less with existing teams	2019-07-28 23:44:23 -07:00
Evan Tschannen	7e97bd181a	fix: we need to build teams when a server becomes healthy if it is possible another servers does not have enough teams	2019-07-28 19:31:21 -07:00
Evan Tschannen	04dd293af0	Merge pull request #1874 from xumengpanda/mengxu/DD-code-read DataDistribution:Add comments to help understand the code	2019-07-26 13:30:44 -07:00
Evan Tschannen	2123fa1c3a	Merge pull request #1853 from xumengpanda/mengxu/redundantTeamRemoverPriority-PR Lower the RelocateShard priority for removing redundant teams	2019-07-26 13:28:42 -07:00
A.J. Beamon	b91795d288	Send bytes input rate to DD.	2019-07-25 16:27:32 -07:00
senthil-ram	edeec8a622	Update fdbserver/DataDistribution.actor.cpp Co-Authored-By: Alex Miller <35046903+alexmiller-apple@users.noreply.github.com>	2019-07-24 15:36:28 -07:00
sramamoorthy	a65c9f92ed	get rid of all timeouts and other changes	2019-07-24 15:36:28 -07:00
sramamoorthy	a2f2ad96ff	code review comments and merge to master changes	2019-07-24 15:36:28 -07:00
sramamoorthy	4f2bb561de	snapshot only local tlogs and not the satellite	2019-07-24 15:36:28 -07:00
sramamoorthy	021c949801	increase snaptime out to 15s for simulator	2019-07-24 15:36:28 -07:00
sramamoorthy	869f77aef1	Few cosmetic edits and fixes	2019-07-24 15:36:28 -07:00
sramamoorthy	ddd4523816	bug fix in timeout & header file re-arrange in DD	2019-07-24 15:36:28 -07:00
sramamoorthy	31c010b393	few minor fixes	2019-07-24 15:36:28 -07:00
sramamoorthy	62c14dae72	disable dd during snap and enable in restore	2019-07-24 15:36:28 -07:00
sramamoorthy	ba6bccce73	snap v2: DD changes - snapshot orchestration logic	2019-07-24 15:36:28 -07:00
Meng Xu	b7478f5dd3	DD:Add comments to help understand code Add comments to explain the functionalities of some code.	2019-07-22 11:23:16 -07:00
Meng Xu	378db79441	Resolve conflict when merge with master	2019-07-22 10:56:20 -07:00
Meng Xu	dae4436a3d	TC:UnitTest:Change invariant due to alg change	2019-07-20 21:06:54 -07:00
Meng Xu	b001a9ebe8	ServerTeamRemover runs after machineTeamRemover finishes If serverTeamRemover removes a team before machineTeamRemover brings the machine team number down to the desired number, DD may create a new team (due to teams removed by serverTeamRemover), which may be removed later by machineTeamRemover. This causes unnnecessary extra data movement.	2019-07-19 16:48:52 -07:00
Meng Xu	64bee63dbc	Resolve two review comments 1) No need to check server with only one team when teamRemover finds a server team or machine team to remove 2) Fix optimalTeamCount counting in teamTracker	2019-07-18 18:46:31 -07:00
Meng Xu	915732ce24	TeamRemover:Reset the removed team counter after removement	2019-07-16 11:17:51 -07:00
Meng Xu	20f067e794	Merge with master:Resolve conflict with PR#1797	2019-07-16 10:52:28 -07:00
Meng Xu	243504b125	DD:Clang format changes	2019-07-15 18:40:14 -07:00
Meng Xu	94e9b8a3b4	Do not remove a team whose min team number is less than target If the minimum number of teams of servers in a team is less than the target value (desired_team_number_per_server * (teamSize + 1) / 2), the team remover should not remove it. Otherwise, DD will oscillate in building more teams and removing redundant teams. Do not do consistency check for three_data_hall mode because when machines are not evenly distributed across data halls, we will need to build more teams than the total desired number to make sure the number of teams per server is no less than the target value.	2019-07-15 18:30:13 -07:00
Meng Xu	cafe9b9412	TC:Target team num per server is desired number Do not overbuild teams because we may oscillate between building more teams and removing the redundant teams. The oscillation happens when the machines are not evenly distributed across availability zones. For example, in three_data_hall mode, we have 1 machine in 1 data hall for 2 data halls. We have 3 machines in the 3rd data hall. To build enough (and more teams) for servers in the 3rd data hall, we will overbuild teams. However, the teamRemover will remove those newly teams.	2019-07-15 17:32:51 -07:00
Meng Xu	415622f465	MachineTeamRemover:Change to remove MT with most teams Change to remove machine team with most machine teams, using the same logic as the serverTeamRemover. The featue is guarded by TR_FLAG_REMOVE_MT_WITH_MOST_TEAMS knob.	2019-07-15 14:29:49 -07:00
Meng Xu	5c5e883745	TC:Keep building until each server and machine has at least the expected number of teams	2019-07-12 19:16:18 -07:00
Meng Xu	8454d74da9	TC:Change remainingTeamBudget to ensure each server has more than desired team number	2019-07-12 18:39:01 -07:00
Meng Xu	1c0daa7f2c	Resolve review comments:Remove unneeded code	2019-07-12 18:10:04 -07:00
Meng Xu	aa19da6977	TC:TraceAllInfo:Remove unused variable Also change some code format in self review	2019-07-12 10:41:05 -07:00
Meng Xu	4da2071b49	ServerTeamRemover:Believe all servers are healthy when we start to remove Before the serverTeamRemover tries to pick a team to remove, it waits for all data movement to finish, which means all teams are healthy. When the serverTeamRemover starts to pick a team to remove, we believe all servers are healthy.	2019-07-11 23:47:31 -07:00
Meng Xu	cf935ff9e6	Remove debug message and format code	2019-07-11 22:05:20 -07:00
Meng Xu	bb758c18ee	ServerTracker:Not always mark server undesired when no healthy team exists A storage server is not desired to be colocated with tLogs. So we want to mark the server as undesired. However, if there is not enough process in the system, we will have no choice but do so. The old logic makes the server undesired if optimalTeamCount > 0; However, there is a rare case when optimalTeamCount is 1 when it is supposed to be 0. To overcome the situation, we add another condition healthyTeamCount > 0 as a guard to mark such a colocated server undesired.	2019-07-11 17:36:57 -07:00
Meng Xu	221e6945db	TeamTracker:Fix bug in counting optimalTeamCount When a teamTracker is cancelled, e.g, by redundant teamRemover or badTeamRemover, we should decrease the optimalTeamCount if the team is considered as an optimal team, i.e., all members' machine fitness is no worse than unset, and the team is healthy.	2019-07-11 17:22:41 -07:00
Meng Xu	c6e42d6119	ReplicationPolicy:Add trace for the name of each keyIndex	2019-07-10 19:29:29 -07:00
Meng Xu	4fae510633	AddBestMachineTeams:BugFix:Must build team when it has remainingMachineTeamBudget	2019-07-10 11:55:06 -07:00
Meng Xu	9816fb6aca	ConsistencyCheck:Check minServerTeamOnServer larger than 0	2019-07-10 11:53:47 -07:00
Meng Xu	aa459a2b03	AddTeamsBestOf:Calculate minTeamNumPerServer before use it	2019-07-09 14:28:39 -07:00
Meng Xu	522230f050	ConsistencyCheck:getTeamCollectionValid tries 10 times before return false Because serverTeamRemover takes time to remove teams, getTeamCollectionValid() need to wait for a while before concluding that the number of server teams is larger than the desired number.	2019-07-09 11:46:57 -07:00
Meng Xu	cf03b274a2	TeamTracker:Add traceTeamCollectionInfo	2019-07-08 23:01:25 -07:00
Meng Xu	bf8af985b9	ServerTeamRemover: Change unit test to include the remover Also further speed up serverTeamRemover in simulation, and Add comments	2019-07-08 20:12:16 -07:00
Meng Xu	3b9618fe11	ServerTeamRemover:Speedup removing teams in simulation Otherwise, simulation may time out when team remover needs to remove hundreds of teams.	2019-07-08 18:17:21 -07:00
Meng Xu	08d76a7bbe	ServerTeamRemover:Bug fix and clang-format	2019-07-08 17:08:32 -07:00
Meng Xu	9cc11e88c5	TeamBuilder:Reduce unnecessary calculation of remainingTeamBudget	2019-07-08 16:56:06 -07:00
Meng Xu	874539149a	ServerTeamRemover: Resolve review comments Pick the team whose minimum team number of a server is the largest one to remove. AddTeamsBestOf should keep building teams until each server has at least the target number of teams.	2019-07-08 16:40:37 -07:00
Meng Xu	08a721b320	Merge branch 'master' into mengxu/server-team-remover-PR	2019-07-08 16:30:32 -07:00
A.J. Beamon	0a5c7608df	Remove "Number" suffix from newly added events (and variables that feed the events).	2019-07-08 15:45:28 -07:00
A.J. Beamon	f52c239ef8	Merge branch 'master' into trace-event-rename # Conflicts: # fdbserver/DataDistribution.actor.cpp # fdbserver/QuietDatabase.actor.cpp	2019-07-08 15:37:00 -07:00
Evan Tschannen	ec11ef024b	Merge pull request #1798 from ajbeamon/merge-release-6.1-into-master Merge release 6.1 into master	2019-07-08 09:02:56 -07:00
A.J. Beamon	dd85edb08c	Merge pull request #1802 from xumengpanda/mengxu/DD-ensure-redundant-team-priority-as700-PR TeamTracker:Set redundant team priority as PRIORITY_TEAM_REDUNDANT	2019-07-08 08:47:28 -07:00
Jingyu Zhou	50e7593c5b	Merge pull request #1796 from ajbeamon/remove-trace-event-underscores Remove trace event underscores	2019-07-05 21:45:55 -07:00
Meng Xu	e8fb7564f5	Merge branch 'master' into mengxu/DD-ensure-redundant-team-priority-as700-PR	2019-07-05 17:28:12 -07:00
Meng Xu	c7a996267c	TeamRemover: Remove unused declaration Also change state variable to variable.	2019-07-05 16:54:06 -07:00
Meng Xu	46d28a3b79	TeamTracker:Set redundant team priority as redundant The redundant team removed by teamRemover will not exist in the global teams data structure. So we will not find the redundant team from shard-to-team mapping in the system key. Before this change, teamTracker marks such team as PRIORITY_TEAM_UNHEALTHY. With this change, it marks it as PRIORITY_TEAM_REDUNDANT	2019-07-05 15:24:00 -07:00
A.J. Beamon	2a56e011ea	Merge branch 'release-6.1' into merge-release-6.1-into-master # Conflicts: # documentation/sphinx/source/release-notes.rst # fdbserver/DataDistribution.actor.cpp	2019-07-05 13:52:29 -07:00
Meng Xu	7ba6cd2d9d	ServerTeamRemover:Reduce the overshot server team number to build Each server has the maximum of DESIRED_TEAMS_PER_SERVER and (DESIRED_TEAMS_PER_SERVER * storageTeamSize) / 2)	2019-07-05 11:01:50 -07:00
A.J. Beamon	2a709ee5d0	Rename event details that use the suffix "Number" to indicate a count, as number could also imply an index. Rename a few other trace events and details that e.g. needed to be pluralized.	2019-07-05 08:54:21 -07:00
A.J. Beamon	a3ac9c7eea	Remove underscores from some trace event names	2019-07-05 08:08:29 -07:00
Meng Xu	2782d432ac	ServerTeamRemover:Update the desired number and pick unhealthy teams first	2019-07-02 22:17:53 -07:00
Meng Xu	599fcb2e6d	Add serverTeamRemover to remove redundant server teams	2019-07-02 17:40:37 -07:00
Meng Xu	7461c87ae6	AddTeamsBestOf: Build more teams than desired We build more teams than we finally want so that we can use serverTeamRemover() actor to remove the teams whose member belong to too many teams. This allows us to get a more balanced number of teams per server.	2019-07-02 17:40:37 -07:00
Evan Tschannen	86b0224347	Merge branch 'release-6.1' of github.com:apple/foundationdb into release-6.1	2019-07-02 16:27:31 -07:00
Evan Tschannen	64e33bb4f9	added logging for maintenance mode	2019-07-02 16:25:29 -07:00
Meng Xu	7afbd10a10	Change teamRemover to machineTeamRemover	2019-07-02 15:16:34 -07:00
Meng Xu	d2d6022ed4	StorageServerTracker:Do not always set doBuildTeams When interface changes, we set doBuildTeams to true only when the interface location changes.	2019-07-02 14:24:26 -07:00
Meng Xu	de5bcaf588	minTeamNumber for server and machine cannot be uint64_t Because the consistency check will try to conver the value to int64_t. If no server exists, the variable will not be updated and thus get overflowed when it is converted to int64_t	2019-07-01 21:39:18 -07:00
Meng Xu	347a7ecdff	MachineTeams:Make traceTeamCollectionInfo not an actor	2019-07-01 16:50:53 -07:00
Meng Xu	b8cb883040	AddBestMachineTeams:Fix input must be non-negative value	2019-06-28 22:46:16 -07:00
Meng Xu	63c42533eb	TaceTeamCollectionInfo:Remove delay	2019-06-28 16:19:58 -07:00
Meng Xu	875cb877ac	TeamCollection: Apply clang-format	2019-06-28 16:01:05 -07:00
Meng Xu	0baae134f6	TeamCollectionInfo: Resolve review comments	2019-06-28 15:59:47 -07:00
Meng Xu	cb681693df	TeamCollection:Do NOT consider healthyness in counting team number If a team is removed from DD, it will be marked as failed and eventually removed from the global teams data structure. Team healthyness is likely to be a temporary state which can be changed rather quickly.	2019-06-28 09:50:43 -07:00
Meng Xu	ce7eb10cac	TeamCollectionInfo: Only count team number for healthy server and machine	2019-06-27 19:04:22 -07:00
Meng Xu	f889843332	Change traceTeamCollectionInfo to actor There are cases where traceTeamCollectionInfo was called within the same execution block, i.e., no wait between the two traceTeamCollectionInfo calls. Because simulation uses the same time for all execution instructions in the same execution block, having more than one traceTeamCollectionInfo at the same time will mess up the trackLatest semantics. When one of them is always chosen by simulator, simulation test will report false positive error. Changing this function to actor and adding a small delay inside the function can solve this problem.	2019-06-27 18:24:20 -07:00
Meng Xu	4fe3c7f749	TeamCollectionInfo:Revert to original version where it is	2019-06-27 17:09:21 -07:00
Meng Xu	42620e4831	TeamCollectionTest:GetTeamCollectionValid wait until values are correct	2019-06-27 16:52:36 -07:00
Meng Xu	ee41311a54	TeamCollection:Call addTeamsBestOf when remainingTeamBudget is not 0	2019-06-27 15:29:26 -07:00
Meng Xu	2993a96de8	TeamCollectionInfo: Remove debug trace and apply clang format	2019-06-27 14:15:51 -07:00
Meng Xu	5f5c404291	BugFix:ReplicationPolicy always fails when teamSize is 1 Whenever use selectReplicas function, be careful that it may have bugs! This bug is that it always return false (not able to find candidates) when the storage team size is 1. This is wrong because when storage team size is 1, the selectReplicas should return an empty result.	2019-06-27 13:47:49 -07:00
Meng Xu	90c158984c	TeamCollection:Add extra trace events	2019-06-27 11:27:29 -07:00
Meng Xu	aaf97542e9	TeamCollectionTest: Update unit test	2019-06-27 11:27:29 -07:00
Meng Xu	53324e4db7	TeamCollectionInfo: clang format	2019-06-27 11:27:29 -07:00
Meng Xu	cc6a0e9bcd	TeamCollectionTest:Do not enforce minServerTeamOnServer larger than 0 In ConfigureTest, one server may be left with 0 server teams, even if we call buildTeams in the storageServerTracker.	2019-06-27 11:27:29 -07:00
Meng Xu	c23d89c98a	TeamCollection:Only count healthy teams for a server When team collection add new server teams, it picks a team with the least number of teams. We should only consider the healthy teams because the unhealthy ones will not be useful.	2019-06-27 11:27:29 -07:00
Meng Xu	e1d459075a	TeamCollection:Count healthy machine teams only Team collection should prioritize to build machine teams for a machine that has the least number of healthy machine teams, instead of just machine teams, because unhealthy machine team will not be able to produce more server teams.	2019-06-27 11:27:29 -07:00
Meng Xu	ee916b337d	TeamCollection:Change the target team number to build When team collection (TC) build server teams and machine teams, it needs to build enough teams such that each server and machine has the DESIRED_TEAMS_PER_SERVER server teams and machine teams. This change calculate the number of teams (server team and machine teams) needed to get each teams for each server and machine.	2019-06-27 11:16:44 -07:00
Meng Xu	21664742a6	TeamCollection:Desired team number may be larger than the max possible team number For example, we have 3 servers for replica factor 3. We can have only 1 team but the desired team number is 3 times 5 equal to 15. Instead of sanity checking the absolute team number per server, we check the difference between the minServerTeamOnServer and maxServerTeamOnServer.	2019-06-27 11:15:06 -07:00
Meng Xu	08f28e99f9	TeamCollection:Test no server or machine has incorrect team number Add test for simulation test which make sure the server team number per server will be no less than the desired_teams_per_server defined in knobs and no larger than the max_teams_per_server. Add similar test for machine teams number per machine as well.	2019-06-27 11:15:06 -07:00
Alex Miller	7a500cd37f	A giant translation of TaskFooPriority -> TaskPriority::Foo This is so that APIs that take priorities don't take ints, which are common and easy to accidentally pass the wrong thing.	2019-06-25 02:47:35 -07:00
Jon Fu	b473a8a830	changed on-the-wire format to use serialized flatbuffers, added cycletest to workload, and fixed small bug in trace	2019-06-11 15:45:06 -07:00
Jon Fu	a6bee65f11	Merge branch 'master' into add-data-distribution-metrics	2019-05-30 11:11:49 -07:00
A.J. Beamon	f417e60264	Merge branch 'merge-release-6.1-into-master' into thread-safe-random-number-generation # Conflicts: # fdbserver/QuietDatabase.actor.cpp	2019-05-23 09:52:00 -07:00
A.J. Beamon	d29c7e4c9b	Merge branch 'release-6.1' into merge-release-6.1-into-master # Conflicts: # documentation/sphinx/source/release-notes.rst # fdbserver/QuietDatabase.actor.cpp # versions.target	2019-05-23 09:28:45 -07:00
Jon Fu	e339ab890b	added timeout and shard limiting behaviour	2019-05-21 14:13:09 -07:00
Jon Fu	0984d272e1	initial implementation of get-dd-metrics	2019-05-21 14:13:09 -07:00
Evan Tschannen	90fe085696	fix: the healthyZone needs to be checked again once the timeout is expected to have elapsed	2019-05-21 13:49:16 -07:00
Evan Tschannen	a8e8be5aac	added a wait failure client which always waits the full failure reaction time, even if it knows the interface is never coming back use this new wait failure client in data distribution, to give time for a storage server to rejoin the cluster after its interface fails	2019-05-21 11:54:17 -07:00
A.J. Beamon	5f55f3f613	Replace g_random and g_nondeterministic_random with functions deterministicRandom() and nondeterministicRandom() that return thread_local random number generators. Delete g_debug_random and trace_random. Allow only deterministicRandom() to be seeded, and require it to be seeded from each thread on which it is used.	2019-05-10 14:01:52 -07:00
Evan Tschannen	2d5043c665	Merge branch 'release-6.1' # Conflicts: # documentation/sphinx/source/release-notes.rst # versions.target	2019-04-30 18:27:04 -07:00
Evan Tschannen	e0f7ec96aa	Data distribution needs to build new teams as old teams are removed to ensure data remains balanced across servers	2019-04-22 17:29:46 -07:00
Evan Tschannen	6220a5ce0f	Merge pull request #1370 from jzhou77/fix-unreferenced Remove unused functions	2019-04-09 11:49:45 -07:00
mpilman	1c16f87a4e	Remove trace-calls to printable (in non-workloads)	2019-04-05 13:12:19 -07:00
Evan Tschannen	a38c396283	made all maintenance transactions lock aware	2019-04-02 14:27:48 -07:00
Evan Tschannen	628fec8c8b	updated status with information about ongoing maintenance clear the maintenance zone if a different storage server is detected failed	2019-04-02 14:15:51 -07:00
Evan Tschannen	781cf9b5a0	added the ability to make a zoneId for maintenance in fdbcli	2019-04-01 17:55:13 -07:00
Jingyu Zhou	f7f8ddd894	Fix warnings on unused variables Found by -Wunused-variable flag.	2019-04-01 14:00:20 -07:00
Jingyu Zhou	9f6fe5f649	Merge remote-tracking branch 'apple/master' into ratekeeper	2019-03-15 11:30:04 -07:00
Jingyu Zhou	99d521ef4f	Monitor Ratekeeper and DataDistributor to use stateless processes Since Ratekeeper and DataDistributor are no longer running with Master, they might be running with stateful processes before a new Master becomes alive, which is undesirable. This PR adds a monitoring of both Ratekeeper and DataDistributor at Cluster Controller -- if Master runs on a stateless class and RK/DD runs at a worse class, then RK/DD will be killed. I.e., RK/DD should be running at their own classes or on the same stateless process as Master. After restart, RK/DD should be running at a better process class.	2019-03-14 15:00:57 -07:00
Evan Tschannen	a2108047aa	removed LocalitySetRef and IRepPolicyRef typedefs, because for clarity the Ref suffix is reserved for arena allocated objects instead of reference counted objects.	2019-03-13 13:14:39 -07:00
Jingyu Zhou	2b0139670e	Fix review comment for PR 1176	2019-03-12 12:02:30 -07:00
Jingyu Zhou	cdfe906c30	Data distributor pulls batch limited info from proxy Add a flag in HealthMetrics to indicate that batch priority is rate limited. Data distributor pulls this flag from proxy to know roughly when rate limiting happens. DD uses this information to determine when to do the rebalance in the background, i.e., moving data from heavily loaded servers to lighter ones. If the cluster is currently rate limited for batch commits, then the rebalance will use longer time intervals, otherwise use shorter intervals. See BgDDMountainChopper() and BgDDValleyFiller() in DataDistributionQueue.actor.cpp.	2019-03-07 13:16:20 -08:00
Jingyu Zhou	835cc278c3	Fix rebase conflicts.	2019-03-07 13:16:20 -08:00
Jingyu Zhou	d52ff738c0	Fix merge conflicts during rebase.	2019-03-07 13:16:20 -08:00
Jingyu Zhou	b2ee41ba33	Remove lastLimited from data distribution Fix a serialization bug in ServerDBInfo, which causes test failures.	2019-03-07 13:16:20 -08:00
Jingyu Zhou	36a51a7b57	Fix a segfault bug due to uncopied ratekeeper interface	2019-03-07 13:16:20 -08:00
Jingyu Zhou	e6ac3f7fe8	Minor fix on ratekeeper work registration.	2019-03-07 13:16:20 -08:00
Jingyu Zhou	3c86643822	Separate Ratekeeper from data distribution. Add a new role for ratekeeper. Remove StorageServerChanges from data distribution. Ratekeeper monitors storage servers, which borrows the idea from DataDistribution.	2019-03-07 13:16:20 -08:00
anoyes	981426bac9	More ide fixes	2019-03-05 18:03:57 -08:00
Evan Tschannen	b8910ba7cd	Merge branch 'master' into feature-fix-force-recovery # Conflicts: # fdbclient/ManagementAPI.actor.h # fdbserver/DataDistribution.actor.cpp # fdbserver/storageserver.actor.cpp # fdbserver/workloads/KillRegion.actor.cpp	2019-02-22 14:38:13 -08:00
Evan Tschannen	d008de576e	Merge pull request #1139 from xumengpanda/mengxu/machine-team-upgrade-PR Add background actor to remove redundant teams	2019-02-22 14:22:07 -08:00
Meng Xu	9445ac0b0c	Status: Use new data distributor worker to publish status After we add a new data distributor role, we publish the data related to data distributor and rate keeper through the new role (and new worker). So the status needs to contact the data distributor, instead of master, to get the status information.	2019-02-21 18:05:50 -08:00
Meng Xu	3e703dc2d1	TeamRemover: Fix bug that may not remove all teams needed	2019-02-21 15:54:16 -08:00
Meng Xu	7cca439e00	TeamRemover: Add status to show redundant team removing Distinguish the removal of unhealthy team and redundant team. Change status report to include redundant team removal report.	2019-02-21 14:16:46 -08:00
Meng Xu	0ac7014142	TeamRemover: Resolve minor comments from code review	2019-02-21 13:18:11 -08:00
Evan Tschannen	329ab766f1	factored out a duplicate code block attempted to fix a compiler error	2019-02-20 18:20:10 -08:00
Meng Xu	d86ba0e811	TeamRemover: Change it to run periodically This simplifies the problem of when we should invoke the teamRemover	2019-02-20 16:08:34 -08:00
Evan Tschannen	27e3617548	fix: remove bad teams needed to use dd_stall_check delay, because in simulation the buggified delay time could make us remove bad teams before they submit their ranges to the queue	2019-02-20 14:18:36 -08:00
Evan Tschannen	3a572b010f	fix: a forced recovery needed to force the data distributor to restart	2019-02-19 16:04:52 -08:00
mpilman	27a3153719	Use ACTOR forward declarations in MoveKeys Also MoveKeys.h -> MoveKeys.actor.h	2019-02-19 15:16:59 -08:00
mpilman	3a0f9839b9	Fix minor IDE build errors	2019-02-19 15:16:59 -08:00
mpilman	3cb2391b58	use proper fwd declarations in ManagementAPI Also ManagementAPI.h -> ManagementAPI.actor.h	2019-02-19 15:16:59 -08:00
Meng Xu	111ab2eccc	TeamRemover: Check redundant team flag before satisfiesPolicy In addTeam(), to determine the team is badTeam or not, we should check redundantTeam before check satisfiesPolicy. Because if a team is redundantTeam, it has been removed from the system before we call addTeam(). The only reason we call addTeam() for a removed redundantTeam is to kick off the badTeam cleanup logic.	2019-02-19 14:46:47 -08:00
Meng Xu	e256d9a9ac	TeamRemover: Change ASSERT in teamRemover function When we remove a machine team in teamRemover function, we should always find the machine team in the global machineTeams. Change the ASSERT to the above invariant.	2019-02-19 08:13:10 -08:00
Meng Xu	3c1ed2eba9	TeamRemover: Confident no duplicate machine teams In removeMachineTeam, we are confident that there is no duplicate machine team when remove a machine team from a machines vector of machineTeams	2019-02-19 08:13:10 -08:00
Meng Xu	ed1d4635bc	TeamRemover: Format cleaning Use clang-format and remove debug messages for the code that fixes bugs in merging the PR of adding a DataDistributor role	2019-02-19 08:13:10 -08:00
Meng Xu	211036ee22	TeamRemover: Fix bugs introduced in the previous commit	2019-02-19 08:13:10 -08:00
Meng Xu	a7810d9594	TeamRemover: Fix ASSERT condition in teamRemover	2019-02-19 08:13:10 -08:00
Meng Xu	06b6a1d2ad	TeamRemover: Bug fix in teamRemover and add teamRemover invocation point	2019-02-19 08:13:10 -08:00
Meng Xu	a6d3a5a3d6	TeamRemover: Change machineTeamNumber to healthyMachineTeamNumber Always use healthy machine team number as the condition of if redundant teams exist	2019-02-19 08:13:10 -08:00
Meng Xu	b35631365f	TeamRemover: Solve confict when merge with PR 1061 The previous commit merge with the master, which just merges the pull request #1062 from jzhou77/PR that adds a new DataDistribution role. The merge causes conflicts and errors in simulation tests. This commit resolves the code conflicts and tries to fix the new errors after incorporating the new DataDistribution role	2019-02-19 08:13:10 -08:00
Evan Tschannen	065a45e05f	Merge branch 'master' into feature-fix-force-recovery # Conflicts: # fdbclient/ManagementAPI.actor.cpp # fdbserver/ClusterController.actor.cpp # fdbserver/workloads/KillRegion.actor.cpp	2019-02-18 17:09:06 -08:00
Evan Tschannen	d492395f84	fix: simulation could buggify a delay such that data distribution incorrectly thinks the queue is not processing unhealthy relocations	2019-02-18 14:57:07 -08:00
Vishesh Yadav	e05b53d755	Merge remote-tracking branch 'apple/master' into task/tls-upgrade	2019-02-15 20:37:07 -08:00
Meng Xu	6d09ac483c	Merge with master	2019-02-15 17:03:40 -08:00
Meng Xu	5ca074d86f	TeamRemover: No order of removing team and machine team We do NOT enforce the removing order of removing a machine team and the server teams on the machine team. This is for the benefit of clear code logic. When a storage server locality changes, we first remove the server and its machine if needed, before we handle the server team removal and addition.	2019-02-15 10:54:29 -08:00
Meng Xu	cfd323dafe	TeamRemover: Check when a server team is removed We do not actively remove a machine team when it has no server team on it. But since adding a server team may add a machine team, we need to be careful that the machine team number is not larger than the desired number due to server team creation. So whenever a server team is removed, we should check if the teamRemover should be kicked in.	2019-02-15 09:35:31 -08:00
Meng Xu	e803eef906	TeamRemover: Must be called when machine number changes When the machine number changes due to machine remove event, the desired machine team number changes. Then we need to make sure the teamRemover actor is running to clean up the redundant teams.	2019-02-14 20:53:26 -08:00
Meng Xu	1e55e8fea6	TeamRemover: Do not call teamRemover in getTeam getTeam is called very frequently and does not create a new team. So no need to call teamRemover in getTeam. teamRemover should be called only when a new team may be added.	2019-02-14 17:37:20 -08:00
Jingyu Zhou	5e6577cc82	Final cleanup per review comments Make distributor interface optional in ServerDBInfo and many other small changes.	2019-02-14 16:37:17 -08:00
Jingyu Zhou	bf6da81bf9	Remove recovery version from data distribution queue This parameter is no longer used/needed.	2019-02-14 16:37:16 -08:00
Evan Tschannen	038144adb1	Update fdbserver/DataDistribution.actor.cpp Co-Authored-By: jzhou77 <jingyuzhou@gmail.com>	2019-02-14 16:37:16 -08:00
Jingyu Zhou	fc3a784963	Fix another build team bug The buildTeam() can create teams with undesired storage servers, which are considered unhealthy. As a result, the data movement can become stuck. Fix this by adding an ACTOR monitorHealthyTeams that builds team every one second whenever there is no healthy teams. Clean up storageServerTracker() interface.	2019-02-14 16:37:16 -08:00
Jingyu Zhou	8afe84d31b	Fix an ordering bug for buildTeam When zeroHealthyTeams signals and the storage server becomes healthy, we could attempt buildTeam before the ServerStatusMap is updated. As a result, the healthy server is not available for use. Fix by delaying the buildTeam after the status map is updated.	2019-02-14 16:37:16 -08:00
Jingyu Zhou	a7d1111a10	Make servers and serverIDs private for TCTeamInfo Make both accessible through public member functions instead.	2019-02-14 16:37:16 -08:00
Jingyu Zhou	8b1235533e	Fix segfault during configuration change This bug was introduced in cee23ee3. During a configuration change, the data distributor was restarted, which destroys previous DDTeamCollection and cancels all previous teamTracker(). In this case, even though the healthy team count reaches 0, there is no need to try to rebuild teams. The bug is triggered when trying rebuilding teams, DDTeamCollection is already destroyed.	2019-02-14 16:37:16 -08:00
Jingyu Zhou	07dab56133	Fix a data movement stuck bug When moving keys to a team, if one of the server in the target team died, then the move can become stuck. This is because the DDTeamCollection waits for all the data movement of the failed server to be completed. However, in this case, because the movement has not finished yet, checking the database tells us there is no key assocated with this server and it is safe to go ahead. In reality, only the in-memory structure knows there is pending movement, i.e., unfinished move causes some keys to be attributed to the failed server. Thus, the server can't be removed yet. Fix by adding a check with in-memory structure in waitForAllDataRemoved(). Use const& to optimize a few function parameters.	2019-02-14 16:37:16 -08:00
Jingyu Zhou	961d71538e	A follow-on fix to ensure build team for zero teams	2019-02-14 16:37:16 -08:00
Jingyu Zhou	5deeec29e3	Fix a bug where team is not rebuild after storage failure When two failures happened to a team, one of the server recovered. The current logic skips for building a new team, which is wrong.	2019-02-14 16:37:16 -08:00
Jingyu Zhou	21066b013a	Remove DataDistributorRejoinRequest This is no longer needed, since worker registration piggybacks distributor interface now.	2019-02-14 16:37:16 -08:00
Jingyu Zhou	b3d1633114	Fix bugs of missing request The quite database can fail to send out requests and report timeout. This seems to be caused by reusing a request that uses the same ReplyPromise. Another bug is Proxy can wait for unneeded time for a dabase change, while the distributor is already known to itself.	2019-02-14 16:37:16 -08:00
Jingyu Zhou	8c61de318f	Fix segfault and no_more_servers errors	2019-02-14 16:37:16 -08:00
Jingyu Zhou	7897616164	Fix wait failure bug on cluster controller The setDistributor() sets an AsyncVar and then runs waitFailureClient. This ordering is wrong because the AsyncVar::set triggers the other loop to run first, which will wait on Never(). The correct code should wait on the Future returned by the waitFailureClient.	2019-02-14 16:37:16 -08:00
Jingyu Zhou	6a655143e8	A follow-on fix for config key usage And some trace event cleanups.	2019-02-14 16:37:16 -08:00
Jingyu Zhou	be5c962bb7	Add a new configuration version key \xff/conf/version This fixed a bug found by upgrade test, where the configuration monitor of the data distributor was monitoring excludedServersVersionKey, which doesn't change in ChangeConfig workload. As a result, data distributor was not aware of configuration changes. Adding this new key and make sure this key is updated in configuration changes so that the monitor can detect configuration changes.	2019-02-14 16:37:16 -08:00
Jingyu Zhou	3135f1d84b	Cluster controller ignores distrobutor rejoin After controller starts one, it will wait for that one and ignore any rejoins received later. Add remoteRecovered() to data distribution for remote team collection.	2019-02-14 16:37:16 -08:00
Jingyu Zhou	99e109d6c5	Fix timeout error due to lost exception Found in tests, a move key conflict exception was not handled because the Future object was not waited by someone. As a result, the data distributor did not die and database checking couldn't get the metric and keep trying until timeout.	2019-02-14 16:37:16 -08:00
Jingyu Zhou	f5242bda7c	Update data distributor to use configuration monitor This enable removal of GetRecoveryInfoRequest from master interface. Remove recoveryTransactionVersion from dataDistribution().	2019-02-14 16:37:16 -08:00
Jingyu Zhou	7a205b1732	Move remoteRecovered to dataDistributionTeamCollection() Let the remote DC to wait until fully recovered before team collection starts.	2019-02-14 16:37:16 -08:00
Jingyu Zhou	ef868f599c	Add DataDistributorInterface to ServerDBInfo Also change the Proxy and QuietDatabase to use the DataDistributorInterface.	2019-02-14 16:37:16 -08:00
Jingyu Zhou	0490160714	Fix according to Evan's comments Use getRateInfo's endpoint as the ID for the DataDistributorInterface. For now, added a "rejoined" flag for ClusterControllerData and Proxy. TODO: move DataDistributorInterface into ServerDBInfo.	2019-02-14 16:30:13 -08:00
Jingyu Zhou	886e7ab2ba	Add a new DataDistributor role. Let cluster controller to start a new data distributor role by sending a message to a chosen worker. Change MasterInterface usage in DataDistribution to masterId Add DataDistributor rejoin handling. This allows the data distributor to tell the new cluster controller of its existence so that the controller doesn't spawn a new one. I.e., there should be only ONE data distributor in the cluster. If DataDistributor (DD) doesn't join in a while, then ClusterController (CC) tries to recruit one as DD. CC also monitors DD and restarts one if it failed. The Proxy is also monitoring the DD. If DD failed, the Proxy will ask CC for the new DD. Add GetRecoveryInfo RPC to master server, which is called by data distributor to obtain the recovery Transaction version from the master server.	2019-02-14 16:30:13 -08:00
Meng Xu	8ee8b98122	TeamCollection: Cosmetic change	2019-02-14 15:59:20 -08:00
Meng Xu	5481851e82	TeamCollection: Add knobs for team remover Added three knobs to control team remover bool TR_FLAG_DISABLE_TEAM_REMOVER: Disable the teamRemover actor double TR_REMOVE_MACHINE_TEAM_DELAY: Wait for the specified time before try to remove next machine team double TR_WAIT_FOR_ALL_MACHINES_HEALTHY_DELAY: Wait before checking if all machines are healthy	2019-02-13 15:11:56 -08:00
Meng Xu	01e55e43bd	TeamCollection: Minor improve code efficiency and style Rewording the feature item in the release document as well.	2019-02-12 19:10:53 -08:00
Meng Xu	c8db205fd9	TeamCollection: Fix bug in remove a server When we remove a server due to server failure, we need to remove the related server teams AND remove the server team from the machine team. In the previous commit, we forgot to remove the server team from the machine team.	2019-02-12 16:18:19 -08:00
Meng Xu	fe4f43203d	TeamCollection: getTeam may add a new team getTeam function may add a new team for the GetTeamRequest. We need to check if the number of teams is larger than the desired team number.	2019-02-12 14:57:35 -08:00
Meng Xu	3ae8767ee8	TeamCollection: Apply clang-format	2019-02-12 13:41:18 -08:00
Meng Xu	214a72fba3	TeamCollection: Resolve review comments 1) Reduce the frequency of checking if we need to call teamRemover 2) Improve code efficiency in finding the machine team to remove 3) Remove unused code 4) Add sanity check	2019-02-12 10:59:57 -08:00
Meng Xu	3b8ae0fe95	TeamCollection: Add into 6.1 release note	2019-02-08 13:50:27 -08:00
Meng Xu	7cfe6de27e	TeamCollection: Server team number must match machine team number DESIRED_TEAMS_PER_MACHINE must equal to DESIRED_TEAMS_PER_SERVER. Otherwise, we may have to few machine teams to create enough server teams. Note that BUGGIFY macro value is based on a random number generator. When you have two BUGGIFY, one may be true and the other is false. Also fix a bug in get the number of healthy machine teams.	2019-02-07 13:53:55 -08:00
Meng Xu	76d022f71c	TeamCollection: Remove redundant teams When the total number of teams is larger than the desired number, we should gracefully remove the redundant teams so that the number of teams is kept to a low number and the possibility of losing data is guaranteed to be extremely low even when multiple racks fail at the same time.	2019-02-07 11:24:51 -08:00
Meng Xu	455024b3fe	SimulationTest: Test the number of teams Magnify the possibility that the number of created machine teams is larger than the number of desired machine teams if we do NOT try to remove the surplus machine teams. This help test the upgrade to machine team in FDB 6.1	2019-02-06 11:04:41 -08:00
Meng Xu	2b73c89e98	TeamCollection: Test the number of teams Call the traceTeamCollectionInfo function to record the team numbers when we add a team directly from the shard information, instead of using addTeamsBestOf logic.	2019-02-05 15:58:16 -08:00
Meng Xu	f5171d1b57	TeamCollection: Test the number of teams The current simulator does not validate if the number of teams in the system is larger than the maximum desired number of teams. This validation should be added because we do NOT want too many teams in the system, which may impede the systems availability when multiple fault zones (e.g., machines) crashes at the same time. This commit adds the test at the consistency check in simulation. Since the current code does not handle the upgrading situation when we enforce the machine teams, the test is expected to fail. The later commit will handle the upgrading situation which gracefully remove the surplus teams.	2019-02-04 18:14:36 -08:00
Evan Tschannen	1d7fec3074	Merge commit '048bfc5c368063d9e009513078dab88be0cbd5b0' into task/tls-upgrade-2 # Conflicts: # .gitignore	2019-01-24 17:43:06 -08:00
Evan Tschannen	684a22a52b	Merge branch 'release-6.0' # Conflicts: # documentation/sphinx/source/release-notes.rst # fdbbackup/backup.actor.cpp # fdbclient/BackupContainer.actor.cpp # fdbclient/HTTP.actor.cpp # fdbserver/storageserver.actor.cpp # fdbserver/workloads/BackupCorrectness.actor.cpp # versions.target	2019-01-09 16:14:46 -08:00
Evan Tschannen	57293a2db0	byte sample recovery did not use limits for its range reads, leading to slow tasks	2019-01-04 10:32:31 -08:00
Simon Zhou	7edf221986	Avoid null check	2018-12-28 13:09:04 -08:00
Meng Xu	486a7b04fa	TeamCollection: Fix build in osX In osX, we cannot adding unsigned long to a string to append to the string.	2018-12-14 13:44:11 -08:00
Vishesh Yadav	3eb9b23024	Listen to multiple addresses and start using vector<NetworkAdddress> in Endpoint - This patch will make FDB listen to multiple addresses given via command line. Although, we'll still use first address in most places, this patch starts using vector<NetworkAddress> in Endpoint at some basic places. - When sending packets to an endpoint, pick a random network address in endpoints - Renames Endpoint::address to Endpoint::addresses since it now holds a vector of addresses.	2018-12-13 13:36:52 -08:00
Vishesh Yadav	43e5a46f9b	Change Endpoint::address(NetworkAddress) to vector<NetworkAddress> Extend `Endpoint` class to take multiple NetworkAddresses instead of just one. Hence, to talk to an endpoint instead of one IP:PORT, we'll have multiple IP:PORT pairs. This patch simply adds the field and makes changes to compile the codebase. The first element of of `address` field is used everywhere. Hence the way we talk to remains same with this patch. NOTE: Directly accessing the first memeber of Endpoint::address is unsafe as Endpoint() doesn't enforces non-empty address list. However, since the correctness test pass for now and are anyway replacing all those unsafe accesses with ones considering the whole vector, this patch ignores to access them in safe way.	2018-12-13 13:36:52 -08:00
Meng Xu	79d94f78f1	TeamCollection: Improve code efficiency Further improve code efficiency by 1) Avoid rebuild machine locality map when machine locality is changed. This may leave the global machine locality map stale. This is ok as long as we do not use the global map to validate the machine team follows the locality policy. 2) Use ASSERT_WE_THINK instead of ASSERT to avoid runtime overhead. ASSERT_WE_THINK will only validate the condition in simulation mode. Signed-off-by: Meng Xu <meng_xu@apple.com>	2018-12-12 22:38:38 -08:00
Meng Xu	e197926c80	TeamCollection: Remove a duplicate function Remove a duplicate function that has different signature. No functionality change. Signed-off-by: Meng Xu <meng_xu@apple.com>	2018-12-12 15:21:37 -08:00
Meng Xu	ad7040efcd	TeamCollection: Bug fix in handle server locality change Make sure the link between server and machine is updated in both server and machine. Rename function name to better reflect its functionality. Signed-off-by: Meng Xu <meng_xu@apple.com>	2018-12-12 14:03:29 -08:00
Meng Xu	e069b5c31c	TeamCollection: Use clang format No functional change. Signed-off-by: Meng Xu <meng_xu@apple.com>	2018-12-06 11:39:35 -08:00
Meng Xu	5d47b9c884	TeamCollection: Handle server locality change A server locality may change from one machine to another. This affects the old machine and machine team the server is on, and the new machine the server moves to. Signed-off-by: Meng Xu <meng_xu@apple.com>	2018-12-05 22:23:14 -08:00
Meng Xu	c5047bc8c3	TeamCollection: All machine teams are correct size We only create correct size machine teams. When configuration (e.g., team size) is changed, the DDTeamCollection will be destroyed and rebuilt so that the invariant will not be violated. Based on the invariant, we can count the number of machine teams more quickly. Signed-off-by: Meng Xu <meng_xu@apple.com>	2018-12-05 15:09:38 -08:00
Meng Xu	57eab1f283	DataDistribution: Remove addAllTeams function The addAllTeams function can be replaced with the new addTeamsBestOf function by passing a large enough number of teams to build. Remove addAllTeams function and update the related unit tests. Signed-off-by: Meng Xu <meng_xu@apple.com>	2018-12-05 15:03:16 -08:00
Meng Xu	38c5c2562b	DataDistribution: Update NotEnoughServers unit test The buggify option may set 1 to the knob parameters (DESIRED_TEAMS_PER_SERVER and MAX_TEAMS_PER_SERVER). When this happens, the number of machine teams to build will be less than what we want, which prevents us from building enough server teams. To avoid this problem, we build machine teams before we call addTeamsBestOf to build server teams. We also add the ASSERT to ensure we build enough machine teams and server teams in the test case. Signed-off-by: Meng Xu <meng_xu@apple.com>	2018-12-05 14:36:48 -08:00
Meng Xu	f32c04c834	DataDistribution: Update NotEnoughServers unit test Change the test condition for the NotEnoughServers unit test. Signed-off-by: Meng Xu <meng_xu@apple.com>	2018-12-03 23:14:01 -08:00
Meng Xu	54a4d6b308	TeamCollection: Improve code efficiency Improve code efficiency with the following changes: 1) Change always-true if-statement to ASSERT; 2) Return when we are confident we will not find more machine teams. No functionality change. Signed-off-by: Meng Xu <meng_xu@apple.com>	2018-12-01 17:10:50 -08:00
Meng Xu	8d6c6e000b	DataDistribution: Mute the NotEnoughServers test Due to the randomness in choosing a server, we cannot gurantee to find all teams. The NotEnoughServers test case may create false positive bug report in the correctness test. Signed-off-by: Meng Xu <meng_xu@apple.com>	2018-12-01 13:29:45 -08:00
Meng Xu	68dcec2240	DataDistribution: Change a unit test Try multiple times of addTeamsBestOf() when we cannot find an available team due to the pure randomness in choosing the server teams. The changes for the unit test reduces the false positive in the simulation test results. Signed-off-by: Meng Xu <meng_xu@apple.com>	2018-12-01 13:12:55 -08:00
Meng Xu	a43f579f66	TeamCollection: Change 1 unit test Relax the assert condition on the random unit test. Due to the randomness in choosing the machine team and the server team from the machine team, it is possible that we may not find the remaining several (e.g., 1 or 2) available teams. For example, there are at most 10 teams available, and we have found 9 teams, the chance of finding the last one is low when we do pure random selection. It is ok to not find every available team because 1) In reality, we only create a small fraction of available teams, and 2) In practical system, this situation only happens when most of servers are temporarily unhealthy. When this situation happens, we will abandon all existing teams and restart the build team from scratch. In simulation test, the situation happens 100 times out of 128613 test cases when we run RandomUnitTests.txt only. Signed-off-by: Meng Xu <meng_xu@apple.com>	2018-12-01 13:11:19 -08:00
Meng Xu	f311455c45	TeamCollection: Cleanup code and add checks Remove unnecessary sanity checks and remove the dead code. Add some necessary sanity checks. Signed-off-by: Meng Xu <meng_xu@apple.com>	2018-11-30 17:40:21 -08:00
Meng Xu	ea3bd1502d	TeamCollection: Calculate machine team number Calculate the number of machine teams in the same way as we calculate the number of server teams. Only count the machine teams that has the correct size and is healthy. Simplify code by removing unnecessary check. Signed-off-by: Meng Xu <meng_xu@apple.com>	2018-11-29 15:38:23 -08:00
Meng Xu	2b41ad5e57	TeamCollection: Pick server team randomly Pick server team purely randomly instead of picking the least used one. This is to avoid creating correlation in the server teams we pick when new machines are added. The logic is: First pick the one random least used server as chosen server; Then pick a machine team that has the server; Then pick a server on each machine in the machine team. We make sure the chosen server is picked. Signed-off-by: Meng Xu <meng_xu@apple.com>	2018-11-28 15:57:53 -08:00
Meng Xu	e4c9d4cbae	TeamCollection: Build all machine teams first Before we build server teams, we build the desired number of machine teams. Then we pick the least used server, from which we pick the least used machine team. Then we pick the least used server on each machine in the least used machine team to get the server team. Note: The logic of building machine teams should be independent from server teams. Signed-off-by: Meng Xu <meng_xu@apple.com>	2018-11-27 18:06:36 -08:00
Meng Xu	4c2c65c1b3	TeamCollection: Replace TraceEvent with ASSERT Replace one TraceEvent that never happens in correctness test with an ASSERT. Change format in one comment. Signed-off-by: Meng xu <meng_xu@apple.com>	2018-11-27 09:48:24 -08:00
Meng Xu	5cbff740ca	TeamCollection: Add ASSERT Remove sanity check code for performance benefit. Replace TraceEvent(SevError) with ASSERT. Signed-off-by: Meng Xu <meng_xu@apple.com>	2018-11-21 13:16:52 -08:00
Meng Xu	8de031f9a6	TeamCollection: clang-format Format the changes with git clang-format. No functional changes. Signed-off-by: Meng Xu <meng_xu@apple.com>	2018-11-21 11:18:26 -08:00
Meng Xu	12c3bec968	TeamCollection: Misc changes to resolve review comments No functional change. Report error in TraceEvent when invariant is violated. Signed-off-by: Meng Xu <meng_xu@apple.com>	2018-11-19 20:44:52 -08:00
Meng Xu	52c6a66601	TeamCollection: Fix a bug introduced in code review When we GetTeam, the data distribution actor may have zero teams in rare situation in the ConfigureTest.txt test. We should return an empty team in this situation instead of triggering error. Signed-off-by: Meng Xu <meng_xu@apple.com>	2018-11-16 16:34:38 -08:00
Meng Xu	f7a7e069f0	TeamCollection: Remove unnecessary comments Pass 41806 tests with no failure Signed-off-by: Meng Xu <meng_xu@apple.com>	2018-11-16 15:56:35 -08:00
Meng Xu	73c58852f0	TeamCollection: Resolve code review comments Resolve code review comments: 1) Improve the code efficiency by avoiding unnecessary map search and avoiding unnecessary checking 2) Remove or comment out trace events when they can be spammy 3) Improve coding style Tested for 1 hour and no error was found. KillRegionCycle.txt test was excluded from the test because existing code cannot pass that test either Signed-off-by: Meng Xu <meng_xu@apple.com>	2018-11-16 15:55:33 -08:00
Meng Xu	5051b35c61	TeamCollection: Use machine team to create server team Current server team collection logic does not consider the fact that multipe storage servers can run on the same machine. When multiple machines fail, all servers on the machines will fail, and the possibility of having one process team fail and lose data is very high. To reduce the possibility of losing data when multiple machine fails, we first create machine teams which span across different fault zones; we then create server teams based on machine teams by first picking 1 machine team, and then picking 1 server from each machine in the machine team. Signed-off-by: Meng Xu <meng_xu@apple.com>	2018-11-16 15:53:22 -08:00
Evan Tschannen	4e54690005	Merge branch 'release-6.0' # Conflicts: # fdbserver/DataDistribution.actor.cpp # fdbserver/MoveKeys.actor.cpp	2018-11-12 20:26:58 -08:00
Evan Tschannen	26c49f21be	fix: we do not know a region is fully replicated until all the initial storage servers have either been heard from or have been removed	2018-11-12 17:39:40 -08:00
Evan Tschannen	cd188a351e	fix: if a destination team became unhealthy and then healthy again, it would lower the priority of a move even though the source servers we are moving from are still unhealthy fix: badTeams were not accounted for when checking priorities	2018-11-11 12:33:31 -08:00
Evan Tschannen	4b5d0b4e2c	Merge branch 'release-6.0' # Conflicts: # documentation/sphinx/source/release-notes.rst # fdbclient/AsyncFileBlobStore.actor.cpp # fdbclient/AsyncFileBlobStore.actor.h # fdbclient/BlobStore.actor.cpp # fdbclient/BlobStore.h # fdbclient/HTTP.actor.cpp # fdbclient/ManagementAPI.actor.cpp # fdbclient/NativeAPI.actor.cpp # fdbrpc/LoadBalance.actor.h # fdbrpc/batcher.actor.h # fdbrpc/fdbrpc.vcxproj # fdbrpc/sim2.actor.cpp # fdbserver/DataDistribution.actor.cpp # fdbserver/DataDistributionTracker.actor.cpp # fdbserver/SimulatedCluster.actor.cpp # fdbserver/TLogServer.actor.cpp # fdbserver/masterserver.actor.cpp	2018-11-10 13:04:24 -08:00
Evan Tschannen	7c23b68501	fix: we need to build teams if a server becomes healthy and it is not already on any teams	2018-11-09 18:06:00 -08:00
Evan Tschannen	3e2484baf7	fix: a team tracker could downgrade the priority of a relocation issued by the team tracker for the other region	2018-11-09 10:07:55 -08:00
Evan Tschannen	19ae063b66	fix: storage servers need to be rebooted when increasing replication so that clients become aware that new options are available	2018-11-08 15:44:03 -08:00

... 3 4 5 6 7 ...

551 Commits