foundationdb

Commit Graph

Author	SHA1	Message	Date
Xin Dong	ae11efcb0a	Made following changes: - Make sure the disabled data distribution won't be accidentally enabled by the 'maintenance' command - Make sure the status json reflects the status of DD accordingly - Make sure the CLI can play with the new DD states correctly, i.e. print out warns when necessary	2019-07-30 22:20:45 -07:00
Xin Dong	4ecfc9830f	Added finer grained controls to DataDistribution in fdbcli. What's happening under the hood is: - Use pre-existing 'healthZone' key and write a special value to it in order to disable DD for all storage server failures - Use a new system key 'rebalanceDDIgnored' key to disable/enable DD for all rebalance reasons(MountainChopper and ValleyFiller) Kicked off two 200K correctness and showed no related errors.	2019-07-30 22:17:21 -07:00
Evan Tschannen	dd4ab63d90	fixed another bad trace event name	2019-07-30 19:36:26 -07:00
Evan Tschannen	b8cd51c4d3	fixed invalid trace event name	2019-07-30 19:23:54 -07:00
Evan Tschannen	a78a97f186	Merge pull request #1908 from etschannen/feature-better-dd A few data distribution improvements	2019-07-30 17:34:50 -07:00
sramamoorthy	a88aaa0f04	review comment	2019-07-30 17:04:51 -07:00
sramamoorthy	63941e0d96	disable DD with a in-memory flag and use in snapv2	2019-07-30 17:04:51 -07:00
Evan Tschannen	5dd9043fd3	addressed review comments	2019-07-30 17:04:41 -07:00
Evan Tschannen	481642fbd4	Merge branch 'master' into feature-better-dd	2019-07-30 16:56:27 -07:00
Evan Tschannen	a3fe3d4324	Merge pull request #1923 from xumengpanda/mengxu/evan-dd-improvement-minor-improvement DD:Change condition for lastBuildTeamsFailed	2019-07-30 16:54:42 -07:00
A.J. Beamon	14648e20f9	Merge pull request #1901 from ajbeamon/data-distribution-receives-bytes-input-rate Send bytes input rate to data distribution	2019-07-30 15:01:36 -07:00
Meng Xu	0e50656c7f	DD:Change condition for lastBuildTeamsFailed Change the threshold team number per server that should set lastBuildTeamsFailed from DESIRED_TEAMS_PER_SERVER to (SERVER_KNOBS->DESIRED_TEAMS_PER_SERVER * (configuration.storageTeamSize + 1)) / 2;	2019-07-30 11:07:02 -07:00
Evan Tschannen	a0f26b604c	Merge pull request #1907 from etschannen/master A number of bug fixes for rare problems found by correctness testing	2019-07-29 21:04:38 -07:00
sramamoorthy	5a56f6b456	minor snap create client improvement and bug fixes	2019-07-29 20:28:22 -07:00
Evan Tschannen	cc4481b71a	team builders prefer to make teams which overlap less with existing teams	2019-07-28 23:44:23 -07:00
Evan Tschannen	7e97bd181a	fix: we need to build teams when a server becomes healthy if it is possible another servers does not have enough teams	2019-07-28 19:31:21 -07:00
Evan Tschannen	04dd293af0	Merge pull request #1874 from xumengpanda/mengxu/DD-code-read DataDistribution:Add comments to help understand the code	2019-07-26 13:30:44 -07:00
Evan Tschannen	2123fa1c3a	Merge pull request #1853 from xumengpanda/mengxu/redundantTeamRemoverPriority-PR Lower the RelocateShard priority for removing redundant teams	2019-07-26 13:28:42 -07:00
A.J. Beamon	b91795d288	Send bytes input rate to DD.	2019-07-25 16:27:32 -07:00
senthil-ram	edeec8a622	Update fdbserver/DataDistribution.actor.cpp Co-Authored-By: Alex Miller <35046903+alexmiller-apple@users.noreply.github.com>	2019-07-24 15:36:28 -07:00
sramamoorthy	a65c9f92ed	get rid of all timeouts and other changes	2019-07-24 15:36:28 -07:00
sramamoorthy	a2f2ad96ff	code review comments and merge to master changes	2019-07-24 15:36:28 -07:00
sramamoorthy	4f2bb561de	snapshot only local tlogs and not the satellite	2019-07-24 15:36:28 -07:00
sramamoorthy	021c949801	increase snaptime out to 15s for simulator	2019-07-24 15:36:28 -07:00
sramamoorthy	869f77aef1	Few cosmetic edits and fixes	2019-07-24 15:36:28 -07:00
sramamoorthy	ddd4523816	bug fix in timeout & header file re-arrange in DD	2019-07-24 15:36:28 -07:00
sramamoorthy	31c010b393	few minor fixes	2019-07-24 15:36:28 -07:00
sramamoorthy	62c14dae72	disable dd during snap and enable in restore	2019-07-24 15:36:28 -07:00
sramamoorthy	ba6bccce73	snap v2: DD changes - snapshot orchestration logic	2019-07-24 15:36:28 -07:00
Meng Xu	b7478f5dd3	DD:Add comments to help understand code Add comments to explain the functionalities of some code.	2019-07-22 11:23:16 -07:00
Meng Xu	378db79441	Resolve conflict when merge with master	2019-07-22 10:56:20 -07:00
Meng Xu	dae4436a3d	TC:UnitTest:Change invariant due to alg change	2019-07-20 21:06:54 -07:00
Meng Xu	b001a9ebe8	ServerTeamRemover runs after machineTeamRemover finishes If serverTeamRemover removes a team before machineTeamRemover brings the machine team number down to the desired number, DD may create a new team (due to teams removed by serverTeamRemover), which may be removed later by machineTeamRemover. This causes unnnecessary extra data movement.	2019-07-19 16:48:52 -07:00
Meng Xu	64bee63dbc	Resolve two review comments 1) No need to check server with only one team when teamRemover finds a server team or machine team to remove 2) Fix optimalTeamCount counting in teamTracker	2019-07-18 18:46:31 -07:00
Meng Xu	915732ce24	TeamRemover:Reset the removed team counter after removement	2019-07-16 11:17:51 -07:00
Meng Xu	20f067e794	Merge with master:Resolve conflict with PR#1797	2019-07-16 10:52:28 -07:00
Meng Xu	243504b125	DD:Clang format changes	2019-07-15 18:40:14 -07:00
Meng Xu	94e9b8a3b4	Do not remove a team whose min team number is less than target If the minimum number of teams of servers in a team is less than the target value (desired_team_number_per_server * (teamSize + 1) / 2), the team remover should not remove it. Otherwise, DD will oscillate in building more teams and removing redundant teams. Do not do consistency check for three_data_hall mode because when machines are not evenly distributed across data halls, we will need to build more teams than the total desired number to make sure the number of teams per server is no less than the target value.	2019-07-15 18:30:13 -07:00
Meng Xu	cafe9b9412	TC:Target team num per server is desired number Do not overbuild teams because we may oscillate between building more teams and removing the redundant teams. The oscillation happens when the machines are not evenly distributed across availability zones. For example, in three_data_hall mode, we have 1 machine in 1 data hall for 2 data halls. We have 3 machines in the 3rd data hall. To build enough (and more teams) for servers in the 3rd data hall, we will overbuild teams. However, the teamRemover will remove those newly teams.	2019-07-15 17:32:51 -07:00
Meng Xu	415622f465	MachineTeamRemover:Change to remove MT with most teams Change to remove machine team with most machine teams, using the same logic as the serverTeamRemover. The featue is guarded by TR_FLAG_REMOVE_MT_WITH_MOST_TEAMS knob.	2019-07-15 14:29:49 -07:00
Meng Xu	5c5e883745	TC:Keep building until each server and machine has at least the expected number of teams	2019-07-12 19:16:18 -07:00
Meng Xu	8454d74da9	TC:Change remainingTeamBudget to ensure each server has more than desired team number	2019-07-12 18:39:01 -07:00
Meng Xu	1c0daa7f2c	Resolve review comments:Remove unneeded code	2019-07-12 18:10:04 -07:00
Meng Xu	aa19da6977	TC:TraceAllInfo:Remove unused variable Also change some code format in self review	2019-07-12 10:41:05 -07:00
Meng Xu	4da2071b49	ServerTeamRemover:Believe all servers are healthy when we start to remove Before the serverTeamRemover tries to pick a team to remove, it waits for all data movement to finish, which means all teams are healthy. When the serverTeamRemover starts to pick a team to remove, we believe all servers are healthy.	2019-07-11 23:47:31 -07:00
Meng Xu	cf935ff9e6	Remove debug message and format code	2019-07-11 22:05:20 -07:00
Meng Xu	bb758c18ee	ServerTracker:Not always mark server undesired when no healthy team exists A storage server is not desired to be colocated with tLogs. So we want to mark the server as undesired. However, if there is not enough process in the system, we will have no choice but do so. The old logic makes the server undesired if optimalTeamCount > 0; However, there is a rare case when optimalTeamCount is 1 when it is supposed to be 0. To overcome the situation, we add another condition healthyTeamCount > 0 as a guard to mark such a colocated server undesired.	2019-07-11 17:36:57 -07:00
Meng Xu	221e6945db	TeamTracker:Fix bug in counting optimalTeamCount When a teamTracker is cancelled, e.g, by redundant teamRemover or badTeamRemover, we should decrease the optimalTeamCount if the team is considered as an optimal team, i.e., all members' machine fitness is no worse than unset, and the team is healthy.	2019-07-11 17:22:41 -07:00
Meng Xu	c6e42d6119	ReplicationPolicy:Add trace for the name of each keyIndex	2019-07-10 19:29:29 -07:00
Meng Xu	4fae510633	AddBestMachineTeams:BugFix:Must build team when it has remainingMachineTeamBudget	2019-07-10 11:55:06 -07:00
Meng Xu	9816fb6aca	ConsistencyCheck:Check minServerTeamOnServer larger than 0	2019-07-10 11:53:47 -07:00
Meng Xu	aa459a2b03	AddTeamsBestOf:Calculate minTeamNumPerServer before use it	2019-07-09 14:28:39 -07:00
Meng Xu	522230f050	ConsistencyCheck:getTeamCollectionValid tries 10 times before return false Because serverTeamRemover takes time to remove teams, getTeamCollectionValid() need to wait for a while before concluding that the number of server teams is larger than the desired number.	2019-07-09 11:46:57 -07:00
Meng Xu	cf03b274a2	TeamTracker:Add traceTeamCollectionInfo	2019-07-08 23:01:25 -07:00
Meng Xu	bf8af985b9	ServerTeamRemover: Change unit test to include the remover Also further speed up serverTeamRemover in simulation, and Add comments	2019-07-08 20:12:16 -07:00
Meng Xu	3b9618fe11	ServerTeamRemover:Speedup removing teams in simulation Otherwise, simulation may time out when team remover needs to remove hundreds of teams.	2019-07-08 18:17:21 -07:00
Meng Xu	08d76a7bbe	ServerTeamRemover:Bug fix and clang-format	2019-07-08 17:08:32 -07:00
Meng Xu	9cc11e88c5	TeamBuilder:Reduce unnecessary calculation of remainingTeamBudget	2019-07-08 16:56:06 -07:00
Meng Xu	874539149a	ServerTeamRemover: Resolve review comments Pick the team whose minimum team number of a server is the largest one to remove. AddTeamsBestOf should keep building teams until each server has at least the target number of teams.	2019-07-08 16:40:37 -07:00
Meng Xu	08a721b320	Merge branch 'master' into mengxu/server-team-remover-PR	2019-07-08 16:30:32 -07:00
A.J. Beamon	0a5c7608df	Remove "Number" suffix from newly added events (and variables that feed the events).	2019-07-08 15:45:28 -07:00
A.J. Beamon	f52c239ef8	Merge branch 'master' into trace-event-rename # Conflicts: # fdbserver/DataDistribution.actor.cpp # fdbserver/QuietDatabase.actor.cpp	2019-07-08 15:37:00 -07:00
Evan Tschannen	ec11ef024b	Merge pull request #1798 from ajbeamon/merge-release-6.1-into-master Merge release 6.1 into master	2019-07-08 09:02:56 -07:00
A.J. Beamon	dd85edb08c	Merge pull request #1802 from xumengpanda/mengxu/DD-ensure-redundant-team-priority-as700-PR TeamTracker:Set redundant team priority as PRIORITY_TEAM_REDUNDANT	2019-07-08 08:47:28 -07:00
Jingyu Zhou	50e7593c5b	Merge pull request #1796 from ajbeamon/remove-trace-event-underscores Remove trace event underscores	2019-07-05 21:45:55 -07:00
Meng Xu	e8fb7564f5	Merge branch 'master' into mengxu/DD-ensure-redundant-team-priority-as700-PR	2019-07-05 17:28:12 -07:00
Meng Xu	c7a996267c	TeamRemover: Remove unused declaration Also change state variable to variable.	2019-07-05 16:54:06 -07:00
Meng Xu	46d28a3b79	TeamTracker:Set redundant team priority as redundant The redundant team removed by teamRemover will not exist in the global teams data structure. So we will not find the redundant team from shard-to-team mapping in the system key. Before this change, teamTracker marks such team as PRIORITY_TEAM_UNHEALTHY. With this change, it marks it as PRIORITY_TEAM_REDUNDANT	2019-07-05 15:24:00 -07:00
A.J. Beamon	2a56e011ea	Merge branch 'release-6.1' into merge-release-6.1-into-master # Conflicts: # documentation/sphinx/source/release-notes.rst # fdbserver/DataDistribution.actor.cpp	2019-07-05 13:52:29 -07:00
Meng Xu	7ba6cd2d9d	ServerTeamRemover:Reduce the overshot server team number to build Each server has the maximum of DESIRED_TEAMS_PER_SERVER and (DESIRED_TEAMS_PER_SERVER * storageTeamSize) / 2)	2019-07-05 11:01:50 -07:00
A.J. Beamon	2a709ee5d0	Rename event details that use the suffix "Number" to indicate a count, as number could also imply an index. Rename a few other trace events and details that e.g. needed to be pluralized.	2019-07-05 08:54:21 -07:00
A.J. Beamon	a3ac9c7eea	Remove underscores from some trace event names	2019-07-05 08:08:29 -07:00
Meng Xu	2782d432ac	ServerTeamRemover:Update the desired number and pick unhealthy teams first	2019-07-02 22:17:53 -07:00
Meng Xu	599fcb2e6d	Add serverTeamRemover to remove redundant server teams	2019-07-02 17:40:37 -07:00
Meng Xu	7461c87ae6	AddTeamsBestOf: Build more teams than desired We build more teams than we finally want so that we can use serverTeamRemover() actor to remove the teams whose member belong to too many teams. This allows us to get a more balanced number of teams per server.	2019-07-02 17:40:37 -07:00
Evan Tschannen	86b0224347	Merge branch 'release-6.1' of github.com:apple/foundationdb into release-6.1	2019-07-02 16:27:31 -07:00
Evan Tschannen	64e33bb4f9	added logging for maintenance mode	2019-07-02 16:25:29 -07:00
Meng Xu	7afbd10a10	Change teamRemover to machineTeamRemover	2019-07-02 15:16:34 -07:00
Meng Xu	d2d6022ed4	StorageServerTracker:Do not always set doBuildTeams When interface changes, we set doBuildTeams to true only when the interface location changes.	2019-07-02 14:24:26 -07:00
Meng Xu	de5bcaf588	minTeamNumber for server and machine cannot be uint64_t Because the consistency check will try to conver the value to int64_t. If no server exists, the variable will not be updated and thus get overflowed when it is converted to int64_t	2019-07-01 21:39:18 -07:00
Meng Xu	347a7ecdff	MachineTeams:Make traceTeamCollectionInfo not an actor	2019-07-01 16:50:53 -07:00
Meng Xu	b8cb883040	AddBestMachineTeams:Fix input must be non-negative value	2019-06-28 22:46:16 -07:00
Meng Xu	63c42533eb	TaceTeamCollectionInfo:Remove delay	2019-06-28 16:19:58 -07:00
Meng Xu	875cb877ac	TeamCollection: Apply clang-format	2019-06-28 16:01:05 -07:00
Meng Xu	0baae134f6	TeamCollectionInfo: Resolve review comments	2019-06-28 15:59:47 -07:00
Meng Xu	cb681693df	TeamCollection:Do NOT consider healthyness in counting team number If a team is removed from DD, it will be marked as failed and eventually removed from the global teams data structure. Team healthyness is likely to be a temporary state which can be changed rather quickly.	2019-06-28 09:50:43 -07:00
Meng Xu	ce7eb10cac	TeamCollectionInfo: Only count team number for healthy server and machine	2019-06-27 19:04:22 -07:00
Meng Xu	f889843332	Change traceTeamCollectionInfo to actor There are cases where traceTeamCollectionInfo was called within the same execution block, i.e., no wait between the two traceTeamCollectionInfo calls. Because simulation uses the same time for all execution instructions in the same execution block, having more than one traceTeamCollectionInfo at the same time will mess up the trackLatest semantics. When one of them is always chosen by simulator, simulation test will report false positive error. Changing this function to actor and adding a small delay inside the function can solve this problem.	2019-06-27 18:24:20 -07:00
Meng Xu	4fe3c7f749	TeamCollectionInfo:Revert to original version where it is	2019-06-27 17:09:21 -07:00
Meng Xu	42620e4831	TeamCollectionTest:GetTeamCollectionValid wait until values are correct	2019-06-27 16:52:36 -07:00
Meng Xu	ee41311a54	TeamCollection:Call addTeamsBestOf when remainingTeamBudget is not 0	2019-06-27 15:29:26 -07:00
Meng Xu	2993a96de8	TeamCollectionInfo: Remove debug trace and apply clang format	2019-06-27 14:15:51 -07:00
Meng Xu	5f5c404291	BugFix:ReplicationPolicy always fails when teamSize is 1 Whenever use selectReplicas function, be careful that it may have bugs! This bug is that it always return false (not able to find candidates) when the storage team size is 1. This is wrong because when storage team size is 1, the selectReplicas should return an empty result.	2019-06-27 13:47:49 -07:00
Meng Xu	90c158984c	TeamCollection:Add extra trace events	2019-06-27 11:27:29 -07:00
Meng Xu	aaf97542e9	TeamCollectionTest: Update unit test	2019-06-27 11:27:29 -07:00
Meng Xu	53324e4db7	TeamCollectionInfo: clang format	2019-06-27 11:27:29 -07:00
Meng Xu	cc6a0e9bcd	TeamCollectionTest:Do not enforce minServerTeamOnServer larger than 0 In ConfigureTest, one server may be left with 0 server teams, even if we call buildTeams in the storageServerTracker.	2019-06-27 11:27:29 -07:00
Meng Xu	c23d89c98a	TeamCollection:Only count healthy teams for a server When team collection add new server teams, it picks a team with the least number of teams. We should only consider the healthy teams because the unhealthy ones will not be useful.	2019-06-27 11:27:29 -07:00
Meng Xu	e1d459075a	TeamCollection:Count healthy machine teams only Team collection should prioritize to build machine teams for a machine that has the least number of healthy machine teams, instead of just machine teams, because unhealthy machine team will not be able to produce more server teams.	2019-06-27 11:27:29 -07:00
Meng Xu	ee916b337d	TeamCollection:Change the target team number to build When team collection (TC) build server teams and machine teams, it needs to build enough teams such that each server and machine has the DESIRED_TEAMS_PER_SERVER server teams and machine teams. This change calculate the number of teams (server team and machine teams) needed to get each teams for each server and machine.	2019-06-27 11:16:44 -07:00

1 2 3 4 5 ...

384 Commits