Commit Graph

384 Commits

Author SHA1 Message Date
Xin Dong ae11efcb0a Made following changes:
- Make sure the disabled data distribution won't be accidentally enabled by the 'maintenance' command
- Make sure the status json reflects the status of DD accordingly
- Make sure the CLI can play with the new DD states correctly, i.e. print out warns when necessary
2019-07-30 22:20:45 -07:00
Xin Dong 4ecfc9830f Added finer grained controls to DataDistribution in fdbcli. What's happening under the hood is:
- Use pre-existing 'healthZone' key and write a special value to it in order to disable DD for all storage server failures
- Use a new system key 'rebalanceDDIgnored' key to disable/enable DD for all rebalance reasons(MountainChopper and ValleyFiller)

Kicked off two 200K correctness and showed no related errors.
2019-07-30 22:17:21 -07:00
Evan Tschannen dd4ab63d90 fixed another bad trace event name 2019-07-30 19:36:26 -07:00
Evan Tschannen b8cd51c4d3 fixed invalid trace event name 2019-07-30 19:23:54 -07:00
Evan Tschannen a78a97f186
Merge pull request #1908 from etschannen/feature-better-dd
A few data distribution improvements
2019-07-30 17:34:50 -07:00
sramamoorthy a88aaa0f04 review comment 2019-07-30 17:04:51 -07:00
sramamoorthy 63941e0d96 disable DD with a in-memory flag and use in snapv2 2019-07-30 17:04:51 -07:00
Evan Tschannen 5dd9043fd3 addressed review comments 2019-07-30 17:04:41 -07:00
Evan Tschannen 481642fbd4 Merge branch 'master' into feature-better-dd 2019-07-30 16:56:27 -07:00
Evan Tschannen a3fe3d4324
Merge pull request #1923 from xumengpanda/mengxu/evan-dd-improvement-minor-improvement
DD:Change condition for lastBuildTeamsFailed
2019-07-30 16:54:42 -07:00
A.J. Beamon 14648e20f9
Merge pull request #1901 from ajbeamon/data-distribution-receives-bytes-input-rate
Send bytes input rate to data distribution
2019-07-30 15:01:36 -07:00
Meng Xu 0e50656c7f DD:Change condition for lastBuildTeamsFailed
Change the threshold team number per server that should set lastBuildTeamsFailed
from DESIRED_TEAMS_PER_SERVER to
(SERVER_KNOBS->DESIRED_TEAMS_PER_SERVER * (configuration.storageTeamSize + 1)) / 2;
2019-07-30 11:07:02 -07:00
Evan Tschannen a0f26b604c
Merge pull request #1907 from etschannen/master
A number of bug fixes for rare problems found by correctness testing
2019-07-29 21:04:38 -07:00
sramamoorthy 5a56f6b456 minor snap create client improvement and bug fixes 2019-07-29 20:28:22 -07:00
Evan Tschannen cc4481b71a team builders prefer to make teams which overlap less with existing teams 2019-07-28 23:44:23 -07:00
Evan Tschannen 7e97bd181a fix: we need to build teams when a server becomes healthy if it is possible another servers does not have enough teams 2019-07-28 19:31:21 -07:00
Evan Tschannen 04dd293af0
Merge pull request #1874 from xumengpanda/mengxu/DD-code-read
DataDistribution:Add comments to help understand the code
2019-07-26 13:30:44 -07:00
Evan Tschannen 2123fa1c3a
Merge pull request #1853 from xumengpanda/mengxu/redundantTeamRemoverPriority-PR
Lower the RelocateShard priority for removing redundant teams
2019-07-26 13:28:42 -07:00
A.J. Beamon b91795d288 Send bytes input rate to DD. 2019-07-25 16:27:32 -07:00
senthil-ram edeec8a622 Update fdbserver/DataDistribution.actor.cpp
Co-Authored-By: Alex Miller <35046903+alexmiller-apple@users.noreply.github.com>
2019-07-24 15:36:28 -07:00
sramamoorthy a65c9f92ed get rid of all timeouts and other changes 2019-07-24 15:36:28 -07:00
sramamoorthy a2f2ad96ff code review comments and merge to master changes 2019-07-24 15:36:28 -07:00
sramamoorthy 4f2bb561de snapshot only local tlogs and not the satellite 2019-07-24 15:36:28 -07:00
sramamoorthy 021c949801 increase snaptime out to 15s for simulator 2019-07-24 15:36:28 -07:00
sramamoorthy 869f77aef1 Few cosmetic edits and fixes 2019-07-24 15:36:28 -07:00
sramamoorthy ddd4523816 bug fix in timeout & header file re-arrange in DD 2019-07-24 15:36:28 -07:00
sramamoorthy 31c010b393 few minor fixes 2019-07-24 15:36:28 -07:00
sramamoorthy 62c14dae72 disable dd during snap and enable in restore 2019-07-24 15:36:28 -07:00
sramamoorthy ba6bccce73 snap v2: DD changes - snapshot orchestration logic 2019-07-24 15:36:28 -07:00
Meng Xu b7478f5dd3 DD:Add comments to help understand code
Add comments to explain the functionalities of some code.
2019-07-22 11:23:16 -07:00
Meng Xu 378db79441 Resolve conflict when merge with master 2019-07-22 10:56:20 -07:00
Meng Xu dae4436a3d TC:UnitTest:Change invariant due to alg change 2019-07-20 21:06:54 -07:00
Meng Xu b001a9ebe8 ServerTeamRemover runs after machineTeamRemover finishes
If serverTeamRemover removes a team before machineTeamRemover brings
the machine team number down to the desired number, DD may create a new
team (due to teams removed by serverTeamRemover), which may be removed
later by machineTeamRemover. This causes unnnecessary extra data movement.
2019-07-19 16:48:52 -07:00
Meng Xu 64bee63dbc Resolve two review comments
1) No need to check server with only one team when teamRemover finds
a server team or machine team to remove

2) Fix optimalTeamCount counting in teamTracker
2019-07-18 18:46:31 -07:00
Meng Xu 915732ce24 TeamRemover:Reset the removed team counter after removement 2019-07-16 11:17:51 -07:00
Meng Xu 20f067e794 Merge with master:Resolve conflict with PR#1797 2019-07-16 10:52:28 -07:00
Meng Xu 243504b125 DD:Clang format changes 2019-07-15 18:40:14 -07:00
Meng Xu 94e9b8a3b4 Do not remove a team whose min team number is less than target
If the minimum number of teams of servers in a team is less than the
target value (desired_team_number_per_server * (teamSize + 1) / 2),
the team remover should not remove it. Otherwise, DD will oscillate in
building more teams and removing redundant teams.

Do not do consistency check for three_data_hall mode because when
machines are not evenly distributed across data halls, we will
need to build more teams than the total desired number to make sure
the number of teams per server is no less than the target value.
2019-07-15 18:30:13 -07:00
Meng Xu cafe9b9412 TC:Target team num per server is desired number
Do not overbuild teams because we may oscillate between building more teams and
removing the redundant teams. The oscillation happens when the machines are not
evenly distributed across availability zones.
For example, in three_data_hall mode, we have 1 machine in 1 data hall for 2 data halls.
We have 3 machines in the 3rd data hall. To build enough (and more teams) for servers
in the 3rd data hall, we will overbuild teams. However,
the teamRemover will remove those newly teams.
2019-07-15 17:32:51 -07:00
Meng Xu 415622f465 MachineTeamRemover:Change to remove MT with most teams
Change to remove machine team with most machine teams, using the same
logic as the serverTeamRemover.

The featue is guarded by TR_FLAG_REMOVE_MT_WITH_MOST_TEAMS knob.
2019-07-15 14:29:49 -07:00
Meng Xu 5c5e883745 TC:Keep building until each server and machine has at least the expected number of teams 2019-07-12 19:16:18 -07:00
Meng Xu 8454d74da9 TC:Change remainingTeamBudget to ensure each server has more than desired team number 2019-07-12 18:39:01 -07:00
Meng Xu 1c0daa7f2c Resolve review comments:Remove unneeded code 2019-07-12 18:10:04 -07:00
Meng Xu aa19da6977 TC:TraceAllInfo:Remove unused variable
Also change some code format in self review
2019-07-12 10:41:05 -07:00
Meng Xu 4da2071b49 ServerTeamRemover:Believe all servers are healthy when we start to remove
Before the serverTeamRemover tries to pick a team to remove,
it waits for all data movement to finish, which means all teams are healthy.

When the serverTeamRemover starts to pick a team to remove,
we believe all servers are healthy.
2019-07-11 23:47:31 -07:00
Meng Xu cf935ff9e6 Remove debug message and format code 2019-07-11 22:05:20 -07:00
Meng Xu bb758c18ee ServerTracker:Not always mark server undesired when no healthy team exists
A storage server is not desired to be colocated with tLogs.
So we want to mark the server as undesired.

However, if there is not enough process in the system, we will
have no choice but do so.

The old logic makes the server undesired if optimalTeamCount > 0;
However, there is a rare case when optimalTeamCount is 1 when it is supposed to be 0.
To overcome the situation, we add another condition healthyTeamCount > 0
as a guard to mark such a colocated server undesired.
2019-07-11 17:36:57 -07:00
Meng Xu 221e6945db TeamTracker:Fix bug in counting optimalTeamCount
When a teamTracker is cancelled, e.g, by redundant teamRemover or badTeamRemover,
we should decrease the optimalTeamCount if the team is considered as an
optimal team, i.e., all members' machine fitness is no worse than unset, and
the team is healthy.
2019-07-11 17:22:41 -07:00
Meng Xu c6e42d6119 ReplicationPolicy:Add trace for the name of each keyIndex 2019-07-10 19:29:29 -07:00
Meng Xu 4fae510633 AddBestMachineTeams:BugFix:Must build team when it has remainingMachineTeamBudget 2019-07-10 11:55:06 -07:00
Meng Xu 9816fb6aca ConsistencyCheck:Check minServerTeamOnServer larger than 0 2019-07-10 11:53:47 -07:00
Meng Xu aa459a2b03 AddTeamsBestOf:Calculate minTeamNumPerServer before use it 2019-07-09 14:28:39 -07:00
Meng Xu 522230f050 ConsistencyCheck:getTeamCollectionValid tries 10 times before return false
Because serverTeamRemover takes time to remove teams,
getTeamCollectionValid() need to wait for a while before concluding that
the number of server teams is larger than the desired number.
2019-07-09 11:46:57 -07:00
Meng Xu cf03b274a2 TeamTracker:Add traceTeamCollectionInfo 2019-07-08 23:01:25 -07:00
Meng Xu bf8af985b9 ServerTeamRemover: Change unit test to include the remover
Also further speed up serverTeamRemover in simulation, and
Add comments
2019-07-08 20:12:16 -07:00
Meng Xu 3b9618fe11 ServerTeamRemover:Speedup removing teams in simulation
Otherwise, simulation may time out when team remover needs to
remove hundreds of teams.
2019-07-08 18:17:21 -07:00
Meng Xu 08d76a7bbe ServerTeamRemover:Bug fix and clang-format 2019-07-08 17:08:32 -07:00
Meng Xu 9cc11e88c5 TeamBuilder:Reduce unnecessary calculation of remainingTeamBudget 2019-07-08 16:56:06 -07:00
Meng Xu 874539149a ServerTeamRemover: Resolve review comments
Pick the team whose minimum team number of a server is the largest one to remove.

AddTeamsBestOf should keep building teams until each server has at least the
target number of teams.
2019-07-08 16:40:37 -07:00
Meng Xu 08a721b320 Merge branch 'master' into mengxu/server-team-remover-PR 2019-07-08 16:30:32 -07:00
A.J. Beamon 0a5c7608df Remove "Number" suffix from newly added events (and variables that feed the events). 2019-07-08 15:45:28 -07:00
A.J. Beamon f52c239ef8 Merge branch 'master' into trace-event-rename
# Conflicts:
#	fdbserver/DataDistribution.actor.cpp
#	fdbserver/QuietDatabase.actor.cpp
2019-07-08 15:37:00 -07:00
Evan Tschannen ec11ef024b
Merge pull request #1798 from ajbeamon/merge-release-6.1-into-master
Merge release 6.1 into master
2019-07-08 09:02:56 -07:00
A.J. Beamon dd85edb08c
Merge pull request #1802 from xumengpanda/mengxu/DD-ensure-redundant-team-priority-as700-PR
TeamTracker:Set redundant team priority as PRIORITY_TEAM_REDUNDANT
2019-07-08 08:47:28 -07:00
Jingyu Zhou 50e7593c5b
Merge pull request #1796 from ajbeamon/remove-trace-event-underscores
Remove trace event underscores
2019-07-05 21:45:55 -07:00
Meng Xu e8fb7564f5 Merge branch 'master' into mengxu/DD-ensure-redundant-team-priority-as700-PR 2019-07-05 17:28:12 -07:00
Meng Xu c7a996267c TeamRemover: Remove unused declaration
Also change state variable to variable.
2019-07-05 16:54:06 -07:00
Meng Xu 46d28a3b79 TeamTracker:Set redundant team priority as redundant
The redundant team removed by teamRemover will not exist
in the global teams data structure. So we will not find
the redundant team from shard-to-team mapping in the system key.

Before this change, teamTracker marks such team as PRIORITY_TEAM_UNHEALTHY.
With this change, it marks it as PRIORITY_TEAM_REDUNDANT
2019-07-05 15:24:00 -07:00
A.J. Beamon 2a56e011ea Merge branch 'release-6.1' into merge-release-6.1-into-master
# Conflicts:
#	documentation/sphinx/source/release-notes.rst
#	fdbserver/DataDistribution.actor.cpp
2019-07-05 13:52:29 -07:00
Meng Xu 7ba6cd2d9d ServerTeamRemover:Reduce the overshot server team number to build
Each server has the maximum of DESIRED_TEAMS_PER_SERVER and
(DESIRED_TEAMS_PER_SERVER * storageTeamSize) / 2)
2019-07-05 11:01:50 -07:00
A.J. Beamon 2a709ee5d0 Rename event details that use the suffix "Number" to indicate a count, as number could also imply an index. Rename a few other trace events and details that e.g. needed to be pluralized. 2019-07-05 08:54:21 -07:00
A.J. Beamon a3ac9c7eea Remove underscores from some trace event names 2019-07-05 08:08:29 -07:00
Meng Xu 2782d432ac ServerTeamRemover:Update the desired number and pick unhealthy teams first 2019-07-02 22:17:53 -07:00
Meng Xu 599fcb2e6d Add serverTeamRemover to remove redundant server teams 2019-07-02 17:40:37 -07:00
Meng Xu 7461c87ae6 AddTeamsBestOf: Build more teams than desired
We build more teams than we finally want so that we can use serverTeamRemover() actor to remove the teams
whose member belong to too many teams. This allows us to get a more balanced number of teams per server.
2019-07-02 17:40:37 -07:00
Evan Tschannen 86b0224347 Merge branch 'release-6.1' of github.com:apple/foundationdb into release-6.1 2019-07-02 16:27:31 -07:00
Evan Tschannen 64e33bb4f9 added logging for maintenance mode 2019-07-02 16:25:29 -07:00
Meng Xu 7afbd10a10 Change teamRemover to machineTeamRemover 2019-07-02 15:16:34 -07:00
Meng Xu d2d6022ed4 StorageServerTracker:Do not always set doBuildTeams
When interface changes, we set doBuildTeams to true only when
the interface location changes.
2019-07-02 14:24:26 -07:00
Meng Xu de5bcaf588 minTeamNumber for server and machine cannot be uint64_t
Because the consistency check will try to conver the value to int64_t.
If no server exists, the variable will not be updated and thus get overflowed
when it is converted to int64_t
2019-07-01 21:39:18 -07:00
Meng Xu 347a7ecdff MachineTeams:Make traceTeamCollectionInfo not an actor 2019-07-01 16:50:53 -07:00
Meng Xu b8cb883040 AddBestMachineTeams:Fix input must be non-negative value 2019-06-28 22:46:16 -07:00
Meng Xu 63c42533eb TaceTeamCollectionInfo:Remove delay 2019-06-28 16:19:58 -07:00
Meng Xu 875cb877ac TeamCollection: Apply clang-format 2019-06-28 16:01:05 -07:00
Meng Xu 0baae134f6 TeamCollectionInfo: Resolve review comments 2019-06-28 15:59:47 -07:00
Meng Xu cb681693df TeamCollection:Do NOT consider healthyness in counting team number
If a team is removed from DD, it will be marked as failed and eventually removed from the
global teams data structure.
Team healthyness is likely to be a temporary state which can be changed rather quickly.
2019-06-28 09:50:43 -07:00
Meng Xu ce7eb10cac TeamCollectionInfo: Only count team number for healthy server and machine 2019-06-27 19:04:22 -07:00
Meng Xu f889843332 Change traceTeamCollectionInfo to actor
There are cases where traceTeamCollectionInfo was called within the same execution block, i.e.,
no wait between the two traceTeamCollectionInfo calls.
Because simulation uses the same time for all execution instructions in the same execution block,
having more than one traceTeamCollectionInfo at the same time will mess up the trackLatest semantics.
When one of them is always chosen by simulator, simulation test will report false positive error.

Changing this function to actor and adding a small delay inside the function can solve this problem.
2019-06-27 18:24:20 -07:00
Meng Xu 4fe3c7f749 TeamCollectionInfo:Revert to original version where it is 2019-06-27 17:09:21 -07:00
Meng Xu 42620e4831 TeamCollectionTest:GetTeamCollectionValid wait until values are correct 2019-06-27 16:52:36 -07:00
Meng Xu ee41311a54 TeamCollection:Call addTeamsBestOf when remainingTeamBudget is not 0 2019-06-27 15:29:26 -07:00
Meng Xu 2993a96de8 TeamCollectionInfo: Remove debug trace and apply clang format 2019-06-27 14:15:51 -07:00
Meng Xu 5f5c404291 BugFix:ReplicationPolicy always fails when teamSize is 1
Whenever use selectReplicas function, be careful that it may have bugs!
This bug is that it always return false (not able to find candidates)
when the storage team size is 1. This is wrong because when storage team size
is 1, the selectReplicas should return an empty result.
2019-06-27 13:47:49 -07:00
Meng Xu 90c158984c TeamCollection:Add extra trace events 2019-06-27 11:27:29 -07:00
Meng Xu aaf97542e9 TeamCollectionTest: Update unit test 2019-06-27 11:27:29 -07:00
Meng Xu 53324e4db7 TeamCollectionInfo: clang format 2019-06-27 11:27:29 -07:00
Meng Xu cc6a0e9bcd TeamCollectionTest:Do not enforce minServerTeamOnServer larger than 0
In ConfigureTest, one server may be left with 0 server teams, even if
we call buildTeams in the storageServerTracker.
2019-06-27 11:27:29 -07:00
Meng Xu c23d89c98a TeamCollection:Only count healthy teams for a server
When team collection add new server teams, it picks a team with
the least number of teams. We should only consider the healthy teams
because the unhealthy ones will not be useful.
2019-06-27 11:27:29 -07:00
Meng Xu e1d459075a TeamCollection:Count healthy machine teams only
Team collection should prioritize to build machine teams for a machine
that has the least number of healthy machine teams, instead of just
machine teams, because unhealthy machine team will not be able to
produce more server teams.
2019-06-27 11:27:29 -07:00
Meng Xu ee916b337d TeamCollection:Change the target team number to build
When team collection (TC) build server teams and machine teams,
it needs to build enough teams such that each server and machine has
the DESIRED_TEAMS_PER_SERVER server teams and machine teams.

This change calculate the number of teams (server team and machine teams)
needed to get each teams for each server and machine.
2019-06-27 11:16:44 -07:00