Commit Graph

344 Commits

Author SHA1 Message Date
Jon Fu c908c6c1db added command to fdbcli and changes to SystemData and ManagementAPI 2019-08-27 14:39:43 -07:00
Evan Tschannen ba54508c47 code cleanup 2019-08-06 16:30:30 -07:00
Evan Tschannen 5dc4c80d44 fix: the machineAttrition workload did not ensure that healthyZone was always cleared
fix: an assert could trigger spuriously
2019-08-05 15:00:17 -07:00
Evan Tschannen 7d7aa27c2d
Merge pull request #1814 from dongxinEric/feature/1508/finer-grained-dd-controls
Added finer grained controls to DataDistribution in fdbcli.
2019-07-31 17:36:20 -07:00
Evan Tschannen bba01c6531 fix: add subsetOfEmergencyTeam could add an unsorted team 2019-07-31 16:02:08 -07:00
Xin Dong b653ddb30d Final clean ups after rebasing master 2019-07-30 22:35:34 -07:00
Xin Dong 5d20364423 Address review comments 2019-07-30 22:24:30 -07:00
Xin Dong 1922c39377 Resolve review comments. 100K run shows one suspecious ASSERT_WE_THINK failure which I think could be a race. 2019-07-30 22:24:30 -07:00
Xin Dong c6e5472d8d Apply suggestions from code review
Co-Authored-By: A.J. Beamon <ajbeamon@users.noreply.github.com>
2019-07-30 22:20:45 -07:00
Xin Dong f5d6e3a5b3 - Addressed review commends
- Added test for the storage server failure disable switch
2019-07-30 22:20:45 -07:00
Xin Dong ae11efcb0a Made following changes:
- Make sure the disabled data distribution won't be accidentally enabled by the 'maintenance' command
- Make sure the status json reflects the status of DD accordingly
- Make sure the CLI can play with the new DD states correctly, i.e. print out warns when necessary
2019-07-30 22:20:45 -07:00
Xin Dong 4ecfc9830f Added finer grained controls to DataDistribution in fdbcli. What's happening under the hood is:
- Use pre-existing 'healthZone' key and write a special value to it in order to disable DD for all storage server failures
- Use a new system key 'rebalanceDDIgnored' key to disable/enable DD for all rebalance reasons(MountainChopper and ValleyFiller)

Kicked off two 200K correctness and showed no related errors.
2019-07-30 22:17:21 -07:00
Evan Tschannen dd4ab63d90 fixed another bad trace event name 2019-07-30 19:36:26 -07:00
Evan Tschannen b8cd51c4d3 fixed invalid trace event name 2019-07-30 19:23:54 -07:00
Evan Tschannen a78a97f186
Merge pull request #1908 from etschannen/feature-better-dd
A few data distribution improvements
2019-07-30 17:34:50 -07:00
sramamoorthy a88aaa0f04 review comment 2019-07-30 17:04:51 -07:00
sramamoorthy 63941e0d96 disable DD with a in-memory flag and use in snapv2 2019-07-30 17:04:51 -07:00
Evan Tschannen 5dd9043fd3 addressed review comments 2019-07-30 17:04:41 -07:00
Evan Tschannen 481642fbd4 Merge branch 'master' into feature-better-dd 2019-07-30 16:56:27 -07:00
Evan Tschannen a3fe3d4324
Merge pull request #1923 from xumengpanda/mengxu/evan-dd-improvement-minor-improvement
DD:Change condition for lastBuildTeamsFailed
2019-07-30 16:54:42 -07:00
A.J. Beamon 14648e20f9
Merge pull request #1901 from ajbeamon/data-distribution-receives-bytes-input-rate
Send bytes input rate to data distribution
2019-07-30 15:01:36 -07:00
Meng Xu 0e50656c7f DD:Change condition for lastBuildTeamsFailed
Change the threshold team number per server that should set lastBuildTeamsFailed
from DESIRED_TEAMS_PER_SERVER to
(SERVER_KNOBS->DESIRED_TEAMS_PER_SERVER * (configuration.storageTeamSize + 1)) / 2;
2019-07-30 11:07:02 -07:00
Evan Tschannen a0f26b604c
Merge pull request #1907 from etschannen/master
A number of bug fixes for rare problems found by correctness testing
2019-07-29 21:04:38 -07:00
sramamoorthy 5a56f6b456 minor snap create client improvement and bug fixes 2019-07-29 20:28:22 -07:00
Evan Tschannen cc4481b71a team builders prefer to make teams which overlap less with existing teams 2019-07-28 23:44:23 -07:00
Evan Tschannen 7e97bd181a fix: we need to build teams when a server becomes healthy if it is possible another servers does not have enough teams 2019-07-28 19:31:21 -07:00
Evan Tschannen 04dd293af0
Merge pull request #1874 from xumengpanda/mengxu/DD-code-read
DataDistribution:Add comments to help understand the code
2019-07-26 13:30:44 -07:00
Evan Tschannen 2123fa1c3a
Merge pull request #1853 from xumengpanda/mengxu/redundantTeamRemoverPriority-PR
Lower the RelocateShard priority for removing redundant teams
2019-07-26 13:28:42 -07:00
A.J. Beamon b91795d288 Send bytes input rate to DD. 2019-07-25 16:27:32 -07:00
senthil-ram edeec8a622 Update fdbserver/DataDistribution.actor.cpp
Co-Authored-By: Alex Miller <35046903+alexmiller-apple@users.noreply.github.com>
2019-07-24 15:36:28 -07:00
sramamoorthy a65c9f92ed get rid of all timeouts and other changes 2019-07-24 15:36:28 -07:00
sramamoorthy a2f2ad96ff code review comments and merge to master changes 2019-07-24 15:36:28 -07:00
sramamoorthy 4f2bb561de snapshot only local tlogs and not the satellite 2019-07-24 15:36:28 -07:00
sramamoorthy 021c949801 increase snaptime out to 15s for simulator 2019-07-24 15:36:28 -07:00
sramamoorthy 869f77aef1 Few cosmetic edits and fixes 2019-07-24 15:36:28 -07:00
sramamoorthy ddd4523816 bug fix in timeout & header file re-arrange in DD 2019-07-24 15:36:28 -07:00
sramamoorthy 31c010b393 few minor fixes 2019-07-24 15:36:28 -07:00
sramamoorthy 62c14dae72 disable dd during snap and enable in restore 2019-07-24 15:36:28 -07:00
sramamoorthy ba6bccce73 snap v2: DD changes - snapshot orchestration logic 2019-07-24 15:36:28 -07:00
Meng Xu b7478f5dd3 DD:Add comments to help understand code
Add comments to explain the functionalities of some code.
2019-07-22 11:23:16 -07:00
Meng Xu 378db79441 Resolve conflict when merge with master 2019-07-22 10:56:20 -07:00
Meng Xu dae4436a3d TC:UnitTest:Change invariant due to alg change 2019-07-20 21:06:54 -07:00
Meng Xu b001a9ebe8 ServerTeamRemover runs after machineTeamRemover finishes
If serverTeamRemover removes a team before machineTeamRemover brings
the machine team number down to the desired number, DD may create a new
team (due to teams removed by serverTeamRemover), which may be removed
later by machineTeamRemover. This causes unnnecessary extra data movement.
2019-07-19 16:48:52 -07:00
Meng Xu 64bee63dbc Resolve two review comments
1) No need to check server with only one team when teamRemover finds
a server team or machine team to remove

2) Fix optimalTeamCount counting in teamTracker
2019-07-18 18:46:31 -07:00
Meng Xu 915732ce24 TeamRemover:Reset the removed team counter after removement 2019-07-16 11:17:51 -07:00
Meng Xu 20f067e794 Merge with master:Resolve conflict with PR#1797 2019-07-16 10:52:28 -07:00
Meng Xu 243504b125 DD:Clang format changes 2019-07-15 18:40:14 -07:00
Meng Xu 94e9b8a3b4 Do not remove a team whose min team number is less than target
If the minimum number of teams of servers in a team is less than the
target value (desired_team_number_per_server * (teamSize + 1) / 2),
the team remover should not remove it. Otherwise, DD will oscillate in
building more teams and removing redundant teams.

Do not do consistency check for three_data_hall mode because when
machines are not evenly distributed across data halls, we will
need to build more teams than the total desired number to make sure
the number of teams per server is no less than the target value.
2019-07-15 18:30:13 -07:00
Meng Xu cafe9b9412 TC:Target team num per server is desired number
Do not overbuild teams because we may oscillate between building more teams and
removing the redundant teams. The oscillation happens when the machines are not
evenly distributed across availability zones.
For example, in three_data_hall mode, we have 1 machine in 1 data hall for 2 data halls.
We have 3 machines in the 3rd data hall. To build enough (and more teams) for servers
in the 3rd data hall, we will overbuild teams. However,
the teamRemover will remove those newly teams.
2019-07-15 17:32:51 -07:00
Meng Xu 415622f465 MachineTeamRemover:Change to remove MT with most teams
Change to remove machine team with most machine teams, using the same
logic as the serverTeamRemover.

The featue is guarded by TR_FLAG_REMOVE_MT_WITH_MOST_TEAMS knob.
2019-07-15 14:29:49 -07:00