Commit Graph

1185 Commits

Author SHA1 Message Date
Evan Tschannen d9626895b1
Merge pull request #964 from xumengpanda/mengxu/teamcollection-release
TeamCollection: Use machine teams to create server teams to increase availability at scale when a machine has multiple servers
2018-12-13 13:18:54 -08:00
Meng Xu 79d94f78f1 TeamCollection: Improve code efficiency
Further improve code efficiency by

1) Avoid rebuild machine locality map when machine locality is changed.
This may leave the global machine locality map stale.
This is ok as long as we do not use the global map to validate
the machine team follows the locality policy.

2) Use ASSERT_WE_THINK instead of ASSERT to avoid runtime overhead.
ASSERT_WE_THINK will only validate the condition in simulation mode.

Signed-off-by: Meng Xu <meng_xu@apple.com>
2018-12-12 22:38:38 -08:00
Meng Xu e197926c80 TeamCollection: Remove a duplicate function
Remove a duplicate function that has different signature.
No functionality change.

Signed-off-by: Meng Xu <meng_xu@apple.com>
2018-12-12 15:21:37 -08:00
Meng Xu ad7040efcd TeamCollection: Bug fix in handle server locality change
Make sure the link between server and machine is updated
in both server and machine.
Rename function name to better reflect its functionality.

Signed-off-by: Meng Xu <meng_xu@apple.com>
2018-12-12 14:03:29 -08:00
Meng Xu e069b5c31c TeamCollection: Use clang format
No functional change.

Signed-off-by: Meng Xu <meng_xu@apple.com>
2018-12-06 11:39:35 -08:00
Meng Xu 5d47b9c884 TeamCollection: Handle server locality change
A server locality may change from one machine to another.
This affects the old machine and machine team the server is on, and
the new machine the server moves to.

Signed-off-by: Meng Xu <meng_xu@apple.com>
2018-12-05 22:23:14 -08:00
Meng Xu c5047bc8c3 TeamCollection: All machine teams are correct size
We only create correct size machine teams.
When configuration (e.g., team size) is changed,
the DDTeamCollection will be destroyed and rebuilt
so that the invariant will not be violated.

Based on the invariant, we can count the number of
machine teams more quickly.

Signed-off-by: Meng Xu <meng_xu@apple.com>
2018-12-05 15:09:38 -08:00
Meng Xu 57eab1f283 DataDistribution: Remove addAllTeams function
The addAllTeams function can be replaced with the new addTeamsBestOf
function by passing a large enough number of teams to build.
Remove addAllTeams function and update the related unit tests.

Signed-off-by: Meng Xu <meng_xu@apple.com>
2018-12-05 15:03:16 -08:00
Meng Xu 38c5c2562b DataDistribution: Update NotEnoughServers unit test
The buggify option may set 1 to the knob parameters
(DESIRED_TEAMS_PER_SERVER and MAX_TEAMS_PER_SERVER).
When this happens, the number of machine teams to build will be
less than what we want, which prevents us from building enough
server teams.

To avoid this problem, we build machine teams before
we call addTeamsBestOf to build server teams.

We also add the ASSERT to ensure we build enough machine teams and
server teams in the test case.

Signed-off-by: Meng Xu <meng_xu@apple.com>
2018-12-05 14:36:48 -08:00
Meng Xu f32c04c834 DataDistribution: Update NotEnoughServers unit test
Change the test condition for the NotEnoughServers unit test.

Signed-off-by: Meng Xu <meng_xu@apple.com>
2018-12-03 23:14:01 -08:00
Meng Xu 54a4d6b308 TeamCollection: Improve code efficiency
Improve code efficiency with the following changes:
1) Change always-true if-statement to ASSERT;
2) Return when we are confident we will not find more machine teams.

No functionality change.

Signed-off-by: Meng Xu <meng_xu@apple.com>
2018-12-01 17:10:50 -08:00
Meng Xu 8d6c6e000b DataDistribution: Mute the NotEnoughServers test
Due to the randomness in choosing a server, we cannot gurantee to
find all teams. The NotEnoughServers test case may create false positive
bug report in the correctness test.

Signed-off-by: Meng Xu <meng_xu@apple.com>
2018-12-01 13:29:45 -08:00
Meng Xu 68dcec2240 DataDistribution: Change a unit test
Try multiple times of addTeamsBestOf() when we cannot find an available team
due to the pure randomness in choosing the server teams.

The changes for the unit test reduces the false positive in the simulation test results.

Signed-off-by: Meng Xu <meng_xu@apple.com>
2018-12-01 13:12:55 -08:00
Meng Xu a43f579f66 TeamCollection: Change 1 unit test
Relax the assert condition on the random unit test.
Due to the randomness in choosing the machine team and
the server team from the machine team, it is possible that
we may not find the remaining several (e.g., 1 or 2) available teams.
For example, there are at most 10 teams available, and we have found
9 teams, the chance of finding the last one is low
when we do pure random selection.

It is ok to not find every available team because
1) In reality, we only create a small fraction of available teams, and
2) In practical system, this situation only happens when most of servers
   are *temporarily* unhealthy. When this situation happens, we will
   abandon all existing teams and restart the build team from scratch.

In simulation test, the situation happens 100 times out of 128613 test cases
when we run RandomUnitTests.txt only.

Signed-off-by: Meng Xu <meng_xu@apple.com>
2018-12-01 13:11:19 -08:00
Meng Xu f311455c45 TeamCollection: Cleanup code and add checks
Remove unnecessary sanity checks and remove the dead code.
Add some necessary sanity checks.

Signed-off-by: Meng Xu <meng_xu@apple.com>
2018-11-30 17:40:21 -08:00
Meng Xu ea3bd1502d TeamCollection: Calculate machine team number
Calculate the number of machine teams in the same way
as we calculate the number of server teams.

Only count the machine teams that has the correct size and is healthy.

Simplify code by removing unnecessary check.

Signed-off-by: Meng Xu <meng_xu@apple.com>
2018-11-29 15:38:23 -08:00
Meng Xu 2b41ad5e57 TeamCollection: Pick server team randomly
Pick server team purely randomly instead of picking the least used one.
This is to avoid creating correlation in the server teams we pick when
new machines are added.

The logic is:
First pick the one random least used server as chosen server;
Then pick a machine team that has the server;
Then pick a server on each machine in the machine team.
We make sure the chosen server is picked.

Signed-off-by: Meng Xu <meng_xu@apple.com>
2018-11-28 15:57:53 -08:00
Meng Xu e4c9d4cbae TeamCollection: Build all machine teams first
Before we build server teams, we build the desired number of machine teams.
Then we pick the least used server, from which we pick the least used machine team.
Then we pick the least used server on each machine in the least used machine team to get the server team.

Note: The logic of building machine teams should be independent from server teams.

Signed-off-by: Meng Xu <meng_xu@apple.com>
2018-11-27 18:06:36 -08:00
A.J. Beamon 975711c389 Merge branch 'release-6.0' of github.com:apple/foundationdb
# Conflicts:
#	documentation/sphinx/source/release-notes.rst
2018-11-27 09:50:39 -08:00
Meng Xu 4c2c65c1b3 TeamCollection: Replace TraceEvent with ASSERT
Replace one TraceEvent that never happens in correctness test with an ASSERT.
Change format in one comment.

Signed-off-by: Meng xu <meng_xu@apple.com>
2018-11-27 09:48:24 -08:00
Meng Xu 5cbff740ca TeamCollection: Add ASSERT
Remove sanity check code for performance benefit.
Replace TraceEvent(SevError) with ASSERT.

Signed-off-by: Meng Xu <meng_xu@apple.com>
2018-11-21 13:16:52 -08:00
Meng Xu 8de031f9a6 TeamCollection: clang-format
Format the changes with git clang-format.
No functional changes.

Signed-off-by: Meng Xu <meng_xu@apple.com>
2018-11-21 11:18:26 -08:00
Meng Xu 12c3bec968 TeamCollection: Misc changes to resolve review comments
No functional change.
Report error in TraceEvent when invariant is violated.

Signed-off-by: Meng Xu <meng_xu@apple.com>
2018-11-19 20:44:52 -08:00
Meng Xu 52c6a66601 TeamCollection: Fix a bug introduced in code review
When we GetTeam, the data distribution actor may have zero teams in
rare situation in the ConfigureTest.txt test.
We should return an empty team in this situation instead of triggering error.

Signed-off-by: Meng Xu <meng_xu@apple.com>
2018-11-16 16:34:38 -08:00
Meng Xu f7a7e069f0 TeamCollection: Remove unnecessary comments
Pass 41806 tests with no failure

Signed-off-by: Meng Xu <meng_xu@apple.com>
2018-11-16 15:56:35 -08:00
Meng Xu 73c58852f0 TeamCollection: Resolve code review comments
Resolve code review comments:
1) Improve the code efficiency by avoiding unnecessary map search
   and avoiding unnecessary checking
2) Remove or comment out trace events when they can be spammy
3) Improve coding style

Tested for 1 hour and no error was found.
KillRegionCycle.txt test was excluded from the test because
existing code cannot pass that test either

Signed-off-by: Meng Xu <meng_xu@apple.com>
2018-11-16 15:55:33 -08:00
Meng Xu 5051b35c61 TeamCollection: Use machine team to create server team
Current server team collection logic does not consider
the fact that multipe storage servers can run on the same machine.
When multiple machines fail, all servers on the machines will fail, and
the possibility of having one process team fail and lose data is very high.

To reduce the possibility of losing data when multiple machine fails,
we first create machine teams which span across different fault zones;
we then create server teams based on machine teams by
first picking 1 machine team, and then
picking 1 server from each machine in the machine team.

Signed-off-by: Meng Xu <meng_xu@apple.com>
2018-11-16 15:53:22 -08:00
Evan Tschannen e45952bc53 Merge branch 'release-6.0'
# Conflicts:
#	documentation/sphinx/source/release-notes.rst
#	fdbclient/BackupContainer.actor.cpp
#	fdbclient/BlobStore.actor.cpp
#	fdbclient/HTTP.actor.cpp
#	tests/BlobStore.txt
#	versions.target
2018-11-13 16:06:39 -08:00
Evan Tschannen 1bd615f954 fix: remoteDcIds will not actually have transaction logs unless usable regions is > 1 2018-11-13 12:36:04 -08:00
Evan Tschannen 4e54690005 Merge branch 'release-6.0'
# Conflicts:
#	fdbserver/DataDistribution.actor.cpp
#	fdbserver/MoveKeys.actor.cpp
2018-11-12 20:26:58 -08:00
Evan Tschannen 3f3a562f75 updated resolution balancing knobs to be a little more aggressive 2018-11-12 19:11:28 -08:00
Evan Tschannen 239bf882d8 Merge branch 'release-6.0' into feature-resolution-balancing-fix 2018-11-12 18:43:20 -08:00
Evan Tschannen 3f461f3706 updated comments 2018-11-12 18:42:29 -08:00
Evan Tschannen 6353a6724b strengthened the protections related to changing regions 2018-11-12 17:40:40 -08:00
Evan Tschannen 26c49f21be fix: we do not know a region is fully replicated until all the initial storage servers have either been heard from or have been removed 2018-11-12 17:39:40 -08:00
Evan Tschannen 3f39024640 buggify resolution balancing so that it still happens in simulation 2018-11-12 00:03:07 -08:00
Evan Tschannen 536ee826da tuned resolver balancing to keep the resolvers within 5MB per second of each other 2018-11-11 23:42:45 -08:00
Evan Tschannen 50f481b149 fix: peek local should not call peek all, because it is possible to still peek from remote log sets after a special tag 2018-11-11 19:16:25 -08:00
Evan Tschannen 7892da032f fix: Do not remove the locality entry for the current transaction logs when removing storage servers
fix: dcId_locality map could be incorrect after restarting recruitEverything
2018-11-11 12:37:53 -08:00
Evan Tschannen cd188a351e fix: if a destination team became unhealthy and then healthy again, it would lower the priority of a move even though the source servers we are moving from are still unhealthy
fix: badTeams were not accounted for when checking priorities
2018-11-11 12:33:31 -08:00
Evan Tschannen 4b5d0b4e2c Merge branch 'release-6.0'
# Conflicts:
#	documentation/sphinx/source/release-notes.rst
#	fdbclient/AsyncFileBlobStore.actor.cpp
#	fdbclient/AsyncFileBlobStore.actor.h
#	fdbclient/BlobStore.actor.cpp
#	fdbclient/BlobStore.h
#	fdbclient/HTTP.actor.cpp
#	fdbclient/ManagementAPI.actor.cpp
#	fdbclient/NativeAPI.actor.cpp
#	fdbrpc/LoadBalance.actor.h
#	fdbrpc/batcher.actor.h
#	fdbrpc/fdbrpc.vcxproj
#	fdbrpc/sim2.actor.cpp
#	fdbserver/DataDistribution.actor.cpp
#	fdbserver/DataDistributionTracker.actor.cpp
#	fdbserver/SimulatedCluster.actor.cpp
#	fdbserver/TLogServer.actor.cpp
#	fdbserver/masterserver.actor.cpp
2018-11-10 13:04:24 -08:00
Evan Tschannen a654183f63
Merge pull request #791 from ajbeamon/remove-cluster-from-iclientapi
Remove cluster from IClientApi (phase 2 of removing DB names)
2018-11-10 10:16:18 -08:00
Evan Tschannen 6a406bae72
Merge pull request #896 from ajbeamon/downgrade-incorrect-cluster-file-event
Downgrade the severity of IncorrectClusterFileContents the first time…
2018-11-10 10:06:36 -08:00
Evan Tschannen 6f4ad84777
Merge pull request #903 from ajbeamon/move-batcher-into-proxy
Move the sort of generic batcher from fdbrpc and make it specific to …
2018-11-10 09:56:03 -08:00
Evan Tschannen 7c23b68501 fix: we need to build teams if a server becomes healthy and it is not already on any teams 2018-11-09 18:06:00 -08:00
A.J. Beamon c3a06aa6f1 Fix indentation 2018-11-09 14:25:40 -08:00
A.J. Beamon 67a152ae9f Move the sort of generic batcher from fdbrpc and make it specific to batching commits in master proxy. Also a couple minor formatting changes. 2018-11-09 14:19:18 -08:00
Evan Tschannen 3e2484baf7 fix: a team tracker could downgrade the priority of a relocation issued by the team tracker for the other region 2018-11-09 10:07:55 -08:00
Evan Tschannen 6874e379fc fix: set the simulator’s view of usable regions to one during configure tests which can disable usable regions 2018-11-09 10:06:03 -08:00
Evan Tschannen 19ae063b66 fix: storage servers need to be rebooted when increasing replication so that clients become aware that new options are available 2018-11-08 15:44:03 -08:00