foundationdb

Commit Graph

Author	SHA1	Message	Date
Meng Xu	57eab1f283	DataDistribution: Remove addAllTeams function The addAllTeams function can be replaced with the new addTeamsBestOf function by passing a large enough number of teams to build. Remove addAllTeams function and update the related unit tests. Signed-off-by: Meng Xu <meng_xu@apple.com>	2018-12-05 15:03:16 -08:00
Meng Xu	38c5c2562b	DataDistribution: Update NotEnoughServers unit test The buggify option may set 1 to the knob parameters (DESIRED_TEAMS_PER_SERVER and MAX_TEAMS_PER_SERVER). When this happens, the number of machine teams to build will be less than what we want, which prevents us from building enough server teams. To avoid this problem, we build machine teams before we call addTeamsBestOf to build server teams. We also add the ASSERT to ensure we build enough machine teams and server teams in the test case. Signed-off-by: Meng Xu <meng_xu@apple.com>	2018-12-05 14:36:48 -08:00
Meng Xu	f32c04c834	DataDistribution: Update NotEnoughServers unit test Change the test condition for the NotEnoughServers unit test. Signed-off-by: Meng Xu <meng_xu@apple.com>	2018-12-03 23:14:01 -08:00
Meng Xu	54a4d6b308	TeamCollection: Improve code efficiency Improve code efficiency with the following changes: 1) Change always-true if-statement to ASSERT; 2) Return when we are confident we will not find more machine teams. No functionality change. Signed-off-by: Meng Xu <meng_xu@apple.com>	2018-12-01 17:10:50 -08:00
Meng Xu	8d6c6e000b	DataDistribution: Mute the NotEnoughServers test Due to the randomness in choosing a server, we cannot gurantee to find all teams. The NotEnoughServers test case may create false positive bug report in the correctness test. Signed-off-by: Meng Xu <meng_xu@apple.com>	2018-12-01 13:29:45 -08:00
Meng Xu	68dcec2240	DataDistribution: Change a unit test Try multiple times of addTeamsBestOf() when we cannot find an available team due to the pure randomness in choosing the server teams. The changes for the unit test reduces the false positive in the simulation test results. Signed-off-by: Meng Xu <meng_xu@apple.com>	2018-12-01 13:12:55 -08:00
Meng Xu	a43f579f66	TeamCollection: Change 1 unit test Relax the assert condition on the random unit test. Due to the randomness in choosing the machine team and the server team from the machine team, it is possible that we may not find the remaining several (e.g., 1 or 2) available teams. For example, there are at most 10 teams available, and we have found 9 teams, the chance of finding the last one is low when we do pure random selection. It is ok to not find every available team because 1) In reality, we only create a small fraction of available teams, and 2) In practical system, this situation only happens when most of servers are temporarily unhealthy. When this situation happens, we will abandon all existing teams and restart the build team from scratch. In simulation test, the situation happens 100 times out of 128613 test cases when we run RandomUnitTests.txt only. Signed-off-by: Meng Xu <meng_xu@apple.com>	2018-12-01 13:11:19 -08:00
Meng Xu	f311455c45	TeamCollection: Cleanup code and add checks Remove unnecessary sanity checks and remove the dead code. Add some necessary sanity checks. Signed-off-by: Meng Xu <meng_xu@apple.com>	2018-11-30 17:40:21 -08:00
Meng Xu	ea3bd1502d	TeamCollection: Calculate machine team number Calculate the number of machine teams in the same way as we calculate the number of server teams. Only count the machine teams that has the correct size and is healthy. Simplify code by removing unnecessary check. Signed-off-by: Meng Xu <meng_xu@apple.com>	2018-11-29 15:38:23 -08:00
Meng Xu	2b41ad5e57	TeamCollection: Pick server team randomly Pick server team purely randomly instead of picking the least used one. This is to avoid creating correlation in the server teams we pick when new machines are added. The logic is: First pick the one random least used server as chosen server; Then pick a machine team that has the server; Then pick a server on each machine in the machine team. We make sure the chosen server is picked. Signed-off-by: Meng Xu <meng_xu@apple.com>	2018-11-28 15:57:53 -08:00
Meng Xu	e4c9d4cbae	TeamCollection: Build all machine teams first Before we build server teams, we build the desired number of machine teams. Then we pick the least used server, from which we pick the least used machine team. Then we pick the least used server on each machine in the least used machine team to get the server team. Note: The logic of building machine teams should be independent from server teams. Signed-off-by: Meng Xu <meng_xu@apple.com>	2018-11-27 18:06:36 -08:00
Meng Xu	4c2c65c1b3	TeamCollection: Replace TraceEvent with ASSERT Replace one TraceEvent that never happens in correctness test with an ASSERT. Change format in one comment. Signed-off-by: Meng xu <meng_xu@apple.com>	2018-11-27 09:48:24 -08:00
Meng Xu	5cbff740ca	TeamCollection: Add ASSERT Remove sanity check code for performance benefit. Replace TraceEvent(SevError) with ASSERT. Signed-off-by: Meng Xu <meng_xu@apple.com>	2018-11-21 13:16:52 -08:00
Meng Xu	8de031f9a6	TeamCollection: clang-format Format the changes with git clang-format. No functional changes. Signed-off-by: Meng Xu <meng_xu@apple.com>	2018-11-21 11:18:26 -08:00
Meng Xu	12c3bec968	TeamCollection: Misc changes to resolve review comments No functional change. Report error in TraceEvent when invariant is violated. Signed-off-by: Meng Xu <meng_xu@apple.com>	2018-11-19 20:44:52 -08:00
Meng Xu	52c6a66601	TeamCollection: Fix a bug introduced in code review When we GetTeam, the data distribution actor may have zero teams in rare situation in the ConfigureTest.txt test. We should return an empty team in this situation instead of triggering error. Signed-off-by: Meng Xu <meng_xu@apple.com>	2018-11-16 16:34:38 -08:00
Meng Xu	f7a7e069f0	TeamCollection: Remove unnecessary comments Pass 41806 tests with no failure Signed-off-by: Meng Xu <meng_xu@apple.com>	2018-11-16 15:56:35 -08:00
Meng Xu	73c58852f0	TeamCollection: Resolve code review comments Resolve code review comments: 1) Improve the code efficiency by avoiding unnecessary map search and avoiding unnecessary checking 2) Remove or comment out trace events when they can be spammy 3) Improve coding style Tested for 1 hour and no error was found. KillRegionCycle.txt test was excluded from the test because existing code cannot pass that test either Signed-off-by: Meng Xu <meng_xu@apple.com>	2018-11-16 15:55:33 -08:00
Meng Xu	5051b35c61	TeamCollection: Use machine team to create server team Current server team collection logic does not consider the fact that multipe storage servers can run on the same machine. When multiple machines fail, all servers on the machines will fail, and the possibility of having one process team fail and lose data is very high. To reduce the possibility of losing data when multiple machine fails, we first create machine teams which span across different fault zones; we then create server teams based on machine teams by first picking 1 machine team, and then picking 1 server from each machine in the machine team. Signed-off-by: Meng Xu <meng_xu@apple.com>	2018-11-16 15:53:22 -08:00
Evan Tschannen	4e54690005	Merge branch 'release-6.0' # Conflicts: # fdbserver/DataDistribution.actor.cpp # fdbserver/MoveKeys.actor.cpp	2018-11-12 20:26:58 -08:00
Evan Tschannen	26c49f21be	fix: we do not know a region is fully replicated until all the initial storage servers have either been heard from or have been removed	2018-11-12 17:39:40 -08:00
Evan Tschannen	cd188a351e	fix: if a destination team became unhealthy and then healthy again, it would lower the priority of a move even though the source servers we are moving from are still unhealthy fix: badTeams were not accounted for when checking priorities	2018-11-11 12:33:31 -08:00
Evan Tschannen	4b5d0b4e2c	Merge branch 'release-6.0' # Conflicts: # documentation/sphinx/source/release-notes.rst # fdbclient/AsyncFileBlobStore.actor.cpp # fdbclient/AsyncFileBlobStore.actor.h # fdbclient/BlobStore.actor.cpp # fdbclient/BlobStore.h # fdbclient/HTTP.actor.cpp # fdbclient/ManagementAPI.actor.cpp # fdbclient/NativeAPI.actor.cpp # fdbrpc/LoadBalance.actor.h # fdbrpc/batcher.actor.h # fdbrpc/fdbrpc.vcxproj # fdbrpc/sim2.actor.cpp # fdbserver/DataDistribution.actor.cpp # fdbserver/DataDistributionTracker.actor.cpp # fdbserver/SimulatedCluster.actor.cpp # fdbserver/TLogServer.actor.cpp # fdbserver/masterserver.actor.cpp	2018-11-10 13:04:24 -08:00
Evan Tschannen	7c23b68501	fix: we need to build teams if a server becomes healthy and it is not already on any teams	2018-11-09 18:06:00 -08:00
Evan Tschannen	3e2484baf7	fix: a team tracker could downgrade the priority of a relocation issued by the team tracker for the other region	2018-11-09 10:07:55 -08:00
Evan Tschannen	19ae063b66	fix: storage servers need to be rebooted when increasing replication so that clients become aware that new options are available	2018-11-08 15:44:03 -08:00
Evan Tschannen	599cc6260e	fix: data distribution who not always add all subsets of emergency teams fix: data distribution would not stop tracking bad teams after all their data was moved to other teams fix: data distribution did not probably handle a server changing locality such that the teams it used to be on no longer satisfy the policy	2018-11-07 21:05:31 -08:00
Evan Tschannen	87d0b4c294	fix: the remote region does not have a full replica is usable_regions==1	2018-11-04 22:05:37 -08:00
Evan Tschannen	ad98acf795	fix: if the team started unhealthy and initialFailureReactionDelay was ready, we would not send relocations to the queue print wrong shard size team messages in simulation	2018-11-02 13:00:15 -07:00
Evan Tschannen	1d591acd0a	removed the countHealthyTeams check, because it was incorrect if it triggered during the wait(yield()) at the top of team tracker	2018-11-02 12:58:16 -07:00
Robert Escriva	268093a96d	Adjust all includes to be relative to the root. Remove the use of relative paths. A header at foo/bar.h could be included by files under foo/ with "bar.h", but would be included everywhere else as "foo/bar.h". Adjust so that every include references such a header with the latter form. Signed-off-by: Robert Escriva <rescriva@dropbox.com>	2018-10-19 17:35:33 +00:00
Evan Tschannen	db71b60d72	Merge pull request #819 from satherton/feature-redwood Redwood storage engine, initial/experimental version	2018-10-18 18:38:11 -07:00
Evan Tschannen	ed7036139a	Merge branch 'release-6.0' # Conflicts: # documentation/sphinx/source/release-notes.rst # fdbserver/DataDistribution.actor.cpp # fdbserver/storageserver.actor.cpp	2018-10-18 17:00:52 -07:00
Evan Tschannen	e36b7cd417	Only log teamTracker trace events if sizes are not wrong, to avoid spammy messages when dropping a fearless configuration wrongSize previous was unneeded	2018-10-17 11:45:47 -07:00
Stephen Atherton	22f8a4efa9	Normalized all unit test names to begin with "/" if they should be included in random unit testing.	2018-10-05 22:09:58 -07:00
Evan Tschannen	3922e477a5	Merge branch 'release-6.0' # Conflicts: # documentation/sphinx/source/release-notes.rst # fdbclient/ManagementAPI.actor.cpp # fdbserver/ClusterController.actor.cpp # fdbserver/DataDistribution.actor.cpp # fdbserver/LogSystemDiskQueueAdapter.actor.cpp # fdbserver/SimulatedCluster.actor.cpp # fdbserver/TLogServer.actor.cpp	2018-10-03 16:57:18 -07:00
Evan Tschannen	a92fc911ac	do not spin on a failed storage server recruitment	2018-10-02 17:31:07 -07:00
Evan Tschannen	e64c55dce0	fix: data distribution would use the wrong priority sometimes when fixing an incomplete movement, this lead to the cluster thinking the data was replicated in all regions before it actually was	2018-09-28 12:15:23 -07:00
A.J. Beamon	92990d6aef	Merge release-6.0 into master	2018-09-21 16:14:39 -07:00
Evan Tschannen	861c8aa675	consider server health when building subsets of emergency teams	2018-09-19 17:57:01 -07:00
Evan Tschannen	702d018882	fix: we cannot use count on an async map, because someone waiting onChange for an item will cause it to exist in the map before it is set	2018-09-19 16:11:57 -07:00
Evan Tschannen	6d18193b3a	fix: team->setHealthy was not being called correctly on initially unhealthy teams	2018-09-19 14:48:07 -07:00
A.J. Beamon	4d39509034	Merge pull request #774 from apple/release-6.0 Merge release-6.0 into master	2018-09-18 15:01:26 -07:00
Balachandar Namasivayam	d622cb1f6e	When the cluster is configured from fearless setup to usable_regions=1, master goes into a loop changing team priority . Fix this issue.	2018-09-12 18:29:49 -07:00
Evan Tschannen	90301f497f	Merge branch 'release-6.0' # Conflicts: # fdbclient/ManagementAPI.actor.cpp # fdbrpc/FlowTransport.actor.cpp # fdbrpc/TLSConnection.actor.cpp # fdbserver/DataDistribution.actor.cpp # fdbserver/Status.actor.cpp # fdbserver/storageserver.actor.cpp # fdbserver/workloads/StatusWorkload.actor.cpp # versions.target	2018-09-05 16:06:33 -07:00
Evan Tschannen	d9906d7d6a	code cleanup	2018-09-05 13:42:10 -07:00
Evan Tschannen	4eaff42e4f	Merge pull request #712 from ajbeamon/remove-database-name-internal Eliminate use of database names (phase 1)	2018-09-05 10:35:00 -07:00
Evan Tschannen	65eabedb6c	fix: addSubsetOfEmergencyTeams could add unhealthy teams optimized teamTracker to check if it satisfies the policy more efficiently added yields to initialization to avoid slow tasks when adding lots of teams	2018-08-31 17:54:55 -07:00
Evan Tschannen	72c86e909e	fix: tracking of the number of unhealthy servers was incorrect fix: locality equality was only checking zoneId	2018-08-31 17:40:27 -07:00
Evan Tschannen	717c43a69f	merge 6.0 into master	2018-08-22 00:28:04 -07:00

1 2 3

127 Commits