Commit Graph

88 Commits

Author SHA1 Message Date
Evan Tschannen 26c49f21be fix: we do not know a region is fully replicated until all the initial storage servers have either been heard from or have been removed 2018-11-12 17:39:40 -08:00
Evan Tschannen cd188a351e fix: if a destination team became unhealthy and then healthy again, it would lower the priority of a move even though the source servers we are moving from are still unhealthy
fix: badTeams were not accounted for when checking priorities
2018-11-11 12:33:31 -08:00
Evan Tschannen 7c23b68501 fix: we need to build teams if a server becomes healthy and it is not already on any teams 2018-11-09 18:06:00 -08:00
Evan Tschannen 3e2484baf7 fix: a team tracker could downgrade the priority of a relocation issued by the team tracker for the other region 2018-11-09 10:07:55 -08:00
Evan Tschannen 19ae063b66 fix: storage servers need to be rebooted when increasing replication so that clients become aware that new options are available 2018-11-08 15:44:03 -08:00
Evan Tschannen 599cc6260e fix: data distribution who not always add all subsets of emergency teams
fix: data distribution would not stop tracking bad teams after all their data was moved to other teams
fix: data distribution did not probably handle a server changing locality such that the teams it used to be on no longer satisfy the policy
2018-11-07 21:05:31 -08:00
Evan Tschannen 87d0b4c294 fix: the remote region does not have a full replica is usable_regions==1 2018-11-04 22:05:37 -08:00
Evan Tschannen ad98acf795 fix: if the team started unhealthy and initialFailureReactionDelay was ready, we would not send relocations to the queue
print wrong shard size team messages in simulation
2018-11-02 13:00:15 -07:00
Evan Tschannen 1d591acd0a removed the countHealthyTeams check, because it was incorrect if it triggered during the wait(yield()) at the top of team tracker 2018-11-02 12:58:16 -07:00
Evan Tschannen e36b7cd417 Only log teamTracker trace events if sizes are not wrong, to avoid spammy messages when dropping a fearless configuration
wrongSize previous was unneeded
2018-10-17 11:45:47 -07:00
Evan Tschannen a92fc911ac do not spin on a failed storage server recruitment 2018-10-02 17:31:07 -07:00
Evan Tschannen e64c55dce0 fix: data distribution would use the wrong priority sometimes when fixing an incomplete movement, this lead to the cluster thinking the data was replicated in all regions before it actually was 2018-09-28 12:15:23 -07:00
Evan Tschannen 861c8aa675 consider server health when building subsets of emergency teams 2018-09-19 17:57:01 -07:00
Evan Tschannen 702d018882 fix: we cannot use count on an async map, because someone waiting onChange for an item will cause it to exist in the map before it is set 2018-09-19 16:11:57 -07:00
Evan Tschannen 6d18193b3a fix: team->setHealthy was not being called correctly on initially unhealthy teams 2018-09-19 14:48:07 -07:00
Balachandar Namasivayam d622cb1f6e When the cluster is configured from fearless setup to usable_regions=1, master goes into a loop changing team priority . Fix this issue. 2018-09-12 18:29:49 -07:00
Evan Tschannen d9906d7d6a code cleanup 2018-09-05 13:42:10 -07:00
Evan Tschannen 65eabedb6c fix: addSubsetOfEmergencyTeams could add unhealthy teams
optimized teamTracker to check if it satisfies the policy more efficiently
added yields to initialization to avoid slow tasks when adding lots of teams
2018-08-31 17:54:55 -07:00
Evan Tschannen 72c86e909e fix: tracking of the number of unhealthy servers was incorrect
fix: locality equality was only checking zoneId
2018-08-31 17:40:27 -07:00
Evan Tschannen a694364a39 fix: teams larger than the storageTeamSize can never become healthy, so we do not need to track them in our data structures. After configuring from usable_regions=2 to usable_regions=1 we will have a lot of these types of teams, leading to performance issues 2018-08-21 21:08:15 -07:00
Evan Tschannen 883050d12f moved the creation of the yieldPromiseStream to properly yield moves from initialDataDistribution 2018-08-13 22:29:55 -07:00
Evan Tschannen 2341e5d8ad fix: we must yield when updating shardsAffectedByTeamFailure with the initial shards. A test with 1 million shards caused a 22 second slow task 2018-08-13 19:46:47 -07:00
Evan Tschannen 9c918a28f6 fix: status was reporting no replicas remaining when the remote datacenter was initially configured with usable_regions=2 2018-08-09 13:16:09 -07:00
Evan Tschannen 6f02ea843a prevented a slow task when too many shards were sent to the data distribution queue after switching to a fearless deployment 2018-08-09 12:37:46 -07:00
Evan Tschannen 392c73affb fixed a few slow tasks 2018-07-12 14:06:59 -07:00
Evan Tschannen d12dac60ec fix: the same team was being added multiple times to primaryTeams 2018-07-12 12:10:18 -07:00
Evan Tschannen 9edbb8d6dd fix: do not consider a storage server failed until the full failure reaction time has elapsed. This was being short-circuited when the endpoint was permanently failed (the storage server has been rebooted) 2018-07-11 15:45:32 -07:00
Evan Tschannen 380b2895f7 fix: we need to wait for the yield in the team tracker not just after the initial failure reaction delay, but also after zeroHealthyTeams changes 2018-07-08 17:44:19 -07:00
Evan Tschannen d6c6e7d306 fix: do not attempt data movement to an unhealthy destination team
allow building more teams than desired if all teams are unhealthy
bestTeamStuck is an error in simulation again
2018-07-07 16:51:16 -07:00
Stephen Atherton 5a84b5e1ef Renamed ShardInfo to avoid a name conflict which sometimes causes the wrong destructor to be used at link time. 2018-06-30 18:44:46 -07:00
Evan Tschannen 0bdd25df23 ratekeeper does not control on remote storage servers 2018-06-18 17:23:55 -07:00
Evan Tschannen 1ccfb3a0f4 fix: log_anti_quorum was always 0 in simulation
removed durableStorageQuorum, because it is no longer a useful configuration parameter
2018-06-18 10:24:57 -07:00
Evan Tschannen 0913368651 added usable_regions to specify if we will replicate into a remote region
remote replication defaults to the primary replication
removed remote_logs, because they should be specified as an override in the regions object
2018-06-17 19:31:15 -07:00
Evan Tschannen e28769b98e fixed trace event name 2018-06-11 12:43:08 -07:00
Evan Tschannen 372ed67497 Merge branch 'master' into feature-remote-logs
# Conflicts:
#	fdbserver/DataDistribution.actor.cpp
#	fdbserver/MasterProxyServer.actor.cpp
#	fdbserver/TLogServer.actor.cpp
#	fdbserver/TagPartitionedLogSystem.actor.cpp
2018-06-11 11:34:10 -07:00
Evan Tschannen 134b5d6f65 fix: only consider data distribution started when remote has recovered so quite database works correctly 2018-06-10 20:25:15 -07:00
Evan Tschannen 4903df5ce9 fix: give time to detect failed servers before building teams 2018-06-10 20:21:39 -07:00
Evan Tschannen 6e48d93d39 backed out the healthy team check because it was unnecessary 2018-06-10 12:43:32 -07:00
A.J. Beamon e5488419cc Attempt to normalize trace events:
* Detail names now all start with an uppercase character and contain no underscores. Ideally these should be head-first camel case, though that was harder to check.
* Type names have the same rules, except they allow one underscore (to support a usage pattern Context_Type). The first character after the underscore is also uppercase.
* Use seconds instead of milliseconds in details.

Added a check when events are logged in simulation that logs a message to stderr if the first two rules above aren't followed.

This probably doesn't address every instance of the above problems, but all of the events I was able to hit in simulation pass the check.
2018-06-08 11:11:08 -07:00
Evan Tschannen e4d5817679 fix: we must server getTeam requests before readyToStart is set because we cannot complete relocateShard requests without getTeam responses from both team collections 2018-06-07 16:14:40 -07:00
Evan Tschannen 9f0c16f062 do not build teams which contain failed servers 2018-06-07 14:05:53 -07:00
Evan Tschannen b423d73b42 fix: do not finish a shard relocation until all of the storage servers have made the current recovery version durable. This is to prevent dropping a needed storage server as a source for a shard after dropping a remote configuration 2018-06-07 12:29:25 -07:00
Evan Tschannen be06938d9d fix: dropping the remote replication will cause all remote storage servers to die. Make sure we are not restoring redundancy before doing this to prevent data loss in simulation. 2018-06-04 18:46:09 -07:00
Evan Tschannen 6cf9508aae finished a comment 2018-06-03 19:38:51 -07:00
Evan Tschannen b1935f1738 fix: do not allow a storage server to be removed within 5 million versions of it being added, because if a storage server is added and removed within the known committed version and recovery version, they storage server will need see either the add or remove when it peeks 2018-05-05 18:16:28 -07:00
Evan Tschannen 440e2ae609 fix: data distribution logic was incorrect for finding a complete source team in a failed DC 2018-05-01 23:08:31 -07:00
Evan Tschannen 10d25927cd Merge branch 'master' into feature-remote-logs
# Conflicts:
#	fdbserver/DataDistribution.actor.cpp
2018-04-30 22:15:39 -07:00
Evan Tschannen 9cdabfed0e added useful trace events 2018-04-29 18:54:47 -07:00
Evan Tschannen 73597f190e fix: new tlogs are initialized with exactly the tags which existed at the recovery version 2018-04-22 20:28:01 -07:00
Bruce Mitchener 9cdf25eda3 Fix some typos. 2018-04-20 00:49:22 +07:00