Evan Tschannen
26c49f21be
fix: we do not know a region is fully replicated until all the initial storage servers have either been heard from or have been removed
2018-11-12 17:39:40 -08:00
Evan Tschannen
cd188a351e
fix: if a destination team became unhealthy and then healthy again, it would lower the priority of a move even though the source servers we are moving from are still unhealthy
...
fix: badTeams were not accounted for when checking priorities
2018-11-11 12:33:31 -08:00
Evan Tschannen
7c23b68501
fix: we need to build teams if a server becomes healthy and it is not already on any teams
2018-11-09 18:06:00 -08:00
Evan Tschannen
3e2484baf7
fix: a team tracker could downgrade the priority of a relocation issued by the team tracker for the other region
2018-11-09 10:07:55 -08:00
Evan Tschannen
19ae063b66
fix: storage servers need to be rebooted when increasing replication so that clients become aware that new options are available
2018-11-08 15:44:03 -08:00
Evan Tschannen
599cc6260e
fix: data distribution who not always add all subsets of emergency teams
...
fix: data distribution would not stop tracking bad teams after all their data was moved to other teams
fix: data distribution did not probably handle a server changing locality such that the teams it used to be on no longer satisfy the policy
2018-11-07 21:05:31 -08:00
Evan Tschannen
87d0b4c294
fix: the remote region does not have a full replica is usable_regions==1
2018-11-04 22:05:37 -08:00
Evan Tschannen
ad98acf795
fix: if the team started unhealthy and initialFailureReactionDelay was ready, we would not send relocations to the queue
...
print wrong shard size team messages in simulation
2018-11-02 13:00:15 -07:00
Evan Tschannen
1d591acd0a
removed the countHealthyTeams check, because it was incorrect if it triggered during the wait(yield()) at the top of team tracker
2018-11-02 12:58:16 -07:00
Evan Tschannen
e36b7cd417
Only log teamTracker trace events if sizes are not wrong, to avoid spammy messages when dropping a fearless configuration
...
wrongSize previous was unneeded
2018-10-17 11:45:47 -07:00
Evan Tschannen
a92fc911ac
do not spin on a failed storage server recruitment
2018-10-02 17:31:07 -07:00
Evan Tschannen
e64c55dce0
fix: data distribution would use the wrong priority sometimes when fixing an incomplete movement, this lead to the cluster thinking the data was replicated in all regions before it actually was
2018-09-28 12:15:23 -07:00
Evan Tschannen
861c8aa675
consider server health when building subsets of emergency teams
2018-09-19 17:57:01 -07:00
Evan Tschannen
702d018882
fix: we cannot use count on an async map, because someone waiting onChange for an item will cause it to exist in the map before it is set
2018-09-19 16:11:57 -07:00
Evan Tschannen
6d18193b3a
fix: team->setHealthy was not being called correctly on initially unhealthy teams
2018-09-19 14:48:07 -07:00
Balachandar Namasivayam
d622cb1f6e
When the cluster is configured from fearless setup to usable_regions=1, master goes into a loop changing team priority . Fix this issue.
2018-09-12 18:29:49 -07:00
Evan Tschannen
d9906d7d6a
code cleanup
2018-09-05 13:42:10 -07:00
Evan Tschannen
65eabedb6c
fix: addSubsetOfEmergencyTeams could add unhealthy teams
...
optimized teamTracker to check if it satisfies the policy more efficiently
added yields to initialization to avoid slow tasks when adding lots of teams
2018-08-31 17:54:55 -07:00
Evan Tschannen
72c86e909e
fix: tracking of the number of unhealthy servers was incorrect
...
fix: locality equality was only checking zoneId
2018-08-31 17:40:27 -07:00
Evan Tschannen
a694364a39
fix: teams larger than the storageTeamSize can never become healthy, so we do not need to track them in our data structures. After configuring from usable_regions=2 to usable_regions=1 we will have a lot of these types of teams, leading to performance issues
2018-08-21 21:08:15 -07:00
Evan Tschannen
883050d12f
moved the creation of the yieldPromiseStream to properly yield moves from initialDataDistribution
2018-08-13 22:29:55 -07:00
Evan Tschannen
2341e5d8ad
fix: we must yield when updating shardsAffectedByTeamFailure with the initial shards. A test with 1 million shards caused a 22 second slow task
2018-08-13 19:46:47 -07:00
Evan Tschannen
9c918a28f6
fix: status was reporting no replicas remaining when the remote datacenter was initially configured with usable_regions=2
2018-08-09 13:16:09 -07:00
Evan Tschannen
6f02ea843a
prevented a slow task when too many shards were sent to the data distribution queue after switching to a fearless deployment
2018-08-09 12:37:46 -07:00
Evan Tschannen
392c73affb
fixed a few slow tasks
2018-07-12 14:06:59 -07:00
Evan Tschannen
d12dac60ec
fix: the same team was being added multiple times to primaryTeams
2018-07-12 12:10:18 -07:00
Evan Tschannen
9edbb8d6dd
fix: do not consider a storage server failed until the full failure reaction time has elapsed. This was being short-circuited when the endpoint was permanently failed (the storage server has been rebooted)
2018-07-11 15:45:32 -07:00
Evan Tschannen
380b2895f7
fix: we need to wait for the yield in the team tracker not just after the initial failure reaction delay, but also after zeroHealthyTeams changes
2018-07-08 17:44:19 -07:00
Evan Tschannen
d6c6e7d306
fix: do not attempt data movement to an unhealthy destination team
...
allow building more teams than desired if all teams are unhealthy
bestTeamStuck is an error in simulation again
2018-07-07 16:51:16 -07:00
Stephen Atherton
5a84b5e1ef
Renamed ShardInfo to avoid a name conflict which sometimes causes the wrong destructor to be used at link time.
2018-06-30 18:44:46 -07:00
Evan Tschannen
0bdd25df23
ratekeeper does not control on remote storage servers
2018-06-18 17:23:55 -07:00
Evan Tschannen
1ccfb3a0f4
fix: log_anti_quorum was always 0 in simulation
...
removed durableStorageQuorum, because it is no longer a useful configuration parameter
2018-06-18 10:24:57 -07:00
Evan Tschannen
0913368651
added usable_regions to specify if we will replicate into a remote region
...
remote replication defaults to the primary replication
removed remote_logs, because they should be specified as an override in the regions object
2018-06-17 19:31:15 -07:00
Evan Tschannen
e28769b98e
fixed trace event name
2018-06-11 12:43:08 -07:00
Evan Tschannen
372ed67497
Merge branch 'master' into feature-remote-logs
...
# Conflicts:
# fdbserver/DataDistribution.actor.cpp
# fdbserver/MasterProxyServer.actor.cpp
# fdbserver/TLogServer.actor.cpp
# fdbserver/TagPartitionedLogSystem.actor.cpp
2018-06-11 11:34:10 -07:00
Evan Tschannen
134b5d6f65
fix: only consider data distribution started when remote has recovered so quite database works correctly
2018-06-10 20:25:15 -07:00
Evan Tschannen
4903df5ce9
fix: give time to detect failed servers before building teams
2018-06-10 20:21:39 -07:00
Evan Tschannen
6e48d93d39
backed out the healthy team check because it was unnecessary
2018-06-10 12:43:32 -07:00
A.J. Beamon
e5488419cc
Attempt to normalize trace events:
...
* Detail names now all start with an uppercase character and contain no underscores. Ideally these should be head-first camel case, though that was harder to check.
* Type names have the same rules, except they allow one underscore (to support a usage pattern Context_Type). The first character after the underscore is also uppercase.
* Use seconds instead of milliseconds in details.
Added a check when events are logged in simulation that logs a message to stderr if the first two rules above aren't followed.
This probably doesn't address every instance of the above problems, but all of the events I was able to hit in simulation pass the check.
2018-06-08 11:11:08 -07:00
Evan Tschannen
e4d5817679
fix: we must server getTeam requests before readyToStart is set because we cannot complete relocateShard requests without getTeam responses from both team collections
2018-06-07 16:14:40 -07:00
Evan Tschannen
9f0c16f062
do not build teams which contain failed servers
2018-06-07 14:05:53 -07:00
Evan Tschannen
b423d73b42
fix: do not finish a shard relocation until all of the storage servers have made the current recovery version durable. This is to prevent dropping a needed storage server as a source for a shard after dropping a remote configuration
2018-06-07 12:29:25 -07:00
Evan Tschannen
be06938d9d
fix: dropping the remote replication will cause all remote storage servers to die. Make sure we are not restoring redundancy before doing this to prevent data loss in simulation.
2018-06-04 18:46:09 -07:00
Evan Tschannen
6cf9508aae
finished a comment
2018-06-03 19:38:51 -07:00
Evan Tschannen
b1935f1738
fix: do not allow a storage server to be removed within 5 million versions of it being added, because if a storage server is added and removed within the known committed version and recovery version, they storage server will need see either the add or remove when it peeks
2018-05-05 18:16:28 -07:00
Evan Tschannen
440e2ae609
fix: data distribution logic was incorrect for finding a complete source team in a failed DC
2018-05-01 23:08:31 -07:00
Evan Tschannen
10d25927cd
Merge branch 'master' into feature-remote-logs
...
# Conflicts:
# fdbserver/DataDistribution.actor.cpp
2018-04-30 22:15:39 -07:00
Evan Tschannen
9cdabfed0e
added useful trace events
2018-04-29 18:54:47 -07:00
Evan Tschannen
73597f190e
fix: new tlogs are initialized with exactly the tags which existed at the recovery version
2018-04-22 20:28:01 -07:00
Bruce Mitchener
9cdf25eda3
Fix some typos.
2018-04-20 00:49:22 +07:00