foundationdb

Commit Graph

Author	SHA1	Message	Date
Evan Tschannen	26c49f21be	fix: we do not know a region is fully replicated until all the initial storage servers have either been heard from or have been removed	2018-11-12 17:39:40 -08:00
Evan Tschannen	cd188a351e	fix: if a destination team became unhealthy and then healthy again, it would lower the priority of a move even though the source servers we are moving from are still unhealthy fix: badTeams were not accounted for when checking priorities	2018-11-11 12:33:31 -08:00
Evan Tschannen	7c23b68501	fix: we need to build teams if a server becomes healthy and it is not already on any teams	2018-11-09 18:06:00 -08:00
Evan Tschannen	3e2484baf7	fix: a team tracker could downgrade the priority of a relocation issued by the team tracker for the other region	2018-11-09 10:07:55 -08:00
Evan Tschannen	19ae063b66	fix: storage servers need to be rebooted when increasing replication so that clients become aware that new options are available	2018-11-08 15:44:03 -08:00
Evan Tschannen	599cc6260e	fix: data distribution who not always add all subsets of emergency teams fix: data distribution would not stop tracking bad teams after all their data was moved to other teams fix: data distribution did not probably handle a server changing locality such that the teams it used to be on no longer satisfy the policy	2018-11-07 21:05:31 -08:00
Evan Tschannen	87d0b4c294	fix: the remote region does not have a full replica is usable_regions==1	2018-11-04 22:05:37 -08:00
Evan Tschannen	ad98acf795	fix: if the team started unhealthy and initialFailureReactionDelay was ready, we would not send relocations to the queue print wrong shard size team messages in simulation	2018-11-02 13:00:15 -07:00
Evan Tschannen	1d591acd0a	removed the countHealthyTeams check, because it was incorrect if it triggered during the wait(yield()) at the top of team tracker	2018-11-02 12:58:16 -07:00
Evan Tschannen	e36b7cd417	Only log teamTracker trace events if sizes are not wrong, to avoid spammy messages when dropping a fearless configuration wrongSize previous was unneeded	2018-10-17 11:45:47 -07:00
Evan Tschannen	a92fc911ac	do not spin on a failed storage server recruitment	2018-10-02 17:31:07 -07:00
Evan Tschannen	e64c55dce0	fix: data distribution would use the wrong priority sometimes when fixing an incomplete movement, this lead to the cluster thinking the data was replicated in all regions before it actually was	2018-09-28 12:15:23 -07:00
Evan Tschannen	861c8aa675	consider server health when building subsets of emergency teams	2018-09-19 17:57:01 -07:00
Evan Tschannen	702d018882	fix: we cannot use count on an async map, because someone waiting onChange for an item will cause it to exist in the map before it is set	2018-09-19 16:11:57 -07:00
Evan Tschannen	6d18193b3a	fix: team->setHealthy was not being called correctly on initially unhealthy teams	2018-09-19 14:48:07 -07:00
Balachandar Namasivayam	d622cb1f6e	When the cluster is configured from fearless setup to usable_regions=1, master goes into a loop changing team priority . Fix this issue.	2018-09-12 18:29:49 -07:00
Evan Tschannen	d9906d7d6a	code cleanup	2018-09-05 13:42:10 -07:00
Evan Tschannen	65eabedb6c	fix: addSubsetOfEmergencyTeams could add unhealthy teams optimized teamTracker to check if it satisfies the policy more efficiently added yields to initialization to avoid slow tasks when adding lots of teams	2018-08-31 17:54:55 -07:00
Evan Tschannen	72c86e909e	fix: tracking of the number of unhealthy servers was incorrect fix: locality equality was only checking zoneId	2018-08-31 17:40:27 -07:00
Evan Tschannen	a694364a39	fix: teams larger than the storageTeamSize can never become healthy, so we do not need to track them in our data structures. After configuring from usable_regions=2 to usable_regions=1 we will have a lot of these types of teams, leading to performance issues	2018-08-21 21:08:15 -07:00
Evan Tschannen	883050d12f	moved the creation of the yieldPromiseStream to properly yield moves from initialDataDistribution	2018-08-13 22:29:55 -07:00
Evan Tschannen	2341e5d8ad	fix: we must yield when updating shardsAffectedByTeamFailure with the initial shards. A test with 1 million shards caused a 22 second slow task	2018-08-13 19:46:47 -07:00
Evan Tschannen	9c918a28f6	fix: status was reporting no replicas remaining when the remote datacenter was initially configured with usable_regions=2	2018-08-09 13:16:09 -07:00
Evan Tschannen	6f02ea843a	prevented a slow task when too many shards were sent to the data distribution queue after switching to a fearless deployment	2018-08-09 12:37:46 -07:00
Evan Tschannen	392c73affb	fixed a few slow tasks	2018-07-12 14:06:59 -07:00
Evan Tschannen	d12dac60ec	fix: the same team was being added multiple times to primaryTeams	2018-07-12 12:10:18 -07:00
Evan Tschannen	9edbb8d6dd	fix: do not consider a storage server failed until the full failure reaction time has elapsed. This was being short-circuited when the endpoint was permanently failed (the storage server has been rebooted)	2018-07-11 15:45:32 -07:00
Evan Tschannen	380b2895f7	fix: we need to wait for the yield in the team tracker not just after the initial failure reaction delay, but also after zeroHealthyTeams changes	2018-07-08 17:44:19 -07:00
Evan Tschannen	d6c6e7d306	fix: do not attempt data movement to an unhealthy destination team allow building more teams than desired if all teams are unhealthy bestTeamStuck is an error in simulation again	2018-07-07 16:51:16 -07:00
Stephen Atherton	5a84b5e1ef	Renamed ShardInfo to avoid a name conflict which sometimes causes the wrong destructor to be used at link time.	2018-06-30 18:44:46 -07:00
Evan Tschannen	0bdd25df23	ratekeeper does not control on remote storage servers	2018-06-18 17:23:55 -07:00
Evan Tschannen	1ccfb3a0f4	fix: log_anti_quorum was always 0 in simulation removed durableStorageQuorum, because it is no longer a useful configuration parameter	2018-06-18 10:24:57 -07:00
Evan Tschannen	0913368651	added usable_regions to specify if we will replicate into a remote region remote replication defaults to the primary replication removed remote_logs, because they should be specified as an override in the regions object	2018-06-17 19:31:15 -07:00
Evan Tschannen	e28769b98e	fixed trace event name	2018-06-11 12:43:08 -07:00
Evan Tschannen	372ed67497	Merge branch 'master' into feature-remote-logs # Conflicts: # fdbserver/DataDistribution.actor.cpp # fdbserver/MasterProxyServer.actor.cpp # fdbserver/TLogServer.actor.cpp # fdbserver/TagPartitionedLogSystem.actor.cpp	2018-06-11 11:34:10 -07:00
Evan Tschannen	134b5d6f65	fix: only consider data distribution started when remote has recovered so quite database works correctly	2018-06-10 20:25:15 -07:00
Evan Tschannen	4903df5ce9	fix: give time to detect failed servers before building teams	2018-06-10 20:21:39 -07:00
Evan Tschannen	6e48d93d39	backed out the healthy team check because it was unnecessary	2018-06-10 12:43:32 -07:00
A.J. Beamon	e5488419cc	Attempt to normalize trace events: * Detail names now all start with an uppercase character and contain no underscores. Ideally these should be head-first camel case, though that was harder to check. * Type names have the same rules, except they allow one underscore (to support a usage pattern Context_Type). The first character after the underscore is also uppercase. * Use seconds instead of milliseconds in details. Added a check when events are logged in simulation that logs a message to stderr if the first two rules above aren't followed. This probably doesn't address every instance of the above problems, but all of the events I was able to hit in simulation pass the check.	2018-06-08 11:11:08 -07:00
Evan Tschannen	e4d5817679	fix: we must server getTeam requests before readyToStart is set because we cannot complete relocateShard requests without getTeam responses from both team collections	2018-06-07 16:14:40 -07:00
Evan Tschannen	9f0c16f062	do not build teams which contain failed servers	2018-06-07 14:05:53 -07:00
Evan Tschannen	b423d73b42	fix: do not finish a shard relocation until all of the storage servers have made the current recovery version durable. This is to prevent dropping a needed storage server as a source for a shard after dropping a remote configuration	2018-06-07 12:29:25 -07:00
Evan Tschannen	be06938d9d	fix: dropping the remote replication will cause all remote storage servers to die. Make sure we are not restoring redundancy before doing this to prevent data loss in simulation.	2018-06-04 18:46:09 -07:00
Evan Tschannen	6cf9508aae	finished a comment	2018-06-03 19:38:51 -07:00
Evan Tschannen	b1935f1738	fix: do not allow a storage server to be removed within 5 million versions of it being added, because if a storage server is added and removed within the known committed version and recovery version, they storage server will need see either the add or remove when it peeks	2018-05-05 18:16:28 -07:00
Evan Tschannen	440e2ae609	fix: data distribution logic was incorrect for finding a complete source team in a failed DC	2018-05-01 23:08:31 -07:00
Evan Tschannen	10d25927cd	Merge branch 'master' into feature-remote-logs # Conflicts: # fdbserver/DataDistribution.actor.cpp	2018-04-30 22:15:39 -07:00
Evan Tschannen	9cdabfed0e	added useful trace events	2018-04-29 18:54:47 -07:00
Evan Tschannen	73597f190e	fix: new tlogs are initialized with exactly the tags which existed at the recovery version	2018-04-22 20:28:01 -07:00
Bruce Mitchener	9cdf25eda3	Fix some typos.	2018-04-20 00:49:22 +07:00

1 2

88 Commits