foundationdb

Commit Graph

Author	SHA1	Message	Date
Evan Tschannen	3fdf72c626	fix: we need to force recovery if the master is still attempting to read the txs tag	2018-09-28 13:33:33 -07:00
Evan Tschannen	22e6afbb18	fix: the cluster controller did not pass in its own locality when creating its database object, therefore it was not using locality aware load balancing	2018-09-28 12:12:06 -07:00
Evan Tschannen	6b6d7a087d	The cluster controller should never consider itself as failed (that will be handled by the coordinators) Simplified the check that the cluster controller is excluded	2018-09-20 17:01:11 -07:00
Evan Tschannen	03728db99b	do not trigger better master exists if the cluster controller is excluded, since the master will change anyways once the cluster controller is moved	2018-09-19 18:28:24 -07:00
Evan Tschannen	df406a340e	Merge pull request #742 from ajbeamon/roles-in-trace-events Add the roles running on a process as a field on trace events in the …	2018-09-05 16:08:12 -07:00
A.J. Beamon	2de0b5d6d7	Add the roles running on a process as a field on trace events in the form of a comma delimited string of role abbreviations.	2018-09-05 15:06:14 -07:00
Evan Tschannen	e60c668853	The cluster controller will increase its failure monitoring delay after there have been many unfinishedRecoveries	2018-08-31 10:51:55 -07:00
Evan Tschannen	e770629229	fix: json_spirit::write_string is very CPU intensive, especially for large JSON documents. The cluster controller would call this function for each status reply it needed to send, resulting in a slow task.	2018-08-15 19:39:06 -07:00
Evan Tschannen	be1a4d74c7	tlogs serve reads to log routers at a low priority, to prevent them from using all their resources catching up a remote dc that has been down for a long time increase the amount of memory ratekeeper budgets for tlogs so that there is a gap after the spill threshold to prevent temporarily overshooting the budget	2018-08-04 10:31:30 -07:00
Evan Tschannen	1c29275672	call all methods which could disable a trace event before it is initialized. In practice this means calling .error first, then .suppressFor, then all your details.	2018-08-01 14:30:57 -07:00
Evan Tschannen	30b2f85020	fix: it is not safe to drop logs supporting the current primary datacenter, because configuring usable_regions down will drop the storage servers in the remote region, leaving you will no remaining logs	2018-07-14 16:26:45 -07:00
Evan Tschannen	28c0d96c90	fix: treat the local region as best when version difference is too large re-check requests when the version difference becomes small	2018-07-06 14:44:11 -07:00
Evan Tschannen	21347df254	fix: getting metrics did not handle broken_promise errors	2018-07-05 12:30:11 -07:00
Evan Tschannen	507b3bacb0	fix: kill all tlogs in one region prevents the remote logs from recovering in that region, do not allow that to prevent us from configuring usable_regions=1. added more recovery states.	2018-07-05 00:08:51 -07:00
Evan Tschannen	e17dfea3b6	fix: desiredTLogCount was used instead of getDesiredLogs(), which caused problems with recruitment when desiredTLogCount was -1. canKillProcess logic was wrong. We still need to configure usable_regions because if datacenterVersionDifference is too large we cannot complete data movement.	2018-07-04 16:22:32 -04:00
Evan Tschannen	f2ec80f10d	added trace events for cluster controller changing datacenters	2018-07-02 13:06:54 -04:00
Evan Tschannen	334a433238	spend less time before using satellite fallback, because the database will be unavailable during this waiting time	2018-07-02 12:50:52 -04:00
Evan Tschannen	7a12d3e130	added the (untested) ability to force a recovery to the remote datacenter, even if that results in data loss. If the DR lag is more than 1 week there could be potential data corruption if any primary storage servers are still alive.	2018-07-01 09:39:04 -04:00
Evan Tschannen	7e68bee692	update better machine classes first to give them a higher chance of becoming the next cluster controller	2018-06-29 01:11:59 -07:00
Evan Tschannen	e9ac8a1039	when the cluster controller is changing itself to a better dc fitness, it should notify itself first so another process does not take over	2018-06-29 00:10:29 -07:00
Evan Tschannen	a288d5b9a9	added a fallback satellite configuration, so that we can use two satellites if available, but do not have to failover to the remote datacenter if one satellite is down	2018-06-28 23:15:32 -07:00
Evan Tschannen	58c2f67ff6	checking outstanding requests can be CPU intensive, so rate limit checking requests	2018-06-27 23:02:08 -07:00
Evan Tschannen	a5b4698bc8	do not wait for good recruitment delay if the cluster controller is in the second best region	2018-06-27 21:05:55 -07:00
Evan Tschannen	c6313a79e3	fix: the cluster controller needs to continue to retry recruitment until after wait_for_good_remote_recruitment_delay	2018-06-25 18:20:16 -07:00
Evan Tschannen	398497f5c3	fix: wrong desired count used when checking good remote fitness	2018-06-22 12:24:01 -07:00
Evan Tschannen	96b0a91ab2	simplified betterCount logic	2018-06-22 10:38:36 -07:00
Evan Tschannen	5fc8199abc	Swapped OkayFit and UnsetFit, because generally if machine classes are set on one machine they are set everywhere and it helps with wait_for_good_recruitment logic wait_for_good_recruitment now requires that you have the desired count of each roll remote recruitment is given a much longer wait_for_good_recruitment time interval, which does not start until enough remote machines have registered	2018-06-22 10:15:24 -07:00
Evan Tschannen	8a8914f046	re-added the ability to configure the number of log routers. Many log routers are needed to get a sufficient number of sockets involved in copying data across the WAN	2018-06-22 00:04:00 -07:00
Evan Tschannen	9a91dad5bd	fixed compile issue	2018-06-21 16:34:36 -07:00
Evan Tschannen	678b4494f4	added logging for the datacenter version difference	2018-06-21 16:31:52 -07:00
Evan Tschannen	0913368651	added usable_regions to specify if we will replicate into a remote region remote replication defaults to the primary replication removed remote_logs, because they should be specified as an override in the regions object	2018-06-17 19:31:15 -07:00
Evan Tschannen	f694f7c9ca	removed hasBestPolicy	2018-06-15 12:36:19 -07:00
Evan Tschannen	0103b6f5ed	added datacenter_version_difference to status	2018-06-14 19:09:25 -07:00
Evan Tschannen	0c6825eb43	allow multiple regions with the same priority configurations must have at least one region with non-negative priority	2018-06-14 12:59:55 -07:00
Evan Tschannen	26b7dd32da	fix: cluster controller did not respect usable dcs	2018-06-14 12:56:48 -07:00
Evan Tschannen	889889323e	The master will tell the cluster controller if it is going to take a long time to recruit new logs in its DC; the cluster controller can determine if the other DC would be better and recruit there. The cluster controller will not switch to the other data center if remote logs are too far behind. We will not recruit in DCs with negative priority.	2018-06-13 18:14:14 -07:00
Alex Miller	fcfa00928b	Make RecoveryState an enum class. This means that all the == 7 or != 0 checks go away, and explicit names must be used.	2018-06-12 16:50:25 -07:00
A.J. Beamon	e5488419cc	Attempt to normalize trace events: * Detail names now all start with an uppercase character and contain no underscores. Ideally these should be head-first camel case, though that was harder to check. * Type names have the same rules, except they allow one underscore (to support a usage pattern Context_Type). The first character after the underscore is also uppercase. * Use seconds instead of milliseconds in details. Added a check when events are logged in simulation that logs a message to stderr if the first two rules above aren't followed. This probably doesn't address every instance of the above problems, but all of the events I was able to hit in simulation pass the check.	2018-06-08 11:11:08 -07:00
Evan Tschannen	c3f2e2bb38	fix: do not attempt to become the cluster controller before recovering files from disk	2018-05-01 12:05:43 -07:00
Evan Tschannen	d72087bfd3	fix: we may not be able to recruit enough log routers, in this case put multiple log routers on the same worker, but also properly rank this configuration lower in better master exists	2018-04-26 22:18:07 -07:00
Evan Tschannen	7af892f50b	first working version of non-copying recovery working with fearless configurations	2018-04-08 21:24:05 -07:00
Evan Tschannen	b36e08f08f	first version of non-copying recovery. Upgrades are broken, and it has not been tested using fearless configurations yet	2018-03-29 15:12:38 -07:00
Evan Tschannen	65b532658f	added support for single region configurations	2018-03-15 10:59:30 -07:00
Evan Tschannen	3abf4d7fdf	Merge branch 'master' into feature-remote-logs	2018-03-09 14:50:04 -08:00
Evan Tschannen	91bb8faa45	Merge commit 'f773b9460d31d31b7d421860fc647936f31aa1fa' # Conflicts: # tests/fast/SidebandWithStatus.txt # tests/rare/LargeApiCorrectnessStatus.txt # tests/slow/DDBalanceAndRemoveStatus.txt	2018-03-09 14:47:03 -08:00
Evan Tschannen	f9625f5b2f	fix: new cluster controllers should not consider anything failed until they have time to get failure monitoring updates fix: storage and log class machines wait 100MS before attempting to become the cluster controller	2018-03-08 18:08:41 -08:00
Evan Tschannen	8c88041608	fix: we must commit to the number of log routers we are going to use when recruiting the primary, because it determines the number of log router tags that will be attached to mutations	2018-03-06 16:31:21 -08:00
Evan Tschannen	1194e3a361	added region-based configuration to support a large variety of fearless setups. Currently only 1 primary 1 remote setups are allowed.	2018-03-05 19:27:46 -08:00
Evan Tschannen	470f5c01f3	changed remoteDcId to a vector of ids, to support future configurations where there are multiple remote databases	2018-02-26 17:09:09 -08:00
Evan Tschannen	37a6a81634	Merge commit '7f6fc3e039c911cd84b8540f7f799fc38a1c1822' into feature-remote-logs # Conflicts: # fdbserver/workloads/RestartRecovery.actor.cpp	2018-02-23 12:33:28 -08:00

1 2 3

110 Commits