Evan Tschannen
3fdf72c626
fix: we need to force recovery if the master is still attempting to read the txs tag
2018-09-28 13:33:33 -07:00
Evan Tschannen
22e6afbb18
fix: the cluster controller did not pass in its own locality when creating its database object, therefore it was not using locality aware load balancing
2018-09-28 12:12:06 -07:00
Evan Tschannen
6b6d7a087d
The cluster controller should never consider itself as failed (that will be handled by the coordinators)
...
Simplified the check that the cluster controller is excluded
2018-09-20 17:01:11 -07:00
Evan Tschannen
03728db99b
do not trigger better master exists if the cluster controller is excluded, since the master will change anyways once the cluster controller is moved
2018-09-19 18:28:24 -07:00
Evan Tschannen
df406a340e
Merge pull request #742 from ajbeamon/roles-in-trace-events
...
Add the roles running on a process as a field on trace events in the …
2018-09-05 16:08:12 -07:00
A.J. Beamon
2de0b5d6d7
Add the roles running on a process as a field on trace events in the form of a comma delimited string of role abbreviations.
2018-09-05 15:06:14 -07:00
Evan Tschannen
e60c668853
The cluster controller will increase its failure monitoring delay after there have been many unfinishedRecoveries
2018-08-31 10:51:55 -07:00
Evan Tschannen
e770629229
fix: json_spirit::write_string is very CPU intensive, especially for large JSON documents. The cluster controller would call this function for each status reply it needed to send, resulting in a slow task.
2018-08-15 19:39:06 -07:00
Evan Tschannen
be1a4d74c7
tlogs serve reads to log routers at a low priority, to prevent them from using all their resources catching up a remote dc that has been down for a long time
...
increase the amount of memory ratekeeper budgets for tlogs so that there is a gap after the spill threshold to prevent temporarily overshooting the budget
2018-08-04 10:31:30 -07:00
Evan Tschannen
1c29275672
call all methods which could disable a trace event before it is initialized. In practice this means calling .error first, then .suppressFor, then all your details.
2018-08-01 14:30:57 -07:00
Evan Tschannen
30b2f85020
fix: it is not safe to drop logs supporting the current primary datacenter, because configuring usable_regions down will drop the storage servers in the remote region, leaving you will no remaining logs
2018-07-14 16:26:45 -07:00
Evan Tschannen
28c0d96c90
fix: treat the local region as best when version difference is too large
...
re-check requests when the version difference becomes small
2018-07-06 14:44:11 -07:00
Evan Tschannen
21347df254
fix: getting metrics did not handle broken_promise errors
2018-07-05 12:30:11 -07:00
Evan Tschannen
507b3bacb0
fix: kill all tlogs in one region prevents the remote logs from recovering in that region, do not allow that to prevent us from configuring usable_regions=1.
...
added more recovery states.
2018-07-05 00:08:51 -07:00
Evan Tschannen
e17dfea3b6
fix: desiredTLogCount was used instead of getDesiredLogs(), which caused problems with recruitment when desiredTLogCount was -1.
...
canKillProcess logic was wrong.
We still need to configure usable_regions because if datacenterVersionDifference is too large we cannot complete data movement.
2018-07-04 16:22:32 -04:00
Evan Tschannen
f2ec80f10d
added trace events for cluster controller changing datacenters
2018-07-02 13:06:54 -04:00
Evan Tschannen
334a433238
spend less time before using satellite fallback, because the database will be unavailable during this waiting time
2018-07-02 12:50:52 -04:00
Evan Tschannen
7a12d3e130
added the (untested) ability to force a recovery to the remote datacenter, even if that results in data loss. If the DR lag is more than 1 week there could be potential data corruption if any primary storage servers are still alive.
2018-07-01 09:39:04 -04:00
Evan Tschannen
7e68bee692
update better machine classes first to give them a higher chance of becoming the next cluster controller
2018-06-29 01:11:59 -07:00
Evan Tschannen
e9ac8a1039
when the cluster controller is changing itself to a better dc fitness, it should notify itself first so another process does not take over
2018-06-29 00:10:29 -07:00
Evan Tschannen
a288d5b9a9
added a fallback satellite configuration, so that we can use two satellites if available, but do not have to failover to the remote datacenter if one satellite is down
2018-06-28 23:15:32 -07:00
Evan Tschannen
58c2f67ff6
checking outstanding requests can be CPU intensive, so rate limit checking requests
2018-06-27 23:02:08 -07:00
Evan Tschannen
a5b4698bc8
do not wait for good recruitment delay if the cluster controller is in the second best region
2018-06-27 21:05:55 -07:00
Evan Tschannen
c6313a79e3
fix: the cluster controller needs to continue to retry recruitment until after wait_for_good_remote_recruitment_delay
2018-06-25 18:20:16 -07:00
Evan Tschannen
398497f5c3
fix: wrong desired count used when checking good remote fitness
2018-06-22 12:24:01 -07:00
Evan Tschannen
96b0a91ab2
simplified betterCount logic
2018-06-22 10:38:36 -07:00
Evan Tschannen
5fc8199abc
Swapped OkayFit and UnsetFit, because generally if machine classes are set on one machine they are set everywhere and it helps with wait_for_good_recruitment logic
...
wait_for_good_recruitment now requires that you have the desired count of each roll
remote recruitment is given a much longer wait_for_good_recruitment time interval, which does not start until enough remote machines have registered
2018-06-22 10:15:24 -07:00
Evan Tschannen
8a8914f046
re-added the ability to configure the number of log routers. Many log routers are needed to get a sufficient number of sockets involved in copying data across the WAN
2018-06-22 00:04:00 -07:00
Evan Tschannen
9a91dad5bd
fixed compile issue
2018-06-21 16:34:36 -07:00
Evan Tschannen
678b4494f4
added logging for the datacenter version difference
2018-06-21 16:31:52 -07:00
Evan Tschannen
0913368651
added usable_regions to specify if we will replicate into a remote region
...
remote replication defaults to the primary replication
removed remote_logs, because they should be specified as an override in the regions object
2018-06-17 19:31:15 -07:00
Evan Tschannen
f694f7c9ca
removed hasBestPolicy
2018-06-15 12:36:19 -07:00
Evan Tschannen
0103b6f5ed
added datacenter_version_difference to status
2018-06-14 19:09:25 -07:00
Evan Tschannen
0c6825eb43
allow multiple regions with the same priority
...
configurations must have at least one region with non-negative priority
2018-06-14 12:59:55 -07:00
Evan Tschannen
26b7dd32da
fix: cluster controller did not respect usable dcs
2018-06-14 12:56:48 -07:00
Evan Tschannen
889889323e
The master will tell the cluster controller if it is going to take a long time to recruit new logs in its DC; the cluster controller can determine if the other DC would be better and recruit there.
...
The cluster controller will not switch to the other data center if remote logs are too far behind.
We will not recruit in DCs with negative priority.
2018-06-13 18:14:14 -07:00
Alex Miller
fcfa00928b
Make RecoveryState an enum class.
...
This means that all the == 7 or != 0 checks go away, and explicit names must be used.
2018-06-12 16:50:25 -07:00
A.J. Beamon
e5488419cc
Attempt to normalize trace events:
...
* Detail names now all start with an uppercase character and contain no underscores. Ideally these should be head-first camel case, though that was harder to check.
* Type names have the same rules, except they allow one underscore (to support a usage pattern Context_Type). The first character after the underscore is also uppercase.
* Use seconds instead of milliseconds in details.
Added a check when events are logged in simulation that logs a message to stderr if the first two rules above aren't followed.
This probably doesn't address every instance of the above problems, but all of the events I was able to hit in simulation pass the check.
2018-06-08 11:11:08 -07:00
Evan Tschannen
c3f2e2bb38
fix: do not attempt to become the cluster controller before recovering files from disk
2018-05-01 12:05:43 -07:00
Evan Tschannen
d72087bfd3
fix: we may not be able to recruit enough log routers, in this case put multiple log routers on the same worker, but also properly rank this configuration lower in better master exists
2018-04-26 22:18:07 -07:00
Evan Tschannen
7af892f50b
first working version of non-copying recovery working with fearless configurations
2018-04-08 21:24:05 -07:00
Evan Tschannen
b36e08f08f
first version of non-copying recovery. Upgrades are broken, and it has not been tested using fearless configurations yet
2018-03-29 15:12:38 -07:00
Evan Tschannen
65b532658f
added support for single region configurations
2018-03-15 10:59:30 -07:00
Evan Tschannen
3abf4d7fdf
Merge branch 'master' into feature-remote-logs
2018-03-09 14:50:04 -08:00
Evan Tschannen
91bb8faa45
Merge commit 'f773b9460d31d31b7d421860fc647936f31aa1fa'
...
# Conflicts:
# tests/fast/SidebandWithStatus.txt
# tests/rare/LargeApiCorrectnessStatus.txt
# tests/slow/DDBalanceAndRemoveStatus.txt
2018-03-09 14:47:03 -08:00
Evan Tschannen
f9625f5b2f
fix: new cluster controllers should not consider anything failed until they have time to get failure monitoring updates
...
fix: storage and log class machines wait 100MS before attempting to become the cluster controller
2018-03-08 18:08:41 -08:00
Evan Tschannen
8c88041608
fix: we must commit to the number of log routers we are going to use when recruiting the primary, because it determines the number of log router tags that will be attached to mutations
2018-03-06 16:31:21 -08:00
Evan Tschannen
1194e3a361
added region-based configuration to support a large variety of fearless setups. Currently only 1 primary 1 remote setups are allowed.
2018-03-05 19:27:46 -08:00
Evan Tschannen
470f5c01f3
changed remoteDcId to a vector of ids, to support future configurations where there are multiple remote databases
2018-02-26 17:09:09 -08:00
Evan Tschannen
37a6a81634
Merge commit '7f6fc3e039c911cd84b8540f7f799fc38a1c1822' into feature-remote-logs
...
# Conflicts:
# fdbserver/workloads/RestartRecovery.actor.cpp
2018-02-23 12:33:28 -08:00