Commit Graph

137 Commits

Author SHA1 Message Date
Evan Tschannen 1314bcec9e Merge branch 'release-6.0'
# Conflicts:
#	documentation/sphinx/source/release-notes.rst
2018-10-05 12:54:00 -07:00
Evan Tschannen 06be70bace fix: if localEnd is smaller than begin, we cannot peek from the local dc 2018-10-05 12:36:34 -07:00
Evan Tschannen 3922e477a5 Merge branch 'release-6.0'
# Conflicts:
#	documentation/sphinx/source/release-notes.rst
#	fdbclient/ManagementAPI.actor.cpp
#	fdbserver/ClusterController.actor.cpp
#	fdbserver/DataDistribution.actor.cpp
#	fdbserver/LogSystemDiskQueueAdapter.actor.cpp
#	fdbserver/SimulatedCluster.actor.cpp
#	fdbserver/TLogServer.actor.cpp
2018-10-03 16:57:18 -07:00
Evan Tschannen aa51d69b2d fix: set peekLocality for upgraded tags 2018-10-03 13:54:59 -07:00
Evan Tschannen 69711a107b fix: because of forced recovery, 0 log router tags does not mean we are a special tlog set 2018-10-02 17:45:11 -07:00
Evan Tschannen e7e1c634e0 fix: we need to restart the peek cursor when the known committed version becomes available 2018-10-02 17:44:14 -07:00
Evan Tschannen 59335aa757 fix: the latest generation of remote transaction logs might has less data the a previous generation, because they take over at known committed version. Detect this case and end at the version that has the most data 2018-09-28 12:25:27 -07:00
Evan Tschannen c577840020 fix: forced recovery should remove all references to the old primary tlogs in all generations of logs to help the peek logic avoid attempting to read from them 2018-09-28 12:23:09 -07:00
Evan Tschannen 05e7f08b26 added a peek method which will attempt to read the txsTag from the local region as much as possible 2018-09-28 12:21:08 -07:00
Evan Tschannen 200e65fe61 added a workload which tests killing an entire region, and recovering from the failure with data loss.
fix: we cannot pop the txs tag from remote logs until they have a full copy of the txnStateStore
fix: we have to modify all of history, we cannot stop after finding a local remote
2018-09-17 18:32:39 -07:00
Alex Miller fb31a6999f Rewrite all files to have #include actorcompiler.h as the last include. 2018-08-14 15:50:26 -07:00
Alex Miller 535b5701e5 Rewrite all `Void _ = wait(...)` -> `wait(...)`.
This takes advantage of the new actorcompiler functionality to avoid
having duplicate definitions of `Void _` when trying to feed the
un-actorompiled source through clang.
2018-08-14 15:50:26 -07:00
Evan Tschannen 1c29275672 call all methods which could disable a trace event before it is initialized. In practice this means calling .error first, then .suppressFor, then all your details. 2018-08-01 14:30:57 -07:00
Evan Tschannen 30b2f85020 fix: it is not safe to drop logs supporting the current primary datacenter, because configuring usable_regions down will drop the storage servers in the remote region, leaving you will no remaining logs 2018-07-14 16:26:45 -07:00
Evan Tschannen cd63c7a7cc added a buffered cursor, which efficiently merges lots of peek cursors 2018-07-12 12:09:48 -07:00
Evan Tschannen c148c865e3 optimized log peek cursors to use much less CPU when using the policy engine 2018-07-11 15:43:55 -07:00
Evan Tschannen f0494f18b1 added a trace event for forced recovery 2018-07-06 17:09:29 -07:00
Evan Tschannen 43b5cb28ba fix: properly handle zero logRouterTags, this is important for forced recovery 2018-07-06 16:52:25 -07:00
Evan Tschannen 866ccfe344 added the ability to allow the master to finish recovery before all storage servers in both regions have their mutations. This allows you to recover from scenarios where you lose all your tlogs in one dc. 2018-07-04 01:59:04 -04:00
Evan Tschannen c69d6166e3 another attempt at forced recovery 2018-07-03 13:42:58 -04:00
Evan Tschannen 57a8c6862e fix: force recovery did not work if the latest log set did not recover th 2018-07-02 23:48:22 -04:00
Evan Tschannen 9eb8dc3a59 fix: previous attempt at force recovery did not work because we need to treat the remote logs as local for peeking 2018-07-02 22:35:18 -04:00
Evan Tschannen 7a12d3e130 added the (untested) ability to force a recovery to the remote datacenter, even if that results in data loss. If the DR lag is more than 1 week there could be potential data corruption if any primary storage servers are still alive. 2018-07-01 09:39:04 -04:00
Evan Tschannen a288d5b9a9 added a fallback satellite configuration, so that we can use two satellites if available, but do not have to failover to the remote datacenter if one satellite is down 2018-06-28 23:15:32 -07:00
Evan Tschannen 00167b0157 renamed some uses of knownCommittedVersion to durableKnownCommittedVersion
epochEnd exclusively refers to the last version a set of logs is responsible for serving peek requests for
recoverAt and recoveredAt refer to the last committed version of the previous generation
2018-06-26 18:20:28 -07:00
Evan Tschannen 8a8914f046 re-added the ability to configure the number of log routers. Many log routers are needed to get a sufficient number of sockets involved in copying data across the WAN 2018-06-22 00:04:00 -07:00
Evan Tschannen 68ac3bdc4c log routers now calculate a precise version to pop for their log router tag 2018-06-21 15:29:46 -07:00
Evan Tschannen e7999e7a3e log routers need to use parallelGetMore when peeking because the latency to the primary datacenter makes the bandwidth of normal peeking too low. 2018-06-19 22:16:45 -07:00
Evan Tschannen 50e1e03130 fix: for configurations with anti-quorums to work, the push actors need to be put in the proxy’s actor collection 2018-06-18 15:25:54 -07:00
Evan Tschannen 0913368651 added usable_regions to specify if we will replicate into a remote region
remote replication defaults to the primary replication
removed remote_logs, because they should be specified as an override in the regions object
2018-06-17 19:31:15 -07:00
Evan Tschannen f637c680f1 fix: populateSatelliteTagLocations was broken
fix: satellites do not index the upgraded locality
2018-06-17 13:29:17 -07:00
Evan Tschannen 6931a00993 satellite log push locations are static per tag, which will reduce the number of tags each satellite log has to index, and reduce the proxy cpu when calculating push locations 2018-06-16 17:39:02 -07:00
Evan Tschannen f694f7c9ca removed hasBestPolicy 2018-06-15 12:36:19 -07:00
Evan Tschannen 0d87186821 use a specific locality for satellites 2018-06-15 11:06:38 -07:00
Evan Tschannen 1796e00149 do not pop tags from logs that are not indexing that tag 2018-06-14 12:55:33 -07:00
Evan Tschannen 889889323e The master will tell the cluster controller if it is going to take a long time to recruit new logs in its DC; the cluster controller can determine if the other DC would be better and recruit there.
The cluster controller will not switch to the other data center if remote logs are too far behind.
We will not recruit in DCs with negative priority.
2018-06-13 18:14:14 -07:00
Evan Tschannen 8dfda1e57b fixed another trace event 2018-06-11 12:53:07 -07:00
Evan Tschannen 372ed67497 Merge branch 'master' into feature-remote-logs
# Conflicts:
#	fdbserver/DataDistribution.actor.cpp
#	fdbserver/MasterProxyServer.actor.cpp
#	fdbserver/TLogServer.actor.cpp
#	fdbserver/TagPartitionedLogSystem.actor.cpp
2018-06-11 11:34:10 -07:00
Evan Tschannen b60264024a fix: we need to copy the txsTag on satellite logs 2018-06-10 20:30:44 -07:00
Evan Tschannen 8a24bf6124 describe did not list all the log sets 2018-06-10 12:38:50 -07:00
A.J. Beamon e5488419cc Attempt to normalize trace events:
* Detail names now all start with an uppercase character and contain no underscores. Ideally these should be head-first camel case, though that was harder to check.
* Type names have the same rules, except they allow one underscore (to support a usage pattern Context_Type). The first character after the underscore is also uppercase.
* Use seconds instead of milliseconds in details.

Added a check when events are logged in simulation that logs a message to stderr if the first two rules above aren't followed.

This probably doesn't address every instance of the above problems, but all of the events I was able to hit in simulation pass the check.
2018-06-08 11:11:08 -07:00
Evan Tschannen c519339adb avoid peeking from logs that do not match the tag’s locality 2018-06-01 18:42:48 -07:00
Evan Tschannen 81c7bddaf8 fix: must check for log router errors while waiting on satellite replies because the recruitmentID will not be updated if it threw an error 2018-05-06 18:15:12 -07:00
Evan Tschannen 8371afb565 fix: log routers need to know if the log system is stopped to determine how they should peek the last log generation 2018-05-05 17:56:00 -07:00
Evan Tschannen e8ea02e054 fix: storage servers need to fail if they can no longer peek data 2018-05-05 17:19:59 -07:00
Evan Tschannen e1e43cff28 endEpoch implemented using getDurableVersion 2018-04-30 18:32:04 -07:00
Evan Tschannen 5143871fed passed debug ids into all versions of peek() to assist debugging 2018-04-30 13:36:35 -07:00
Evan Tschannen 9cdabfed0e added useful trace events 2018-04-29 18:54:47 -07:00
Evan Tschannen 2e286b768d fix: locality is needed for a logSet to call getPushLocations
fix: accidentally deleted allowPops assignment on the log router
2018-04-29 13:47:32 -07:00
Evan Tschannen dbdeeaa5cf fix: log routers are given all the information they need to add remote tags in their initialization request 2018-04-28 18:04:57 -07:00