Evan Tschannen
1314bcec9e
Merge branch 'release-6.0'
...
# Conflicts:
# documentation/sphinx/source/release-notes.rst
2018-10-05 12:54:00 -07:00
Evan Tschannen
06be70bace
fix: if localEnd is smaller than begin, we cannot peek from the local dc
2018-10-05 12:36:34 -07:00
Evan Tschannen
3922e477a5
Merge branch 'release-6.0'
...
# Conflicts:
# documentation/sphinx/source/release-notes.rst
# fdbclient/ManagementAPI.actor.cpp
# fdbserver/ClusterController.actor.cpp
# fdbserver/DataDistribution.actor.cpp
# fdbserver/LogSystemDiskQueueAdapter.actor.cpp
# fdbserver/SimulatedCluster.actor.cpp
# fdbserver/TLogServer.actor.cpp
2018-10-03 16:57:18 -07:00
Evan Tschannen
aa51d69b2d
fix: set peekLocality for upgraded tags
2018-10-03 13:54:59 -07:00
Evan Tschannen
69711a107b
fix: because of forced recovery, 0 log router tags does not mean we are a special tlog set
2018-10-02 17:45:11 -07:00
Evan Tschannen
e7e1c634e0
fix: we need to restart the peek cursor when the known committed version becomes available
2018-10-02 17:44:14 -07:00
Evan Tschannen
59335aa757
fix: the latest generation of remote transaction logs might has less data the a previous generation, because they take over at known committed version. Detect this case and end at the version that has the most data
2018-09-28 12:25:27 -07:00
Evan Tschannen
c577840020
fix: forced recovery should remove all references to the old primary tlogs in all generations of logs to help the peek logic avoid attempting to read from them
2018-09-28 12:23:09 -07:00
Evan Tschannen
05e7f08b26
added a peek method which will attempt to read the txsTag from the local region as much as possible
2018-09-28 12:21:08 -07:00
Evan Tschannen
200e65fe61
added a workload which tests killing an entire region, and recovering from the failure with data loss.
...
fix: we cannot pop the txs tag from remote logs until they have a full copy of the txnStateStore
fix: we have to modify all of history, we cannot stop after finding a local remote
2018-09-17 18:32:39 -07:00
Alex Miller
fb31a6999f
Rewrite all files to have #include actorcompiler.h as the last include.
2018-08-14 15:50:26 -07:00
Alex Miller
535b5701e5
Rewrite all `Void _ = wait(...)` -> `wait(...)`.
...
This takes advantage of the new actorcompiler functionality to avoid
having duplicate definitions of `Void _` when trying to feed the
un-actorompiled source through clang.
2018-08-14 15:50:26 -07:00
Evan Tschannen
1c29275672
call all methods which could disable a trace event before it is initialized. In practice this means calling .error first, then .suppressFor, then all your details.
2018-08-01 14:30:57 -07:00
Evan Tschannen
30b2f85020
fix: it is not safe to drop logs supporting the current primary datacenter, because configuring usable_regions down will drop the storage servers in the remote region, leaving you will no remaining logs
2018-07-14 16:26:45 -07:00
Evan Tschannen
cd63c7a7cc
added a buffered cursor, which efficiently merges lots of peek cursors
2018-07-12 12:09:48 -07:00
Evan Tschannen
c148c865e3
optimized log peek cursors to use much less CPU when using the policy engine
2018-07-11 15:43:55 -07:00
Evan Tschannen
f0494f18b1
added a trace event for forced recovery
2018-07-06 17:09:29 -07:00
Evan Tschannen
43b5cb28ba
fix: properly handle zero logRouterTags, this is important for forced recovery
2018-07-06 16:52:25 -07:00
Evan Tschannen
866ccfe344
added the ability to allow the master to finish recovery before all storage servers in both regions have their mutations. This allows you to recover from scenarios where you lose all your tlogs in one dc.
2018-07-04 01:59:04 -04:00
Evan Tschannen
c69d6166e3
another attempt at forced recovery
2018-07-03 13:42:58 -04:00
Evan Tschannen
57a8c6862e
fix: force recovery did not work if the latest log set did not recover th
2018-07-02 23:48:22 -04:00
Evan Tschannen
9eb8dc3a59
fix: previous attempt at force recovery did not work because we need to treat the remote logs as local for peeking
2018-07-02 22:35:18 -04:00
Evan Tschannen
7a12d3e130
added the (untested) ability to force a recovery to the remote datacenter, even if that results in data loss. If the DR lag is more than 1 week there could be potential data corruption if any primary storage servers are still alive.
2018-07-01 09:39:04 -04:00
Evan Tschannen
a288d5b9a9
added a fallback satellite configuration, so that we can use two satellites if available, but do not have to failover to the remote datacenter if one satellite is down
2018-06-28 23:15:32 -07:00
Evan Tschannen
00167b0157
renamed some uses of knownCommittedVersion to durableKnownCommittedVersion
...
epochEnd exclusively refers to the last version a set of logs is responsible for serving peek requests for
recoverAt and recoveredAt refer to the last committed version of the previous generation
2018-06-26 18:20:28 -07:00
Evan Tschannen
8a8914f046
re-added the ability to configure the number of log routers. Many log routers are needed to get a sufficient number of sockets involved in copying data across the WAN
2018-06-22 00:04:00 -07:00
Evan Tschannen
68ac3bdc4c
log routers now calculate a precise version to pop for their log router tag
2018-06-21 15:29:46 -07:00
Evan Tschannen
e7999e7a3e
log routers need to use parallelGetMore when peeking because the latency to the primary datacenter makes the bandwidth of normal peeking too low.
2018-06-19 22:16:45 -07:00
Evan Tschannen
50e1e03130
fix: for configurations with anti-quorums to work, the push actors need to be put in the proxy’s actor collection
2018-06-18 15:25:54 -07:00
Evan Tschannen
0913368651
added usable_regions to specify if we will replicate into a remote region
...
remote replication defaults to the primary replication
removed remote_logs, because they should be specified as an override in the regions object
2018-06-17 19:31:15 -07:00
Evan Tschannen
f637c680f1
fix: populateSatelliteTagLocations was broken
...
fix: satellites do not index the upgraded locality
2018-06-17 13:29:17 -07:00
Evan Tschannen
6931a00993
satellite log push locations are static per tag, which will reduce the number of tags each satellite log has to index, and reduce the proxy cpu when calculating push locations
2018-06-16 17:39:02 -07:00
Evan Tschannen
f694f7c9ca
removed hasBestPolicy
2018-06-15 12:36:19 -07:00
Evan Tschannen
0d87186821
use a specific locality for satellites
2018-06-15 11:06:38 -07:00
Evan Tschannen
1796e00149
do not pop tags from logs that are not indexing that tag
2018-06-14 12:55:33 -07:00
Evan Tschannen
889889323e
The master will tell the cluster controller if it is going to take a long time to recruit new logs in its DC; the cluster controller can determine if the other DC would be better and recruit there.
...
The cluster controller will not switch to the other data center if remote logs are too far behind.
We will not recruit in DCs with negative priority.
2018-06-13 18:14:14 -07:00
Evan Tschannen
8dfda1e57b
fixed another trace event
2018-06-11 12:53:07 -07:00
Evan Tschannen
372ed67497
Merge branch 'master' into feature-remote-logs
...
# Conflicts:
# fdbserver/DataDistribution.actor.cpp
# fdbserver/MasterProxyServer.actor.cpp
# fdbserver/TLogServer.actor.cpp
# fdbserver/TagPartitionedLogSystem.actor.cpp
2018-06-11 11:34:10 -07:00
Evan Tschannen
b60264024a
fix: we need to copy the txsTag on satellite logs
2018-06-10 20:30:44 -07:00
Evan Tschannen
8a24bf6124
describe did not list all the log sets
2018-06-10 12:38:50 -07:00
A.J. Beamon
e5488419cc
Attempt to normalize trace events:
...
* Detail names now all start with an uppercase character and contain no underscores. Ideally these should be head-first camel case, though that was harder to check.
* Type names have the same rules, except they allow one underscore (to support a usage pattern Context_Type). The first character after the underscore is also uppercase.
* Use seconds instead of milliseconds in details.
Added a check when events are logged in simulation that logs a message to stderr if the first two rules above aren't followed.
This probably doesn't address every instance of the above problems, but all of the events I was able to hit in simulation pass the check.
2018-06-08 11:11:08 -07:00
Evan Tschannen
c519339adb
avoid peeking from logs that do not match the tag’s locality
2018-06-01 18:42:48 -07:00
Evan Tschannen
81c7bddaf8
fix: must check for log router errors while waiting on satellite replies because the recruitmentID will not be updated if it threw an error
2018-05-06 18:15:12 -07:00
Evan Tschannen
8371afb565
fix: log routers need to know if the log system is stopped to determine how they should peek the last log generation
2018-05-05 17:56:00 -07:00
Evan Tschannen
e8ea02e054
fix: storage servers need to fail if they can no longer peek data
2018-05-05 17:19:59 -07:00
Evan Tschannen
e1e43cff28
endEpoch implemented using getDurableVersion
2018-04-30 18:32:04 -07:00
Evan Tschannen
5143871fed
passed debug ids into all versions of peek() to assist debugging
2018-04-30 13:36:35 -07:00
Evan Tschannen
9cdabfed0e
added useful trace events
2018-04-29 18:54:47 -07:00
Evan Tschannen
2e286b768d
fix: locality is needed for a logSet to call getPushLocations
...
fix: accidentally deleted allowPops assignment on the log router
2018-04-29 13:47:32 -07:00
Evan Tschannen
dbdeeaa5cf
fix: log routers are given all the information they need to add remote tags in their initialization request
2018-04-28 18:04:57 -07:00