Evan Tschannen
6254a1a8e4
fix: restarting the provisional proxy causes all tlog peeks to restart, so if tlog peeks take longer than 1 second this could end in an infinite loop
2019-03-22 18:37:39 -07:00
Evan Tschannen
2605257737
Merge branch 'master' of github.com:apple/foundationdb
2019-03-19 18:47:29 -07:00
Evan Tschannen
5b9c45ea0b
clients do not attempt to connect to provisional proxies
2019-03-19 13:37:50 -07:00
Balachandar Namasivayam
5471725db5
Support config where the primary and remote DC's can be used as satellites.
2019-03-18 12:17:59 -07:00
Evan Tschannen
a7e45cff91
Merge pull request #1176 from jzhou77/ratekeeper
...
Make Ratekeeper a separate role
2019-03-12 15:58:59 -07:00
Evan Tschannen
2627bcd35e
Merge branch 'master' into feature-metadata-version
2019-03-10 21:13:28 -07:00
Jingyu Zhou
3c86643822
Separate Ratekeeper from data distribution.
...
Add a new role for ratekeeper.
Remove StorageServerChanges from data distribution.
Ratekeeper monitors storage servers, which borrows the idea from
DataDistribution.
2019-03-07 13:16:20 -08:00
Alex Miller
c6a65389ae
Remove noexcept macro and replace with BOOST_NOEXCEPT.
...
BOOST_NOEXCEPT does what the noexcept macro was supposed to do, but in a
way that is correctly maintained over time.
2019-03-05 22:06:12 -08:00
anoyes
981426bac9
More ide fixes
2019-03-05 18:03:57 -08:00
Evan Tschannen
3da85f3acd
implemented the \xff/metadataVersion key, which can be used by layers to help them cheaply cache metadata and know when their cache is invalid
2019-02-28 17:45:00 -08:00
Evan Tschannen
b8910ba7cd
Merge branch 'master' into feature-fix-force-recovery
...
# Conflicts:
# fdbclient/ManagementAPI.actor.h
# fdbserver/DataDistribution.actor.cpp
# fdbserver/storageserver.actor.cpp
# fdbserver/workloads/KillRegion.actor.cpp
2019-02-22 14:38:13 -08:00
Evan Tschannen
0e19b5a935
fix: allow the txnStateStore to be recovered from a process in a down datacenter, so that the cluster controller can know to switch to the other region
2019-02-21 16:52:27 -08:00
Evan Tschannen
3a572b010f
fix: a forced recovery needed to force the data distributor to restart
2019-02-19 16:04:52 -08:00
mpilman
3f0fd2a20c
Use fwd decls in WorkerInterface
...
Also WorkerInterface.h -> WorkerInterface.actor.h
2019-02-19 15:16:59 -08:00
mpilman
0bb60e5a3b
Use proper fwd decl in NativeAPI
...
Also NativeAPI.h -> NativeAPI.actor.h
2019-02-19 15:16:59 -08:00
Evan Tschannen
8ed89fd711
fixed review comments
2019-02-19 11:26:53 -08:00
Evan Tschannen
065a45e05f
Merge branch 'master' into feature-fix-force-recovery
...
# Conflicts:
# fdbclient/ManagementAPI.actor.cpp
# fdbserver/ClusterController.actor.cpp
# fdbserver/workloads/KillRegion.actor.cpp
2019-02-18 17:09:06 -08:00
Evan Tschannen
ccaa860ffc
fix: all storage servers must reboot during a forced recovery, because their rejoin commit might have been lost
2019-02-18 15:27:18 -08:00
Evan Tschannen
9cfadad41b
fix: if the tagPartitionedLogSystem cannot do a forced recovery, the master should not execute it forced recovery based modifications either
2019-02-18 15:13:18 -08:00
Evan Tschannen
8f2af8bed1
fix: forced recoveries now require a target dcid which will become the new primary location. During the forced recovery, the configuration will be changed to make that location primary, and usable_regions will be set to 1. If the target dcid is already the primary location, the forced recovery will do nothing. This makes forced recoveries idempotent, so it is safe to the client to re-send forced recovery commands to the cluster controller.
...
fix: the cluster controller attempts to do a commit to determine if the cluster is alive, since its own internal recoveryState might not be up-to-date.
fix: forceMasterFailure on the cluster controller did not always cause the current master to be re-recruited
2019-02-18 14:54:28 -08:00
Evan Tschannen
4c35ebdcc6
fix: because of forced recoveries, storage servers in remote regions cannot update their durable version to (lastLogVersion - 5e6), because the lastLogVersion might have jumped due to an epoch end and the recovery version after the forced recovery could be before the epoch end, causing the storage server to want to rollback to a version it does not have on disk
2019-02-18 14:40:30 -08:00
Evan Tschannen
05ca0a10d8
fix: kill all storage servers which are not in the safe locality after a forced recovery
2019-02-18 14:30:51 -08:00
Jingyu Zhou
6a655143e8
A follow-on fix for config key usage
...
And some trace event cleanups.
2019-02-14 16:37:16 -08:00
Jingyu Zhou
aea602d9c7
Remove getRecoveryInfo from master interface.
2019-02-14 16:37:16 -08:00
Jingyu Zhou
886e7ab2ba
Add a new DataDistributor role.
...
Let cluster controller to start a new data distributor role by sending a
message to a chosen worker.
Change MasterInterface usage in DataDistribution to masterId
Add DataDistributor rejoin handling.
This allows the data distributor to tell the new cluster controller of its
existence so that the controller doesn't spawn a new one. I.e., there should
be only ONE data distributor in the cluster.
If DataDistributor (DD) doesn't join in a while, then ClusterController (CC) tries
to recruit one as DD. CC also monitors DD and restarts one if it failed.
The Proxy is also monitoring the DD. If DD failed, the Proxy will ask CC for
the new DD.
Add GetRecoveryInfo RPC to master server, which is called by data distributor
to obtain the recovery Transaction version from the master server.
2019-02-14 16:30:13 -08:00
Evan Tschannen
e45952bc53
Merge branch 'release-6.0'
...
# Conflicts:
# documentation/sphinx/source/release-notes.rst
# fdbclient/BackupContainer.actor.cpp
# fdbclient/BlobStore.actor.cpp
# fdbclient/HTTP.actor.cpp
# tests/BlobStore.txt
# versions.target
2018-11-13 16:06:39 -08:00
Evan Tschannen
1bd615f954
fix: remoteDcIds will not actually have transaction logs unless usable regions is > 1
2018-11-13 12:36:04 -08:00
Evan Tschannen
4e54690005
Merge branch 'release-6.0'
...
# Conflicts:
# fdbserver/DataDistribution.actor.cpp
# fdbserver/MoveKeys.actor.cpp
2018-11-12 20:26:58 -08:00
Evan Tschannen
7892da032f
fix: Do not remove the locality entry for the current transaction logs when removing storage servers
...
fix: dcId_locality map could be incorrect after restarting recruitEverything
2018-11-11 12:37:53 -08:00
Evan Tschannen
4b5d0b4e2c
Merge branch 'release-6.0'
...
# Conflicts:
# documentation/sphinx/source/release-notes.rst
# fdbclient/AsyncFileBlobStore.actor.cpp
# fdbclient/AsyncFileBlobStore.actor.h
# fdbclient/BlobStore.actor.cpp
# fdbclient/BlobStore.h
# fdbclient/HTTP.actor.cpp
# fdbclient/ManagementAPI.actor.cpp
# fdbclient/NativeAPI.actor.cpp
# fdbrpc/LoadBalance.actor.h
# fdbrpc/batcher.actor.h
# fdbrpc/fdbrpc.vcxproj
# fdbrpc/sim2.actor.cpp
# fdbserver/DataDistribution.actor.cpp
# fdbserver/DataDistributionTracker.actor.cpp
# fdbserver/SimulatedCluster.actor.cpp
# fdbserver/TLogServer.actor.cpp
# fdbserver/masterserver.actor.cpp
2018-11-10 13:04:24 -08:00
Evan Tschannen
6bb283aebc
fix: dcId to Locality changes could be lost if an emergency transaction happened that did not change the configuration
...
fix: master proxy was starting dcId’s at 1 number too large
2018-11-05 11:12:43 -08:00
Evan Tschannen
87295cc263
suppressed spammy trace events, and avoid reporting a long master recovery duration when the cluster is first created
2018-11-04 23:07:56 -08:00
Robert Escriva
268093a96d
Adjust all includes to be relative to the root.
...
Remove the use of relative paths. A header at foo/bar.h could be included by
files under foo/ with "bar.h", but would be included everywhere else as
"foo/bar.h". Adjust so that every include references such a header with the
latter form.
Signed-off-by: Robert Escriva <rescriva@dropbox.com>
2018-10-19 17:35:33 +00:00
Evan Tschannen
3922e477a5
Merge branch 'release-6.0'
...
# Conflicts:
# documentation/sphinx/source/release-notes.rst
# fdbclient/ManagementAPI.actor.cpp
# fdbserver/ClusterController.actor.cpp
# fdbserver/DataDistribution.actor.cpp
# fdbserver/LogSystemDiskQueueAdapter.actor.cpp
# fdbserver/SimulatedCluster.actor.cpp
# fdbserver/TLogServer.actor.cpp
2018-10-03 16:57:18 -07:00
Evan Tschannen
cdaf5e1192
fix: forced recovery does not recover tags from any DC besides the surviving one
2018-10-02 17:46:22 -07:00
Evan Tschannen
e7e1c634e0
fix: we need to restart the peek cursor when the known committed version becomes available
2018-10-02 17:44:14 -07:00
Evan Tschannen
05e7f08b26
added a peek method which will attempt to read the txsTag from the local region as much as possible
2018-09-28 12:21:08 -07:00
Evan Tschannen
200e65fe61
added a workload which tests killing an entire region, and recovering from the failure with data loss.
...
fix: we cannot pop the txs tag from remote logs until they have a full copy of the txnStateStore
fix: we have to modify all of history, we cannot stop after finding a local remote
2018-09-17 18:32:39 -07:00
Evan Tschannen
90301f497f
Merge branch 'release-6.0'
...
# Conflicts:
# fdbclient/ManagementAPI.actor.cpp
# fdbrpc/FlowTransport.actor.cpp
# fdbrpc/TLSConnection.actor.cpp
# fdbserver/DataDistribution.actor.cpp
# fdbserver/Status.actor.cpp
# fdbserver/storageserver.actor.cpp
# fdbserver/workloads/StatusWorkload.actor.cpp
# versions.target
2018-09-05 16:06:33 -07:00
Evan Tschannen
90bf277206
require key value store memory to recover cleanly when recovering the txnStateStore, since all of the data it is recovering has been fsync’ed
2018-08-31 13:07:48 -07:00
A.J. Beamon
2a97139d5d
This is the first step in eliminating the usage of database names in our code. The C API remains the same, but underneath that all usage of database names is eliminated.
2018-08-16 10:24:12 -07:00
Alex Miller
fb31a6999f
Rewrite all files to have #include actorcompiler.h as the last include.
2018-08-14 15:50:26 -07:00
Alex Miller
535b5701e5
Rewrite all `Void _ = wait(...)` -> `wait(...)`.
...
This takes advantage of the new actorcompiler functionality to avoid
having duplicate definitions of `Void _` when trying to feed the
un-actorompiled source through clang.
2018-08-14 15:50:26 -07:00
Evan Tschannen
9d0a07a400
fix: trackLatest for master recovery state was wrong, causing status to report incorrect recovery states
2018-08-04 12:50:56 -07:00
Evan Tschannen
30b2f85020
fix: it is not safe to drop logs supporting the current primary datacenter, because configuring usable_regions down will drop the storage servers in the remote region, leaving you will no remaining logs
2018-07-14 16:26:45 -07:00
Evan Tschannen
b9f2b80129
deleted spammy trace event
2018-07-09 22:02:15 -07:00
Evan Tschannen
6b40f2764d
fix: off by one error on popping missing tags
2018-07-09 15:43:22 -07:00
Evan Tschannen
da5a232d7e
fix: If we have not recruited the remote logs yet and detect a configuration change, we must fail the master to update the remote recruitment request
2018-07-05 12:17:41 -07:00
Evan Tschannen
507b3bacb0
fix: kill all tlogs in one region prevents the remote logs from recovering in that region, do not allow that to prevent us from configuring usable_regions=1.
...
added more recovery states.
2018-07-05 00:08:51 -07:00
Evan Tschannen
866ccfe344
added the ability to allow the master to finish recovery before all storage servers in both regions have their mutations. This allows you to recover from scenarios where you lose all your tlogs in one dc.
2018-07-04 01:59:04 -04:00