foundationdb

Commit Graph

Author	SHA1	Message	Date
Jingyu Zhou	886e7ab2ba	Add a new DataDistributor role. Let cluster controller to start a new data distributor role by sending a message to a chosen worker. Change MasterInterface usage in DataDistribution to masterId Add DataDistributor rejoin handling. This allows the data distributor to tell the new cluster controller of its existence so that the controller doesn't spawn a new one. I.e., there should be only ONE data distributor in the cluster. If DataDistributor (DD) doesn't join in a while, then ClusterController (CC) tries to recruit one as DD. CC also monitors DD and restarts one if it failed. The Proxy is also monitoring the DD. If DD failed, the Proxy will ask CC for the new DD. Add GetRecoveryInfo RPC to master server, which is called by data distributor to obtain the recovery Transaction version from the master server.	2019-02-14 16:30:13 -08:00
Evan Tschannen	e45952bc53	Merge branch 'release-6.0' # Conflicts: # documentation/sphinx/source/release-notes.rst # fdbclient/BackupContainer.actor.cpp # fdbclient/BlobStore.actor.cpp # fdbclient/HTTP.actor.cpp # tests/BlobStore.txt # versions.target	2018-11-13 16:06:39 -08:00
Evan Tschannen	1bd615f954	fix: remoteDcIds will not actually have transaction logs unless usable regions is > 1	2018-11-13 12:36:04 -08:00
Evan Tschannen	4e54690005	Merge branch 'release-6.0' # Conflicts: # fdbserver/DataDistribution.actor.cpp # fdbserver/MoveKeys.actor.cpp	2018-11-12 20:26:58 -08:00
Evan Tschannen	7892da032f	fix: Do not remove the locality entry for the current transaction logs when removing storage servers fix: dcId_locality map could be incorrect after restarting recruitEverything	2018-11-11 12:37:53 -08:00
Evan Tschannen	4b5d0b4e2c	Merge branch 'release-6.0' # Conflicts: # documentation/sphinx/source/release-notes.rst # fdbclient/AsyncFileBlobStore.actor.cpp # fdbclient/AsyncFileBlobStore.actor.h # fdbclient/BlobStore.actor.cpp # fdbclient/BlobStore.h # fdbclient/HTTP.actor.cpp # fdbclient/ManagementAPI.actor.cpp # fdbclient/NativeAPI.actor.cpp # fdbrpc/LoadBalance.actor.h # fdbrpc/batcher.actor.h # fdbrpc/fdbrpc.vcxproj # fdbrpc/sim2.actor.cpp # fdbserver/DataDistribution.actor.cpp # fdbserver/DataDistributionTracker.actor.cpp # fdbserver/SimulatedCluster.actor.cpp # fdbserver/TLogServer.actor.cpp # fdbserver/masterserver.actor.cpp	2018-11-10 13:04:24 -08:00
Evan Tschannen	6bb283aebc	fix: dcId to Locality changes could be lost if an emergency transaction happened that did not change the configuration fix: master proxy was starting dcId’s at 1 number too large	2018-11-05 11:12:43 -08:00
Evan Tschannen	87295cc263	suppressed spammy trace events, and avoid reporting a long master recovery duration when the cluster is first created	2018-11-04 23:07:56 -08:00
Robert Escriva	268093a96d	Adjust all includes to be relative to the root. Remove the use of relative paths. A header at foo/bar.h could be included by files under foo/ with "bar.h", but would be included everywhere else as "foo/bar.h". Adjust so that every include references such a header with the latter form. Signed-off-by: Robert Escriva <rescriva@dropbox.com>	2018-10-19 17:35:33 +00:00
Evan Tschannen	3922e477a5	Merge branch 'release-6.0' # Conflicts: # documentation/sphinx/source/release-notes.rst # fdbclient/ManagementAPI.actor.cpp # fdbserver/ClusterController.actor.cpp # fdbserver/DataDistribution.actor.cpp # fdbserver/LogSystemDiskQueueAdapter.actor.cpp # fdbserver/SimulatedCluster.actor.cpp # fdbserver/TLogServer.actor.cpp	2018-10-03 16:57:18 -07:00
Evan Tschannen	cdaf5e1192	fix: forced recovery does not recover tags from any DC besides the surviving one	2018-10-02 17:46:22 -07:00
Evan Tschannen	e7e1c634e0	fix: we need to restart the peek cursor when the known committed version becomes available	2018-10-02 17:44:14 -07:00
Evan Tschannen	05e7f08b26	added a peek method which will attempt to read the txsTag from the local region as much as possible	2018-09-28 12:21:08 -07:00
Evan Tschannen	200e65fe61	added a workload which tests killing an entire region, and recovering from the failure with data loss. fix: we cannot pop the txs tag from remote logs until they have a full copy of the txnStateStore fix: we have to modify all of history, we cannot stop after finding a local remote	2018-09-17 18:32:39 -07:00
Evan Tschannen	90301f497f	Merge branch 'release-6.0' # Conflicts: # fdbclient/ManagementAPI.actor.cpp # fdbrpc/FlowTransport.actor.cpp # fdbrpc/TLSConnection.actor.cpp # fdbserver/DataDistribution.actor.cpp # fdbserver/Status.actor.cpp # fdbserver/storageserver.actor.cpp # fdbserver/workloads/StatusWorkload.actor.cpp # versions.target	2018-09-05 16:06:33 -07:00
Evan Tschannen	90bf277206	require key value store memory to recover cleanly when recovering the txnStateStore, since all of the data it is recovering has been fsync’ed	2018-08-31 13:07:48 -07:00
A.J. Beamon	2a97139d5d	This is the first step in eliminating the usage of database names in our code. The C API remains the same, but underneath that all usage of database names is eliminated.	2018-08-16 10:24:12 -07:00
Alex Miller	fb31a6999f	Rewrite all files to have #include actorcompiler.h as the last include.	2018-08-14 15:50:26 -07:00
Alex Miller	535b5701e5	Rewrite all `Void _ = wait(...)` -> `wait(...)`. This takes advantage of the new actorcompiler functionality to avoid having duplicate definitions of `Void _` when trying to feed the un-actorompiled source through clang.	2018-08-14 15:50:26 -07:00
Evan Tschannen	9d0a07a400	fix: trackLatest for master recovery state was wrong, causing status to report incorrect recovery states	2018-08-04 12:50:56 -07:00
Evan Tschannen	30b2f85020	fix: it is not safe to drop logs supporting the current primary datacenter, because configuring usable_regions down will drop the storage servers in the remote region, leaving you will no remaining logs	2018-07-14 16:26:45 -07:00
Evan Tschannen	b9f2b80129	deleted spammy trace event	2018-07-09 22:02:15 -07:00
Evan Tschannen	6b40f2764d	fix: off by one error on popping missing tags	2018-07-09 15:43:22 -07:00
Evan Tschannen	da5a232d7e	fix: If we have not recruited the remote logs yet and detect a configuration change, we must fail the master to update the remote recruitment request	2018-07-05 12:17:41 -07:00
Evan Tschannen	507b3bacb0	fix: kill all tlogs in one region prevents the remote logs from recovering in that region, do not allow that to prevent us from configuring usable_regions=1. added more recovery states.	2018-07-05 00:08:51 -07:00
Evan Tschannen	866ccfe344	added the ability to allow the master to finish recovery before all storage servers in both regions have their mutations. This allows you to recover from scenarios where you lose all your tlogs in one dc.	2018-07-04 01:59:04 -04:00
Evan Tschannen	3c9f3da980	fix: usable regions cannot be changed during an emergency transaction, because it could lead to all storage servers dying if the previous primary is dead	2018-07-01 23:59:06 -04:00
Evan Tschannen	7a12d3e130	added the (untested) ability to force a recovery to the remote datacenter, even if that results in data loss. If the DR lag is more than 1 week there could be potential data corruption if any primary storage servers are still alive.	2018-07-01 09:39:04 -04:00
Evan Tschannen	8a8914f046	re-added the ability to configure the number of log routers. Many log routers are needed to get a sufficient number of sockets involved in copying data across the WAN	2018-06-22 00:04:00 -07:00
Evan Tschannen	0913368651	added usable_regions to specify if we will replicate into a remote region remote replication defaults to the primary replication removed remote_logs, because they should be specified as an override in the regions object	2018-06-17 19:31:15 -07:00
Evan Tschannen	284233baa1	added a key in the database with the locality of the current master	2018-06-14 19:36:02 -07:00
Evan Tschannen	fbb3f85c74	fix: logsKey was not being updated properly	2018-06-14 12:54:39 -07:00
Evan Tschannen	889889323e	The master will tell the cluster controller if it is going to take a long time to recruit new logs in its DC; the cluster controller can determine if the other DC would be better and recruit there. The cluster controller will not switch to the other data center if remote logs are too far behind. We will not recruit in DCs with negative priority.	2018-06-13 18:14:14 -07:00
Alex Miller	fcfa00928b	Make RecoveryState an enum class. This means that all the == 7 or != 0 checks go away, and explicit names must be used.	2018-06-12 16:50:25 -07:00
A.J. Beamon	e5488419cc	Attempt to normalize trace events: * Detail names now all start with an uppercase character and contain no underscores. Ideally these should be head-first camel case, though that was harder to check. * Type names have the same rules, except they allow one underscore (to support a usage pattern Context_Type). The first character after the underscore is also uppercase. * Use seconds instead of milliseconds in details. Added a check when events are logged in simulation that logs a message to stderr if the first two rules above aren't followed. This probably doesn't address every instance of the above problems, but all of the events I was able to hit in simulation pass the check.	2018-06-08 11:11:08 -07:00
Evan Tschannen	b1935f1738	fix: do not allow a storage server to be removed within 5 million versions of it being added, because if a storage server is added and removed within the known committed version and recovery version, they storage server will need see either the add or remove when it peeks	2018-05-05 18:16:28 -07:00
Evan Tschannen	35b2ca820a	fix: certain tlog errors during remote recovery could fail to kill the master, the master could have a reference counting cycle with its actor collection	2018-04-24 16:10:14 -07:00
Evan Tschannen	73597f190e	fix: new tlogs are initialized with exactly the tags which existed at the recovery version	2018-04-22 20:28:01 -07:00
Evan Tschannen	3018a7b1b3	fix: the known committed version of a newly initialized log is 1, since by definition the first commit must have succeeded	2018-04-16 10:42:48 -07:00
Evan Tschannen	a8662f8737	fix: remote recovered is does not need to wait for old logs to be removed	2018-04-16 10:14:39 -07:00
Evan Tschannen	3453a51d0f	remoteRecovery was still swallowing errors	2018-04-10 13:31:24 -07:00
Evan Tschannen	5fcedd2e98	fix: coordinated state errors were being eaten	2018-04-10 11:14:57 -07:00
Evan Tschannen	7af892f50b	first working version of non-copying recovery working with fearless configurations	2018-04-08 21:24:05 -07:00
Evan Tschannen	b36e08f08f	first version of non-copying recovery. Upgrades are broken, and it has not been tested using fearless configurations yet	2018-03-29 15:12:38 -07:00
Evan Tschannen	65b532658f	added support for single region configurations	2018-03-15 10:59:30 -07:00
Evan Tschannen	8c88041608	fix: we must commit to the number of log routers we are going to use when recruiting the primary, because it determines the number of log router tags that will be attached to mutations	2018-03-06 16:31:21 -08:00
Evan Tschannen	1194e3a361	added region-based configuration to support a large variety of fearless setups. Currently only 1 primary 1 remote setups are allowed.	2018-03-05 19:27:46 -08:00
Evan Tschannen	470f5c01f3	changed remoteDcId to a vector of ids, to support future configurations where there are multiple remote databases	2018-02-26 17:09:09 -08:00
Evan Tschannen	37a6a81634	Merge commit '7f6fc3e039c911cd84b8540f7f799fc38a1c1822' into feature-remote-logs # Conflicts: # fdbserver/workloads/RestartRecovery.actor.cpp	2018-02-23 12:33:28 -08:00
Alec Grieser	0bae9880f1	remove trailing whitespace from our copyright headers ; fixed formatting of python setup.py	2018-02-21 10:25:11 -08:00

1 2

87 Commits