foundationdb

Commit Graph

Author	SHA1	Message	Date
Evan Tschannen	6254a1a8e4	fix: restarting the provisional proxy causes all tlog peeks to restart, so if tlog peeks take longer than 1 second this could end in an infinite loop	2019-03-22 18:37:39 -07:00
Evan Tschannen	2605257737	Merge branch 'master' of github.com:apple/foundationdb	2019-03-19 18:47:29 -07:00
Evan Tschannen	5b9c45ea0b	clients do not attempt to connect to provisional proxies	2019-03-19 13:37:50 -07:00
Balachandar Namasivayam	5471725db5	Support config where the primary and remote DC's can be used as satellites.	2019-03-18 12:17:59 -07:00
Evan Tschannen	a7e45cff91	Merge pull request #1176 from jzhou77/ratekeeper Make Ratekeeper a separate role	2019-03-12 15:58:59 -07:00
Evan Tschannen	2627bcd35e	Merge branch 'master' into feature-metadata-version	2019-03-10 21:13:28 -07:00
Jingyu Zhou	3c86643822	Separate Ratekeeper from data distribution. Add a new role for ratekeeper. Remove StorageServerChanges from data distribution. Ratekeeper monitors storage servers, which borrows the idea from DataDistribution.	2019-03-07 13:16:20 -08:00
Alex Miller	c6a65389ae	Remove noexcept macro and replace with BOOST_NOEXCEPT. BOOST_NOEXCEPT does what the noexcept macro was supposed to do, but in a way that is correctly maintained over time.	2019-03-05 22:06:12 -08:00
anoyes	981426bac9	More ide fixes	2019-03-05 18:03:57 -08:00
Evan Tschannen	3da85f3acd	implemented the \xff/metadataVersion key, which can be used by layers to help them cheaply cache metadata and know when their cache is invalid	2019-02-28 17:45:00 -08:00
Evan Tschannen	b8910ba7cd	Merge branch 'master' into feature-fix-force-recovery # Conflicts: # fdbclient/ManagementAPI.actor.h # fdbserver/DataDistribution.actor.cpp # fdbserver/storageserver.actor.cpp # fdbserver/workloads/KillRegion.actor.cpp	2019-02-22 14:38:13 -08:00
Evan Tschannen	0e19b5a935	fix: allow the txnStateStore to be recovered from a process in a down datacenter, so that the cluster controller can know to switch to the other region	2019-02-21 16:52:27 -08:00
Evan Tschannen	3a572b010f	fix: a forced recovery needed to force the data distributor to restart	2019-02-19 16:04:52 -08:00
mpilman	3f0fd2a20c	Use fwd decls in WorkerInterface Also WorkerInterface.h -> WorkerInterface.actor.h	2019-02-19 15:16:59 -08:00
mpilman	0bb60e5a3b	Use proper fwd decl in NativeAPI Also NativeAPI.h -> NativeAPI.actor.h	2019-02-19 15:16:59 -08:00
Evan Tschannen	8ed89fd711	fixed review comments	2019-02-19 11:26:53 -08:00
Evan Tschannen	065a45e05f	Merge branch 'master' into feature-fix-force-recovery # Conflicts: # fdbclient/ManagementAPI.actor.cpp # fdbserver/ClusterController.actor.cpp # fdbserver/workloads/KillRegion.actor.cpp	2019-02-18 17:09:06 -08:00
Evan Tschannen	ccaa860ffc	fix: all storage servers must reboot during a forced recovery, because their rejoin commit might have been lost	2019-02-18 15:27:18 -08:00
Evan Tschannen	9cfadad41b	fix: if the tagPartitionedLogSystem cannot do a forced recovery, the master should not execute it forced recovery based modifications either	2019-02-18 15:13:18 -08:00
Evan Tschannen	8f2af8bed1	fix: forced recoveries now require a target dcid which will become the new primary location. During the forced recovery, the configuration will be changed to make that location primary, and usable_regions will be set to 1. If the target dcid is already the primary location, the forced recovery will do nothing. This makes forced recoveries idempotent, so it is safe to the client to re-send forced recovery commands to the cluster controller. fix: the cluster controller attempts to do a commit to determine if the cluster is alive, since its own internal recoveryState might not be up-to-date. fix: forceMasterFailure on the cluster controller did not always cause the current master to be re-recruited	2019-02-18 14:54:28 -08:00
Evan Tschannen	4c35ebdcc6	fix: because of forced recoveries, storage servers in remote regions cannot update their durable version to (lastLogVersion - 5e6), because the lastLogVersion might have jumped due to an epoch end and the recovery version after the forced recovery could be before the epoch end, causing the storage server to want to rollback to a version it does not have on disk	2019-02-18 14:40:30 -08:00
Evan Tschannen	05ca0a10d8	fix: kill all storage servers which are not in the safe locality after a forced recovery	2019-02-18 14:30:51 -08:00
Jingyu Zhou	6a655143e8	A follow-on fix for config key usage And some trace event cleanups.	2019-02-14 16:37:16 -08:00
Jingyu Zhou	aea602d9c7	Remove getRecoveryInfo from master interface.	2019-02-14 16:37:16 -08:00
Jingyu Zhou	886e7ab2ba	Add a new DataDistributor role. Let cluster controller to start a new data distributor role by sending a message to a chosen worker. Change MasterInterface usage in DataDistribution to masterId Add DataDistributor rejoin handling. This allows the data distributor to tell the new cluster controller of its existence so that the controller doesn't spawn a new one. I.e., there should be only ONE data distributor in the cluster. If DataDistributor (DD) doesn't join in a while, then ClusterController (CC) tries to recruit one as DD. CC also monitors DD and restarts one if it failed. The Proxy is also monitoring the DD. If DD failed, the Proxy will ask CC for the new DD. Add GetRecoveryInfo RPC to master server, which is called by data distributor to obtain the recovery Transaction version from the master server.	2019-02-14 16:30:13 -08:00
Evan Tschannen	e45952bc53	Merge branch 'release-6.0' # Conflicts: # documentation/sphinx/source/release-notes.rst # fdbclient/BackupContainer.actor.cpp # fdbclient/BlobStore.actor.cpp # fdbclient/HTTP.actor.cpp # tests/BlobStore.txt # versions.target	2018-11-13 16:06:39 -08:00
Evan Tschannen	1bd615f954	fix: remoteDcIds will not actually have transaction logs unless usable regions is > 1	2018-11-13 12:36:04 -08:00
Evan Tschannen	4e54690005	Merge branch 'release-6.0' # Conflicts: # fdbserver/DataDistribution.actor.cpp # fdbserver/MoveKeys.actor.cpp	2018-11-12 20:26:58 -08:00
Evan Tschannen	7892da032f	fix: Do not remove the locality entry for the current transaction logs when removing storage servers fix: dcId_locality map could be incorrect after restarting recruitEverything	2018-11-11 12:37:53 -08:00
Evan Tschannen	4b5d0b4e2c	Merge branch 'release-6.0' # Conflicts: # documentation/sphinx/source/release-notes.rst # fdbclient/AsyncFileBlobStore.actor.cpp # fdbclient/AsyncFileBlobStore.actor.h # fdbclient/BlobStore.actor.cpp # fdbclient/BlobStore.h # fdbclient/HTTP.actor.cpp # fdbclient/ManagementAPI.actor.cpp # fdbclient/NativeAPI.actor.cpp # fdbrpc/LoadBalance.actor.h # fdbrpc/batcher.actor.h # fdbrpc/fdbrpc.vcxproj # fdbrpc/sim2.actor.cpp # fdbserver/DataDistribution.actor.cpp # fdbserver/DataDistributionTracker.actor.cpp # fdbserver/SimulatedCluster.actor.cpp # fdbserver/TLogServer.actor.cpp # fdbserver/masterserver.actor.cpp	2018-11-10 13:04:24 -08:00
Evan Tschannen	6bb283aebc	fix: dcId to Locality changes could be lost if an emergency transaction happened that did not change the configuration fix: master proxy was starting dcId’s at 1 number too large	2018-11-05 11:12:43 -08:00
Evan Tschannen	87295cc263	suppressed spammy trace events, and avoid reporting a long master recovery duration when the cluster is first created	2018-11-04 23:07:56 -08:00
Robert Escriva	268093a96d	Adjust all includes to be relative to the root. Remove the use of relative paths. A header at foo/bar.h could be included by files under foo/ with "bar.h", but would be included everywhere else as "foo/bar.h". Adjust so that every include references such a header with the latter form. Signed-off-by: Robert Escriva <rescriva@dropbox.com>	2018-10-19 17:35:33 +00:00
Evan Tschannen	3922e477a5	Merge branch 'release-6.0' # Conflicts: # documentation/sphinx/source/release-notes.rst # fdbclient/ManagementAPI.actor.cpp # fdbserver/ClusterController.actor.cpp # fdbserver/DataDistribution.actor.cpp # fdbserver/LogSystemDiskQueueAdapter.actor.cpp # fdbserver/SimulatedCluster.actor.cpp # fdbserver/TLogServer.actor.cpp	2018-10-03 16:57:18 -07:00
Evan Tschannen	cdaf5e1192	fix: forced recovery does not recover tags from any DC besides the surviving one	2018-10-02 17:46:22 -07:00
Evan Tschannen	e7e1c634e0	fix: we need to restart the peek cursor when the known committed version becomes available	2018-10-02 17:44:14 -07:00
Evan Tschannen	05e7f08b26	added a peek method which will attempt to read the txsTag from the local region as much as possible	2018-09-28 12:21:08 -07:00
Evan Tschannen	200e65fe61	added a workload which tests killing an entire region, and recovering from the failure with data loss. fix: we cannot pop the txs tag from remote logs until they have a full copy of the txnStateStore fix: we have to modify all of history, we cannot stop after finding a local remote	2018-09-17 18:32:39 -07:00
Evan Tschannen	90301f497f	Merge branch 'release-6.0' # Conflicts: # fdbclient/ManagementAPI.actor.cpp # fdbrpc/FlowTransport.actor.cpp # fdbrpc/TLSConnection.actor.cpp # fdbserver/DataDistribution.actor.cpp # fdbserver/Status.actor.cpp # fdbserver/storageserver.actor.cpp # fdbserver/workloads/StatusWorkload.actor.cpp # versions.target	2018-09-05 16:06:33 -07:00
Evan Tschannen	90bf277206	require key value store memory to recover cleanly when recovering the txnStateStore, since all of the data it is recovering has been fsync’ed	2018-08-31 13:07:48 -07:00
A.J. Beamon	2a97139d5d	This is the first step in eliminating the usage of database names in our code. The C API remains the same, but underneath that all usage of database names is eliminated.	2018-08-16 10:24:12 -07:00
Alex Miller	fb31a6999f	Rewrite all files to have #include actorcompiler.h as the last include.	2018-08-14 15:50:26 -07:00
Alex Miller	535b5701e5	Rewrite all `Void _ = wait(...)` -> `wait(...)`. This takes advantage of the new actorcompiler functionality to avoid having duplicate definitions of `Void _` when trying to feed the un-actorompiled source through clang.	2018-08-14 15:50:26 -07:00
Evan Tschannen	9d0a07a400	fix: trackLatest for master recovery state was wrong, causing status to report incorrect recovery states	2018-08-04 12:50:56 -07:00
Evan Tschannen	30b2f85020	fix: it is not safe to drop logs supporting the current primary datacenter, because configuring usable_regions down will drop the storage servers in the remote region, leaving you will no remaining logs	2018-07-14 16:26:45 -07:00
Evan Tschannen	b9f2b80129	deleted spammy trace event	2018-07-09 22:02:15 -07:00
Evan Tschannen	6b40f2764d	fix: off by one error on popping missing tags	2018-07-09 15:43:22 -07:00
Evan Tschannen	da5a232d7e	fix: If we have not recruited the remote logs yet and detect a configuration change, we must fail the master to update the remote recruitment request	2018-07-05 12:17:41 -07:00
Evan Tschannen	507b3bacb0	fix: kill all tlogs in one region prevents the remote logs from recovering in that region, do not allow that to prevent us from configuring usable_regions=1. added more recovery states.	2018-07-05 00:08:51 -07:00
Evan Tschannen	866ccfe344	added the ability to allow the master to finish recovery before all storage servers in both regions have their mutations. This allows you to recover from scenarios where you lose all your tlogs in one dc.	2018-07-04 01:59:04 -04:00

1 2 3

111 Commits