foundationdb

Commit Graph

Author	SHA1	Message	Date
Evan Tschannen	bb5799bd20	Merge pull request #2642 from xumengpanda/mengxu/new-backup-format-PR FastRestore:Integrate with new backup format	2020-03-25 15:47:55 -07:00
Jingyu Zhou	e2f317a0da	Fix a crash failure	2020-03-25 09:18:49 -07:00
Jingyu Zhou	243d078596	Fix off by one error Epoch end version is saved version + 1, so need +1 for minBackupVersion.	2020-03-23 20:44:31 -07:00
Jingyu Zhou	90b40e1d75	Merge branch 'mengxu/new-backup-format-PR-delta' of github.com:xumengpanda/foundationdb into backup-worker-bak Resolve Conflicts: fdbclient/BackupAgent.actor.h fdbserver/BackupWorker.actor.cpp fdbserver/RestoreMaster.actor.cpp fdbserver/masterserver.actor.cpp	2020-03-23 13:35:33 -07:00
Meng Xu	be67ab4d6a	Correct comment based on review	2020-03-23 12:53:40 -07:00
Andrew Noyes	fa8eaf9810	Assert recoverAndEndEpoch does not become ready	2020-03-23 12:40:00 -07:00
Meng Xu	3f31ebf659	New backup:Revise event name and explain code	2020-03-23 10:55:44 -07:00
Jingyu Zhou	97702d91c8	Skip recruiting backup workers for older epochs before min backup version When master starts recruiting backup workers, if there is no active backup job or the min version of the backup job is greater than old epoch's end version, then these old epochs can be skipped.	2020-03-21 13:44:02 -07:00
Jingyu Zhou	818072f3cb	Set oldest backup epoch if not recruiting backup workers Since tlog is not kept until backup worker has pulled mutations from it, the old tlogs can only be displaced after oldest backup epoch equals current epoch. So if master is not recruiting backup workers, it should set the oldest backup epoch as the current epoch.	2020-03-20 20:16:43 -07:00
Jingyu Zhou	5359528132	Reduce a call to getLogSystemConfig()	2020-03-20 20:15:09 -07:00
Jingyu Zhou	12ed8ad536	Fix backup worker start version when logset start version is lower The start version of tlog set can be smaller than the last epoch's end version. In this case, set backup worker's start version as last epoch's end version to avoid overlapping of version ranges among backup workers.	2020-03-20 20:15:08 -07:00
Jingyu Zhou	80d3fa1222	Add delay for master to recruit backup workers This delay is to ensure old epoch's backup workers can save their progress in the database. Otherwise, the new master could attempts to recruit backup workers for the old epoch on version ranges that have already been popped. As a result, the logs will lose data.	2020-03-20 20:15:08 -07:00
Jingyu Zhou	fda6c08640	Include a total number of tags in partition log file names This is needed for BackupContainer to check partitioned mutation logs are continuous, i.e., restorable to a version.	2020-03-20 20:13:38 -07:00
Jingyu Zhou	5bf62c8f85	Reduce a call to getLogSystemConfig()	2020-03-19 10:08:19 -07:00
Jingyu Zhou	89d8f13038	Fix backup worker start version when logset start version is lower The start version of tlog set can be smaller than the last epoch's end version. In this case, set backup worker's start version as last epoch's end version to avoid overlapping of version ranges among backup workers.	2020-03-18 16:41:35 -07:00
Jingyu Zhou	15437ffb53	Add delay for master to recruit backup workers This delay is to ensure old epoch's backup workers can save their progress in the database. Otherwise, the new master could attempts to recruit backup workers for the old epoch on version ranges that have already been popped. As a result, the logs will lose data.	2020-03-18 16:41:35 -07:00
Jingyu Zhou	d8c6bf585d	Include a total number of tags in partition log file names This is needed for BackupContainer to check partitioned mutation logs are continuous, i.e., restorable to a version.	2020-03-18 16:39:40 -07:00
Evan Tschannen	e08f0201f1	merge release 6.2 into master	2020-03-17 12:51:47 -07:00
Evan Tschannen	56dee89e6e	active generations should include the current one	2020-03-16 11:09:42 -07:00
Evan Tschannen	e5d53c863b	report in status the number of active generations	2020-03-16 10:29:17 -07:00
Evan Tschannen	818537ed2d	Update fdbserver/masterserver.actor.cpp Co-Authored-By: A.J. Beamon <ajbeamon@users.noreply.github.com>	2020-03-14 15:04:46 -07:00
Evan Tschannen	2f2f56020f	Update fdbserver/masterserver.actor.cpp Co-Authored-By: A.J. Beamon <ajbeamon@users.noreply.github.com>	2020-03-13 15:54:13 -07:00
Evan Tschannen	a39effa57d	delay recoveries after 70 outstanding generations, and stop recoveries after 100 outstanding generations to prevent a death spiral from filling up the coordinated state	2020-03-13 10:28:32 -07:00
negoyal	cd949eca71	Merge branch 'master' into fdb_cache_subfeature2	2020-02-26 11:22:08 -08:00
Evan Tschannen	96258b9809	Merge branch 'release-6.2' # Conflicts: # documentation/sphinx/source/release-notes.rst # fdbcli/fdbcli.actor.cpp # fdbclient/ManagementAPI.actor.cpp # fdbrpc/FlowTransport.actor.cpp # fdbserver/ClusterController.actor.cpp # fdbserver/DataDistribution.actor.cpp # fdbserver/DataDistribution.actor.h # fdbserver/DataDistributionQueue.actor.cpp # fdbserver/KeyValueStoreMemory.actor.cpp # fdbserver/MasterProxyServer.actor.cpp # fdbserver/QuietDatabase.actor.cpp # fdbserver/SkipList.cpp # fdbserver/StorageMetrics.actor.h # fdbserver/TLogServer.actor.cpp # fdbserver/fdbserver.actor.cpp # fdbserver/storageserver.actor.cpp # fdbserver/workloads/KVStoreTest.actor.cpp # flow/CMakeLists.txt # flow/Knobs.cpp # flow/Knobs.h # flow/genericactors.actor.cpp # flow/serialize.h	2020-02-21 19:09:16 -08:00
A.J. Beamon	df2b0452b4	Step 3 of fixing storage server range reads: change return type of readRange from VectorRef<KeyValueRef> to RangeResultRef.	2020-02-06 13:19:24 -08:00
negoyal	85cc35e81e	Merge branch 'master' into HEAD	2020-02-05 14:59:55 -08:00
Jingyu Zhou	52c6737411	Rename backupLoggingEnabled as backupWorkerEnabled To highlight the changes for 7.0 backup changes. By default, backup_worker_enabled flag is set for 7.0 version.	2020-02-04 10:09:16 -08:00
Jingyu Zhou	0db03f1d3c	Use backup_logging_enabled flag The default is to enable new backup workers. Users can disable this flag to turn off the backup worker feature.	2020-02-03 20:03:22 -08:00
Jingyu Zhou	38aa1903fd	Add a DB configuration option for backup workers Right now, the default is to keep the old backup behavior, i.e., do NOT use backup workers. Specifically, if BackupType is not set (or is set to default), the master will not recruit backup workers and will not add pseudo locality for backup workers. The StartFullBackupTaskFunc is updated to check if backup worker is enabled. Only when it is not enabled, starting a backup will wait on all backup workers to be started.	2020-01-31 19:29:09 -08:00
mpilman	6cc827277f	delete dead code	2020-01-24 14:28:09 -08:00
mpilman	4c3afa4208	Merge branch 'features/cache-initialization' of github.com:mpilman/foundationdb into features/cache-initialization	2020-01-24 11:03:25 -08:00
mpilman	51717c970d	Fixed management api	2020-01-24 11:00:50 -08:00
Jingyu Zhou	8b67a89eed	More review comments fixed.	2020-01-22 19:42:13 -08:00
Jingyu Zhou	1eaea91cb3	Address review comments	2020-01-22 19:42:13 -08:00
Jingyu Zhou	e14246ac16	Add more information for trace events	2020-01-22 19:42:13 -08:00
Jingyu Zhou	4bed33031f	Set backup worker start version to be savedVersion + 1 If no progress found, start version is set to epochBegin. So the start version is the one after the last saved (or from last epoch's saved) version.	2020-01-22 19:42:13 -08:00
Jingyu Zhou	4ed75e37f3	BackupProgress uses old epoch's begin version if no progress found Get rid of the complex logic of choosing the largest saved version from previous epoch for the oldest epoch. Instead, use the begin version now available from log system.	2020-01-22 19:38:46 -08:00
Jingyu Zhou	19eacac3ce	Add a unit test for BackupProgress	2020-01-22 19:38:46 -08:00
Jingyu Zhou	64052f6349	Check and fill backup gaps for old epochs and tags Sometimes the backup worker has not updated progress to the system space and a master recovery happens. As a result, next epoch doesn't know the progress of previous ones. This change is to check for such missing gaps and fill them with the whole range [startVersion, endVersion). The code is refactored into BackupProgress.actor.* to consolidate backup progress processing for the master server.	2020-01-22 19:38:46 -08:00
Jingyu Zhou	ed54aaa09e	Fix a crash failure of empty backup interface	2020-01-22 19:38:46 -08:00
Jingyu Zhou	23985da6a0	Use backup worker failed error code during recovery And use override instead of virtual in TagPartitionedLogSystem.	2020-01-22 19:38:45 -08:00
Jingyu Zhou	840e74d696	Allow storage server queue in consistency check The backup worker needs to update its progress even during consistency check by commit transactions to the database. Thus we can't really achieve zero storage server queue. So add a limit of 10,000 to pass the consistency check.	2020-01-22 19:38:45 -08:00
Jingyu Zhou	9567bf730d	Fix a crash due to null log system When a master starts, backup worker from old epochs may send BackupWorkerDoneRequest to it. The master can be safely ignore it, since the checkRemoved logic of the backup worker can self exit then.	2020-01-22 19:38:45 -08:00
Jingyu Zhou	0c08161d8e	Remove old backup workers when done For backup workers working on old epochs, once their work is done, they will notify the master. Then the master removes them from the log system and acknowledge back to the backup workers so that they can gracefully shut down. The popping of a backup worker is stalled if there are workers from older epochs still working. Otherwise, workers from old epochs will lost data. However, allowing newer epoch to start backup can cause holes in version ranges. The restore process must verify the backup progress to make sure there are no holes, otherwise it has to wait.	2020-01-22 19:38:45 -08:00
Jingyu Zhou	85c4a4e422	Address review comments for PR #1625	2020-01-22 19:38:45 -08:00
Jingyu Zhou	22f4bef589	Fix a race that backup workers may not be registered After the backup worker recruitment is done, we need to force trigger the registration with cluster controller. Otherwise, the log system may not have the backup workers, which can stall backup workers from obtaining a cursor and resulting in mutations being kept in TLogs.	2020-01-22 19:38:45 -08:00
Jingyu Zhou	73824faf65	Track pseudo tags popping for individual IDs For each log router ID, we track the popped version of each pseudo tag so that the popping only applied to the minimum of these versions. Also add more tracing for popping and epochs.	2020-01-22 19:38:45 -08:00
Jingyu Zhou	580151e1d4	Refactor code using C++ 17 iterator	2020-01-22 19:38:45 -08:00
Jingyu Zhou	c2b8ee3b53	Small improvement	2020-01-22 19:38:45 -08:00
Jingyu Zhou	19d6a889ff	Recruit backup workers for old epochs If there are unfinished ranges in the old epochs, the new master will recruit backup workers responsible for finishing these ranges. These workers remains in the cluster until the next epoch, when it will remove itself.	2020-01-22 19:38:45 -08:00
Jingyu Zhou	ac851619bb	Fix merge errors with master	2020-01-22 19:38:45 -08:00
Jingyu Zhou	11964733b7	WIP: should be divided into smaller commits.	2020-01-22 19:38:45 -08:00
Jingyu Zhou	41f0cf2bb5	Add decode function for backup progress	2020-01-22 19:38:45 -08:00
Jingyu Zhou	a4d6ebe79e	Recruit backup worker in newEpoch	2020-01-22 19:37:48 -08:00
Jingyu Zhou	eac49bca04	Add backup worker recruitment in master.	2020-01-22 19:35:30 -08:00
negoyal	e8e5f2d118	Bug fixes in the cache role.	2020-01-07 16:51:40 -08:00
negoyal	cf2563f1c7	Mix of various things, a lot of which will change.	2019-12-05 17:10:32 -08:00
negoyal	a4a0bf18f9	Merging with Master.	2019-11-12 13:01:29 -08:00
Jon Fu	d96a7b2c69	Merge branch 'master' of https://github.com/apple/foundationdb into mark-ss-failed	2019-10-03 09:47:45 -07:00
Evan Tschannen	3cc5d484a5	the include and exclude commands do not need to set the moveKeysLockOwnerKey, which will kill the data distribution algorithm	2019-09-27 18:33:56 -07:00
A.J. Beamon	1f8a157b35	Extend the length allowed for configuration fields. Log the config if recovery fails due to invalid config.	2019-09-05 15:36:37 -07:00
Andrew Noyes	6aa0ada7b1	Replace scalar root types with proper messages	2019-08-28 14:40:50 -07:00
Evan Tschannen	4c9a392f05	the master checks the popped version of the txsTag before recovering the txnStateStore, to avoid restoring data that is later found to be popped	2019-08-05 17:01:48 -07:00
Evan Tschannen	5c98dcce6d	revert the proxy forwarding path, because it is no longer necessary as clients keep a persistent connection open with coordinators	2019-07-27 16:46:22 -07:00
Evan Tschannen	b509a441e7	Merge branch 'master' into feature-skip-confirm # Conflicts: # bindings/flow/tester/Tester.actor.cpp # bindings/go/src/_stacktester/stacktester.go # bindings/java/src/test/com/apple/foundationdb/test/AsyncStackTester.java # bindings/java/src/test/com/apple/foundationdb/test/StackTester.java # bindings/python/tests/tester.py # bindings/ruby/tests/tester.rb # documentation/sphinx/source/api-c.rst # documentation/sphinx/source/api-python.rst # documentation/sphinx/source/api-ruby.rst # documentation/sphinx/source/data-modeling.rst # documentation/sphinx/source/developer-guide.rst # fdbclient/vexillographer/fdb.options # fdbserver/MasterProxyServer.actor.cpp	2019-07-27 15:08:13 -07:00
Evan Tschannen	02de53160d	only skip confirm epoch live if CAUSAL_READ_RISKY is enabled time checked on the proxy should be less than the time waited by the master to account for clock speed differences setting REQUIRED_MIN_RECOVERY_DURATION and ENFORCED_MIN_RECOVERY_DURATION to 0 will go back to the old behavior	2019-07-12 17:58:16 -07:00
Evan Tschannen	a63969afb3	enforce a minimum recovery duration, which allows proxies to avoid checking if the epoch is alive as long as its last commit has been less than MINIMUM_RECOVERY_DURATION ago	2019-07-12 13:10:21 -07:00
Evan Tschannen	d8948c8be1	Merge branch 'master' into feature-fast-txs-recovery # Conflicts: # fdbserver/TagPartitionedLogSystem.actor.cpp	2019-07-10 13:59:52 -07:00
Evan Tschannen	c348b3da51	After a proxy dies, it will remain alive for an additional 10 seconds to forward clients to the new proxies	2019-07-08 12:53:40 -07:00
Evan Tschannen	15e894c724	Merge in master	2019-07-05 15:49:24 -07:00
Alex Miller	ea6898144d	Merge remote-tracking branch 'upstream/master' into flowlock-api	2019-07-03 20:44:15 -07:00
Jingyu Zhou	b69d7adabc	Remove unused remoteRecovered from master server	2019-07-01 15:41:35 -07:00
Evan Tschannen	52efcfd136	fix: properly create the right number for txsTags when changing between different numbers of logs	2019-06-27 15:15:05 -07:00
Alex Miller	7a500cd37f	A giant translation of TaskFooPriority -> TaskPriority::Foo This is so that APIs that take priorities don't take ints, which are common and easy to accidentally pass the wrong thing.	2019-06-25 02:47:35 -07:00
Evan Tschannen	e0be631414	shard the txs tag so that more transaction logs are involved in its recovery	2019-06-19 18:15:09 -07:00
A.J. Beamon	5f55f3f613	Replace g_random and g_nondeterministic_random with functions deterministicRandom() and nondeterministicRandom() that return thread_local random number generators. Delete g_debug_random and trace_random. Allow only deterministicRandom() to be seeded, and require it to be seeded from each thread on which it is used.	2019-05-10 14:01:52 -07:00
Jingyu Zhou	8b5449e608	Fix review comments for PR #1473	2019-04-29 16:45:42 -07:00
Jingyu Zhou	966ec30fcc	Add pseudoLocalities for special tag consumers	2019-04-21 10:41:07 -07:00
mpilman	1c16f87a4e	Remove trace-calls to printable (in non-workloads)	2019-04-05 13:12:19 -07:00
Evan Tschannen	f5de52de91	fix: cancel the previous log system recruitment before calling newEpoch, to avoid multiple actors attempting to modify oldLogSystem at the same time	2019-04-01 16:38:25 -07:00
Evan Tschannen	b6008558d3	renamed BinaryWriter.toStringRef() to .toValue(), because the function now returns a Standalone<StringRef>() eliminated an unnecessary copy from the proxy commit path eliminated an unnecessary copy from buffered peek cursor	2019-03-28 11:52:50 -07:00
Evan Tschannen	6254a1a8e4	fix: restarting the provisional proxy causes all tlog peeks to restart, so if tlog peeks take longer than 1 second this could end in an infinite loop	2019-03-22 18:37:39 -07:00
Evan Tschannen	2605257737	Merge branch 'master' of github.com:apple/foundationdb	2019-03-19 18:47:29 -07:00
Evan Tschannen	5b9c45ea0b	clients do not attempt to connect to provisional proxies	2019-03-19 13:37:50 -07:00
Balachandar Namasivayam	5471725db5	Support config where the primary and remote DC's can be used as satellites.	2019-03-18 12:17:59 -07:00
Evan Tschannen	a7e45cff91	Merge pull request #1176 from jzhou77/ratekeeper Make Ratekeeper a separate role	2019-03-12 15:58:59 -07:00
Evan Tschannen	2627bcd35e	Merge branch 'master' into feature-metadata-version	2019-03-10 21:13:28 -07:00
Jingyu Zhou	3c86643822	Separate Ratekeeper from data distribution. Add a new role for ratekeeper. Remove StorageServerChanges from data distribution. Ratekeeper monitors storage servers, which borrows the idea from DataDistribution.	2019-03-07 13:16:20 -08:00
Alex Miller	c6a65389ae	Remove noexcept macro and replace with BOOST_NOEXCEPT. BOOST_NOEXCEPT does what the noexcept macro was supposed to do, but in a way that is correctly maintained over time.	2019-03-05 22:06:12 -08:00
anoyes	981426bac9	More ide fixes	2019-03-05 18:03:57 -08:00
Evan Tschannen	3da85f3acd	implemented the \xff/metadataVersion key, which can be used by layers to help them cheaply cache metadata and know when their cache is invalid	2019-02-28 17:45:00 -08:00
Evan Tschannen	b8910ba7cd	Merge branch 'master' into feature-fix-force-recovery # Conflicts: # fdbclient/ManagementAPI.actor.h # fdbserver/DataDistribution.actor.cpp # fdbserver/storageserver.actor.cpp # fdbserver/workloads/KillRegion.actor.cpp	2019-02-22 14:38:13 -08:00
Evan Tschannen	0e19b5a935	fix: allow the txnStateStore to be recovered from a process in a down datacenter, so that the cluster controller can know to switch to the other region	2019-02-21 16:52:27 -08:00
Evan Tschannen	3a572b010f	fix: a forced recovery needed to force the data distributor to restart	2019-02-19 16:04:52 -08:00
mpilman	3f0fd2a20c	Use fwd decls in WorkerInterface Also WorkerInterface.h -> WorkerInterface.actor.h	2019-02-19 15:16:59 -08:00
mpilman	0bb60e5a3b	Use proper fwd decl in NativeAPI Also NativeAPI.h -> NativeAPI.actor.h	2019-02-19 15:16:59 -08:00
Evan Tschannen	8ed89fd711	fixed review comments	2019-02-19 11:26:53 -08:00
Evan Tschannen	065a45e05f	Merge branch 'master' into feature-fix-force-recovery # Conflicts: # fdbclient/ManagementAPI.actor.cpp # fdbserver/ClusterController.actor.cpp # fdbserver/workloads/KillRegion.actor.cpp	2019-02-18 17:09:06 -08:00
Evan Tschannen	ccaa860ffc	fix: all storage servers must reboot during a forced recovery, because their rejoin commit might have been lost	2019-02-18 15:27:18 -08:00
Evan Tschannen	9cfadad41b	fix: if the tagPartitionedLogSystem cannot do a forced recovery, the master should not execute it forced recovery based modifications either	2019-02-18 15:13:18 -08:00
Evan Tschannen	8f2af8bed1	fix: forced recoveries now require a target dcid which will become the new primary location. During the forced recovery, the configuration will be changed to make that location primary, and usable_regions will be set to 1. If the target dcid is already the primary location, the forced recovery will do nothing. This makes forced recoveries idempotent, so it is safe to the client to re-send forced recovery commands to the cluster controller. fix: the cluster controller attempts to do a commit to determine if the cluster is alive, since its own internal recoveryState might not be up-to-date. fix: forceMasterFailure on the cluster controller did not always cause the current master to be re-recruited	2019-02-18 14:54:28 -08:00
Evan Tschannen	4c35ebdcc6	fix: because of forced recoveries, storage servers in remote regions cannot update their durable version to (lastLogVersion - 5e6), because the lastLogVersion might have jumped due to an epoch end and the recovery version after the forced recovery could be before the epoch end, causing the storage server to want to rollback to a version it does not have on disk	2019-02-18 14:40:30 -08:00
Evan Tschannen	05ca0a10d8	fix: kill all storage servers which are not in the safe locality after a forced recovery	2019-02-18 14:30:51 -08:00
Jingyu Zhou	6a655143e8	A follow-on fix for config key usage And some trace event cleanups.	2019-02-14 16:37:16 -08:00
Jingyu Zhou	aea602d9c7	Remove getRecoveryInfo from master interface.	2019-02-14 16:37:16 -08:00
Jingyu Zhou	886e7ab2ba	Add a new DataDistributor role. Let cluster controller to start a new data distributor role by sending a message to a chosen worker. Change MasterInterface usage in DataDistribution to masterId Add DataDistributor rejoin handling. This allows the data distributor to tell the new cluster controller of its existence so that the controller doesn't spawn a new one. I.e., there should be only ONE data distributor in the cluster. If DataDistributor (DD) doesn't join in a while, then ClusterController (CC) tries to recruit one as DD. CC also monitors DD and restarts one if it failed. The Proxy is also monitoring the DD. If DD failed, the Proxy will ask CC for the new DD. Add GetRecoveryInfo RPC to master server, which is called by data distributor to obtain the recovery Transaction version from the master server.	2019-02-14 16:30:13 -08:00
Evan Tschannen	e45952bc53	Merge branch 'release-6.0' # Conflicts: # documentation/sphinx/source/release-notes.rst # fdbclient/BackupContainer.actor.cpp # fdbclient/BlobStore.actor.cpp # fdbclient/HTTP.actor.cpp # tests/BlobStore.txt # versions.target	2018-11-13 16:06:39 -08:00
Evan Tschannen	1bd615f954	fix: remoteDcIds will not actually have transaction logs unless usable regions is > 1	2018-11-13 12:36:04 -08:00
Evan Tschannen	4e54690005	Merge branch 'release-6.0' # Conflicts: # fdbserver/DataDistribution.actor.cpp # fdbserver/MoveKeys.actor.cpp	2018-11-12 20:26:58 -08:00
Evan Tschannen	7892da032f	fix: Do not remove the locality entry for the current transaction logs when removing storage servers fix: dcId_locality map could be incorrect after restarting recruitEverything	2018-11-11 12:37:53 -08:00
Evan Tschannen	4b5d0b4e2c	Merge branch 'release-6.0' # Conflicts: # documentation/sphinx/source/release-notes.rst # fdbclient/AsyncFileBlobStore.actor.cpp # fdbclient/AsyncFileBlobStore.actor.h # fdbclient/BlobStore.actor.cpp # fdbclient/BlobStore.h # fdbclient/HTTP.actor.cpp # fdbclient/ManagementAPI.actor.cpp # fdbclient/NativeAPI.actor.cpp # fdbrpc/LoadBalance.actor.h # fdbrpc/batcher.actor.h # fdbrpc/fdbrpc.vcxproj # fdbrpc/sim2.actor.cpp # fdbserver/DataDistribution.actor.cpp # fdbserver/DataDistributionTracker.actor.cpp # fdbserver/SimulatedCluster.actor.cpp # fdbserver/TLogServer.actor.cpp # fdbserver/masterserver.actor.cpp	2018-11-10 13:04:24 -08:00
Evan Tschannen	6bb283aebc	fix: dcId to Locality changes could be lost if an emergency transaction happened that did not change the configuration fix: master proxy was starting dcId’s at 1 number too large	2018-11-05 11:12:43 -08:00
Evan Tschannen	87295cc263	suppressed spammy trace events, and avoid reporting a long master recovery duration when the cluster is first created	2018-11-04 23:07:56 -08:00
Robert Escriva	268093a96d	Adjust all includes to be relative to the root. Remove the use of relative paths. A header at foo/bar.h could be included by files under foo/ with "bar.h", but would be included everywhere else as "foo/bar.h". Adjust so that every include references such a header with the latter form. Signed-off-by: Robert Escriva <rescriva@dropbox.com>	2018-10-19 17:35:33 +00:00
Evan Tschannen	3922e477a5	Merge branch 'release-6.0' # Conflicts: # documentation/sphinx/source/release-notes.rst # fdbclient/ManagementAPI.actor.cpp # fdbserver/ClusterController.actor.cpp # fdbserver/DataDistribution.actor.cpp # fdbserver/LogSystemDiskQueueAdapter.actor.cpp # fdbserver/SimulatedCluster.actor.cpp # fdbserver/TLogServer.actor.cpp	2018-10-03 16:57:18 -07:00
Evan Tschannen	cdaf5e1192	fix: forced recovery does not recover tags from any DC besides the surviving one	2018-10-02 17:46:22 -07:00
Evan Tschannen	e7e1c634e0	fix: we need to restart the peek cursor when the known committed version becomes available	2018-10-02 17:44:14 -07:00
Evan Tschannen	05e7f08b26	added a peek method which will attempt to read the txsTag from the local region as much as possible	2018-09-28 12:21:08 -07:00
Evan Tschannen	200e65fe61	added a workload which tests killing an entire region, and recovering from the failure with data loss. fix: we cannot pop the txs tag from remote logs until they have a full copy of the txnStateStore fix: we have to modify all of history, we cannot stop after finding a local remote	2018-09-17 18:32:39 -07:00
Evan Tschannen	90301f497f	Merge branch 'release-6.0' # Conflicts: # fdbclient/ManagementAPI.actor.cpp # fdbrpc/FlowTransport.actor.cpp # fdbrpc/TLSConnection.actor.cpp # fdbserver/DataDistribution.actor.cpp # fdbserver/Status.actor.cpp # fdbserver/storageserver.actor.cpp # fdbserver/workloads/StatusWorkload.actor.cpp # versions.target	2018-09-05 16:06:33 -07:00
Evan Tschannen	90bf277206	require key value store memory to recover cleanly when recovering the txnStateStore, since all of the data it is recovering has been fsync’ed	2018-08-31 13:07:48 -07:00
A.J. Beamon	2a97139d5d	This is the first step in eliminating the usage of database names in our code. The C API remains the same, but underneath that all usage of database names is eliminated.	2018-08-16 10:24:12 -07:00
Alex Miller	fb31a6999f	Rewrite all files to have #include actorcompiler.h as the last include.	2018-08-14 15:50:26 -07:00
Alex Miller	535b5701e5	Rewrite all `Void _ = wait(...)` -> `wait(...)`. This takes advantage of the new actorcompiler functionality to avoid having duplicate definitions of `Void _` when trying to feed the un-actorompiled source through clang.	2018-08-14 15:50:26 -07:00
Evan Tschannen	9d0a07a400	fix: trackLatest for master recovery state was wrong, causing status to report incorrect recovery states	2018-08-04 12:50:56 -07:00
Evan Tschannen	30b2f85020	fix: it is not safe to drop logs supporting the current primary datacenter, because configuring usable_regions down will drop the storage servers in the remote region, leaving you will no remaining logs	2018-07-14 16:26:45 -07:00
Evan Tschannen	b9f2b80129	deleted spammy trace event	2018-07-09 22:02:15 -07:00
Evan Tschannen	6b40f2764d	fix: off by one error on popping missing tags	2018-07-09 15:43:22 -07:00
Evan Tschannen	da5a232d7e	fix: If we have not recruited the remote logs yet and detect a configuration change, we must fail the master to update the remote recruitment request	2018-07-05 12:17:41 -07:00
Evan Tschannen	507b3bacb0	fix: kill all tlogs in one region prevents the remote logs from recovering in that region, do not allow that to prevent us from configuring usable_regions=1. added more recovery states.	2018-07-05 00:08:51 -07:00
Evan Tschannen	866ccfe344	added the ability to allow the master to finish recovery before all storage servers in both regions have their mutations. This allows you to recover from scenarios where you lose all your tlogs in one dc.	2018-07-04 01:59:04 -04:00
Evan Tschannen	3c9f3da980	fix: usable regions cannot be changed during an emergency transaction, because it could lead to all storage servers dying if the previous primary is dead	2018-07-01 23:59:06 -04:00
Evan Tschannen	7a12d3e130	added the (untested) ability to force a recovery to the remote datacenter, even if that results in data loss. If the DR lag is more than 1 week there could be potential data corruption if any primary storage servers are still alive.	2018-07-01 09:39:04 -04:00
Evan Tschannen	8a8914f046	re-added the ability to configure the number of log routers. Many log routers are needed to get a sufficient number of sockets involved in copying data across the WAN	2018-06-22 00:04:00 -07:00
Evan Tschannen	0913368651	added usable_regions to specify if we will replicate into a remote region remote replication defaults to the primary replication removed remote_logs, because they should be specified as an override in the regions object	2018-06-17 19:31:15 -07:00
Evan Tschannen	284233baa1	added a key in the database with the locality of the current master	2018-06-14 19:36:02 -07:00
Evan Tschannen	fbb3f85c74	fix: logsKey was not being updated properly	2018-06-14 12:54:39 -07:00
Evan Tschannen	889889323e	The master will tell the cluster controller if it is going to take a long time to recruit new logs in its DC; the cluster controller can determine if the other DC would be better and recruit there. The cluster controller will not switch to the other data center if remote logs are too far behind. We will not recruit in DCs with negative priority.	2018-06-13 18:14:14 -07:00
Alex Miller	fcfa00928b	Make RecoveryState an enum class. This means that all the == 7 or != 0 checks go away, and explicit names must be used.	2018-06-12 16:50:25 -07:00
A.J. Beamon	e5488419cc	Attempt to normalize trace events: * Detail names now all start with an uppercase character and contain no underscores. Ideally these should be head-first camel case, though that was harder to check. * Type names have the same rules, except they allow one underscore (to support a usage pattern Context_Type). The first character after the underscore is also uppercase. * Use seconds instead of milliseconds in details. Added a check when events are logged in simulation that logs a message to stderr if the first two rules above aren't followed. This probably doesn't address every instance of the above problems, but all of the events I was able to hit in simulation pass the check.	2018-06-08 11:11:08 -07:00
Evan Tschannen	b1935f1738	fix: do not allow a storage server to be removed within 5 million versions of it being added, because if a storage server is added and removed within the known committed version and recovery version, they storage server will need see either the add or remove when it peeks	2018-05-05 18:16:28 -07:00
Evan Tschannen	35b2ca820a	fix: certain tlog errors during remote recovery could fail to kill the master, the master could have a reference counting cycle with its actor collection	2018-04-24 16:10:14 -07:00
Evan Tschannen	73597f190e	fix: new tlogs are initialized with exactly the tags which existed at the recovery version	2018-04-22 20:28:01 -07:00
Evan Tschannen	3018a7b1b3	fix: the known committed version of a newly initialized log is 1, since by definition the first commit must have succeeded	2018-04-16 10:42:48 -07:00
Evan Tschannen	a8662f8737	fix: remote recovered is does not need to wait for old logs to be removed	2018-04-16 10:14:39 -07:00
Evan Tschannen	3453a51d0f	remoteRecovery was still swallowing errors	2018-04-10 13:31:24 -07:00
Evan Tschannen	5fcedd2e98	fix: coordinated state errors were being eaten	2018-04-10 11:14:57 -07:00
Evan Tschannen	7af892f50b	first working version of non-copying recovery working with fearless configurations	2018-04-08 21:24:05 -07:00
Evan Tschannen	b36e08f08f	first version of non-copying recovery. Upgrades are broken, and it has not been tested using fearless configurations yet	2018-03-29 15:12:38 -07:00

1 2 3 4 5 ...

293 Commits