foundationdb

Commit Graph

Author	SHA1	Message	Date
Jingyu Zhou	12ed8ad536	Fix backup worker start version when logset start version is lower The start version of tlog set can be smaller than the last epoch's end version. In this case, set backup worker's start version as last epoch's end version to avoid overlapping of version ranges among backup workers.	2020-03-20 20:15:08 -07:00
Jingyu Zhou	80d3fa1222	Add delay for master to recruit backup workers This delay is to ensure old epoch's backup workers can save their progress in the database. Otherwise, the new master could attempts to recruit backup workers for the old epoch on version ranges that have already been popped. As a result, the logs will lose data.	2020-03-20 20:15:08 -07:00
Jingyu Zhou	fda6c08640	Include a total number of tags in partition log file names This is needed for BackupContainer to check partitioned mutation logs are continuous, i.e., restorable to a version.	2020-03-20 20:13:38 -07:00
Evan Tschannen	e08f0201f1	merge release 6.2 into master	2020-03-17 12:51:47 -07:00
Evan Tschannen	56dee89e6e	active generations should include the current one	2020-03-16 11:09:42 -07:00
Evan Tschannen	e5d53c863b	report in status the number of active generations	2020-03-16 10:29:17 -07:00
Evan Tschannen	818537ed2d	Update fdbserver/masterserver.actor.cpp Co-Authored-By: A.J. Beamon <ajbeamon@users.noreply.github.com>	2020-03-14 15:04:46 -07:00
Evan Tschannen	2f2f56020f	Update fdbserver/masterserver.actor.cpp Co-Authored-By: A.J. Beamon <ajbeamon@users.noreply.github.com>	2020-03-13 15:54:13 -07:00
Evan Tschannen	a39effa57d	delay recoveries after 70 outstanding generations, and stop recoveries after 100 outstanding generations to prevent a death spiral from filling up the coordinated state	2020-03-13 10:28:32 -07:00
Evan Tschannen	96258b9809	Merge branch 'release-6.2' # Conflicts: # documentation/sphinx/source/release-notes.rst # fdbcli/fdbcli.actor.cpp # fdbclient/ManagementAPI.actor.cpp # fdbrpc/FlowTransport.actor.cpp # fdbserver/ClusterController.actor.cpp # fdbserver/DataDistribution.actor.cpp # fdbserver/DataDistribution.actor.h # fdbserver/DataDistributionQueue.actor.cpp # fdbserver/KeyValueStoreMemory.actor.cpp # fdbserver/MasterProxyServer.actor.cpp # fdbserver/QuietDatabase.actor.cpp # fdbserver/SkipList.cpp # fdbserver/StorageMetrics.actor.h # fdbserver/TLogServer.actor.cpp # fdbserver/fdbserver.actor.cpp # fdbserver/storageserver.actor.cpp # fdbserver/workloads/KVStoreTest.actor.cpp # flow/CMakeLists.txt # flow/Knobs.cpp # flow/Knobs.h # flow/genericactors.actor.cpp # flow/serialize.h	2020-02-21 19:09:16 -08:00
A.J. Beamon	df2b0452b4	Step 3 of fixing storage server range reads: change return type of readRange from VectorRef<KeyValueRef> to RangeResultRef.	2020-02-06 13:19:24 -08:00
Jingyu Zhou	52c6737411	Rename backupLoggingEnabled as backupWorkerEnabled To highlight the changes for 7.0 backup changes. By default, backup_worker_enabled flag is set for 7.0 version.	2020-02-04 10:09:16 -08:00
Jingyu Zhou	0db03f1d3c	Use backup_logging_enabled flag The default is to enable new backup workers. Users can disable this flag to turn off the backup worker feature.	2020-02-03 20:03:22 -08:00
Jingyu Zhou	38aa1903fd	Add a DB configuration option for backup workers Right now, the default is to keep the old backup behavior, i.e., do NOT use backup workers. Specifically, if BackupType is not set (or is set to default), the master will not recruit backup workers and will not add pseudo locality for backup workers. The StartFullBackupTaskFunc is updated to check if backup worker is enabled. Only when it is not enabled, starting a backup will wait on all backup workers to be started.	2020-01-31 19:29:09 -08:00
Jingyu Zhou	8b67a89eed	More review comments fixed.	2020-01-22 19:42:13 -08:00
Jingyu Zhou	1eaea91cb3	Address review comments	2020-01-22 19:42:13 -08:00
Jingyu Zhou	e14246ac16	Add more information for trace events	2020-01-22 19:42:13 -08:00
Jingyu Zhou	4bed33031f	Set backup worker start version to be savedVersion + 1 If no progress found, start version is set to epochBegin. So the start version is the one after the last saved (or from last epoch's saved) version.	2020-01-22 19:42:13 -08:00
Jingyu Zhou	4ed75e37f3	BackupProgress uses old epoch's begin version if no progress found Get rid of the complex logic of choosing the largest saved version from previous epoch for the oldest epoch. Instead, use the begin version now available from log system.	2020-01-22 19:38:46 -08:00
Jingyu Zhou	19eacac3ce	Add a unit test for BackupProgress	2020-01-22 19:38:46 -08:00
Jingyu Zhou	64052f6349	Check and fill backup gaps for old epochs and tags Sometimes the backup worker has not updated progress to the system space and a master recovery happens. As a result, next epoch doesn't know the progress of previous ones. This change is to check for such missing gaps and fill them with the whole range [startVersion, endVersion). The code is refactored into BackupProgress.actor.* to consolidate backup progress processing for the master server.	2020-01-22 19:38:46 -08:00
Jingyu Zhou	ed54aaa09e	Fix a crash failure of empty backup interface	2020-01-22 19:38:46 -08:00
Jingyu Zhou	23985da6a0	Use backup worker failed error code during recovery And use override instead of virtual in TagPartitionedLogSystem.	2020-01-22 19:38:45 -08:00
Jingyu Zhou	840e74d696	Allow storage server queue in consistency check The backup worker needs to update its progress even during consistency check by commit transactions to the database. Thus we can't really achieve zero storage server queue. So add a limit of 10,000 to pass the consistency check.	2020-01-22 19:38:45 -08:00
Jingyu Zhou	9567bf730d	Fix a crash due to null log system When a master starts, backup worker from old epochs may send BackupWorkerDoneRequest to it. The master can be safely ignore it, since the checkRemoved logic of the backup worker can self exit then.	2020-01-22 19:38:45 -08:00
Jingyu Zhou	0c08161d8e	Remove old backup workers when done For backup workers working on old epochs, once their work is done, they will notify the master. Then the master removes them from the log system and acknowledge back to the backup workers so that they can gracefully shut down. The popping of a backup worker is stalled if there are workers from older epochs still working. Otherwise, workers from old epochs will lost data. However, allowing newer epoch to start backup can cause holes in version ranges. The restore process must verify the backup progress to make sure there are no holes, otherwise it has to wait.	2020-01-22 19:38:45 -08:00
Jingyu Zhou	85c4a4e422	Address review comments for PR #1625	2020-01-22 19:38:45 -08:00
Jingyu Zhou	22f4bef589	Fix a race that backup workers may not be registered After the backup worker recruitment is done, we need to force trigger the registration with cluster controller. Otherwise, the log system may not have the backup workers, which can stall backup workers from obtaining a cursor and resulting in mutations being kept in TLogs.	2020-01-22 19:38:45 -08:00
Jingyu Zhou	73824faf65	Track pseudo tags popping for individual IDs For each log router ID, we track the popped version of each pseudo tag so that the popping only applied to the minimum of these versions. Also add more tracing for popping and epochs.	2020-01-22 19:38:45 -08:00
Jingyu Zhou	580151e1d4	Refactor code using C++ 17 iterator	2020-01-22 19:38:45 -08:00
Jingyu Zhou	c2b8ee3b53	Small improvement	2020-01-22 19:38:45 -08:00
Jingyu Zhou	19d6a889ff	Recruit backup workers for old epochs If there are unfinished ranges in the old epochs, the new master will recruit backup workers responsible for finishing these ranges. These workers remains in the cluster until the next epoch, when it will remove itself.	2020-01-22 19:38:45 -08:00
Jingyu Zhou	ac851619bb	Fix merge errors with master	2020-01-22 19:38:45 -08:00
Jingyu Zhou	11964733b7	WIP: should be divided into smaller commits.	2020-01-22 19:38:45 -08:00
Jingyu Zhou	41f0cf2bb5	Add decode function for backup progress	2020-01-22 19:38:45 -08:00
Jingyu Zhou	a4d6ebe79e	Recruit backup worker in newEpoch	2020-01-22 19:37:48 -08:00
Jingyu Zhou	eac49bca04	Add backup worker recruitment in master.	2020-01-22 19:35:30 -08:00
negoyal	a4a0bf18f9	Merging with Master.	2019-11-12 13:01:29 -08:00
Jon Fu	d96a7b2c69	Merge branch 'master' of https://github.com/apple/foundationdb into mark-ss-failed	2019-10-03 09:47:45 -07:00
Evan Tschannen	3cc5d484a5	the include and exclude commands do not need to set the moveKeysLockOwnerKey, which will kill the data distribution algorithm	2019-09-27 18:33:56 -07:00
A.J. Beamon	1f8a157b35	Extend the length allowed for configuration fields. Log the config if recovery fails due to invalid config.	2019-09-05 15:36:37 -07:00
Andrew Noyes	6aa0ada7b1	Replace scalar root types with proper messages	2019-08-28 14:40:50 -07:00
Evan Tschannen	4c9a392f05	the master checks the popped version of the txsTag before recovering the txnStateStore, to avoid restoring data that is later found to be popped	2019-08-05 17:01:48 -07:00
Evan Tschannen	5c98dcce6d	revert the proxy forwarding path, because it is no longer necessary as clients keep a persistent connection open with coordinators	2019-07-27 16:46:22 -07:00
Evan Tschannen	b509a441e7	Merge branch 'master' into feature-skip-confirm # Conflicts: # bindings/flow/tester/Tester.actor.cpp # bindings/go/src/_stacktester/stacktester.go # bindings/java/src/test/com/apple/foundationdb/test/AsyncStackTester.java # bindings/java/src/test/com/apple/foundationdb/test/StackTester.java # bindings/python/tests/tester.py # bindings/ruby/tests/tester.rb # documentation/sphinx/source/api-c.rst # documentation/sphinx/source/api-python.rst # documentation/sphinx/source/api-ruby.rst # documentation/sphinx/source/data-modeling.rst # documentation/sphinx/source/developer-guide.rst # fdbclient/vexillographer/fdb.options # fdbserver/MasterProxyServer.actor.cpp	2019-07-27 15:08:13 -07:00
Evan Tschannen	02de53160d	only skip confirm epoch live if CAUSAL_READ_RISKY is enabled time checked on the proxy should be less than the time waited by the master to account for clock speed differences setting REQUIRED_MIN_RECOVERY_DURATION and ENFORCED_MIN_RECOVERY_DURATION to 0 will go back to the old behavior	2019-07-12 17:58:16 -07:00
Evan Tschannen	a63969afb3	enforce a minimum recovery duration, which allows proxies to avoid checking if the epoch is alive as long as its last commit has been less than MINIMUM_RECOVERY_DURATION ago	2019-07-12 13:10:21 -07:00
Evan Tschannen	d8948c8be1	Merge branch 'master' into feature-fast-txs-recovery # Conflicts: # fdbserver/TagPartitionedLogSystem.actor.cpp	2019-07-10 13:59:52 -07:00
Evan Tschannen	c348b3da51	After a proxy dies, it will remain alive for an additional 10 seconds to forward clients to the new proxies	2019-07-08 12:53:40 -07:00
Evan Tschannen	15e894c724	Merge in master	2019-07-05 15:49:24 -07:00

1 2 3 4

172 Commits