foundationdb

Commit Graph

Author	SHA1	Message	Date
Jingyu Zhou	5a602f58e8	Start backup with a wait on all backup workers running This wait is to make sure that backup workers are already saving mutations so that no mutations are missed. The idea is that the CLI sets a "backupStartedKey" in the database and waits for allWorkerStarted() key of the backup to be set. Backup workers monitor the changes to the "backupStartedKey" and start logging mutations. Additionally, backup worker for Tag(-2,0) monitors all other workers have started (checking their saved progress version is larger than the backup's start version), and then sets the allWorkerStarted() key for the backup.	2020-01-31 19:29:09 -08:00
Jingyu Zhou	c08a192c75	Add a backup start key If the backup key is not set, do not recruit backup workers for old epoches.	2020-01-22 19:42:13 -08:00
Jingyu Zhou	19d6a889ff	Recruit backup workers for old epochs If there are unfinished ranges in the old epochs, the new master will recruit backup workers responsible for finishing these ranges. These workers remains in the cluster until the next epoch, when it will remove itself.	2020-01-22 19:38:45 -08:00
Jingyu Zhou	41f0cf2bb5	Add decode function for backup progress	2020-01-22 19:38:45 -08:00
Jingyu Zhou	7da9f47f26	Enable pop from backup workers This is still WIP as some edge cases can trigger test failure, most likely due to not popping mutations by backup workers when epoch ends.	2020-01-22 19:38:45 -08:00
negoyal	d46c7ded59	Merge remote-tracking branch 'origin/master' into storage-cache-subfeature1	2019-11-14 17:52:22 -08:00
negoyal	a4a0bf18f9	Merging with Master.	2019-11-12 13:01:29 -08:00
Meng Xu	58aa6711e4	FastRestore:ApplyToDB:BugFix:Serialize integer as bigEndian to ensure lexico order	2019-11-03 17:26:07 -08:00
Andrew Noyes	b7b5d2ead3	Remove several nonsensical const uses These seem to be all the ones that clang's -Wignored-qualifiers complains about	2019-10-26 14:30:34 -07:00
Andrew Noyes	de8921b660	Move RestoreWorkerInterface to fdbclient	2019-10-25 10:42:22 -07:00
Andrew Noyes	d4de608bb6	Fix OPEN_FOR_IDE build	2019-10-25 10:42:22 -07:00
Jon Fu	f4237ebfff	Merge branch 'master' of https://github.com/apple/foundationdb into mark-ss-failed	2019-10-16 11:32:16 -07:00
Meng Xu	84b5a5525f	FastRestore:Add restoreApplierKeys	2019-10-10 17:18:34 -07:00
Jon Fu	471e283128	Merge branch 'master' of https://github.com/apple/foundationdb into mark-ss-failed	2019-09-18 11:49:07 -07:00
Meng Xu	d160810662	FastRestore:Resolve review comments	2019-09-04 16:48:43 -07:00
Jon Fu	c908c6c1db	added command to fdbcli and changes to SystemData and ManagementAPI	2019-08-27 14:39:43 -07:00
Meng Xu	7ff46e6772	Merge branch 'master' into mengxu/performant-restore-PR	2019-08-07 20:31:56 -07:00
Evan Tschannen	ba54508c47	code cleanup	2019-08-06 16:30:30 -07:00
Meng Xu	9cc832cfd6	FastRestore:Fix Mac and Windows compilation error	2019-08-02 14:33:08 -07:00
Meng Xu	3b54363780	FastRestore:Apply Clang-format	2019-08-01 18:09:12 -07:00
Meng Xu	7ccaeddf05	Merge branch 'master' into mengxu/performant-restore-PR	2019-08-01 13:23:17 -07:00
Xin Dong	1922c39377	Resolve review comments. 100K run shows one suspecious ASSERT_WE_THINK failure which I think could be a race.	2019-07-30 22:24:30 -07:00
Xin Dong	ae11efcb0a	Made following changes: - Make sure the disabled data distribution won't be accidentally enabled by the 'maintenance' command - Make sure the status json reflects the status of DD accordingly - Make sure the CLI can play with the new DD states correctly, i.e. print out warns when necessary	2019-07-30 22:20:45 -07:00
Xin Dong	4ecfc9830f	Added finer grained controls to DataDistribution in fdbcli. What's happening under the hood is: - Use pre-existing 'healthZone' key and write a special value to it in order to disable DD for all storage server failures - Use a new system key 'rebalanceDDIgnored' key to disable/enable DD for all rebalance reasons(MountainChopper and ValleyFiller) Kicked off two 200K correctness and showed no related errors.	2019-07-30 22:17:21 -07:00
Meng Xu	b0c31f28af	FastRestore:Fix bug that blocks restore 1) Should recruit only configured number of roles; 2) Should never register a restore master interface as a restore worker (loader or applier) interface.	2019-07-25 17:55:37 -07:00
Meng Xu	45083edf74	Merge branch 'master' into mengxu/performant-restore-PR Fix conflicts as well.	2019-07-25 10:46:11 -07:00
Balachandar Namasivayam	7489f83a7f	Disable/Re-enable consistency check through a database key. fdbcli has a new command 'consistencycheck' to disable/re-enable consistency check. cluster_healthy metric in status becomes false if consistencycheck is disabled.	2019-06-20 21:38:45 -07:00
Meng Xu	022b555b69	FastRestore:Fix bug in finish restore RestoreMaster may not receive all acks. for the last command, i.e., finishRestore, because RestoreLoaders and RestoreAppliers exit immediately after sending the ack. If the ack is lost, it will not be resent. This commit also removes some unneeded code. This commit passes 50k random tests without errors.	2019-06-05 20:07:18 -07:00
Meng Xu	477fd152c0	FastRestore:Refactor code 1) Use the runRYWTransaction for simple DB access 2) Replace some printf with TraceEvent 3) Remove printf not used in debugging 4) Avoid wait inside the condition in loop-choose-when for the core routine of restore worker, loader and applier. 5) Rename Restore.actor.cpp to RestoreWorker.actor.cpp since the file only has functionalities related to restore worker. Passed correctness test	2019-06-04 11:22:47 -07:00
sramamoorthy	4083af0b01	Avoid using trackLatest for TLog pop test cases	2019-05-28 22:07:46 -07:00
sramamoorthy	69edefe68b	Snapshot based backup and resotre implementation	2019-05-28 22:07:46 -07:00
Evan Tschannen	b451c2cd56	Merge pull request #1497 from alexmiller-apple/fastrecovery Add an \xff keyrange that is backed by the txnStateStore.	2019-05-23 10:52:35 -07:00
Meng Xu	fac63a83c4	FastRestore:Use NotifiedVersion to deduplicate requests Add a NotifiedVersion into an applier data which represents the smallest version the applier is at. When a loader sends mutation vector to appliers, it sends the request that contains prevVersion and commitVersion. This commits also put actor into an actorCollector for loop-choose-when situation.	2019-05-22 22:09:54 -07:00
Meng Xu	f235bb7e0d	FastRestore:Use readVersion to trigger watch Use readVersion to trigger watch on the restoreRequestTriggerKey and restoreRequestDoneKey.	2019-05-22 13:20:59 -07:00
Evan Tschannen	f3897238f8	added the ability to add a read conflict range on the metadata version key without the READ_SYSTEM_KEYS option	2019-05-15 10:13:38 -07:00
Meng Xu	a08a6776f5	FastRestore: Refactor to smaller components The current code uses one restore interface to handle the work for all restore roles, i.e., master, loader and applier. This makes it harder to review or maintain or scale. This commit split the restore into multiple roles by mimicing FDB transaction system: 1) It uses a RestoreWorker as the process to host restore roles; This commit assumes one restore role per RestoreWorker; but it should be easy to extend to support multiple roles per RestoreWorker; 2) It creates 3 restore roles: RestoreMaster: Coordinate the restore process and send commands to the other two roles; RestoreLoader: Parse backup files to mutations and send mutations to appliers; RestoreApplier: Sort received mutations and apply them to DB in order. Compilable version. To be tested in correctness.	2019-05-10 14:20:06 -07:00
Meng Xu	25c75f4222	FastRestore: Add new empty files for restore roles Add .h and .cpp files for RestoreLoader and RestoreApplier roles. We will split the code for each restore role into a separate file. This commit also fixes the bug in including RestoreCommon.actor.h, and remove the unused code.	2019-05-06 16:59:41 -07:00
Alex Miller	797d431934	Add an \xff keyrange that is backed by the txnStateStore.	2019-04-25 17:04:20 -07:00
Meng Xu	c4a8a80d6f	Merge branch 'apple/master' into mengxu/performant-restore-PR	2019-04-04 22:51:00 -07:00
Meng Xu	eb1e880fef	FastRestore: Rename RestoreCommandInterface Rename it to RestoreInterface. The new name is more general because we will have different type of RequestStreams for each type of commands.	2019-04-04 13:52:24 -07:00
Evan Tschannen	781cf9b5a0	added the ability to make a zoneId for maintenance in fdbcli	2019-04-01 17:55:13 -07:00
Meng Xu	70d7c289f4	Merge branch 'master' into mengxu/restore/parallel-v7	2019-03-30 22:13:10 -07:00
Meng Xu	ee70bbf318	FastRestore: Correct running after refactor Test on one test case and passed.	2019-03-14 16:45:04 -07:00
Meng Xu	00d1e5e70a	FastRestore: Add command UID and code clean Change variable name to a shorter name Remove most unused code Compilable at this commit	2019-03-10 17:17:18 -07:00
Evan Tschannen	3da85f3acd	implemented the \xff/metadataVersion key, which can be used by layers to help them cheaply cache metadata and know when their cache is invalid	2019-02-28 17:45:00 -08:00
Evan Tschannen	3a572b010f	fix: a forced recovery needed to force the data distributor to restart	2019-02-19 16:04:52 -08:00
Evan Tschannen	065a45e05f	Merge branch 'master' into feature-fix-force-recovery # Conflicts: # fdbclient/ManagementAPI.actor.cpp # fdbserver/ClusterController.actor.cpp # fdbserver/workloads/KillRegion.actor.cpp	2019-02-18 17:09:06 -08:00
Evan Tschannen	4c35ebdcc6	fix: because of forced recoveries, storage servers in remote regions cannot update their durable version to (lastLogVersion - 5e6), because the lastLogVersion might have jumped due to an epoch end and the recovery version after the forced recovery could be before the epoch end, causing the storage server to want to rollback to a version it does not have on disk	2019-02-18 14:40:30 -08:00
Evan Tschannen	05ca0a10d8	fix: kill all storage servers which are not in the safe locality after a forced recovery	2019-02-18 14:30:51 -08:00
Evan Tschannen	107b361396	Update fdbclient/SystemData.h Co-Authored-By: jzhou77 <jingyuzhou@gmail.com>	2019-02-14 16:37:16 -08:00

1 2

88 Commits