Commit Graph

3939 Commits

Author SHA1 Message Date
Jingyu Zhou 6be913a430 Add partitioned logs option to AtomicRestore workload 2020-03-26 13:04:00 -07:00
Jingyu Zhou aca458cd96 Set 50% chance to restore old backup files for fast restore 2020-03-26 13:04:00 -07:00
Jingyu Zhou 99f4ef6e0c Fix restore loader to handle mutation sub number
For old backup format, give them a sub sequence number starting from 0 for each
commit version.
2020-03-26 13:04:00 -07:00
Jingyu Zhou 40b17e1e9b Remove a no longer unused knob 2020-03-26 13:04:00 -07:00
Jingyu Zhou 772ab70aee Add an option for fast restore to restore old backups
If "usePartitionedLogs" is set to false, then the workload uses old backups for
restore.
2020-03-26 13:04:00 -07:00
Meng Xu 1052b23ee1
Merge pull request #2370 from atn34/test-watch-outliving-transaction
Test watch outliving transaction
2020-03-26 12:40:38 -07:00
Andrew Noyes cdb6bbfc85 Test watch outliving transaction 2020-03-26 10:09:03 -07:00
Jingyu Zhou feedab02a0
Merge pull request #2855 from xumengpanda/mengxu/fr-api-atomicrestore-PR
Add ApiCorrectnessAtomicRestore workload for the new performant restore
2020-03-25 18:05:26 -07:00
Evan Tschannen bb5799bd20
Merge pull request #2642 from xumengpanda/mengxu/new-backup-format-PR
FastRestore:Integrate with new backup format
2020-03-25 15:47:55 -07:00
Jingyu Zhou 0f57bf9685 Remove a SevError event
The same mutation can be present in overlapping mutation logs. Thus we cannot
assert its absence. This can be caused for multiple reasons. One possibility
is that new TLogs can copy mutations from old generation TLogs; another one
is backup worker is recruited without knowning previously saved progress.
2020-03-25 15:23:21 -07:00
Meng Xu 1ba11dc74b Apply clang format 2020-03-25 11:20:17 -07:00
Meng Xu 120272f025 Change unlockDB from RestoreMaster to Agent 2020-03-25 11:04:49 -07:00
Jingyu Zhou 472f7bdd32 Rename a trace event to avoid confusion
Change from BackupRange to BackupVersionRange.
2020-03-25 11:03:05 -07:00
Evan Tschannen e0fbd9ecbe
Merge pull request #2847 from atn34/atn34/assert-no-return
Assert recoverAndEndEpoch does not become ready
2020-03-25 10:23:38 -07:00
Jingyu Zhou e2f317a0da Fix a crash failure 2020-03-25 09:18:49 -07:00
Jingyu Zhou 00fb4c1a35 Fix an off by one error
Backup worker's saved version should start from its startVersion - 1, i.e.,
the startVersion is not saved yet. Otherwise, if the version range is just
the startVersion itself and there is no data, then the range [startVersion,
startVersion + 1) will be missing. This causes non-continuous partitioned logs.
2020-03-24 23:40:36 -07:00
Meng Xu ca8966a28b Move lockDB into submitRestore request from restore worker
AtomicRestore needs to lock DB before we start the restore worker.
So we cannot lock DB in restore worker with a different randomUID.
2020-03-24 23:39:35 -07:00
Meng Xu 6a8d6ddb8e Introduce ParallelRestoreApiCorrectnessAtomicRestore.txt test
This covers ApiCorrectnessTest as workload for parallel restore.
2020-03-24 22:30:51 -07:00
Jingyu Zhou 669916467e Add missing transaction reset call 2020-03-24 20:14:37 -07:00
Jingyu Zhou 5e729a5bcf Merge branch 'master' of https://github.com/apple/foundationdb into backup-worker-bak 2020-03-24 19:54:36 -07:00
Jingyu Zhou edcbeb8992 Address review comments
Move transaction object outside of the loop and rename trace events.
2020-03-24 18:22:20 -07:00
Meng Xu b173929316 Add atomicParallelRestore to AtomicRestore workload 2020-03-24 15:58:49 -07:00
Meng Xu 81f7181c9e Refactor submitParallelRestore function into FileBackupAgent 2020-03-24 14:44:55 -07:00
Meng Xu 5584884c12 Refactor parallelRestoreFinish function into FileBackupAgent 2020-03-24 14:15:15 -07:00
Jingyu Zhou a3058e7d96 Fix incorrectly marking a backup job as stopped
This causes missing version ranges for mutation logs.
2020-03-23 22:05:58 -07:00
Jingyu Zhou 1155304cd5 Remove a spurious assertion
It's possible that there is a gap between backup's contiguousLogEnd and snapshot
version.
2020-03-23 21:39:40 -07:00
Jingyu Zhou 82a1790776 Fix backup worker crash due to aborted backup job
If a backup job is aborted, the "startedBackupWorkers" key can be cleared, thus
triggering the assertion failure.
2020-03-23 21:11:25 -07:00
Jingyu Zhou 243d078596 Fix off by one error
Epoch end version is saved version + 1, so need +1 for minBackupVersion.
2020-03-23 20:44:31 -07:00
Jingyu Zhou f1d7fbafb4 Stop actors for displaced backup workers
If the worker is displaced, it should not update backup containers.
2020-03-23 18:48:06 -07:00
Jingyu Zhou dd90845277 Fix assert failure
Should be backup's contiguousLogEnd > maxRestorableVersion.
2020-03-23 14:49:05 -07:00
Jingyu Zhou 196127fb92 Address review comments 2020-03-23 14:15:36 -07:00
Jingyu Zhou fd7643c322 Remove a variable 2020-03-23 13:45:48 -07:00
Jingyu Zhou 90b40e1d75 Merge branch 'mengxu/new-backup-format-PR-delta' of github.com:xumengpanda/foundationdb into backup-worker-bak
Resolve Conflicts:
	fdbclient/BackupAgent.actor.h
	fdbserver/BackupWorker.actor.cpp
	fdbserver/RestoreMaster.actor.cpp
	fdbserver/masterserver.actor.cpp
2020-03-23 13:35:33 -07:00
Meng Xu be67ab4d6a Correct comment based on review 2020-03-23 12:53:40 -07:00
Jingyu Zhou f0f4e42a4c Add removal for backupWorkerCache 2020-03-23 12:47:42 -07:00
Andrew Noyes fa8eaf9810 Assert recoverAndEndEpoch does not become ready 2020-03-23 12:40:00 -07:00
Meng Xu 0fcd6c98d4 Include simulator.h to RestoreWorker 2020-03-23 11:34:02 -07:00
Meng Xu 48db54424f Add assassination workload to restore test workload
Add assert to ensure restore worker is reliable and not killed.
2020-03-23 11:11:13 -07:00
Meng Xu 51047a6c1d Protect restore worker from assassination in simulation 2020-03-23 11:06:40 -07:00
Meng Xu 3f31ebf659 New backup:Revise event name and explain code 2020-03-23 10:55:44 -07:00
Jingyu Zhou a8c2acdba0 Count the unique number of tags in startedBackupWorkers 2020-03-23 10:44:26 -07:00
Jingyu Zhou 658504bc66 Add a cache to handle repeated delivery of backup recruitment messages 2020-03-23 10:22:24 -07:00
Jingyu Zhou 1552653f1c Backup Worker: Cancel the actor when container is stopped 2020-03-22 21:08:11 -07:00
Jingyu Zhou 33ea027f84 Make sure only current epoch's backup workers update all workers
So that backup workers from old epochs don't mess with the list of all workers.
2020-03-22 18:28:22 -07:00
Jingyu Zhou 44c1996950 Change all worker started to be set after all workers updated a key
Previously, all worker started is set to be when saved log versions are higher.
However, saving the versions can be wrong, as the worker is not guaranteed to
write to the right container. For instance, if the watch is triggered later,
then mutation logs are written to previous containers. So we need to ensure the
right container is ready -- all workers have acknowledged seeing the container.
2020-03-22 16:40:12 -07:00
Jingyu Zhou 97702d91c8 Skip recruiting backup workers for older epochs before min backup version
When master starts recruiting backup workers, if there is no active backup job
or the min version of the backup job is greater than old epoch's end version,
then these old epochs can be skipped.
2020-03-21 13:44:02 -07:00
Jingyu Zhou 0eacf1cdab trackTlogRecovery listens on backup worker change events
Old TLogs can only be removed when backup workers no long need them (i.e., the
oldest backup epoch == current epoch). As a result, the core state changes need
include backup worker changes, which updates the oldest backup epoch.
2020-03-20 20:17:32 -07:00
Jingyu Zhou 818072f3cb Set oldest backup epoch if not recruiting backup workers
Since tlog is not kept until backup worker has pulled mutations from it, the
old tlogs can only be displaced after oldest backup epoch equals current epoch.
So if master is not recruiting backup workers, it should set the oldest backup
epoch as the current epoch.
2020-03-20 20:16:43 -07:00
Jingyu Zhou 0fe2810425 Fix repeated backup progress checking in backup worker
The delay is not used, which caused repeated progress checking in worker 0.
2020-03-20 20:16:43 -07:00
Jingyu Zhou 4a499a3c97 Remove backup worker's first and last pop
The first pop of current epoch can pop old epoch's data before they are saved.
The last pop of a stopped backup worker should be skipped so that after
recovery, the data is still accessible in case the last epoch's progress saving
transaction is delayed.
2020-03-20 20:16:43 -07:00