Jingyu Zhou
6be913a430
Add partitioned logs option to AtomicRestore workload
2020-03-26 13:04:00 -07:00
Jingyu Zhou
aca458cd96
Set 50% chance to restore old backup files for fast restore
2020-03-26 13:04:00 -07:00
Jingyu Zhou
99f4ef6e0c
Fix restore loader to handle mutation sub number
...
For old backup format, give them a sub sequence number starting from 0 for each
commit version.
2020-03-26 13:04:00 -07:00
Jingyu Zhou
40b17e1e9b
Remove a no longer unused knob
2020-03-26 13:04:00 -07:00
Jingyu Zhou
772ab70aee
Add an option for fast restore to restore old backups
...
If "usePartitionedLogs" is set to false, then the workload uses old backups for
restore.
2020-03-26 13:04:00 -07:00
Meng Xu
1052b23ee1
Merge pull request #2370 from atn34/test-watch-outliving-transaction
...
Test watch outliving transaction
2020-03-26 12:40:38 -07:00
Andrew Noyes
cdb6bbfc85
Test watch outliving transaction
2020-03-26 10:09:03 -07:00
Jingyu Zhou
feedab02a0
Merge pull request #2855 from xumengpanda/mengxu/fr-api-atomicrestore-PR
...
Add ApiCorrectnessAtomicRestore workload for the new performant restore
2020-03-25 18:05:26 -07:00
Evan Tschannen
bb5799bd20
Merge pull request #2642 from xumengpanda/mengxu/new-backup-format-PR
...
FastRestore:Integrate with new backup format
2020-03-25 15:47:55 -07:00
Jingyu Zhou
0f57bf9685
Remove a SevError event
...
The same mutation can be present in overlapping mutation logs. Thus we cannot
assert its absence. This can be caused for multiple reasons. One possibility
is that new TLogs can copy mutations from old generation TLogs; another one
is backup worker is recruited without knowning previously saved progress.
2020-03-25 15:23:21 -07:00
Meng Xu
1ba11dc74b
Apply clang format
2020-03-25 11:20:17 -07:00
Meng Xu
120272f025
Change unlockDB from RestoreMaster to Agent
2020-03-25 11:04:49 -07:00
Jingyu Zhou
472f7bdd32
Rename a trace event to avoid confusion
...
Change from BackupRange to BackupVersionRange.
2020-03-25 11:03:05 -07:00
Evan Tschannen
e0fbd9ecbe
Merge pull request #2847 from atn34/atn34/assert-no-return
...
Assert recoverAndEndEpoch does not become ready
2020-03-25 10:23:38 -07:00
Jingyu Zhou
e2f317a0da
Fix a crash failure
2020-03-25 09:18:49 -07:00
Jingyu Zhou
00fb4c1a35
Fix an off by one error
...
Backup worker's saved version should start from its startVersion - 1, i.e.,
the startVersion is not saved yet. Otherwise, if the version range is just
the startVersion itself and there is no data, then the range [startVersion,
startVersion + 1) will be missing. This causes non-continuous partitioned logs.
2020-03-24 23:40:36 -07:00
Meng Xu
ca8966a28b
Move lockDB into submitRestore request from restore worker
...
AtomicRestore needs to lock DB before we start the restore worker.
So we cannot lock DB in restore worker with a different randomUID.
2020-03-24 23:39:35 -07:00
Meng Xu
6a8d6ddb8e
Introduce ParallelRestoreApiCorrectnessAtomicRestore.txt test
...
This covers ApiCorrectnessTest as workload for parallel restore.
2020-03-24 22:30:51 -07:00
Jingyu Zhou
669916467e
Add missing transaction reset call
2020-03-24 20:14:37 -07:00
Jingyu Zhou
5e729a5bcf
Merge branch 'master' of https://github.com/apple/foundationdb into backup-worker-bak
2020-03-24 19:54:36 -07:00
Jingyu Zhou
edcbeb8992
Address review comments
...
Move transaction object outside of the loop and rename trace events.
2020-03-24 18:22:20 -07:00
Meng Xu
b173929316
Add atomicParallelRestore to AtomicRestore workload
2020-03-24 15:58:49 -07:00
Meng Xu
81f7181c9e
Refactor submitParallelRestore function into FileBackupAgent
2020-03-24 14:44:55 -07:00
Meng Xu
5584884c12
Refactor parallelRestoreFinish function into FileBackupAgent
2020-03-24 14:15:15 -07:00
Jingyu Zhou
a3058e7d96
Fix incorrectly marking a backup job as stopped
...
This causes missing version ranges for mutation logs.
2020-03-23 22:05:58 -07:00
Jingyu Zhou
1155304cd5
Remove a spurious assertion
...
It's possible that there is a gap between backup's contiguousLogEnd and snapshot
version.
2020-03-23 21:39:40 -07:00
Jingyu Zhou
82a1790776
Fix backup worker crash due to aborted backup job
...
If a backup job is aborted, the "startedBackupWorkers" key can be cleared, thus
triggering the assertion failure.
2020-03-23 21:11:25 -07:00
Jingyu Zhou
243d078596
Fix off by one error
...
Epoch end version is saved version + 1, so need +1 for minBackupVersion.
2020-03-23 20:44:31 -07:00
Jingyu Zhou
f1d7fbafb4
Stop actors for displaced backup workers
...
If the worker is displaced, it should not update backup containers.
2020-03-23 18:48:06 -07:00
Jingyu Zhou
dd90845277
Fix assert failure
...
Should be backup's contiguousLogEnd > maxRestorableVersion.
2020-03-23 14:49:05 -07:00
Jingyu Zhou
196127fb92
Address review comments
2020-03-23 14:15:36 -07:00
Jingyu Zhou
fd7643c322
Remove a variable
2020-03-23 13:45:48 -07:00
Jingyu Zhou
90b40e1d75
Merge branch 'mengxu/new-backup-format-PR-delta' of github.com:xumengpanda/foundationdb into backup-worker-bak
...
Resolve Conflicts:
fdbclient/BackupAgent.actor.h
fdbserver/BackupWorker.actor.cpp
fdbserver/RestoreMaster.actor.cpp
fdbserver/masterserver.actor.cpp
2020-03-23 13:35:33 -07:00
Meng Xu
be67ab4d6a
Correct comment based on review
2020-03-23 12:53:40 -07:00
Jingyu Zhou
f0f4e42a4c
Add removal for backupWorkerCache
2020-03-23 12:47:42 -07:00
Andrew Noyes
fa8eaf9810
Assert recoverAndEndEpoch does not become ready
2020-03-23 12:40:00 -07:00
Meng Xu
0fcd6c98d4
Include simulator.h to RestoreWorker
2020-03-23 11:34:02 -07:00
Meng Xu
48db54424f
Add assassination workload to restore test workload
...
Add assert to ensure restore worker is reliable and not killed.
2020-03-23 11:11:13 -07:00
Meng Xu
51047a6c1d
Protect restore worker from assassination in simulation
2020-03-23 11:06:40 -07:00
Meng Xu
3f31ebf659
New backup:Revise event name and explain code
2020-03-23 10:55:44 -07:00
Jingyu Zhou
a8c2acdba0
Count the unique number of tags in startedBackupWorkers
2020-03-23 10:44:26 -07:00
Jingyu Zhou
658504bc66
Add a cache to handle repeated delivery of backup recruitment messages
2020-03-23 10:22:24 -07:00
Jingyu Zhou
1552653f1c
Backup Worker: Cancel the actor when container is stopped
2020-03-22 21:08:11 -07:00
Jingyu Zhou
33ea027f84
Make sure only current epoch's backup workers update all workers
...
So that backup workers from old epochs don't mess with the list of all workers.
2020-03-22 18:28:22 -07:00
Jingyu Zhou
44c1996950
Change all worker started to be set after all workers updated a key
...
Previously, all worker started is set to be when saved log versions are higher.
However, saving the versions can be wrong, as the worker is not guaranteed to
write to the right container. For instance, if the watch is triggered later,
then mutation logs are written to previous containers. So we need to ensure the
right container is ready -- all workers have acknowledged seeing the container.
2020-03-22 16:40:12 -07:00
Jingyu Zhou
97702d91c8
Skip recruiting backup workers for older epochs before min backup version
...
When master starts recruiting backup workers, if there is no active backup job
or the min version of the backup job is greater than old epoch's end version,
then these old epochs can be skipped.
2020-03-21 13:44:02 -07:00
Jingyu Zhou
0eacf1cdab
trackTlogRecovery listens on backup worker change events
...
Old TLogs can only be removed when backup workers no long need them (i.e., the
oldest backup epoch == current epoch). As a result, the core state changes need
include backup worker changes, which updates the oldest backup epoch.
2020-03-20 20:17:32 -07:00
Jingyu Zhou
818072f3cb
Set oldest backup epoch if not recruiting backup workers
...
Since tlog is not kept until backup worker has pulled mutations from it, the
old tlogs can only be displaced after oldest backup epoch equals current epoch.
So if master is not recruiting backup workers, it should set the oldest backup
epoch as the current epoch.
2020-03-20 20:16:43 -07:00
Jingyu Zhou
0fe2810425
Fix repeated backup progress checking in backup worker
...
The delay is not used, which caused repeated progress checking in worker 0.
2020-03-20 20:16:43 -07:00
Jingyu Zhou
4a499a3c97
Remove backup worker's first and last pop
...
The first pop of current epoch can pop old epoch's data before they are saved.
The last pop of a stopped backup worker should be skipped so that after
recovery, the data is still accessible in case the last epoch's progress saving
transaction is delayed.
2020-03-20 20:16:43 -07:00