Commit Graph

9368 Commits

Author SHA1 Message Date
Meng Xu 85a9f6ab96 Merge branch 'master' into mengxu/fr-enable-rollback-and-abort-test-PR 2020-03-23 16:32:00 -07:00
Meng Xu 14c641ce2b Enable rollback workload 2020-03-23 16:29:01 -07:00
Jingyu Zhou cede1500cd
Merge pull request #2848 from xumengpanda/mengxu/fr-multi-cycle-test-PR
Performant restore [22/xx]: Introduce multiple cycle tests for the restore
2020-03-23 16:27:15 -07:00
Meng Xu 4756447b74 Enable abortAndRestartAfter option for restore cycle test 2020-03-23 15:15:14 -07:00
Meng Xu f936adedb0 Change tab to space for test workload file 2020-03-23 15:10:35 -07:00
Jingyu Zhou dd90845277 Fix assert failure
Should be backup's contiguousLogEnd > maxRestorableVersion.
2020-03-23 14:49:05 -07:00
Jingyu Zhou 196127fb92 Address review comments 2020-03-23 14:15:36 -07:00
Meng Xu d0bce1a105 Add ParallelRestoreCorrectnessCycle test into CMakeList 2020-03-23 13:56:33 -07:00
Meng Xu 9c2c3b26d3 Merge branch 'master' into mengxu/fr-multi-cycle-test-PR 2020-03-23 13:54:20 -07:00
Jingyu Zhou 9a50458a64
Merge pull request #2846 from xumengpanda/mengxu/fr-add-attrition-to-test-PR
Performant restore [21/xx]: Enable assassination workload in restore test
2020-03-23 13:52:01 -07:00
Meng Xu 00ada7e086 Enable attrition workload for restore multi cycle tests 2020-03-23 13:50:05 -07:00
Jingyu Zhou fd7643c322 Remove a variable 2020-03-23 13:45:48 -07:00
Jingyu Zhou 90b40e1d75 Merge branch 'mengxu/new-backup-format-PR-delta' of github.com:xumengpanda/foundationdb into backup-worker-bak
Resolve Conflicts:
	fdbclient/BackupAgent.actor.h
	fdbserver/BackupWorker.actor.cpp
	fdbserver/RestoreMaster.actor.cpp
	fdbserver/masterserver.actor.cpp
2020-03-23 13:35:33 -07:00
Meng Xu ace19eefe4 Introduce multi cycle tests for fast restore 2020-03-23 12:59:43 -07:00
Meng Xu be67ab4d6a Correct comment based on review 2020-03-23 12:53:40 -07:00
Jingyu Zhou f0f4e42a4c Add removal for backupWorkerCache 2020-03-23 12:47:42 -07:00
Andrew Noyes fa8eaf9810 Assert recoverAndEndEpoch does not become ready 2020-03-23 12:40:00 -07:00
Meng Xu 5b2b2a8767 Enable assassination for restore cycle test 2020-03-23 12:28:04 -07:00
Meng Xu 0fcd6c98d4 Include simulator.h to RestoreWorker 2020-03-23 11:34:02 -07:00
Meng Xu 48db54424f Add assassination workload to restore test workload
Add assert to ensure restore worker is reliable and not killed.
2020-03-23 11:11:13 -07:00
Meng Xu 51047a6c1d Protect restore worker from assassination in simulation 2020-03-23 11:06:40 -07:00
Meng Xu 3f31ebf659 New backup:Revise event name and explain code 2020-03-23 10:55:44 -07:00
Jingyu Zhou a8c2acdba0 Count the unique number of tags in startedBackupWorkers 2020-03-23 10:44:26 -07:00
Jingyu Zhou 658504bc66 Add a cache to handle repeated delivery of backup recruitment messages 2020-03-23 10:22:24 -07:00
Jingyu Zhou 1552653f1c Backup Worker: Cancel the actor when container is stopped 2020-03-22 21:08:11 -07:00
Jingyu Zhou 33ea027f84 Make sure only current epoch's backup workers update all workers
So that backup workers from old epochs don't mess with the list of all workers.
2020-03-22 18:28:22 -07:00
Jingyu Zhou 44c1996950 Change all worker started to be set after all workers updated a key
Previously, all worker started is set to be when saved log versions are higher.
However, saving the versions can be wrong, as the worker is not guaranteed to
write to the right container. For instance, if the watch is triggered later,
then mutation logs are written to previous containers. So we need to ensure the
right container is ready -- all workers have acknowledged seeing the container.
2020-03-22 16:40:12 -07:00
Jingyu Zhou 97702d91c8 Skip recruiting backup workers for older epochs before min backup version
When master starts recruiting backup workers, if there is no active backup job
or the min version of the backup job is greater than old epoch's end version,
then these old epochs can be skipped.
2020-03-21 13:44:02 -07:00
Jingyu Zhou 0eacf1cdab trackTlogRecovery listens on backup worker change events
Old TLogs can only be removed when backup workers no long need them (i.e., the
oldest backup epoch == current epoch). As a result, the core state changes need
include backup worker changes, which updates the oldest backup epoch.
2020-03-20 20:17:32 -07:00
Jingyu Zhou 818072f3cb Set oldest backup epoch if not recruiting backup workers
Since tlog is not kept until backup worker has pulled mutations from it, the
old tlogs can only be displaced after oldest backup epoch equals current epoch.
So if master is not recruiting backup workers, it should set the oldest backup
epoch as the current epoch.
2020-03-20 20:16:43 -07:00
Jingyu Zhou 0fe2810425 Fix repeated backup progress checking in backup worker
The delay is not used, which caused repeated progress checking in worker 0.
2020-03-20 20:16:43 -07:00
Jingyu Zhou 4a499a3c97 Remove backup worker's first and last pop
The first pop of current epoch can pop old epoch's data before they are saved.
The last pop of a stopped backup worker should be skipped so that after
recovery, the data is still accessible in case the last epoch's progress saving
transaction is delayed.
2020-03-20 20:16:43 -07:00
Jingyu Zhou 9d6de758a7 Backup Worker: Give a chance of saving progress before displaced
Move the exit loop after the saving of progress so that when doneTrigger is
active, we won't exit the loop immediately.
2020-03-20 20:16:10 -07:00
Jingyu Zhou 5359528132 Reduce a call to getLogSystemConfig() 2020-03-20 20:15:09 -07:00
Jingyu Zhou 6b0d2923e7 Add target version as the limit for version batches
If using partitioned logs, the mutations after the target version can be
included if this limit is not considered.
2020-03-20 20:15:09 -07:00
Jingyu Zhou 08173951bc Add an exitEarly flag for backup worker
If a backup worker is on an old epoch, it could exit early if either of the
following is true:
- there is no backups
- all backups starts a version >= the endVersion

If this flag is set, the backup worker exit without doing any work, which
signals the master to update oldest backup epoch.
2020-03-20 20:15:09 -07:00
Jingyu Zhou e1737fc644 Skip setting backupStartedKey if using old mutation logs
For old submitBackup(), where partitionedLog is false, do not set the
backupStartedKey in BackupConfig, which signals backup workers to skip these
backups.
2020-03-20 20:15:09 -07:00
Jingyu Zhou 5b36dcaad5 Fix oldest backup epoch for backup workers
The oldest backup epoch is piggybacked in LogSystemConfig from master to
cluster controller and then to all workers. Previously, this epoch is set
to the current master epoch, which is wrong.
2020-03-20 20:15:09 -07:00
Jingyu Zhou fea6155714 StagingKey uses mutation instead of a vector of mutations for each log version
Because each log version contains commit version and subsequence number, each
key can only have one mutation for its log version. This simplifies
StagingKey::add() a lot.
2020-03-20 20:15:09 -07:00
Jingyu Zhou 4bdb32be14 Batch sending all mutations of a version from RestoreLoader
This optimization is to reduce the number of messages sent from loader to
applier, which was unintentionally done when introducing sub sequence numbers
for mutations.
2020-03-20 20:15:09 -07:00
Jingyu Zhou 4065ca2a65 Fix duplicated mutation in StagingKey
For some reason I am not sure why, there can be duplicated mutations added to
StagingKey, which needs to be filtered out. Otherwise, atomic operations can
result in corrupted data in database.
2020-03-20 20:15:09 -07:00
Jingyu Zhou 799f0b4b0e Small code refactor 2020-03-20 20:15:09 -07:00
Jingyu Zhou e40f937d3a Fix missing mutations in splitMutation
When a range mutation is larger than the last split point, this mutation can
become missing in the RestoreLoader, which is fixed in this commit.
2020-03-20 20:15:09 -07:00
Jingyu Zhou b18f192831 Fix decode bug of missing mutations
After reading a new block, all mutations are sorted by version again, which
can invalidate previously tuple. As a result, the decoded file will miss some
of the mutations.
2020-03-20 20:15:09 -07:00
Jingyu Zhou 9ea549ba7d Updates lastest backup worker progress after all previous epochs are done
If workers for previous epochs are still ongoing, we may end up with a
container that miss mutations in previous epochs. So the update only happens
after there are only current epoch's backup workers.
2020-03-20 20:15:09 -07:00
Jingyu Zhou 4c75c61f39 Fix duplicate file removal for subset version ranges
Partitioned logs can have strict subset version ranges, which was not properly
handled -- we used to assume overlapping only happens for the same begin
version.
2020-03-20 20:15:09 -07:00
Jingyu Zhou 1a1f572f29 Fix a time gap for monitoring backup keys
Backup worker starts by check if there are backup keys and then runs
monitorBackupKeyOrPullData() loop, which does the check again. The second check
can be delayed, which causes the loop to perform NOOP pops. The fix removes
this second check and uses the result of the first check to decide what to do
in the loop.
2020-03-20 20:15:09 -07:00
Jingyu Zhou c63493c34f Allow overlapped versions in partitioned logs
The overlapping can only happens between two generations, where the known
committed version to recovery version is copied from old generation to the new
generation. Within a generation, there is no overlap.

The fix here is related to the calculation of continuous version ranges,
allowing the overlap to happen.
2020-03-20 20:15:09 -07:00
Jingyu Zhou fa7c8d8bb3 Add done trigger so that backup progress can be set
Otherwise, when there is no mutations for the unfinished range, the empty file
may not be created when the worker is displaced, thus leaving holes in version
ranges.
2020-03-20 20:15:09 -07:00
Jingyu Zhou 4f4ce93f8c Remove debug print out 2020-03-20 20:15:09 -07:00