Commit Graph

275 Commits

Author SHA1 Message Date
Jingyu Zhou 0823091423 Fix backup worker removal races with setting
The master waits for all backup worker recruitment done and then set them in a
batch. However, a backup worker could remove itself before the master sets it.
As a result, the worker is not removed and oldest backup epoch can't advance,
and TLog can't be popped.
2020-04-20 11:06:46 -07:00
Jingyu Zhou 280bc94738 Do not recruit backup workers with wrong tags
In a rare scenario, the master can recruit backup workers with more tags than
the number of log router tags for an epoch. This can be caused by an
unsuccessful recovery, which uses more tags than the next epoch. When
recruiting for the next epoch, if no progress has been made yet, the recruiting
logic will look back at the previous epoch. If previous epoch has saved past
this epoch's begin version, current epoch's progress is updated with that
information and can result in more tags being inserted to this epoch's
recruitment.
2020-03-28 21:19:41 -07:00
Jingyu Zhou 90b40e1d75 Merge branch 'mengxu/new-backup-format-PR-delta' of github.com:xumengpanda/foundationdb into backup-worker-bak
Resolve Conflicts:
	fdbclient/BackupAgent.actor.h
	fdbserver/BackupWorker.actor.cpp
	fdbserver/RestoreMaster.actor.cpp
	fdbserver/masterserver.actor.cpp
2020-03-23 13:35:33 -07:00
Meng Xu 3f31ebf659 New backup:Revise event name and explain code 2020-03-23 10:55:44 -07:00
Jingyu Zhou 0eacf1cdab trackTlogRecovery listens on backup worker change events
Old TLogs can only be removed when backup workers no long need them (i.e., the
oldest backup epoch == current epoch). As a result, the core state changes need
include backup worker changes, which updates the oldest backup epoch.
2020-03-20 20:17:32 -07:00
Jingyu Zhou 818072f3cb Set oldest backup epoch if not recruiting backup workers
Since tlog is not kept until backup worker has pulled mutations from it, the
old tlogs can only be displaced after oldest backup epoch equals current epoch.
So if master is not recruiting backup workers, it should set the oldest backup
epoch as the current epoch.
2020-03-20 20:16:43 -07:00
Jingyu Zhou 5b36dcaad5 Fix oldest backup epoch for backup workers
The oldest backup epoch is piggybacked in LogSystemConfig from master to
cluster controller and then to all workers. Previously, this epoch is set
to the current master epoch, which is wrong.
2020-03-20 20:15:09 -07:00
Jingyu Zhou 12ed8ad536 Fix backup worker start version when logset start version is lower
The start version of tlog set can be smaller than the last epoch's end version.
In this case, set backup worker's start version as last epoch's end version to
avoid overlapping of version ranges among backup workers.
2020-03-20 20:15:08 -07:00
Meng Xu 94276076de BackupWorker:Buggify upload delay
Add questions to code as well.
2020-03-18 19:04:45 -07:00
Jingyu Zhou 19f6394dc9 Fix oldest backup epoch for backup workers
The oldest backup epoch is piggybacked in LogSystemConfig from master to
cluster controller and then to all workers. Previously, this epoch is set
to the current master epoch, which is wrong.
2020-03-18 16:44:17 -07:00
Jingyu Zhou 89d8f13038 Fix backup worker start version when logset start version is lower
The start version of tlog set can be smaller than the last epoch's end version.
In this case, set backup worker's start version as last epoch's end version to
avoid overlapping of version ranges among backup workers.
2020-03-18 16:41:35 -07:00
Evan Tschannen e08f0201f1 merge release 6.2 into master 2020-03-17 12:51:47 -07:00
Evan Tschannen 04052226df reverting a change which causes data inconsistency between the primary and secondary 2020-03-17 09:41:44 -07:00
Evan Tschannen 0ca89547a5 make sure the number of logRouterTags is larger than the number of satelliteTLogs to avoid having satellites with no data. 2020-03-14 15:02:19 -07:00
Evan Tschannen 4640edf5d6 do not recruit satellite tlogs when usable regions=1 2020-03-13 10:24:52 -07:00
Jingyu Zhou 52c6737411 Rename backupLoggingEnabled as backupWorkerEnabled
To highlight the changes for 7.0 backup changes. By default,
backup_worker_enabled flag is set for 7.0 version.
2020-02-04 10:09:16 -08:00
Jingyu Zhou 0db03f1d3c Use backup_logging_enabled flag
The default is to enable new backup workers. Users can disable this flag to
turn off the backup worker feature.
2020-02-03 20:03:22 -08:00
Jingyu Zhou 38aa1903fd Add a DB configuration option for backup workers
Right now, the default is to keep the old backup behavior, i.e., do NOT use
backup workers. Specifically, if BackupType is not set (or is set to default),
the master will not recruit backup workers and will not add pseudo locality for
backup workers.

The StartFullBackupTaskFunc is updated to check if backup worker is enabled.
Only when it is not enabled, starting a backup will wait on all backup workers
to be started.
2020-01-31 19:29:09 -08:00
Jingyu Zhou e9c7ad82cc Comment out pseudo tag pop trace event 2020-01-31 19:29:09 -08:00
Jingyu Zhou 39fbacbc4f Address review comments 2020-01-22 19:43:40 -08:00
Jingyu Zhou 1eaea91cb3 Address review comments 2020-01-22 19:42:13 -08:00
Jingyu Zhou dcd0a46bc6 Fix a rare remote recovery bug
This bug was introduced when I added log router tags unconditionally to any
configurations. In newEpoch(), the wait for remote recovery is conditioned on
"logRouterTags == 0", which always becomes false. Thus remote recovery was not
performed and remote TLogs won't copy data from previous epoch's TLogs
(previous epoch is a single region configuration). As a result, storage servers
cannot peek/get the data, and won't pop tags. Thus, waitForFullReplication()
became stuck and eventually test timeout.
2020-01-22 19:42:13 -08:00
Jingyu Zhou 56a2c37071 Recruit backup workers for single region
Enable log router tags for single region, which are popped by backup workers.
Need to add noop for backup workers if there is no active backups.
2020-01-22 19:42:13 -08:00
Jingyu Zhou 4ed75e37f3 BackupProgress uses old epoch's begin version if no progress found
Get rid of the complex logic of choosing the largest saved version from
previous epoch for the oldest epoch. Instead, use the begin version now
available from log system.
2020-01-22 19:38:46 -08:00
Jingyu Zhou 42430e8f5e Add epochBegin version to OldTLogCoreData/OldLogData/OldTLogConf
This is to simplify the backup process so that whenever there is an old epoch
in the log system, we always know its begin version and can backup from that
version if no progress is known for that old epoch.
2020-01-22 19:38:46 -08:00
Jingyu Zhou 64052f6349 Check and fill backup gaps for old epochs and tags
Sometimes the backup worker has not updated progress to the system space and a
master recovery happens. As a result, next epoch doesn't know the progress of
previous ones. This change is to check for such missing gaps and fill them with
the whole range [startVersion, endVersion).

The code is refactored into BackupProgress.actor.* to consolidate backup
progress processing for the master server.
2020-01-22 19:38:46 -08:00
Jingyu Zhou 52bdaeee39 Do not save backup workers to core state and back
Each master starts from an empty set of backup workers and recruits a new set.
So there is no need to save current backup workers to DBCoreState. Note current
backup workers need to be serialized to LogSystemConfig (in ServerDBInfo) so
that backup workers can check if they have been displaced.
2020-01-22 19:38:46 -08:00
Jingyu Zhou ed54aaa09e Fix a crash failure of empty backup interface 2020-01-22 19:38:46 -08:00
Jingyu Zhou 23985da6a0 Use backup worker failed error code during recovery
And use override instead of virtual in TagPartitionedLogSystem.
2020-01-22 19:38:45 -08:00
Jingyu Zhou 9d7a1a77d0 Small fixes. 2020-01-22 19:38:45 -08:00
Jingyu Zhou 0c08161d8e Remove old backup workers when done
For backup workers working on old epochs, once their work is done, they will
notify the master. Then the master removes them from the log system and
acknowledge back to the backup workers so that they can gracefully shut down.

The popping of a backup worker is stalled if there are workers from older
epochs still working. Otherwise, workers from old epochs will lost data.

However, allowing newer epoch to start backup can cause holes in version ranges.
The restore process must verify the backup progress to make sure there are no
holes, otherwise it has to wait.
2020-01-22 19:38:45 -08:00
Jingyu Zhou 85c4a4e422 Address review comments for PR #1625 2020-01-22 19:38:45 -08:00
Jingyu Zhou 116608a0a7 Set backup workers w.r.t. the correct epoch
For backup workers created for previous epoch, we need to associate them with
the correct epoch so that later peekLogRouter can get the correct peek cursor.
Otherwise, the workers can never peek the missing range of mutations.
2020-01-22 19:38:45 -08:00
Jingyu Zhou 73824faf65 Track pseudo tags popping for individual IDs
For each log router ID, we track the popped version of each pseudo tag so that
the popping only applied to the minimum of these versions.

Also add more tracing for popping and epochs.
2020-01-22 19:38:45 -08:00
Jingyu Zhou 3509209d3f Fix not setting epoch for old log system 2020-01-22 19:38:45 -08:00
Jingyu Zhou a1095c8250 Remove epoch from DBCoreState
Use existing recoveryCount if needed.
2020-01-22 19:38:45 -08:00
Jingyu Zhou d5a92e1805 Fix pseudo locality usage bug
Somehow pseudo localities are not saved to LogSystemConfig and getPseudoPopTag()
should translate LogRouter tag to pseudo tags.
2020-01-22 19:38:45 -08:00
Jingyu Zhou 19d6a889ff Recruit backup workers for old epochs
If there are unfinished ranges in the old epochs, the new master will recruit
backup workers responsible for finishing these ranges. These workers remains in
the cluster until the next epoch, when it will remove itself.
2020-01-22 19:38:45 -08:00
Jingyu Zhou 11964733b7 WIP: should be divided into smaller commits. 2020-01-22 19:38:45 -08:00
Jingyu Zhou 17002740bb Add epoch and backup workers to DBCoreState
This enables backup workers to know the end version of the epoch. Additionally,
the master recovery only needs to deal with crashed backup workers by
recruiting new workers to backup the unfinished version range.
2020-01-22 19:38:45 -08:00
Jingyu Zhou 7da9f47f26 Enable pop from backup workers
This is still WIP as some edge cases can trigger test failure, most likely due
to not popping mutations by backup workers when epoch ends.
2020-01-22 19:38:45 -08:00
Jingyu Zhou a797958af6 Update peekLogRouter for backup workers to peek 2020-01-22 19:37:48 -08:00
Jingyu Zhou a4d6ebe79e Recruit backup worker in newEpoch 2020-01-22 19:37:48 -08:00
Jingyu Zhou eac49bca04 Add backup worker recruitment in master. 2020-01-22 19:35:30 -08:00
Jingyu Zhou 442738b6db Small code refactoring 2020-01-22 19:35:30 -08:00
Evan Tschannen 3f9d9d8b84 Merge branch 'release-6.2'
# Conflicts:
#	CMakeLists.txt
#	cmake/FlowCommands.cmake
#	documentation/sphinx/source/release-notes.rst
#	fdbclient/StorageServerInterface.h
#	fdbserver/DataDistributionTracker.actor.cpp
#	fdbserver/MasterProxyServer.actor.cpp
#	fdbserver/fdbserver.actor.cpp
#	flow/Knobs.h
#	flow/Platform.cpp
#	versions.target
2020-01-16 18:37:47 -08:00
Evan Tschannen d8c3c2fda4 Improved prioritization of commit path on the proxies 2019-12-18 16:56:35 -08:00
negoyal a4a0bf18f9 Merging with Master. 2019-11-12 13:01:29 -08:00
Evan Tschannen afc9713005 Merge branch 'release-6.2'
# Conflicts:
#	CMakeLists.txt
#	documentation/sphinx/source/release-notes.rst
#	fdbclient/FDBTypes.h
#	fdbserver/LogSystem.h
#	fdbserver/LogSystemPeekCursor.actor.cpp
#	fdbserver/OldTLogServer_6_0.actor.cpp
#	fdbserver/TLogServer.actor.cpp
#	versions.target
2019-11-06 13:45:37 -08:00
Evan Tschannen a8ca47beff optimized memory allocations by using VectorRef<Tag> instead of std::vector<Tag> 2019-11-05 18:07:30 -08:00