foundationdb

Commit Graph

Author	SHA1	Message	Date
Jingyu Zhou	0823091423	Fix backup worker removal races with setting The master waits for all backup worker recruitment done and then set them in a batch. However, a backup worker could remove itself before the master sets it. As a result, the worker is not removed and oldest backup epoch can't advance, and TLog can't be popped.	2020-04-20 11:06:46 -07:00
Jingyu Zhou	280bc94738	Do not recruit backup workers with wrong tags In a rare scenario, the master can recruit backup workers with more tags than the number of log router tags for an epoch. This can be caused by an unsuccessful recovery, which uses more tags than the next epoch. When recruiting for the next epoch, if no progress has been made yet, the recruiting logic will look back at the previous epoch. If previous epoch has saved past this epoch's begin version, current epoch's progress is updated with that information and can result in more tags being inserted to this epoch's recruitment.	2020-03-28 21:19:41 -07:00
Jingyu Zhou	90b40e1d75	Merge branch 'mengxu/new-backup-format-PR-delta' of github.com:xumengpanda/foundationdb into backup-worker-bak Resolve Conflicts: fdbclient/BackupAgent.actor.h fdbserver/BackupWorker.actor.cpp fdbserver/RestoreMaster.actor.cpp fdbserver/masterserver.actor.cpp	2020-03-23 13:35:33 -07:00
Meng Xu	3f31ebf659	New backup:Revise event name and explain code	2020-03-23 10:55:44 -07:00
Jingyu Zhou	0eacf1cdab	trackTlogRecovery listens on backup worker change events Old TLogs can only be removed when backup workers no long need them (i.e., the oldest backup epoch == current epoch). As a result, the core state changes need include backup worker changes, which updates the oldest backup epoch.	2020-03-20 20:17:32 -07:00
Jingyu Zhou	818072f3cb	Set oldest backup epoch if not recruiting backup workers Since tlog is not kept until backup worker has pulled mutations from it, the old tlogs can only be displaced after oldest backup epoch equals current epoch. So if master is not recruiting backup workers, it should set the oldest backup epoch as the current epoch.	2020-03-20 20:16:43 -07:00
Jingyu Zhou	5b36dcaad5	Fix oldest backup epoch for backup workers The oldest backup epoch is piggybacked in LogSystemConfig from master to cluster controller and then to all workers. Previously, this epoch is set to the current master epoch, which is wrong.	2020-03-20 20:15:09 -07:00
Jingyu Zhou	12ed8ad536	Fix backup worker start version when logset start version is lower The start version of tlog set can be smaller than the last epoch's end version. In this case, set backup worker's start version as last epoch's end version to avoid overlapping of version ranges among backup workers.	2020-03-20 20:15:08 -07:00
Meng Xu	94276076de	BackupWorker:Buggify upload delay Add questions to code as well.	2020-03-18 19:04:45 -07:00
Jingyu Zhou	19f6394dc9	Fix oldest backup epoch for backup workers The oldest backup epoch is piggybacked in LogSystemConfig from master to cluster controller and then to all workers. Previously, this epoch is set to the current master epoch, which is wrong.	2020-03-18 16:44:17 -07:00
Jingyu Zhou	89d8f13038	Fix backup worker start version when logset start version is lower The start version of tlog set can be smaller than the last epoch's end version. In this case, set backup worker's start version as last epoch's end version to avoid overlapping of version ranges among backup workers.	2020-03-18 16:41:35 -07:00
Evan Tschannen	e08f0201f1	merge release 6.2 into master	2020-03-17 12:51:47 -07:00
Evan Tschannen	04052226df	reverting a change which causes data inconsistency between the primary and secondary	2020-03-17 09:41:44 -07:00
Evan Tschannen	0ca89547a5	make sure the number of logRouterTags is larger than the number of satelliteTLogs to avoid having satellites with no data.	2020-03-14 15:02:19 -07:00
Evan Tschannen	4640edf5d6	do not recruit satellite tlogs when usable regions=1	2020-03-13 10:24:52 -07:00
Jingyu Zhou	52c6737411	Rename backupLoggingEnabled as backupWorkerEnabled To highlight the changes for 7.0 backup changes. By default, backup_worker_enabled flag is set for 7.0 version.	2020-02-04 10:09:16 -08:00
Jingyu Zhou	0db03f1d3c	Use backup_logging_enabled flag The default is to enable new backup workers. Users can disable this flag to turn off the backup worker feature.	2020-02-03 20:03:22 -08:00
Jingyu Zhou	38aa1903fd	Add a DB configuration option for backup workers Right now, the default is to keep the old backup behavior, i.e., do NOT use backup workers. Specifically, if BackupType is not set (or is set to default), the master will not recruit backup workers and will not add pseudo locality for backup workers. The StartFullBackupTaskFunc is updated to check if backup worker is enabled. Only when it is not enabled, starting a backup will wait on all backup workers to be started.	2020-01-31 19:29:09 -08:00
Jingyu Zhou	e9c7ad82cc	Comment out pseudo tag pop trace event	2020-01-31 19:29:09 -08:00
Jingyu Zhou	39fbacbc4f	Address review comments	2020-01-22 19:43:40 -08:00
Jingyu Zhou	1eaea91cb3	Address review comments	2020-01-22 19:42:13 -08:00
Jingyu Zhou	dcd0a46bc6	Fix a rare remote recovery bug This bug was introduced when I added log router tags unconditionally to any configurations. In newEpoch(), the wait for remote recovery is conditioned on "logRouterTags == 0", which always becomes false. Thus remote recovery was not performed and remote TLogs won't copy data from previous epoch's TLogs (previous epoch is a single region configuration). As a result, storage servers cannot peek/get the data, and won't pop tags. Thus, waitForFullReplication() became stuck and eventually test timeout.	2020-01-22 19:42:13 -08:00
Jingyu Zhou	56a2c37071	Recruit backup workers for single region Enable log router tags for single region, which are popped by backup workers. Need to add noop for backup workers if there is no active backups.	2020-01-22 19:42:13 -08:00
Jingyu Zhou	4ed75e37f3	BackupProgress uses old epoch's begin version if no progress found Get rid of the complex logic of choosing the largest saved version from previous epoch for the oldest epoch. Instead, use the begin version now available from log system.	2020-01-22 19:38:46 -08:00
Jingyu Zhou	42430e8f5e	Add epochBegin version to OldTLogCoreData/OldLogData/OldTLogConf This is to simplify the backup process so that whenever there is an old epoch in the log system, we always know its begin version and can backup from that version if no progress is known for that old epoch.	2020-01-22 19:38:46 -08:00
Jingyu Zhou	64052f6349	Check and fill backup gaps for old epochs and tags Sometimes the backup worker has not updated progress to the system space and a master recovery happens. As a result, next epoch doesn't know the progress of previous ones. This change is to check for such missing gaps and fill them with the whole range [startVersion, endVersion). The code is refactored into BackupProgress.actor.* to consolidate backup progress processing for the master server.	2020-01-22 19:38:46 -08:00
Jingyu Zhou	52bdaeee39	Do not save backup workers to core state and back Each master starts from an empty set of backup workers and recruits a new set. So there is no need to save current backup workers to DBCoreState. Note current backup workers need to be serialized to LogSystemConfig (in ServerDBInfo) so that backup workers can check if they have been displaced.	2020-01-22 19:38:46 -08:00
Jingyu Zhou	ed54aaa09e	Fix a crash failure of empty backup interface	2020-01-22 19:38:46 -08:00
Jingyu Zhou	23985da6a0	Use backup worker failed error code during recovery And use override instead of virtual in TagPartitionedLogSystem.	2020-01-22 19:38:45 -08:00
Jingyu Zhou	9d7a1a77d0	Small fixes.	2020-01-22 19:38:45 -08:00
Jingyu Zhou	0c08161d8e	Remove old backup workers when done For backup workers working on old epochs, once their work is done, they will notify the master. Then the master removes them from the log system and acknowledge back to the backup workers so that they can gracefully shut down. The popping of a backup worker is stalled if there are workers from older epochs still working. Otherwise, workers from old epochs will lost data. However, allowing newer epoch to start backup can cause holes in version ranges. The restore process must verify the backup progress to make sure there are no holes, otherwise it has to wait.	2020-01-22 19:38:45 -08:00
Jingyu Zhou	85c4a4e422	Address review comments for PR #1625	2020-01-22 19:38:45 -08:00
Jingyu Zhou	116608a0a7	Set backup workers w.r.t. the correct epoch For backup workers created for previous epoch, we need to associate them with the correct epoch so that later peekLogRouter can get the correct peek cursor. Otherwise, the workers can never peek the missing range of mutations.	2020-01-22 19:38:45 -08:00
Jingyu Zhou	73824faf65	Track pseudo tags popping for individual IDs For each log router ID, we track the popped version of each pseudo tag so that the popping only applied to the minimum of these versions. Also add more tracing for popping and epochs.	2020-01-22 19:38:45 -08:00
Jingyu Zhou	3509209d3f	Fix not setting epoch for old log system	2020-01-22 19:38:45 -08:00
Jingyu Zhou	a1095c8250	Remove epoch from DBCoreState Use existing recoveryCount if needed.	2020-01-22 19:38:45 -08:00
Jingyu Zhou	d5a92e1805	Fix pseudo locality usage bug Somehow pseudo localities are not saved to LogSystemConfig and getPseudoPopTag() should translate LogRouter tag to pseudo tags.	2020-01-22 19:38:45 -08:00
Jingyu Zhou	19d6a889ff	Recruit backup workers for old epochs If there are unfinished ranges in the old epochs, the new master will recruit backup workers responsible for finishing these ranges. These workers remains in the cluster until the next epoch, when it will remove itself.	2020-01-22 19:38:45 -08:00
Jingyu Zhou	11964733b7	WIP: should be divided into smaller commits.	2020-01-22 19:38:45 -08:00
Jingyu Zhou	17002740bb	Add epoch and backup workers to DBCoreState This enables backup workers to know the end version of the epoch. Additionally, the master recovery only needs to deal with crashed backup workers by recruiting new workers to backup the unfinished version range.	2020-01-22 19:38:45 -08:00
Jingyu Zhou	7da9f47f26	Enable pop from backup workers This is still WIP as some edge cases can trigger test failure, most likely due to not popping mutations by backup workers when epoch ends.	2020-01-22 19:38:45 -08:00
Jingyu Zhou	a797958af6	Update peekLogRouter for backup workers to peek	2020-01-22 19:37:48 -08:00
Jingyu Zhou	a4d6ebe79e	Recruit backup worker in newEpoch	2020-01-22 19:37:48 -08:00
Jingyu Zhou	eac49bca04	Add backup worker recruitment in master.	2020-01-22 19:35:30 -08:00
Jingyu Zhou	442738b6db	Small code refactoring	2020-01-22 19:35:30 -08:00
Evan Tschannen	3f9d9d8b84	Merge branch 'release-6.2' # Conflicts: # CMakeLists.txt # cmake/FlowCommands.cmake # documentation/sphinx/source/release-notes.rst # fdbclient/StorageServerInterface.h # fdbserver/DataDistributionTracker.actor.cpp # fdbserver/MasterProxyServer.actor.cpp # fdbserver/fdbserver.actor.cpp # flow/Knobs.h # flow/Platform.cpp # versions.target	2020-01-16 18:37:47 -08:00
Evan Tschannen	d8c3c2fda4	Improved prioritization of commit path on the proxies	2019-12-18 16:56:35 -08:00
negoyal	a4a0bf18f9	Merging with Master.	2019-11-12 13:01:29 -08:00
Evan Tschannen	afc9713005	Merge branch 'release-6.2' # Conflicts: # CMakeLists.txt # documentation/sphinx/source/release-notes.rst # fdbclient/FDBTypes.h # fdbserver/LogSystem.h # fdbserver/LogSystemPeekCursor.actor.cpp # fdbserver/OldTLogServer_6_0.actor.cpp # fdbserver/TLogServer.actor.cpp # versions.target	2019-11-06 13:45:37 -08:00
Evan Tschannen	a8ca47beff	optimized memory allocations by using VectorRef<Tag> instead of std::vector<Tag>	2019-11-05 18:07:30 -08:00

1 2 3 4 5 ...

275 Commits