Commit Graph

3359 Commits

Author SHA1 Message Date
Xin Dong c901fc269b Changed to use 'rate' instead of 'limit' after some discussion with Evan and AJ 2020-01-30 14:13:56 -08:00
Xin Dong 65c607bc13 Fix the error after the rebase 2020-01-30 14:13:56 -08:00
Xin Dong 1b313a4f7e Address review comments. Rebased with latest master 2020-01-30 14:13:56 -08:00
Xin Dong 9aaf4bc107 Add code coverage mark when sending out the throttled error. 2020-01-30 14:13:56 -08:00
Xin Dong e21426d12a Send error back to the GRV requests with batch priority when the cluster is saturated, instead of blindly enqueue the requests and let the client timeout. 2020-01-30 14:13:56 -08:00
A.J. Beamon fa51a1abc5
Merge pull request #2604 from xumengpanda/mengxu/fast-restore-valgrind-fix-PR14
Performant restore [14/XX Add-on]: Fix initialized field in VersionBatch struct
2020-01-27 15:22:31 -08:00
Meng Xu 76f30e71dc FastRestore:Init VersionBatch explicitly
Built-in variable may not be zero initialized by
compiler provided default constructor.
2020-01-26 13:15:45 -08:00
Alvin Moore d03e49b4a1 Fixed the location of crc32c.h from fdbrpc to flow 2020-01-26 07:01:25 -08:00
Alex Miller 6945a6ea01
Merge pull request #2345 from zjuLcg/add-consistency-verification-in-mako-workload
Add consistency verification in mako workload
2020-01-24 17:07:49 -08:00
Evan Tschannen 8f599e9d15 fix: backupWorker would crash when run outside of simulation 2020-01-23 19:06:39 -08:00
Evan Tschannen 76e192d490
Merge pull request #2538 from alexmiller-apple/hashlittle2-to-crc32c
Convert more hashlittle{,2} uses to crc32c_append
2020-01-23 17:54:38 -08:00
Evan Tschannen 6c0b934dda
Merge pull request #2242 from alexmiller-apple/fix-10min-stall-again
Fix the 10min multi-region recovery stall again
2020-01-23 17:53:02 -08:00
A.J. Beamon b2c8a4a34c
Merge pull request #2519 from xumengpanda/mengxu/fast-restore-versionBatch-fixSize-PR
Performant restore [14/XX]: Ensure each version-batch not exceed a configured size
2020-01-23 16:49:01 -08:00
A.J. Beamon 8a065b9da4
Merge pull request #2557 from alexmiller-apple/reduce-versionstamp-conflictranges
Narrow the unreadable range of keys after a versionstamped key operation
2020-01-23 11:14:47 -08:00
Jingyu Zhou 6ddf73e26a Remove code introduced when resolving merge conflicts 2020-01-22 21:23:38 -08:00
Jingyu Zhou 39fbacbc4f Address review comments 2020-01-22 19:43:40 -08:00
Jingyu Zhou acebfdc67b Restore storage queue limit to 0 in consistency check
The storage queue is no longer going to be a problem failing tests. Now the
backup worker life cycle is tied with backup. So consistency check only happens
after the backup workload is done. Thus, we no longer need to save backup
progress when consistency check is running.
2020-01-22 19:43:40 -08:00
Jingyu Zhou c6c39ca99d Update better master exist with backup workers
During recruitment, if there is no desired log router count, use tlog size
instead, because the number of backup workers has to be larger than 0.
2020-01-22 19:43:40 -08:00
Jingyu Zhou 8b67a89eed More review comments fixed. 2020-01-22 19:42:13 -08:00
Jingyu Zhou 1eaea91cb3 Address review comments 2020-01-22 19:42:13 -08:00
Jingyu Zhou 1311fec45a Add an option to get minKnownCommittedVersion from Proxies
The backup worker needs to use this version for popping when running in a NOOP
mode. This option is added to GetReadVersionRequest and proxies will send back
minKnownCommittedVersion if the option is set.

Also add a couple of knobs for backup workers.
2020-01-22 19:42:13 -08:00
Jingyu Zhou 7989f3f015 Add NOOP to backup worker
The backup worker just blindly pop tags if the "backupStartedKey" is not set.
Note the commit version from TLog cannot be used as the pop version, because
for a single region, during a recovery the log router tags are used to recover
mutations. The backup worker can potentially pop mutations that are needed for
recovery, causing consistency errors. So the solution for now is to use commit
version - 5,000,000, which is a version guaranteed to be persisted on all
replicas.
2020-01-22 19:42:13 -08:00
Jingyu Zhou c08a192c75 Add a backup start key
If the backup key is not set, do not recruit backup workers for old epoches.
2020-01-22 19:42:13 -08:00
Jingyu Zhou e14246ac16 Add more information for trace events 2020-01-22 19:42:13 -08:00
Jingyu Zhou 4bed33031f Set backup worker start version to be savedVersion + 1
If no progress found, start version is set to epochBegin. So the start version
is the one after the last saved (or from last epoch's saved) version.
2020-01-22 19:42:13 -08:00
Jingyu Zhou dcd0a46bc6 Fix a rare remote recovery bug
This bug was introduced when I added log router tags unconditionally to any
configurations. In newEpoch(), the wait for remote recovery is conditioned on
"logRouterTags == 0", which always becomes false. Thus remote recovery was not
performed and remote TLogs won't copy data from previous epoch's TLogs
(previous epoch is a single region configuration). As a result, storage servers
cannot peek/get the data, and won't pop tags. Thus, waitForFullReplication()
became stuck and eventually test timeout.
2020-01-22 19:42:13 -08:00
Jingyu Zhou 56a2c37071 Recruit backup workers for single region
Enable log router tags for single region, which are popped by backup workers.
Need to add noop for backup workers if there is no active backups.
2020-01-22 19:42:13 -08:00
Jingyu Zhou 0e5f5b50f0 Remove unused backup worker knobs 2020-01-22 19:38:46 -08:00
Jingyu Zhou 60f360c954 Log oldest backup epoch in the backup worker 2020-01-22 19:38:46 -08:00
Jingyu Zhou 568a8a8e77 Use big endian for mutation log files
For each mutation, its version, sub-version, and size are prefixed with big
endian representation. This is required, especially for the first version
variable, because we use 0xFF for padding purpose. A little endian version
number can easily collide with 0xFF, while big endian is guaranteed to have
0x00 as the first byte.
2020-01-22 19:38:46 -08:00
Jingyu Zhou 954743977b Add paddings to a block in mutation log files
This is needed otherwise decoding cannot be performed.
2020-01-22 19:38:46 -08:00
Jingyu Zhou e4aea9b66d Use VectorRef<Tag> for VersionedMessage 2020-01-22 19:38:46 -08:00
Jingyu Zhou 7f7ec99170 Serialize and deserialize new backup files
The BackupWorker writes files that can be read by FileConverter. Move
StringRefReader to the header file for reuse in FileConverter.
2020-01-22 19:38:46 -08:00
Jingyu Zhou 56f40a978e Backport changes to OldTLogServer_6_2 2020-01-22 19:38:46 -08:00
Jingyu Zhou f21d7ca44c Add tag ID to backup log file names 2020-01-22 19:38:46 -08:00
Jingyu Zhou 2c83fbfe6c Rename to BackupWorker.actor.cpp to be explicit
There is already one file named backup.actor.cpp in "fdbbackup/".
2020-01-22 19:38:46 -08:00
Jingyu Zhou 2b2325036a Fix compiler error of using override 2020-01-22 19:38:46 -08:00
Jingyu Zhou 4ed75e37f3 BackupProgress uses old epoch's begin version if no progress found
Get rid of the complex logic of choosing the largest saved version from
previous epoch for the oldest epoch. Instead, use the begin version now
available from log system.
2020-01-22 19:38:46 -08:00
Jingyu Zhou 42430e8f5e Add epochBegin version to OldTLogCoreData/OldLogData/OldTLogConf
This is to simplify the backup process so that whenever there is an old epoch
in the log system, we always know its begin version and can backup from that
version if no progress is known for that old epoch.
2020-01-22 19:38:46 -08:00
Jingyu Zhou 250137a52f Change BackupProgress to be a class
Struct doesn't need addref() or delref() members, though.
2020-01-22 19:38:46 -08:00
Jingyu Zhou 1e0753a327 Remove backup workers from DBCoreState
This is no longer needed.
2020-01-22 19:38:46 -08:00
Jingyu Zhou 19eacac3ce Add a unit test for BackupProgress 2020-01-22 19:38:46 -08:00
Jingyu Zhou 64052f6349 Check and fill backup gaps for old epochs and tags
Sometimes the backup worker has not updated progress to the system space and a
master recovery happens. As a result, next epoch doesn't know the progress of
previous ones. This change is to check for such missing gaps and fill them with
the whole range [startVersion, endVersion).

The code is refactored into BackupProgress.actor.* to consolidate backup
progress processing for the master server.
2020-01-22 19:38:46 -08:00
Jingyu Zhou 08d9f36071 Add tags for backup worker trace events 2020-01-22 19:38:46 -08:00
Jingyu Zhou 52bdaeee39 Do not save backup workers to core state and back
Each master starts from an empty set of backup workers and recruits a new set.
So there is no need to save current backup workers to DBCoreState. Note current
backup workers need to be serialized to LogSystemConfig (in ServerDBInfo) so
that backup workers can check if they have been displaced.
2020-01-22 19:38:46 -08:00
Jingyu Zhou ed54aaa09e Fix a crash failure of empty backup interface 2020-01-22 19:38:46 -08:00
Jingyu Zhou 67ad260b9e Fix OOM in backup worker
For backup worker working on old epochs, make it a contract that the worker
won't pull messages after the end version. This potentially saves memory and
simplify the saving logic.

Fix the wrong backup epoch when sending BackupWorkerDoneRequest.
2020-01-22 19:38:46 -08:00
Jingyu Zhou 297da14aba Fix backup worker not popping up to end version
Previously, the pop version is the min of minKnownCommittedVersion and
endVersion. In the case of backup worker for previous epoch, the endVersion
should be used.
2020-01-22 19:38:46 -08:00
Jingyu Zhou 40436a4e78 Filter out non-backup related mutations 2020-01-22 19:38:45 -08:00
Jingyu Zhou ff512b0c93 Fix memory corruption due to invalid Arena
For an ILogPeekCursor, the arena becomes invalid if hasMessage() is false.
So the backup worker needs to keep a reference to the arena so that the message
refers to memory area that is still valid.
2020-01-22 19:38:45 -08:00