The storage queue is no longer going to be a problem failing tests. Now the
backup worker life cycle is tied with backup. So consistency check only happens
after the backup workload is done. Thus, we no longer need to save backup
progress when consistency check is running.
The backup worker needs to use this version for popping when running in a NOOP
mode. This option is added to GetReadVersionRequest and proxies will send back
minKnownCommittedVersion if the option is set.
Also add a couple of knobs for backup workers.
The backup worker just blindly pop tags if the "backupStartedKey" is not set.
Note the commit version from TLog cannot be used as the pop version, because
for a single region, during a recovery the log router tags are used to recover
mutations. The backup worker can potentially pop mutations that are needed for
recovery, causing consistency errors. So the solution for now is to use commit
version - 5,000,000, which is a version guaranteed to be persisted on all
replicas.
This bug was introduced when I added log router tags unconditionally to any
configurations. In newEpoch(), the wait for remote recovery is conditioned on
"logRouterTags == 0", which always becomes false. Thus remote recovery was not
performed and remote TLogs won't copy data from previous epoch's TLogs
(previous epoch is a single region configuration). As a result, storage servers
cannot peek/get the data, and won't pop tags. Thus, waitForFullReplication()
became stuck and eventually test timeout.
For each mutation, its version, sub-version, and size are prefixed with big
endian representation. This is required, especially for the first version
variable, because we use 0xFF for padding purpose. A little endian version
number can easily collide with 0xFF, while big endian is guaranteed to have
0x00 as the first byte.
Get rid of the complex logic of choosing the largest saved version from
previous epoch for the oldest epoch. Instead, use the begin version now
available from log system.
This is to simplify the backup process so that whenever there is an old epoch
in the log system, we always know its begin version and can backup from that
version if no progress is known for that old epoch.
Sometimes the backup worker has not updated progress to the system space and a
master recovery happens. As a result, next epoch doesn't know the progress of
previous ones. This change is to check for such missing gaps and fill them with
the whole range [startVersion, endVersion).
The code is refactored into BackupProgress.actor.* to consolidate backup
progress processing for the master server.
Each master starts from an empty set of backup workers and recruits a new set.
So there is no need to save current backup workers to DBCoreState. Note current
backup workers need to be serialized to LogSystemConfig (in ServerDBInfo) so
that backup workers can check if they have been displaced.
For backup worker working on old epochs, make it a contract that the worker
won't pull messages after the end version. This potentially saves memory and
simplify the saving logic.
Fix the wrong backup epoch when sending BackupWorkerDoneRequest.
Previously, the pop version is the min of minKnownCommittedVersion and
endVersion. In the case of backup worker for previous epoch, the endVersion
should be used.
For an ILogPeekCursor, the arena becomes invalid if hasMessage() is false.
So the backup worker needs to keep a reference to the arena so that the message
refers to memory area that is still valid.