Mutations are only logged when they are within the backup ranges, which means a
range clear mutation has to calculate the intersection ranges and divide a clear
into potentially multiple clear mutations. This pare of code is modeled after
how proxy handles backup mutations.
The mutation logs of backup workers are saved into "mlogs" directory under the
container directory. The backup worker has been restructured to handle multiple
backups, where each one is stored in a separate backup container.
In the backup worker, mutations pulled from TLogs are buffered in a message
queue. When writing out to different containers, their corresponding mutation
ranges are used to check if a mutation should be written. When a new backup
is submitted by the client, "backupStartedKey" is updated. The worker monitors
this key, updates its internal map of backups, and then next pull from TLog
needs to wait for the readiness of the new backup. This is to ensure when
worker 0 sets the backup is started, all workers have already been logging
mutations for the backup.
Right now, the default is to keep the old backup behavior, i.e., do NOT use
backup workers. Specifically, if BackupType is not set (or is set to default),
the master will not recruit backup workers and will not add pseudo locality for
backup workers.
The StartFullBackupTaskFunc is updated to check if backup worker is enabled.
Only when it is not enabled, starting a backup will wait on all backup workers
to be started.
The backup task may be restarted multiple times so the started key for the
backup task may already be set. In this case, the wait on watch should be
skipped.
The backup worker ID was changed to be the interface ID, not the request ID.
The lastSeenVersion is replaced with minKnownCommittedVersion, which is
incremented even when there is no mutations. This also avoid the problem of
setting popVersion to be higher than the actual committed version (no harm
here though, as there are no mutations).
The "backupStartedKey" handling is also fixed to correctly handle cases when
we wait for the stopping of backups.
TaskBucket::keepRunning() needs to be called in backup transactions to be sure
that the task has not been cancelled. If so, the task is cancelled. Otherwise,
the task can continue run, causing multiple runs of the same task.
Another subtle issue is that the beginVersion is persisted on backupStartedKey.
So while reading it back from that key, we should set task's beginVersion with
the value persisted earlier.
This wait is to make sure that backup workers are already saving mutations so
that no mutations are missed. The idea is that the CLI sets a "backupStartedKey"
in the database and waits for allWorkerStarted() key of the backup to be set.
Backup workers monitor the changes to the "backupStartedKey" and start logging
mutations. Additionally, backup worker for Tag(-2,0) monitors all other workers
have started (checking their saved progress version is larger than the backup's
start version), and then sets the allWorkerStarted() key for the backup.
The monitoring loop of system key "backupStartedKey" and decides to be in one
of two modes: NOOP and backup. In the NOOP mode, the worker just pop TLogs. In
the backup mode, the worker pulls mutations from TLogs and save the mutations
into logs.