Sometimes, an epoch's begin version is lower than the previous epoch's end
version. In some rare casse, the master ends up recruiting backup workers for
both epoch and have overlapping ranges of [epochBeginVersion, prevEpochEndVersion].
Since the popping order is by epoch. Previous epoch can pop the mutation and
save to a log file. Then this epoch will miss these popped mutation in the
overlapping range, causing corrupted mutation logs.
A partial recovery can result in empty epoch that copies previous epoch's
version range. In this case, getOldEpochTagsVersionsInfo() will not return
previous epoch's information. To correctly compute the start version for a
backup worker, we need to check previous epoch's saved version. If they are
larger than this epoch's begin version, use previously saved version as the
start version.
This wait is to make sure that backup workers are already saving mutations so
that no mutations are missed. The idea is that the CLI sets a "backupStartedKey"
in the database and waits for allWorkerStarted() key of the backup to be set.
Backup workers monitor the changes to the "backupStartedKey" and start logging
mutations. Additionally, backup worker for Tag(-2,0) monitors all other workers
have started (checking their saved progress version is larger than the backup's
start version), and then sets the allWorkerStarted() key for the backup.
Get rid of the complex logic of choosing the largest saved version from
previous epoch for the oldest epoch. Instead, use the begin version now
available from log system.
Sometimes the backup worker has not updated progress to the system space and a
master recovery happens. As a result, next epoch doesn't know the progress of
previous ones. This change is to check for such missing gaps and fill them with
the whole range [startVersion, endVersion).
The code is refactored into BackupProgress.actor.* to consolidate backup
progress processing for the master server.