Sometimes, the true-up has to go backup multiple epochs for saved versions,
because a tag's progress can be missing in an epoch. In other words, we need to
check progress for all tags.
The oldest epoch the master gets can assume its begin version is 1, which can
be wrong. In this case, we use the saved backup progress to "true-up" the real
begin version.
Sometimes, an epoch's begin version is lower than the previous epoch's end
version. In some rare casse, the master ends up recruiting backup workers for
both epoch and have overlapping ranges of [epochBeginVersion, prevEpochEndVersion].
Since the popping order is by epoch. Previous epoch can pop the mutation and
save to a log file. Then this epoch will miss these popped mutation in the
overlapping range, causing corrupted mutation logs.
In a rare scenario, the master can recruit backup workers with more tags than
the number of log router tags for an epoch. This can be caused by an
unsuccessful recovery, which uses more tags than the next epoch. When
recruiting for the next epoch, if no progress has been made yet, the recruiting
logic will look back at the previous epoch. If previous epoch has saved past
this epoch's begin version, current epoch's progress is updated with that
information and can result in more tags being inserted to this epoch's
recruitment.
Backup worker starts by check if there are backup keys and then runs
monitorBackupKeyOrPullData() loop, which does the check again. The second check
can be delayed, which causes the loop to perform NOOP pops. The fix removes
this second check and uses the result of the first check to decide what to do
in the loop.
A partial recovery can result in empty epoch that copies previous epoch's
version range. In this case, getOldEpochTagsVersionsInfo() will not return
previous epoch's information. To correctly compute the start version for a
backup worker, we need to check previous epoch's saved version. If they are
larger than this epoch's begin version, use previously saved version as the
start version.
Get rid of the complex logic of choosing the largest saved version from
previous epoch for the oldest epoch. Instead, use the begin version now
available from log system.
Sometimes the backup worker has not updated progress to the system space and a
master recovery happens. As a result, next epoch doesn't know the progress of
previous ones. This change is to check for such missing gaps and fill them with
the whole range [startVersion, endVersion).
The code is refactored into BackupProgress.actor.* to consolidate backup
progress processing for the master server.