So when there is master recovery due to failed tlog, proxy, resolver, log
router, or resolver, we can have a trace event tells which address that the
master thinks is dead.
We are currently emitting Role transition traces when a role starts and
when it ends. While this is useful for debugging, it doesn't work well
with tools that inject data and might potentially miss some trace lines.
We do decorate each trace lines with the roles assigned to that
particular process, however, this is not sufficient for tools that can
make use of the UID -> Role mapping
When master starts recruiting backup workers, if there is no active backup job
or the min version of the backup job is greater than old epoch's end version,
then these old epochs can be skipped.
Since tlog is not kept until backup worker has pulled mutations from it, the
old tlogs can only be displaced after oldest backup epoch equals current epoch.
So if master is not recruiting backup workers, it should set the oldest backup
epoch as the current epoch.
The start version of tlog set can be smaller than the last epoch's end version.
In this case, set backup worker's start version as last epoch's end version to
avoid overlapping of version ranges among backup workers.
This delay is to ensure old epoch's backup workers can save their progress in
the database. Otherwise, the new master could attempts to recruit backup
workers for the old epoch on version ranges that have already been popped. As
a result, the logs will lose data.
The start version of tlog set can be smaller than the last epoch's end version.
In this case, set backup worker's start version as last epoch's end version to
avoid overlapping of version ranges among backup workers.
This delay is to ensure old epoch's backup workers can save their progress in
the database. Otherwise, the new master could attempts to recruit backup
workers for the old epoch on version ranges that have already been popped. As
a result, the logs will lose data.
Right now, the default is to keep the old backup behavior, i.e., do NOT use
backup workers. Specifically, if BackupType is not set (or is set to default),
the master will not recruit backup workers and will not add pseudo locality for
backup workers.
The StartFullBackupTaskFunc is updated to check if backup worker is enabled.
Only when it is not enabled, starting a backup will wait on all backup workers
to be started.
Get rid of the complex logic of choosing the largest saved version from
previous epoch for the oldest epoch. Instead, use the begin version now
available from log system.
Sometimes the backup worker has not updated progress to the system space and a
master recovery happens. As a result, next epoch doesn't know the progress of
previous ones. This change is to check for such missing gaps and fill them with
the whole range [startVersion, endVersion).
The code is refactored into BackupProgress.actor.* to consolidate backup
progress processing for the master server.
The backup worker needs to update its progress even during consistency check by
commit transactions to the database. Thus we can't really achieve zero storage
server queue. So add a limit of 10,000 to pass the consistency check.