Previously, the pop version is the min of minKnownCommittedVersion and
endVersion. In the case of backup worker for previous epoch, the endVersion
should be used.
For an ILogPeekCursor, the arena becomes invalid if hasMessage() is false.
So the backup worker needs to keep a reference to the arena so that the message
refers to memory area that is still valid.
If a mutation has txsTag, then it is the change to in-memory key value store,
i.e., the transaction state store, and should be ignored by the backup worker.
The only exception is for the "metadataVersionKey", which needs to be stored in
the backup.
This is the first step in the new backup's data pipeline. Verification of file
content is needed in future commits. A clear documentation of file format is a
work in progress.
The backup worker needs to update its progress even during consistency check by
commit transactions to the database. Thus we can't really achieve zero storage
server queue. So add a limit of 10,000 to pass the consistency check.
When a master starts, backup worker from old epochs may send BackupWorkerDoneRequest
to it. The master can be safely ignore it, since the checkRemoved logic of the
backup worker can self exit then.
For backup workers working on old epochs, once their work is done, they will
notify the master. Then the master removes them from the log system and
acknowledge back to the backup workers so that they can gracefully shut down.
The popping of a backup worker is stalled if there are workers from older
epochs still working. Otherwise, workers from old epochs will lost data.
However, allowing newer epoch to start backup can cause holes in version ranges.
The restore process must verify the backup progress to make sure there are no
holes, otherwise it has to wait.
For backup workers created for previous epoch, we need to associate them with
the correct epoch so that later peekLogRouter can get the correct peek cursor.
Otherwise, the workers can never peek the missing range of mutations.
After the backup worker recruitment is done, we need to force trigger the
registration with cluster controller. Otherwise, the log system may not have
the backup workers, which can stall backup workers from obtaining a cursor and
resulting in mutations being kept in TLogs.
Separate popping logic into an actor with shorter interval than the upload
interval. More critically, even if there is no mutations (e.g., in quiet
database period), the popped version should still be advanced.
It's possible that after obtaining data distributor, the distributor then dies
and a new one is recruited. Because the tester is still contacting the old one,
it becomes stuck.
For each log router ID, we track the popped version of each pseudo tag so that
the popping only applied to the minimum of these versions.
Also add more tracing for popping and epochs.
If there are unfinished ranges in the old epochs, the new master will recruit
backup workers responsible for finishing these ranges. These workers remains in
the cluster until the next epoch, when it will remove itself.
This enables backup workers to know the end version of the epoch. Additionally,
the master recovery only needs to deal with crashed backup workers by
recruiting new workers to backup the unfinished version range.