Previously, if multiple metadata messages were written without clearing
transaction state between them, an issue could occur when deserializing
where a later message could be missing its LogProtocolMessage prefix.
This would cause the protocol version to not be set, causing an
assertion failure when deserializing MutationRefs.
This introduces unhygenic macro variants that inline a `ENABLED &&`
before the TraceEvent. This way, they get entirely compiled out unless
enabled.
Then rewrite all debugMutation uses via sed.
Like:
* Leaving the proxy
* Entering the TLog
* Leaving the TLog
* Being read on a cursor
All of this brought to you by TagsAndMessage!
This also slides in a minor optimization as to how mutations are serialized per target log.
Since tlog is not kept until backup worker has pulled mutations from it, the
old tlogs can only be displaced after oldest backup epoch equals current epoch.
So if master is not recruiting backup workers, it should set the oldest backup
epoch as the current epoch.
The start version of tlog set can be smaller than the last epoch's end version.
In this case, set backup worker's start version as last epoch's end version to
avoid overlapping of version ranges among backup workers.
The start version of tlog set can be smaller than the last epoch's end version.
In this case, set backup worker's start version as last epoch's end version to
avoid overlapping of version ranges among backup workers.
Get rid of the complex logic of choosing the largest saved version from
previous epoch for the oldest epoch. Instead, use the begin version now
available from log system.
Sometimes the backup worker has not updated progress to the system space and a
master recovery happens. As a result, next epoch doesn't know the progress of
previous ones. This change is to check for such missing gaps and fill them with
the whole range [startVersion, endVersion).
The code is refactored into BackupProgress.actor.* to consolidate backup
progress processing for the master server.
For backup workers working on old epochs, once their work is done, they will
notify the master. Then the master removes them from the log system and
acknowledge back to the backup workers so that they can gracefully shut down.
The popping of a backup worker is stalled if there are workers from older
epochs still working. Otherwise, workers from old epochs will lost data.
However, allowing newer epoch to start backup can cause holes in version ranges.
The restore process must verify the backup progress to make sure there are no
holes, otherwise it has to wait.
For each log router ID, we track the popped version of each pseudo tag so that
the popping only applied to the minimum of these versions.
Also add more tracing for popping and epochs.