218 lines
11 KiB
HTML
218 lines
11 KiB
HTML
|
<meta charset="utf-8">
|
|||
|
|
|||
|
# Forward Compatibility for Transaction Logs
|
|||
|
|
|||
|
## Background
|
|||
|
|
|||
|
A repeated concern with adopting FoundationDB has been that upgrades are one
|
|||
|
way, with no supported rollback. If one were to upgrade a cluster running 6.0
|
|||
|
to a 6.1, then there's no way to roll back to 6.0 if the new version results in
|
|||
|
worse client application performance or unavailability. In the interest of
|
|||
|
increasing adoption, work has begun on supporting on-disk forward
|
|||
|
compatibility, which allows for upgrades to be rolled back.
|
|||
|
|
|||
|
The traditional way of allowing roll backs is to have one version, `N`, that
|
|||
|
introduces a feature, but is left as disabled. `N+1` enables the feature, and
|
|||
|
then `N+2` removes whatever was deprecated in `N`. However, FDB currently has
|
|||
|
a 6 month release cadence, and waiting 6 months to be able to use a new feature
|
|||
|
in production is unacceptably long. Thus, the goal is to have a way to be able
|
|||
|
to have a sane and user-friendly, rollback-supporting upgrade path, but still
|
|||
|
allow features to be used immediately if desired.
|
|||
|
|
|||
|
This document also carries two specific restrictions to the scope of what it covers:
|
|||
|
|
|||
|
1. This document specifically is **not** a discussion of network protocol
|
|||
|
compatibility nor supporting rolling upgrades. Rolling upgrades of FDB are
|
|||
|
still discouraged, and minor versions are still protocol incompatible with
|
|||
|
each other.
|
|||
|
2. This only covers the proposed design of how forward compatibility for
|
|||
|
transaction logs will be handled, and not forward compatibility for
|
|||
|
FoundationDB as a whole. There are other parts of the system that durably
|
|||
|
store data, the coordinators and storage servers, that will not be discussed.
|
|||
|
|
|||
|
## Overview
|
|||
|
|
|||
|
A new configuration option, `log_version`, will be introduced to allow a user
|
|||
|
to control which on-disk format the transaction logs are allowed to use. Not
|
|||
|
every release will affect the on-disk format of the transaction logs, so
|
|||
|
`log_version` is an opaque integer that is incremented by one whenever the
|
|||
|
on-disk format of the transaction log is changed.
|
|||
|
|
|||
|
`log_version` is set by from `fdbcli`, with an invocation looking like
|
|||
|
`$ fdbcli -C cluster.file --exec "configure log_version:=2"`. Note that `:=`
|
|||
|
is used instead of `=`, to keep the convention in `fdbcli` that configuration
|
|||
|
options that users aren't expected to need (or wish) to modify are set with
|
|||
|
`:=`.
|
|||
|
|
|||
|
Right now, FDB releases and `log_version` values are as follows:
|
|||
|
|
|||
|
| Release | Log Version |
|
|||
|
| ------- | ----------- |
|
|||
|
| pre-5.2 | 1 |
|
|||
|
| 5.2-6.0 | 2 |
|
|||
|
| 6.1+ | 3 |
|
|||
|
| 6.2 | 4 |
|
|||
|
| 6.3 | 5 |
|
|||
|
|
|||
|
If a user does not specify any configuration for `log_version`, then
|
|||
|
`log_version` will be set so that rolling back to the previous minor version of
|
|||
|
FDB will be possible. FDB will always support loading files generated by
|
|||
|
default from the next minor version. It will be possible to configure
|
|||
|
`log_version` to a higher value on the release that introduces it, it the user
|
|||
|
is willing to sacrifice the ability to roll back.
|
|||
|
|
|||
|
This means FDB's releases will work like the following:
|
|||
|
|
|||
|
| | 6.0 | 6.1 | 6.2 | 6.3 |
|
|||
|
|--------------|-----|-----|-------|---------|
|
|||
|
| Configurable | 2 | 2,3 | 3,4 | 4,5 |
|
|||
|
| Default | 2 | 2 | 3 | 4 |
|
|||
|
| Recoverable | 2 | 2,3 | 2,3,4 | 2,3,4,5 |
|
|||
|
|
|||
|
Where...
|
|||
|
|
|||
|
* "configurable" means values considered an acceptable configuration setting for `fdbcli> configure log_version:=N`.
|
|||
|
* "default" means what `log_version` will be if you don't configure it.
|
|||
|
* "recoverable" means that FDB can load files that were generated from the specified `log_version`.
|
|||
|
|
|||
|
Configuring to a `log_version` will cause FDB to use the maximum of that
|
|||
|
`log_version` and default `log_version`. The default `log_version` will always
|
|||
|
be the minimum configurable log version. This is done so that manually setting
|
|||
|
`log_version` once, and then upgrading FDB multiple times, will eventually
|
|||
|
cause a low `log_version` left in the database configuration to act as a
|
|||
|
request for the default. Configuring `log_version` to a very high number (e.g. 9999)
|
|||
|
will cause FDB to always use the highest available log version.
|
|||
|
|
|||
|
As a concrete example, 6.1 will introduce a new transaction log feature with
|
|||
|
on-disk format implications. If you wish to use it, you'll first have to
|
|||
|
`configure log_version:=3`. Otherwise, after upgrading to FDB6.2, it will
|
|||
|
become the default. If problems are discovered when upgrading to FDB6.2, then
|
|||
|
roll back to FDB6.1. (Theoretically. See scope restrictions above.)
|
|||
|
|
|||
|
## Detailed Implementation
|
|||
|
|
|||
|
`fdbcli> configure log_version:=3` sets `\xff/conf/log_version` to `3`. This
|
|||
|
version is also persisted as part of the `LogSystemConfig` and thus
|
|||
|
`DBCoreState`, so that any code handling the log system will have access to the
|
|||
|
`log_version` that was used to create it.
|
|||
|
|
|||
|
Changing `log_version` will result in a recovery, and FoundationDB will recover
|
|||
|
into the requested transaction log implementation. This involves locking the
|
|||
|
previous generation of transaction logs, and then recruiting a new generation
|
|||
|
of transaction logs. FDB will load `\xff/conf/log_version` as the requested
|
|||
|
`log_version`, and when sending a `InitializeTLogRequest` to recruit a new
|
|||
|
transaction log, it uses the maximum of the requested log version and the
|
|||
|
default `log_version`.
|
|||
|
|
|||
|
A worker, when receiving an `InitializeTLogRequest`, will initialize a
|
|||
|
transaction log corresponding to the requested `log_version`. Transaction logs
|
|||
|
can pack multiple generations of transaction logs into the same shared entity,
|
|||
|
a `SharedTLog`. `SharedTLog` instances correspond to one set of files, and
|
|||
|
will only contain transaction log generations of the same `log_version`.
|
|||
|
|
|||
|
This allows us to have multiple generations of transaction logs running within
|
|||
|
one worker that have different `log_version`s, and if the worker crashes and
|
|||
|
restarts, we need to be able to recreate those transaction log instances.
|
|||
|
|
|||
|
Transaction logs maintain two types of files, one is a pair files prefixed with
|
|||
|
`logqueue-` that are the DiskQueue, and the other is the metadata store, which
|
|||
|
is normally a mini `ssd-2` storage engine running within the transaction log.
|
|||
|
|
|||
|
When a worker first starts, it scans its data directory for any files that were
|
|||
|
instances of a transaction log. It then needs to construct a transaction log
|
|||
|
instance that can read the format of the file to be able to reconnect the data
|
|||
|
in the files back to the FDB cluster, so that it can be used in a recovery if
|
|||
|
needed.
|
|||
|
|
|||
|
This presents a problem that the worker needs to know all the configuration
|
|||
|
options that were used to decide the file format of the transaction log
|
|||
|
*before* it can rejoin a cluster and get far enough through a recovery to find
|
|||
|
out what that configuration was. To get around this, the relevant
|
|||
|
configuration options have been added to the file name so that they're
|
|||
|
available when scanning the list of files.
|
|||
|
|
|||
|
Currently, FDB identifies a transaction log instance via seeing a file that starts
|
|||
|
with `log-`, which represents the metadata store. This filename has the format
|
|||
|
of `log-<UUID>.<SUFFIX>` where UUID is the `logId`, and SUFFIX tells us if the
|
|||
|
metadata store is a memory or ssd storage engine file.
|
|||
|
|
|||
|
This format is being changed to `log2-<KV PAIRS>-<UUID>.<SUFFIX>`, where KV
|
|||
|
PAIRS is a small amount of information encoded into the file name to give us
|
|||
|
the metadata *about* the file that is required. According to POSIX, the
|
|||
|
characters allowed for "fully portable filenames" are `A–Z a–z 0–9 . _ -` and
|
|||
|
the filename length should stay under 255 characters. This leaves only `_` as
|
|||
|
the only character not already used. Therefore, the KV pair encoding
|
|||
|
`K1_V1_K2_V2_...`, so keys and values separated by an `_`, and kv pairs are
|
|||
|
also separated by an `_`.
|
|||
|
|
|||
|
The currently supported keys are:
|
|||
|
|
|||
|
V
|
|||
|
: A copy of `log_version`
|
|||
|
|
|||
|
LS
|
|||
|
: `log_spill`, a new configuration option in 6.1
|
|||
|
|
|||
|
and any unrecognized keys are ignored, which will likely help forward compatibility.
|
|||
|
|
|||
|
An example file name is `log2-V_3_LS_2-46a5f353ac18d787852d44c3a2e51527-0.fdq`.
|
|||
|
|
|||
|
### Testing
|
|||
|
|
|||
|
`SimulationConfig` has been changed to randomly set `log_version` according to
|
|||
|
what is supported. This means that with restarting upgrade tests that simulate
|
|||
|
upgrading from `N` to `N+1`, the `N+1` version will see files that came from an
|
|||
|
FDB running with any `log_version` value that was previously supported. If
|
|||
|
`N+1` can't handle the files correctly, then the simulation test will fail.
|
|||
|
|
|||
|
`ConfigureTest` tries randomly toggling `log_version` up and down in a live
|
|||
|
database, along with all the other log related options. Some are valid, some
|
|||
|
are invalid and should be rejected, or will cause ASSERTs in later parts of the
|
|||
|
code.
|
|||
|
|
|||
|
I've added a new test, `ConfigureTestRestart` that tests changing
|
|||
|
configurations and then upgrading FDB, to cover testing that upgrades still
|
|||
|
happen correctly when `log_version` has been changed. This also verifies that
|
|||
|
on-disk formats for those `log_version`s are still loadable by future FDB
|
|||
|
versions.
|
|||
|
|
|||
|
There are no tests that mix the `ConfigureDatabase` and `Attrition` workloads.
|
|||
|
It would be good to do so, to cover the case of `log_version` changes in the
|
|||
|
presence of failures, but one cannot be added easily. The simulator calculates
|
|||
|
what processes/machines are safe to kill by looking at the current
|
|||
|
configuration. For `ConfigureTest`, this isn't good enough, because `triple`
|
|||
|
could mean that there are three replicas, or that the FDB cluster just changed
|
|||
|
from `single` to `triple` and only have one replica of data until data
|
|||
|
distribution finishes. It would be good to add a `ConfigureKillTest` sometime
|
|||
|
in the future.
|
|||
|
|
|||
|
For FDB to actually announce that rolling back from `N+1` to `N` is supported,
|
|||
|
there will need to be downgrade tests from `N+1` to `N` also. The default in
|
|||
|
`N+1` should always be recoverable within `N`. As FDB isn't promising forward
|
|||
|
compatibility yet, these tests haven't been implemented.
|
|||
|
|
|||
|
# Transaction Log Forward Compatibility Operational Guide
|
|||
|
|
|||
|
## Notable Behavior Changes
|
|||
|
|
|||
|
When release notes mention a new `log_version` is available, after deploying
|
|||
|
that release, it's worth considering upgrading `log_version`. Doing so will
|
|||
|
allow a controlled upgrade, and reduce the number of new changes that will
|
|||
|
take effect when upgrading to the next release.
|
|||
|
|
|||
|
## Observability
|
|||
|
|
|||
|
* When running with a non-default `log_version`, the setting will appear in `fdbcli> status`.
|
|||
|
|
|||
|
## Monitoring and Alerting
|
|||
|
|
|||
|
If anyone is doing anything that relies on the file names the transaction log uses, they'll be changing.
|
|||
|
|
|||
|
|
|||
|
<!-- Force long-style table of contents -->
|
|||
|
<script>window.markdeepOptions={}; window.markdeepOptions.tocStyle="long";</script>
|
|||
|
<!-- When printed, top level section headers should force page breaks -->
|
|||
|
<style>.md h1, .md .nonumberh1 {page-break-before:always}</style>
|
|||
|
<!-- Markdeep: -->
|
|||
|
<style class="fallback">body{visibility:hidden;white-space:pre;font-family:monospace}</style><script src="markdeep.min.js" charset="utf-8"></script><script src="https://casual-effects.com/markdeep/latest/markdeep.min.js" charset="utf-8"></script><script>window.alreadyProcessedMarkdeep||(document.body.style.visibility="visible")</script>
|