foundationdb/design/dynamic-knobs.md

Ignoring revisions in .git-blame-ignore-revs. Click here to bypass and see the normal blame view.

421 lines
21 KiB
Markdown
Raw Normal View History

# Dynamic Knobs
This document is largely adapted from original design documents by Markus
Pilman and Trevor Clinkenbeard.
## Background
FoundationDB parameters control the behavior of the database, including whether
certain features are available and the value of internal constants. Parameters
will be referred to as knobs for the remainder of this document. Currently,
these knobs are configured through arguments passed to `fdbserver` processes,
often controlled by `fdbmonitor`. This has a number of problems:
1. Updating knobs involves updating `foundationdb.conf` files on each host in a
cluster. This has a lot of overhead and typically requires external tooling
for large scale changes.
2. All knob changes require a process restart.
3. We can't easily track the history of knob changes.
## Overview
The dynamic knobs project creates a strictly serializable quorum-based
configuration database stored on the coordinators. Each `fdbserver` process
specifies a configuration path and applies knob overrides from the
configuration database for its specified classes.
### Caveats
The configuration database explicitly does not support the following:
1. A high load. The update rate, while not specified, should be relatively low.
2. A large amount of data. The database is meant to be relatively small (under
one megabyte). Data is not sharded and every coordinator stores a complete
copy.
3. Concurrent writes. At most one write can succeed at a time, and clients must
retry their failed writes.
## Design
### Configuration Path
Each `fdbserver` process can now include a `--config_path` argument specifying
its configuration path. A configuration path is a hierarchical list of
configuration classes specifying which knob overrides the `fdbserver` process
should apply from the configuration database. For example:
```bash
$ fdbserver --config_path classA/classB/classC ...
```
Knob overrides follow descending priority:
1. Manually specified command line knobs.
2. Individual configuration class overrides.
* Subdirectories override parent directories. For example, if the
configuration path is `az-1/storage/gp3`, the `gp3` configuration takes
priority over the `storage` configuration, which takes priority over the
`az-1` configuration.
3. Global configuration knobs.
4. Default knob values.
#### Example
For example, imagine an `fdbserver` process run as follows:
```bash
$ fdbserver --datadir /mnt/fdb/storage/4500 --logdir /var/log/foundationdb --public_address auto:4500 --config_path az-1/storage/gp3 --knob_disable_asserts false
```
And the configuration database contains:
| ConfigClass | KnobName | KnobValue |
|-------------|---------------------|-----------|
| az-2 | page_cache_4k | 8e9 |
| storage | min_trace_severity | 20 |
| az-1 | compaction_interval | 280 |
| storage | compaction_interval | 350 |
| az-1 | disable_asserts | true |
| \<global\> | max_metric_size | 5000 |
| gp3 | max_metric_size | 1000 |
The final configuration for the process will be:
| KnobName | KnobValue | Explanation |
|---------------------|-------------|-------------|
| page_cache_4k | \<default\> | The configuration database knob override for `az-2` is ignored, so the compiled default is used |
| min_trace_severity | 20 | Because the `storage` configuration class is part of the processs configuration path, the corresponding knob override is applied from the configuration database |
| compaction_interval | 350 | The `storage` knob override takes precedence over the `az-1` knob override |
| disable_asserts | false | This knob is manually overridden, so all other overrides are ignored |
| max_metric_size | 1000 | Knob overrides for specific configuration classes take precedence over global knob overrides, so the global override is ignored |
### Clients
Clients can write to the configuration database using transactions.
Configuration database transactions are differentiated from regular
transactions through specification of the `USE_CONFIG_DATABASE` database
option.
In configuration transactions, the client uses the tuple layer to interact with
the configuration database. Keys are tuples of size two, where the first item
is the configuration class being written, and the second item is the knob name.
The value should be specified as a string. It will be converted to the
appropriate type based on the declared type of the knob being set.
Below is a sample Python script to write to the configuration database.
```python
import fdb
fdb.api_version(720)
@fdb.transactional
def set_knob(tr, knob_name, knob_value, config_class, description):
tr['\xff\xff/description'] = description
tr[fdb.tuple.pack((config_class, knob_name,))] = knob_value
# This function performs two knob changes transactionally.
@fdb.transactional
def set_multiple_knobs(tr):
tr['\xff\xff/description'] = 'description'
tr[fdb.tuple.pack((None, 'min_trace_severity',))] = '10'
tr[fdb.tuple.pack(('az-1', 'min_trace_severity',))] = '20'
db = fdb.open()
db.options.set_use_config_database()
set_knob(db, 'min_trace_severity', '10', None, 'description')
set_knob(db, 'min_trace_severity', '20', 'az-1', 'description')
```
### Disable the Configuration Database
The configuration database includes both client and server changes and is
enabled by default. Thus, to disable the configuration database, changes must
be made to both.
#### Server
The configuration database can be disabled by specifying the ``fdbserver``
command line option ``--no-config-db``. Note that this option must be specified
for *every* ``fdbserver`` process.
#### Client
The only client change from the configuration database is as part of the change
coordinators command. The change coordinators command is not considered
successful until the configuration database is readable on the new
coordinators. This will cause the change coordinators command to hang if run
against a database with dynamic knobs disabled. To disable the client side
configuration database liveness check, specify the ``--no-config-db`` flag when
changing coordinators. For example:
```
fdbcli> coordinators auto --no-config-db
```
## Status
The current state of the configuration database is output as part of `status
json`. The configuration path for each process can be determined from the
``command_line`` key associated with each process.
Sample from ``status json``:
```
"configuration_database" : {
"commits" : [
{
"description" : "set some knobs",
"timestamp" : 1659570000,
"version" : 1
},
{
"description" : "make some other changes",
"timestamp" : 1659570000,
"version" : 2
}
],
"last_compacted_version" : 0,
"most_recent_version" : 2,
"mutations" : [
{
"config_class" : "<global>",
"knob_name" : "min_trace_severity",
"knob_value" : "int:5",
"type" : "set",
"version" : 1
},
{
"config_class" : "<global>",
"knob_name" : "compaction_interval",
"knob_value" : "double:30.000000",
"type" : "set",
"version" : 1
},
{
"config_class" : "az-1",
"knob_name" : "compaction_interval",
"knob_value" : "double:60.000000",
"type" : "set",
"version" : 1
},
{
"config_class" : "<global>",
"knob_name" : "compaction_interval",
"type" : "clear",
"version" : 2
},
{
"config_class" : "<global>",
"knob_name" : "update_node_timeout",
"knob_value" : "double:4.000000",
"type" : "set",
"version" : 2
}
],
"snapshot" : {
"<global>" : {
"min_trace_severity" : "int:5",
"update_node_timeout" : "double:4.000000"
},
"az-1" : {
"compaction_interval" : "double:60.000000"
}
}
}
```
After compaction, ``status json`` would show:
```
"configuration_database" : {
"commits" : [
],
"last_compacted_version" : 2,
"most_recent_version" : 2,
"mutations" : [
],
"snapshot" : {
"<global>" : {
"min_trace_severity" : "int:5",
"update_node_timeout" : "double:4.000000"
},
"az-1" : {
"compaction_interval" : "double:60.000000"
}
}
}
```
## Detailed Implementation
The configuration database is implemented as a replicated state machine living
on the coordinators. This allows configuration database transactions to
continue to function in the event of a catastrophic loss of the transaction
subsystem.
To commit a transaction, clients run the two phase Paxos protocol. First, the
client asks for a live version from a quorum of coordinators. When a
coordinator receives a request for its live version, it increments its local
live version by one and returns it to the client. Then, the client submits its
writes at the live version it received in the previous step. A coordinator will
accept the commit if it is still on the same live version. If a majority of
coordinators accept the commit, it is considered committed.
### Coordinator
Each coordinator runs a ``ConfigNode`` which serves as a replica storing one
full copy of the configuration database. Coordinators never communicate with
other coordinators while processing configuration database transactions.
Instead, the client runs the transaction and determines when it has quorum
agreement.
Coordinators serve the following ``ConfigTransactionInterface`` to allow
clients to read from and write to the configuration database.
#### ``ConfigTransactionInterface``
| Request | Request fields | Reply fields | Explanation |
|------------------|----------------------------------------------------------------|-----------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------|
| GetGeneration | (coordinatorsHash) | (generation) or (coordinators_changed error) | Get a new read version. This read version is used for all future requests in the transaction |
| Get | (configuration class, knob name, coordinatorsHash, generation) | (knob value or empty) or (coordinators_changed error) or (transaction_too_old error) | Returns the current value stored at the specified configuration class and knob name, or empty if no value exists |
| GetConfigClasses | (coordinatorsHash, generation) | (configuration classes) or (coordinators_changed error) or (transaction_too_old error) | Returns a list of all configuration classes stored in the configuration database |
| GetKnobs | (configuration class, coordinatorsHash, generation) | (knob names) or (coordinators_changed error) or (transaction_too_old error) | Returns a list of all knob names stored for the provided configuration class |
| Commit | (mutation list, coordinatorsHash, generation) | ack or (coordinators_changed error) or (commit_unknown_result error) or (not_committed error) | Commit mutations set by the transaction |
Coordinators also serve the following ``ConfigFollowerInterface`` to provide
access to (and modification of) their current state. Most interaction through
this interface is done by the cluster controller through its
``IConfigConsumer`` implementation living on the ``ConfigBroadcaster``.
#### ``ConfigFollowerInterface``
| Request | Request fields | Reply fields | Explanation |
|-----------------------|----------------------------------------------------------------------|-----------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------|
| GetChanges | (lastSeenVersion, mostRecentVersion) | (mutation list, version) or (version_already_compacted error) or (process_behind error) | Request changes since the last seen version, receive a new most recent version, as well as recent mutations |
| GetSnapshotAndChanges | (mostRecentVersion) | (snapshot, snapshotVersion, changes) | Request the full configuration database, in the form of a base snapshot and changes to apply on top of the snapshot |
| Compact | (version) | ack | Compact mutations up to the provided version |
| Rollforward | (rollbackTo, lastKnownCommitted, target, changes, specialZeroQuorum) | ack or (version_already_compacted error) or (transaction_too_old error) | Rollback/rollforward mutations on a node to catch it up with the majority |
| GetCommittedVersion | () | (registered, lastCompacted, lastLive, lastCommitted) | Request version information from a ``ConfigNode`` |
| Lock | (coordinatorsHash) | ack | Lock a ``ConfigNode`` to prevent it from serving requests during a coordinator change |
### Cluster Controller
The cluster controller runs a singleton ``ConfigBroadcaster`` which is
responsible for periodically polling the ``ConfigNode``s for updates, then
broadcasting these updates to workers through the ``ConfigBroadcastInterface``.
When workers join the cluster, they register themselves and their
``ConfigBroadcastInterface`` with the broadcaster. The broadcaster then pushes
new updates to registered workers.
The ``ConfigBroadcastInterface`` is also used by ``ConfigNode``s to register
with the ``ConfigBroadcaster``. ``ConfigNode``s need to register with the
broadcaster because the broadcaster decides when the ``ConfigNode`` may begin
serving requests, based on global information about status of other
``ConfigNode``s. For example, if a system with three ``ConfigNode``s suffers a
fault where one ``ConfigNode`` loses data, the faulty ``ConfigNode`` should
not be allowed to begin serving requests again until it has been rolled forward
and is up to date with the latest state of the configuration database.
#### ``ConfigBroadcastInterface``
| Request | Request fields | Reply fields | Explanation |
|------------|------------------------------------------------------------|-------------------------------|---------------------------------------------------------------------------------------------|
| Snapshot | (snapshot, version, restartDelay) | ack | A snapshot of the configuration database sent by the broadcaster to workers |
| Changes | (changes, mostRecentVersion, restartDelay) | ack | A list of changes up to and including mostRecentVersion, sent by the broadcaster to workers |
| Registered | () | (registered, lastSeenVersion) | Sent by the broadcaster to new ``ConfigNode``s to determine their registration status |
| Ready | (snapshot, snapshotVersion, liveVersion, coordinatorsHash) | ack | Sent by the broadcaster to new ``ConfigNode``s to allow them to start serving requests |
### Worker
Each worker runs a ``LocalConfiguration`` instance which receives and applies
knob updates from the ``ConfigBroadcaster``. The local configuration maintains
a durable ``KeyValueStoreMemory`` containing the following:
* The latest known configuration version
* The most recently used configuration path
* All knob overrides corresponding to the configuration path at the latest known version
Once a worker starts, it will:
* Apply manually set knobs
* Read its local configuration file
* If the stored configuration path does not match the configuration path
specified on the command line, delete the local configuration file
* Otherwise, apply knob updates from the local configuration file. Manually
specified knobs will not be overridden
* Register with the broadcaster to receive new updates for its configuration
classes
* Persist these updates when received and restart if necessary
### Knob Atomicity
All knobs are classified as either atomic or non-atomic. Atomic knobs require a
process restart when changed, while non-atomic knobs do not.
### Compaction
``ConfigNode``s store individual mutations in order to be able to update other,
out of date ``ConfigNode``s without needing to send a full snapshot. Each
configuration database commit also contains additional metadata such as a
timestamp and a text description of the changes being made. To keep the size of
the configuration database manageable, a compaction process runs periodically
(defaulting to every five minutes) which compacts individual mutations into a
simplified snapshot of key-value pairs. Compaction is controlled by the
``ConfigBroadcaster``, using information it peridiodically requests from
``ConfigNode``s. Compaction will only compact up to the minimum known version
across *all* ``ConfigNode``s. This means that if one ``ConfigNode`` is
permanently partitioned from the ``ConfigBroadcaster`` or from clients, no
compaction will ever take place.
### Rollback / Rollforward
It is necessary to be able to roll ``ConfigNode``s backward and forward with
respect to their committed versions due to the nature of quorum logic and
unreliable networks.
Consider a case where a client commit gets persisted durably on one out of
three ``ConfigNode``s (assume commit messages to the other two nodes are lost).
Since the value is not committed on a majority of ``ConfigNode``s, it cannot be
considered committed. But it is also incorrect to have the value persist on one
out of three nodes as future commits are made. In this case, the most common
result is that the ``ConfigNode`` will be rolled back when the next commit from
a different client is made, and then rolled forward to contain the data from
the commit. ``PaxosConfigConsumer`` contains logic to recognize ``ConfigNode``
minorities and update them to match the quorum.
### Changing Coordinators
Since the configuration database lives on the coordinators and the
[coordinators can be
changed](https://apple.github.io/foundationdb/configuration.html#configuration-changing-coordination-servers),
it is necessary to copy the configuration database from the old to the new
coordinators during such an event. A coordinator change performs the following
steps in regards to the configuration database:
1. Write ``\xff/coordinatorsKey`` with the new coordinators string. The key
``\xff/previousCoordinators`` contains the current (old) set of
coordinators.
2. Lock the old ``ConfigNode``s so they can no longer serve client requests.
3. Start a recovery, causing a new cluster controller (and therefore
``ConfigBroadcaster``) to be selected.
4. Read ``\xff/previousCoordinators`` on the ``ConfigBroadcaster`` and, if
present, read an up-to-date snapshot of the configuration database on the
old coordinators.
5. Determine if each registering ``ConfigNode`` needs an up-to-date snapshot of
the configuration database sent to it, based on its reported version and the
snapshot version of the database received from the old coordinators.
* Some new coordinators which were also coordinators in the previous
configuration may not need a snapshot.
6. Send ready requests to new ``ConfigNode``s, including an up-to-date snapshot
if necessary. This allows the new coordinators to begin serving
configuration database requests from clients.
## Testing
The ``ConfigDatabaseUnitTests`` class unit test a number of different
configuration database dimensions.
The ``ConfigIncrement`` workload tests contention between clients attempting to
write to the configuration database, paired with machine failure and
coordinator changes.