421 lines
21 KiB
Markdown
421 lines
21 KiB
Markdown
|
# Dynamic Knobs
|
|||
|
|
|||
|
This document is largely adapted from original design documents by Markus
|
|||
|
Pilman and Trevor Clinkenbeard.
|
|||
|
|
|||
|
## Background
|
|||
|
|
|||
|
FoundationDB parameters control the behavior of the database, including whether
|
|||
|
certain features are available and the value of internal constants. Parameters
|
|||
|
will be referred to as knobs for the remainder of this document. Currently,
|
|||
|
these knobs are configured through arguments passed to `fdbserver` processes,
|
|||
|
often controlled by `fdbmonitor`. This has a number of problems:
|
|||
|
|
|||
|
1. Updating knobs involves updating `foundationdb.conf` files on each host in a
|
|||
|
cluster. This has a lot of overhead and typically requires external tooling
|
|||
|
for large scale changes.
|
|||
|
2. All knob changes require a process restart.
|
|||
|
3. We can't easily track the history of knob changes.
|
|||
|
|
|||
|
## Overview
|
|||
|
|
|||
|
The dynamic knobs project creates a strictly serializable quorum-based
|
|||
|
configuration database stored on the coordinators. Each `fdbserver` process
|
|||
|
specifies a configuration path and applies knob overrides from the
|
|||
|
configuration database for its specified classes.
|
|||
|
|
|||
|
### Caveats
|
|||
|
|
|||
|
The configuration database explicitly does not support the following:
|
|||
|
|
|||
|
1. A high load. The update rate, while not specified, should be relatively low.
|
|||
|
2. A large amount of data. The database is meant to be relatively small (under
|
|||
|
one megabyte). Data is not sharded and every coordinator stores a complete
|
|||
|
copy.
|
|||
|
3. Concurrent writes. At most one write can succeed at a time, and clients must
|
|||
|
retry their failed writes.
|
|||
|
|
|||
|
## Design
|
|||
|
|
|||
|
### Configuration Path
|
|||
|
|
|||
|
Each `fdbserver` process can now include a `--config_path` argument specifying
|
|||
|
its configuration path. A configuration path is a hierarchical list of
|
|||
|
configuration classes specifying which knob overrides the `fdbserver` process
|
|||
|
should apply from the configuration database. For example:
|
|||
|
|
|||
|
```bash
|
|||
|
$ fdbserver --config_path classA/classB/classC ...
|
|||
|
```
|
|||
|
|
|||
|
Knob overrides follow descending priority:
|
|||
|
|
|||
|
1. Manually specified command line knobs.
|
|||
|
2. Individual configuration class overrides.
|
|||
|
* Subdirectories override parent directories. For example, if the
|
|||
|
configuration path is `az-1/storage/gp3`, the `gp3` configuration takes
|
|||
|
priority over the `storage` configuration, which takes priority over the
|
|||
|
`az-1` configuration.
|
|||
|
3. Global configuration knobs.
|
|||
|
4. Default knob values.
|
|||
|
|
|||
|
#### Example
|
|||
|
|
|||
|
For example, imagine an `fdbserver` process run as follows:
|
|||
|
|
|||
|
```bash
|
|||
|
$ fdbserver --datadir /mnt/fdb/storage/4500 --logdir /var/log/foundationdb --public_address auto:4500 --config_path az-1/storage/gp3 --knob_disable_asserts false
|
|||
|
```
|
|||
|
|
|||
|
And the configuration database contains:
|
|||
|
|
|||
|
| ConfigClass | KnobName | KnobValue |
|
|||
|
|-------------|---------------------|-----------|
|
|||
|
| az-2 | page_cache_4k | 8e9 |
|
|||
|
| storage | min_trace_severity | 20 |
|
|||
|
| az-1 | compaction_interval | 280 |
|
|||
|
| storage | compaction_interval | 350 |
|
|||
|
| az-1 | disable_asserts | true |
|
|||
|
| \<global\> | max_metric_size | 5000 |
|
|||
|
| gp3 | max_metric_size | 1000 |
|
|||
|
|
|||
|
The final configuration for the process will be:
|
|||
|
|
|||
|
| KnobName | KnobValue | Explanation |
|
|||
|
|---------------------|-------------|-------------|
|
|||
|
| page_cache_4k | \<default\> | The configuration database knob override for `az-2` is ignored, so the compiled default is used |
|
|||
|
| min_trace_severity | 20 | Because the `storage` configuration class is part of the process’s configuration path, the corresponding knob override is applied from the configuration database |
|
|||
|
| compaction_interval | 350 | The `storage` knob override takes precedence over the `az-1` knob override |
|
|||
|
| disable_asserts | false | This knob is manually overridden, so all other overrides are ignored |
|
|||
|
| max_metric_size | 1000 | Knob overrides for specific configuration classes take precedence over global knob overrides, so the global override is ignored |
|
|||
|
|
|||
|
### Clients
|
|||
|
|
|||
|
Clients can write to the configuration database using transactions.
|
|||
|
Configuration database transactions are differentiated from regular
|
|||
|
transactions through specification of the `USE_CONFIG_DATABASE` database
|
|||
|
option.
|
|||
|
|
|||
|
In configuration transactions, the client uses the tuple layer to interact with
|
|||
|
the configuration database. Keys are tuples of size two, where the first item
|
|||
|
is the configuration class being written, and the second item is the knob name.
|
|||
|
The value should be specified as a string. It will be converted to the
|
|||
|
appropriate type based on the declared type of the knob being set.
|
|||
|
|
|||
|
Below is a sample Python script to write to the configuration database.
|
|||
|
|
|||
|
```python
|
|||
|
import fdb
|
|||
|
|
|||
|
fdb.api_version(720)
|
|||
|
|
|||
|
@fdb.transactional
|
|||
|
def set_knob(tr, knob_name, knob_value, config_class, description):
|
|||
|
tr['\xff\xff/description'] = description
|
|||
|
tr[fdb.tuple.pack((config_class, knob_name,))] = knob_value
|
|||
|
|
|||
|
# This function performs two knob changes transactionally.
|
|||
|
@fdb.transactional
|
|||
|
def set_multiple_knobs(tr):
|
|||
|
tr['\xff\xff/description'] = 'description'
|
|||
|
tr[fdb.tuple.pack((None, 'min_trace_severity',))] = '10'
|
|||
|
tr[fdb.tuple.pack(('az-1', 'min_trace_severity',))] = '20'
|
|||
|
|
|||
|
db = fdb.open()
|
|||
|
db.options.set_use_config_database()
|
|||
|
|
|||
|
set_knob(db, 'min_trace_severity', '10', None, 'description')
|
|||
|
set_knob(db, 'min_trace_severity', '20', 'az-1', 'description')
|
|||
|
```
|
|||
|
|
|||
|
### Disable the Configuration Database
|
|||
|
|
|||
|
The configuration database includes both client and server changes and is
|
|||
|
enabled by default. Thus, to disable the configuration database, changes must
|
|||
|
be made to both.
|
|||
|
|
|||
|
#### Server
|
|||
|
|
|||
|
The configuration database can be disabled by specifying the ``fdbserver``
|
|||
|
command line option ``--no-config-db``. Note that this option must be specified
|
|||
|
for *every* ``fdbserver`` process.
|
|||
|
|
|||
|
#### Client
|
|||
|
|
|||
|
The only client change from the configuration database is as part of the change
|
|||
|
coordinators command. The change coordinators command is not considered
|
|||
|
successful until the configuration database is readable on the new
|
|||
|
coordinators. This will cause the change coordinators command to hang if run
|
|||
|
against a database with dynamic knobs disabled. To disable the client side
|
|||
|
configuration database liveness check, specify the ``--no-config-db`` flag when
|
|||
|
changing coordinators. For example:
|
|||
|
|
|||
|
```
|
|||
|
fdbcli> coordinators auto --no-config-db
|
|||
|
```
|
|||
|
|
|||
|
## Status
|
|||
|
|
|||
|
The current state of the configuration database is output as part of `status
|
|||
|
json`. The configuration path for each process can be determined from the
|
|||
|
``command_line`` key associated with each process.
|
|||
|
|
|||
|
Sample from ``status json``:
|
|||
|
|
|||
|
```
|
|||
|
"configuration_database" : {
|
|||
|
"commits" : [
|
|||
|
{
|
|||
|
"description" : "set some knobs",
|
|||
|
"timestamp" : 1659570000,
|
|||
|
"version" : 1
|
|||
|
},
|
|||
|
{
|
|||
|
"description" : "make some other changes",
|
|||
|
"timestamp" : 1659570000,
|
|||
|
"version" : 2
|
|||
|
}
|
|||
|
],
|
|||
|
"last_compacted_version" : 0,
|
|||
|
"most_recent_version" : 2,
|
|||
|
"mutations" : [
|
|||
|
{
|
|||
|
"config_class" : "<global>",
|
|||
|
"knob_name" : "min_trace_severity",
|
|||
|
"knob_value" : "int:5",
|
|||
|
"type" : "set",
|
|||
|
"version" : 1
|
|||
|
},
|
|||
|
{
|
|||
|
"config_class" : "<global>",
|
|||
|
"knob_name" : "compaction_interval",
|
|||
|
"knob_value" : "double:30.000000",
|
|||
|
"type" : "set",
|
|||
|
"version" : 1
|
|||
|
},
|
|||
|
{
|
|||
|
"config_class" : "az-1",
|
|||
|
"knob_name" : "compaction_interval",
|
|||
|
"knob_value" : "double:60.000000",
|
|||
|
"type" : "set",
|
|||
|
"version" : 1
|
|||
|
},
|
|||
|
{
|
|||
|
"config_class" : "<global>",
|
|||
|
"knob_name" : "compaction_interval",
|
|||
|
"type" : "clear",
|
|||
|
"version" : 2
|
|||
|
},
|
|||
|
{
|
|||
|
"config_class" : "<global>",
|
|||
|
"knob_name" : "update_node_timeout",
|
|||
|
"knob_value" : "double:4.000000",
|
|||
|
"type" : "set",
|
|||
|
"version" : 2
|
|||
|
}
|
|||
|
],
|
|||
|
"snapshot" : {
|
|||
|
"<global>" : {
|
|||
|
"min_trace_severity" : "int:5",
|
|||
|
"update_node_timeout" : "double:4.000000"
|
|||
|
},
|
|||
|
"az-1" : {
|
|||
|
"compaction_interval" : "double:60.000000"
|
|||
|
}
|
|||
|
}
|
|||
|
}
|
|||
|
```
|
|||
|
|
|||
|
After compaction, ``status json`` would show:
|
|||
|
|
|||
|
```
|
|||
|
"configuration_database" : {
|
|||
|
"commits" : [
|
|||
|
],
|
|||
|
"last_compacted_version" : 2,
|
|||
|
"most_recent_version" : 2,
|
|||
|
"mutations" : [
|
|||
|
],
|
|||
|
"snapshot" : {
|
|||
|
"<global>" : {
|
|||
|
"min_trace_severity" : "int:5",
|
|||
|
"update_node_timeout" : "double:4.000000"
|
|||
|
},
|
|||
|
"az-1" : {
|
|||
|
"compaction_interval" : "double:60.000000"
|
|||
|
}
|
|||
|
}
|
|||
|
}
|
|||
|
```
|
|||
|
|
|||
|
## Detailed Implementation
|
|||
|
|
|||
|
The configuration database is implemented as a replicated state machine living
|
|||
|
on the coordinators. This allows configuration database transactions to
|
|||
|
continue to function in the event of a catastrophic loss of the transaction
|
|||
|
subsystem.
|
|||
|
|
|||
|
To commit a transaction, clients run the two phase Paxos protocol. First, the
|
|||
|
client asks for a live version from a quorum of coordinators. When a
|
|||
|
coordinator receives a request for its live version, it increments its local
|
|||
|
live version by one and returns it to the client. Then, the client submits its
|
|||
|
writes at the live version it received in the previous step. A coordinator will
|
|||
|
accept the commit if it is still on the same live version. If a majority of
|
|||
|
coordinators accept the commit, it is considered committed.
|
|||
|
|
|||
|
### Coordinator
|
|||
|
|
|||
|
Each coordinator runs a ``ConfigNode`` which serves as a replica storing one
|
|||
|
full copy of the configuration database. Coordinators never communicate with
|
|||
|
other coordinators while processing configuration database transactions.
|
|||
|
Instead, the client runs the transaction and determines when it has quorum
|
|||
|
agreement.
|
|||
|
|
|||
|
Coordinators serve the following ``ConfigTransactionInterface`` to allow
|
|||
|
clients to read from and write to the configuration database.
|
|||
|
|
|||
|
#### ``ConfigTransactionInterface``
|
|||
|
| Request | Request fields | Reply fields | Explanation |
|
|||
|
|------------------|----------------------------------------------------------------|-----------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------|
|
|||
|
| GetGeneration | (coordinatorsHash) | (generation) or (coordinators_changed error) | Get a new read version. This read version is used for all future requests in the transaction |
|
|||
|
| Get | (configuration class, knob name, coordinatorsHash, generation) | (knob value or empty) or (coordinators_changed error) or (transaction_too_old error) | Returns the current value stored at the specified configuration class and knob name, or empty if no value exists |
|
|||
|
| GetConfigClasses | (coordinatorsHash, generation) | (configuration classes) or (coordinators_changed error) or (transaction_too_old error) | Returns a list of all configuration classes stored in the configuration database |
|
|||
|
| GetKnobs | (configuration class, coordinatorsHash, generation) | (knob names) or (coordinators_changed error) or (transaction_too_old error) | Returns a list of all knob names stored for the provided configuration class |
|
|||
|
| Commit | (mutation list, coordinatorsHash, generation) | ack or (coordinators_changed error) or (commit_unknown_result error) or (not_committed error) | Commit mutations set by the transaction |
|
|||
|
|
|||
|
Coordinators also serve the following ``ConfigFollowerInterface`` to provide
|
|||
|
access to (and modification of) their current state. Most interaction through
|
|||
|
this interface is done by the cluster controller through its
|
|||
|
``IConfigConsumer`` implementation living on the ``ConfigBroadcaster``.
|
|||
|
|
|||
|
#### ``ConfigFollowerInterface``
|
|||
|
| Request | Request fields | Reply fields | Explanation |
|
|||
|
|-----------------------|----------------------------------------------------------------------|-----------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------|
|
|||
|
| GetChanges | (lastSeenVersion, mostRecentVersion) | (mutation list, version) or (version_already_compacted error) or (process_behind error) | Request changes since the last seen version, receive a new most recent version, as well as recent mutations |
|
|||
|
| GetSnapshotAndChanges | (mostRecentVersion) | (snapshot, snapshotVersion, changes) | Request the full configuration database, in the form of a base snapshot and changes to apply on top of the snapshot |
|
|||
|
| Compact | (version) | ack | Compact mutations up to the provided version |
|
|||
|
| Rollforward | (rollbackTo, lastKnownCommitted, target, changes, specialZeroQuorum) | ack or (version_already_compacted error) or (transaction_too_old error) | Rollback/rollforward mutations on a node to catch it up with the majority |
|
|||
|
| GetCommittedVersion | () | (registered, lastCompacted, lastLive, lastCommitted) | Request version information from a ``ConfigNode`` |
|
|||
|
| Lock | (coordinatorsHash) | ack | Lock a ``ConfigNode`` to prevent it from serving requests during a coordinator change |
|
|||
|
|
|||
|
### Cluster Controller
|
|||
|
|
|||
|
The cluster controller runs a singleton ``ConfigBroadcaster`` which is
|
|||
|
responsible for periodically polling the ``ConfigNode``s for updates, then
|
|||
|
broadcasting these updates to workers through the ``ConfigBroadcastInterface``.
|
|||
|
When workers join the cluster, they register themselves and their
|
|||
|
``ConfigBroadcastInterface`` with the broadcaster. The broadcaster then pushes
|
|||
|
new updates to registered workers.
|
|||
|
|
|||
|
The ``ConfigBroadcastInterface`` is also used by ``ConfigNode``s to register
|
|||
|
with the ``ConfigBroadcaster``. ``ConfigNode``s need to register with the
|
|||
|
broadcaster because the broadcaster decides when the ``ConfigNode`` may begin
|
|||
|
serving requests, based on global information about status of other
|
|||
|
``ConfigNode``s. For example, if a system with three ``ConfigNode``s suffers a
|
|||
|
fault where one ``ConfigNode`` loses data, the faulty ``ConfigNode`` should
|
|||
|
not be allowed to begin serving requests again until it has been rolled forward
|
|||
|
and is up to date with the latest state of the configuration database.
|
|||
|
|
|||
|
#### ``ConfigBroadcastInterface``
|
|||
|
|
|||
|
| Request | Request fields | Reply fields | Explanation |
|
|||
|
|------------|------------------------------------------------------------|-------------------------------|---------------------------------------------------------------------------------------------|
|
|||
|
| Snapshot | (snapshot, version, restartDelay) | ack | A snapshot of the configuration database sent by the broadcaster to workers |
|
|||
|
| Changes | (changes, mostRecentVersion, restartDelay) | ack | A list of changes up to and including mostRecentVersion, sent by the broadcaster to workers |
|
|||
|
| Registered | () | (registered, lastSeenVersion) | Sent by the broadcaster to new ``ConfigNode``s to determine their registration status |
|
|||
|
| Ready | (snapshot, snapshotVersion, liveVersion, coordinatorsHash) | ack | Sent by the broadcaster to new ``ConfigNode``s to allow them to start serving requests |
|
|||
|
|
|||
|
### Worker
|
|||
|
|
|||
|
Each worker runs a ``LocalConfiguration`` instance which receives and applies
|
|||
|
knob updates from the ``ConfigBroadcaster``. The local configuration maintains
|
|||
|
a durable ``KeyValueStoreMemory`` containing the following:
|
|||
|
|
|||
|
* The latest known configuration version
|
|||
|
* The most recently used configuration path
|
|||
|
* All knob overrides corresponding to the configuration path at the latest known version
|
|||
|
|
|||
|
Once a worker starts, it will:
|
|||
|
|
|||
|
* Apply manually set knobs
|
|||
|
* Read its local configuration file
|
|||
|
* If the stored configuration path does not match the configuration path
|
|||
|
specified on the command line, delete the local configuration file
|
|||
|
* Otherwise, apply knob updates from the local configuration file. Manually
|
|||
|
specified knobs will not be overridden
|
|||
|
* Register with the broadcaster to receive new updates for its configuration
|
|||
|
classes
|
|||
|
* Persist these updates when received and restart if necessary
|
|||
|
|
|||
|
### Knob Atomicity
|
|||
|
|
|||
|
All knobs are classified as either atomic or non-atomic. Atomic knobs require a
|
|||
|
process restart when changed, while non-atomic knobs do not.
|
|||
|
|
|||
|
### Compaction
|
|||
|
|
|||
|
``ConfigNode``s store individual mutations in order to be able to update other,
|
|||
|
out of date ``ConfigNode``s without needing to send a full snapshot. Each
|
|||
|
configuration database commit also contains additional metadata such as a
|
|||
|
timestamp and a text description of the changes being made. To keep the size of
|
|||
|
the configuration database manageable, a compaction process runs periodically
|
|||
|
(defaulting to every five minutes) which compacts individual mutations into a
|
|||
|
simplified snapshot of key-value pairs. Compaction is controlled by the
|
|||
|
``ConfigBroadcaster``, using information it peridiodically requests from
|
|||
|
``ConfigNode``s. Compaction will only compact up to the minimum known version
|
|||
|
across *all* ``ConfigNode``s. This means that if one ``ConfigNode`` is
|
|||
|
permanently partitioned from the ``ConfigBroadcaster`` or from clients, no
|
|||
|
compaction will ever take place.
|
|||
|
|
|||
|
### Rollback / Rollforward
|
|||
|
|
|||
|
It is necessary to be able to roll ``ConfigNode``s backward and forward with
|
|||
|
respect to their committed versions due to the nature of quorum logic and
|
|||
|
unreliable networks.
|
|||
|
|
|||
|
Consider a case where a client commit gets persisted durably on one out of
|
|||
|
three ``ConfigNode``s (assume commit messages to the other two nodes are lost).
|
|||
|
Since the value is not committed on a majority of ``ConfigNode``s, it cannot be
|
|||
|
considered committed. But it is also incorrect to have the value persist on one
|
|||
|
out of three nodes as future commits are made. In this case, the most common
|
|||
|
result is that the ``ConfigNode`` will be rolled back when the next commit from
|
|||
|
a different client is made, and then rolled forward to contain the data from
|
|||
|
the commit. ``PaxosConfigConsumer`` contains logic to recognize ``ConfigNode``
|
|||
|
minorities and update them to match the quorum.
|
|||
|
|
|||
|
### Changing Coordinators
|
|||
|
|
|||
|
Since the configuration database lives on the coordinators and the
|
|||
|
[coordinators can be
|
|||
|
changed](https://apple.github.io/foundationdb/configuration.html#configuration-changing-coordination-servers),
|
|||
|
it is necessary to copy the configuration database from the old to the new
|
|||
|
coordinators during such an event. A coordinator change performs the following
|
|||
|
steps in regards to the configuration database:
|
|||
|
|
|||
|
1. Write ``\xff/coordinatorsKey`` with the new coordinators string. The key
|
|||
|
``\xff/previousCoordinators`` contains the current (old) set of
|
|||
|
coordinators.
|
|||
|
2. Lock the old ``ConfigNode``s so they can no longer serve client requests.
|
|||
|
3. Start a recovery, causing a new cluster controller (and therefore
|
|||
|
``ConfigBroadcaster``) to be selected.
|
|||
|
4. Read ``\xff/previousCoordinators`` on the ``ConfigBroadcaster`` and, if
|
|||
|
present, read an up-to-date snapshot of the configuration database on the
|
|||
|
old coordinators.
|
|||
|
5. Determine if each registering ``ConfigNode`` needs an up-to-date snapshot of
|
|||
|
the configuration database sent to it, based on its reported version and the
|
|||
|
snapshot version of the database received from the old coordinators.
|
|||
|
* Some new coordinators which were also coordinators in the previous
|
|||
|
configuration may not need a snapshot.
|
|||
|
6. Send ready requests to new ``ConfigNode``s, including an up-to-date snapshot
|
|||
|
if necessary. This allows the new coordinators to begin serving
|
|||
|
configuration database requests from clients.
|
|||
|
|
|||
|
## Testing
|
|||
|
|
|||
|
The ``ConfigDatabaseUnitTests`` class unit test a number of different
|
|||
|
configuration database dimensions.
|
|||
|
|
|||
|
The ``ConfigIncrement`` workload tests contention between clients attempting to
|
|||
|
write to the configuration database, paired with machine failure and
|
|||
|
coordinator changes.
|