Add dynamic knobs design doc
Co-authored-by: Markus Pilman <markus.pilman@snowflake.com> Co-authored-by: Trevor Clinkenbeard <trevor.clinkenbeard@snowflake.com>
This commit is contained in:
parent
05e463f79f
commit
d710c1fce5
|
@ -0,0 +1,420 @@
|
|||
# Dynamic Knobs
|
||||
|
||||
This document is largely adapted from original design documents by Markus
|
||||
Pilman and Trevor Clinkenbeard.
|
||||
|
||||
## Background
|
||||
|
||||
FoundationDB parameters control the behavior of the database, including whether
|
||||
certain features are available and the value of internal constants. Parameters
|
||||
will be referred to as knobs for the remainder of this document. Currently,
|
||||
these knobs are configured through arguments passed to `fdbserver` processes,
|
||||
often controlled by `fdbmonitor`. This has a number of problems:
|
||||
|
||||
1. Updating knobs involves updating `foundationdb.conf` files on each host in a
|
||||
cluster. This has a lot of overhead and typically requires external tooling
|
||||
for large scale changes.
|
||||
2. All knob changes require a process restart.
|
||||
3. We can't easily track the history of knob changes.
|
||||
|
||||
## Overview
|
||||
|
||||
The dynamic knobs project creates a strictly serializable quorum-based
|
||||
configuration database stored on the coordinators. Each `fdbserver` process
|
||||
specifies a configuration path and applies knob overrides from the
|
||||
configuration database for its specified classes.
|
||||
|
||||
### Caveats
|
||||
|
||||
The configuration database explicitly does not support the following:
|
||||
|
||||
1. A high load. The update rate, while not specified, should be relatively low.
|
||||
2. A large amount of data. The database is meant to be relatively small (under
|
||||
one megabyte). Data is not sharded and every coordinator stores a complete
|
||||
copy.
|
||||
3. Concurrent writes. At most one write can succeed at a time, and clients must
|
||||
retry their failed writes.
|
||||
|
||||
## Design
|
||||
|
||||
### Configuration Path
|
||||
|
||||
Each `fdbserver` process can now include a `--config_path` argument specifying
|
||||
its configuration path. A configuration path is a hierarchical list of
|
||||
configuration classes specifying which knob overrides the `fdbserver` process
|
||||
should apply from the configuration database. For example:
|
||||
|
||||
```bash
|
||||
$ fdbserver --config_path classA/classB/classC ...
|
||||
```
|
||||
|
||||
Knob overrides follow descending priority:
|
||||
|
||||
1. Manually specified command line knobs.
|
||||
2. Individual configuration class overrides.
|
||||
* Subdirectories override parent directories. For example, if the
|
||||
configuration path is `az-1/storage/gp3`, the `gp3` configuration takes
|
||||
priority over the `storage` configuration, which takes priority over the
|
||||
`az-1` configuration.
|
||||
3. Global configuration knobs.
|
||||
4. Default knob values.
|
||||
|
||||
#### Example
|
||||
|
||||
For example, imagine an `fdbserver` process run as follows:
|
||||
|
||||
```bash
|
||||
$ fdbserver --datadir /mnt/fdb/storage/4500 --logdir /var/log/foundationdb --public_address auto:4500 --config_path az-1/storage/gp3 --knob_disable_asserts false
|
||||
```
|
||||
|
||||
And the configuration database contains:
|
||||
|
||||
| ConfigClass | KnobName | KnobValue |
|
||||
|-------------|---------------------|-----------|
|
||||
| az-2 | page_cache_4k | 8e9 |
|
||||
| storage | min_trace_severity | 20 |
|
||||
| az-1 | compaction_interval | 280 |
|
||||
| storage | compaction_interval | 350 |
|
||||
| az-1 | disable_asserts | true |
|
||||
| \<global\> | max_metric_size | 5000 |
|
||||
| gp3 | max_metric_size | 1000 |
|
||||
|
||||
The final configuration for the process will be:
|
||||
|
||||
| KnobName | KnobValue | Explanation |
|
||||
|---------------------|-------------|-------------|
|
||||
| page_cache_4k | \<default\> | The configuration database knob override for `az-2` is ignored, so the compiled default is used |
|
||||
| min_trace_severity | 20 | Because the `storage` configuration class is part of the process’s configuration path, the corresponding knob override is applied from the configuration database |
|
||||
| compaction_interval | 350 | The `storage` knob override takes precedence over the `az-1` knob override |
|
||||
| disable_asserts | false | This knob is manually overridden, so all other overrides are ignored |
|
||||
| max_metric_size | 1000 | Knob overrides for specific configuration classes take precedence over global knob overrides, so the global override is ignored |
|
||||
|
||||
### Clients
|
||||
|
||||
Clients can write to the configuration database using transactions.
|
||||
Configuration database transactions are differentiated from regular
|
||||
transactions through specification of the `USE_CONFIG_DATABASE` database
|
||||
option.
|
||||
|
||||
In configuration transactions, the client uses the tuple layer to interact with
|
||||
the configuration database. Keys are tuples of size two, where the first item
|
||||
is the configuration class being written, and the second item is the knob name.
|
||||
The value should be specified as a string. It will be converted to the
|
||||
appropriate type based on the declared type of the knob being set.
|
||||
|
||||
Below is a sample Python script to write to the configuration database.
|
||||
|
||||
```python
|
||||
import fdb
|
||||
|
||||
fdb.api_version(720)
|
||||
|
||||
@fdb.transactional
|
||||
def set_knob(tr, knob_name, knob_value, config_class, description):
|
||||
tr['\xff\xff/description'] = description
|
||||
tr[fdb.tuple.pack((config_class, knob_name,))] = knob_value
|
||||
|
||||
# This function performs two knob changes transactionally.
|
||||
@fdb.transactional
|
||||
def set_multiple_knobs(tr):
|
||||
tr['\xff\xff/description'] = 'description'
|
||||
tr[fdb.tuple.pack((None, 'min_trace_severity',))] = '10'
|
||||
tr[fdb.tuple.pack(('az-1', 'min_trace_severity',))] = '20'
|
||||
|
||||
db = fdb.open()
|
||||
db.options.set_use_config_database()
|
||||
|
||||
set_knob(db, 'min_trace_severity', '10', None, 'description')
|
||||
set_knob(db, 'min_trace_severity', '20', 'az-1', 'description')
|
||||
```
|
||||
|
||||
### Disable the Configuration Database
|
||||
|
||||
The configuration database includes both client and server changes and is
|
||||
enabled by default. Thus, to disable the configuration database, changes must
|
||||
be made to both.
|
||||
|
||||
#### Server
|
||||
|
||||
The configuration database can be disabled by specifying the ``fdbserver``
|
||||
command line option ``--no-config-db``. Note that this option must be specified
|
||||
for *every* ``fdbserver`` process.
|
||||
|
||||
#### Client
|
||||
|
||||
The only client change from the configuration database is as part of the change
|
||||
coordinators command. The change coordinators command is not considered
|
||||
successful until the configuration database is readable on the new
|
||||
coordinators. This will cause the change coordinators command to hang if run
|
||||
against a database with dynamic knobs disabled. To disable the client side
|
||||
configuration database liveness check, specify the ``--no-config-db`` flag when
|
||||
changing coordinators. For example:
|
||||
|
||||
```
|
||||
fdbcli> coordinators auto --no-config-db
|
||||
```
|
||||
|
||||
## Status
|
||||
|
||||
The current state of the configuration database is output as part of `status
|
||||
json`. The configuration path for each process can be determined from the
|
||||
``command_line`` key associated with each process.
|
||||
|
||||
Sample from ``status json``:
|
||||
|
||||
```
|
||||
"configuration_database" : {
|
||||
"commits" : [
|
||||
{
|
||||
"description" : "set some knobs",
|
||||
"timestamp" : 1659570000,
|
||||
"version" : 1
|
||||
},
|
||||
{
|
||||
"description" : "make some other changes",
|
||||
"timestamp" : 1659570000,
|
||||
"version" : 2
|
||||
}
|
||||
],
|
||||
"last_compacted_version" : 0,
|
||||
"most_recent_version" : 2,
|
||||
"mutations" : [
|
||||
{
|
||||
"config_class" : "<global>",
|
||||
"knob_name" : "min_trace_severity",
|
||||
"knob_value" : "int:5",
|
||||
"type" : "set",
|
||||
"version" : 1
|
||||
},
|
||||
{
|
||||
"config_class" : "<global>",
|
||||
"knob_name" : "compaction_interval",
|
||||
"knob_value" : "double:30.000000",
|
||||
"type" : "set",
|
||||
"version" : 1
|
||||
},
|
||||
{
|
||||
"config_class" : "az-1",
|
||||
"knob_name" : "compaction_interval",
|
||||
"knob_value" : "double:60.000000",
|
||||
"type" : "set",
|
||||
"version" : 1
|
||||
},
|
||||
{
|
||||
"config_class" : "<global>",
|
||||
"knob_name" : "compaction_interval",
|
||||
"type" : "clear",
|
||||
"version" : 2
|
||||
},
|
||||
{
|
||||
"config_class" : "<global>",
|
||||
"knob_name" : "update_node_timeout",
|
||||
"knob_value" : "double:4.000000",
|
||||
"type" : "set",
|
||||
"version" : 2
|
||||
}
|
||||
],
|
||||
"snapshot" : {
|
||||
"<global>" : {
|
||||
"min_trace_severity" : "int:5",
|
||||
"update_node_timeout" : "double:4.000000"
|
||||
},
|
||||
"az-1" : {
|
||||
"compaction_interval" : "double:60.000000"
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
After compaction, ``status json`` would show:
|
||||
|
||||
```
|
||||
"configuration_database" : {
|
||||
"commits" : [
|
||||
],
|
||||
"last_compacted_version" : 2,
|
||||
"most_recent_version" : 2,
|
||||
"mutations" : [
|
||||
],
|
||||
"snapshot" : {
|
||||
"<global>" : {
|
||||
"min_trace_severity" : "int:5",
|
||||
"update_node_timeout" : "double:4.000000"
|
||||
},
|
||||
"az-1" : {
|
||||
"compaction_interval" : "double:60.000000"
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Detailed Implementation
|
||||
|
||||
The configuration database is implemented as a replicated state machine living
|
||||
on the coordinators. This allows configuration database transactions to
|
||||
continue to function in the event of a catastrophic loss of the transaction
|
||||
subsystem.
|
||||
|
||||
To commit a transaction, clients run the two phase Paxos protocol. First, the
|
||||
client asks for a live version from a quorum of coordinators. When a
|
||||
coordinator receives a request for its live version, it increments its local
|
||||
live version by one and returns it to the client. Then, the client submits its
|
||||
writes at the live version it received in the previous step. A coordinator will
|
||||
accept the commit if it is still on the same live version. If a majority of
|
||||
coordinators accept the commit, it is considered committed.
|
||||
|
||||
### Coordinator
|
||||
|
||||
Each coordinator runs a ``ConfigNode`` which serves as a replica storing one
|
||||
full copy of the configuration database. Coordinators never communicate with
|
||||
other coordinators while processing configuration database transactions.
|
||||
Instead, the client runs the transaction and determines when it has quorum
|
||||
agreement.
|
||||
|
||||
Coordinators serve the following ``ConfigTransactionInterface`` to allow
|
||||
clients to read from and write to the configuration database.
|
||||
|
||||
#### ``ConfigTransactionInterface``
|
||||
| Request | Request fields | Reply fields | Explanation |
|
||||
|------------------|----------------------------------------------------------------|-----------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------|
|
||||
| GetGeneration | (coordinatorsHash) | (generation) or (coordinators_changed error) | Get a new read version. This read version is used for all future requests in the transaction |
|
||||
| Get | (configuration class, knob name, coordinatorsHash, generation) | (knob value or empty) or (coordinators_changed error) or (transaction_too_old error) | Returns the current value stored at the specified configuration class and knob name, or empty if no value exists |
|
||||
| GetConfigClasses | (coordinatorsHash, generation) | (configuration classes) or (coordinators_changed error) or (transaction_too_old error) | Returns a list of all configuration classes stored in the configuration database |
|
||||
| GetKnobs | (configuration class, coordinatorsHash, generation) | (knob names) or (coordinators_changed error) or (transaction_too_old error) | Returns a list of all knob names stored for the provided configuration class |
|
||||
| Commit | (mutation list, coordinatorsHash, generation) | ack or (coordinators_changed error) or (commit_unknown_result error) or (not_committed error) | Commit mutations set by the transaction |
|
||||
|
||||
Coordinators also serve the following ``ConfigFollowerInterface`` to provide
|
||||
access to (and modification of) their current state. Most interaction through
|
||||
this interface is done by the cluster controller through its
|
||||
``IConfigConsumer`` implementation living on the ``ConfigBroadcaster``.
|
||||
|
||||
#### ``ConfigFollowerInterface``
|
||||
| Request | Request fields | Reply fields | Explanation |
|
||||
|-----------------------|----------------------------------------------------------------------|-----------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------|
|
||||
| GetChanges | (lastSeenVersion, mostRecentVersion) | (mutation list, version) or (version_already_compacted error) or (process_behind error) | Request changes since the last seen version, receive a new most recent version, as well as recent mutations |
|
||||
| GetSnapshotAndChanges | (mostRecentVersion) | (snapshot, snapshotVersion, changes) | Request the full configuration database, in the form of a base snapshot and changes to apply on top of the snapshot |
|
||||
| Compact | (version) | ack | Compact mutations up to the provided version |
|
||||
| Rollforward | (rollbackTo, lastKnownCommitted, target, changes, specialZeroQuorum) | ack or (version_already_compacted error) or (transaction_too_old error) | Rollback/rollforward mutations on a node to catch it up with the majority |
|
||||
| GetCommittedVersion | () | (registered, lastCompacted, lastLive, lastCommitted) | Request version information from a ``ConfigNode`` |
|
||||
| Lock | (coordinatorsHash) | ack | Lock a ``ConfigNode`` to prevent it from serving requests during a coordinator change |
|
||||
|
||||
### Cluster Controller
|
||||
|
||||
The cluster controller runs a singleton ``ConfigBroadcaster`` which is
|
||||
responsible for periodically polling the ``ConfigNode``s for updates, then
|
||||
broadcasting these updates to workers through the ``ConfigBroadcastInterface``.
|
||||
When workers join the cluster, they register themselves and their
|
||||
``ConfigBroadcastInterface`` with the broadcaster. The broadcaster then pushes
|
||||
new updates to registered workers.
|
||||
|
||||
The ``ConfigBroadcastInterface`` is also used by ``ConfigNode``s to register
|
||||
with the ``ConfigBroadcaster``. ``ConfigNode``s need to register with the
|
||||
broadcaster because the broadcaster decides when the ``ConfigNode`` may begin
|
||||
serving requests, based on global information about status of other
|
||||
``ConfigNode``s. For example, if a system with three ``ConfigNode``s suffers a
|
||||
fault where one ``ConfigNode`` loses data, the faulty ``ConfigNode`` should
|
||||
not be allowed to begin serving requests again until it has been rolled forward
|
||||
and is up to date with the latest state of the configuration database.
|
||||
|
||||
#### ``ConfigBroadcastInterface``
|
||||
|
||||
| Request | Request fields | Reply fields | Explanation |
|
||||
|------------|------------------------------------------------------------|-------------------------------|---------------------------------------------------------------------------------------------|
|
||||
| Snapshot | (snapshot, version, restartDelay) | ack | A snapshot of the configuration database sent by the broadcaster to workers |
|
||||
| Changes | (changes, mostRecentVersion, restartDelay) | ack | A list of changes up to and including mostRecentVersion, sent by the broadcaster to workers |
|
||||
| Registered | () | (registered, lastSeenVersion) | Sent by the broadcaster to new ``ConfigNode``s to determine their registration status |
|
||||
| Ready | (snapshot, snapshotVersion, liveVersion, coordinatorsHash) | ack | Sent by the broadcaster to new ``ConfigNode``s to allow them to start serving requests |
|
||||
|
||||
### Worker
|
||||
|
||||
Each worker runs a ``LocalConfiguration`` instance which receives and applies
|
||||
knob updates from the ``ConfigBroadcaster``. The local configuration maintains
|
||||
a durable ``KeyValueStoreMemory`` containing the following:
|
||||
|
||||
* The latest known configuration version
|
||||
* The most recently used configuration path
|
||||
* All knob overrides corresponding to the configuration path at the latest known version
|
||||
|
||||
Once a worker starts, it will:
|
||||
|
||||
* Apply manually set knobs
|
||||
* Read its local configuration file
|
||||
* If the stored configuration path does not match the configuration path
|
||||
specified on the command line, delete the local configuration file
|
||||
* Otherwise, apply knob updates from the local configuration file. Manually
|
||||
specified knobs will not be overridden
|
||||
* Register with the broadcaster to receive new updates for its configuration
|
||||
classes
|
||||
* Persist these updates when received and restart if necessary
|
||||
|
||||
### Knob Atomicity
|
||||
|
||||
All knobs are classified as either atomic or non-atomic. Atomic knobs require a
|
||||
process restart when changed, while non-atomic knobs do not.
|
||||
|
||||
### Compaction
|
||||
|
||||
``ConfigNode``s store individual mutations in order to be able to update other,
|
||||
out of date ``ConfigNode``s without needing to send a full snapshot. Each
|
||||
configuration database commit also contains additional metadata such as a
|
||||
timestamp and a text description of the changes being made. To keep the size of
|
||||
the configuration database manageable, a compaction process runs periodically
|
||||
(defaulting to every five minutes) which compacts individual mutations into a
|
||||
simplified snapshot of key-value pairs. Compaction is controlled by the
|
||||
``ConfigBroadcaster``, using information it peridiodically requests from
|
||||
``ConfigNode``s. Compaction will only compact up to the minimum known version
|
||||
across *all* ``ConfigNode``s. This means that if one ``ConfigNode`` is
|
||||
permanently partitioned from the ``ConfigBroadcaster`` or from clients, no
|
||||
compaction will ever take place.
|
||||
|
||||
### Rollback / Rollforward
|
||||
|
||||
It is necessary to be able to roll ``ConfigNode``s backward and forward with
|
||||
respect to their committed versions due to the nature of quorum logic and
|
||||
unreliable networks.
|
||||
|
||||
Consider a case where a client commit gets persisted durably on one out of
|
||||
three ``ConfigNode``s (assume commit messages to the other two nodes are lost).
|
||||
Since the value is not committed on a majority of ``ConfigNode``s, it cannot be
|
||||
considered committed. But it is also incorrect to have the value persist on one
|
||||
out of three nodes as future commits are made. In this case, the most common
|
||||
result is that the ``ConfigNode`` will be rolled back when the next commit from
|
||||
a different client is made, and then rolled forward to contain the data from
|
||||
the commit. ``PaxosConfigConsumer`` contains logic to recognize ``ConfigNode``
|
||||
minorities and update them to match the quorum.
|
||||
|
||||
### Changing Coordinators
|
||||
|
||||
Since the configuration database lives on the coordinators and the
|
||||
[coordinators can be
|
||||
changed](https://apple.github.io/foundationdb/configuration.html#configuration-changing-coordination-servers),
|
||||
it is necessary to copy the configuration database from the old to the new
|
||||
coordinators during such an event. A coordinator change performs the following
|
||||
steps in regards to the configuration database:
|
||||
|
||||
1. Write ``\xff/coordinatorsKey`` with the new coordinators string. The key
|
||||
``\xff/previousCoordinators`` contains the current (old) set of
|
||||
coordinators.
|
||||
2. Lock the old ``ConfigNode``s so they can no longer serve client requests.
|
||||
3. Start a recovery, causing a new cluster controller (and therefore
|
||||
``ConfigBroadcaster``) to be selected.
|
||||
4. Read ``\xff/previousCoordinators`` on the ``ConfigBroadcaster`` and, if
|
||||
present, read an up-to-date snapshot of the configuration database on the
|
||||
old coordinators.
|
||||
5. Determine if each registering ``ConfigNode`` needs an up-to-date snapshot of
|
||||
the configuration database sent to it, based on its reported version and the
|
||||
snapshot version of the database received from the old coordinators.
|
||||
* Some new coordinators which were also coordinators in the previous
|
||||
configuration may not need a snapshot.
|
||||
6. Send ready requests to new ``ConfigNode``s, including an up-to-date snapshot
|
||||
if necessary. This allows the new coordinators to begin serving
|
||||
configuration database requests from clients.
|
||||
|
||||
## Testing
|
||||
|
||||
The ``ConfigDatabaseUnitTests`` class unit test a number of different
|
||||
configuration database dimensions.
|
||||
|
||||
The ``ConfigIncrement`` workload tests contention between clients attempting to
|
||||
write to the configuration database, paired with machine failure and
|
||||
coordinator changes.
|
Loading…
Reference in New Issue