21 KiB
Dynamic Knobs
This document is largely adapted from original design documents by Markus Pilman and Trevor Clinkenbeard.
Background
FoundationDB parameters control the behavior of the database, including whether
certain features are available and the value of internal constants. Parameters
will be referred to as knobs for the remainder of this document. Currently,
these knobs are configured through arguments passed to fdbserver
processes,
often controlled by fdbmonitor
. This has a number of problems:
- Updating knobs involves updating
foundationdb.conf
files on each host in a cluster. This has a lot of overhead and typically requires external tooling for large scale changes. - All knob changes require a process restart.
- We can't easily track the history of knob changes.
Overview
The dynamic knobs project creates a strictly serializable quorum-based
configuration database stored on the coordinators. Each fdbserver
process
specifies a configuration path and applies knob overrides from the
configuration database for its specified classes.
Caveats
The configuration database explicitly does not support the following:
- A high load. The update rate, while not specified, should be relatively low.
- A large amount of data. The database is meant to be relatively small (under one megabyte). Data is not sharded and every coordinator stores a complete copy.
- Concurrent writes. At most one write can succeed at a time, and clients must retry their failed writes.
Design
Configuration Path
Each fdbserver
process can now include a --config_path
argument specifying
its configuration path. A configuration path is a hierarchical list of
configuration classes specifying which knob overrides the fdbserver
process
should apply from the configuration database. For example:
$ fdbserver --config_path classA/classB/classC ...
Knob overrides follow descending priority:
- Manually specified command line knobs.
- Individual configuration class overrides.
- Subdirectories override parent directories. For example, if the
configuration path is
az-1/storage/gp3
, thegp3
configuration takes priority over thestorage
configuration, which takes priority over theaz-1
configuration.
- Global configuration knobs.
- Default knob values.
Example
For example, imagine an fdbserver
process run as follows:
$ fdbserver --datadir /mnt/fdb/storage/4500 --logdir /var/log/foundationdb --public_address auto:4500 --config_path az-1/storage/gp3 --knob_disable_asserts false
And the configuration database contains:
ConfigClass | KnobName | KnobValue |
---|---|---|
az-2 | page_cache_4k | 8e9 |
storage | min_trace_severity | 20 |
az-1 | compaction_interval | 280 |
storage | compaction_interval | 350 |
az-1 | disable_asserts | true |
<global> | max_metric_size | 5000 |
gp3 | max_metric_size | 1000 |
The final configuration for the process will be:
KnobName | KnobValue | Explanation |
---|---|---|
page_cache_4k | <default> | The configuration database knob override for az-2 is ignored, so the compiled default is used |
min_trace_severity | 20 | Because the storage configuration class is part of the process’s configuration path, the corresponding knob override is applied from the configuration database |
compaction_interval | 350 | The storage knob override takes precedence over the az-1 knob override |
disable_asserts | false | This knob is manually overridden, so all other overrides are ignored |
max_metric_size | 1000 | Knob overrides for specific configuration classes take precedence over global knob overrides, so the global override is ignored |
Clients
Clients can write to the configuration database using transactions.
Configuration database transactions are differentiated from regular
transactions through specification of the USE_CONFIG_DATABASE
database
option.
In configuration transactions, the client uses the tuple layer to interact with the configuration database. Keys are tuples of size two, where the first item is the configuration class being written, and the second item is the knob name. The value should be specified as a string. It will be converted to the appropriate type based on the declared type of the knob being set.
Below is a sample Python script to write to the configuration database.
import fdb
fdb.api_version(720)
@fdb.transactional
def set_knob(tr, knob_name, knob_value, config_class, description):
tr['\xff\xff/description'] = description
tr[fdb.tuple.pack((config_class, knob_name,))] = knob_value
# This function performs two knob changes transactionally.
@fdb.transactional
def set_multiple_knobs(tr):
tr['\xff\xff/description'] = 'description'
tr[fdb.tuple.pack((None, 'min_trace_severity',))] = '10'
tr[fdb.tuple.pack(('az-1', 'min_trace_severity',))] = '20'
db = fdb.open()
db.options.set_use_config_database()
set_knob(db, 'min_trace_severity', '10', None, 'description')
set_knob(db, 'min_trace_severity', '20', 'az-1', 'description')
Disable the Configuration Database
The configuration database includes both client and server changes and is enabled by default. Thus, to disable the configuration database, changes must be made to both.
Server
The configuration database can be disabled by specifying the fdbserver
command line option --no-config-db
. Note that this option must be specified
for every fdbserver
process.
Client
The only client change from the configuration database is as part of the change
coordinators command. The change coordinators command is not considered
successful until the configuration database is readable on the new
coordinators. This will cause the change coordinators command to hang if run
against a database with dynamic knobs disabled. To disable the client side
configuration database liveness check, specify the --no-config-db
flag when
changing coordinators. For example:
fdbcli> coordinators auto --no-config-db
Status
The current state of the configuration database is output as part of status json
. The configuration path for each process can be determined from the
command_line
key associated with each process.
Sample from status json
:
"configuration_database" : {
"commits" : [
{
"description" : "set some knobs",
"timestamp" : 1659570000,
"version" : 1
},
{
"description" : "make some other changes",
"timestamp" : 1659570000,
"version" : 2
}
],
"last_compacted_version" : 0,
"most_recent_version" : 2,
"mutations" : [
{
"config_class" : "<global>",
"knob_name" : "min_trace_severity",
"knob_value" : "int:5",
"type" : "set",
"version" : 1
},
{
"config_class" : "<global>",
"knob_name" : "compaction_interval",
"knob_value" : "double:30.000000",
"type" : "set",
"version" : 1
},
{
"config_class" : "az-1",
"knob_name" : "compaction_interval",
"knob_value" : "double:60.000000",
"type" : "set",
"version" : 1
},
{
"config_class" : "<global>",
"knob_name" : "compaction_interval",
"type" : "clear",
"version" : 2
},
{
"config_class" : "<global>",
"knob_name" : "update_node_timeout",
"knob_value" : "double:4.000000",
"type" : "set",
"version" : 2
}
],
"snapshot" : {
"<global>" : {
"min_trace_severity" : "int:5",
"update_node_timeout" : "double:4.000000"
},
"az-1" : {
"compaction_interval" : "double:60.000000"
}
}
}
After compaction, status json
would show:
"configuration_database" : {
"commits" : [
],
"last_compacted_version" : 2,
"most_recent_version" : 2,
"mutations" : [
],
"snapshot" : {
"<global>" : {
"min_trace_severity" : "int:5",
"update_node_timeout" : "double:4.000000"
},
"az-1" : {
"compaction_interval" : "double:60.000000"
}
}
}
Detailed Implementation
The configuration database is implemented as a replicated state machine living on the coordinators. This allows configuration database transactions to continue to function in the event of a catastrophic loss of the transaction subsystem.
To commit a transaction, clients run the two phase Paxos protocol. First, the client asks for a live version from a quorum of coordinators. When a coordinator receives a request for its live version, it increments its local live version by one and returns it to the client. Then, the client submits its writes at the live version it received in the previous step. A coordinator will accept the commit if it is still on the same live version. If a majority of coordinators accept the commit, it is considered committed.
Coordinator
Each coordinator runs a ConfigNode
which serves as a replica storing one
full copy of the configuration database. Coordinators never communicate with
other coordinators while processing configuration database transactions.
Instead, the client runs the transaction and determines when it has quorum
agreement.
Coordinators serve the following ConfigTransactionInterface
to allow
clients to read from and write to the configuration database.
ConfigTransactionInterface
Request | Request fields | Reply fields | Explanation |
---|---|---|---|
GetGeneration | (coordinatorsHash) | (generation) or (coordinators_changed error) | Get a new read version. This read version is used for all future requests in the transaction |
Get | (configuration class, knob name, coordinatorsHash, generation) | (knob value or empty) or (coordinators_changed error) or (transaction_too_old error) | Returns the current value stored at the specified configuration class and knob name, or empty if no value exists |
GetConfigClasses | (coordinatorsHash, generation) | (configuration classes) or (coordinators_changed error) or (transaction_too_old error) | Returns a list of all configuration classes stored in the configuration database |
GetKnobs | (configuration class, coordinatorsHash, generation) | (knob names) or (coordinators_changed error) or (transaction_too_old error) | Returns a list of all knob names stored for the provided configuration class |
Commit | (mutation list, coordinatorsHash, generation) | ack or (coordinators_changed error) or (commit_unknown_result error) or (not_committed error) | Commit mutations set by the transaction |
Coordinators also serve the following ConfigFollowerInterface
to provide
access to (and modification of) their current state. Most interaction through
this interface is done by the cluster controller through its
IConfigConsumer
implementation living on the ConfigBroadcaster
.
ConfigFollowerInterface
Request | Request fields | Reply fields | Explanation |
---|---|---|---|
GetChanges | (lastSeenVersion, mostRecentVersion) | (mutation list, version) or (version_already_compacted error) or (process_behind error) | Request changes since the last seen version, receive a new most recent version, as well as recent mutations |
GetSnapshotAndChanges | (mostRecentVersion) | (snapshot, snapshotVersion, changes) | Request the full configuration database, in the form of a base snapshot and changes to apply on top of the snapshot |
Compact | (version) | ack | Compact mutations up to the provided version |
Rollforward | (rollbackTo, lastKnownCommitted, target, changes, specialZeroQuorum) | ack or (version_already_compacted error) or (transaction_too_old error) | Rollback/rollforward mutations on a node to catch it up with the majority |
GetCommittedVersion | () | (registered, lastCompacted, lastLive, lastCommitted) | Request version information from a ConfigNode |
Lock | (coordinatorsHash) | ack | Lock a ConfigNode to prevent it from serving requests during a coordinator change |
Cluster Controller
The cluster controller runs a singleton ConfigBroadcaster
which is
responsible for periodically polling the ConfigNode
s for updates, then
broadcasting these updates to workers through the ConfigBroadcastInterface
.
When workers join the cluster, they register themselves and their
ConfigBroadcastInterface
with the broadcaster. The broadcaster then pushes
new updates to registered workers.
The ConfigBroadcastInterface
is also used by ConfigNode
s to register
with the ConfigBroadcaster
. ConfigNode
s need to register with the
broadcaster because the broadcaster decides when the ConfigNode
may begin
serving requests, based on global information about status of other
ConfigNode
s. For example, if a system with three ConfigNode
s suffers a
fault where one ConfigNode
loses data, the faulty ConfigNode
should
not be allowed to begin serving requests again until it has been rolled forward
and is up to date with the latest state of the configuration database.
ConfigBroadcastInterface
Request | Request fields | Reply fields | Explanation |
---|---|---|---|
Snapshot | (snapshot, version, restartDelay) | ack | A snapshot of the configuration database sent by the broadcaster to workers |
Changes | (changes, mostRecentVersion, restartDelay) | ack | A list of changes up to and including mostRecentVersion, sent by the broadcaster to workers |
Registered | () | (registered, lastSeenVersion) | Sent by the broadcaster to new ConfigNode s to determine their registration status |
Ready | (snapshot, snapshotVersion, liveVersion, coordinatorsHash) | ack | Sent by the broadcaster to new ConfigNode s to allow them to start serving requests |
Worker
Each worker runs a LocalConfiguration
instance which receives and applies
knob updates from the ConfigBroadcaster
. The local configuration maintains
a durable KeyValueStoreMemory
containing the following:
- The latest known configuration version
- The most recently used configuration path
- All knob overrides corresponding to the configuration path at the latest known version
Once a worker starts, it will:
- Apply manually set knobs
- Read its local configuration file
- If the stored configuration path does not match the configuration path specified on the command line, delete the local configuration file
- Otherwise, apply knob updates from the local configuration file. Manually specified knobs will not be overridden
- Register with the broadcaster to receive new updates for its configuration
classes
- Persist these updates when received and restart if necessary
Knob Atomicity
All knobs are classified as either atomic or non-atomic. Atomic knobs require a process restart when changed, while non-atomic knobs do not.
Compaction
ConfigNode
s store individual mutations in order to be able to update other,
out of date ConfigNode
s without needing to send a full snapshot. Each
configuration database commit also contains additional metadata such as a
timestamp and a text description of the changes being made. To keep the size of
the configuration database manageable, a compaction process runs periodically
(defaulting to every five minutes) which compacts individual mutations into a
simplified snapshot of key-value pairs. Compaction is controlled by the
ConfigBroadcaster
, using information it peridiodically requests from
ConfigNode
s. Compaction will only compact up to the minimum known version
across all ConfigNode
s. This means that if one ConfigNode
is
permanently partitioned from the ConfigBroadcaster
or from clients, no
compaction will ever take place.
Rollback / Rollforward
It is necessary to be able to roll ConfigNode
s backward and forward with
respect to their committed versions due to the nature of quorum logic and
unreliable networks.
Consider a case where a client commit gets persisted durably on one out of
three ConfigNode
s (assume commit messages to the other two nodes are lost).
Since the value is not committed on a majority of ConfigNode
s, it cannot be
considered committed. But it is also incorrect to have the value persist on one
out of three nodes as future commits are made. In this case, the most common
result is that the ConfigNode
will be rolled back when the next commit from
a different client is made, and then rolled forward to contain the data from
the commit. PaxosConfigConsumer
contains logic to recognize ConfigNode
minorities and update them to match the quorum.
Changing Coordinators
Since the configuration database lives on the coordinators and the coordinators can be changed, it is necessary to copy the configuration database from the old to the new coordinators during such an event. A coordinator change performs the following steps in regards to the configuration database:
- Write
\xff/coordinatorsKey
with the new coordinators string. The key\xff/previousCoordinators
contains the current (old) set of coordinators. - Lock the old
ConfigNode
s so they can no longer serve client requests. - Start a recovery, causing a new cluster controller (and therefore
ConfigBroadcaster
) to be selected. - Read
\xff/previousCoordinators
on theConfigBroadcaster
and, if present, read an up-to-date snapshot of the configuration database on the old coordinators. - Determine if each registering
ConfigNode
needs an up-to-date snapshot of the configuration database sent to it, based on its reported version and the snapshot version of the database received from the old coordinators.- Some new coordinators which were also coordinators in the previous configuration may not need a snapshot.
- Send ready requests to new
ConfigNode
s, including an up-to-date snapshot if necessary. This allows the new coordinators to begin serving configuration database requests from clients.
Testing
The ConfigDatabaseUnitTests
class unit test a number of different
configuration database dimensions.
The ConfigIncrement
workload tests contention between clients attempting to
write to the configuration database, paired with machine failure and
coordinator changes.