Finish TLog Spill-by-reference design doc, and incorporate feedback.
This commit is contained in:
parent
0469750087
commit
1547cbcef3
|
@ -4,22 +4,51 @@
|
||||||
|
|
||||||
## Background
|
## Background
|
||||||
|
|
||||||
(This assumes a basic familiarity with [FoundationDB's architecture][fdbsummit-technical-overview].)
|
(This assumes a basic familiarity with [FoundationDB's architecture](https://www.youtu.be/EMwhsGsxfPU).)
|
||||||
|
|
||||||
Transaction logs are a distributed Write-Ahead-Log for FoundationDB. They
|
Transaction logs are a distributed Write-Ahead-Log for FoundationDB. They
|
||||||
receive commits from proxies that are written to a sequential *disk queue* in
|
receive commits from proxies, and are responsible for durably storing those
|
||||||
version order. A commit is sent as a list of tagged mutations, where a *tag*
|
commits, and making them available to storage servers for reading.
|
||||||
is a small identifier that represents one storage server as a destination. The
|
|
||||||
transaction logs are then later *peeked* by storage servers to receive data
|
Clients send *mutations*, the list of their set, clears, atomic operations,
|
||||||
destined for their tag. This is how committed data becomes available to
|
etc., to proxies. Proxies collect mutations into a *batch*, which is the list
|
||||||
clients for reading.
|
of all changes that need to be applied to the database to bring it from version
|
||||||
|
`N-1` to `N`. Proxies then walk through their in-memory mapping of shard
|
||||||
|
boundaries to associate one or more *tags*, a small integer uniquely
|
||||||
|
identifying a destination storage server, with each mutation. They then send a
|
||||||
|
*commit*, the full list of `(tags, mutation)` for each mutation in a batch, to
|
||||||
|
the transaction logs.
|
||||||
|
|
||||||
|
The transaction log has two responsibilities: it must persist the commits to
|
||||||
|
disk and notify the proxy when a commit is durably stored, and it must make the
|
||||||
|
commit available for consumption by the storage server. Each storage server
|
||||||
|
*peeks* its own tag, which requests all mutations from the transaction log with
|
||||||
|
the given tag at a given version or above. After a storage server durably
|
||||||
|
applies the mutations to disk, it *pops* the transaction logs with the same tag
|
||||||
|
and its new durable version, notifying the transaction logs that they may
|
||||||
|
discard mutations with the given tag and a lesser version.
|
||||||
|
|
||||||
|
To persist commits, a transaction log appends commits to a growable on-disk
|
||||||
|
ring buffer, called a *disk queue*, in version order. Commit data is *pushed*
|
||||||
|
onto the disk queue, and when all mutations in the oldest commit persisted are
|
||||||
|
no longer needed, the disk queue is *popped* to trim its tail.
|
||||||
|
|
||||||
|
To make commits available to storage servers efficiently, a transaction log
|
||||||
|
maintains a copy of the commit in-memory, and maintains one queue per tag that
|
||||||
|
indexes the location of each mutation in each commit with the specific tag,
|
||||||
|
sequentially. This way, responding to a peek from a storage server only
|
||||||
|
requires sequentailly walking through the queue, and copying each mutation
|
||||||
|
referenced into the response buffer.
|
||||||
|
|
||||||
Transaction logs internally handle commits via performing two operations
|
Transaction logs internally handle commits via performing two operations
|
||||||
concurrently. First, they walk through each mutation in the commit, and push
|
concurrently. First, they walk through each mutation in the commit, and push
|
||||||
the mutation onto an in-memory queue of mutations destined for that tag.
|
the mutation onto an in-memory queue of mutations destined for that tag.
|
||||||
Second, they include the data in the next batch of pages to durably persist to
|
Second, they include the data in the next batch of pages to durably persist to
|
||||||
disk. These queues are popped from when the corresponding storage server has
|
disk. These in-memory queues are popped from when the corresponding storage
|
||||||
persisted the data to its own disk.
|
server has persisted the data to its own disk. The disk queue only exists to
|
||||||
|
allow the in-memory queues to be rebuilt if the transaction log crashes, is
|
||||||
|
never read from except during a transaction log recovering post-crash, and is
|
||||||
|
popped when the oldest version it contains is no longer needed in memory.
|
||||||
|
|
||||||
TLogs will need to hold the last 5-7 seconds of mutations. In normal
|
TLogs will need to hold the last 5-7 seconds of mutations. In normal
|
||||||
operation, the default 1.5GB of memory is enough such that the last 5-7 seconds
|
operation, the default 1.5GB of memory is enough such that the last 5-7 seconds
|
||||||
|
@ -36,30 +65,17 @@ to exceed `TLOG_SPILL_THREASHOLD` bytes, the transaction log offloads the
|
||||||
oldest data to disk. This writing of data to disk to reduce TLog memory
|
oldest data to disk. This writing of data to disk to reduce TLog memory
|
||||||
pressure is referred to as *spilling*.
|
pressure is referred to as *spilling*.
|
||||||
|
|
||||||
Previously, spilling would work by writing the data to a SQLite B-tree. The
|
|
||||||
key would be `(tag, version)`, and the value would be all the mutations
|
|
||||||
destined for the given tag at the given version. Peek requests have a start
|
|
||||||
version, that is the latest version for which the storage server knows about,
|
|
||||||
and the TLog responds by range-reading the B-tree from the start version. Pop
|
|
||||||
requests allow the TLog to forget all mutations for a tag until a specific
|
|
||||||
version, and the TLog thus issues a range clear from `(tag, 0)` to
|
|
||||||
`(tag, pop_version)`. After spilling, the durably written data in the disk
|
|
||||||
queue would be trimmed to only include from the spilled version on, as any
|
|
||||||
required data is now entirely, durably held in the B-tree. As the entire value
|
|
||||||
is copied into the B-tree, this method of spilling will be referred to as
|
|
||||||
*spill-by-value* in the rest of this document.
|
|
||||||
|
|
||||||
**************************************************************
|
**************************************************************
|
||||||
* Transaction Log *
|
* Transaction Log *
|
||||||
* *
|
* *
|
||||||
* *
|
* *
|
||||||
* +------------------+ appends +------------+ *
|
* +------------------+ pushes +------------+ *
|
||||||
* | Incoming Commits |----------->| Disk Queue | +------+ *
|
* | Incoming Commits |----------->| Disk Queue | +------+ *
|
||||||
* +------------------+ +------------+ |SQLite| *
|
* +------------------+ +------------+ |SQLite| *
|
||||||
* | ^ +------+ *
|
* | ^ +------+ *
|
||||||
* | | ^ *
|
* | | ^ *
|
||||||
* | pops | *
|
* | pops | *
|
||||||
* +------+--------------+ | writes *
|
* +------+-------+------+ | writes *
|
||||||
* | | | | | *
|
* | | | | | *
|
||||||
* v v v +----------+ *
|
* v v v +----------+ *
|
||||||
* in-memory +---+ +---+ +---+ |Spill Loop| *
|
* in-memory +---+ +---+ +---+ |Spill Loop| *
|
||||||
|
@ -73,14 +89,31 @@ is copied into the B-tree, this method of spilling will be referred to as
|
||||||
* *
|
* *
|
||||||
**************************************************************
|
**************************************************************
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
Previously, spilling would work by writing the data to a SQLite B-tree. The
|
||||||
|
key would be `(tag, version)`, and the value would be all the mutations
|
||||||
|
destined for the given tag at the given version. Peek requests have a start
|
||||||
|
version, that is the latest version for which the storage server knows about,
|
||||||
|
and the TLog responds by range-reading the B-tree from the start version. Pop
|
||||||
|
requests allow the TLog to forget all mutations for a tag until a specific
|
||||||
|
version, and the TLog thus issues a range clear from `(tag, 0)` to
|
||||||
|
`(tag, pop_version)`. After spilling, the durably written data in the disk
|
||||||
|
queue would be trimmed to only include from the spilled version on, as any
|
||||||
|
required data is now entirely, durably held in the B-tree. As the entire value
|
||||||
|
is copied into the B-tree, this method of spilling will be referred to as
|
||||||
|
*spill-by-value* in the rest of this document.
|
||||||
|
|
||||||
Unfortunately, it turned out that spilling in this fashion greatly impacts TLog
|
Unfortunately, it turned out that spilling in this fashion greatly impacts TLog
|
||||||
performance. A write bandwidth saturation test was run against a cluster, with
|
performance. A write bandwidth saturation test was run against a cluster, with
|
||||||
a modification to the transaction logs to have them act as if there was one
|
a modification to the transaction logs to have them act as if there was one
|
||||||
storage server that was permanently failed; it never sent pop requests to allow
|
storage server that was permanently failed; it never sent pop requests to allow
|
||||||
the TLog to remove data from memory. After 15min, the write bandwidth had
|
the TLog to remove data from memory. After 15min, the write bandwidth had
|
||||||
reduced to 30% of its baseline. After 30min, that became 10%. After 60min,
|
reduced to 30% of its baseline. After 30min, that became 10%. After 60min,
|
||||||
that became 5%. (This is an intentional hyperbole due to the saturating write
|
that became 5%. Writing entire values gives an immediate 3x additional write
|
||||||
load. See experiments at the end for more detail and data.)
|
amplification, and the actual write amplification increases as the B-tree gets
|
||||||
|
deeper. (This is an intentional illustration of the worst case, due to the
|
||||||
|
workload being a saturating write load.)
|
||||||
|
|
||||||
With the recent multi-DC/multi-region work, a failure of a remote data center
|
With the recent multi-DC/multi-region work, a failure of a remote data center
|
||||||
would cause transaction logs to need to buffer all commits, as every commit is
|
would cause transaction logs to need to buffer all commits, as every commit is
|
||||||
|
@ -90,8 +123,6 @@ to rapidly degrade. It is unacceptable for a remote datacenter failure to so
|
||||||
drastically affect the primary datacenter's performance in the case of a
|
drastically affect the primary datacenter's performance in the case of a
|
||||||
failure, so a more performant way of spilling data is required.
|
failure, so a more performant way of spilling data is required.
|
||||||
|
|
||||||
## Overview
|
|
||||||
|
|
||||||
Whereas spill-by-value copied the entire mutation into the B-tree and removes
|
Whereas spill-by-value copied the entire mutation into the B-tree and removes
|
||||||
it from the disk queue, spill-by-reference leaves the mutations in the disk
|
it from the disk queue, spill-by-reference leaves the mutations in the disk
|
||||||
queue and writes a pointer to it into the B-tree. Performance experiments
|
queue and writes a pointer to it into the B-tree. Performance experiments
|
||||||
|
@ -106,7 +137,7 @@ versions to be written together.
|
||||||
************************************************************************
|
************************************************************************
|
||||||
* DiskQueue *
|
* DiskQueue *
|
||||||
* *
|
* *
|
||||||
* -------- Index on disk -------- ---- Index in memory ---- *
|
* ------- Index in B-tree ------- ---- Index in memory ---- *
|
||||||
* / \ / \ *
|
* / \ / \ *
|
||||||
* +-----------------------------------+-----------------------------+ *
|
* +-----------------------------------+-----------------------------+ *
|
||||||
* | Spilled Data | Most Recent Data | *
|
* | Spilled Data | Most Recent Data | *
|
||||||
|
@ -118,7 +149,7 @@ versions to be written together.
|
||||||
Spill-by-reference works by taking a larger range of versions, and building a
|
Spill-by-reference works by taking a larger range of versions, and building a
|
||||||
single key-value pair per tag that describes where in the disk queue is every
|
single key-value pair per tag that describes where in the disk queue is every
|
||||||
relevant commit for that tag. Concretely, this takes the form
|
relevant commit for that tag. Concretely, this takes the form
|
||||||
`(tag, last_version) -> [(version, start, end, mutation bytes)]`, where...
|
`(tag, last_version) -> [(version, start, end, mutation_bytes), ...]`, where:
|
||||||
|
|
||||||
* `tag` is the small integer representing the storage server this mutation batch is destined for.
|
* `tag` is the small integer representing the storage server this mutation batch is destined for.
|
||||||
* `last_version` is the last/maximum version contained in the value's batch.
|
* `last_version` is the last/maximum version contained in the value's batch.
|
||||||
|
@ -128,12 +159,14 @@ relevant commit for that tag. Concretely, this takes the form
|
||||||
* `mutation_bytes` is the number of bytes in the commit that are relevant for this tag.
|
* `mutation_bytes` is the number of bytes in the commit that are relevant for this tag.
|
||||||
|
|
||||||
And then writing only once per tag spilled into the B-tree for each iteration
|
And then writing only once per tag spilled into the B-tree for each iteration
|
||||||
through spilling.
|
through spilling. This turns the number of writes into the B-Tree from
|
||||||
|
`O(tags * versions)` to `O(tags)`.
|
||||||
|
|
||||||
Note that each tuple in the list represents a commit, and not a mutation. This
|
Note that each tuple in the list represents a commit, and not a mutation. This
|
||||||
means that peeking spilled commits will involve reading mutations unrelated to
|
means that peeking spilled commits will involve reading all mutations that were
|
||||||
the requested tag. Alternatively, one could have each tuple represent a
|
a part of the commit, and then filtering them to only the ones that have the
|
||||||
mutation within a commit, to prevent over-reading when peeking. There exist
|
tag of interest. Alternatively, one could have each tuple represent a mutation
|
||||||
|
within a commit, to prevent over-reading when peeking. There exist
|
||||||
pathological workloads for each strategy. The purpose of this work is most
|
pathological workloads for each strategy. The purpose of this work is most
|
||||||
importantly to support spilling of log router tags. These exist on every
|
importantly to support spilling of log router tags. These exist on every
|
||||||
mutation, so that it will get copied to other datacenters. This is the exact
|
mutation, so that it will get copied to other datacenters. This is the exact
|
||||||
|
@ -144,30 +177,31 @@ to record mutation(s) versus the entire commit, but performance testing hasn't
|
||||||
surfaced this as important enough to include in the initial version of this
|
surfaced this as important enough to include in the initial version of this
|
||||||
work.
|
work.
|
||||||
|
|
||||||
Peeking now works by issuing a range read to the B-tree from `(tag, peek_begin)`
|
Peeking spilled data now works by issuing a range read to the B-tree from
|
||||||
to `(tag, infinity)`. This is why the key contains the last version of the
|
`(tag, peek_begin)` to `(tag, infinity)`. This is why the key contains the
|
||||||
batch, rather than the beginning, so that a range read from the peek request's
|
last version of the batch, rather than the beginning, so that a range read from
|
||||||
version will always return all relevant batches. For each batched tuple, if
|
the peek request's version will always return all relevant batches. For each
|
||||||
the version is greater than our peek request's version, then we read the commit
|
batched tuple, if the version is greater than our peek request's version, then
|
||||||
containing that mutation from disk, extract the relevant mutations, and append
|
we read the commit containing that mutation from disk, extract the relevant
|
||||||
them to our response. There is a target size of the response, 150KB by
|
mutations, and append them to our response. There is a target size of the
|
||||||
default. As we iterate through the tuples, we sum `mutation_bytes`, which
|
response, 150KB by default. As we iterate through the tuples, we sum
|
||||||
already informs us how many bytes of relevant mutations we'll get from a given
|
`mutation_bytes`, which already informs us how many bytes of relevant mutations
|
||||||
commit. This allows us to make sure we won't waste disk IOs on reads that will
|
we'll get from a given commit. This allows us to make sure we won't waste disk
|
||||||
end up being discarded as unnecessary.
|
IOs on reads that will end up being discarded as unnecessary.
|
||||||
|
|
||||||
Popping works similarly to before, but now requires recovering information from
|
Popping spilled data works similarly to before, but now requires recovering
|
||||||
disk. Previously, we would maintain a map from version to location in the disk
|
information from disk. Previously, we would maintain a map from version to
|
||||||
queue for every version we hadn't yet spilled. Once spilling has copied the
|
location in the disk queue for every version we hadn't yet spilled. Once
|
||||||
value into the B-tree, knowing where the commit was in the disk queue is
|
spilling has copied the value into the B-tree, knowing where the commit was in
|
||||||
useless to us, and is removed. In spill-by-reference, that information is
|
the disk queue is useless to us, and is removed. In spill-by-reference, that
|
||||||
still needed to know how to map "pop until version 7" to "pop until byte 87" in
|
information is still needed to know how to map "pop until version 7" to "pop
|
||||||
the disk queue. Unfortunately, keeping this information in memory would result
|
until byte 87" in the disk queue. Unfortunately, keeping this information in
|
||||||
in TLogs slowly consuming more and more memory[^versionmap-memory] as more data
|
memory would result in TLogs slowly consuming more and more
|
||||||
is spilled. Instead, we issue a range read of the B-tree from `(tag, pop_version)`
|
memory[^versionmap-memory] as more data is spilled. Instead, we issue a range
|
||||||
to `(tag, infinity)` and look at the first commit we find with a version
|
read of the B-tree from `(tag, pop_version)` to `(tag, infinity)` and look at
|
||||||
greater than our own. We then use its starting disk queue location as the
|
the first commit we find with a version greater than our own. We then use its
|
||||||
limit of what we could pop the disk queue until for this tag.
|
starting disk queue location as the limit of what we could pop the disk queue
|
||||||
|
until for this tag.
|
||||||
|
|
||||||
[^versionmap-memory]: Pessimistic assumptions would suggest that a TLog spilling 1TB of data would require ~50GB of memory to hold this map, which isn't acceptable.
|
[^versionmap-memory]: Pessimistic assumptions would suggest that a TLog spilling 1TB of data would require ~50GB of memory to hold this map, which isn't acceptable.
|
||||||
|
|
||||||
|
@ -181,9 +215,108 @@ The rough outline of concrete changes proposed looks like:
|
||||||
0. Modify popping in new TLogServer
|
0. Modify popping in new TLogServer
|
||||||
0. Spill txsTag specially
|
0. Spill txsTag specially
|
||||||
|
|
||||||
|
### Configuring and Upgrading
|
||||||
|
|
||||||
|
Modifying how transaction logs spill data is a change to the on-disk files of
|
||||||
|
transaction logs. The work for enabling safe upgrades and rollbacks of
|
||||||
|
persistent state changes to transaction logs was split off into a seperate
|
||||||
|
design document: "Forward Compatibility for Transaction Logs".
|
||||||
|
|
||||||
|
That document describes a `log_version` configuration setting that controls the
|
||||||
|
availability of new transaction log features. A similar configuration setting
|
||||||
|
was created, `log_spill`, that at `log_version >= 3`, one may `fdbcli>
|
||||||
|
configure log_spill:=2` to enable spill-by-reference. Only FDB 6.1 or newer
|
||||||
|
will be unable to recover transaction log files that were using
|
||||||
|
spill-by-reference. FDB 6.2 will use spill-by-reference by default.
|
||||||
|
|
||||||
|
| FDB Version | Default | Configurable |
|
||||||
|
+-------------+---------+--------------+
|
||||||
|
| 6.0 | No | No |
|
||||||
|
| 6.1 | No | Yes |
|
||||||
|
| 6.2 | Yes | Yes |
|
||||||
|
|
||||||
|
If running FDB 6.1, the full command to enable spill-by-reference is
|
||||||
|
`fdbcli> configure log_version:=3 log_spill:=2`.
|
||||||
|
|
||||||
|
The TLog implementing spill-by-value was moved to `OldTLogServer_6_0.actor.cpp`
|
||||||
|
and namespaced similarly. `tLogFnForOptions` takes a `TLogOptions`, which is
|
||||||
|
the version and spillType, and returns the correct TLog implementation
|
||||||
|
according to those settings. We maintain a map of
|
||||||
|
`(TLogVersion, StoreType, TLogSpillType)` to TLog instance, so that only
|
||||||
|
one SharedTLog exists per configuration variant.
|
||||||
|
|
||||||
|
### Generations
|
||||||
|
|
||||||
|
As a background, each time FoundationDB goes through a recovery, it will
|
||||||
|
recruit a new generation of transaction logs. This new generation of
|
||||||
|
transaction logs will often be recruited on the same worker that hosted the
|
||||||
|
previous generation's transaction log. The old generation of transaction logs
|
||||||
|
will only shut down once all the data that they have has been fully popped.
|
||||||
|
This means that there can be multiple instances of a transaction log in the
|
||||||
|
same process.
|
||||||
|
|
||||||
|
Naively, this would create resource issues. Each instance would think that it
|
||||||
|
is allowed its own 1.5GB buffer of in-memory mutations. Instead, internally to
|
||||||
|
the TLog implmentation, the transaction log is split into two parts. A
|
||||||
|
`SharedTLog` is all the data that should be shared across multiple generations.
|
||||||
|
A TLog is all the data that is private to one generation. Most notably, the
|
||||||
|
1.5GB mutation buffer and the on-disk files are owned by the `SharedTLog`. The
|
||||||
|
index for the data added to that buffer is maintained within each TLog. In the
|
||||||
|
code, a SharedTLog is `struct TLogData`, and a TLog is `struct LogData`.
|
||||||
|
(I didn't choose these names.)
|
||||||
|
|
||||||
|
This background is required, because one needs to keep in mind that we might be
|
||||||
|
committing in one TLog instance, a different one might be spilling, and yet
|
||||||
|
another might be the one popping data.
|
||||||
|
|
||||||
|
*********************************************************
|
||||||
|
* SharedTLog *
|
||||||
|
* *
|
||||||
|
* +--------+--------+--------+--------+--------+ *
|
||||||
|
* | TLog 1 | TLog 2 | TLog 3 | TLog 4 | TLog 5 | *
|
||||||
|
* +--------+--------+--------+--------+--------+ *
|
||||||
|
* ^ popping ^spilling ^committing *
|
||||||
|
*********************************************************
|
||||||
|
|
||||||
|
Conceptually, this is because each TLog owns a separate part of the same Disk
|
||||||
|
Queue file. The earliest TLog instance needs to be the one that controls when
|
||||||
|
the earliest part of the file can be discarded. We spill in version order, and
|
||||||
|
thus whatever TLog is responsible for the earliest unspilled version needs to
|
||||||
|
be the one doing the spilling. We always commit the newest version, so the
|
||||||
|
newest TLog must be the one writing to the disk queue and inserting new data
|
||||||
|
into the buffer of mutations.
|
||||||
|
|
||||||
|
|
||||||
### Spilling
|
### Spilling
|
||||||
|
|
||||||
In spill-by-reference, spilling is now the act of persisting the index of
|
`updatePersistentData()` is the core of the spilling loop, that takes a new
|
||||||
|
persistent data version, writes the in-memory index for all commits less than
|
||||||
|
that version to disk, and then removes them from memory. By contact, once
|
||||||
|
spilling commits an updated persistentDataVersion to the B-tree, then those
|
||||||
|
bytes will not need to be recovered into memory after a crash, nor will the
|
||||||
|
in-memory bytes be needed to serve a peek response.
|
||||||
|
|
||||||
|
Our new method of spilling iterates through each tag, and builds up a
|
||||||
|
`vector<SpilledData>` for each tag, where `SpilledData` is:
|
||||||
|
|
||||||
|
``` CPP
|
||||||
|
struct SpilledData {
|
||||||
|
Version version;
|
||||||
|
IDiskQueue::location start;
|
||||||
|
uint32_t length;
|
||||||
|
uint32_t mutationBytes;
|
||||||
|
};
|
||||||
|
```
|
||||||
|
|
||||||
|
And then this vector is serialized, and written to the B-tree as
|
||||||
|
`(logId, tag, max(SpilledData.version))` = `serialized(vector<SpilledData>)`
|
||||||
|
|
||||||
|
As we iterate through each commit, we record the number of mutation bytes in
|
||||||
|
this commit that have our tag of interest. This is so that later, peeking can
|
||||||
|
read exactly the number of commits that it needs from disk.
|
||||||
|
|
||||||
|
Although the focus of this project is on the topic of spilling, the code
|
||||||
|
implementing itself saw the least amount of total change.
|
||||||
|
|
||||||
### Peeking
|
### Peeking
|
||||||
|
|
||||||
|
@ -191,6 +324,12 @@ A `TLogPeekRequest` contains a `Tag` and a `Version`, and is a request for all
|
||||||
commits with the specified tag with a commit version greater than or equal to
|
commits with the specified tag with a commit version greater than or equal to
|
||||||
the given version. The goal is to return a 150KB block of mutations.
|
the given version. The goal is to return a 150KB block of mutations.
|
||||||
|
|
||||||
|
When servicing a peek request, we will read up to 150KB of mutations from the
|
||||||
|
in-memory index. If the peek version is lower than the version that we've
|
||||||
|
spilled to disk, then we consult the on-disk index for up to 150KB of
|
||||||
|
mutations. (If we tried to read from disk first, and then read from memory, we
|
||||||
|
would then be racing with the spilling loop moving data from memory to disk.)
|
||||||
|
|
||||||
**************************************************************************
|
**************************************************************************
|
||||||
* *
|
* *
|
||||||
* +---------+ Tag +---------+ Tag +--------+ *
|
* +---------+ Tag +---------+ Tag +--------+ *
|
||||||
|
@ -207,7 +346,7 @@ the given version. The goal is to return a 150KB block of mutations.
|
||||||
* *
|
* *
|
||||||
**************************************************************************
|
**************************************************************************
|
||||||
|
|
||||||
Spill-by-value and memory storage engine only ever read from DiskQueue when
|
Spill-by-value and memory storage engine only ever read from the DiskQueue when
|
||||||
recovering, and read the entire file linearly. Therefore, `IDiskQueue` had no
|
recovering, and read the entire file linearly. Therefore, `IDiskQueue` had no
|
||||||
API for random reads to the DiskQueue. That ability is now required for
|
API for random reads to the DiskQueue. That ability is now required for
|
||||||
peeking, and thus, `IDiskQueue`'s API has been enhanced correspondingly:
|
peeking, and thus, `IDiskQueue`'s API has been enhanced correspondingly:
|
||||||
|
@ -233,7 +372,23 @@ directly, the responsibility for verifing data integrity falls upon the
|
||||||
DiskQueue. `CheckHashes::YES` will cause the DiskQueue to use the checksum in
|
DiskQueue. `CheckHashes::YES` will cause the DiskQueue to use the checksum in
|
||||||
each DiskQueue page to verify data integrity. If an externally maintained
|
each DiskQueue page to verify data integrity. If an externally maintained
|
||||||
checksums exists to verify the returned data, then `CheckHashes::NO` can be
|
checksums exists to verify the returned data, then `CheckHashes::NO` can be
|
||||||
used to elide the checksumming.
|
used to elide the checksumming. A page failing its checksum will cause the
|
||||||
|
transaction log to die with an `io_error()`.
|
||||||
|
|
||||||
|
What is read from disk is a `TLogQueueEntry`:
|
||||||
|
|
||||||
|
``` CPP
|
||||||
|
struct TLogQueueEntryRef {
|
||||||
|
UID id;
|
||||||
|
Version version;
|
||||||
|
Version knownCommittedVersion;
|
||||||
|
StringRef messages;
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Which provides the commit version and the logId of the TLog generation that
|
||||||
|
produced this commit, in addition to all of the mutations for that version.
|
||||||
|
(`knownCommittedVersion` is only used during FDB's recovery process.)
|
||||||
|
|
||||||
### Popping
|
### Popping
|
||||||
|
|
||||||
|
@ -242,25 +397,26 @@ transaction log to notify it that it is allowed to discard data for `tag` up
|
||||||
through `version`. Once all the tags have been popped from the oldest commit
|
through `version`. Once all the tags have been popped from the oldest commit
|
||||||
in the DiskQueue, the tail of the DiskQueue can be discarded to reclaim space.
|
in the DiskQueue, the tail of the DiskQueue can be discarded to reclaim space.
|
||||||
|
|
||||||
|
If our popped version is in the range of what has been spilled, then we need to
|
||||||
|
consult our on-disk index to see what is the next location in the disk queue
|
||||||
|
that has data which is useful to us. This act would race with the spilling
|
||||||
|
loop changing what data is spilled, and thus disk queue popping
|
||||||
|
(`popDiskQueue()`) was made to run serially after spilling completes.
|
||||||
|
|
||||||
Each time FoundationDB goes through a recovery, it will recruit a new
|
Also due to spilling and popping largely overlapping in state, the disk queue
|
||||||
generation of transaction logs. This new generation of transaction logs will
|
popping loop does not immediately react to a pop request from a storage server
|
||||||
often be recruited on the same worker that hosted the previous generation's
|
changing the popped version for a tag. Spilling saves the popped version for
|
||||||
transaction log. The old generation of transaction logs will only shut down
|
each tag when the spill loop runs, and if that version changed, then
|
||||||
once all the data that they have has been fully popped. This means that there
|
`popDiskQueue()` refreshes its knowledge of what the minimum location in the
|
||||||
can be multiple instances of a transaction log
|
disk queue is required for that tag. We can pop the disk queue to the minimum
|
||||||
|
of all minimum tag locations, or to the minimum location needed for an
|
||||||
*********************************************************
|
in-memory mutation if there is no spilled data.
|
||||||
* SharedTLog *
|
|
||||||
* *
|
|
||||||
* +--------+--------+--------+--------+--------+ *
|
|
||||||
* | TLog 1 | TLog 2 | TLog 3 | TLog 4 | TLog 5 | *
|
|
||||||
* +--------+--------+--------+--------+--------+ *
|
|
||||||
* ^ popping ^spilling ^committing *
|
|
||||||
*********************************************************
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
As a post implementation note, this ended up being a "here be dragons"
|
||||||
|
experience, with a surprising number of edge cases in races between
|
||||||
|
spilling/popping, various situations of having/not having/having inaccurate
|
||||||
|
data for tags, or that tags can stop being pushed to when storage servers are
|
||||||
|
removed but their corresponding `TagData` is never removed.
|
||||||
|
|
||||||
### Transaction State Store
|
### Transaction State Store
|
||||||
|
|
||||||
|
@ -343,8 +499,6 @@ can only happen at scale:
|
||||||
* Verify that peek requests get limited
|
* Verify that peek requests get limited
|
||||||
* See if tlog commits can get starved by excessive peeking
|
* See if tlog commits can get starved by excessive peeking
|
||||||
|
|
||||||
[fdbsummit-technical-overview]: https://www.youtube.com/watch?v=EMwhsGsxfPU
|
|
||||||
|
|
||||||
# TLog Spill-By-Reference Operational Guide
|
# TLog Spill-By-Reference Operational Guide
|
||||||
|
|
||||||
## Notable Behavior Changes
|
## Notable Behavior Changes
|
||||||
|
@ -391,7 +545,7 @@ The expected implication of this are:
|
||||||
### Popping
|
### Popping
|
||||||
|
|
||||||
Popping will transition from being an only in-memory operation to one that
|
Popping will transition from being an only in-memory operation to one that
|
||||||
involves reads from disk.
|
can involve reads from disk if the popped tag has spilled data.
|
||||||
|
|
||||||
Due to a strange quirk, TLogs will allocate up to 2GB of memory as a read cache
|
Due to a strange quirk, TLogs will allocate up to 2GB of memory as a read cache
|
||||||
for SQLite's B-tree. The expected maximum size of the B-tree has drastically
|
for SQLite's B-tree. The expected maximum size of the B-tree has drastically
|
||||||
|
@ -446,7 +600,7 @@ minor impacts on recovery times:
|
||||||
Increasing it could increase throughput in spilling regimes.<br>
|
Increasing it could increase throughput in spilling regimes.<br>
|
||||||
Decreasing it will decrease how sawtooth-like TLog memory usage is.<br>
|
Decreasing it will decrease how sawtooth-like TLog memory usage is.<br>
|
||||||
|
|
||||||
`TLOG_UPDATE_STORAGE`
|
`UPDATE_STORAGE_BYTE_LIMIT`
|
||||||
: How many bytes of mutations should be spilled at once in a spill-by-value TLog.<br>
|
: How many bytes of mutations should be spilled at once in a spill-by-value TLog.<br>
|
||||||
This knob is pre-existing, and has only been "changed" to only apply to spill-by-value.<br>
|
This knob is pre-existing, and has only been "changed" to only apply to spill-by-value.<br>
|
||||||
|
|
||||||
|
@ -484,22 +638,10 @@ With the new changes, we must ensure that sufficent information has been exposed
|
||||||
1. If something goes wrong in production, we can understand what and why from trace logs.
|
1. If something goes wrong in production, we can understand what and why from trace logs.
|
||||||
2. We can understand if the TLog is performing suboptimally, and if so, which knob we should change and by how much.
|
2. We can understand if the TLog is performing suboptimally, and if so, which knob we should change and by how much.
|
||||||
|
|
||||||
All of the below are planned to be additions to TLogMetrics.
|
The following metrics were added to `TLogMetrics`:
|
||||||
|
|
||||||
### Spilling
|
### Spilling
|
||||||
|
|
||||||
!!! warning
|
|
||||||
Only metrics above this line have been implemented.
|
|
||||||
|
|
||||||
`SpillReferenceBatchSize`
|
|
||||||
: Stats about the total size of batches written to the B-tree, excluding `txsTag`.
|
|
||||||
|
|
||||||
`SpillReferenceTagCount`
|
|
||||||
: Stats about the number of distinct tags that have been spilled on each loop iteration, excluding `txsTag`.
|
|
||||||
|
|
||||||
`SpillReferenceIterationCount`
|
|
||||||
: The number of times we committed data to the B-tree to spill.
|
|
||||||
|
|
||||||
### Peeking
|
### Peeking
|
||||||
|
|
||||||
`PeekMemoryRequestsStalled`
|
`PeekMemoryRequestsStalled`
|
||||||
|
@ -508,42 +650,16 @@ All of the below are planned to be additions to TLogMetrics.
|
||||||
`PeekMemoryReserved`
|
`PeekMemoryReserved`
|
||||||
: The amount of memory currently reserved for serving peek requests.
|
: The amount of memory currently reserved for serving peek requests.
|
||||||
|
|
||||||
!!! warning
|
|
||||||
Only metrics above this line have been implemented.
|
|
||||||
|
|
||||||
`PeekMemoryAverage`
|
|
||||||
: The average amount of memory allocated per peek of spilled data.
|
|
||||||
|
|
||||||
`PeekMemoryLimitHit`
|
|
||||||
: The number of times a peek was cut short due to hitting the maximum memory limit.
|
|
||||||
|
|
||||||
`PeekReferenceSpilledCount`
|
|
||||||
: The number of times a peek request required reading spilled data from the disk queue.
|
|
||||||
|
|
||||||
`PeekReferenceReadAmp`
|
|
||||||
: Stats about the read amplification encountered.
|
|
||||||
|
|
||||||
### Popping
|
### Popping
|
||||||
|
|
||||||
!!! warning
|
`QueuePoppedVersion`
|
||||||
Only metrics above this line have been implemented.
|
|
||||||
|
|
||||||
`DQOldestVersion`
|
|
||||||
: The oldest version that's still useful.
|
: The oldest version that's still useful.
|
||||||
|
|
||||||
`BytesPopped`
|
`MinPoppedTagLocality`
|
||||||
: The total bytes discarded from the queue *and* the `IDiskQueue::location` of the first useful byte.
|
: The locality of the tag that's preventing the DiskQueue from being further popped.
|
||||||
|
|
||||||
`OldestUnpoppedTag`
|
`MinPoppedTagId`
|
||||||
: The tag that's preventing the DiskQueue from being further popped.
|
: The id of the tag that's preventing the DiskQueue from being further popped.
|
||||||
|
|
||||||
### Disk Queue
|
|
||||||
|
|
||||||
!!! warning
|
|
||||||
Only metrics above this line have been implemented.
|
|
||||||
|
|
||||||
`DiskQueueExcessBytes`
|
|
||||||
: The number of bytes that the disk queue doesn't need, and will truncate to free over time. This should be roughly equal to `BytesInput - BytesPopped`, but computed at a different layer.
|
|
||||||
|
|
||||||
## Monitoring and Alerting
|
## Monitoring and Alerting
|
||||||
|
|
||||||
|
@ -559,29 +675,8 @@ Of which I'm aware of:
|
||||||
* Any current alerts on "Disk Queue files more than [constant size] GB" will need to be removed.
|
* Any current alerts on "Disk Queue files more than [constant size] GB" will need to be removed.
|
||||||
* Any alerting or monitoring of `log*.sqlite` as an indication of spilling will no longer be effective.
|
* Any alerting or monitoring of `log*.sqlite` as an indication of spilling will no longer be effective.
|
||||||
|
|
||||||
|
|
||||||
* A graph of `BytesInput - BytesPopped` will give an idea of the number of "active" bytes in the DiskQueue file.
|
* A graph of `BytesInput - BytesPopped` will give an idea of the number of "active" bytes in the DiskQueue file.
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
# Appendix
|
|
||||||
|
|
||||||
## Experiments
|
|
||||||
|
|
||||||
### SQLite B-tree costs
|
|
||||||
|
|
||||||
| Experiment | result |
|
|
||||||
|-------------------------------|---------|
|
|
||||||
| Baseline | 60 MB/s |
|
|
||||||
| Write value to B-tree | 20 MB/s |
|
|
||||||
| Write pointer to B-tree | 20 MB/s |
|
|
||||||
| " once per spill loop | 40 MB/s |
|
|
||||||
| " spill 10x the data per loop | 60 MB/s |
|
|
||||||
|
|
||||||
### Spilling speeds
|
|
||||||
|
|
||||||
2. speed over time with 1 storage server dead
|
|
||||||
|
|
||||||
<!-- Force long-style table of contents -->
|
<!-- Force long-style table of contents -->
|
||||||
<script>window.markdeepOptions={}; window.markdeepOptions.tocStyle="long";</script>
|
<script>window.markdeepOptions={}; window.markdeepOptions.tocStyle="long";</script>
|
||||||
<!-- When printed, top level section headers should force page breaks -->
|
<!-- When printed, top level section headers should force page breaks -->
|
||||||
|
|
Loading…
Reference in New Issue