diff --git a/design/tlog-spilling.md.html b/design/tlog-spilling.md.html new file mode 100644 index 0000000000..1e19490d75 --- /dev/null +++ b/design/tlog-spilling.md.html @@ -0,0 +1,685 @@ + + +# TLog Spill-By-Reference Design + +## Background + +(This assumes a basic familiarity with [FoundationDB's architecture](https://www.youtu.be/EMwhsGsxfPU).) + +Transaction logs are a distributed Write-Ahead-Log for FoundationDB. They +receive commits from proxies, and are responsible for durably storing those +commits, and making them available to storage servers for reading. + +Clients send *mutations*, the list of their set, clears, atomic operations, +etc., to proxies. Proxies collect mutations into a *batch*, which is the list +of all changes that need to be applied to the database to bring it from version +`N-1` to `N`. Proxies then walk through their in-memory mapping of shard +boundaries to associate one or more *tags*, a small integer uniquely +identifying a destination storage server, with each mutation. They then send a +*commit*, the full list of `(tags, mutation)` for each mutation in a batch, to +the transaction logs. + +The transaction log has two responsibilities: it must persist the commits to +disk and notify the proxy when a commit is durably stored, and it must make the +commit available for consumption by the storage server. Each storage server +*peeks* its own tag, which requests all mutations from the transaction log with +the given tag at a given version or above. After a storage server durably +applies the mutations to disk, it *pops* the transaction logs with the same tag +and its new durable version, notifying the transaction logs that they may +discard mutations with the given tag and a lesser version. + +To persist commits, a transaction log appends commits to a growable on-disk +ring buffer, called a *disk queue*, in version order. Commit data is *pushed* +onto the disk queue, and when all mutations in the oldest commit persisted are +no longer needed, the disk queue is *popped* to trim its tail. + +To make commits available to storage servers efficiently, a transaction log +maintains a copy of the commit in-memory, and maintains one queue per tag that +indexes the location of each mutation in each commit with the specific tag, +sequentially. This way, responding to a peek from a storage server only +requires sequentailly walking through the queue, and copying each mutation +referenced into the response buffer. + +Transaction logs internally handle commits via performing two operations +concurrently. First, they walk through each mutation in the commit, and push +the mutation onto an in-memory queue of mutations destined for that tag. +Second, they include the data in the next batch of pages to durably persist to +disk. These in-memory queues are popped from when the corresponding storage +server has persisted the data to its own disk. The disk queue only exists to +allow the in-memory queues to be rebuilt if the transaction log crashes, is +never read from except during a transaction log recovering post-crash, and is +popped when the oldest version it contains is no longer needed in memory. + +TLogs will need to hold the last 5-7 seconds of mutations. In normal +operation, the default 1.5GB of memory is enough such that the last 5-7 seconds +of commits should almost always fit in memory. However, in the presence of +failures, the transaction log can be required to buffer significantly more +data. Most notably, when a storage server fails, its tag isn't popped until +data distribution is able to re-replicate all of the shards that storage server +was responsible for to other storage servers. Before that happens, mutations +will accumulate on the TLog destined for the failed storage server, in case it +comes back and is able to rejoin the cluster. + +When this accumulation causes the memory required to hold all the unpopped data +to exceed `TLOG_SPILL_THREASHOLD` bytes, the transaction log offloads the +oldest data to disk. This writing of data to disk to reduce TLog memory +pressure is referred to as *spilling*. + +************************************************************** +* Transaction Log * +* * +* * +* +------------------+ pushes +------------+ * +* | Incoming Commits |----------->| Disk Queue | +------+ * +* +------------------+ +------------+ |SQLite| * +* | ^ +------+ * +* | | ^ * +* | pops | * +* +------+-------+------+ | writes * +* | | | | | * +* v v v +----------+ * +* in-memory +---+ +---+ +---+ |Spill Loop| * +* queues | 1 | | 2 | | 3 | +----------+ * +* per-tag | | | | | | ^ * +* |...| |...| |...| | * +* | | | | * +* v v v | * +* +-------+------+--------------+ * +* queues spilled on overflow * +* * +************************************************************** + +## Overview + +Previously, spilling would work by writing the data to a SQLite B-tree. The +key would be `(tag, version)`, and the value would be all the mutations +destined for the given tag at the given version. Peek requests have a start +version, that is the latest version for which the storage server knows about, +and the TLog responds by range-reading the B-tree from the start version. Pop +requests allow the TLog to forget all mutations for a tag until a specific +version, and the TLog thus issues a range clear from `(tag, 0)` to +`(tag, pop_version)`. After spilling, the durably written data in the disk +queue would be trimmed to only include from the spilled version on, as any +required data is now entirely, durably held in the B-tree. As the entire value +is copied into the B-tree, this method of spilling will be referred to as +*spill-by-value* in the rest of this document. + +Unfortunately, it turned out that spilling in this fashion greatly impacts TLog +performance. A write bandwidth saturation test was run against a cluster, with +a modification to the transaction logs to have them act as if there was one +storage server that was permanently failed; it never sent pop requests to allow +the TLog to remove data from memory. After 15min, the write bandwidth had +reduced to 30% of its baseline. After 30min, that became 10%. After 60min, +that became 5%. Writing entire values gives an immediate 3x additional write +amplification, and the actual write amplification increases as the B-tree gets +deeper. (This is an intentional illustration of the worst case, due to the +workload being a saturating write load.) + +With the recent multi-DC/multi-region work, a failure of a remote data center +would cause transaction logs to need to buffer all commits, as every commit is +tagged as destined for the remote datacenter. This would rapidly push +transaction logs into a spilling regime, and thus write bandwidth would begin +to rapidly degrade. It is unacceptable for a remote datacenter failure to so +drastically affect the primary datacenter's performance in the case of a +failure, so a more performant way of spilling data is required. + +Whereas spill-by-value copied the entire mutation into the B-tree and removes +it from the disk queue, spill-by-reference leaves the mutations in the disk +queue and writes a pointer to it into the B-tree. Performance experiments +revealed that the TLog's performance while spilling was dictated more by the +number of writes done to the SQLite B-tree, than by the size of those writes. +Thus, "spill-by-reference" being able to do a significantly better batching +with its writes to the B-tree is more important than that it writes less data +in aggregate. Spill-by-reference significantly reduces the volume of data +written to the B-tree, and the less data that we write, the more we can batch +versions to be written together. + +************************************************************************ +* DiskQueue * +* * +* ------- Index in B-tree ------- ---- Index in memory ---- * +* / \ / \ * +* +-----------------------------------+-----------------------------+ * +* | Spilled Data | Most Recent Data | * +* +-----------------------------------+-----------------------------+ * +* lowest version highest version * +* * +************************************************************************ + +Spill-by-reference works by taking a larger range of versions, and building a +single key-value pair per tag that describes where in the disk queue is every +relevant commit for that tag. Concretely, this takes the form +`(tag, last_version) -> [(version, start, end, mutation_bytes), ...]`, where: + + * `tag` is the small integer representing the storage server this mutation batch is destined for. + * `last_version` is the last/maximum version contained in the value's batch. + * `version` is the version of the commit that this index entry points to. + * `start` is an index into the disk queue of where to find the beginning of the commit. + * `end` is an index into the disk queue of where the end of the commit is. + * `mutation_bytes` is the number of bytes in the commit that are relevant for this tag. + +And then writing only once per tag spilled into the B-tree for each iteration +through spilling. This turns the number of writes into the B-Tree from +`O(tags * versions)` to `O(tags)`. + +Note that each tuple in the list represents a commit, and not a mutation. This +means that peeking spilled commits will involve reading all mutations that were +a part of the commit, and then filtering them to only the ones that have the +tag of interest. Alternatively, one could have each tuple represent a mutation +within a commit, to prevent over-reading when peeking. There exist +pathological workloads for each strategy. The purpose of this work is most +importantly to support spilling of log router tags. These exist on every +mutation, so that it will get copied to other datacenters. This is the exact +pathological workload for recording each mutation individually, because it only +increases the number of IO operations used to read the same amount of data. +For a wider set of workloads, there's room to establish a heuristic as to when +to record mutation(s) versus the entire commit, but performance testing hasn't +surfaced this as important enough to include in the initial version of this +work. + +Peeking spilled data now works by issuing a range read to the B-tree from +`(tag, peek_begin)` to `(tag, infinity)`. This is why the key contains the +last version of the batch, rather than the beginning, so that a range read from +the peek request's version will always return all relevant batches. For each +batched tuple, if the version is greater than our peek request's version, then +we read the commit containing that mutation from disk, extract the relevant +mutations, and append them to our response. There is a target size of the +response, 150KB by default. As we iterate through the tuples, we sum +`mutation_bytes`, which already informs us how many bytes of relevant mutations +we'll get from a given commit. This allows us to make sure we won't waste disk +IOs on reads that will end up being discarded as unnecessary. + +Popping spilled data works similarly to before, but now requires recovering +information from disk. Previously, we would maintain a map from version to +location in the disk queue for every version we hadn't yet spilled. Once +spilling has copied the value into the B-tree, knowing where the commit was in +the disk queue is useless to us, and is removed. In spill-by-reference, that +information is still needed to know how to map "pop until version 7" to "pop +until byte 87" in the disk queue. Unfortunately, keeping this information in +memory would result in TLogs slowly consuming more and more +memory[^versionmap-memory] as more data is spilled. Instead, we issue a range +read of the B-tree from `(tag, pop_version)` to `(tag, infinity)` and look at +the first commit we find with a version greater than our own. We then use its +starting disk queue location as the limit of what we could pop the disk queue +until for this tag. + +[^versionmap-memory]: Pessimistic assumptions would suggest that a TLog spilling 1TB of data would require ~50GB of memory to hold this map, which isn't acceptable. + +## Detailed Implementation + +The rough outline of concrete changes proposed looks like: + +0. Allow a new TLog and old TLog to co-exist and be configurable, upgradeable, and recoverable +0. Modify spilling in new TLogServer +0. Modify peeking in new TLogServer +0. Modify popping in new TLogServer +0. Spill txsTag specially + +### Configuring and Upgrading + +Modifying how transaction logs spill data is a change to the on-disk files of +transaction logs. The work for enabling safe upgrades and rollbacks of +persistent state changes to transaction logs was split off into a seperate +design document: "Forward Compatibility for Transaction Logs". + +That document describes a `log_version` configuration setting that controls the +availability of new transaction log features. A similar configuration setting +was created, `log_spill`, that at `log_version >= 3`, one may `fdbcli> +configure log_spill:=2` to enable spill-by-reference. Only FDB 6.1 or newer +will be unable to recover transaction log files that were using +spill-by-reference. FDB 6.2 will use spill-by-reference by default. + +| FDB Version | Default | Configurable | ++-------------+---------+--------------+ +| 6.0 | No | No | +| 6.1 | No | Yes | +| 6.2 | Yes | Yes | + +If running FDB 6.1, the full command to enable spill-by-reference is +`fdbcli> configure log_version:=3 log_spill:=2`. + +The TLog implementing spill-by-value was moved to `OldTLogServer_6_0.actor.cpp` +and namespaced similarly. `tLogFnForOptions` takes a `TLogOptions`, which is +the version and spillType, and returns the correct TLog implementation +according to those settings. We maintain a map of +`(TLogVersion, StoreType, TLogSpillType)` to TLog instance, so that only +one SharedTLog exists per configuration variant. + +### Generations + +As a background, each time FoundationDB goes through a recovery, it will +recruit a new generation of transaction logs. This new generation of +transaction logs will often be recruited on the same worker that hosted the +previous generation's transaction log. The old generation of transaction logs +will only shut down once all the data that they have has been fully popped. +This means that there can be multiple instances of a transaction log in the +same process. + +Naively, this would create resource issues. Each instance would think that it +is allowed its own 1.5GB buffer of in-memory mutations. Instead, internally to +the TLog implmentation, the transaction log is split into two parts. A +`SharedTLog` is all the data that should be shared across multiple generations. +A TLog is all the data that is private to one generation. Most notably, the +1.5GB mutation buffer and the on-disk files are owned by the `SharedTLog`. The +index for the data added to that buffer is maintained within each TLog. In the +code, a SharedTLog is `struct TLogData`, and a TLog is `struct LogData`. +(I didn't choose these names.) + +This background is required, because one needs to keep in mind that we might be +committing in one TLog instance, a different one might be spilling, and yet +another might be the one popping data. + +********************************************************* +* SharedTLog * +* * +* +--------+--------+--------+--------+--------+ * +* | TLog 1 | TLog 2 | TLog 3 | TLog 4 | TLog 5 | * +* +--------+--------+--------+--------+--------+ * +* ^ popping ^spilling ^committing * +********************************************************* + +Conceptually, this is because each TLog owns a separate part of the same Disk +Queue file. The earliest TLog instance needs to be the one that controls when +the earliest part of the file can be discarded. We spill in version order, and +thus whatever TLog is responsible for the earliest unspilled version needs to +be the one doing the spilling. We always commit the newest version, so the +newest TLog must be the one writing to the disk queue and inserting new data +into the buffer of mutations. + + +### Spilling + +`updatePersistentData()` is the core of the spilling loop, that takes a new +persistent data version, writes the in-memory index for all commits less than +that version to disk, and then removes them from memory. By contact, once +spilling commits an updated persistentDataVersion to the B-tree, then those +bytes will not need to be recovered into memory after a crash, nor will the +in-memory bytes be needed to serve a peek response. + +Our new method of spilling iterates through each tag, and builds up a +`vector` for each tag, where `SpilledData` is: + +``` CPP +struct SpilledData { + Version version; + IDiskQueue::location start; + uint32_t length; + uint32_t mutationBytes; +}; +``` + +And then this vector is serialized, and written to the B-tree as +`(logId, tag, max(SpilledData.version))` = `serialized(vector)` + +As we iterate through each commit, we record the number of mutation bytes in +this commit that have our tag of interest. This is so that later, peeking can +read exactly the number of commits that it needs from disk. + +Although the focus of this project is on the topic of spilling, the code +implementing itself saw the least amount of total change. + +### Peeking + +A `TLogPeekRequest` contains a `Tag` and a `Version`, and is a request for all +commits with the specified tag with a commit version greater than or equal to +the given version. The goal is to return a 150KB block of mutations. + +When servicing a peek request, we will read up to 150KB of mutations from the +in-memory index. If the peek version is lower than the version that we've +spilled to disk, then we consult the on-disk index for up to 150KB of +mutations. (If we tried to read from disk first, and then read from memory, we +would then be racing with the spilling loop moving data from memory to disk.) + +************************************************************************** +* * +* +---------+ Tag +---------+ Tag +--------+ * +* | Peek |-------->| Spilled | ...------------->| Memory | * +* | Request | Version | Index | Version | Index | * +* +---------+ +---------+ +--------+ * +* | | * +* +-----------------+-----------------+ | * +* / \ Start=100 _/ \_ Start=500 + Start=900 + Ptr=0xF00 * +* / \ Length=50 / \ Length=70 / \ Length=30 / \ Length=30 * +* +------------------------------------------------+------------------+ * +* | Disk Queue | Also In Memory | * +* +------------------------------------------------+------------------+ * +* * +************************************************************************** + +Spill-by-value and memory storage engine only ever read from the DiskQueue when +recovering, and read the entire file linearly. Therefore, `IDiskQueue` had no +API for random reads to the DiskQueue. That ability is now required for +peeking, and thus, `IDiskQueue`'s API has been enhanced correspondingly: + +``` CPP +enum class CheckHashes { NO, YES }; + +class IDiskQueue { + // ... + Future> read(location start, location end, CheckHashes ch); + // ... +}; +``` + +Internally, the DiskQueue adds page headers every 4K, which are stripped out +from the returned data. Therefore, the length of the result will not be the +same as `end-start`, intentionally. For this reason, the API is `(start, end)` +and not `(start, length)`. + +Spilled data, when using spill-by-value, was resistent to bitrot via data being +checksummed interally within SQLite's B-tree. Now that reads can be done +directly, the responsibility for verifing data integrity falls upon the +DiskQueue. `CheckHashes::YES` will cause the DiskQueue to use the checksum in +each DiskQueue page to verify data integrity. If an externally maintained +checksums exists to verify the returned data, then `CheckHashes::NO` can be +used to elide the checksumming. A page failing its checksum will cause the +transaction log to die with an `io_error()`. + +What is read from disk is a `TLogQueueEntry`: + +``` CPP +struct TLogQueueEntryRef { + UID id; + Version version; + Version knownCommittedVersion; + StringRef messages; +} +``` + +Which provides the commit version and the logId of the TLog generation that +produced this commit, in addition to all of the mutations for that version. +(`knownCommittedVersion` is only used during FDB's recovery process.) + +### Popping + +As storage servers persist data, they send `pop(tag, version)` requests to the +transaction log to notify it that it is allowed to discard data for `tag` up +through `version`. Once all the tags have been popped from the oldest commit +in the DiskQueue, the tail of the DiskQueue can be discarded to reclaim space. + +If our popped version is in the range of what has been spilled, then we need to +consult our on-disk index to see what is the next location in the disk queue +that has data which is useful to us. This act would race with the spilling +loop changing what data is spilled, and thus disk queue popping +(`popDiskQueue()`) was made to run serially after spilling completes. + +Also due to spilling and popping largely overlapping in state, the disk queue +popping loop does not immediately react to a pop request from a storage server +changing the popped version for a tag. Spilling saves the popped version for +each tag when the spill loop runs, and if that version changed, then +`popDiskQueue()` refreshes its knowledge of what the minimum location in the +disk queue is required for that tag. We can pop the disk queue to the minimum +of all minimum tag locations, or to the minimum location needed for an +in-memory mutation if there is no spilled data. + +As a post implementation note, this ended up being a "here be dragons" +experience, with a surprising number of edge cases in races between +spilling/popping, various situations of having/not having/having inaccurate +data for tags, or that tags can stop being pushed to when storage servers are +removed but their corresponding `TagData` is never removed. + +### Transaction State Store + +For FDB to perform a recovery, there is information that it needs to know about +the database, such as the configuration, worker exclusions, backup status, etc. +These values are stored into the database in the `\xff` system keyspace. +However, during a recovery, FDB can't read this data from the storage servers, +because recovery hasn't completed, so it doesn't know who the storage servers +are yet. Thus, a copy of this data is held in-memory on every proxy in the +*transaction state store*, and durably persisted as a part of commits on the +transaction logs. Being durably stored on the transaction logs means the list +of transaction logs can be fetched from the coordinators, and then used to load +the rest of the information about the database. + +The in-memory storage engine writes an equal amount of mutations and snapshot +data to a queue, an when a full snapshot of the data has been written, deletes +the preceeding snapshot and begins writing a new one. When backing an +in-memory storage engine with the transaction logs, the +`LogSystemDiskQueueAdapter` implements writing to a queue as committing +mutations to the transaction logs with a special tag of `txsTag`, and deleting +the preceeding snapshot as popping the transaction logs for the tag of `txsTag` +until the version where the last full snapshot began. + +This means that unlike every other commit that is tagged and stored on the +transaction logs, `txsTag` signifies data that is: + +1. Committed to infrequently +2. Only peeked on recovery +3. Popped infrequently, and a large portion of the data is popped at once +4. A small total volume of data + +The most problematic of these is the infrequent popping. Unpopped data will be +spilled after some time, and if `txsTag` data is spilled and not popped, it +will prevent the DiskQueue from being popped as well. This will cause the +DiskQueue to grow continuously. The infrequent commits and small data volume +means that there benefits of spill-by-reference over spill-by-value don't apply +for this tag. + +Thus, even when configured to spill-by-reference, `txsTag` is spilled by value. + +### Disk Queue Recovery + +If a transaction log dies and restarts, all commits that were in memory at the +time of the crash must be loaded back into memory. Recovery is blocked on this +process, as there might have been a commit to the transaction state store +immediately before crashing, and that data needs to be fully readable during a +recovery. + +In spill-by-value, the DiskQueue only ever contained commits that were also +held in memory, and thus recovery would need to read up to 1.5GB of data. With +spill-by-reference, the DiskQueue could theoretically contain terrabytes of +data. To keep recovery times boundedly low, FDB must still only read the +commits that need to be loaded back into memory. + +This is done by persisting the location in the DiskQueue of the last spilled +commit to the SQLite B-Tree. This is done in the same transaction as the +spilling of that commit. This provides an always accurate pointer to where +data that needs to be loaded into memory begins. The pointer is to the +beginning of the last commit rather than the end, to make sure that the pointer +is always contained within the DiskQueue. This provides extra sanity checking +on the validity of the DiskQueue's contents at recovery, at the cost of +potentially reading 10MB more than what would be required. + +## Testing + +Correctness bugs in spilling would manifest as data corruption, which is well covered by simulation. +The only special testing code added was to enable changing `log_spill` in `ConfigureTest`. +This covers switching between spilling methods in the presence of faults. + +An `ASSERT` was added to simulation that verifies that commits read from the +DiskQueue on recovery are only the commits which have not been spilled. + +The rest of the testing is to take a physical cluster and try the extremes that +can only happen at scale: + +* Verify that recovery times are not impacted when a large amount of data is spilled +* Verify that long running tests hit a steady state of memory usage (and thus there are likely no leaks). +* Plot how quickly (MB/s) a remote datacenter can catch up in old vs new spilling strategy +* See what happens when there's 1 tlog and more than 100 storage servers. + * Verify that peek requests get limited + * See if tlog commits can get starved by excessive peeking + +# TLog Spill-By-Reference Operational Guide + +## Notable Behavior Changes + +TL;DR: Spilling involves less IOPS and is faster. Peeking involves more IOPS and is slower. Popping involves >0 IOPS. + +### Spilling + +The most notable effect of the spilling changes is that the Disk Queue files +will now grow to potentially terrabytes in size. + + 1. Spilling will occur in larger batches, which will result in a more +sawtooth-like `BytesInput - BytesDurable` value. I'm not aware that this will have any meaningful impact. + + * Disk queue files will grow when spilling is happening + * Alerting based on DQ file size is no longer appropriate + +As a curious aside, throughput decreases as spilled volume increases, which +quite possibly worked as accidental backpressure. As a feature, this no longer +exists, but means write-heavy workloads can drown storage servers faster than +before. + +### Peeking + +Peeking has seen tremendous changes. Its involves more IO operations and memory usage. + +The expected implication of this are: + +1. A peek of spilled data will involve a burst of IO operations. + + Theoretically, this burst can drown out queued write operations to disk, + thus and slowing down TLog commits. This hasn't been observed in testing. + + Low IOPS devices, such as HDD or network attached storage, would struggle + more here than locally attached SSD. + +2. Generating a peek response of 150KB could require reading 100MB of data, and allocating buffers to hold that 100MB. + + OOMs were observed in early testing. Code has been added to specifically + limit how much memory can be allocated for serving a signle peek request + and all concurrent peek requests, with knobs to allow tuning this per + deployment configuration. + +### Popping + +Popping will transition from being an only in-memory operation to one that +can involve reads from disk if the popped tag has spilled data. + +Due to a strange quirk, TLogs will allocate up to 2GB of memory as a read cache +for SQLite's B-tree. The expected maximum size of the B-tree has drastically +reduced, so these reads should almost never actually hit disk. The number of +writes to disk will stay the same, so performance should stay unchanged. + +### Disk Queues + +This work should have a minimal impact on recovery times, which is why recovery +hasn't been significantly mentioned in this document. However, there are two +minor impacts on recovery times: + +1. Larger disk queue file means more file to zero out in the case of recovery. + + This should be negligable when fallocate `ZERO_RANGE` is available, because then it's only a metadata operation. + +2. A larger file means more bisection iterations to find the first page. + + If we say Disk Queue files are typically ~4GB now, and people are unlikely + to have more than 4TB drives, then this means in the worst case, another 8 + sequential IOs will need to be done when first recovering a disk queue file + to find the most recent page with a binary search. + + If this turns out to be an issue, it's trivial to address. There's no + reason to do only a binary search when drives support parallel requests. A + 32-way search could reasonably be done, and would would make a 4TB Disk + Queue file faster to recover than a 4GB one currently. + +3. Disk queue files can now shrink. + + The particular logic currently used is that: + + If one file is significantly larger than the other file, then it will be + truncated to the size of the other file. This resolves situations where a + particular storage server or remote DC being down causes one DiskQueue file + to be grown to a massive size, and then the data is rapidly popped. + + Otherwise, If the files are of reasonably similar size, then we'll take + `pushLocation - popLocation` as the number of "active" bytes, and then + shrink the file by `TLOG_DISK_QUEUE_SHRINK_BYTES` bytes if the file is + larger than `active + TLOG_DISK_QUEUE_EXTENSION_BYTES + TLOG_DISK_QUEUE_SHRINK_BYTES`. + +!!! note + While writing this, I've realized it's probably a good idea to limit that + the disk queue can't shrink under 4GB of size, to prevent size thrashing on + bursty workloads. + +## Knobs + +`REFERENCE_SPILL_UPDATE_STORAGE_BYTE_LIMIT` +: How many bytes of mutations should be spilled at once in a spill-by-reference TLog.
+ Increasing it could increase throughput in spilling regimes.
+ Decreasing it will decrease how sawtooth-like TLog memory usage is.
+ +`UPDATE_STORAGE_BYTE_LIMIT` +: How many bytes of mutations should be spilled at once in a spill-by-value TLog.
+ This knob is pre-existing, and has only been "changed" to only apply to spill-by-value.
+ +`TLOG_SPILL_REFERENCE_MAX_BATCHES_PER_PEEK` +: How many batches of spilled data index batches should be read from disk to serve one peek request.
+ Increasing it will potentially increase the throughput of peek requests.
+ Decreasing it will decrease the number of read IOs done per peek request.
+ +`TLOG_SPILL_REFERENCE_MAX_BYTES_PER_BATCH` +: How many bytes a batch of spilled data indexes can be.
+ Increasing it will increase TLog throughput while spilling.
+ Decreasing it will decrease the latency and increase the throughput of peek requests.
+ +`TLOG_SPILL_REFERENCE_MAX_PEEK_MEMORY_BYTES` +: How many bytes of memory can be allocated to hold the results of reads from disk to respond to peek requests.
+ Increasing it will increase the number of parallel peek requests a TLog can handle at once.
+ Decreasing it will reduce TLog memory usage.
+ If increased, `--max_memory` should be increased by the same amount.
+ +`TLOG_DISK_QUEUE_EXTENSION_BYTES` +: When a DiskQueue needs to extend a file, by how many bytes should it extend the file.
+ Increasing it will reduce metadata operations done to the drive, and likely tail commit latency.
+ Decreasing it will reduce allocated but unused space in the DiskQueue files.
+ Note that this was previously hardcoded to 20MB, and is only being promoted to a knob.
+ +`TLOG_DISK_QUEUE_SHRINK_BYTES` +: If a DiskQueue file has extra space left when switching to the other file, by how many bytes should it be shrunk.
+ Increasing this will cause disk space to be returned to the OS faster.
+ Decreasing this will decrease TLog tail latency due to filesystem metadata updates.
+ +## Observability + +With the new changes, we must ensure that sufficent information has been exposed such that: + +1. If something goes wrong in production, we can understand what and why from trace logs. +2. We can understand if the TLog is performing suboptimally, and if so, which knob we should change and by how much. + +The following metrics were added to `TLogMetrics`: + +### Spilling + +### Peeking + +`PeekMemoryRequestsStalled` +: The number of peek requests that are blocked on acquiring memory for reads. + +`PeekMemoryReserved` +: The amount of memory currently reserved for serving peek requests. + +### Popping + +`QueuePoppedVersion` +: The oldest version that's still useful. + +`MinPoppedTagLocality` +: The locality of the tag that's preventing the DiskQueue from being further popped. + +`MinPoppedTagId` +: The id of the tag that's preventing the DiskQueue from being further popped. + +## Monitoring and Alerting + +To answer questions like: + +1. What new graphs should exist? +2. What old graphs might exist that would no longer be meaningful? +3. What alerts might exist that need to be changed? +4. What alerts should be created? + +Of which I'm aware of: + +* Any current alerts on "Disk Queue files more than [constant size] GB" will need to be removed. +* Any alerting or monitoring of `log*.sqlite` as an indication of spilling will no longer be effective. + +* A graph of `BytesInput - BytesPopped` will give an idea of the number of "active" bytes in the DiskQueue file. + + + + + + +