foundationdb/design/transaction-state-store.md

9.3 KiB

Transaction State Store (txnStateStore)

This document describes the transaction state store (often is referred as txnStateStore in the code) in FDB. The transaction state store keeps important metadata about the database to bootstrap the database, to guide the transaction system to persist writes (i.e., help assign storage tags to mutations at commit proxies), and to manage data (i.e., shard) movement metadata. This is a critical piece of information that have to be consistent across many processes and to be persistent for recovery.

Acknowledgment: A lot of contents are taken from Evan's FDB brownbag talk.

What information is stored in transaction state store?

The information includes: shard mapping (key range to storage server mapping, i.e., keyServers), storage server tags (serverTags), tagLocalityList, storage server tag history, database locked flag, metadata version, mustContainSystemMutations, coordinators, storage server interface (serverList), database configurations, TSS mappings and quarantines, backup apply mutation ranges and log ranges, etc.

The information of transaction state store is kept in the system key space, i.e., using the \xff prefix. Note all data in the system key space are saved on storage servers. The txnStateStore is only a part of the \xff key space, and is additionally kept in the memory of commit proxies as well as disks of the log system (i.e., TLogs). Changes to the txnStateStore are special mutations to the \xff key space, and are called inconsistently in the code base as "metadata mutations" in commit proxies and "state mutations" in Resolvers.

Why do we need transaction state store?

When bootstraping an FDB cluster, the cluster controller (CC) role recruits a new transaction system and initializes them. In particular, the transaction state store is first read by the CC from previous generation's log system, and then broadcast to all commit proxies of the new transaction system. After initializing txnStateStore, these commit proxies know how to assign mutations with storage server tags: txnStateStore contains the shard map from key range to storage servers; commit proxies use the shard map to find and attach the destination storage tags for each mutation.

How is transaction state store replicated?

The txnStateStore is replicated in all commit proxies' memories. It is very important that txnStateStore data are consistent, otherwise, a shard change issued by one commit proxy could result in a situation where different proxies think they should send a mutation to different storage servers, thus causing data corruptions.

FDB solves this problem by state machine replication: all commit proxies start with the same txnStateStore data (from CC broadcast), and apply the same sequence of mutations. Because commits happen at all proxies, it is difficult to maintain the same order as well as minimize the communication among them. Fortunately, every transaction has to send a conflict resolution request to all Resolvers and they process transactions in strict order of commit versions. Leveraging this mechanism, each commit proxy sends all metadata (i.e., system key) mutations to all Resolvers. Resolvers keep these mutations in memory and forward to other commit proxies in separate resolution response. Each commit proxy receive resolution response, along with metadata mutations happend at other proxies before its commit version, and apply all these metadata mutations in the commit order. Finally, this proxy only writes metadata mutations in its own transaction batch to TLogs, i.e., do not write other proxies' metadata mutations to TLogs to avoid repeated writes.

It's worth calling out that everything in the txnStateStore is stored at some storage servers and a client (e.g., fdbcli) can read from these storage servers. During the commit process, commit proxies parse all mutations in a batch of transactions, and apply changes (i.e., metadata mutations) to its in-memory copy of txnStateStore. Later, the same changes are applied at storage servers for persistence. Additionally, the process to store txnStateStore at log system is described below.

Notably applyMetadataMutations() is the function that commit proxies use to make changes to txnStateStore. The key ranges stored in txnStateStore include [\xff, \xff\x02) and [\xff\x03, \xff\xff), but not everything in these ranges. There is no data in the range of [\xff\x02, \xff\x03) belong to txnStateStore, e.g., \xff\x02 prefix is used for backup data and is NOT metadata mutations.

How is transaction state store persisted at log system?

When a commit proxy writes metadata mutations to the log system, the proxy assigns a "txs" tag to the mutation. Depending on FDB versions, the "txs" tag can be one special tag txsTag{ tagLocalitySpecial, 1 } for TLogVersion::V3 (FDB 6.1) or a randomized "txs" tag for TLogVersion::V4 (FDB 6.2 and later) and larger. The idea of randomized "txs" tag is to spread metadata mutations to all TLogs for faster parallel recovery of txnStateStore.

At TLogs, all mutation data are indexed by tags. "txs" tag data is special, since it is only peeked by the cluster controller (CC) during the transaction system recovery. See TLog Spilling doc for more detailed discussion on the topic of spilling "txs" data. In short, txsTag is spilled by value. "txs" tag data is indexed and stored in both primary TLogs and satellite TLogs. Note satellite TLogs only index log router tags and "txs" tags.

How is transaction state store implemented?

txnStateStore is kept in memory at commit proxies using KeyValueStoreMemory, which uses LogSystemDiskQueueAdapter to be durable with the log system. As a result, reading from txnStateStore never blocks, which means the futures returned by read calls should always be ready. Writes to txnStateStore are first buffered by the LogSystemDiskQueueAdapter in memory. After a commit proxy pushes transaction data to the log system and the data becomes durable, the proxy clears the buffered data in LogSystemDiskQueueAdapter.

Note KeyValueStoreMemory periodically checkpoints its data to disks to speed up its recovery. For txnStateStore, this checkpoint allows versions before it to be popped, i.e., "txs" tag data on the log system can be reclaimed so that log queue does not grow unbounded. Thus, the txsPoppedVersion is the version right before a checkpoint, and during recovery, txnStateStore is recovered by reading all data from this version.