Add transaction state store documentation
With code pointers.
This commit is contained in:
parent
9ab14d8577
commit
f6dc54ebbe
|
@ -0,0 +1,99 @@
|
|||
# Transaction State Store (txnStateStore)
|
||||
|
||||
This document describes the transaction state store (often is referred as `txnStateStore` in the code) in FDB. The transaction state store keeps important metadata about the database to bootstrap the database, to guide the transaction system to persist writes (i.e., help assign storage tags to mutations at commit proxies), and to manage data (i.e., shard) movement metadata. This is a critical piece of information that have to be consistent across many processes and to be persistent for recovery.
|
||||
|
||||
Acknowledgment: A lot of contents are taken from [Evan's FDB brownbag talk](https://drive.google.com/file/d/15UvKiNc-jSFfDGygNmLQP_d4b14X3DAS/).
|
||||
|
||||
## What information is stored in transaction state store?
|
||||
|
||||
The information includes: shard mapping (key range to storage server mapping, i.e.,
|
||||
`keyServers`), storage server tags (`serverTags`), tagLocalityList, storage server tag
|
||||
history, database locked flag, metadata version, mustContainSystemMutations, coordinators,
|
||||
storage server interface (`serverList`), database configurations, TSS mappings and
|
||||
quarantines, backup apply mutation ranges and log ranges, etc.
|
||||
|
||||
The information of transaction state store is kept in the system key space, i.e., using the
|
||||
`\xff` prefix. Note all data in the system key space are saved on storage servers. The
|
||||
`txnStateStore` is only a part of the `\xff` key space, and is additionally kept in the
|
||||
memory of commit proxies as well as disks of the log system (i.e., TLogs). Changes to
|
||||
the `txnStateStore` are special mutations to the `\xff` key space, and are called
|
||||
inconsistently in the code base as "metadata mutations" in commit proxies and
|
||||
"state mutations" in Resolvers.
|
||||
|
||||
## Why do we need transaction state store?
|
||||
|
||||
When bootstraping an FDB cluster, the new master (i.e., the sequencer) role recruits a
|
||||
new transaction system and initializes them. In particular, the transaction state store
|
||||
is first read by the master from previous generation's log system, and then broadcast to
|
||||
all commit proxies of the new transaction system. After initializing `txnStateStore`, these
|
||||
commit proxies know how to assign mutations with storage server tags: `txnStateStore`
|
||||
contains the shard map from key range to storage servers; commit proxies use the shard
|
||||
map to find and attach the destination storage tags for each mutation.
|
||||
|
||||
## How is transaction state store replicated?
|
||||
|
||||
The `txnStateStore` is replicated in all commit proxies' memories. It is very important
|
||||
that `txnStateStore` data are consistent, otherwise, a shard change issued by one commit
|
||||
proxy could result in a situation where different proxies think they should send a
|
||||
mutation to different storage servers, thus causing data corruptions.
|
||||
|
||||
FDB solves this problem by state machine replication: all commit proxies start with the
|
||||
same `txnStateStore` data (from master broadcast), and apply the same sequence of mutations.
|
||||
Because commits happen at all proxies, it is difficult to maintain the same order as well
|
||||
as minimize the communication among them. Fortunately, every transaction has to send a
|
||||
conflict resolution request to all Resolvers and they process transactions in strict order
|
||||
of commit versions. Leveraging this mechanism, each commit proxy sends all metadata
|
||||
(i.e., system key) mutations to all Resolvers. Resolvers keep these mutations in memory
|
||||
and forward to other commit proxies in separate resolution response. Each commit proxy
|
||||
receive resolution response, along with metadata mutations happend at other proxies before
|
||||
its commit version, and apply all these metadata mutations in the commit order.
|
||||
Finally, this proxy only writes metadata mutations in its own transaction batch to TLogs,
|
||||
i.e., do not write other proxies' metadata mutations to TLogs to avoid repeated writes.
|
||||
Notably `\xff\x02` prefix is used for backup data and is *NOT* metadata mutations.
|
||||
|
||||
## How is transaction state store persisted?
|
||||
|
||||
When a commit proxy writes metadata mutations to the log system, the proxy assigns a
|
||||
"txs" tag to the mutation. Depending on FDB versions, the "txs" tag can be one special
|
||||
tag `txsTag{ tagLocalitySpecial, 1 }` for `TLogVersion::V3` (FDB 6.1) or a randomized
|
||||
"txs" tag for `TLogVersion::V4` (FDB 6.2 and later) and larger. The idea of randomized
|
||||
"txs" tag is to spread metadata mutations to all TLogs for faster parallel recovery of
|
||||
`txnStateStore`.
|
||||
|
||||
At TLogs, all mutation data are indexed by tags. "txs" tag data is special, since it is
|
||||
only peeked by the master during the transaction system recovery.
|
||||
See [TLog Spilling doc](tlog-spilling.md.html) for more detailed discussion on the
|
||||
topic of spilling "txs" data. In short, `txsTag` is spilled by value.
|
||||
"txs" tag data is indexed and stored in both primary TLogs and satellite TLogs.
|
||||
Note satellite TLogs only index log router tags and "txs" tags.
|
||||
|
||||
## How is transaction state store implemented?
|
||||
|
||||
`txnStateStore` is kept in memory at commit proxies using `KeyValueStoreMemory`, which
|
||||
uses `LogSystemDiskQueueAdapter` to be durable with the log system. As a result, reading
|
||||
from `txnStateStore` never blocks, which means the futures returned by read calls should
|
||||
always be ready. Writes to `txnStateStore` are first buffered by the `LogSystemDiskQueueAdapter`
|
||||
in memory. After a commit proxy pushes transaction data to the log system and the data
|
||||
becomes durable, the proxy clears the buffered data in `LogSystemDiskQueueAdapter`.
|
||||
|
||||
* Master reads `txnStateStore` from old log system: https://github.com/apple/foundationdb/blob/6281e647784e74dccb3a6cb88efb9d8b9cccd376/fdbserver/masterserver.actor.cpp#L928-L931
|
||||
|
||||
* Master broadcasts `txnStateStore` to commit proxies: https://github.com/apple/foundationdb/blob/6281e647784e74dccb3a6cb88efb9d8b9cccd376/fdbserver/masterserver.actor.cpp#L940-L968
|
||||
|
||||
* Commit proxies receive txnStateStore broadcast and builds the `keyInfo` map: https://github.com/apple/foundationdb/blob/6281e647784e74dccb3a6cb88efb9d8b9cccd376/fdbserver/CommitProxyServer.actor.cpp#L1886-L1927
|
||||
* Look up `keyInfo` map for `GetKeyServerLocationsRequest`: https://github.com/apple/foundationdb/blob/6281e647784e74dccb3a6cb88efb9d8b9cccd376/fdbserver/CommitProxyServer.actor.cpp#L1464
|
||||
* Look up `keyInfo` map for assign mutations with storage tags: https://github.com/apple/foundationdb/blob/6281e647784e74dccb3a6cb88efb9d8b9cccd376/fdbserver/CommitProxyServer.actor.cpp#L926 and https://github.com/apple/foundationdb/blob/6281e647784e74dccb3a6cb88efb9d8b9cccd376/fdbserver/CommitProxyServer.actor.cpp#L965-L1010
|
||||
|
||||
* Commit proxies recover database lock flag and metadata version: https://github.com/apple/foundationdb/blob/6281e647784e74dccb3a6cb88efb9d8b9cccd376/fdbserver/CommitProxyServer.actor.cpp#L1942-L1944
|
||||
|
||||
* Commit proxies add metadata mutations to Resolver request: https://github.com/apple/foundationdb/blob/6281e647784e74dccb3a6cb88efb9d8b9cccd376/fdbserver/CommitProxyServer.actor.cpp#L137-L140
|
||||
|
||||
* Resolvers keep these mutations in memory: https://github.com/apple/foundationdb/blob/6281e647784e74dccb3a6cb88efb9d8b9cccd376/fdbserver/Resolver.actor.cpp#L220-L230
|
||||
|
||||
* Resolvers copy metadata mutations to resolution reply message: https://github.com/apple/foundationdb/blob/6281e647784e74dccb3a6cb88efb9d8b9cccd376/fdbserver/Resolver.actor.cpp#L244-L249
|
||||
|
||||
* Commit proxies apply all metadata mutations (including those from other proxies) in the commit order: https://github.com/apple/foundationdb/blob/6281e647784e74dccb3a6cb88efb9d8b9cccd376/fdbserver/CommitProxyServer.actor.cpp#L740-L770
|
||||
|
||||
* Commit proxies only write metadata mutations in its own transaction batch to TLogs: https://github.com/apple/foundationdb/blob/6281e647784e74dccb3a6cb88efb9d8b9cccd376/fdbserver/CommitProxyServer.actor.cpp#L772-L774 adds mutations to `storeCommits`. Later in `postResolution()`, https://github.com/apple/foundationdb/blob/6281e647784e74dccb3a6cb88efb9d8b9cccd376/fdbserver/CommitProxyServer.actor.cpp#L1162-L1176, only the last one in `storeCommits` are send to TLogs.
|
||||
|
||||
* Commit proxies clear the buffered data in `LogSystemDiskQueueAdapter` after TLog push: https://github.com/apple/foundationdb/blob/6281e647784e74dccb3a6cb88efb9d8b9cccd376/fdbserver/CommitProxyServer.actor.cpp#L1283-L1287
|
Loading…
Reference in New Issue