Merge remote-tracking branch 'origin/master' into bugfixes/macos-literal-string
This commit is contained in:
commit
5b9d0b9f85
|
@ -1,6 +1,6 @@
|
|||
##############################
|
||||
###################################################
|
||||
FDB HA Write Path: How a mutation travels in FDB HA
|
||||
##############################
|
||||
###################################################
|
||||
|
||||
| Author: Meng Xu
|
||||
| Reviewer: Alex Miller, Jingyu Zhou, Lukas Joswiak, Trevor Clinkenbeard
|
||||
|
@ -23,7 +23,7 @@ To simplify the description, we assume the HA cluster has the following configur
|
|||
We describe the background knowledge -- Sharding and Tag structure -- before we discuss how a mutation travels in a FDB HA cluster.
|
||||
|
||||
Sharding: Which shard goes to which servers?
|
||||
=================
|
||||
============================================
|
||||
|
||||
A shard is a continuous key range. FDB divides the entire keyspace to thousands of shards. A mutation’s key decides which shard it belongs to.
|
||||
|
||||
|
@ -35,7 +35,7 @@ Shard-to-tLog mapping is decided by shard-to-SS mapping and tLog’s replication
|
|||
|
||||
|
||||
Tag structure
|
||||
=================
|
||||
=============
|
||||
|
||||
Tag is an overloaded term in FDB. In the early history of FDB, a tag is a number used in SS-to-tag mapping. As FDB evolves, tags are used by different components for different purposes:
|
||||
|
||||
|
@ -58,7 +58,7 @@ To simplify our discussion in the document, we use “tag.id” to represent a t
|
|||
|
||||
|
||||
How does a mutation travel in FDB?
|
||||
=================
|
||||
==================================
|
||||
|
||||
To simplify the description, we ignore the batching mechanisms happening in each component in the data path that are used to improve the system’s performance.
|
||||
|
||||
|
@ -67,12 +67,12 @@ Figure 1 illustrates how a mutation is routed inside FDB. The solid lines are as
|
|||
.. image:: /images/FDB_ha_write_path.png
|
||||
|
||||
At Client
|
||||
-----------------
|
||||
---------
|
||||
|
||||
When an application creates a transaction and writes mutations, its FDB client sends the set of mutations to a proxy, say proxy 0. Now let’s focus on one of the normal mutations, say m1, whose key is in the normal keyspace.
|
||||
|
||||
At Proxy
|
||||
-----------------
|
||||
--------
|
||||
|
||||
**Sequencing.** *It first asks the master for the commit version of this transaction batch*. The master acts like a sequencer for FDB transactions to determine the order of transactions to commit by assigning a new commit version and the last assigned commit version as the previous commit version. The transaction log system will use the [previous commit version, commit version] pair to determine its commit order, i.e., only make this transaction durable after the transaction with the previous commit version is made durable.
|
||||
|
||||
|
@ -102,7 +102,7 @@ Proxy groups mutations with the same tag as messages. Proxy then synchronously p
|
|||
|
||||
|
||||
At primary tLogs and satellite tLogs
|
||||
-----------------
|
||||
------------------------------------
|
||||
|
||||
Once it receives mutations pushed by proxies, it builds indexes for each tag’s mutations. Primary TLogs index both log router tags and the primary DC's SS tags. Satellite tLogs only index log router tags.
|
||||
|
||||
|
@ -116,7 +116,7 @@ tLogs also maintain two properties:
|
|||
|
||||
|
||||
At primary SS
|
||||
-----------------
|
||||
-------------
|
||||
|
||||
**Primary tLog of a SS.** Since a SS’s tag is identically mapped to one tLog. The tLog has all mutations for the SS and is the primary tLog for the SS. When the SS peeks data from tLogs, it will prefer to peek data from its primary tLog. If the primary tLog crashes, it will contact the rest of tLogs, ask for mutations with the SS’s tag, and merge them together. This complex merge operation is abstracted in the TagPartitionedLogSystem interface.
|
||||
|
||||
|
@ -128,7 +128,7 @@ Now let’s look at how the mutation m1 is routed to the remote DC.
|
|||
|
||||
|
||||
At log router
|
||||
-----------------
|
||||
-------------
|
||||
|
||||
Log routers are consumers of satellite tLogs or primary tLogs, controlled by a knob LOG_ROUTER_PEEK_FROM_SATELLITES_PREFERRED. By default, the knob is configured for log routers to use satellite tLogs. This relationship is similar to primary SSes to primary tLogs.
|
||||
|
||||
|
@ -138,7 +138,7 @@ Log router buffers its mutations in memory and waits for the remote tLogs to pee
|
|||
|
||||
|
||||
At remote tLogs
|
||||
-----------------
|
||||
---------------
|
||||
|
||||
Remote tLogs are consumers of log routers. Each remote tLog keeps pulling mutations, which have the remote tLog’s tag, from log routers. Because log router tags are randomly chosen for mutations, a remote tLog’s mutations can spread across all log routers. So each remote tLog must contact all log routers for its data and merge these mutations in increasing order of versions on the remote tLog.
|
||||
|
||||
|
@ -147,34 +147,34 @@ Once a remote tLog collects and merge mutations from all log routers, it makes t
|
|||
Now the mutation m1 has arrived at the remote tLog, which is similar as when it arrives at the primary tLog.
|
||||
|
||||
|
||||
At remote SSes.
|
||||
-----------------
|
||||
At remote SSes
|
||||
--------------
|
||||
|
||||
Similar to how primary SSes pull mutations from primary tLogs, each remote SS keeps pulling mutations, which have its tag, from remote tLogs. Once a remote SS makes mutations up to a version V1 durable, the SS pops its tag to the version V1 from all remote tLogs.
|
||||
|
||||
|
||||
Implementation
|
||||
=================
|
||||
==============
|
||||
|
||||
* proxy assigns tags to a mutation:
|
||||
|
||||
https://github.com/xumengpanda/foundationdb/blob/063700e4d60cd44c1f32413761e3fe7571fab9c0/fdbserver/MasterProxyServer.actor.cpp#L824
|
||||
https://github.com/apple/foundationdb/blob/7eabdf784a21bca102f84e7eaf14bafc54605dff/fdbserver/MasterProxyServer.actor.cpp#L1410
|
||||
|
||||
|
||||
Mutation Serialization (WiP)
|
||||
=================
|
||||
============================
|
||||
|
||||
This section will go into detail on how mutations are serialized as preparation for ingestion into the TagPartitionedLogSystem. This has also been covered at:
|
||||
|
||||
https://drive.google.com/file/d/1OaP5bqH2kst1VxD6RWj8h2cdr9rhhBHy/view.
|
||||
https://drive.google.com/file/d/1OaP5bqH2kst1VxD6RWj8h2cdr9rhhBHy/view
|
||||
|
||||
The proxy handles splitting transactions into their individual mutations. These mutations are then serialized and synchronously sent to multiple transaction logs.
|
||||
|
||||
The process starts in *commitBatch*. Eventually, *assignMutationsToStorageServers* is called to assign mutations to storage servers and serialize them. This function loops over each mutation in each transaction, determining the set of tags for the mutation (which storage servers it will be sent to), and then calling *LogPushData::writeTypedMessage* on the mutation.
|
||||
The process starts in *commitBatch*. Eventually, *assignMutationsToStorageServers* is called to assign mutations to storage servers and serialize them. This function loops over each mutation in each transaction, determining the set of tags for the mutation (which storage servers it will be sent to), and then calling *LogPushData.writeTypedMessage* on the mutation.
|
||||
|
||||
The *LogPushData* class is used to hold serialized mutations on a per transaction log basis. It’s *messagesWriter* field holds one *BinaryWriter* per transaction log.
|
||||
|
||||
*LogPushData::writeTypedMessage* is the function that serializes each mutation and writes it to the correct binary stream to be sent to the corresponding transaction log. Each serialized mutation contains additional metadata about the message, with the format:
|
||||
*LogPushData.writeTypedMessage* is the function that serializes each mutation and writes it to the correct binary stream to be sent to the corresponding transaction log. Each serialized mutation contains additional metadata about the message, with the format:
|
||||
|
||||
.. image:: /images/serialized_mutation_metadata_format.png
|
||||
|
||||
|
|
|
@ -30,6 +30,8 @@ These documents explain the engineering design of FoundationDB, with detailed in
|
|||
|
||||
* :doc:`read-write-path` describes how FDB read and write path works.
|
||||
|
||||
* :doc:`ha-write-path` describes how FDB write path works in HA setting.
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
:titlesonly:
|
||||
|
@ -48,3 +50,4 @@ These documents explain the engineering design of FoundationDB, with detailed in
|
|||
testing
|
||||
kv-architecture
|
||||
read-write-path
|
||||
ha-write-path
|
||||
|
|
|
@ -42,10 +42,10 @@ typedef uint32_t QueueID;
|
|||
|
||||
// Pager Events
|
||||
enum class PagerEvents { CacheLookup = 0, CacheHit, CacheMiss, PageWrite, MAXEVENTS };
|
||||
static const std::string PagerEventsCodes[] = { "Lookup", "Hit", "Miss", "Write" };
|
||||
static const std::string PagerEventsStrings[] = { "Lookup", "Hit", "Miss", "Write", "Unknown" };
|
||||
// Reasons for page level events.
|
||||
enum class PagerEventReasons { PointRead = 0, RangeRead, RangePrefetch, Commit, LazyClear, MetaData, MAXEVENTREASONS };
|
||||
static const std::string PagerEventReasonsCodes[] = { "Get", "GetR", "GetRPF", "Commit", "LazyClr", "Meta" };
|
||||
static const std::string PagerEventReasonsStrings[] = { "Get", "GetR", "GetRPF", "Commit", "LazyClr", "Meta", "Unknown" };
|
||||
|
||||
static const int nonBtreeLevel = 0;
|
||||
static const std::pair<PagerEvents, PagerEventReasons> possibleEventReasonPairs[] = {
|
||||
|
@ -167,6 +167,7 @@ public:
|
|||
virtual Future<Reference<const ArenaPage>> getPhysicalPage(PagerEventReasons reason,
|
||||
unsigned int level,
|
||||
LogicalPageID pageID,
|
||||
int priority,
|
||||
bool cacheable,
|
||||
bool nohit) = 0;
|
||||
virtual bool tryEvictPage(LogicalPageID id) = 0;
|
||||
|
@ -238,8 +239,9 @@ public:
|
|||
virtual Future<Reference<ArenaPage>> readPage(PagerEventReasons reason,
|
||||
unsigned int level,
|
||||
LogicalPageID pageID,
|
||||
bool cacheable = true,
|
||||
bool noHit = false) = 0;
|
||||
int priority,
|
||||
bool cacheable,
|
||||
bool noHit) = 0;
|
||||
virtual Future<Reference<ArenaPage>> readExtent(LogicalPageID pageID) = 0;
|
||||
virtual void releaseExtentReadLock() = 0;
|
||||
|
||||
|
|
File diff suppressed because it is too large
Load Diff
Loading…
Reference in New Issue