foundationdb/documentation/sphinx/source/architecture.rst

############
Architecture
############

FoundationDB makes your architecture flexible and easy to operate. Your applications can send their data directly to the FoundationDB or to a :doc:`layer<layer-concept>`, a user-written module that can provide a new data model, compatibility with existing systems, or even serve as an entire framework. In both cases, all data is stored in a single place via an ordered, transactional key-value API.

The following diagram details the logical architecture.

|image0|


Detailed FoundationDB Architecture
----------------------------------

The FoundationDB architecture chooses a decoupled design, where
processes are assigned different heterogeneous roles (e.g.,
Coordinators, Storage Servers, Master). Scaling the database is achieved
by horizontally expanding the number of processes for separate roles:

Coordinators
~~~~~~~~~~~~

All clients and servers connect to a FoundationDB cluster with a cluster
file, which contains the IP:PORT of the coordinators. Both the clients
and servers use the coordinators to connect with the cluster controller.
The servers will attempt to become the cluster controller if one does
not exist, and register with the cluster controller once one has been
elected. Clients use the cluster controller to keep an up-to-date list
of GRV proxies and commit proxies.

Cluster Controller
~~~~~~~~~~~~~~~~~~

The cluster controller is a singleton elected by a majority of
coordinators. It is the entry point for all processes in the cluster. It
is responsible for determining when a process has failed, telling
processes which roles they should become, and passing system information
between all of the processes.

Master
~~~~~~

The master is responsible for coordinating the transition of the write
sub-system from one generation to the next. The write sub-system
includes the master, GRV proxies, commit proxies, resolvers, and
transaction logs. The three roles are treated as a unit, and if any of
them fail, we will recruit a replacement for all three roles. The master
provides the commit versions for batches of the mutations to the commit
proxies.

Historically, Ratekeeper and Data Distributor are coupled with Master on
the same process. Since 6.2, both have become a singleton in the
cluster. The life time is no longer tied with Master.

|image1|

GRV Proxies
~~~~~~~~~~~

The GRV proxies are responsible for providing read versions, communicating
with ratekeeper to control the rate providing read versions. To provide a
read version, a GRV proxy will ask all master to see the largest committed
version at this point in time, while simultaneously checking that the
transaction logs have not been stopped. Ratekeeper will artificially slow
down the rate at which the GRV proxy provides read versions.

Commit Proxies
~~~~~~~~~~~~~~

The proxies are responsible for committing transactions, report committed
versions to master and tracking the storage servers responsible for each
range of keys.

Commits are accomplished by:

-  Get a commit version from the master.
-  Use the resolvers to determine if the transaction conflicts with
   previously committed transactions.
-  Make the transaction durable on the transaction logs.

The key space starting with the ``\xff`` byte is reserved for system
metadata. All mutations committed into this key space are distributed to
all of the commit proxies through the resolvers. This metadata includes a
mapping between key ranges and the storage servers which have the data
for that range of keys. The commit proxies provides this information to
clients on-demand. The clients cache this mapping; if they ask a storage
server for a key it does not have, they will clear their cache and get a
more up-to-date list of servers from the commit proxies.

Transaction Logs
~~~~~~~~~~~~~~~~

The transaction logs make mutations durable to disk for fast commit
latencies. The logs receive commits from the commit proxy in version order,
and only respond to the commit proxy once the data has been written and fsync’ed
to an append only mutation log on disk. Before the data is even written to
disk we forward it to the storage servers responsible for that mutation.
Once the storage servers have made the mutation durable, they pop it
from the log. This generally happens roughly 6 seconds after the
mutation was originally committed to the log. We only read from the
log’s disk when the process has been rebooted. If a storage server has
failed, mutations bound for that storage server will build up on the
logs. Once data distribution makes a different storage server
responsible for all of the missing storage server’s data we will discard
the log data bound for the failed server.

Resolvers
~~~~~~~~~

The resolvers are responsible determining conflicts between
transactions. A transaction conflicts if it reads a key that has been
written between the transaction’s read version and commit version. The
resolver does this by holding the last 5 seconds of committed writes in
memory, and comparing a new transaction’s reads against this set of
commits.

Storage Servers
~~~~~~~~~~~~~~~

The vast majority of processes in a cluster are storage servers. Storage
servers are assigned ranges of key, and are responsible to storing all
of the data for that range. They keep 5 seconds of mutations in memory,
and an on disk copy of the data as of 5 second ago. Clients must read at
a version within the last 5 seconds, or they will get a
``transaction_too_old`` error. The SSD storage engine stores the data in
a B-tree based on SQLite. The memory storage engine store the data in
memory with an append only log that is only read from disk if the
process is rebooted. In the upcoming FoundationDB 7.0 release, the
B-tree storage engine will be replaced with a brand new *Redwood*
engine.

Data Distributor
~~~~~~~~~~~~~~~~

Data distributor manages the lifetime of storage servers, decides which
storage server is responsible for which data range, and ensures data is
evenly distributed across all storage servers (SS). Data distributor as
a singleton in the cluster is recruited and monitored by Cluster
Controller. See `internal
documentation <https://github.com/apple/foundationdb/blob/master/design/data-distributor-internals.md>`__.

Ratekeeper
~~~~~~~~~~

Ratekeeper monitors system load and slows down client transaction rate
when the cluster is close to saturation by lowering the rate at which
the proxy provides read versions. Ratekeeper as a singleton in the
cluster is recruited and monitored by Cluster Controller.

Clients
~~~~~~~

A client links with specific language bindings (i.e., client libraries)
in order to communicate with a FoundationDB cluster. The language
bindings support loading multiple versions of C libraries, allowing the
client communicates with older version of the FoundationDB clusters.
Currently, C, Go, Python, Java, Ruby bindings are officially supported.

Transaction Processing
----------------------

A database transaction in FoundationDB starts by a client contacting one
of the GRV proxies to obtain a read version, which is guaranteed to be
larger than any of commit version that client may know about (even
through side channels outside the FoundationDB cluster). This is needed
so that a client will see the result of previous commits that have
happened.

Then the client may issue multiple reads to storage servers and obtain
values at that specific read version. Client writes are kept in local
memory without contacting the cluster. By default, reading a key that
was written in the same transaction will return the newly written value.

At commit time, the client sends the transaction data (all reads and
writes) to one of the commit proxies and waits for commit or abort response
from the commit proxy. If the transaction conflicts with another one and
cannot commit, the client may choose to retry the transaction from the
beginning again. If the transaction commits, the commit proxy also returns
the commit version back to the client and to master so that GRV proxies can
get access to the latest committed version. Note this commit version is
larger than the read version and is chosen by the master.

The FoundationDB architecture separates the scaling of client reads and
writes (i.e., transaction commits). Because clients directly issue reads
to sharded storage servers, reads scale linearly to the number of
storage servers. Similarly, writes are scaled by adding more processes
to Commit Proxies, Resolvers, and Log Servers in the transaction system.

Determine Read Version
~~~~~~~~~~~~~~~~~~~~~~

When a client requests a read version from a GRV proxy, the GRV proxy asks
master for the latest committed version, and checks a set of transaction
logs satisfying replication policy are live. Then the GRV proxy returns
the maximum committed version as the read version to the client.

|image2|

The reason for the GRV proxy to contact master for the latest committed
versions is to because master is a central place to keep the largest of
all commit proxies' committed version.

The reason for checking a set of transaction logs satisfying replication
policy are live is to ensure the GRV proxy is not replaced with newer
generation of GRV proxies. This is because GRV proxy is a stateless role
recruited in each generation. If a recovery has happened and the old GRV
proxy is still live, this old GRV proxy could still give out read versions.
As a result, a *read-only* transaction may see stale results (a
read-write transaction will be aborted). By checking a set of
transaction logs satisfying replication policy are live, the GRV proxy makes
sure no recovery has happened, thus the *read-only* transaction sees the
latest data.

Note that the client cannot simply ask the master for read versions because
this approach is putting more work towards the master, because the master
role can’t be scaled. Even though giving out read-versions isn’t very
expensive, it still requires the master to get a transaction budget from the
Ratekeeper, batches requests, and potentially maintains thousands of network
connections from clients.

|image3|

Transaction Commit
~~~~~~~~~~~~~~~~~~

A client transaction commits in the following steps:

1. A client sends a transaction to a commit proxy.
2. The commit proxy asks the master for a commit version.
3. The master sends back a commit version that is higher than any commit
   version seen before.
4. The commit proxy sends the read and write conflict ranges to the resolver(s)
   with the commit version included.
5. The resolver responds back with whether the transaction has any
   conflicts with previous transactions by sorting transactions
   according to their commit versions and computing if such a serial
   execution order is conflict-free.

   -  If there are conflicts, the commit proxy responds back to the client with
      a not_committed error.
   -  If there are no conflicts, the commit proxy sends the mutations and
      commit version of this transaction to the transaction logs.

6. Once the mutations are durable on the logs, the commit proxy responds back
   success to the user.

Note the commit proxy sends each resolver their respective key ranges, if
any one of the resolvers detects a conflict then the transaction is not
committed. This has the flaw that if only one of the resolvers detects a
conflict, the other resolver will still think the transaction has
succeeded and may fail future transactions with overlapping write
conflict ranges, even though these future transaction can commit. In
practice, a well designed workload will only have a very small
percentage of conflicts, so this amplification will not affect
performance. Additionally, each transaction has a five seconds window.
After five seconds, resolvers will remove the conflict ranges of old
transactions, which also limits the chance of this type of false
conflict.

|image4|

|image5|

Background Work
~~~~~~~~~~~~~~~

There are a number of background work happening besides the transaction
processing:

-  **Ratekeeper** collects statistic information from GRV proxies, Commit
   proxies, transaction logs, and storage servers and compute the target
   transaction rate for the cluster.

-  **Data distribution** monitors all storage servers and perform load
   balancing operations to evenly distribute data among all storage
   servers.

-  **Storage servers** pull mutations from transaction logs, write them
   into storage engine to persist on disks.

-  **Commit proxies** periodically send empty commits to transaction logs to
   keep commit versions increasing, in case there is no client generated
   transactions.

|image6|

Transaction System Recovery
~~~~~~~~~~~~~~~~~~~~~~~~~~~

The transaction system implements the write pipeline of the FoundationDB
cluster and its performance is critical to the transaction commit
latency. A typical recovery takes about a few hundred milliseconds, but
longer recovery time (usually a few seconds) can happen. Whenever there
is a failure in the transaction system, a recovery process is performed
to restore the transaction system to a new configuration, i.e., a clean
state. Specifically, the Master process monitors the health of GRV Proxies,
Commit Proxies,  Resolvers, and Transaction Logs. If any one of the monitored
process failed, the Master process terminates. The Cluster Controller will
detect this event, and then recruits a new Master, which coordinates the
recovery and recruits a new transaction system instance. In this way,
the transaction processing is divided into a number of epochs, where
each epoch represents a generation of the transaction system with its
unique Master process.

For each epoch, the Master initiates recovery in several steps. First,
the Master reads the previous transaction system states from
Coordinators and lock the coordinated states to prevent another Master
process from recovering at the same time. Then the Master recovers
previous transaction system states, including all Log Servers’
Information, stops these Log Servers from accepting transactions, and
recruits a new set of GRV Proxies, Commit Proxies, Resolvers, and
Transaction Logs. After previous Log Servers are stopped and new transaction
system is recruited, the Master writes the coordinated states with current
transaction system information. Finally, the Master accepts new
transaction commits. See details in this
`documentation <https://github.com/apple/foundationdb/blob/master/design/recovery-internals.md>`__.

Because GRV Proxies, Commit Proxies and Resolvers are stateless, their
recoveries have no extra work. In contrast, Transaction Logs save the
logs of committed transactions, and we need to ensure all previously
committed transactions are durable and retrievable by storage servers.
That is, for any transactions that the Commit Proxies may have sent back
commit response, their logs are persisted in multiple Log Servers (e.g.,
three servers if replication degree is 3).

Finally, a recovery will *fast forward* time by 90 seconds, which would
abort any in-progress client transactions with ``transaction_too_old``
error. During retry, these client transactions will find the new
generation of transaction system and commit.

**``commit_result_unknown`` error:** If a recovery happened while a
transaction is committing (i.e., a commit proxy has sent mutations to
transaction logs). A client would have received
``commit_result_unknown``, and then retried the transaction. It’s
completely permissible for FDB to commit both the first attempt, and the
second retry, as ``commit_result_unknown`` means the transaction may or
may not have committed. This is why it’s strongly recommended that
transactions should be idempotent, so that they handle
``commit_result_unknown`` correctly.

Resources
---------

`Forum
Post <https://forums.foundationdb.org/t/technical-overview-of-the-database/135/26>`__

`Existing Architecture
Documentation <https://github.com/apple/foundationdb/blob/master/documentation/sphinx/source/kv-architecture.rst>`__

`Summit
Presentation <https://www.youtube.com/watch?list=PLbzoR-pLrL6q7uYN-94-p_-Q3hyAmpI7o&v=EMwhsGsxfPU&feature=emb_logo>`__

`Data Distribution
Documentation <https://github.com/apple/foundationdb/blob/master/design/data-distributor-internals.md>`__

`Recovery
Documentation <https://github.com/apple/foundationdb/blob/master/design/recovery-internals.md>`__

.. |image0| image:: images/Architecture.png
.. |image1| image:: images/architecture-1.jpeg
.. |image2| image:: images/architecture-2.jpeg
.. |image3| image:: images/architecture-3.jpeg
.. |image4| image:: images/architecture-4.jpeg
.. |image5| image:: images/architecture-5.jpeg
.. |image6| image:: images/architecture-6.jpeg