92 lines
5.4 KiB
Markdown
92 lines
5.4 KiB
Markdown
|
# Flow Transport
|
|||
|
|
|||
|
This section describes the design and implementation of the flow transport wire protocol (as of release 6.3).
|
|||
|
|
|||
|
## ConnectPacket
|
|||
|
|
|||
|
The first bytes sent over a tcp connection in flow are the `ConnectPacket`.
|
|||
|
This is a variable length message (though fixed length at a given protocol
|
|||
|
version) designed with forward and backward compatibility in mind. The expected length of the `ConnectPacket` is encoded as the first 4 bytes (unsigned, little-endian). Upon receiving an incoming connection, a peer reads the `ProtocolVersion` (the next 8 bytes unsigned, little-endian. The most significant 4 bits encode flags and should be zeroed before interpreting numerically.) from the `ConnectPacket`.
|
|||
|
|
|||
|
## Protocol compatibility
|
|||
|
|
|||
|
Based on the incoming connection's `ProtocolVersion`, this connection is either
|
|||
|
"compatible" or "incompatible". If this connection is incompatible, then we
|
|||
|
will not actually look at any bytes sent after the `ConnectPacket`, but we will
|
|||
|
keep the connection open so that the peer does not keep trying to open new
|
|||
|
connections.
|
|||
|
|
|||
|
If this connection is compatible, then we know that our peer is using the same wire protocol as we are and we can proceed.
|
|||
|
|
|||
|
## Framing and checksumming protocol
|
|||
|
|
|||
|
As of release 6.3, the structure of subsequent messages is as follows:
|
|||
|
|
|||
|
* For TLS connections:
|
|||
|
1. packet length (4 bytes unsigned little-endian)
|
|||
|
2. token (16 opaque bytes that identify the recipient of this message)
|
|||
|
3. message contents (packet length - 16 bytes to be interpreted by the recipient)
|
|||
|
* For non-TLS connections, there's additionally a crc32 checksum for message integrity:
|
|||
|
1. packet length (4 bytes unsigned little-endian)
|
|||
|
2. 4 byte crc32 checksum of token + message
|
|||
|
3. token
|
|||
|
4. message
|
|||
|
|
|||
|
## Well-known endpoints
|
|||
|
|
|||
|
Endpoints are a pair of a 16 byte token that identifies the recipient and a
|
|||
|
network address to send a message to. Endpoints are usually obtained over the
|
|||
|
network - for example a request conventionally includes the endpoint the
|
|||
|
reply should be sent to (like a self-addressed stamped envelope). So if you
|
|||
|
can send a message and get endpoints in reply you can start sending messages
|
|||
|
those endpoints. But how do you send that first message?
|
|||
|
|
|||
|
That's where the concept of a "well-known" endpoint comes in. Some endpoints
|
|||
|
(for example the endpoints coordinators are listening on) use "well-known"
|
|||
|
tokens that are agreed upon ahead of time. Technically the value of these
|
|||
|
tokens could be changed as part of an incompatible protocol version bump, but
|
|||
|
in practice this hasn't happened and shouldn't ever need to happen.
|
|||
|
|
|||
|
## Flatbuffers
|
|||
|
|
|||
|
Prior to release-6.2 the structure of messages (e.g. how many fields a
|
|||
|
message has) was implicitly part of the protocol version, and so adding a
|
|||
|
field to any message required a protocol version bump. Since release-6.2
|
|||
|
messages are encoded as flatbuffers messages, and you can technically add
|
|||
|
fields without a protocol version bump. This is a powerful and dangerous tool
|
|||
|
that needs to be used with caution. If you add a field without a protocol version bump, then you can no longer be certain that this field will always be present (e.g. if you get a message from an old peer it might not include that field.)
|
|||
|
We don't have a good way to test two or more fdbserver binaries in
|
|||
|
simulation, so we discourage adding fields or otherwise making any protocol
|
|||
|
changes without a protocol version bump.
|
|||
|
|
|||
|
Bumping the protocol version is costly for clients though, since now they need a whole new libfdb_c.so to be able to talk to the cluster _at all_.
|
|||
|
|
|||
|
## Stable Endpoints
|
|||
|
|
|||
|
Stable endpoints are a proposal to allow protocol compatibility to be checked
|
|||
|
per endpoint rather than per connection. The proposal is to commit to the
|
|||
|
current (release-6.3) framing protocol for opening connections, and allow a
|
|||
|
newer framing protocol (for example a new checksum) to be negotiated after
|
|||
|
the connection has been established. This way even if peers are at different
|
|||
|
protocol versions they can still read the token each message is addressed to,
|
|||
|
and they can use that token to decide whether or not to attempt to handle the
|
|||
|
message. By default, tokens will have the same compatibility requirements as
|
|||
|
before where the protocol version must match exactly. But new tokens can
|
|||
|
optionally have a different policy - e.g. handle anything from a protocol
|
|||
|
version >= release-7.0.
|
|||
|
|
|||
|
One of the main features motivating "Stable Endpoints" is the ability to download a compatible libfdb_c from a coordinator.
|
|||
|
|
|||
|
### Changes to flow transport for Stable Endpoints
|
|||
|
|
|||
|
1. Well known endpoints must never change (this just makes it official)
|
|||
|
2. The (initial) framing protocol must remain fixed. If we want to change the checksum, we can add a stable, well known endpoint that advertises what checksums are supported and use this to change the checksum after the connection has already been established.
|
|||
|
3. Each endpoint can have a different compatibility policy: e.g. an endpoint can be marked as requiring at least `ProtocolVersion::withStableInterfaces()` like this:
|
|||
|
|
|||
|
```
|
|||
|
ReplyPromise<ProtocolInfoReply> reply{ PeerCompatibilityPolicy{ RequirePeer::AtLeast,
|
|||
|
ProtocolVersion::withStableInterfaces() } };
|
|||
|
```
|
|||
|
|
|||
|
4. Well known endpoints no longer need to be added in a particular order. Instead you reserve the number of well known endpoints ahead of time and then you can add them in any order.
|