toKill1 and toKill2 are a random subset of all processes. If simply kill all processes in toKill1 or toKill2,
we may kill too many processes to make the cluster unavailable and stuck.
Similar as what toKill2 were modified if it can cause cluster unavailable,
we should do the same thing for toKill1
Which fixes a compilation error due to a circular dependency between
flow.a and fdbrpc.a. However, this is now done at the cost of newNet2
users have to remember to add Net2FileSystem::stop() as a callback.
This requires the certificate chain to load successfully, otherwise
fdbcli will error out at an earlier point due to Net2 not being able to
configure TLS.
With TLS, a worker (or process) can have a TLS address and non-TLS address.
When a process is created in simulation, the primary address is TLS by default.
The non-TLS one is the TLS address port plus one.
In a connection between two workers, if their primary addresses do not enable
or disable TLS together, one worker will swap its primary address and secondary address
so that the TLS config of the two endpoints can match.
The swap can make the primary address no longer the TLS one that was created
when the process is created. And the swap only happens for worker instead of
process struct in simulation.
This swap can cause worker->address != process->address.
In checkForExtraDataStores actor, we use worker->address to check if a process
is killable and use the process->address to kill the process. The inconsistency
can cause simulation to kill a protected process that is not killable and leads
to simulation failure.
The idea being that we keep around a TLSConfig that the configuration
that the user has provided, and then when we want to intialize an SSL
context, we ask the TLSConfig to load all certificates and return us a
LoadedTLSConfig that is a concrete set of certificate bytes in memory.
initTLS now just takes the in-memory bytes and applies them to the ssl
context.
This is a large refactor to lead up into certificate refeshing, where we
will periodically check for changes to the certificates, and then
re-load them and apply them to a new SSL context.
trackLeakedConnection actor should give server enough time to
close its connection due to idle connection.
The current logic waits for at least 24 seconds to detect and close
an idle connection.
The current trackLeakedConnection actor waits for about 30 seconds
to claim LeakedConnection error.
We increase the delay in trackLeakedConnection actor to avoid
false positive error in simulation test.
Co-authored by: Vishesh Yadav
* This will allow client to continue monitoring peer connections while
connection stays open, so that there is no period of "uncertainity"
without previous no-monitoring approach.
* Use multiplier for incoming connection idle timeout
* Update idle connection timeout values and leaked connection timeout in
simulator.
This patch does two changes to connection monitoring:
1. Connection monitoring at client side will check if the connection
has been stayed idle for some time. If connection is unused for a
while, we close the connection. There is some weirdness involved here
as ping messages are by themselves are connection traffic. We get over
this by making it two-phase process, first being checking idle
reliable traffic, followed by disabling pings and then checking for
idle unreliable traffic.
2. Connection monitoring of clients from server will no longer send
pings to clients. Instead, it keep monitor the received bytes and
close after certain period of inactivity.
This commit includes functionality to turn on
the object serializer for network communication.
This is done the following way:
- On incoming connections, a process will detect
whether the client supports the object serializer
and will only serialize responses with it, if it does
- On outgoing connections, the command line flag is used
to determine whether the object serializer should be used
to send data.
This way, a cluster can run in mixed mode. To upgrade one
can upgrade one process at a time and set the flag one process
at a time.
This is how this is tested on the simulator:
- The command line flag can take three options: on, off,
and random.
- For off, the object serializer will never we used.
- For on, the object serializer will be always used.
- For random, the simulator will flip a coin for each
process it starts up.
This is the first part of making `TraceEvent` cheaper. The main idea is
to defer calls to any code that formats string. These are the main
changes:
- TraceEvent::detail now takes a c-string instead of std::string for
literals. This prevents unnecessary allocations if the trace is not
going to be printed in the first place (for example for SevDebug).
Before that `detail` expected a `std::string` as key, which mean that
any string literal would be copied on each call.
- Templates Traceable and SpecialTraceMetricType. These templates can be
specialized for any type that needs to be printed. The actual
formatting will be deferred to after the `enabled` check. This
provides two benefits: (1) if a TraceEvent is disabled, we don't pay
for the formatting and (2) TraceEvent can trace types that it doesn't
know about.
- TraceEvent::enabled will be set in the constructor if the Severity is
passed. This will make sure that `TraceEvent::init` is not called.
- `TraceEvent::detail` will be inlined. So for disabled TraceEvent
calls, a call to detail will only introduce a if-branch which is much
cheaper than a function call.
- NetworkAddress now contains IPAddress object which can be either
IPv4 or IPv6 address. 128bits are used even for IPv4 addresses,
however only 32bits are used when using/serializing IPv4 address.
- ConnectPacket is updated to store IPv6 address. Backward compatible
with old format since the first 32bits of IP address field is used
for serialization of IPv4.
- Mainly updates rest of the code to use IPAddress structure instead
of plain uint32_t.
- IPv6 address/pair ports should be represented as `[ip]:port` as per
convention. This applies to both cluster files and command line
arguments.
* log_version in the database (`/conf/log_version`) is now a hint that gets
rounded to the nearest supported version.
* fdbcli and FDB enforce that only a valid log_version can be configured to
* TLogVersion is persisted in CoreTLogSet (and LogSet and TLogSet)
* Some comments here and there
* Add an assert on filename length to make sure KV-pairs in filename
don't exceed a maximum length.
The simulator uses a hash table to cache all open files to make sure
that several simulated processes don't open the file more than once.
This currently doesn't work properly and deleted files are often kept
open forever. As a result, we often ran out of file descriptors.
The problem is luckily quite simple: files are often opened with an
absolute path but later a relativ path is passed for deletion. This
is not working because the map that is used to store the file
descriptors is not aware of paths - so deleted files are often not
removed from this map. The fix that works for us is to just always
work with absolute paths when adding and removing files from this map.
Sim2Listener can now take the network address to listen on. This is
used to listen to multiple ports in simulator and test the patch
which added multiple network addresses to single endpoint.
Remove the use of relative paths. A header at foo/bar.h could be included by
files under foo/ with "bar.h", but would be included everywhere else as
"foo/bar.h". Adjust so that every include references such a header with the
latter form.
Signed-off-by: Robert Escriva <rescriva@dropbox.com>
fix: we cannot pop the txs tag from remote logs until they have a full copy of the txnStateStore
fix: we have to modify all of history, we cannot stop after finding a local remote
This takes advantage of the new actorcompiler functionality to avoid
having duplicate definitions of `Void _` when trying to feed the
un-actorompiled source through clang.
self-moves are frowned upon in C++, and in our code this generally happens from
calls to swap as part of trying to implement a "unordered erase" function via
swap-to-the-end-and-pop_back. For convenience, a swapAndPop() function is now
offered that performs this, while disallowing self-moves.
canKillProcess logic was wrong.
We still need to configure usable_regions because if datacenterVersionDifference is too large we cannot complete data movement.
* Detail names now all start with an uppercase character and contain no underscores. Ideally these should be head-first camel case, though that was harder to check.
* Type names have the same rules, except they allow one underscore (to support a usage pattern Context_Type). The first character after the underscore is also uppercase.
* Use seconds instead of milliseconds in details.
Added a check when events are logged in simulation that logs a message to stderr if the first two rules above aren't followed.
This probably doesn't address every instance of the above problems, but all of the events I was able to hit in simulation pass the check.
removed unused simulation variables
run the simulation with only 1 coordinator most of the time, since we protect the coordinator from being killed, and protecting too many things is bad for simulation
fix: the master did not monitor for the failure of remote logs
stop merge attempts when a data center is failed
fixed a variety of other problems with data distribution when a data center is failed
BlobStoreEndpoint now only accepts hostnames and an optional service, so this update is not compatible with the previous URL formats having many IP addresses.
Changed killMachine and killDataCenter interface to return final killtype
Updated TESTs for DataCenter to ensure that DataCenter was killed
Added assertion to ensure that failed DC kills were not downgrades
Added method for identifying acceptable availability process classes
Extended cluster availability function to ensure coordinators can be auto configured
Fixed availability function to allow protected processes to be considered as dead if not available
Added debug trace events for providing machine state when considering availability
Added trace event for protected coordinators