Which is to proceed past Net2 creation, and allow certificate refresh to
try and eventually load valid certs. Additionally, fix certificate
refeshing dieing if the certificate is not readable when first called.
In testing, I also found and fixed an issue where if a cert went from
unreadable to readable, we wouldn't reload the TLS context, due to not
considering it as a file change.
I.e., do not allow the same version's mutations saved in different files.
Otherwise, we may have a file only contain a version's partial data, causing
continuity analysis of mutation logs to fail. This could also cause restore
failures, if the target version's mutations are stored in two files.
In the above description, all mutation logs refer to the same tag's logs.
The master waits for all backup worker recruitment done and then set them in a
batch. However, a backup worker could remove itself before the master sets it.
As a result, the worker is not removed and oldest backup epoch can't advance,
and TLog can't be popped.
For the first mutation log of a backup, we need to true-up its begin version to
the exact version of the first mutation. This is needed to ensure the strict
less than relationship between two mutation logs, if one's version range is
within the other.
A problematic scenario is as follows:
Epoch 1: a mutation log A [200, 900] is saved, but its progress is NOT saved.
Epoch 2: master recruits a worker for [1, 1000], 1000 is epoch 1's end version.
New worker saves a mutation log B [100, 1000]
A's range is strict within B's range, but A's size is larger than B.
This happens because B's start version is true-up to the backup's begin version,
which is not the actual version of the first mutation. After B's begin version
is true-up to 300, we won't have this issue.
This fixes the consistency check failure, where saving progress commits new
transactions. Pop is performed by the NOOP loop in monitorBackupKeyOrPullData.
When the Master recruits a backup worker for previous epochs, the Master may
set the begin version to a very low number, because the backup progress for
that epoch is not saved. This can cause problem for the log file, since these
low versions have been popped.
The fix here is to advance savedVersion to the minimum of backup's starting
version if it is higher than the begin version set by the Master. This is safe
because these versions are not popped. If they are popped, their progress should
already be recorded and Master would use a higher version than the backup's
starting version.
The oldest epoch the master gets can assume its begin version is 1, which can
be wrong. In this case, we use the saved backup progress to "true-up" the real
begin version.
Which fixes a compilation error due to a circular dependency between
flow.a and fdbrpc.a. However, this is now done at the cost of newNet2
users have to remember to add Net2FileSystem::stop() as a callback.
There were similar TraceEvents in the FDBLibTLS/LibreSSL TLS
implementaiton that were accidentally dropped in the TLS rewrite.
This makes it so that one does not have to use magic to figure out if a
process was configued with TLS correctly when some of the settings come
from environment variables.
This will stop eio threads for both the client (`fdb_stop_network()`)
and the server. This change is being done more for the former, but I
don't see any harm in doing the latter as well.
If client code initiates an FDB operation to a TLS cluster, and then
immediately exits the main thread, then OpenSSL's atexit handler would
potentially run while the network thread is attempting to do TLS
operations, and thus crash.
This commit removes the OpenSSL atexit hander, and instead relies on a
client intentionally ending the network thread to do TLS cleanup. If
the client code exits without stopping the network thread, then we'll
never free OpenSSL data structures, which is the safer thing to do.
We recently witnessed (using tsan) the main thread exiting without first
joining the network thread, and this caused data races and
heap-use-after-free's
Now the lifetime of these globals will be tied to the network thread
itself (and I guess every thread, but the one that actually uses memory
will be owned by the network thread.)