fix: the cluster controller attempts to do a commit to determine if the cluster is alive, since its own internal recoveryState might not be up-to-date.
fix: forceMasterFailure on the cluster controller did not always cause the current master to be re-recruited
The usedIds is updated by master registration request, which populates the
usedIds map. However, this request may contain processes that cluster controller
is not aware, i.e., not in id_worker map.
This is ok until I added tracing the usedIds, which silently insert an empty
entry into id_worker map for the unknown process. This new entry can cause
crashing failure when trying to access its LocalityData.
Remove AsyncTrigger for usedIds, and change to serverInfo->onChange.
Use const & to avoid unnecessary copies in WorkerInterface's LocalityData
and getExtraTLogEligibleMachines().
The quite database can fail to send out requests and report timeout. This seems
to be caused by reusing a request that uses the same ReplyPromise. Another bug
is Proxy can wait for unneeded time for a dabase change, while the distributor
is already known to itself.
The setDistributor() sets an AsyncVar and then runs waitFailureClient. This
ordering is wrong because the AsyncVar::set triggers the other loop to run
first, which will wait on Never(). The correct code should wait on the Future
returned by the waitFailureClient.
This allows cluster controller to know data distributor during worker
registration phase, thus avoiding recruiting a new data distributor after
starting.
Also change the worker to skip creating a new data distributor if there is
already one running on the worker, which can trigger operation timeout in tests.
This fixed a bug found by upgrade test, where the configuration monitor of the
data distributor was monitoring excludedServersVersionKey, which doesn't
change in ChangeConfig workload. As a result, data distributor was not aware of
configuration changes.
Adding this new key and make sure this key is updated in configuration changes
so that the monitor can detect configuration changes.
After controller starts one, it will wait for that one and ignore any rejoins
received later.
Add remoteRecovered() to data distribution for remote team collection.
Use getRateInfo's endpoint as the ID for the DataDistributorInterface.
For now, added a "rejoined" flag for ClusterControllerData and Proxy.
TODO: move DataDistributorInterface into ServerDBInfo.
Let cluster controller to start a new data distributor role by sending a
message to a chosen worker.
Change MasterInterface usage in DataDistribution to masterId
Add DataDistributor rejoin handling.
This allows the data distributor to tell the new cluster controller of its
existence so that the controller doesn't spawn a new one. I.e., there should
be only ONE data distributor in the cluster.
If DataDistributor (DD) doesn't join in a while, then ClusterController (CC) tries
to recruit one as DD. CC also monitors DD and restarts one if it failed.
The Proxy is also monitoring the DD. If DD failed, the Proxy will ask CC for
the new DD.
Add GetRecoveryInfo RPC to master server, which is called by data distributor
to obtain the recovery Transaction version from the master server.
Fixed a couple of bugs
1) A rare race condition where a worker is being roles even after it died.
2) Fix how RoleFitness is calculated for TLog and LogRouter. Only worst fitness is compared to see if a better fit is available.
- This patch will make FDB listen to multiple addresses given via
command line. Although, we'll still use first address in most places,
this patch starts using vector<NetworkAddress> in Endpoint at some basic
places.
- When sending packets to an endpoint, pick a random network address in
endpoints
- Renames Endpoint::address to Endpoint::addresses since it
now holds a vector of addresses.
Extend `Endpoint` class to take multiple NetworkAddresses instead of
just one. Hence, to talk to an endpoint instead of one IP:PORT, we'll
have multiple IP:PORT pairs.
This patch simply adds the field and makes changes to compile the
codebase. The first element of of `address` field is used everywhere.
Hence the way we talk to remains same with this patch.
NOTE:
Directly accessing the first memeber of Endpoint::address is unsafe
as Endpoint() doesn't enforces non-empty address list. However, since
the correctness test pass for now and are anyway replacing all those
unsafe accesses with ones considering the whole vector, this patch
ignores to access them in safe way.