The buildTeam() can create teams with undesired storage servers, which are
considered unhealthy. As a result, the data movement can become stuck.
Fix this by adding an ACTOR monitorHealthyTeams that builds team every one
second whenever there is no healthy teams.
Clean up storageServerTracker() interface.
When zeroHealthyTeams signals and the storage server becomes healthy, we could
attempt buildTeam before the ServerStatusMap is updated. As a result, the
healthy server is not available for use. Fix by delaying the buildTeam after the
status map is updated.
This bug was introduced in cee23ee3. During a configuration change, the data
distributor was restarted, which destroys previous DDTeamCollection and cancels
all previous teamTracker(). In this case, even though the healthy team count
reaches 0, there is no need to try to rebuild teams. The bug is triggered when
trying rebuilding teams, DDTeamCollection is already destroyed.
When moving keys to a team, if one of the server in the target team died, then
the move can become stuck. This is because the DDTeamCollection waits for all
the data movement of the failed server to be completed. However, in this case,
because the movement has not finished yet, checking the database tells us there
is no key assocated with this server and it is safe to go ahead. In reality,
only the in-memory structure knows there is pending movement, i.e., unfinished
move causes some keys to be attributed to the failed server. Thus, the server
can't be removed yet. Fix by adding a check with in-memory structure in
waitForAllDataRemoved().
Use const& to optimize a few function parameters.
The usedIds is updated by master registration request, which populates the
usedIds map. However, this request may contain processes that cluster controller
is not aware, i.e., not in id_worker map.
This is ok until I added tracing the usedIds, which silently insert an empty
entry into id_worker map for the unknown process. This new entry can cause
crashing failure when trying to access its LocalityData.
Remove AsyncTrigger for usedIds, and change to serverInfo->onChange.
Use const & to avoid unnecessary copies in WorkerInterface's LocalityData
and getExtraTLogEligibleMachines().
The quite database can fail to send out requests and report timeout. This seems
to be caused by reusing a request that uses the same ReplyPromise. Another bug
is Proxy can wait for unneeded time for a dabase change, while the distributor
is already known to itself.
The setDistributor() sets an AsyncVar and then runs waitFailureClient. This
ordering is wrong because the AsyncVar::set triggers the other loop to run
first, which will wait on Never(). The correct code should wait on the Future
returned by the waitFailureClient.
This allows cluster controller to know data distributor during worker
registration phase, thus avoiding recruiting a new data distributor after
starting.
Also change the worker to skip creating a new data distributor if there is
already one running on the worker, which can trigger operation timeout in tests.
This fixed a bug found by upgrade test, where the configuration monitor of the
data distributor was monitoring excludedServersVersionKey, which doesn't
change in ChangeConfig workload. As a result, data distributor was not aware of
configuration changes.
Adding this new key and make sure this key is updated in configuration changes
so that the monitor can detect configuration changes.
After controller starts one, it will wait for that one and ignore any rejoins
received later.
Add remoteRecovered() to data distribution for remote team collection.
Found in tests, a move key conflict exception was not handled because the
Future object was not waited by someone. As a result, the data distributor
did not die and database checking couldn't get the metric and keep trying until
timeout.