Whenever use selectReplicas function, be careful that it may have bugs!
This bug is that it always return false (not able to find candidates)
when the storage team size is 1. This is wrong because when storage team size
is 1, the selectReplicas should return an empty result.
When team collection add new server teams, it picks a team with
the least number of teams. We should only consider the healthy teams
because the unhealthy ones will not be useful.
Team collection should prioritize to build machine teams for a machine
that has the least number of healthy machine teams, instead of just
machine teams, because unhealthy machine team will not be able to
produce more server teams.
When team collection (TC) build server teams and machine teams,
it needs to build enough teams such that each server and machine has
the DESIRED_TEAMS_PER_SERVER server teams and machine teams.
This change calculate the number of teams (server team and machine teams)
needed to get each teams for each server and machine.
For example, we have 3 servers for replica factor 3. We can have only 1 team
but the desired team number is 3 times 5 equal to 15.
Instead of sanity checking the absolute team number per server, we check
the difference between the minServerTeamOnServer and maxServerTeamOnServer.
Add test for simulation test which make sure the server team number
per server will be no less than the desired_teams_per_server defined
in knobs and no larger than the max_teams_per_server.
Add similar test for machine teams number per machine as well.
Since Ratekeeper and DataDistributor are no longer running with Master, they
might be running with stateful processes before a new Master becomes alive,
which is undesirable.
This PR adds a monitoring of both Ratekeeper and DataDistributor at Cluster
Controller -- if Master runs on a stateless class and RK/DD runs at a worse
class, then RK/DD will be killed. I.e., RK/DD should be running at their own
classes or on the same stateless process as Master. After restart, RK/DD should
be running at a better process class.
Add a flag in HealthMetrics to indicate that batch priority is rate limited.
Data distributor pulls this flag from proxy to know roughly when rate limiting
happens.
DD uses this information to determine when to do the rebalance in the background,
i.e., moving data from heavily loaded servers to lighter ones. If the cluster is
currently rate limited for batch commits, then the rebalance will use longer
time intervals, otherwise use shorter intervals. See BgDDMountainChopper() and
BgDDValleyFiller() in DataDistributionQueue.actor.cpp.
Add a new role for ratekeeper.
Remove StorageServerChanges from data distribution.
Ratekeeper monitors storage servers, which borrows the idea from
DataDistribution.
After we add a new data distributor role, we publish the data
related to data distributor and rate keeper through the new
role (and new worker).
So the status needs to contact the data distributor, instead of master,
to get the status information.
In addTeam(), to determine the team is badTeam or not, we should check
redundantTeam before check satisfiesPolicy. Because if a team is
redundantTeam, it has been removed from the system before we call addTeam().
The only reason we call addTeam() for a removed redundantTeam is to
kick off the badTeam cleanup logic.
When we remove a machine team in teamRemover function,
we should always find the machine team in the global machineTeams.
Change the ASSERT to the above invariant.