We do NOT enforce the removing order of removing a machine team
and the server teams on the machine team.
This is for the benefit of clear code logic.
When a storage server locality changes, we first remove the server
and its machine if needed, before we handle the server team removal
and addition.
We do not actively remove a machine team when it has no server team on it.
But since adding a server team may add a machine team, we need to be
careful that the machine team number is not larger than the desired number
due to server team creation.
So whenever a server team is removed, we should check if the teamRemover
should be kicked in.
When the machine number changes due to machine remove event,
the desired machine team number changes. Then we need to
make sure the teamRemover actor is running to clean up the
redundant teams.
getTeam is called very frequently and does not create a new team.
So no need to call teamRemover in getTeam.
teamRemover should be called only when a new team may be added.
The buildTeam() can create teams with undesired storage servers, which are
considered unhealthy. As a result, the data movement can become stuck.
Fix this by adding an ACTOR monitorHealthyTeams that builds team every one
second whenever there is no healthy teams.
Clean up storageServerTracker() interface.
When zeroHealthyTeams signals and the storage server becomes healthy, we could
attempt buildTeam before the ServerStatusMap is updated. As a result, the
healthy server is not available for use. Fix by delaying the buildTeam after the
status map is updated.
This bug was introduced in cee23ee3. During a configuration change, the data
distributor was restarted, which destroys previous DDTeamCollection and cancels
all previous teamTracker(). In this case, even though the healthy team count
reaches 0, there is no need to try to rebuild teams. The bug is triggered when
trying rebuilding teams, DDTeamCollection is already destroyed.
When moving keys to a team, if one of the server in the target team died, then
the move can become stuck. This is because the DDTeamCollection waits for all
the data movement of the failed server to be completed. However, in this case,
because the movement has not finished yet, checking the database tells us there
is no key assocated with this server and it is safe to go ahead. In reality,
only the in-memory structure knows there is pending movement, i.e., unfinished
move causes some keys to be attributed to the failed server. Thus, the server
can't be removed yet. Fix by adding a check with in-memory structure in
waitForAllDataRemoved().
Use const& to optimize a few function parameters.
The quite database can fail to send out requests and report timeout. This seems
to be caused by reusing a request that uses the same ReplyPromise. Another bug
is Proxy can wait for unneeded time for a dabase change, while the distributor
is already known to itself.
The setDistributor() sets an AsyncVar and then runs waitFailureClient. This
ordering is wrong because the AsyncVar::set triggers the other loop to run
first, which will wait on Never(). The correct code should wait on the Future
returned by the waitFailureClient.
This fixed a bug found by upgrade test, where the configuration monitor of the
data distributor was monitoring excludedServersVersionKey, which doesn't
change in ChangeConfig workload. As a result, data distributor was not aware of
configuration changes.
Adding this new key and make sure this key is updated in configuration changes
so that the monitor can detect configuration changes.
After controller starts one, it will wait for that one and ignore any rejoins
received later.
Add remoteRecovered() to data distribution for remote team collection.
Found in tests, a move key conflict exception was not handled because the
Future object was not waited by someone. As a result, the data distributor
did not die and database checking couldn't get the metric and keep trying until
timeout.
Use getRateInfo's endpoint as the ID for the DataDistributorInterface.
For now, added a "rejoined" flag for ClusterControllerData and Proxy.
TODO: move DataDistributorInterface into ServerDBInfo.
Let cluster controller to start a new data distributor role by sending a
message to a chosen worker.
Change MasterInterface usage in DataDistribution to masterId
Add DataDistributor rejoin handling.
This allows the data distributor to tell the new cluster controller of its
existence so that the controller doesn't spawn a new one. I.e., there should
be only ONE data distributor in the cluster.
If DataDistributor (DD) doesn't join in a while, then ClusterController (CC) tries
to recruit one as DD. CC also monitors DD and restarts one if it failed.
The Proxy is also monitoring the DD. If DD failed, the Proxy will ask CC for
the new DD.
Add GetRecoveryInfo RPC to master server, which is called by data distributor
to obtain the recovery Transaction version from the master server.
Added three knobs to control team remover
bool TR_FLAG_DISABLE_TEAM_REMOVER:
Disable the teamRemover actor
double TR_REMOVE_MACHINE_TEAM_DELAY:
Wait for the specified time before try to remove next machine team
double TR_WAIT_FOR_ALL_MACHINES_HEALTHY_DELAY:
Wait before checking if all machines are healthy
When we remove a server due to server failure, we need to
remove the related server teams AND remove the server team from
the machine team.
In the previous commit, we forgot to remove the server team from
the machine team.
1) Reduce the frequency of checking if we need to call teamRemover
2) Improve code efficiency in finding the machine team to remove
3) Remove unused code
4) Add sanity check
DESIRED_TEAMS_PER_MACHINE must equal to DESIRED_TEAMS_PER_SERVER.
Otherwise, we may have to few machine teams to create enough server teams.
Note that BUGGIFY macro value is based on a random number generator.
When you have two BUGGIFY, one may be true and the other is false.
Also fix a bug in get the number of healthy machine teams.
When the total number of teams is larger than the desired number,
we should gracefully remove the redundant teams so that
the number of teams is kept to a low number and the possibility of
losing data is guaranteed to be extremely low even when multiple
racks fail at the same time.
Magnify the possibility that the number of created machine teams is
larger than the number of desired machine teams if we do NOT try to remove the surplus machine teams.
This help test the upgrade to machine team in FDB 6.1
Call the traceTeamCollectionInfo function to record the team numbers
when we add a team directly from the shard information, instead of
using addTeamsBestOf logic.
The current simulator does not validate if the number of teams in
the system is larger than the maximum desired number of teams.
This validation should be added because we do NOT want too many teams
in the system, which may impede the systems availability when
multiple fault zones (e.g., machines) crashes at the same time.
This commit adds the test at the consistency check in simulation.
Since the current code does not handle the upgrading situation
when we enforce the machine teams, the test is expected to fail.
The later commit will handle the upgrading situation which gracefully
remove the surplus teams.
- This patch will make FDB listen to multiple addresses given via
command line. Although, we'll still use first address in most places,
this patch starts using vector<NetworkAddress> in Endpoint at some basic
places.
- When sending packets to an endpoint, pick a random network address in
endpoints
- Renames Endpoint::address to Endpoint::addresses since it
now holds a vector of addresses.
Extend `Endpoint` class to take multiple NetworkAddresses instead of
just one. Hence, to talk to an endpoint instead of one IP:PORT, we'll
have multiple IP:PORT pairs.
This patch simply adds the field and makes changes to compile the
codebase. The first element of of `address` field is used everywhere.
Hence the way we talk to remains same with this patch.
NOTE:
Directly accessing the first memeber of Endpoint::address is unsafe
as Endpoint() doesn't enforces non-empty address list. However, since
the correctness test pass for now and are anyway replacing all those
unsafe accesses with ones considering the whole vector, this patch
ignores to access them in safe way.
Further improve code efficiency by
1) Avoid rebuild machine locality map when machine locality is changed.
This may leave the global machine locality map stale.
This is ok as long as we do not use the global map to validate
the machine team follows the locality policy.
2) Use ASSERT_WE_THINK instead of ASSERT to avoid runtime overhead.
ASSERT_WE_THINK will only validate the condition in simulation mode.
Signed-off-by: Meng Xu <meng_xu@apple.com>
Make sure the link between server and machine is updated
in both server and machine.
Rename function name to better reflect its functionality.
Signed-off-by: Meng Xu <meng_xu@apple.com>
A server locality may change from one machine to another.
This affects the old machine and machine team the server is on, and
the new machine the server moves to.
Signed-off-by: Meng Xu <meng_xu@apple.com>
We only create correct size machine teams.
When configuration (e.g., team size) is changed,
the DDTeamCollection will be destroyed and rebuilt
so that the invariant will not be violated.
Based on the invariant, we can count the number of
machine teams more quickly.
Signed-off-by: Meng Xu <meng_xu@apple.com>
The addAllTeams function can be replaced with the new addTeamsBestOf
function by passing a large enough number of teams to build.
Remove addAllTeams function and update the related unit tests.
Signed-off-by: Meng Xu <meng_xu@apple.com>
The buggify option may set 1 to the knob parameters
(DESIRED_TEAMS_PER_SERVER and MAX_TEAMS_PER_SERVER).
When this happens, the number of machine teams to build will be
less than what we want, which prevents us from building enough
server teams.
To avoid this problem, we build machine teams before
we call addTeamsBestOf to build server teams.
We also add the ASSERT to ensure we build enough machine teams and
server teams in the test case.
Signed-off-by: Meng Xu <meng_xu@apple.com>
Improve code efficiency with the following changes:
1) Change always-true if-statement to ASSERT;
2) Return when we are confident we will not find more machine teams.
No functionality change.
Signed-off-by: Meng Xu <meng_xu@apple.com>
Due to the randomness in choosing a server, we cannot gurantee to
find all teams. The NotEnoughServers test case may create false positive
bug report in the correctness test.
Signed-off-by: Meng Xu <meng_xu@apple.com>
Try multiple times of addTeamsBestOf() when we cannot find an available team
due to the pure randomness in choosing the server teams.
The changes for the unit test reduces the false positive in the simulation test results.
Signed-off-by: Meng Xu <meng_xu@apple.com>
Relax the assert condition on the random unit test.
Due to the randomness in choosing the machine team and
the server team from the machine team, it is possible that
we may not find the remaining several (e.g., 1 or 2) available teams.
For example, there are at most 10 teams available, and we have found
9 teams, the chance of finding the last one is low
when we do pure random selection.
It is ok to not find every available team because
1) In reality, we only create a small fraction of available teams, and
2) In practical system, this situation only happens when most of servers
are *temporarily* unhealthy. When this situation happens, we will
abandon all existing teams and restart the build team from scratch.
In simulation test, the situation happens 100 times out of 128613 test cases
when we run RandomUnitTests.txt only.
Signed-off-by: Meng Xu <meng_xu@apple.com>
Calculate the number of machine teams in the same way
as we calculate the number of server teams.
Only count the machine teams that has the correct size and is healthy.
Simplify code by removing unnecessary check.
Signed-off-by: Meng Xu <meng_xu@apple.com>
Pick server team purely randomly instead of picking the least used one.
This is to avoid creating correlation in the server teams we pick when
new machines are added.
The logic is:
First pick the one random least used server as chosen server;
Then pick a machine team that has the server;
Then pick a server on each machine in the machine team.
We make sure the chosen server is picked.
Signed-off-by: Meng Xu <meng_xu@apple.com>
Before we build server teams, we build the desired number of machine teams.
Then we pick the least used server, from which we pick the least used machine team.
Then we pick the least used server on each machine in the least used machine team to get the server team.
Note: The logic of building machine teams should be independent from server teams.
Signed-off-by: Meng Xu <meng_xu@apple.com>
When we GetTeam, the data distribution actor may have zero teams in
rare situation in the ConfigureTest.txt test.
We should return an empty team in this situation instead of triggering error.
Signed-off-by: Meng Xu <meng_xu@apple.com>
Resolve code review comments:
1) Improve the code efficiency by avoiding unnecessary map search
and avoiding unnecessary checking
2) Remove or comment out trace events when they can be spammy
3) Improve coding style
Tested for 1 hour and no error was found.
KillRegionCycle.txt test was excluded from the test because
existing code cannot pass that test either
Signed-off-by: Meng Xu <meng_xu@apple.com>
Current server team collection logic does not consider
the fact that multipe storage servers can run on the same machine.
When multiple machines fail, all servers on the machines will fail, and
the possibility of having one process team fail and lose data is very high.
To reduce the possibility of losing data when multiple machine fails,
we first create machine teams which span across different fault zones;
we then create server teams based on machine teams by
first picking 1 machine team, and then
picking 1 server from each machine in the machine team.
Signed-off-by: Meng Xu <meng_xu@apple.com>
fix: data distribution would not stop tracking bad teams after all their data was moved to other teams
fix: data distribution did not probably handle a server changing locality such that the teams it used to be on no longer satisfy the policy
Remove the use of relative paths. A header at foo/bar.h could be included by
files under foo/ with "bar.h", but would be included everywhere else as
"foo/bar.h". Adjust so that every include references such a header with the
latter form.
Signed-off-by: Robert Escriva <rescriva@dropbox.com>
optimized teamTracker to check if it satisfies the policy more efficiently
added yields to initialization to avoid slow tasks when adding lots of teams
There's never any reason to save the value of a Void return, and it's
the easiest source of redefined variable bugs that will creep back in
over time. So just `wait(...)`, it's cleaner that way.
This takes advantage of the new actorcompiler functionality to avoid
having duplicate definitions of `Void _` when trying to feed the
un-actorompiled source through clang.
self-moves are frowned upon in C++, and in our code this generally happens from
calls to swap as part of trying to implement a "unordered erase" function via
swap-to-the-end-and-pop_back. For convenience, a swapAndPop() function is now
offered that performs this, while disallowing self-moves.
* Detail names now all start with an uppercase character and contain no underscores. Ideally these should be head-first camel case, though that was harder to check.
* Type names have the same rules, except they allow one underscore (to support a usage pattern Context_Type). The first character after the underscore is also uppercase.
* Use seconds instead of milliseconds in details.
Added a check when events are logged in simulation that logs a message to stderr if the first two rules above aren't followed.
This probably doesn't address every instance of the above problems, but all of the events I was able to hit in simulation pass the check.
fix: the master did not monitor for the failure of remote logs
stop merge attempts when a data center is failed
fixed a variety of other problems with data distribution when a data center is failed
added configuration options for the remote data center and satellite data centers
updated cluster controller recruitment logic
refactors how master writes core state
updated log recovery, and log system peeking