Further improve code efficiency by
1) Avoid rebuild machine locality map when machine locality is changed.
This may leave the global machine locality map stale.
This is ok as long as we do not use the global map to validate
the machine team follows the locality policy.
2) Use ASSERT_WE_THINK instead of ASSERT to avoid runtime overhead.
ASSERT_WE_THINK will only validate the condition in simulation mode.
Signed-off-by: Meng Xu <meng_xu@apple.com>
Make sure the link between server and machine is updated
in both server and machine.
Rename function name to better reflect its functionality.
Signed-off-by: Meng Xu <meng_xu@apple.com>
A server locality may change from one machine to another.
This affects the old machine and machine team the server is on, and
the new machine the server moves to.
Signed-off-by: Meng Xu <meng_xu@apple.com>
We only create correct size machine teams.
When configuration (e.g., team size) is changed,
the DDTeamCollection will be destroyed and rebuilt
so that the invariant will not be violated.
Based on the invariant, we can count the number of
machine teams more quickly.
Signed-off-by: Meng Xu <meng_xu@apple.com>
The addAllTeams function can be replaced with the new addTeamsBestOf
function by passing a large enough number of teams to build.
Remove addAllTeams function and update the related unit tests.
Signed-off-by: Meng Xu <meng_xu@apple.com>
The buggify option may set 1 to the knob parameters
(DESIRED_TEAMS_PER_SERVER and MAX_TEAMS_PER_SERVER).
When this happens, the number of machine teams to build will be
less than what we want, which prevents us from building enough
server teams.
To avoid this problem, we build machine teams before
we call addTeamsBestOf to build server teams.
We also add the ASSERT to ensure we build enough machine teams and
server teams in the test case.
Signed-off-by: Meng Xu <meng_xu@apple.com>
Improve code efficiency with the following changes:
1) Change always-true if-statement to ASSERT;
2) Return when we are confident we will not find more machine teams.
No functionality change.
Signed-off-by: Meng Xu <meng_xu@apple.com>
Due to the randomness in choosing a server, we cannot gurantee to
find all teams. The NotEnoughServers test case may create false positive
bug report in the correctness test.
Signed-off-by: Meng Xu <meng_xu@apple.com>
Try multiple times of addTeamsBestOf() when we cannot find an available team
due to the pure randomness in choosing the server teams.
The changes for the unit test reduces the false positive in the simulation test results.
Signed-off-by: Meng Xu <meng_xu@apple.com>
Relax the assert condition on the random unit test.
Due to the randomness in choosing the machine team and
the server team from the machine team, it is possible that
we may not find the remaining several (e.g., 1 or 2) available teams.
For example, there are at most 10 teams available, and we have found
9 teams, the chance of finding the last one is low
when we do pure random selection.
It is ok to not find every available team because
1) In reality, we only create a small fraction of available teams, and
2) In practical system, this situation only happens when most of servers
are *temporarily* unhealthy. When this situation happens, we will
abandon all existing teams and restart the build team from scratch.
In simulation test, the situation happens 100 times out of 128613 test cases
when we run RandomUnitTests.txt only.
Signed-off-by: Meng Xu <meng_xu@apple.com>
Calculate the number of machine teams in the same way
as we calculate the number of server teams.
Only count the machine teams that has the correct size and is healthy.
Simplify code by removing unnecessary check.
Signed-off-by: Meng Xu <meng_xu@apple.com>
Pick server team purely randomly instead of picking the least used one.
This is to avoid creating correlation in the server teams we pick when
new machines are added.
The logic is:
First pick the one random least used server as chosen server;
Then pick a machine team that has the server;
Then pick a server on each machine in the machine team.
We make sure the chosen server is picked.
Signed-off-by: Meng Xu <meng_xu@apple.com>
Before we build server teams, we build the desired number of machine teams.
Then we pick the least used server, from which we pick the least used machine team.
Then we pick the least used server on each machine in the least used machine team to get the server team.
Note: The logic of building machine teams should be independent from server teams.
Signed-off-by: Meng Xu <meng_xu@apple.com>
When we GetTeam, the data distribution actor may have zero teams in
rare situation in the ConfigureTest.txt test.
We should return an empty team in this situation instead of triggering error.
Signed-off-by: Meng Xu <meng_xu@apple.com>
Resolve code review comments:
1) Improve the code efficiency by avoiding unnecessary map search
and avoiding unnecessary checking
2) Remove or comment out trace events when they can be spammy
3) Improve coding style
Tested for 1 hour and no error was found.
KillRegionCycle.txt test was excluded from the test because
existing code cannot pass that test either
Signed-off-by: Meng Xu <meng_xu@apple.com>
Current server team collection logic does not consider
the fact that multipe storage servers can run on the same machine.
When multiple machines fail, all servers on the machines will fail, and
the possibility of having one process team fail and lose data is very high.
To reduce the possibility of losing data when multiple machine fails,
we first create machine teams which span across different fault zones;
we then create server teams based on machine teams by
first picking 1 machine team, and then
picking 1 server from each machine in the machine team.
Signed-off-by: Meng Xu <meng_xu@apple.com>