For example, we have 3 servers for replica factor 3. We can have only 1 team
but the desired team number is 3 times 5 equal to 15.
Instead of sanity checking the absolute team number per server, we check
the difference between the minServerTeamOnServer and maxServerTeamOnServer.
Add test for simulation test which make sure the server team number
per server will be no less than the desired_teams_per_server defined
in knobs and no larger than the max_teams_per_server.
Add similar test for machine teams number per machine as well.
fix: we could incorrectly make data durable if eraseMessagesFromMemory was in progress while running updatePersistentData
the quiet database check now ensure that tlogs have no more than 30 seconds of versions unpopped from the disk queue
The previous commit merge with the master, which just merges
the pull request #1062 from jzhou77/PR that adds a new DataDistribution role.
The merge causes conflicts and errors in simulation tests.
This commit resolves the code conflicts and
tries to fix the new errors after incorporating the new DataDistribution role
When moving keys to a team, if one of the server in the target team died, then
the move can become stuck. This is because the DDTeamCollection waits for all
the data movement of the failed server to be completed. However, in this case,
because the movement has not finished yet, checking the database tells us there
is no key assocated with this server and it is safe to go ahead. In reality,
only the in-memory structure knows there is pending movement, i.e., unfinished
move causes some keys to be attributed to the failed server. Thus, the server
can't be removed yet. Fix by adding a check with in-memory structure in
waitForAllDataRemoved().
Use const& to optimize a few function parameters.
The quite database can fail to send out requests and report timeout. This seems
to be caused by reusing a request that uses the same ReplyPromise. Another bug
is Proxy can wait for unneeded time for a dabase change, while the distributor
is already known to itself.
After controller starts one, it will wait for that one and ignore any rejoins
received later.
Add remoteRecovered() to data distribution for remote team collection.
Use getRateInfo's endpoint as the ID for the DataDistributorInterface.
For now, added a "rejoined" flag for ClusterControllerData and Proxy.
TODO: move DataDistributorInterface into ServerDBInfo.
Let cluster controller to start a new data distributor role by sending a
message to a chosen worker.
Change MasterInterface usage in DataDistribution to masterId
Add DataDistributor rejoin handling.
This allows the data distributor to tell the new cluster controller of its
existence so that the controller doesn't spawn a new one. I.e., there should
be only ONE data distributor in the cluster.
If DataDistributor (DD) doesn't join in a while, then ClusterController (CC) tries
to recruit one as DD. CC also monitors DD and restarts one if it failed.
The Proxy is also monitoring the DD. If DD failed, the Proxy will ask CC for
the new DD.
Add GetRecoveryInfo RPC to master server, which is called by data distributor
to obtain the recovery Transaction version from the master server.
DESIRED_TEAMS_PER_MACHINE must equal to DESIRED_TEAMS_PER_SERVER.
Otherwise, we may have to few machine teams to create enough server teams.
Note that BUGGIFY macro value is based on a random number generator.
When you have two BUGGIFY, one may be true and the other is false.
Also fix a bug in get the number of healthy machine teams.
When the total number of teams is larger than the desired number,
we should gracefully remove the redundant teams so that
the number of teams is kept to a low number and the possibility of
losing data is guaranteed to be extremely low even when multiple
racks fail at the same time.
Magnify the possibility that the number of created machine teams is
larger than the number of desired machine teams if we do NOT try to remove the surplus machine teams.
This help test the upgrade to machine team in FDB 6.1
Call the traceTeamCollectionInfo function to record the team numbers
when we add a team directly from the shard information, instead of
using addTeamsBestOf logic.
The current simulator does not validate if the number of teams in
the system is larger than the maximum desired number of teams.
This validation should be added because we do NOT want too many teams
in the system, which may impede the systems availability when
multiple fault zones (e.g., machines) crashes at the same time.
This commit adds the test at the consistency check in simulation.
Since the current code does not handle the upgrading situation
when we enforce the machine teams, the test is expected to fail.
The later commit will handle the upgrading situation which gracefully
remove the surplus teams.
Remove the use of relative paths. A header at foo/bar.h could be included by
files under foo/ with "bar.h", but would be included everywhere else as
"foo/bar.h". Adjust so that every include references such a header with the
latter form.
Signed-off-by: Robert Escriva <rescriva@dropbox.com>
This takes advantage of the new actorcompiler functionality to avoid
having duplicate definitions of `Void _` when trying to feed the
un-actorompiled source through clang.
canKillProcess logic was wrong.
We still need to configure usable_regions because if datacenterVersionDifference is too large we cannot complete data movement.
* Detail names now all start with an uppercase character and contain no underscores. Ideally these should be head-first camel case, though that was harder to check.
* Type names have the same rules, except they allow one underscore (to support a usage pattern Context_Type). The first character after the underscore is also uppercase.
* Use seconds instead of milliseconds in details.
Added a check when events are logged in simulation that logs a message to stderr if the first two rules above aren't followed.
This probably doesn't address every instance of the above problems, but all of the events I was able to hit in simulation pass the check.