Change to remove machine team with most machine teams, using the same
logic as the serverTeamRemover.
The featue is guarded by TR_FLAG_REMOVE_MT_WITH_MOST_TEAMS knob.
time checked on the proxy should be less than the time waited by the master to account for clock speed differences
setting REQUIRED_MIN_RECOVERY_DURATION and ENFORCED_MIN_RECOVERY_DURATION to 0 will go back to the old behavior
Formerly, they would prefer to peek from the primary's logs. Testing of
a failed region rejoining the cluster revealed that this becomes quite a
strain on the primary logs when extremely large volumes of peek requests
are coming from the Log Routers. It happens that we have satellites
that contain the same mutations with Log Router tags, that have no other
peeking load, so we can prefer to use the satellite to peek rather than
the primary to distribute load across TLogs better.
Unfortunately, this revealed a latent bug in how tagged mutations in the
KnownCommittedVersion->RecoveryVersion gap were copied across
generations when the number of log router tags were decreased.
Satellite TLogs would be assigned log router tags using the
team-building based logic in getPushLocations(), whereas TLogs would
internally re-index tags according to tag.id%logRouterTags. This
mismatch would mean that we could have:
Log0 -2:0 ----- -2:0 Log 0
Log1 -2:1 \
>--- -2:1,-2:0 (-2:2 mod 2 becomes -2:0) Log 1
Log2 -2:2 /
And now we have data that's tagged as -2:0 on a TLog that's not the
preferred location for -2:0, and therefore a BestLocationOnly cursor
would miss the mutations.
This was never noticed before, as we never
used a satellite as a preferred location to peek from. Merge cursors
always peek from all locations, and thus a peek for -2:0 that needed
data from the satellites would have gone to both TLogs and merged the
results.
We now take this mod-based re-indexing into account when assigning which
TLogs need to recover which tags from the previous generation, to make
sure that tag.id%logRouterTags always results in the assigned TLog being
the preferred location.
Unfortunately, previously existing will potentially have existing
satellites with log router tags indexed incorrectly, so this transition
needs to be gated on a `log_version` transition. Old LogSets will have
an old LogVersion, and we won't prefer the sattelite for peeking. Log
Sets post-6.2 (opt-in) or post-6.3 (default) will be indexed correctly,
and therefore we can safely offload peeking onto the satellites.
RestoreMaster may not receive all acks. for the last command, i.e., finishRestore,
because RestoreLoaders and RestoreAppliers exit immediately after sending the ack.
If the ack is lost, it will not be resent.
This commit also removes some unneeded code.
This commit passes 50k random tests without errors.
1) Use the runRYWTransaction for simple DB access
2) Replace some printf with TraceEvent
3) Remove printf not used in debugging
4) Avoid wait inside the condition in loop-choose-when for
the core routine of restore worker, loader and applier.
5) Rename Restore.actor.cpp to RestoreWorker.actor.cpp since
the file only has functionalities related to restore worker.
Passed correctness test
- ignorePopDeadline to have highier limit in simulator
to accommdate for the buggify delays and make snapshot succeed.
- introduce a new knob for auto resetting the disabling of tlog pop
So as to not make filesystem assumptions. This knob did technically
appear in (only the) 6.1.5 release, but this feature was broken 6.1.5,
so thus impossible to use anyway.
And instead create a new file while incrementally truncating the old one
down. This avoids queueing up a massive number of filesystem metadata
operations in one call, thus flooding the disk with requests and
stalling out all other filesystem operations.
This sets the knobs so that a truncate of >10GB causes us to create a
new file rather than trying to truncate the old one.
Theoretically, we could spill 20MB of 22B mutations for one key, which
would generate a very long value being stored in SQLite, and very
inefficiently read back. This stops that from being a problem, at the
cost of some extra write calls.
Popping the disk queue now requires potentially recovering the location
to which we can pop from the spilled data itself, and for each tag we
must maintain the first location with relevant data.
The previous queue we had to represent the ordering, queueOrder, was
used by spilling, and popped when a TLog had been spilled. This means
that as soon as a TLog has been fully spilled, we have no idea how it
relates in order to other fully spilled TLogs.
Instead, use queueOrder to keep track of all the TLog UIDs until they're
removed, and use spillOrder to keep track of the order only for
spilling.
Add a new role for ratekeeper.
Remove StorageServerChanges from data distribution.
Ratekeeper monitors storage servers, which borrows the idea from
DataDistribution.
After we add a new data distributor role, we publish the data
related to data distributor and rate keeper through the new
role (and new worker).
So the status needs to contact the data distributor, instead of master,
to get the status information.
The buildTeam() can create teams with undesired storage servers, which are
considered unhealthy. As a result, the data movement can become stuck.
Fix this by adding an ACTOR monitorHealthyTeams that builds team every one
second whenever there is no healthy teams.
Clean up storageServerTracker() interface.
Added three knobs to control team remover
bool TR_FLAG_DISABLE_TEAM_REMOVER:
Disable the teamRemover actor
double TR_REMOVE_MACHINE_TEAM_DELAY:
Wait for the specified time before try to remove next machine team
double TR_WAIT_FOR_ALL_MACHINES_HEALTHY_DELAY:
Wait before checking if all machines are healthy
1) Reduce the frequency of checking if we need to call teamRemover
2) Improve code efficiency in finding the machine team to remove
3) Remove unused code
4) Add sanity check
DESIRED_TEAMS_PER_MACHINE must equal to DESIRED_TEAMS_PER_SERVER.
Otherwise, we may have to few machine teams to create enough server teams.
Note that BUGGIFY macro value is based on a random number generator.
When you have two BUGGIFY, one may be true and the other is false.
Also fix a bug in get the number of healthy machine teams.
Magnify the possibility that the number of created machine teams is
larger than the number of desired machine teams if we do NOT try to remove the surplus machine teams.
This help test the upgrade to machine team in FDB 6.1
There are several missing includes for cmath in the code, I added those.
Next, Coro returns a reference to a stack variable and this causes a
warning. As this is probably ok for Coro, I disabled the warning in
that file for GCC. I want to have this warning in the build system as
it is generally a very useful warning to have.
Another change is that major and minor are deprecated for a while now.
I replaced those with gnu_dev_major and gnu_dev_minor.
ErrorOr currently implements operators ==, !=, and <. These do not
compile because Error does not implement ==. This compiles on older
versions of gcc and clang because ErrorOr<T>::operator== is not used
anywhere. It is still wrong though and newer gcc versions complain.
I simply removed these methods.
The most interesting fix is that TraceEvent::~TraceEvent is currently
throwing exceptions. This is illegal behavior in C++11 and a idea in
older versions of C++. For now I simply removed the throw, but this
might need some more thought.