atomicOp has an amplified performance overhead to the cluster,
for example, an ADD operation can be small, but SS has to load
the value to do the operation and the value can be large.
When we pipeline multiple version batches, we should prevent a later
version batch from blocking the earlier version batch by consuming
CPU resources.
To achive the above, we should assign higher priority to actors
in later phases in a version batch.
Because restore master will not invoke an actor at a later phase unless
the actors at the earlier phases have been finished. This priority assignment
will not cause dead lock.
1) Sort logfiles by endVersion
2) Exit program early when restore will not succeed
3) Do not increase nextVersion unncessarily when
calculate version batches.
4) Change assert condition that ensures progress in
calculating version batches.
1) Remove endVersion field because it has been included in RestoreAsset;
2) Ensure endVersion in VersionBatch and RestoreAsset is always exclusive;
3) Revise ASSERT in laoder and applier in situations when the dummy commit version
is endVersion, to avoid false positive ASSERT failure.
1. Review memory use cases and improve:
Ensure state varialble is initialized and
change unnecessary state variable to variable.
2. Remove debug code that is no longer useful;
3. Mute verbose debug.
1) Use map iterator instead of pointer to maintain stability when map is inserted or deleted
2) dummySampleWorkload: clear rangeToApplier data in each sampling phase. otherwise, we can
have an increasing number of keys assigned to the applier.
1) Do not keep restore role data (e.g., masterData) in restore worker;
2) Change function parameter list by only passing in the needed variables in role data;
3) Remove unneccessary files vector from masterData;
4) Change typos in comments and some functions name.
RestoreMaster may not receive all acks. for the last command, i.e., finishRestore,
because RestoreLoaders and RestoreAppliers exit immediately after sending the ack.
If the ack is lost, it will not be resent.
This commit also removes some unneeded code.
This commit passes 50k random tests without errors.
1) Use the runRYWTransaction for simple DB access
2) Replace some printf with TraceEvent
3) Remove printf not used in debugging
4) Avoid wait inside the condition in loop-choose-when for
the core routine of restore worker, loader and applier.
5) Rename Restore.actor.cpp to RestoreWorker.actor.cpp since
the file only has functionalities related to restore worker.
Passed correctness test
1) Remove global map to buffer the parsed mutations on loader.
Use local map instead to increase parallelism.
2) Use std::map<LoadingParam, Future<Void>> to hold the actor
that parse a backup file and to de-duplicate requests.
3) Remove unused code.
1) Add sendBatchRequests and getBatchReplies
sendBatchRequests is a generic actor to send requests without
processing replies.
getBatchReplies is similar to sendBatchRequests expect that
it returns the reply to caller.
2) Share applier interface to loaders by using RequestStream,
instead of using DB.
Create RestoreSysInfo struct, similar purpose as DBInfo, for
the restore system information that are shared among restore workers.
Add a NotifiedVersion into an applier data which represents
the smallest version the applier is at.
When a loader sends mutation vector to appliers, it sends
the request that contains prevVersion and commitVersion.
This commits also put actor into an actorCollector for
loop-choose-when situation.
When mutationVectorThreshold is not 1, a loader sends a vector of
mutations to an applier.
We should never mix mutations at different versions into the same vector.
The code on previous commit may mix mutations at versions.
This commit resolves the bug.
In the sampling phase, a loader will cache the mutations into kvOps map;
In the loading log file phase, the loader will do the same thing.
The loader must clear the kvOps map once the loader use it; otherwise,
it will cache the sampled mutations twice, which leads to an
inconsistent restored DB.
This commit identifies the bug
why DB may be restored to an inconsistent state.
The cmdid is used to achieve exact once delivery even when
network can deliver a request twice.
This is under assumption that cmdid is unique for each request!
However, this assumption may not hold for
the phase Loader_Send_Mutations_To_Applier, when loaders send parsed
mutations to appliers:
1) When the same loader loads multiple files, we reset the cmdid
for the phase;
2) When different loaders load files, each loader's cmdid starts from
0 for the phase.
Both situations can break the assumption, which causes appliers to
miss some mutations to apply. This breaks the cycle test.