* Replace KeyRange with std::vector<KeyRange> in DataMoveMetaData and
CheckpointMetaData.
* Checked if ranges.empty().
* fmt.
* Resolved some comments.
Co-authored-by: He Liu <heliu@apple.com>
* Proactively clean up idempotency ids for successful commits
This change also includes some minor changes from my branch working on
an idempotency ids cleaner, that I'd like to get merged sooner rather
than later.
- Adding a timestamp to idempotency values
- Making IdempotencyId an actor file
- Adding commit_unknown_result_fatal
- Checking idempotencyIdsExpiredVersion in determineCommitStatus
- Some testing QOL changes
* Factor out decodeIdempotencyKey logic
* Fix formatting
* Update flow/include/flow/error_definitions.h
Co-authored-by: A.J. Beamon <aj.beamon@snowflake.com>
* Use KeyBackedObjectProperty for idempotencyIdsExpiredVersion
* Add IDEMPOTENCY_ID_IN_MEMORY_LIFETIME knob
* Rename ExpireIdempotencyKeyValuePairRequest
Also add a code probe for the case where an ExpireIdempotencyIdRequest is
received before the count is known, and add an assert
* Fix formatting and add TODO for nwijetunga
Co-authored-by: A.J. Beamon <aj.beamon@snowflake.com>
In FDB 7.1, this key was stored in the txnStateStore. In 7.2, it has
been moved to the database. This was causing protocol compatibility
issues during upgrades, so we need to rename the key.
The cluster ID is now stored in the database instead of in the
txnStateStore. The cluster controller will read it on boot and send it
to all processes to persist.
The simulator tracks only active processes. Rebooted or killed processes
are removed from the list of processes, and only get added back when the
process is rebooted and starts up again. This causes a problem for the
`RebootProcessAndSwitch` kill type, which wants to simultaneously reboot
all machines in a cluster and change their cluster file. If a machine is
currently being rebooted, it will miss the reboot process and switch
command.
The fix is to add a check when a process is being started in simulation.
If the process has had its cluster file changed and the cluster is in a
state where all processes should have had their cluster files reverted
to the original value, the simulator will now send a
`RebootProcessAndSwitch` signal right when the process is started. This
will cause an extra reboot, but should correctly switch the process back
to its original, correct cluster file, allowing the cluster to fully
recover all clusters.
Note that the above issue should only affect simulation, due to how the
simulator tracks processes and handles kill signals.
This commit also adds a field to each process struct to determine
whether the process is being run in a DR cluster in the simulation run.
This is needed because simulation does not differentiate between
processes in different clusters (other than by the IP), and some
processes needed to switch clusters and some simply needed to be
rebooted.
In the case
1. A watch to key A is set, the watchValueMap ACTOR, noted as X, starts waiting.
2. All watches are cleared due to connection string change.
3. The watch to key A is restarted with watchValueMap ACTOR Y.
4. X receives the cancel exception, and tries to dereference the counter. This causes Y gets cancelled.
the reference count will cause watch prematurely terminate. Recording
the versions of each watch would help preventing this issue
Currently, there is a cyclic reference situation in
DatabaseContext -> WatchMetadata -> watchStorageServerResp ->
DatabaseContext
If there is a watch created in the DatabaseContext, even the
corresponding wait ACTOR is cancelled, the WatchMetadata will still hold
a reference to watchStorageServerResp ACTOR, which holds a reference to
DatabaseContext.
In this situation, any DatabaseContext who held a watch will not be
automatically destructed since its reference count will never reduce to
0 until the watch value is changed. Every time the cluster recoveries,
several watches are created, and when the cluster restarts, the
DatabaseContext which not being used, will not be able to destructed due
to these watches.
With this patch, each wait to the watch will be counted. Either the
watch is triggered or cancelled, the corresponding count will be
reduced. If a watch is not being waited, the watch will be cancelled,
effectively reduce the reference count of DatabaseContext. This will
hopefully fix the issue mentioned above.
The code is tested by 1) Manually change the number of logs of a local
cluster, see the cluster recovery and previous DatabaseContext being
destructed; 2) 100K joshua run, with 1 failure, the same test will fail
on the current git main branch.