memtable and writebuffer size are too small in simualtion, which causes
thousands of sst files and at least 6 levels of ssts.
Both makes compaction slower in simulation and contribute to timeout errors.
After increasing the size, failure rate (timeout failures) when we only run rocksdb and
sharded rocksdb engines in simulation drops from 10 out of 332339 tests to 10 out of 497532 tests.
For apple dev who wants to look into the joshua details,
before the change, joshua ensemble id is 20221111-223720-mengxudebugrocks-505ede1c55664ddf
after the change, joshua ensemble id is 20221114-192042-mengxurocksdebugknobchange-1e4c047d112e9a38
* add bytelimit for prefetch
A fraction of byteLimit will be used as the limit to fetch index.
For the indexes fetched, fetch records for them in batch.
byteLimit always count the index size, it also count record if exist,
it at least return 1 index-record entry and always include the last entry
despite that adding the last entry despite it might exceed limit.
There is a Knob STRICTLY_ENFORCE_BYTE_LIMIT, when it is set, records
will be discarded once the byteLimit is hit, despite they are fetched.
Otherwise, return the whole batch.
* Network setup to fail on the initialization failures of external clients
* MVC: A more intuitive error when failing to load an API function
* Testing initializing FDB client with different configuration options
* Enable strict external client configuration check only for new API versions
* Upgrade FDB package version to 7.3.0; Update upgrade tests
* Allow multiple keyranges in CheckpointRequest.
Include DataMove ID in CheckpointMetaData.
* Use UID dataMoveId instead of Optional<UID>.
Co-authored-by: He Liu <heliu@apple.com>
* Exposing writeEntireFile up through BackupContainerFileSystem, and using it in blob worker
* Adding blob worker latency metrics
* avoid writeEntireFile if object is too large
* gracefully falling back to multi-part upload if the file is too big
* Added SSPhysicalShard.
* Update physicalShards in StorageServer::addShard().
* Handle notAssigned shard.
* fetchKeys() are not stopped during TerminateStorageServer since
physicalShards are not cleared.
* Fixed addingSplitLeft unset shardId.
* Increased the timeout for Rocks reads in simulation.
* Cleanup.
* set SERVE_AUDIT_STORAGE_PARALLELISM to 1.
* Disabled ValidateStorage test.
* Resolved comments.
Co-authored-by: He Liu <heliu@apple.com>
* Fix a potential memory error
The returned KeyRef should live at least as long as the supplied arena
* Improve keyAfter and singleKeyRange
Avoid duplicating the keyAfter implementation, and share memory between
begin and end for singleKeyRange returning a standalone
* Avoid creating and destroying an arena in a loop
This defeats some of the purposes of Arena's, namely to amortize the
cost of calling malloc and free and to improve cache locality.
* Improve Arena usage
Avoid an arena allocation in keyAfter - instead return a string with
static lifetime. I made sure to return the same memory as was just
brought into cache to inspect whether key == \xff\xff.
Also avoid creating and destroying an arena in a loop for encrypting
idempotency id sets.
* Extend Tlog persistentStorage to persist encryption state
Description
diff-3: Address review comment.
diff-2: Extend ClusterController endpoints to allow query
cluster's encryptionAtRest status
Update Tlog recovery to ensure on-disk encryption
status matches with cluster's cstate persisted
encryptionAtRest
diff-1: Store encryptionAtRestMode state in Coordinators
Major changes proposed are:
1. Extend TLog persistentStorage to persist encryption state
2. Encryption state persisted is derived from corresponding
db-config and relevant SERVER_KNOBS. In near future, knobs
shall be removed.
3. On TLog startup, the persisted encryption state is compared
against cluster configuration, if mismatch, the TLog is killed
and not allowed to rejoin the cluster.
Testing
devRunCorrectness - 100K
* Upgrade tests: dump thread call stacks of the tester process if it fails to terminate
* ApiTester: log before and after stopping the network thread
* Catch and print exceptions in closeTraceFile; Close trace file at the end of MVC runNetwork
* Change trace event name for MVC runNetwork termination
Co-authored-by: A.J. Beamon <aj.beamon@snowflake.com>
Co-authored-by: A.J. Beamon <aj.beamon@snowflake.com>
* Replace KeyRange with std::vector<KeyRange> in DataMoveMetaData and
CheckpointMetaData.
* Checked if ranges.empty().
* fmt.
* Resolved some comments.
Co-authored-by: He Liu <heliu@apple.com>
* Proactively clean up idempotency ids for successful commits
This change also includes some minor changes from my branch working on
an idempotency ids cleaner, that I'd like to get merged sooner rather
than later.
- Adding a timestamp to idempotency values
- Making IdempotencyId an actor file
- Adding commit_unknown_result_fatal
- Checking idempotencyIdsExpiredVersion in determineCommitStatus
- Some testing QOL changes
* Factor out decodeIdempotencyKey logic
* Fix formatting
* Update flow/include/flow/error_definitions.h
Co-authored-by: A.J. Beamon <aj.beamon@snowflake.com>
* Use KeyBackedObjectProperty for idempotencyIdsExpiredVersion
* Add IDEMPOTENCY_ID_IN_MEMORY_LIFETIME knob
* Rename ExpireIdempotencyKeyValuePairRequest
Also add a code probe for the case where an ExpireIdempotencyIdRequest is
received before the count is known, and add an assert
* Fix formatting and add TODO for nwijetunga
Co-authored-by: A.J. Beamon <aj.beamon@snowflake.com>
In FDB 7.1, this key was stored in the txnStateStore. In 7.2, it has
been moved to the database. This was causing protocol compatibility
issues during upgrades, so we need to rename the key.
The cluster ID is now stored in the database instead of in the
txnStateStore. The cluster controller will read it on boot and send it
to all processes to persist.
The simulator tracks only active processes. Rebooted or killed processes
are removed from the list of processes, and only get added back when the
process is rebooted and starts up again. This causes a problem for the
`RebootProcessAndSwitch` kill type, which wants to simultaneously reboot
all machines in a cluster and change their cluster file. If a machine is
currently being rebooted, it will miss the reboot process and switch
command.
The fix is to add a check when a process is being started in simulation.
If the process has had its cluster file changed and the cluster is in a
state where all processes should have had their cluster files reverted
to the original value, the simulator will now send a
`RebootProcessAndSwitch` signal right when the process is started. This
will cause an extra reboot, but should correctly switch the process back
to its original, correct cluster file, allowing the cluster to fully
recover all clusters.
Note that the above issue should only affect simulation, due to how the
simulator tracks processes and handles kill signals.
This commit also adds a field to each process struct to determine
whether the process is being run in a DR cluster in the simulation run.
This is needed because simulation does not differentiate between
processes in different clusters (other than by the IP), and some
processes needed to switch clusters and some simply needed to be
rebooted.
In the case
1. A watch to key A is set, the watchValueMap ACTOR, noted as X, starts waiting.
2. All watches are cleared due to connection string change.
3. The watch to key A is restarted with watchValueMap ACTOR Y.
4. X receives the cancel exception, and tries to dereference the counter. This causes Y gets cancelled.
the reference count will cause watch prematurely terminate. Recording
the versions of each watch would help preventing this issue
Currently, there is a cyclic reference situation in
DatabaseContext -> WatchMetadata -> watchStorageServerResp ->
DatabaseContext
If there is a watch created in the DatabaseContext, even the
corresponding wait ACTOR is cancelled, the WatchMetadata will still hold
a reference to watchStorageServerResp ACTOR, which holds a reference to
DatabaseContext.
In this situation, any DatabaseContext who held a watch will not be
automatically destructed since its reference count will never reduce to
0 until the watch value is changed. Every time the cluster recoveries,
several watches are created, and when the cluster restarts, the
DatabaseContext which not being used, will not be able to destructed due
to these watches.
With this patch, each wait to the watch will be counted. Either the
watch is triggered or cancelled, the corresponding count will be
reduced. If a watch is not being waited, the watch will be cancelled,
effectively reduce the reference count of DatabaseContext. This will
hopefully fix the issue mentioned above.
The code is tested by 1) Manually change the number of logs of a local
cluster, see the cluster recovery and previous DatabaseContext being
destructed; 2) 100K joshua run, with 1 failure, the same test will fail
on the current git main branch.
* Fix a test timeout due to buggified knob MAX_WRITE_TRANSACTION_LIFE_VERSIONS
The buggified knob MAX_WRITE_TRANSACTION_LIFE_VERSIONS can be only 1M. In some
tests, this transaction always end up commitVersion - readVersion is a little
above 1M, thus always getting transaction_too_old error.
* Change MAX_COMMIT_BATCH_INTERVAL instead
So that the master may give out versions fast enough.
* Fix an assertion failure in a unit test
48125>>8 = 187, 48125 = 0xbbfd
48128>>8 = 188, 48128 = 0xbc00
So if 48125 is chosen as the index, 48128 changes the higher order byte.
48125 & 0xff7f = 47997 = 0xbb7d. Thus +5 won't change the higher order byte.
In FDB 7.1, this key was stored in the txnStateStore. In 7.2, it has
been moved to the database. This was causing protocol compatibility
issues during upgrades, so we need to rename the key.
The cluster ID is now stored in the database instead of in the
txnStateStore. The cluster controller will read it on boot and send it
to all processes to persist.
The simulator tracks only active processes. Rebooted or killed processes
are removed from the list of processes, and only get added back when the
process is rebooted and starts up again. This causes a problem for the
`RebootProcessAndSwitch` kill type, which wants to simultaneously reboot
all machines in a cluster and change their cluster file. If a machine is
currently being rebooted, it will miss the reboot process and switch
command.
The fix is to add a check when a process is being started in simulation.
If the process has had its cluster file changed and the cluster is in a
state where all processes should have had their cluster files reverted
to the original value, the simulator will now send a
`RebootProcessAndSwitch` signal right when the process is started. This
will cause an extra reboot, but should correctly switch the process back
to its original, correct cluster file, allowing the cluster to fully
recover all clusters.
Note that the above issue should only affect simulation, due to how the
simulator tracks processes and handles kill signals.
This commit also adds a field to each process struct to determine
whether the process is being run in a DR cluster in the simulation run.
This is needed because simulation does not differentiate between
processes in different clusters (other than by the IP), and some
processes needed to switch clusters and some simply needed to be
rebooted.