* EaR: Update ApiWorkload to validate encryption at-rest guarantees
Description
FDB encryption data at-rest guarantees if cluster is configured with feature
enabled, all data written to persistent disks shall be "encrypted". Given FDB
maintains multiple persistent storages during lifecycle of the data, the patch
proposes a scheme to validate the invariant via "simulation testing"
Patch proposes updating ApiCorrectness workload to do the following:
1. Client supplied params and/randomly enable the validation feature.
2. Validation when enabled, allows injecting a known "marker string"
to workload generated Key and Value data patterns.
3. On shutdown, if the validation is enabled, all test files are
scanned for the known "marker" pattern.
Simulation tests are already capable of doing the following:
1. Randomly select TenantMode (disabled/optional/required)
2. Randomly select EncryptionAtRestMode (cluster_aware/domain_aware)
Hence, the updates test all possible combinations are validated. Also,
'defaultTenant' is present to cover 'domain_aware' encryption use cases.
Testing
devRunCorrectness
devRetryCorrectness - ApiCorrectness & EncryptedBackupCorrectness
* Define API for unsuppressable TraceEvent types
Add trace checking tests for authz trace events
* Revert temporary configurations used for debugging
* Simplify/Modernize flow audit logging API
- Do event type whitelist checks at compile time
- Use ""_audit literal API instead of a tag struct
- Replace int with a lightweight struct for tracking/modifying TraceEvent enablement
* Revert installing signal handler for SIGTERM and refactor test script
Move trace checker to local_cluster.py
* Lengthen public key refresh interval and add more audited events
* Try and make MSVC and Mac build happy
* consteval > constexpr
'inline consteval' still causes link errors in Mac builds
* Enable secure allocation mode in Arena
This mode allows zeroing out blocks holding sensitive data after use
* Introduce WipedString to all token-holding memory
Also introduce a option flag "sensitive"
* Make pointer equivalency a hard requirement for non-ASAN builds
So that we can detect when Arena/malloc/memory-wipe behavior changes
When remote DC is down, the remote team collection of DD can initializing
waiting for the remote to recover (all_tlog_recruited state). However, the
getTeam request can already be served by the remote team collection. So, for
a RelocateShard (data movement such as split, move), it will get a team for
the remote DC. But the data movement can't make progress on the remote team
because the remote DC hasn't recovered yet. Because of the stuck of data
movement, the primary cannot reach the "storage_recovered" state and stay in
accepting_commit state.
The specifc test failure: slow/ApiCorrectness.toml -s 339026305 -b on
at commit: 0edd899d65
In this test, primary DC has 1 SS killed, remote DC has 2 TLog and 2 SS killed.
So the remote is dead, the remaining 2 SSes can't make progress because of the
loss of 2 TLogs. The repairDeadDatacenter() can't reach the "storage_recovered"
state due to DD's failure of moving shards away from the killed SS in the
primary.
The fix is to exclude all remote in repairDeadDatacenter() so that tells DD to
mark all SSes in the remote as unhealthy. Another fix is to return empty
results for getTeam request if the remote team collection is not ready. This
will allow the data movement to continue, essentially remote team is not changed
for the data movement.
Also, to minimize audit log loss, handle token usage audit logging at each usage.
This has a side-effect of making the token use log less bursty.
This also subtly changes the dedup cache policy.
Dedup time window used to be 5 seconds (default) since the start of batch-logging.
Now it's 5 seconds from the first usage since the closing of the previous dedup window
* client_config_tester: use a generic mechanism to set specific network options
* trace_initialize_on_setup option to initialize client traces on network setup without local IP address
* trace_initialize_on_setup: Addressing review comments
* Restore correct formatting
* trace_initialize_on_setup: Update go bindings
* Include PID for identification into trace file names by default
* Use the same naming pattern for trace files in all configurations
* Empty commit
In the HA configuration, it's possible the remote DC was killed 2 out of 3
machines, left not enough machines for a successful recovery. So this PR changes
to Reboot to avoid such excessive killings.