* Add networkoption to disable non-TLS connections
* add disable plaintext connection to fdbserver
* python doc
* Formatting
* Add tls disable plaintext connection to client api test
* review
* fix negative test
* formatting
* add TLS support to c client config tests
Adds support for TLS in the client and server separately
* add tests for disable_plaintext_connections
Test TLS and Plaintext Clusters and Clients
* Fix documentation
* Rename option to indicate it is client-only
* clearer formatting
* default to allowing plaintext connections
* add SetTLSDisablePlaintextConnection to go bindings
* clean up old audit metadata
* change comments
* fix audit cleanup rule as PR description claim and reduce timeout of auditStorageCorrectness in tester
* address comment
* clear audit metadata should not throw error
* cleanup progress metadata by type
* control number of AuditStatistic events
* carefully persist new audit state
* add unit tests and fix issues
* cleanup
* allow audit concurrent run for different types and fix some bug in auditutl
* fix ci issue and nits
* Added `get_audit_status checkmigration` to print out the number of data shards and `physical shards`, so that we know the progress of migration to `shard_encode_location_metadata`
* Fixed print format.
* Addressed comments.
* fixing bugs with tenant_mode required on external clients and changing test to find them
* Update fdbcli/BlobKeyCommand.actor.cpp
Co-authored-by: A.J. Beamon <aj.beamon@snowflake.com>
---------
Co-authored-by: A.J. Beamon <aj.beamon@snowflake.com>
* buggified max_shards_on_large_teams, and had the consistency scan verify the proper number of shards have been overreplicated
* fix: when restarting the data distributor, do no allow more than max_shards_on_large_teams shards to be marked as healthy
* cleanup audit progress metadata and tester directly issue audit requests to DD instead of CC
* address comments and fix test dd issue request but dd not present
* Implemented AuditUtils.actor.cpp
Moved AuditUtils to fdbserver/
* Persist AuditStorageState.
* Passed persisted AuditStorageState test.
* Added audit_storage_error to indicate a corruption is caught.
Throw/Send audit_storage_error when there is a data corruption.
Added doAuditStorage() for resuming Audit.
* Load and resume AuditStorage when DD restarts.
* Generate audit id monotonically.
* Fixed minor issue AuditId/Type was not set.
* Adding getLatestAuditStates.
* Improved persisted errors and added AuditStorageCommand.actor.cpp for
fdbcli.
* Added `audit_storage` fdbcli command.
* fmt.
* Fixed null shared_ptr issue.
* Improve audit data.
* Change DDAuditFailed to SevWarn.
* Sev.
* set SERVE_AUDIT_STORAGE_PARALLELISM to 1.
* Moved AuditUtils* to fdbclient/.
* Added getAuditStatus fdbcli command.
* Refactor audit storage fdb cli commands.
* Added auditStorage in sim.
* Cleanup.
* Resolved comments.
* Resolved comments.
* Added SystemData for metadata audit.
Refactored audit workflow to make sure all sub-tasks are executed w/o
early exit.
* Improvements.
* Persisted Failed state after too many retries.
* Added retryCount for resumeAuditStorage().
* resolving conflict.
* Resolved conflicts.
* allow-merged-to-run
* add timeout to audit client
* fmt
* validate replica
* add audit serverKey
* address comments and fmt
* fix audit_storage_exceeded_request_limit
* fix segfault in getLatestAuditStatesImpl
* fix bugs
* remove timeout from workload
* fix bugs
* audit local view of shard assignment
* fmt
* fix-stuck-issue-and-make-dd-audit-storage-self-retry
* fix timeout
* fix timeout
* fix bugs and cleanup
* fix nit
* change name state to coreState for audit metadata
* address comments
* code clean
* fmt
* setup debug
* cleanup
* clean up
* code cleanup
* code clean
* remove tmp file
* fmt
* trace portion of shards that of anonymous physical shard
* remove unnecessary actor cleanup
* do not give up when tr is too old
* address commits
* refactor
* clean
* fmt
* fix-command-help-text
* fix-auditstate-restore-and-enable-restore-to-metadata-audit
* address comments
* fmrt
* debug and improve efficient of resume audit
* small change
* fix audit cli
* bypass completed audit when dd restart
* fix auditStorageCommandActor
* make mismatch key range more visable
* address comments
* make local shard metadata check can make progress by retries
* address comments
* address comments
* partition location metadata validation by range and server
* unset MIN_TRACE_SEVERITY
* address comments and SS auto proceed until failed then notify dd
* persistNewAuditState should checkMoveKeysLock
* audit storage location metadata partitioned by range and move shard assignment history def to the end of SS structure
* code cleanup
* fix error message in metadata validation
* fix registerAuditsForShardAssignmentHistoryCollection input for local shard validation
* add comments to code and add guard to make sure the SS audit does not proceeds automatically for many times without being notified by DD --- to support audit cancellation later
* fix coalesceRangeList
* replace rangeOverlapping func with operator and use struct instead of complicated type for return value of getKeyServer/serverKey/shardInfo
* simplify shard assignment history
* shardAssignmentRecordRequests should be unorder_map
* address comments, make trackShardAssignment simple, make anyChildAuditFailed cover all audit children, keep only one audit actor run at a time on each SS
* only run validate shard info once at a time, other audit type does not have this limitation
---------
Co-authored-by: He Liu <heliu05023@gmail.com>
Co-authored-by: He Liu <heliu@apple.com>
Co-authored-by: Zhe Wang <zhewang@Zhes-Laptop.local>
* fix metacluster get segfault
* update fdbcli tenant list function to take tenant group filter, support JSON, and report tenant IDs
* code review changes
* code formatting
* additional code review changes
* account for empty tenant groups
* reformat error catching in fdbcli command
* refactor json output and address code review comments
* add back mistakenly removed hint
* keep hints after 4th token
* add to tenant management workload
* fix compile error
* fix test range
* add more asserts to metacluster case
* nest test condition inside if block
* adjust tenant test layout
* refactor some test files
* reorganize test workload logic
Description
Patch removes an unused CODE_PROBE checking the encryption header
being read flag version is valid, given the flag-version is determined
by peeking into std::variant index and we only have version-1 supported,
for now converted the check to an ASSERT
Testing
EncryptionUnitTests.toml
EncryptionOps.toml
BlobGranuleCorrectness/Clean.toml
In clearData() used by simulation test to clean up, we re-use the same debugID for multiple transactions,
making it less clear when grepping for the debugID. This PR assigns a new debugID for each new transaction.
Some of the commented-out tracing code are already obsolete, I updated several of them, but am not sure if
the tracing should be enabled when enabling debug transaction.
Finally, use different but similar with the same prefix error messages for different call stacks.
Co-authored-by: A.J. Beamon <aj.beamon@snowflake.com>
Fix `RangeResult.readThrough` misuses:
1. KeyValueStores do not need to set readThrough, as it will not be
serialized and return. Also setting it to the last key of the result
is not right, it should at least be the keyAfter of the last key;
2. Fix NativeAPI doesn't set `RangeResult.more` in a few places;
3. Avoid `tryGetRange()` setting `readThrough` when `more` is false,
which was a workaround for the above item 2;
4. `tryGetRangeFromBlob()` doesn't set `more` but set `readThrough` to
indicate it is end, which was following the same above workaround I
think. Fixed that.
5. `getRangeStream()` is going to set `more` to true and then let the
`readThrough` be it's boundary.
Also added readThrough getter/setter function to validate it's usage.
* EaR: Implement Key Check Value semantics
Description
Key Check Value (KCV) is a checksum of cryptographic encryption key
used to validate encryption keys's integrity. FDB Encryption at-rest
relies on external KMS to supply encryption keys.
Patch proposes following major changes:
1. Implement Sha256 based KCV implementation to protect against
'baseCipher' corruption in two possible scenarios:
a) potential corruption external to FDB
b) potential corruption within FDB processes.
2. Scheme persists computed KCV token in block encryption header,
which then gets validated as part of header validation during
decryption.
3. FDB Encryption key derivation uses HMAC_SHA256 digest generation
scheme, which allows max 64 bytes of 'cipher buffer', patch add
required check to ensure 'baseCipher' length are within bounds.
OpenSSL HMAC underlying call ignores extra length if supplied, however,
it weakens the security guarantees, hence, disallowed.
Testing
devRunCorrectness - multiple 500K runs
Valgrind & Asan - BlobCipherUnit, RESTKMSUnit, BlobGranuleCorrectness*,
EncryptionOps, EncryptKeyProxyTest
* EaR: Fix BlobCipher cache handling for cipher needs refresh and/or expired
Description
Patch proposes BlobCipher cache bug related to handling of cipherKeys
that either 'needsRefresh' and/or 'expired'
Also, adds a unit-test to cover the following usecase:
1. Test refreshAt and expireAt properties of the cipherKey
2. Validate corresponding Counter value increments
Testing
Extend /blobCipher unitest tests
Description
EKP failure detection time is choosen to be fairly low given EKP
availability could impact cluster availability. However, in
simulation runs, for various reasons such a low-latency failure
detection could cause EKP recruitment to be done in a loop.
Patch relaxes the failure detection time for simulation runs
Testing
Description
RESTKms validation tokens are provided using foundationdb.conf, the file
uses '#' as comment start character. Update the token-name seperator to
'$' instead of '#'
Testing
RESTKmsConnectorUnit.toml
* Api Tester: Specify knobs in the toml file; Test loop profiler
* Gracefully stop the loop profiler thread
* Protect loop profiler thread by mutex
* Create loop profiler thread only if is not stopped
In the pathological case that both key size and value size are 0,
the probability of choosing that key-value pair is 0, and we divide
by zero when computing the sampledSize.
This change adds documentation to that function, which was quite
difficult to understand. In addition, we add `probability` to the
returned values, since one of the callers was backing it out from
sampledSize and itself in danger of dividing by zero.
* add a field to show average data movement bytes in MovingData trace
* change variable type
* Make changes to variable/function naming and add more comments
* move rolling window struct to a new file; deal with corner case: dd startup, elements are full
* format
* simplify codes
* modify file/struct name, universal to moving window
* fix typo
* add a new Knob to control MovingWindow::Deque size
* make the general use of dequeSize limit
* format
* format, use space rather than tab
* Add processID, networkAddress, and locality to layer status JSON for Backup Agents.
* Backup/dr agent determines network address to report in Layer Status only once, when the status updater loop begins, since it is a blocking call which connects to the cluster. And lots of code cleanup.
Description
Patch proposes two changes:
1. Avoid appending tls as part of URI for secure connections
2. RefreshEKs recurring task can be skipped if there are no keys to be refreshed
Testing
EncryptionOps.toml
EncryptKeyProxyTest.toml
devRunCorrectness
devRunCorrectnessFiltered 'Encrypt*'
* EaR: REST kms misc fixes
Description
Patch addresses following issues:
1. Fix "return connection" routine, it fixes a regression introduced by
an earlier fix.
2. Update RESTConnectionPool::connectionPoolMap to an "unordered_map"
for O(1) lookups
3. Improve logging
4. Make RESTUrl parsing handle extra '/' for 'resource'
Testing
Standalone fdbserver connecting to external KMS and database create
Description
Knob 'REST_KMS_ALLOW_NOT_SECURE_CONNECTION' got renamed in recent
patch, however, there are other places that needs an update too.
Testing
devRunCorrectness - 100K
RESTUtilUnits.toml
RESTKmsConnectorUnits.toml
* EaR: REST KMS fixes - encryption integration testing
Description
Major changes:
1. Multiple fixes observed while performing integration end-to-end
testing for Encryption at-rest feature.
2. Improve REST module logging. Introduced FLOW_KNOBS->REST_LOG_LEVEL
to have more granular control of feature logging disconnected from
the cluster log level.
Testing
Integration testbed:
1. Run fdbserver standalone
2. Run external KMS http-server to serve encryption key fetch requests
* EaR: RESTClient HTTP compliance, fix json request content type
Description
diff-1: Address review comments
RESTClient is responsible to handle FDB <-> KMS communication
for Encryption and other usecases. By design, it only supports
"secure connection" i.e. "https"; however, it seems there is a
need to expand the module to support "http" connection,
for instance: test and dev deployments for instance.
However, given RESTClient gets involved in handling high
sensitive contents such as: plaintext "encryption cipher
from a KMS", the feature is guarded using
CLIENT_KNOB->REST_KMS_ENABLE_NOT_SECURE_CONNECTION which is
settable using FDBServer command line argument
"--kms-rest-enable_not_secure_connection" (boolean)
Testing
Deployed a standalone fdbserver and communicate with a
simple "http" server