* tuple: add support for usertype as last part of tuple
The expectation is user defined types come at the end as there is no
delimiter standard. In that case, if we encounter a user defined type,
we append it to the end of the tuple and allow users to transform that
type by parsing raw string returned.
* blob: use Tuple::unpackUserType() for key alignment
We want blob granule key alignment to be able to understand user data
types. This upgrades support and allows user data types to be the last
key in a blob granule and for us to be able to parse that properly.
The diagnostic is
fdbserver/workloads/DiskFailureInjection.actor.cpp:57:64: runtime error: load of value 208, which is not a valid value for type 'bool'
The fix is simply to initialize the default value of verificationMode
* Recruit new singleton for consistency checker.
* Recruit the consistency checker only if enabled.
* Add a yield in monitorConsistencyChecker().
* Minor fixes.
* Consistency check workload enhancements.
* Minor fixes and clarifications.
* clang format
* Clang format.
* Minor fixes, cleanup, debug tracing.
* Misc.
* Move the consistency scan information from dbconfig to a key backed object.
* Move consistency scan config out of db cofig to a state object and feature rename.
* ConsistencyCheck workload refactor.
* devFormat
* Update fdbcli/ConsistencyScanCommand.actor.cpp
* Review Comments.
Co-authored-by: negoyal <neelam.goyal@gmail.com>
Co-authored-by: Ata E Husain Bohra <ata.husain@snowflake.com>
* Introduce "default encryption domain"
Description
In current FDB native encryption data at-rest implementation,
an entity getting encrypted (mutation, KV and/or file) is categorized
into one of following encryption domains:
1. Tenant domain, where, Encryption domain == Tenant boundaries
2. FDB system keyspace - FDB metadata encryption domain
3. FDB Encryption Header domain - used to generate digest for
plaintext EncryptionHeader.
The scheme doesn't support encryption if an entity can't be categorized
into any of above mentioned encryption domains, for instance, non-tenant
mutations are NOT supported.
Patch extend the encryption support for mutations for which corresponding
Tenant information can't be obtained (Key length shorter than TenantPrefix)
and/or mutations do not belong to any valid Tenant
(FDB management cluster data) by mapping such mutations to a
"default encryption domain".
TODO
CommitProxy driven TLog encryption implementation requires every transaction
mutation to contain 1 KV, not crossing Tenant-boundaries. Only exception to
this rule is ClearRange mutations. For now ClearRange mutations are mapped
to 'default encryption domain', in subsequent patch appropriate handling
for ClearRange mutations shall be proposed.
Testing
devRunCorrectness - 100k
Configuration database data lives on the coordinators. When a change
coordinators command is issued, the data must be sent to the new
coordinators to keep the database consistent.
Adding the following metrics:
* BlobCipherKeyCache hit/miss
* EKP: KMS requests latencies
* For each component that using encryption, they now need to pass a UsageType enum to the encryption helper methods (GetEncryptCipherKeys/GetLatestEncryptCipherKey/encrypt/decrypt) and those methods will help to log get cipher key latency samples and encryption/decryption cpu times accordingly.
* Remove API 720 guards for tenants (experimental feature) and the cluster ID special keys (no need to guard)
* Enable the relaxed special key access in transactions that need to use special key-space APIs introduced in 7.2
Valgrind is complaining about use of uninitialized memory in
absl::GetStackTrace, and the cases where it complains the backtraces are
incomplete. Note: this means that jemalloc heap profiling no longer
works out of the box. Advanced users who want to enable jemalloc heap
profiling will now have to revert this change and build from source.
* flow: add ApiVersion to replace hard coding api version
Instead of hard coding api value, let's rely on feature versions akin to
ProtocolVersion.
* ApiVersion: remove use of -1 for latest and use LATEST_VERSION
A new knob `ENABLE_STORAGE_SERVER_ENCRYPTION` is added, which despite its name, currently only Redwood supports it. The knob is mean to be only used in tests to test encryption in individual components, and otherwise enabling encryption should be done through the general `ENABLE_ENCRYPTION` knob.
Under the hood, a new `Encryption` encoding type is added to `IPager`, which use AES-256 to encrypt a page. With this encoding, `BlobCipherEncryptHeader` is inserted into page header for encryption metadata. Moreover, since we compute and store an SHA-256 auth token with the encryption header, we rely on it to checksum the data (and the encryption header), and skip the standard xxhash checksum.
`EncryptionKeyProvider` implements the `IEncryptionKeyProvider` interface to provide encryption keys, which utilizes the existing `getLatestEncryptCipherKey` and `getEncryptCipherKey` actors to fetch encryption keys from either local cache or EKP server. If multi-tenancy is used, for writing a new page, `EncryptionKeyProvider` checks if a page contain only data for a single tenant, if so, fetches tenant specific encryption key; otherwise system encryption key is used. The tenant check is done by extracting tenant id from page bound key prefixes. `EncryptionKeyProvider` also holds a reference of the `tenantPrefixIndex` map maintained by storage server, which is used to check if a tenant do exists, and getting the tenant name in order to get the encryption key.
* Implement TenantCacheEntry in-memory cache
Description
diff-4: TraceEvent usage improvements
diff-3: Address review comments
diff-2: Add APIs to read counter values, test improvements
diff-1: Address review comments
Major changes includes:
1. Implements an actor that enables an in-memory caching of
TenantCacheEntry object, allowing the caller to embed custom
information along with TenantCacheEntry.
2. The cache follows read-through cache semantics where the entry
gets loaded from underlying database on a miss.
3. The cache implements a "periodic poller" to refresh known Tenants
by consulting the database. Once a database keyrange-watch feature is
available, cache shall be updated.
Bonus:
Implement a 'recurringAsync' addition to genericActors allowing caller
to schedule a periodic task registering an "actor functor"; the routine
'waits' for the actor unlike existing 'recurring' implementation.
Testing
TenantEntryCache workload
devCorrectnessRun - 100K
* Adding BlobRange test
* refactored blobrange test to use new db functions, and fixed several bugs
* bug fixes for blob manager and verifyBlobRange
* More range unaligned fixes
* cleaning up test and disabling tests that don't work yet for now
* removing overzealous assert in blob manager
* more fixes for overzealous assert
* cleanup and renaming test
* adding chaos to blob ranges test
* Added purgeAtLatest to BlobGranuleVerifier
* Also checking for merge resnapshot/fully delete granule purge races
* error check and count for purgeAtLatest
* changing test defaults back
* adding final data check after final availability check in blob granule verifier
* Fixing merge boundary recovery
* fixing an edge case in blob manager repeat recruitment
* fixing a race between tenant loading and key alignment
* formatting
* doing force purge at the end of the test if it didn't happen during
* better checking for granule metadata left over after purge
* retrying for in-flight granules that weren't known before purge
* letting test pass for a hard to fix purge case
* changed post-test force purge to only be after final availability check
* Made force purge more robust to split+merge races
* fix flush/force purge change feed leak
* Better fix for assign while force purge races
* more merge/force purge races
* cleanup
Description
FDB native encryption data at-rest supports two type of cipher-keys
in-memory caching:
1. Revocable keys - with a definite expiry (future timestamp)
2. Non-revocable keys - with or without expiry timestamp and/or
refreshAt timestamp.
Patch update BlobCipherKey in-memory cache to respect EKP/KMS
supplied 'refreshAt' and 'expireAt' timestamp. GetLatestCipher
validates `cipher key freshness' as well as GetCipherKey checks
for 'cipher key liveness' before replying details to the caller.
Patch also optimizes the BlobCipher module logging by taking
following measures:
1. BLOB_CIPHER_DEBUG macro to guard spammy log messages needed
mostly for debugging failures.
2. Minimize log volume by logging cipherKey details for any new
key added to the cache, key-refreshes are not logged.
3. Categorize logs into: debug, info and warn on per-usecase basis
Testing
devRunCorrectness - 100K
EncryptOps.toml - 100K
* Granule purge cannot delete history entry for fully deleting granule until all children are completely done splitting
* Several purging fixes related to granule history
* Fixed typo in refactor
* fixing memory model for purgeRange
* formatting
* weakening granule purge test for now
* cleanup
* First version of force purging granules
* fixing issue in BW range assignment reporting
* Fixing incorrect assert with force purging
* Error handling when checking force purged state
* fixed force purging and recover/reassign range races and check
* Handling force purge + boundary change race
* more places to check for force purged status
* fixed manager restart in the middle of force purge bug
* fixing same-BM purge and assignment races in all cases
* weakening orphaned granule history check a bit because of difficult to solve races
* fixing txn options on retry
* loading force purged ranges at start to avoid resuming a merge that is being force purged
* cleanup
* Enabling purging in granule tests, and adding check for leaked change feeds in force purge
* formatting
* missed parameter in merge conflicts
* Fixing leaked change feed race with merge and force purge
* adding change feed cleanup when new blob manager recovers in-progress merge that raced with force purge
* added forcepurge fdbcli command
* Enable configuring the next future protocol version as the current protocol version in FDB client, fdbserver, and fdbcli
* Auto format python files used in upgrade tests
* Add a test for upgrading to a future FDB version
* Emphasize that the options for using future protocol version are intended for test purposes only
* Make the global variable for current protocol version visible only locally
* Refactirng to avoid using currentProtocolVersion() in static intialization
* Update go bindings
* better check for granule-ification
* Handling blob granule initial split too large
* Re-evaluating split size if too large, even if read doesn't get transaction_too_old
* reworked to have blob worker propose split key
* New GranuleStatusReply to avoid seqno issue stream side effects
* Handling retries on reevaluateInitialSplit properly
* Waiting for stream to be initialized
* Checking reevaluate split for additional split points beyond proposed
* Fixing more races in reevaluate initial split
* properly handling cleaning up old change feed after split re-evaluate
* fixing granule conversion bug with hard boundaries
* fixing clear and merge check race with cycle test
* refactor missed knob check for clearAndMerge
* Fixing formatting
* review comments and improving large range conversion
* fixing typo
* more formatting
* Using knownBlobRanges for blob granule ranges whether tenants are enabled or not
* Effectively disabled blob granule tests when tenants enabled to fix ctest
* Granule purge cannot delete history entry for fully deleting granule until all children are completely done splitting
* Several purging fixes related to granule history
* Fixed typo in refactor
* fixing memory model for purgeRange
* formatting
* weakening granule purge test for now
* cleanup
* review comments
* blob: read TenantMap during recovery
Future functionality in the blob subsystem will rely on the tenant data
being loaded. This fixes this issue by loading the tenant data before
completing recovery such that continued actions on existing blob
granules will have access to the tenant data.
Example scenario with failover, splits are restarted before loading the
tenant data:
BM - BlobManager
epoch 3: epoch 4:
BM record intent to split.
Epoch fails.
BM recovery begins.
BM fails to persist split.
BM recovery finishes.
BM.checkBlobWorkerList()
maybeSplitRange().
BM.monitorClientRanges().
loads tenant data.
bin/fdbserver -r simulation -f tests/slow/BlobGranuleCorrectness.toml \
-s 223570924 -b on --crash --trace_format json
* blob: add tuple key truncation for blob granule alignment
FDB has a backup system available using the blob manager and blob
granule subsystem. If we want to audit the data in the blobs, it's a lot
easier if we can align them to something meaningful.
When a blob granule is being split, we ask the storage metrics system
for split points as it holds approximate data distribution metrics.
These keys are then processed to determine if they are a tuple and
should be truncated according to the new knob,
BG_KEY_TUPLE_TRUNCATE_OFFSET.
Here we keep all aligned keys together in the same granule even if it is
larger than the allowed granule size. The following commit will address
this by adding merge boundaries.
* blob: minor clean ups in merging code
1. Rename mergeNow -> seen. This is more inline with clocksweep naming
and removes the confusion between mergeNow and canMergeNow.
2. Make clearMergeCandidate() reset to MergeCandidateCannotMerge to make
a clear distinction what we're accomplishing.
3. Rename canMergeNow() -> mergeEligble().
* blob: add explicit (hard) boundaries
Blob ranges can be specified either through explicit ranges or at the
tenant level. Right now this is managed implicitly. This commit aims to
make it a little more explicit.
Blobification begins in monitorClientRanges() which parses either the
explicit blob ranges or the tenant map. As we do this and add new
ranges, let's explicitly track what is a hard boundary and what isn't.
When blob merging occurs, we respect this boundary. When a hard boundary
is encountered, we submit the found eligible ranges and start looking
for a new range beginning with this hard boundary.
* blob: create BlobGranuleSplitPoints struct
This is a setup for the following commit. Our goal here is to provide a
structure for split points to be passed around. The need is for us to be
able to carry uncommitted state until it is committed and we can apply
these mutations to the in-memory data structures.
* blob: implement soft boundaries
An earlier commit establishes the need to create data boundaries within
a tenant. The reality is we may encounter a set of keys that degnerate
to the same key prefix. We'll need to be able to split those across
granules, but we want to ensure we merge the split granules together
before merging with other granules.
This adds to the BlobGranuleSplitPoints state of new
BlobGranuleMergeBoundary items. BlobGranuleMergeBoundary contains state
saying if it is a left or right boundary. This information is used to,
like hard boundaries, force merging of like granules first.
We read the BlobGranuleMergeBoundary map into memory at recovery.