The logic to determine the validity of a process joining a cluster now
belongs on the worker and the cluster controller. It is no longer
restricted to tlogs and storages, but instead applies to all processes
(even stateless ones).
The cluster ID is now stored in the database instead of in the
txnStateStore. The cluster controller will read it on boot and send it
to all processes to persist.
And have these processes enter a "zombie" state where they cancel all
their actors and then wait forever, refusing to do any additional work
until they are manually handled by the operator.
Changes:
1. Change `isEncryptionOpSupported` to not check against `clientDBInfo.isEncryptionEnabled`, but instead against ENABLE_ENCRYPTION server knob. The problem with clientDBInfo is before its being broadcast to the workers, its content is uninitialized, during which some data (e.g. item 2) is not getting encrypted when they should.
2. Fix CommitProxy not encrypting metadata mutations which are recovered from txnStateStore
3. Fix KeyValueStoreMemory (thus TxnStateStore) partial transaction coming from recovery is not encrypted
4. new CODE_PROBE for the above fixes
5. Logging changes
- setknob <knob_name> <knob_value> [config_class]
- getknob <knob_name> [config_class]
- Added new option to begin to specify if it's a configuration txn. Syntax is begin [config-txn]
- Added utility function for converting tuples to string
- Added knobmanagment test in fdbcli_tests.py
* Recruit new singleton for consistency checker.
* Recruit the consistency checker only if enabled.
* Add a yield in monitorConsistencyChecker().
* Minor fixes.
* Consistency check workload enhancements.
* Minor fixes and clarifications.
* clang format
* Clang format.
* Minor fixes, cleanup, debug tracing.
* Misc.
* Move the consistency scan information from dbconfig to a key backed object.
* Move consistency scan config out of db cofig to a state object and feature rename.
* ConsistencyCheck workload refactor.
* devFormat
* Update fdbcli/ConsistencyScanCommand.actor.cpp
* Review Comments.
Co-authored-by: negoyal <neelam.goyal@gmail.com>
Co-authored-by: Ata E Husain Bohra <ata.husain@snowflake.com>
The `--no-config-db` flag, passed to `fdbserver`, will disable the
configuration database. When this flag is specified, no `ConfigNode`s
will be started, the `ConfigBroadcaster` will not be started, and on a
coordinator change no attempt will be made to lock `ConfigNode`s.
Configuration database data lives on the coordinators. When a change
coordinators command is issued, the data must be sent to the new
coordinators to keep the database consistent.
* flow: add ApiVersion to replace hard coding api version
Instead of hard coding api value, let's rely on feature versions akin to
ProtocolVersion.
* ApiVersion: remove use of -1 for latest and use LATEST_VERSION
A new knob `ENABLE_STORAGE_SERVER_ENCRYPTION` is added, which despite its name, currently only Redwood supports it. The knob is mean to be only used in tests to test encryption in individual components, and otherwise enabling encryption should be done through the general `ENABLE_ENCRYPTION` knob.
Under the hood, a new `Encryption` encoding type is added to `IPager`, which use AES-256 to encrypt a page. With this encoding, `BlobCipherEncryptHeader` is inserted into page header for encryption metadata. Moreover, since we compute and store an SHA-256 auth token with the encryption header, we rely on it to checksum the data (and the encryption header), and skip the standard xxhash checksum.
`EncryptionKeyProvider` implements the `IEncryptionKeyProvider` interface to provide encryption keys, which utilizes the existing `getLatestEncryptCipherKey` and `getEncryptCipherKey` actors to fetch encryption keys from either local cache or EKP server. If multi-tenancy is used, for writing a new page, `EncryptionKeyProvider` checks if a page contain only data for a single tenant, if so, fetches tenant specific encryption key; otherwise system encryption key is used. The tenant check is done by extracting tenant id from page bound key prefixes. `EncryptionKeyProvider` also holds a reference of the `tenantPrefixIndex` map maintained by storage server, which is used to check if a tenant do exists, and getting the tenant name in order to get the encryption key.
The localities are stored in ServerDBInfo for calculating distances to other
processes. The localities are not set when creating ServerDBInfo, thus any
distances calculated before UpdateServerDBInfoRequest will be wrong.
This PR fixes this issue, thus preventing unnecessary cross DC calls,
especially for index prefetching on the storage servers.
* Update network address in trace logs; Add system monitor for flowprocess
* Create a new trace file with the correct process address for flowprocess
* Remove unused debugging traces
* Add a new error lock_file_failure; Change please_reboot_remote_kv_store to please_reboot_kv_store; Add the code to only reboot the kv store but not the worker; Remove some unnecessay traces
* Add error handling for file_not_found in handleIOErrors
* Format worker.actor.cpp file
* Throttle the cluster if the blob manager cannot assign ranges
* fixed a number of different bugs which caused ratekeeper to throttle to zero because of blob worker lag
* fix: do not mark an assignment as block if it is cancelled
* remove asserts to merge bug fixes
* fix formatting
* restored old control flow to storage updater
* storage updater did not throw errors
* disable buggify to see if it fixes CI
* Fixing merge boundary recovery
* fixing an edge case in blob manager repeat recruitment
* fixing a race between tenant loading and key alignment
* formatting
* throttle the cluster when blob workers fall behind
* do not throttle on blob workers if they are not enabled
* remove an unnecessary actor
* fixed a compile error
* fetch blob worker metrics at the same interval as the rate is updated, avoid fetching the complete blob worker list too frequently
* fixed another compilation bug
* added a 5 second delay before bw throttling to prevent false positives caused by the 100e6 version jump during recovery. Lower the throttling thresholds to react much quicker to bw lag.
* fixed a number of problems
* changed the minBlobVersionRequest to look at storage server versions since this will be a lot more efficient
* fix: do not let desired go backwards
* fix: track the version of notAtLatest changefeeds for throttling
* ratekeeper now throttled blob workers by estimating the transaction per second throughput of the blob workers
* added metrics for blob worker change feeds
* added a knob to disable bw throttling
* fixed the transaction options in blob manager
* 'main' of github.com:sfc-gh-nwijetunga/foundationdb: (32 commits)
Store rocksdb::DBOptions and rocksdb::ColumnFamilyOptions to (#7766)
Update CONTRIBUTING.md
Update tests/rare/SpecificUnitTests.toml
fix ASAN OOM problem
Update CONTRIBUTING.md
Write tracing and ALP special key errors as JSON
Fix: the static tenant map in the Java tester was being accessed concurrently from multiple threads. Make it a concurrent map. (#7805)
Run clang-format
Print SIGNAL output to stdout
Print to stderr only upon errors
Testing upgrades to a future version of FDB (#7780)
Flush gcov coverage upon SIGTERM
Report the unit tests being run in test harness
Fix a bug in a storage wiggler unit test where some servers were added with too recent a timestamp
Fix undefined behavior in versioned btree test due to integer overflow
When a transaction operation gets an unknown tenant error, it needs to reset the tenant ID so it can be updated in the next tenant lookup request.
Don't buggify max tenants per cluster globally; instead buggify it in specific tests
Remove non-existing unittest
Add unit tests to the correctness package
Add comment to INetwork
...
* Enable configuring the next future protocol version as the current protocol version in FDB client, fdbserver, and fdbcli
* Auto format python files used in upgrade tests
* Add a test for upgrading to a future FDB version
* Emphasize that the options for using future protocol version are intended for test purposes only
* Make the global variable for current protocol version visible only locally
* Refactirng to avoid using currentProtocolVersion() in static intialization
* Update go bindings