To retrieve storage metadata for every status json request is very expensive
for clusters with a large number of storage servers. So I change the logic so
that ClusterController actively monitors changes to storage metadata, and only
retrieves them when there is a change.
* Rename and simplify fetch time variables
* Add RefreshTime detail to TenantCacheGetStorageUsageRefreshSlow trace
* Stagger storage estimation requests
* Update the value of a knob in simulation to reduce flakiness
* Improve names of TenantCache and StorageQuota related traces. Add slow refresh time.
* Convert potentially spammy TenantCache traces to SevDebug
* fix: Non-storage processes were not being checked for locality exclusions
fix: Data distribution when not detect a newly added process was locality excluded
fix: RemoveServerSafely did not wait for processes to be excluded before killing them when excluding localities
* fix: do not allow locality based excludes if they cannot exclude the required addresses
* Remove SS entries from RateKeeper once it is down
Before the change, certain data structures in RateKeeper would
not delete data associated with a deleted/cancelled SS, thus
it causes significant unnecessary CPU usage, results in degrades
of GRV proxy in performance. This change fixes it.
* remainingBudgetForAuditTasks should be managed within audit
* fix CI
* add audit storage test for various ranges
* clean DD
* new auditStorageUserDataQ
* fix assert fail in startTrackShardAssignment
* fix assert fail in ssaudit
* address comments
* replace assert with audit_cancel in ss audits
* add audit check progress tool
* add observability to audit progress and fix audit bugs
* fix audit progress issues and add sim test for audit progress and add trace event for the audit progress and add fdbcli to track the audit progress
* remove old audit storage on SS
* check audit progress when auditCore completes
If tenant mode is REQUIRED, then we should verify that in the normal key space, no data exists outside tenants' prefixes. This applies to data clusters (also known as partition clusters) in a metacluster and standalone clusters with tenants.
For the management cluster of a metacluster, we should verify that no data exists outside the prefix ranges specified by tenant/ and metacluster/ in the normal key space.
Test plan:
devRunCorrectnessFiltered +Metacluster* +Tenant* --max-runs 100000
20230702-052847-yajin-082705d269588494. 0 Failure
devRunCorrectness --max-runs 100000
20230702-134219-yajin-e9cce7bd165e70a9. 1 Failure, unrelated to this change
If tenant mode is REQUIRED, then we should verify that in the normal key space, no data exists outside
tenants' prefixes. This applies to data clusters (also known as partition clusters) in a metacluster and standalone clusters
with tenants.
For the management cluster of a metacluster, we should verify that no data exists outside the prefix ranges specified by `tenant/` and `metacluster/` in the normal key space.
Test plan:
devRunCorrectnessFiltered +Metacluster* +Tenant* --max-runs 100000
20230702-052847-yajin-082705d269588494. 0 Failure
devRunCorrectness --max-runs 100000
20230702-134219-yajin-e9cce7bd165e70a9. 1 Failure, unrelated to this change
Description
Given Configurable encryption has been checked in and being tested via
simulation for more than a month and also to avoid penalty of accessing
KNOBS in inline commit path, patch retires the KNOB and make
ConfigurationEncryption default EaR mode for FDB.
BlobCipher still supports the old format header and encryption semantics,
will remove the dead code as a followup PR.
Testing
devRunCorrectness - 100K
Description
SimKmsVault unit test when run as part of simulation Random test,
based on the test order, SimKmsVaultKeyCtx can be initialized as
part of some other test (FlowSingleton).
Update the test to handle the scenario.
Testing
devRunCorrectness - 100K
* EaR: Update KMS URL refresh policy and fix bugs
Description
RESTKmsConnector implements discovery and refresh semantics i.e.
on bootstrap it discovers KMS Urls and periodically refresh the
URLs (handle server upgrade scenario). The current implementation
caches the URLs in a min-heap, as part of serving a request, actor
pops out elements from min-heap and attempts connecting to the server,
on failure, the URL is temporarily stored in a stack, at the end of
the request processing, the stack is merged back into the heap.
The code doesn't work as expected if there are multiple requests
consumes the heap causing following issues:
1. Min-heap would retain old URLs replaced by latest refresh (stack merge)
2. URL discovery file is read more than expected as multiple requests can
empty heap, causing the code to read URLs from the file.
Patch proposes following policy to cache and maintain URLs priority:
1. Unresponsiveness penalty: KMS flaky connection or overload can cause
requests to timeout or fail; each such instance updates unresponsiveness
penalty of associated URL context. Further, the penalty is time bound and
deteriorate with time.
2. Cached URLs are sorted once a failure is encountered, priority followed
is:
2.1. Unresponsiveness penalty server(s) least preferred
2.2. Server(s) with high total-failures less preferred
2.3. Server(s) with high total-malformed response less preferred.
3. Updates RESTClient to throw 'retryable' error up to the client such as:
'connection_failed' and/or 'timeout'
4. Extend RESTUrl to support IPv6 format.
Testing
RESTUnit - 100K (new test added for coverage)
devRunCorrectness
* EaR: reduce metrics logging
BlobCipherMetrics used to break down by usage types (whehter it is for tlog, redwood, backup, etc), and these counters will be printed to trace log even when encryption is not enabled, or the specific usage is not happening on a node (e.g. a node with only stateless roles will also print blob cipher counters for redwood). We are reducing the BlobCipherMetrics loggings by:
1. Default to not breakdown the metrics by usage type, and the behavior is controlled by the knob `ENCRYPT_KEY_CACHE_ENABLE_DETAIL_LOGGING`
2. When the detail breakdown is enabled, the counters are lazily initialize
3. Even if the counters are initialized, they will not be logged if the count is 0 (so like if a node was recruited as tlog but then drops the tlog role later on, the tlog counter inside BlobCipherMetrics will not be logged anymore).
* buggify BlobCipherMetrics detail logging knob
* format
* EaR: Add test case to validate decryption with invalid key
Description
Extend BlobCipher unit test to provide coverage for the scenario
where buffer got encrypted with a EncryptionKey K, however,
decryption for some reason got attempted with K'.
Testing
EncryptionUnit.toml - 100K
* EaR: Add test case to validate decryption with invalid key
Description
Address review comments
Testing
* added operational metrics and some polish
* moving consistency scan enablement in simulation tests to main tester workflow
* more stats and throttling polish
Make a local copy of the promise before calling `send` in case the
promise gets destroyed as a result of fulfilling it.
This issue was previously fixed for sending errors to the `result`
promise, but it was never fixed when fulfilling the promise. The issue
manifested as an invalid generation returned when running a `set`
against the configuration database immediately followed by a `get` with
a new transaction object.
* Return const references in PTree accessors
Many usages do not require copying the reference (and incurring the
ref-counting overhead)
* Remove unnecessary refcounting for rotating ptree
* adding consistency scan clear stats and testing in simulation
* Adding test that intentionally injects corruption in consistency scan requests and ensures the scan finds it
* cleanup
* adding assert false to disabled code
* list audits
* cancel audits and corresponding tests
* make audit storage dblock aware
* increase audit retry since we are able to cancel
* fix updateAuditState and fdb github ci
* fmt
* fix fdbcli audit_storage and fix CI issue
* fix fdb cli
* address comments
* fmt
* Added location_metadata fdbcli to query shard locations, assignements, numbers etc.
* Added `listshards` to get some random physical/non-physical shards.
* Resolved comments.
* [fdbserver] workaround the FRT type layout issue to get Swfit getVersion working
* MasterData.actor.h: fix comment typo
* masterserver.swift: some tweaks
* masterserver.swift: remove getVersion function, use the method
* masterserver.swift: print replied version to output for tracing
* [swift] add radar links for C++ interop issues found in getVersion bringup
* Update fdbserver.actor.cpp
* Migrate MasterData closer to full reference type
This removes the workaround for the FRT type layout issue, and gets us closer to making MasterData a full reference type
* [interop] require a new toolchain (>= Oct 19th) to build
* [Swift] fix computation of toAdd for getVersion Swift implementation
* add Swift to FDBClient and add async `atLeast` to NotifiedVersion
* fix
* use new atLeast API in master server
* =build fixup link dependencies in swift fdbclient
* clocks
* +clock implement Clock using Flow's notion of time
* [interop] workaround the immortal retain/release issue
* [swift] add script to get latest centos toolchain
* always install swift hooks; not only in "test" mode
* simulator - first thing running WIP
* cleanups
* more cleanup
* working snapshot
* remove sim debug printlns
* added convenience for whenAtLeast
* try Alex's workaround
* annotate nonnull
* cleanup clock a little bit
* fix missing impls after rebase
* Undo the swift_lookup_Map_UID_CommitProxyVersionReplies workaround
No longer needed - the issue was retain/release
* [flow][swift] add Swift version of BUGGIFY
* [swiftication] add CounterValue type to provide value semantics for Counter types on the Swift side
* remove extraneous requestingProxyUID local
* masterserver: initial Swift state prototype
* [interop] make the Swiftied getVersion work
* masterserver - remove the C++ implementation (it can't be supported as state is now missing)
* Remove unnecessary SWIFT_CXX_REF_IMMORTAL annotations from Flow types
* Remove C++ implementation of CommitProxyVersionReplies - it's in Swift now
* [swift interop] remove more SWIFT_CXX_REF_IMMORTAL
* [swift interop] add SWIFT_CXX_IMMORTAL_SINGLETON_TYPE annotation for semanticly meaningful immortal uses
* rename SWIFT_CXX_REF_IMMORTAL -> UNSAFE_SWIFT_CXX_IMMORTAL_REF
* Move master server waitForPrev to swift
* =build fix linking swift in all modules
* =build single link option
* =cmake avoid manual math, just get "last" element from list
* implement Streams support (#18)
* [interop] update to new toolchain #6
* [interop] remove C++ vtable linking workarounds
* [interop] make MasterData proper reference counted SWIFT_CXX_REF_MASTERDATA
* [interop] use Swift array to pass UIDs to registerLastCommitProxyVersionReplies
* [interop] expose MasterServer actor to C++ without wrapper struct
* [interop] we no longer need expose on methods 🥳
* [interop] initial prototype of storing CheckedContinuation on the C++ side
* Example of invoking a synchronous swift function from a C++ unit test. (#21)
* move all "tests" we have in Swift, and priority support into real modules (#24)
* Make set continuation functions inline
* Split flow_swift into flow_swift and flow_swift_future to break circular dependency
* rename SwiftContinuationCallbackStruct to FlowCallbackForSwiftContinuation
* Future interop: use a method in a class template for continuation set call
* Revert "Merge pull request #22 from FoundationDB/cpp-continuation" (#30)
* Basic Swift Guide (#29)
Co-authored-by: Alex Lorenz <arphaman@gmail.com>
* Revert "Revert "Merge pull request #22 from FoundationDB/cpp-continuation" (#30)"
This reverts commit c025fe6258.
* Restore the C++ continuation, but it seems waitValue is broken for CInt somehow now
* disable broken tests - waitValue not accessible
* Streams can be async iterated over (#27)
Co-authored-by: Alex Lorenz <arphaman@gmail.com>
* remove work in progress things (#35)
* remove some not used (yet) code
* remove expose func for CInt, it's a primitive so we always have witness info (#37)
* +masterdata implement provideVersions in Swift (#36)
* serveLiveCommittedVersion in Swift (#38)
* Port updateLiveCommittedVersion to swift (#33)
Co-authored-by: Konrad `ktoso` Malawski <konrad_malawski@apple.com>
* Implement updateRecoveryData in Swift (#39)
Co-authored-by: Alex Lorenz <arphaman@gmail.com>
* Simplify flow_swift to avoid multiple targets and generate separate CheckedContinuation header
* Uncomment test which was blocked on extensions not being picked up (#31)
* [interop] Use a separate target for Swift-to-C++ header generation
* reduce boilerplate in future and stream support (#41)
* [interop] require interop v8 - that will fix linker issue (https://github.com/apple/swift/issues/62448)
* [interop] fix swift_stream_support.h Swift include
* [interop] bump up requirement to version 9
* [interop] Generalize the Flow.Optional -> Swift.Optional conversion using generics
* [WIP] masterServer func in Swift (#45)
* [interop] Try conforms_to with a SWIFT_CONFORMS_TO macro for Optional conformance (#49)
* [interop] include FlowOptionalProtocol source file when generating Flow_CheckedContinuation.h
This header generation step depends on the import of the C++ Flow module, which requires the presence of FlowOptionalProtocol
* conform Future to FlowFutureOps
* some notes
* move to value() so we can use discardable result for Flow.Void
* make calling into Swift async funcs nicer by returning Flow Futures
* [interop] hide initial use of FlowCheckedContinuation in flow.h to break dependency cycle
* [fdbserver] fix an EncryptionOpsUtils.h modularization issue (showed up with modularized libc++)
* Pass GCC toolchain using CMAKE_Swift_COMPILE_EXTERNAL_TOOLCHAIN to Swift's clang importer
* [interop] drop the no longer needed libstdc++ include directories
* [cmake] add a configuration check to ensure Swift can import C++ standard library
* [swift] include msgpack from msgpack_DIR
* [interop] make sure the FDB module maps have 'export' directive
* add import 'flow_swift' to swift_fdbserver_cxx_swift_value_conformance.swift
This is needed for CONFORMS_TO to work in imported modules
* make sure the Swift -> C++ manually bridged function signature matches generated signature
* [interop][workaround] force back use of @expose attribute before _Concurrency issue is fixed
* [interop] make getResolutionBalancer return a pointer to allow Swift to use it
We should revert back to a reference once compiler allows references again
* [interop] add a workaround for 'pop' being marked as unsafe in Swift
* masterserver.swift: MasterData returns the Swift actor pointer in an unsafe manner
* Add a 'getCopy' method to AsyncVar to make it more Swift friendly
* [interop] bump up the toolchain requirement
* Revert "[interop][workaround] force back use of @expose attribute before _Concurrency issue is fixed"
This reverts commit b01b271a76.
* [interop] add FIXME comments highlighting new issue workarounds
* [interop] adopt the new C++ interoperability compiler flag
* [interop] generate swift compile commands
* Do not deduplicate Swift compilation commands
* [interop] generate swift compile commands
* Do not deduplicate Swift compilation commands
* flow actorcompiler.h: add a SWIFT_ACTOR empty macro definition
This is needed to make the actor files parsable by clangd
* [cmake] add missing dependencies
* experimental cross compile
* [cmake] fix triple in cross-compiled cmake flags
* [interop] update to interop toolchain version 16
* [x-compile] add flags for cross-compiling boost
* cleanup x-compile cmake changes
* [cmake] fix typo in CMAKE_Swift_COMPILER_EXTERNAL_TOOLCHAIN config variable
* [interop] pass MasterDataActor from Swift to C++ and back to Swift
* [fdbserver] Swift->C++ header generation for FDBServer should use same module cache path
* Update swift_get_latest_toolchain.sh to fetch 5.9 toochains
* set HAVE_FLAG_SEARCH_PATHS_FIRST for cross compilation
* Resolve conflicts in net2/sim2/actors, can't build yet
* undo SWIFT_ACTOR changes, not necessary for merge
* guard c++ compiler flags with is_cxx_compile
* Update flow/actorcompiler/ActorParser.cs
Co-authored-by: Evan Wilde <etceterawilde@gmail.com>
* update the boost dependency
* Include boost directory from the container for Swift
* conform flow's Optional to FlowOptionalProtocol again
* Guard entire RocksDBLogForwarder.h with SSD_ROCKSDB_EXPERIMENTAL to avoid failing on missing rocksdb APIs
* remove extraneous merge marker
* [swift] update swift_test_streams.swifto to use vars in more places
* Add header guard to flow/include/flow/ThreadSafeQueue.h to fix moduralization issue
* Update net and sim impls
* [cmake] use prebuilt libc++ boost only when we're actually using libc++
* [fdbserver] Swift->C++ header generation for FDBServer should use same module cache path
* fixups after merge
* remove CustomStringConvertible conformance that would not be used
* remove self-caused deprecation warnings in future_support
* handle newly added task priority
* reformatting
* future: make value() not mutating
* remove FIXME, not needed anymore
* future: clarify why as functions
* Support TraceEvent in Swift
* Enable TraceEvent using a class wrapper in Swift
* prearing WITH_SWIFT flag
* wip disabled failing Go stuff
* cleanup WITH_SWIFT_FLAG and reenable Go
* wip disabled failing Go stuff
* move setting flag before printing it
* Add SWIFT_IDE_SETUP and cleanup guides and build a bit
* Revert "Wipe packet buffers that held serialized WipedString (#10018)"
This reverts commit e2df6e3302.
* [Swift] Compile workaround in KeyBackedRangeMap; default init is incorrect
* [interop] do not add FlowFutureOps conformance when building flow clang module for Flow checked continuation header pre-generation
* make sure to show -DUSE_LIBCXX=OFF in readme
* readme updates
* do not print to stderr
* Update Swift and C++ code to build with latest Swift 5.9 toolchain now that we no longer support universal references and bridge the methods that take in a constant reference template parameter correctly
* Fix SERVER_KNOBS and enable use them for masterserver
* Bump to C++20, Swift is now able to handle it as well
* Put waitForPrev behind FLOW_WITH_SWIFT knob
* Forward declare updateLiveCommittedVersion
* Remove unused code
* fix wrong condition set for updateLiveCommittedVersion
* Revert "Revert "Wipe packet buffers that held serialized WipedString (#10018)""
This reverts commit 5ad8dce052.
* Enable go-bindings in cmake
* Revert "Revert "Wipe packet buffers that held serialized WipedString (#10018)""
This reverts commit 5ad8dce052.
* USE_SWIFT flag so we "build without swift" until ready to by default
* uncomment a few tests which were disabled during USE_SWIFT enablement
* the option is WITH_SWIFT, not USE
* formatting
* Fix masterserver compile error
* Fix some build errors.
How did it not merge cleanly? :/
* remove initializer list from constructor
* Expect Swift toolchain only if WITH_SWIFT is enabled
* Don't require Flow_CheckedContinuation when Swift is disabled
* Don't compile FlowCheckedContinuation when WITH_SWIFT=OFF
* No-op Swift macros
* More compile guards
* fix typo
* Run clang-format
* Guard swift/bridging include in fdbrpc
* Remove printf to pass the test
* Remove some more printf to avoid potential issues
TODO: Need to be TraceEvents instead
* Remove __has_feature(nullability) as its only used in Swift
* Don't use __FILENAME__
* Don't call generate_module_map outside WITH_SWIFT
* Add some more cmake stuff under WITH_SWIFT guard
* Some more guards
* Bring back TLSTest.cpp
* clang-format
* fix comment formatting
* Remove unused command line arg
* fix cmake formatting in some files
* Address some review comments
* fix clang-format error
---------
Co-authored-by: Alex Lorenz <arphaman@gmail.com>
Co-authored-by: Russell Sears <russell_sears@apple.com>
Co-authored-by: Evan Wilde <etceterawilde@gmail.com>
Co-authored-by: Alex Lorenz <aleksei_lorenz@apple.com>
Co-authored-by: Vishesh Yadav <vishesh_yadav@apple.com>
Co-authored-by: Vishesh Yadav <vishesh3y@gmail.com>
Description
Optimize logging emitted from GetEncryptCipherKey module,
especially the one more useful for debugging and not very useful
in the production
Testing
SwizzledRollbackSideBand - randomSeed (276500218)
devRunCorrectness - 100k
When buggify is enabled, it's possible the version map has 5 entries, which is
larger than BACKUP_MAP_KEY_LOWER_LIMIT, causing the range task to be delayed
infinitely: the BackupRangeTaskFunc::_execute() skips the execution and
schedules the task to be added back in BackupRangeTaskFunc::_finish().
Reproduction:
Seed: -f ./tests/slow/SharedDefaultBackupCorrectness.toml -s 3202874095 -b on
-f ./tests/slow/VersionStampBackupToDB.toml -s 1190111003 -b on
Commit: 6e5773dd5 at release-7.3
Build: clang
* Make CodeProbeImpl::_hitCount atomic
* Structure access to TraceLog::logTraceEventMetrics so that it is written before a trace log is opened and only read from one thread after it is opened.
* Fix condition in assert
* Rename TraceLog::log to logMetrics and move initialization of trace log metrics into TraceLog::open
---------
Co-authored-by: A.J. Beamon <aj.beamon@snowflake.com>