Commit Graph

7281 Commits

Author SHA1 Message Date
Jay Zhuang 5130fe99d3
Fix RandomStringSetGenerator may not generate enough unique key (#10503)
For small key set, RandomStringSetGenerator may not try enough random
key to generate unique key set.
Increase the retry for smaller key set number.
2023-06-15 17:23:13 -07:00
Jingyu Zhou b23dbd7105
Merge pull request #10488 from sfc-gh-huliu/describebackup
Fix computeRestoreEndVersion bug when outLogs is null
2023-06-14 21:36:02 -07:00
Yao Xiao 7d1c6e7f17
Improve sharded rocksdb init time. (#10475) 2023-06-15 09:37:32 +08:00
Jay Zhuang ea52e90f03
Remove unnecessary padding for encrypt/decrypt (#9942)
* No padding is needed for AES CTR mode

https://en.wikipedia.org/wiki/Block_cipher_mode_of_operation#Padding

* Remove EVP_CIPHER_CTX_reset() so the encryptor could be reused

It's only reused in the unittest. In prod, new encryptor is created for
every new text.
2023-06-14 13:37:29 -07:00
Hui Liu 606d8db75f
Remove blobGranuleLockKeys after blob granule restore (#10477) 2023-06-14 12:41:42 -07:00
Evan Tschannen 74614161b2
Merge pull request #10460 from sfc-gh-etschannen/feature-durable-change-feed
Cache change feeds durably on blob workers
2023-06-14 08:35:52 -07:00
Ata E Husain Bohra bfbf8cd053
EaR: Update KMS URL refresh policy and fix bugs (#10382)
* EaR: Update KMS URL refresh policy and fix bugs

Description

RESTKmsConnector implements discovery and refresh semantics i.e.
on bootstrap it discovers KMS Urls and periodically refresh the
URLs (handle server upgrade scenario). The current implementation
caches the URLs in a min-heap, as part of serving a request, actor
pops out elements from min-heap and attempts connecting to the server,
on failure, the URL is temporarily stored in a stack, at the end of
the request processing, the stack is merged back into the heap.
The code doesn't work as expected if there are multiple requests
consumes the heap causing following issues:
1. Min-heap would retain old URLs replaced by latest refresh (stack merge)
2. URL discovery file is read more than expected as multiple requests can
empty heap, causing the code to read URLs from the file.

Patch proposes following policy to cache and maintain URLs priority:
1. Unresponsiveness penalty: KMS flaky connection or overload can cause
requests to timeout or fail; each such instance updates unresponsiveness
penalty of associated URL context. Further, the penalty is time bound and
deteriorate with time.
2. Cached URLs are sorted once a failure is encountered, priority followed
is:
2.1. Unresponsiveness penalty server(s) least preferred
2.2. Server(s) with high total-failures less preferred
2.3. Server(s) with high total-malformed response less preferred.
3. Updates RESTClient to throw 'retryable' error up to the client such as:
'connection_failed' and/or 'timeout'
4. Extend RESTUrl to support IPv6 format.

Testing

RESTUnit - 100K (new test added for coverage)
devRunCorrectness
2023-06-14 08:06:39 -07:00
Hui Liu f84cedd361 Fix computeRestoreEndVersion bug when outLogs is null 2023-06-13 17:03:57 -07:00
Evan Tschannen eb772c0043 added a blob worker specific page cache size for redwood so that it does not have to be changed manually in fdb.conf for all blob worker processes 2023-06-13 10:35:13 -07:00
Hui Liu 630013cfd9
Fix MoveKeysClean.toml failure (#10470) 2023-06-13 08:45:03 -07:00
Josh Slocum 31e4610b56
misc operational and documentation improvements (#10465)
* misc operational and documentation improvements

* fixing doc build
2023-06-12 15:14:01 -05:00
Evan Tschannen 88eed268c3 added a knob for how many bytes are read from disk 2023-06-11 16:10:20 -07:00
Evan Tschannen b09d6d44eb disable change feed cache by default 2023-06-11 15:24:37 -07:00
Evan Tschannen a8ceadd917 actor cancellation still needs to unset storage 2023-06-11 14:55:05 -07:00
Evan Tschannen 359e178dcd Merge branch 'main' into feature-durable-change-feed
# Conflicts:
#	fdbclient/ClientKnobs.cpp
#	fdbserver/BlobManager.actor.cpp
#	fdbserver/worker.actor.cpp
2023-06-11 13:58:35 -07:00
Evan Tschannen f69f4c73ad addressed review comments 2023-06-11 13:54:38 -07:00
Evan Tschannen 7322e21e23 fixed compiler error 2023-06-11 09:25:05 -07:00
Evan Tschannen 334a868dfe fix: respect end when reading from disk; update the starting version when leaving a hole on disk 2023-06-11 09:24:09 -07:00
Evan Tschannen d03f08f914 fix: not all mutations were being made durable 2023-06-10 18:36:02 -07:00
Evan Tschannen be8d8a8f72 fix: popping the cache was removing too many versions 2023-06-09 16:20:48 -07:00
Evan Tschannen 33a7f57da5 fix: clear the cache when popping change feeds; do not insert versions into the cache that are already durable 2023-06-09 13:49:33 -07:00
Yi Wu 7048ad21a8
EaR: reduce metrics logging (#10453)
* EaR: reduce metrics logging

BlobCipherMetrics used to break down by usage types (whehter it is for tlog, redwood, backup, etc), and these counters will be printed to trace log even when encryption is not enabled, or the specific usage is not happening on a node (e.g. a node with only stateless roles will also print blob cipher counters for redwood). We are reducing the BlobCipherMetrics loggings by:
1. Default to not breakdown the metrics by usage type, and the behavior is controlled by the knob  `ENCRYPT_KEY_CACHE_ENABLE_DETAIL_LOGGING`
2. When the detail breakdown is enabled, the counters are lazily initialize
3. Even if the counters are initialized, they will not be logged if the count is 0 (so like if a node was recruited as tlog but then drops the tlog role later on, the tlog counter inside BlobCipherMetrics will not be logged anymore).

* buggify BlobCipherMetrics detail logging knob

* format
2023-06-09 12:07:49 -07:00
neethuhaneesha a71de03cb9
Rocksdb max auto readahead size knob change to 64k (#10449) 2023-06-09 11:26:06 -07:00
Ata E Husain Bohra 31aa06cfbc
EaR: Add test case to validate decryption with invalid key (#10394)
* EaR: Add test case to validate decryption with invalid key

Description

Extend BlobCipher unit test to provide coverage for the scenario
where buffer got encrypted with a EncryptionKey K, however,
decryption for some reason got attempted with K'.

Testing

EncryptionUnit.toml - 100K

* EaR: Add test case to validate decryption with invalid key

Description

Address review comments

Testing
2023-06-08 22:32:15 -07:00
Jingyu Zhou f210cf708d
Merge pull request #10436 from jzhou77/main
Add getlocation and getall fdbcli debug commands
2023-06-08 22:05:03 -07:00
Zhe Wang d85f6e95c4
init (#10458) 2023-06-08 18:30:32 -07:00
Josh Slocum b209cd5d19
Consistency scan polish (#10445)
* added operational metrics and some polish

* moving consistency scan enablement in simulation tests to main tester workflow

* more stats and throttling polish
2023-06-08 14:18:58 -05:00
Zhe Wang 8119e3da87
Fix audit storage actor cancel issue (#10443)
* init

* add testAuditStorageConcurrentRunForSameType test
2023-06-08 09:53:32 -07:00
Hui Liu ef93caf344
BlobGranuleRestore - skip muations applying if restore target version is less than begin version (#10442) 2023-06-08 09:19:25 -07:00
Zhe Wu 77f2caf030 Clean up documentation for PreferWithinShardCount 2023-06-07 22:20:04 -07:00
Lukas Joswiak c3d518409c Fix a bug where fulfilling a promise could cause it to get deleted
Make a local copy of the promise before calling `send` in case the
promise gets destroyed as a result of fulfilling it.

This issue was previously fixed for sending errors to the `result`
promise, but it was never fixed when fulfilling the promise. The issue
manifested as an invalid generation returned when running a `set`
against the configuration database immediately followed by a `get` with
a new transaction object.
2023-06-07 15:26:08 -07:00
Jingyu Zhou b8c0087ca6 Fix compiling errors 2023-06-07 15:10:00 -07:00
Zhe Wu 6b17f9fcf3 Adding PreferWithinShardLimit option 2023-06-07 14:38:58 -07:00
Jingyu Zhou 614686f737 Add getlocation and getall fdbcli debug commands
getlocation: returns the SS list for a key
getall: returns both the SS list and values on the SS for a key
2023-06-07 14:36:16 -07:00
sfc-gh-tclinkenbeard 6b28c53211 Merge remote-tracking branch 'origin/main' into main-fix-op-cost-bug 2023-06-07 09:58:29 -07:00
Evan Tschannen 197c39b552 cache change feeds using a storage engine to avoid reading them for the server on startup 2023-06-07 08:41:31 -07:00
Andrew Noyes bdec6bf5b9
Remove some unnecessary ref-counting in the PTree (#10401)
* Return const references in PTree accessors

Many usages do not require copying the reference (and incurring the
ref-counting overhead)

* Remove unnecessary refcounting for rotating ptree
2023-06-07 09:49:48 -05:00
Josh Slocum 220b7d1a37
Consistency scan test improvements (#10402)
* adding consistency scan clear stats and testing in simulation

* Adding test that intentionally injects corruption in consistency scan requests and ensures the scan finds it

* cleanup

* adding assert false to disabled code
2023-06-07 07:21:47 -05:00
Zhe Wang f8f8f72c4e
Add audit storage cancellation (#10386)
* list audits

* cancel audits and corresponding tests

* make audit storage dblock aware

* increase audit retry since we are able to cancel

* fix updateAuditState and fdb github ci

* fmt

* fix fdbcli audit_storage and fix CI issue

* fix fdb cli

* address comments

* fmt
2023-06-06 14:29:53 -07:00
He Liu fc8543125c
Added location_metadata fdbcli to query shard locations, assignements… (#10395)
* Added location_metadata fdbcli to query shard locations, assignements, numbers etc.

* Added `listshards` to get some random physical/non-physical shards.

* Resolved comments.
2023-06-06 10:33:48 -07:00
Hui Liu 8fcac8a9a9
Dump manifest by using multiple transactions (#10380) 2023-06-05 11:22:29 -07:00
Konrad `ktoso` Malawski c26aa0b2a3
Introduce initial Swift support in fdbserver (#10156)
* [fdbserver] workaround the FRT type layout issue to get Swfit getVersion working

* MasterData.actor.h: fix comment typo

* masterserver.swift: some tweaks

* masterserver.swift: remove getVersion function, use the method

* masterserver.swift: print replied version to output for tracing

* [swift] add radar links for C++ interop issues found in getVersion bringup

* Update fdbserver.actor.cpp

* Migrate MasterData closer to full reference type

This removes the workaround for the FRT type layout issue, and gets us closer to making MasterData a full reference type

* [interop] require a new toolchain (>= Oct 19th) to build

* [Swift] fix computation of toAdd for getVersion Swift implementation

* add Swift to FDBClient and add async `atLeast` to NotifiedVersion

* fix

* use new atLeast API in master server

* =build fixup link dependencies in swift fdbclient

* clocks

* +clock implement Clock using Flow's notion of time

* [interop] workaround the immortal retain/release issue

* [swift] add script to get latest centos toolchain

* always install swift hooks; not only in "test" mode

* simulator - first thing running WIP

* cleanups

* more cleanup

* working snapshot

* remove sim debug printlns

* added convenience for whenAtLeast

* try Alex's workaround

* annotate nonnull

* cleanup clock a little bit

* fix missing impls after rebase

* Undo the swift_lookup_Map_UID_CommitProxyVersionReplies workaround

No longer needed - the issue was retain/release

* [flow][swift] add Swift version of BUGGIFY

* [swiftication] add CounterValue type to provide value semantics for Counter types on the Swift side

* remove extraneous requestingProxyUID local

* masterserver: initial Swift state prototype

* [interop] make the Swiftied getVersion work

* masterserver - remove the C++ implementation (it can't be supported as state is now missing)

* Remove unnecessary SWIFT_CXX_REF_IMMORTAL annotations from Flow types

* Remove C++ implementation of CommitProxyVersionReplies - it's in Swift now

* [swift interop] remove more SWIFT_CXX_REF_IMMORTAL

* [swift interop] add SWIFT_CXX_IMMORTAL_SINGLETON_TYPE annotation for semanticly meaningful immortal uses

* rename SWIFT_CXX_REF_IMMORTAL -> UNSAFE_SWIFT_CXX_IMMORTAL_REF

* Move master server waitForPrev to swift

* =build fix linking swift in all modules

* =build single link option

* =cmake avoid manual math, just get "last" element from list

* implement Streams support (#18)

* [interop] update to new toolchain #6

* [interop] remove C++ vtable linking workarounds

* [interop] make MasterData proper reference counted SWIFT_CXX_REF_MASTERDATA

* [interop] use Swift array to pass UIDs to registerLastCommitProxyVersionReplies

* [interop] expose MasterServer actor to C++ without wrapper struct

* [interop] we no longer need expose on methods 🥳

* [interop] initial prototype of storing CheckedContinuation on the C++ side

* Example of invoking a synchronous swift function from a C++ unit test. (#21)

* move all "tests" we have in Swift, and priority support into real modules (#24)

* Make set continuation functions inline

* Split flow_swift into flow_swift and flow_swift_future to break circular dependency

* rename SwiftContinuationCallbackStruct to FlowCallbackForSwiftContinuation

* Future interop: use a method in a class template for continuation set call

* Revert "Merge pull request #22 from FoundationDB/cpp-continuation" (#30)

* Basic Swift Guide (#29)

Co-authored-by: Alex Lorenz <arphaman@gmail.com>

* Revert "Revert "Merge pull request #22 from FoundationDB/cpp-continuation" (#30)"

This reverts commit c025fe6258.

* Restore the C++ continuation, but it seems waitValue is broken for CInt somehow now

* disable broken tests - waitValue not accessible

* Streams can be async iterated over (#27)

Co-authored-by: Alex Lorenz <arphaman@gmail.com>

* remove work in progress things (#35)

* remove some not used (yet) code

* remove expose func for CInt, it's a primitive so we always have witness info (#37)

* +masterdata implement provideVersions in Swift (#36)

* serveLiveCommittedVersion in Swift (#38)

* Port updateLiveCommittedVersion to swift (#33)

Co-authored-by: Konrad `ktoso` Malawski <konrad_malawski@apple.com>

* Implement updateRecoveryData in Swift (#39)

Co-authored-by: Alex Lorenz <arphaman@gmail.com>

* Simplify flow_swift to avoid multiple targets and generate separate CheckedContinuation header

* Uncomment test which was blocked on extensions not being picked up (#31)

* [interop] Use a separate target for Swift-to-C++ header generation

* reduce boilerplate in future and stream support (#41)

* [interop] require interop v8 - that will fix linker issue (https://github.com/apple/swift/issues/62448)

* [interop] fix swift_stream_support.h Swift include

* [interop] bump up requirement to version 9

* [interop] Generalize the Flow.Optional -> Swift.Optional conversion using generics

* [WIP] masterServer func in Swift (#45)

* [interop] Try conforms_to with a SWIFT_CONFORMS_TO macro for Optional conformance (#49)

* [interop] include FlowOptionalProtocol source file when generating Flow_CheckedContinuation.h

This header generation step depends on the import of the C++ Flow module, which requires the presence of FlowOptionalProtocol

* conform Future to FlowFutureOps

* some notes

* move to value() so we can use discardable result for Flow.Void

* make calling into Swift async funcs nicer by returning Flow Futures

* [interop] hide initial use of FlowCheckedContinuation in flow.h to break dependency cycle

* [fdbserver] fix an EncryptionOpsUtils.h modularization issue (showed up with modularized libc++)

* Pass GCC toolchain using CMAKE_Swift_COMPILE_EXTERNAL_TOOLCHAIN to Swift's clang importer

* [interop] drop the no longer needed libstdc++ include directories

* [cmake] add a configuration check to ensure Swift can import C++ standard library

* [swift] include msgpack from msgpack_DIR

* [interop] make sure the FDB module maps have 'export' directive

* add import 'flow_swift' to swift_fdbserver_cxx_swift_value_conformance.swift

This is needed for CONFORMS_TO to work in imported modules

* make sure the Swift -> C++ manually bridged function signature matches generated signature

* [interop][workaround] force back use of @expose attribute before _Concurrency issue is fixed

* [interop] make getResolutionBalancer return a pointer to allow Swift to use it

We should revert back to a reference once compiler allows references again

* [interop] add a workaround for 'pop' being marked as unsafe in Swift

* masterserver.swift: MasterData returns the Swift actor pointer in an unsafe manner

* Add a 'getCopy' method to AsyncVar to make it more Swift friendly

* [interop] bump up the toolchain requirement

* Revert "[interop][workaround] force back use of @expose attribute before _Concurrency issue is fixed"

This reverts commit b01b271a76.

* [interop] add FIXME comments highlighting new issue workarounds

* [interop] adopt the new C++ interoperability compiler flag

* [interop] generate swift compile commands

* Do not deduplicate Swift compilation commands

* [interop] generate swift compile commands

* Do not deduplicate Swift compilation commands

* flow actorcompiler.h: add a SWIFT_ACTOR empty macro definition

This is needed to make the actor files parsable by clangd

* [cmake] add missing dependencies

* experimental cross compile

* [cmake] fix triple in cross-compiled cmake flags

* [interop] update to interop toolchain version 16

* [x-compile] add flags for cross-compiling boost

* cleanup x-compile cmake changes

* [cmake] fix typo in CMAKE_Swift_COMPILER_EXTERNAL_TOOLCHAIN config variable

* [interop] pass MasterDataActor from Swift to C++ and back to Swift

* [fdbserver] Swift->C++ header generation for FDBServer should use same module cache path

* Update swift_get_latest_toolchain.sh to fetch 5.9 toochains

* set HAVE_FLAG_SEARCH_PATHS_FIRST for cross compilation

* Resolve conflicts in net2/sim2/actors, can't build yet

* undo SWIFT_ACTOR changes, not necessary for merge

* guard c++ compiler flags with is_cxx_compile

* Update flow/actorcompiler/ActorParser.cs

Co-authored-by: Evan Wilde <etceterawilde@gmail.com>

* update the boost dependency

* Include boost directory from the container for Swift

* conform flow's Optional to FlowOptionalProtocol again

* Guard entire RocksDBLogForwarder.h with SSD_ROCKSDB_EXPERIMENTAL to avoid failing on missing rocksdb APIs

* remove extraneous merge marker

* [swift] update swift_test_streams.swifto to use vars in more places

* Add header guard to flow/include/flow/ThreadSafeQueue.h to fix moduralization issue

* Update net and sim impls

* [cmake] use prebuilt libc++ boost only when we're actually using libc++

* [fdbserver] Swift->C++ header generation for FDBServer should use same module cache path

* fixups after merge

* remove CustomStringConvertible conformance that would not be used

* remove self-caused deprecation warnings in future_support

* handle newly added task priority

* reformatting

* future: make value() not mutating

* remove FIXME, not needed anymore

* future: clarify why as functions

* Support TraceEvent in Swift

* Enable TraceEvent using a class wrapper in Swift

* prearing WITH_SWIFT flag

* wip disabled failing Go stuff

* cleanup WITH_SWIFT_FLAG and reenable Go

* wip disabled failing Go stuff

* move setting flag before printing it

* Add SWIFT_IDE_SETUP and cleanup guides and build a bit

* Revert "Wipe packet buffers that held serialized WipedString (#10018)"

This reverts commit e2df6e3302.

* [Swift] Compile workaround in KeyBackedRangeMap; default init is incorrect

* [interop] do not add FlowFutureOps conformance when building flow clang module for Flow checked continuation header pre-generation

* make sure to show  -DUSE_LIBCXX=OFF in readme

* readme updates

* do not print to stderr

* Update Swift and C++ code to build with latest Swift 5.9 toolchain now that we no longer support universal references and bridge the methods that take in a constant reference template parameter correctly

* Fix SERVER_KNOBS and enable use them for masterserver

* Bump to C++20, Swift is now able to handle it as well

* Put waitForPrev behind FLOW_WITH_SWIFT knob

* Forward declare updateLiveCommittedVersion

* Remove unused code

* fix wrong condition set for updateLiveCommittedVersion

* Revert "Revert "Wipe packet buffers that held serialized WipedString (#10018)""

This reverts commit 5ad8dce052.

* Enable go-bindings in cmake

* Revert "Revert "Wipe packet buffers that held serialized WipedString (#10018)""

This reverts commit 5ad8dce052.

* USE_SWIFT flag so we "build without swift" until ready to by default

* uncomment a few tests which were disabled during USE_SWIFT enablement

* the option is WITH_SWIFT, not USE

* formatting

* Fix masterserver compile error

* Fix some build errors.

How did it not merge cleanly? :/

* remove initializer list from constructor

* Expect Swift toolchain only if WITH_SWIFT is enabled

* Don't require Flow_CheckedContinuation when Swift is disabled

* Don't compile FlowCheckedContinuation when WITH_SWIFT=OFF

* No-op Swift macros

* More compile guards

* fix typo

* Run clang-format

* Guard swift/bridging include in fdbrpc

* Remove printf to pass the test

* Remove some more printf to avoid potential issues

TODO: Need to be TraceEvents instead

* Remove __has_feature(nullability) as its only used in Swift

* Don't use __FILENAME__

* Don't call generate_module_map outside WITH_SWIFT

* Add some more cmake stuff under WITH_SWIFT guard

* Some more guards

* Bring back TLSTest.cpp

* clang-format

* fix comment formatting

* Remove unused command line arg

* fix cmake formatting in some files

* Address some review comments

* fix clang-format error

---------

Co-authored-by: Alex Lorenz <arphaman@gmail.com>
Co-authored-by: Russell Sears <russell_sears@apple.com>
Co-authored-by: Evan Wilde <etceterawilde@gmail.com>
Co-authored-by: Alex Lorenz <aleksei_lorenz@apple.com>
Co-authored-by: Vishesh Yadav <vishesh_yadav@apple.com>
Co-authored-by: Vishesh Yadav <vishesh3y@gmail.com>
2023-06-02 16:09:28 -05:00
Nim Wijetunga 95bf14323f
EKP and KMS Health Check (#10341)
EKP and KMS Health Check
2023-06-01 16:24:04 -07:00
Trevor Clinkenbeard e6fb1e6a47
Merge pull request #10389 from sfc-gh-tclinkenbeard/main-holt-linear-smoother
Add `HoltLinearSmoother` class
2023-06-01 09:22:10 -07:00
sfc-gh-tclinkenbeard 3c6941192e Merge remote-tracking branch 'origin/main' into main-fix-op-cost-bug 2023-05-31 19:05:39 -07:00
sfc-gh-tclinkenbeard f647a1289e Split GLOBAL_TAG_THROTTLING_FOLDING_TIME into several knobs 2023-05-31 17:20:32 -07:00
Ata E Husain Bohra 4f21e0cfcd
EaR: Optimize logging from GetEncryptCipherKey (#10326)
Description

Optimize logging emitted from GetEncryptCipherKey module,
especially the one more useful for debugging and not very useful
in the production

Testing

SwizzledRollbackSideBand - randomSeed (276500218)
devRunCorrectness - 100k
2023-05-31 16:56:07 -07:00
Trevor Clinkenbeard 7cf901df43
Merge pull request #10354 from sfc-gh-tclinkenbeard/main-tag-counter-optimizations
Improve performance of `TransactionTagCounter`
2023-05-31 16:08:12 -07:00
Jingyu Zhou 36f2f9015e
Merge pull request #10357 from w41ter/main
Fix restore range loss
2023-05-31 15:43:25 -07:00
sfc-gh-tclinkenbeard b227c0ca27 Merge remote-tracking branch 'origin/main' into main-fix-op-cost-bug 2023-05-31 13:57:07 -07:00
He Liu aeb7f2bd4d Merge branch 'main' of https://github.com/apple/foundationdb into disable-physical-shard-move 2023-05-30 15:27:57 -07:00
Jingyu Zhou 43d67d6f98 Should repeat when speedUpSimulation is false 2023-05-30 11:08:48 -07:00
Jingyu Zhou 0674984ab1 Fix a simulation DR stuck issue
When buggify is enabled, it's possible the version map has 5 entries, which is
larger than BACKUP_MAP_KEY_LOWER_LIMIT, causing the range task to be delayed
infinitely: the BackupRangeTaskFunc::_execute() skips the execution and
schedules the task to be added back in BackupRangeTaskFunc::_finish().

Reproduction:
  Seed: -f ./tests/slow/SharedDefaultBackupCorrectness.toml -s 3202874095 -b on
        -f ./tests/slow/VersionStampBackupToDB.toml -s 1190111003 -b on
  Commit: 6e5773dd5 at release-7.3
  Build: clang
2023-05-30 09:50:24 -07:00
Zhe Wang 61aaca005e
SS Audit Storage Throttling (#10322)
* ss audit storage throttling

* add audit manager to ss

* reduce CONCURRENT_AUDIT_TASK_COUNT_MAX

* revises comments

* fix audit cli

* fix getAuditStates

* remove toStringForCLI
2023-05-29 14:43:47 -07:00
w41ter abd23958c2 Fix restore range loss 2023-05-29 11:39:07 +08:00
sfc-gh-tclinkenbeard 0cfbe4ccc1 Fix get*OperationCost functions for empty mutations/results 2023-05-28 13:17:24 -07:00
sfc-gh-tclinkenbeard a1b0a6b35e Merge remote-tracking branch 'origin/main' into main-tag-counter-optimizations 2023-05-27 13:04:11 -07:00
Jingyu Zhou d0e9a14b73
Merge pull request #10324 from liquid-helium/delete-data-move-checkpoints-by-id
Removed ENABLE_DD_PHYSICAL_SHARD_MOVE
2023-05-27 10:07:02 -07:00
sfc-gh-tclinkenbeard 926a7cbb4d Add AUTO_TAG_THROTTLE_SPRING_BYTES_STORAGE_SERVER knob 2023-05-26 16:10:37 -07:00
sfc-gh-tclinkenbeard 2385dd36f3 Update GLOBAL_TAG_THROTTLING_FOLDING_TIME default to 10.0 2023-05-26 16:10:37 -07:00
Zhe Wang 53675db306
Fix audit storage issue with multiple DDs (#10310)
* init

* add DDAuditContext

* move metadata update before runauditstorage

* revert DDAuditContext and replace ddAuditId with ddId

* cleanup
2023-05-26 15:56:03 -07:00
He Liu caed7ed374 Merge branch 'main' of https://github.com/apple/foundationdb into delete-data-move-checkpoints-by-id 2023-05-26 15:37:21 -07:00
He Liu 08de38120d Merge branch 'main' of https://github.com/apple/foundationdb into disable-physical-shard-move 2023-05-26 11:54:52 -07:00
Vaidas Gasiunas 60753b5b57
Fix a couple thread-safety issues (#10359)
* Make CodeProbeImpl::_hitCount atomic

* Structure access to TraceLog::logTraceEventMetrics so that it is written before a trace log is opened and only read from one thread after it is opened.

* Fix condition in assert

* Rename TraceLog::log to logMetrics and move initialization of trace log metrics into TraceLog::open

---------

Co-authored-by: A.J. Beamon <aj.beamon@snowflake.com>
2023-05-26 19:36:02 +02:00
sfc-gh-tclinkenbeard 67c53eb203 Decrease default value for TAG_MEASUREMENT_INTERVAL to 5.0 2023-05-26 08:15:45 -07:00
sfc-gh-tclinkenbeard cf32ba4d8c Increase default value for SS_THROTTLE_TAGS_TRACKED to 5 2023-05-26 08:15:45 -07:00
sfc-gh-tclinkenbeard f741a584c0 Improve performance of TransactionTagCounter 2023-05-26 08:15:41 -07:00
sfc-gh-tclinkenbeard e724c90ffe Remove unnecessary GLOBAL_TAG_THROTTLING_MIN_TPS knob 2023-05-25 16:45:32 -07:00
sfc-gh-tclinkenbeard 71846070d6 Update default tag throttling knob values 2023-05-25 16:45:32 -07:00
He Liu d21d85e4b6 Merge branch 'main' of https://github.com/apple/foundationdb into disable-physical-shard-move 2023-05-25 12:25:44 -07:00
He Liu 1900b63acd Merge branch 'main' of https://github.com/apple/foundationdb into delete-data-move-checkpoints-by-id 2023-05-24 13:41:02 -07:00
Ankita Kejriwal 9373191e0a
Fix two bugs in checkExclusion() and add a trace event for better observability (#10330)
* Fix a division in checkExclusion() to be double and add a trace event

* Update the ssExcludedCount only if the role is storage
2023-05-24 10:58:03 -07:00
Jingyu Zhou 1712691da5
Merge pull request #10328 from sfc-gh-jslocum/knob_allow_relative_path_blob_container
adding knob to allow relative paths for local backup containers
2023-05-24 10:30:02 -07:00
He Liu 9100507928 Disable physical shard move by default. 2023-05-24 08:51:13 -07:00
Jingyu Zhou 13800ae1a8 Increase BW_RK_SIM_QUIESCE_DELAY to 400s
The blob worker needs more time to catchup, about 388s in the failed simulation
test.

Reproduction:
  seed: -f ./tests/slow/BlobGranuleVerifyLargeClean.toml -s 4068151139 -b on
  commit: 3bdd71cb0 at release-7.3 branch
  build: gcc
2023-05-23 15:54:56 -07:00
Josh Slocum 8f241632af adding knob to allow relative paths for local backup containers 2023-05-23 17:06:49 -05:00
He Liu 5160f91e78 Removed SHARD_ENCODE_LOCATION_METADATA. 2023-05-23 13:39:25 -07:00
Josh Slocum d038154d69
re-enabling change feed coalesce knob (#10317) 2023-05-23 14:43:11 -05:00
He Liu 8ad7ec6fdf
Psm ss (#9817)
* Update NativeAPI getCheckpointForRange().

* Implemented checkpoint in SS.

* clean up.

* Disabled StorageServerCheckpointTest.

* Serialized checkpoint creation and deletion.

Simplified checkpoint GC, via deleting CheckpointMetaData::dir.

* Fixed PhysicalShardMove test. Where fetchCheckpoint target range is misset.

* Minor improvements on CheckpointMetaData and DataMoveMetaData.

* fmt.

* Optimized PhysicalShardMove test

cleanup.

* Refactored ShardedRocks checkpoint/restore for psm.

* Complete ShardedRocks::restore.

* dismiss operation_obsolete, and throw actor_cancelled.

* Validate checkpoint when !asKeyValues.

* fmt.

* Don't read from uninitialized physical shard.

* Resolved commments.

* cleanup.

* Added verify_checksum_before_restore for ShardedRocks.

* Added ShardedRocksDB checkpoint/restore unit test.

* Populate CheckpointMetaData::dir in RocksDB.

* Rename MovingIn as Adding.

* Added StorageServerUtils.

* Added physical shard move in SS.

* Fix on ApplyMetaData, doFetchFile error handling etc.

* Debugging incorrect shard size.

* Create/delete checkpoints only when Physical shard move is enabled.

* Added back SHARD_ENCODE_LOCATION_METADATA.

* Fixed bytesSample incorrect issue.

Essentially dedicated CheckpointRocksDBCF as key-value based checkpoint, will need to add a new format for the file-based checkpoint.

* Cleanup.

* Cleanup & compile rocksdb with 8.1 branch.

* clean up.

* clean up.

* Allowed request_maybe_delivered error type in FetchShard.

* Added FDBRocksDBVersion.h.

* Fixed stuck fetchShard.

* Don't create checkpoint on TSS.

* Upgrade to RocksDB 8.1.1

* Cleanup.

* Fixed accidently deleted db_path and name fields.

* Improved trace event.

* Removed redundants from previuos ShardedrocksDB.

* Cleanup.

* cleanup.

* cleanup.

* reanme `state`.

* Cleanup.

* Removed excessive TraceEvent.

* * Fixed shardMap race condition on different threads
* Added *Stats, logging data move rates.
* Added `DD_PHYSICAL_SHARD_MOVE_PROBABILITY` to support hybrid data move.

* Resolved comments.

* fmt.

* Use physical shard move in PhysicalShardMoveTest.

* Enforce physical-shard-move for PhysicalShardMoveTest.

* fmt
2023-05-23 11:18:35 -07:00
Xiaoxi Wang 969196d8ba Add read ops shard metrics notify bound 2023-05-23 09:46:34 -07:00
Josh Slocum 629b068145
Bg tenant metadata restarting (#10235)
* making blob metadata optionally deterministic across runs

* Non restarting test passes after refactor

* adding downgrade version test

* formatting
2023-05-23 11:24:13 -05:00
He Liu eaa934dac6
Added more logs about shard management. (#10303) 2023-05-22 18:00:00 -07:00
Yao Xiao bbf15be05f
Knobs to speed up DB open. (#10301) 2023-05-22 16:21:05 -07:00
Vaidas Gasiunas 9bc55f67c3
Fix releasing watches on future cancellation (#10304)
* Test watch cleanup on cancel

* Fix clearing the database in Java integration tests

* Always cancel the futures wrapped by MVC abortable futures

* More tests for watch cleanup

* Fix clear database database in some Java integration tests
2023-05-22 22:01:27 +02:00
Zhe Wang 6c980862c3
Improve throughput of audit storage (#10245)
* improve audit throughput

* if ssshard fails do audit due to ssi failure, then global retry is required

* fix a trace event name

* fix budget release in doAudit

* avoid throttling in general simultion tests

* fix doAuditOnStorageServer throw error

* avoid starting a task that has been complete

* when ddaudit ssshard failed, check if ssi is removed, if yes, silently exit

* fix trace detail name of AuditUtilStorageServerRemovedEnd evenrt

* redo schedule in doAuditOnStorageServer

* schedule does not wait doAudit

* remove TESTING_AUDIT_STORAGE_THROTTLING

* ssaudit stops proceeding if ddauditstate is not in running phase

* make tester audit storage only happen when simulation, and randomly set CONCURRENT_AUDIT_TASK_COUNT_MAX
2023-05-22 12:09:08 -07:00
sfc-gh-tclinkenbeard 7ef66ab356 Add OutstandingWatches and WatchMapSize to TransactionMetrics 2023-05-22 12:07:10 -07:00
Ata E Husain Bohra 2b0a08dbe4
BlobMetadata: Move SimBlobMetada store to SimKmsVault (#10269)
Description

Patch refactor SimKmsConnector to move SimBlobMetadata store to SimKmsVault

Testing

BlobGranuleCorrectness - 100K
/fdbserver/blob/connectionprovider - 100K
devRunCorrectness - 100K
2023-05-22 11:00:59 -07:00
Hui Liu 7ca13d8f9c
support blob restore in fdbrestore (#10248) 2023-05-19 14:45:14 -07:00
Zhe Wu 93ad70db38
Merge pull request #10263 from halfprice/zhewu/gc-generation-using-recoverat
GC earlier TLog generation using each generation's `recover at` version instead of `start version`
2023-05-19 12:07:02 -07:00
Jefferson Zhong 3760522dc2 Make stepSize configurable for preloadApplyMutationsKeyVersionMap 2023-05-19 10:57:30 -07:00
Yao Xiao cef93f7d22
knobs (#10253) 2023-05-18 14:58:09 -07:00
Josh Slocum 2916a11a86
New ConsistencyScan (#10265)
* Remove duplicate getRange() for DB handles and update existing GetRange to accept DB handles.

* Initial progress checkpoint on new ConsistencyScan role.

* Updated TODOs, finished most if not all state updates.

* placeholder

* Add more TODOs, documentation and comment improvements.

* Checkpoint round state to avoid advancing progress if commit fails.

* Bug fix, check is supposed to be for overlap, not lack of overlap.

* Added more TODO's and added faked read results / exceptions and faked DB size retrieval to prove the consistencyScanCore logic works.

* Update JSON schemas and command help.

* Add comment about lifetime stats reset.

* More TODO comments and some renames for clarity, some bug fixes.

* properly stopping consistency scan in simulation so that it doesn't run forever and cause quiet database to fail

* removing trailing comma from consistency_scan json schema

* Making CC inconsistency not an error if it's intentional tss corruption

* consistency scan actually reads storage locations

* added check that consistency scan actually completes a round in simulation, fixed bug and added debugging around consistency scan getting stuck

* made consistency scan properly fetch database size

* refactoring data check to be used in both consistency scan and consistency check

* checking that consistency scan always completes at least one round and doesn't get stuck

* cleanup

* fixing ide build

* consistencyscan fdbcli command wasn't actually changing db state

* consistencyscan fdbcli command always said enabled even when it wasn't

---------

Co-authored-by: Steve Atherton <steve.atherton@snowflake.com>
2023-05-18 15:02:41 -05:00
Ata E Husain Bohra e25b9ff686
EaR: REST based Simulated KMS Vault request handler interface (#10240)
* EaR: REST based Simulated KMS Vault request hanlder interface

Description

  diff-1: Address review comments
             Improve unit test case coverage
  diff-2: Extend RESTKmsConnectorUtil to generate HTTP::Header

EaR simulation testing is currently driven using SimKmsConnector
interface, it exposes endpoints directly invoked by EKP to fetch
encryption keys. Approach avoids testing RESTKms communication
path. Recently FDB codebase got extended by adding HTTPServer
interface, which was a gap prohibiting end-to-end testing of
EaR code.

Patch proposes following changes:
1. Refactor RESTKmsConnector to move common code and definitions
to RESTKmsConnectorUtil namespace
2. Introduce RESTSimKmsVault accepting HTTP format requests and
providing appropriate HTTP response.

Testing

RESTUnit          100K + 5k valgrind
devRunCorrectness 100K

Testing
2023-05-17 12:38:09 -07:00
Zhe Wu 0bdfe1889b Add recovered at in CSTATE, and use a knob to guard the use of it 2023-05-16 12:47:00 -07:00
Josh Slocum 185e7d9f30
fixing BlobGranuleRequests to properly bump read version on retry (#10216) 2023-05-16 14:12:00 -05:00
Josh Slocum 3ea16ff579
Blob kms connector ids (#10121)
* blob metadata refactor to use location id and simplify rest api

* buggifying different ordering of locations in blob metadata response
2023-05-16 13:10:11 -05:00
neethuhaneesha 854464a6af
Hex values in TSS logs and rocksb debuglogs mode knob (#10231) 2023-05-16 10:34:58 -07:00
Zhe Wang 852e012eb2
Adding throttling of audit storage tasks and tracing progress of tasks (#10233)
* when trigger doAuditOnStorageServer, check remainingBudgetForAuditTasks

* add trace event of audit progress

* address comments

* code clean up

* make dispatch and schedule audit be more clear

* make dispatch and schedule audit be more clear 2

* make dispatch and schedule audit be more clear 3

* address comments
2023-05-15 16:19:41 -07:00
Jingyu Zhou 9675f13ba9 Reduce STORAGE_FETCH_KEYS_DELAY to speedup data movement
Buggified value of 100s is too long to cause consistency check failures.
2023-05-15 13:56:08 -07:00
A.J. Beamon 712fefd59f
Merge pull request #10213 from sfc-gh-ajbeamon/tenant-code-probes
Add code probes for tenant and metacluster code
2023-05-15 12:13:00 -07:00
Sam Gwydir 6c16875c34
Add networkoption to disable non-TLS connections (#9984)
* Add networkoption to disable non-TLS connections

* add disable plaintext connection to fdbserver

* python doc

* Formatting

* Add tls disable plaintext connection to client api test

* review

* fix negative test

* formatting

* add TLS support to c client config tests

Adds support for TLS in the client and server separately

* add tests for disable_plaintext_connections

Test TLS and Plaintext Clusters and Clients

* Fix documentation

* Rename option to indicate it is client-only

* clearer formatting

* default to allowing plaintext connections

* add SetTLSDisablePlaintextConnection to go bindings
2023-05-13 00:14:11 +02:00
A.J. Beamon eacf817b2f Add metacluster code probes 2023-05-12 12:32:24 -07:00
Josh Slocum f82ea43198
copying headers into http request (#10227) 2023-05-11 20:18:12 -05:00
A.J. Beamon b15622c492 Fix formatting and unrelated windows build issue 2023-05-11 08:52:20 -07:00
neethuhaneesha 92d1da79a9
RocksDB WAL archive options. (#10211) 2023-05-10 21:36:18 -07:00
A.J. Beamon d8141c049d Add code probes for tenant code 2023-05-10 20:44:39 -07:00
Zhe Wang 8559d4f1a8
Adding cleanup of old audit metadata (#10137)
* clean up old audit metadata

* change comments

* fix audit cleanup rule as PR description claim and reduce timeout of auditStorageCorrectness in tester

* address comment

* clear audit metadata should not throw error

* cleanup progress metadata by type

* control number of AuditStatistic events

* carefully persist new audit state

* add unit tests and fix issues

* cleanup

* allow audit concurrent run for different types and fix some bug in auditutl

* fix ci issue and nits
2023-05-10 19:32:04 -07:00
Yao Xiao 995fba9254
Merge pull request #10152 from yao-xiao-github/main
Cherrypick multiple ShardedRocksDB improvements
2023-05-10 16:14:17 -07:00
Evan Tschannen 3dd86d6c22 move IKeyValueStore.h to the client 2023-05-10 15:41:47 -07:00
Yao Xiao 182d2cafbf Log physical shard size in KVS 2023-05-10 12:54:59 -07:00
Ata E Husain Bohra 18fd2702c4
EaR: Implement SimKmsVault interface, refactor SimKmsConnector (#10194)
Description

Patch implements a SimKmsVault interface allowing unittest/simulation
to satisfy encryption lookup usecases. It also refactors existing
SimKmsConnector to leverage SimKmsVault APIs

Testing

devRunCorrectness - 100K
/simKmsVault - asan & valgrind
EncryptionUnitTest
2023-05-10 12:44:53 -07:00
He Liu 66cd102821
Added `get_audit_status checkmigration` to print out the number of da… (#10188)
* Added `get_audit_status checkmigration` to print out the number of data shards and `physical shards`, so that we know the progress of migration to `shard_encode_location_metadata`

* Fixed print format.

* Addressed comments.
2023-05-10 12:26:39 -07:00
Yao Xiao 2d1b5d02e2 Range deletion memory usage improvements (#10048) 2023-05-10 10:23:01 -07:00
Yao Xiao fa101e1e11 Log background error and add knobs for memory tuning. (#9841)
* error logger

* recovery mode
2023-05-10 10:23:01 -07:00
Yao Xiao fa821c0ed6 Cherrypick #9746 2023-05-10 10:23:01 -07:00
Yao Xiao abd45c4486 Cherrypick #9665 2023-05-10 10:23:01 -07:00
Josh Slocum 9a2365daa8
fixing bugs with tenant_mode required on external clients and changin… (#10183)
* fixing bugs with tenant_mode required on external clients and changing test to find them

* Update fdbcli/BlobKeyCommand.actor.cpp

Co-authored-by: A.J. Beamon <aj.beamon@snowflake.com>

---------

Co-authored-by: A.J. Beamon <aj.beamon@snowflake.com>
2023-05-09 13:41:58 -05:00
Jay Zhuang 801a01bd38
Merge pull request #10159 from sfc-gh-jazhuang/redwood_test
Integrate the random key/value generator to Redwood test
2023-05-09 11:41:47 -07:00
Josh Slocum e69d54fbc0
Block unblobbify (#10182)
* stregthening check for not merging consecutive blob ranges

* implementing expanded unblobbify and changing tests to account
2023-05-09 11:43:11 -05:00
Josh Slocum 6be0c74d5b
Adding explicit blob range mutation log to handle large number of ranges (#10174)
* Adding explicit blob range mutation log to handle large number of ranges

* fixing ide build
2023-05-09 11:30:04 -05:00
Jay Zhuang 1c009bbd11 Update value size and maxCommitSize based on pageSize 2023-05-09 09:11:30 -07:00
Jay Zhuang 9f2f735d53 More random keys if there's fewer stringSet 2023-05-09 09:11:30 -07:00
Jay Zhuang 561db510e0 Add a helper class to container StringSetGenerator and StringGenerator 2023-05-09 09:11:30 -07:00
Jay Zhuang fd680782b5 Integrate the random key/value generator 2023-05-09 09:11:30 -07:00
Evan Tschannen c8e8505101
buggified max_shards_on_large_teams (#10105)
* buggified max_shards_on_large_teams, and had the consistency scan verify the proper number of shards have been overreplicated

* fix: when restarting the data distributor, do no allow more than max_shards_on_large_teams shards to be marked as healthy
2023-05-08 16:56:42 -07:00
Hui Liu 53e68065e7
Support blob manifest backup for fdbbackup cmdline (#10091) 2023-05-08 16:07:22 -07:00
Ankita Kejriwal 63354f68ad
Update knob values for Storage Quota polling intervals (#10154) 2023-05-08 10:06:29 -07:00
Hui Liu 65ed7775fd
Add manifest encryption (#10081) 2023-05-05 14:33:37 -07:00
Jingyu Zhou b844a92c1e
Merge pull request #10143 from neethuhaneesha/paranoidChecks
Rocksdb paranoid file checks knob.
2023-05-05 10:23:06 -07:00
Josh Slocum a4dffa087a
Adding Simulated HTTP Server and refactoring HTTP code (#10112)
* Adding Simulated HTTP Server and refactoring HTTP code

* fixing formatting

* fixing merge conflicts

* fixing more merge conflicts

* code review feedback

* changing reference counted interface

* more fixes

* fixing ide build i guess
2023-05-05 12:19:17 -05:00
Steve Atherton fb2fc6a260
Merge pull request #10157 from sfc-gh-satherton/systemkey-overlap
Bug fix, check is supposed to be for overlap, not lack of overlap.
2023-05-04 21:24:12 -07:00
Jingyu Zhou 78434517ff Increase buggified STORAGE_METRICS_SHARD_LIMIT value
The previous buggified value 3 can be the same as key location size, thus
causing splitStorageMetrics() to stuck.
2023-05-04 19:31:43 -07:00
Steve Atherton d52113e7a3 Bug fix, check is supposed to be for overlap, not lack of overlap. 2023-05-04 18:08:37 -07:00
Josh Slocum fb950a9c81
adding blob ranges to backup keys to not lose blobbification on restore (#10059) 2023-05-04 13:55:20 -05:00
neethuhaneesha 8b2f3bcfdc Rocksdb paranoid file checks knob. 2023-05-04 11:49:38 -07:00
A.J. Beamon 9d647f827c
Merge pull request #10129 from sfc-gh-ajbeamon/require-reliable-coordinator-quorum
Do not allow changing the coordinators to a set that is unreliable in simulation
2023-05-04 08:18:29 -07:00
Jay Zhuang d0cb599c7a Fix a gcc build error
```
RandomKeyValueUtils.cpp:64:106: error: call of overloaded 'RandomKeyTupleGenerator(<brace-enclosed initializer list>)' is ambiguous
```
2023-05-03 16:33:04 -07:00
Jay Zhuang a18bb10bcf Merge branch 'main' into random-kv-generator 2023-05-03 15:39:37 -07:00
Zhe Wang d254fba6e5
Adding cleanup of audit progress metadata when audit complete (#10118)
* cleanup audit progress metadata and tester directly issue audit requests to DD instead of CC

* address comments and fix test dd issue request but dd not present
2023-05-03 15:39:22 -07:00
A.J. Beamon ccf61ac2e5 Do not allow changing the coordinators to a set that is unreliable, because otherwise we could delete our coordinated state 2023-05-03 15:03:03 -07:00
Xiaoxi Wang 91de1c880e remove PrepareBlobRestore waiting for inFlight moving 2023-05-03 14:43:23 -07:00
Xiaoxi Wang d7c089fd13 add timeout to blob migrator getReply to tackle recovery during preparation 2023-05-03 14:43:23 -07:00
Josh Slocum 22155c84f4
adding logic to disable splitting within a truncated tuple, and validating it in test (#10106) 2023-05-03 10:23:46 -05:00
Zhe Wu fffdfa5b3d Increase MAX_STORAGE_COMMIT_TIME to be inline with LOW_PRIORITY_DURABILITY_LAG 2023-05-02 11:12:52 -07:00
Josh Slocum d0c412b5e6
fixing incorrect uses of ThreadSafeAsyncVar (#10086) 2023-05-02 07:29:06 -05:00
Xiaoxi Wang 5ea53a797e check storage metadata and storage server interface in the same transaction 2023-05-01 18:08:08 -07:00
Xiaoxi Wang 3a8bdcca3d add metadata check to quiesent consistency check 2023-05-01 18:08:08 -07:00
Xiaoxi Wang 3605d8c74c populate storage metadata for tss 2023-05-01 18:08:08 -07:00
A.J. Beamon 85f5e206a7
Merge pull request #10047 from sfc-gh-ajbeamon/add-metacluster-version
Add a metacluster version to the MetaclusterRegistrationEntry and validate it when loading the entry
2023-05-01 12:32:37 -07:00
A.J. Beamon b258159d3a Change enum capitalization. Improve error reporting if we cannot read metacluster registration when fetching metacluster metrics. Improve timeliness of metacluster metrics updates. 2023-05-01 11:21:42 -07:00