Commit Graph

3691 Commits

Author SHA1 Message Date
Zhe Wang 770ffb7419
[Release-7.3] TLS should accept same key with different values (#11763)
* fix tls

* address comment
2024-11-15 00:16:16 -08:00
flowguru 1edb499428
Fix backup dryrun bug
* Fix backup dryrun bug

Currently there is a out-of-scope issue, this change also adds
a knob to control whether to allow dryrun of backup

* fix another bug that misses a wait statement

---------

Co-authored-by: Hao <fdbflowguru@gmail.com>
2024-11-14 09:46:11 -08:00
Syed Paymaan Raza c3979112cb
Urgent consistency checker fixes (cherrypick 7.3) (#11736)
* [fdbserver] Drop duplicate or conflicted requests from urgent consistency checker clients

* Fix edge case in urgent consistency check causing infinite loop

* fixup! Fix edge case in urgent consistency check causing infinite loop

self review
2024-10-26 09:50:51 -07:00
Vishesh Yadav 2b23111720
[release-7.3] Log all incoming connections (#11713)
* Log all incoming connections

* Address review comments

* Update FlowTransport.actor.cpp

* Update FlowTransport.actor.cpp

* Refactor

* Format

* initialize for simulation
2024-10-16 20:32:32 -07:00
Hao Fu 525183012a
Retry with dryrun in the presence of s3 token error(release-7.3) (#11602)
* Retry with dryrun in the presence of s3 token error

s3 token is from local disk and might be expired or invalid,
before this change backup retries to upload data to s3 indefinitely,
thus it is a waste of network bandwidth.

Now retry with a get request of list all buckets in the case of
s3 token error, and only retry the upload when token error disappears.

* Finish testing, set default to false

* Check bucket exist or not, rather than listBucket

* address comments
2024-09-06 11:47:33 -07:00
Xiaoge Su 08554e4f57
Add PeerAddress to all PeerAddr/Peer TraceEvent [release-7.3] (#11521)
* Add PeerAddress to all PeerAddr/Peer TraceEvent

This is to address #4846

* fixup!

* Decorate TLS handshake errors with peerAddr (#10090)

* Use connection debug ID in N2_AcceptHandshakeError

* Decorate TLS handshake errors with peerIP

* only write one value to ostream

* Add PeerAddress to all PeerAddr/Peer TraceEvent

This is to address #4846

---------

Co-authored-by: Sam Gwydir <sam.gwydir@snowflake.com>
2024-07-23 17:31:19 -07:00
Yao Xiao 83ae9ac129
[release-7.3] Wait for TSS during finishMoveShards. (#11485)
* Fix wait

* Fix wait
2024-07-22 23:41:00 -07:00
Zhe Wang 6d93c4dcac
[Release-7.3] Cherrypick Consistency Check Urgent (#11228)
* Consistency Check Urgent (Cherrypick from Release-7.1) (#11217)

* cherry-pick-distributed-consistency-checker

* code cleanup

* refactor code, decouple consistencyCheckerUrgent and consistency checker

* fix workload for consistencycheckurgent

* add new consistencycheckurgent role type

* fix CI

* address comments

* fix-consistencycheckurgent-large-read (#11229)
2024-03-05 10:45:15 -08:00
Jingyu Zhou 9e54cd0b5f
Merge pull request #11198 from johscheuer/abort-fdb-if-non-zero-exit-7.3
[RELEASE-7.3] Abort process when abnormal shutdown is initiated to allow coredumps to be generated
2024-02-14 15:36:37 -08:00
He Liu 11d8a5ef92
Cherry-pick to 7.3: Added checksum in MutationRef (#11181) (#11193) 2024-02-14 14:48:53 -08:00
Johannes M. Scheuermann fc59b9cbf6 Add knob to allow fdbserver to abort under abnormal behaviour 2024-02-14 10:17:32 +01:00
Johannes Scheuermann 2657754a53 Update link to abort function 2024-02-14 10:17:30 +01:00
Johannes M. Scheuermann 4f0361419b Abort process when abnormal shutdown is initiated to allow coredumps to be generated 2024-02-14 10:17:28 +01:00
Johannes M. Scheuermann e194d2bcaf Fix complie issue for ALLOC_INSTRUMENTATION 2024-02-01 14:46:52 +01:00
yaoxiao-github c8d092d067 Add log cleaner for rocksdb logs. 2024-01-17 14:55:55 -08:00
Jingyu Zhou d4eea8f048 Fix picking of IPv4 addresses 2023-10-18 08:24:09 +02:00
Jingyu Zhou 84a68f94ab Add a knob RESOLVE_PREFER_IPV4_ADDR to prefer IPv4 addresses
The default is to prefer IPv6 addresses.
2023-10-18 08:24:01 +02:00
Andrew Noyes 6fc73483a5 Implement Event using std::latch (#10705) 2023-09-25 11:14:27 -07:00
Zhe Wang 57e2f08c50
[Release-7.3] Cherry-pick audit storage optimizations (#10840)
* Multiple improvements to AuditStorages (#10685)

* remove danger DDAudit assert, add AuditRate knob, add progress check when ssshard complete, add progress check for ssshard in fdbcli

* throttle progress check for ssshard

* fix getAuditProgressByServer

* fix trace event for ss audit

* using name -- checkMoveKeysLockForAudit

* new scheduleAuditLocationMetadata

* address comments

* shorten progress summary for ssshard

* simplify getAuditProgressByServer in fdbcli

* Audit storage for specific engine (#10781)

* audit storage for specific engine

* fix getStorageType

* fix budget of skipAuditOnRange

* fix budget in scheduleAuditOnRange

* fix CI error

* improve trace events

* address comments

* Audit location metadata in DD (#10820)

* Audit location metadata in DD

* nits

* Fix auditStorage: Audit task should not retry if the task is issued by an outdated DD (copy from main PR 10844)
2023-08-29 11:09:28 -07:00
Evan Tschannen 2ff2b2cf38 added a blob worker specific page cache size for redwood so that it does not have to be changed manually in fdb.conf for all blob worker processes 2023-08-07 17:20:04 -07:00
Evan Tschannen 6ccc6c9b5b addressed review comments 2023-08-07 17:20:04 -07:00
Nim Wijetunga 7e14bd3389 add kms and ekp status to json 2023-08-07 14:59:58 -07:00
Josh Slocum 6f044dc339 reducing frequency of EKC latency logging to the standard for latencies in fdb 2023-07-19 14:05:00 -07:00
A.J. Beamon 0d9c581bd1 Rename TraceLog::log to logMetrics and move initialization of trace log metrics into TraceLog::open 2023-07-19 14:05:00 -07:00
A.J. Beamon fa70b885fe Fix condition in assert 2023-07-19 14:05:00 -07:00
A.J. Beamon a73227f7c1 Structure access to TraceLog::logTraceEventMetrics so that it is written before a trace log is opened and only read from one thread after it is opened. 2023-07-19 14:05:00 -07:00
A.J. Beamon 20915b749f Make CodeProbeImpl::_hitCount atomic 2023-07-19 14:05:00 -07:00
Zhe Wang cc4781a4ec Detect inconsistency of KeyServers and ServerKeys in real time (#10484)
* add framework

* add audit logic

* refactor audit loc metadata

* address comments

* add realtime audit timeout, add post validation logic

* fix input empty range to compareKeyServersAndServerKeys

* add context for auditKeyServersAndServerKeysInRealTime

* focus on moveShard

* remove space

* address comments

* cleanup

* add audit cleanup

* make validateRangeAssignment simple

* change trace name

* add shardAssigned

* stop DD when inconsistency detected

* fix ci

* small fix

* revert ss and auditUtl and simplify rt audit

* cleanup ss

* tiny change

* address comments and refactor code

* make auditLocationMetadataPreCheck retriable

* handle actor cancel in auditLocationMetadataPreCheck

* rm timeout and add new protection for failure of audit

* fix bugs

* import dataMoveId to validation

* improve trace event

* carefully propagate error and stop DD

* tiny fix

* small change

* remove a state var

* nit

* clean comments

* fmt
2023-07-06 10:00:31 -07:00
Xiaoge Su 32a60e9a29
[release-7.3] Report missing attributes when constructing status (#10524)
* Report missing attribute in StorageServer status

* fixup! reformat source

* fixup! Reformat source

* Retrigger CI
2023-06-21 14:02:43 -07:00
Zhe Wang aa47c0c722
[Release 7.3] Cherry-pick Add audit storage cancellation (#10386) (#10430)
* Add audit storage cancellation (#10386)

* list audits

* cancel audits and corresponding tests

* make audit storage dblock aware

* increase audit retry since we are able to cancel

* fix updateAuditState and fdb github ci

* fmt

* fix fdbcli audit_storage and fix CI issue

* fix fdb cli

* address comments

* fmt

* Fix audit storage actor cancel issue (#10443)

* init

* add testAuditStorageConcurrentRunForSameType test

* init (#10458)
2023-06-09 10:43:54 -07:00
He Liu f408c1bb08
Cherry pick psm and several other PRs (#10405)
* Psm ss (#9817)

* Update NativeAPI getCheckpointForRange().

* Implemented checkpoint in SS.

* clean up.

* Disabled StorageServerCheckpointTest.

* Serialized checkpoint creation and deletion.

Simplified checkpoint GC, via deleting CheckpointMetaData::dir.

* Fixed PhysicalShardMove test. Where fetchCheckpoint target range is misset.

* Minor improvements on CheckpointMetaData and DataMoveMetaData.

* fmt.

* Optimized PhysicalShardMove test

cleanup.

* Refactored ShardedRocks checkpoint/restore for psm.

* Complete ShardedRocks::restore.

* dismiss operation_obsolete, and throw actor_cancelled.

* Validate checkpoint when !asKeyValues.

* fmt.

* Don't read from uninitialized physical shard.

* Resolved commments.

* cleanup.

* Added verify_checksum_before_restore for ShardedRocks.

* Added ShardedRocksDB checkpoint/restore unit test.

* Populate CheckpointMetaData::dir in RocksDB.

* Rename MovingIn as Adding.

* Added StorageServerUtils.

* Added physical shard move in SS.

* Fix on ApplyMetaData, doFetchFile error handling etc.

* Debugging incorrect shard size.

* Create/delete checkpoints only when Physical shard move is enabled.

* Added back SHARD_ENCODE_LOCATION_METADATA.

* Fixed bytesSample incorrect issue.

Essentially dedicated CheckpointRocksDBCF as key-value based checkpoint, will need to add a new format for the file-based checkpoint.

* Cleanup.

* Cleanup & compile rocksdb with 8.1 branch.

* clean up.

* clean up.

* Allowed request_maybe_delivered error type in FetchShard.

* Added FDBRocksDBVersion.h.

* Fixed stuck fetchShard.

* Don't create checkpoint on TSS.

* Upgrade to RocksDB 8.1.1

* Cleanup.

* Fixed accidently deleted db_path and name fields.

* Improved trace event.

* Removed redundants from previuos ShardedrocksDB.

* Cleanup.

* cleanup.

* cleanup.

* reanme `state`.

* Cleanup.

* Removed excessive TraceEvent.

* * Fixed shardMap race condition on different threads
* Added *Stats, logging data move rates.
* Added `DD_PHYSICAL_SHARD_MOVE_PROBABILITY` to support hybrid data move.

* Resolved comments.

* fmt.

* Use physical shard move in PhysicalShardMoveTest.

* Enforce physical-shard-move for PhysicalShardMoveTest.

* fmt

* Reverted unintended changes.

* Added more logs about shard management. #10303

* Removed ENABLE_DD_PHYSICAL_SHARD_MOVE #10324

* Delete a data move if key range is not consistent. #10334
Disable physical shard move by default. #10335
2023-06-06 13:29:24 -07:00
Zhe Wu 22a79b0b7a Update FDB_AV_LATEST_BINDINGS_VERSION to 7.3 2023-05-23 13:17:09 -07:00
Zhe Wu b75da0dda0 Add recovered at in CSTATE, and use a knob to guard the use of it 2023-05-22 09:57:13 -07:00
Josh Slocum ff0c61aaf0 Adding BlobFailureInjection workload 2023-05-19 13:12:57 -07:00
Josh Slocum 8d504f3d2f Adding explicit blob range mutation log to handle large number of ranges 2023-05-16 18:25:09 -07:00
Josh Slocum d9c6ed9bb5 Passes existing tests 2023-05-16 17:28:11 -07:00
Josh Slocum e13fcf3e5e Adding Simulated HTTP Server and refactoring HTTP code 2023-05-16 16:32:34 -07:00
He Liu a8cc5fdde0
Merge branch 'release-7.3' into cherry-pick-audit-storage 2023-05-11 10:41:43 -07:00
Zhe Wang 0d8406dde3 Adding cleanup of old audit metadata (#10137)
* clean up old audit metadata

* change comments

* fix audit cleanup rule as PR description claim and reduce timeout of auditStorageCorrectness in tester

* address comment

* clear audit metadata should not throw error

* cleanup progress metadata by type

* control number of AuditStatistic events

* carefully persist new audit state

* add unit tests and fix issues

* cleanup

* allow audit concurrent run for different types and fix some bug in auditutl

* fix ci issue and nits
2023-05-10 21:18:21 -07:00
Zhe Wu 2158c985fd Update 7.3 branch to version 7.3 2023-05-10 13:59:36 -07:00
Xiaoge Su 2f70eb12f5 fixup! Cherry pick the change of genericactors.actor.h:store in #9991 2023-05-03 16:29:18 -07:00
Steve Atherton dd653064dc Address review comments. KeyRangeMapSnapshot is now ReferenceCounted and getSnapshot() returns a Reference to discourage copying. Added several comments for clarity. Added FormatUsingTraceable and changed all new formatters to use it except for Standalone<T> which redirects to the formatter for T. 2023-05-03 15:52:35 -07:00
Steve Atherton 5c413e48f3 Added `transaction_option_setter<DB>` to determine if a DB-like thing has a `->setOptions(tr)` method. This method is called in `runTransaction()` templates at the top of the retry loop and in the manual retry loops in KeyBackedTypes. Added `if constexpr(` support to the ActorCompiler to support this. 2023-05-03 15:52:09 -07:00
Steve Atherton 32bf39ef8b DataDistributor will restart if DDConfiguration changes. 2023-05-03 15:29:13 -07:00
Steve Atherton 8d35aa113d Changed KeyBackedTypes to an actor file. Added TypedKeySelectors for Map and Set classes and getRange() keySelector methods. Added debug macro for KeyBackedTypes. Rewrote KeyBackedRangeMap using keyselectors on KeyBackedMap. 2023-05-03 15:27:57 -07:00
Steve Atherton 7b88d4368b DDConfiguration class for modeling user specified key range configuration options. Added KeyBackedRangeMapSnapshot, some other supporting changes to KeyBackedTypes. Added invalidKey to give KeyBackedTypes a safe prefix to avoid accidental userspace modification from uninitialized accessors. 2023-05-03 15:27:57 -07:00
Chaoguang Lin bbc6c6cc8f Fix for correctness failures when issuing duplicate requests
Add comments; Disable failure injection in Snap test
2023-05-03 14:43:50 -07:00
Junhyun Shim 98bd49c676 Apply review suggestions 2023-05-02 17:38:51 -07:00
Junhyun Shim 4b3d8f43da Extend WipedString guarantees to serialized packets 2023-05-02 17:38:51 -07:00
Ata E Husain Bohra 37e0a43106 Address review comments and fix compilation issue
Description

Testing
2023-05-01 19:24:04 -07:00