Commit Graph

7112 Commits

Author SHA1 Message Date
sfc-gh-tclinkenbeard 2385dd36f3 Update GLOBAL_TAG_THROTTLING_FOLDING_TIME default to 10.0 2023-05-26 16:10:37 -07:00
Zhe Wang 53675db306
Fix audit storage issue with multiple DDs (#10310)
* init

* add DDAuditContext

* move metadata update before runauditstorage

* revert DDAuditContext and replace ddAuditId with ddId

* cleanup
2023-05-26 15:56:03 -07:00
Vaidas Gasiunas 60753b5b57
Fix a couple thread-safety issues (#10359)
* Make CodeProbeImpl::_hitCount atomic

* Structure access to TraceLog::logTraceEventMetrics so that it is written before a trace log is opened and only read from one thread after it is opened.

* Fix condition in assert

* Rename TraceLog::log to logMetrics and move initialization of trace log metrics into TraceLog::open

---------

Co-authored-by: A.J. Beamon <aj.beamon@snowflake.com>
2023-05-26 19:36:02 +02:00
sfc-gh-tclinkenbeard e724c90ffe Remove unnecessary GLOBAL_TAG_THROTTLING_MIN_TPS knob 2023-05-25 16:45:32 -07:00
sfc-gh-tclinkenbeard 71846070d6 Update default tag throttling knob values 2023-05-25 16:45:32 -07:00
Ankita Kejriwal 9373191e0a
Fix two bugs in checkExclusion() and add a trace event for better observability (#10330)
* Fix a division in checkExclusion() to be double and add a trace event

* Update the ssExcludedCount only if the role is storage
2023-05-24 10:58:03 -07:00
Jingyu Zhou 1712691da5
Merge pull request #10328 from sfc-gh-jslocum/knob_allow_relative_path_blob_container
adding knob to allow relative paths for local backup containers
2023-05-24 10:30:02 -07:00
Jingyu Zhou 13800ae1a8 Increase BW_RK_SIM_QUIESCE_DELAY to 400s
The blob worker needs more time to catchup, about 388s in the failed simulation
test.

Reproduction:
  seed: -f ./tests/slow/BlobGranuleVerifyLargeClean.toml -s 4068151139 -b on
  commit: 3bdd71cb0 at release-7.3 branch
  build: gcc
2023-05-23 15:54:56 -07:00
Josh Slocum 8f241632af adding knob to allow relative paths for local backup containers 2023-05-23 17:06:49 -05:00
Josh Slocum d038154d69
re-enabling change feed coalesce knob (#10317) 2023-05-23 14:43:11 -05:00
He Liu 8ad7ec6fdf
Psm ss (#9817)
* Update NativeAPI getCheckpointForRange().

* Implemented checkpoint in SS.

* clean up.

* Disabled StorageServerCheckpointTest.

* Serialized checkpoint creation and deletion.

Simplified checkpoint GC, via deleting CheckpointMetaData::dir.

* Fixed PhysicalShardMove test. Where fetchCheckpoint target range is misset.

* Minor improvements on CheckpointMetaData and DataMoveMetaData.

* fmt.

* Optimized PhysicalShardMove test

cleanup.

* Refactored ShardedRocks checkpoint/restore for psm.

* Complete ShardedRocks::restore.

* dismiss operation_obsolete, and throw actor_cancelled.

* Validate checkpoint when !asKeyValues.

* fmt.

* Don't read from uninitialized physical shard.

* Resolved commments.

* cleanup.

* Added verify_checksum_before_restore for ShardedRocks.

* Added ShardedRocksDB checkpoint/restore unit test.

* Populate CheckpointMetaData::dir in RocksDB.

* Rename MovingIn as Adding.

* Added StorageServerUtils.

* Added physical shard move in SS.

* Fix on ApplyMetaData, doFetchFile error handling etc.

* Debugging incorrect shard size.

* Create/delete checkpoints only when Physical shard move is enabled.

* Added back SHARD_ENCODE_LOCATION_METADATA.

* Fixed bytesSample incorrect issue.

Essentially dedicated CheckpointRocksDBCF as key-value based checkpoint, will need to add a new format for the file-based checkpoint.

* Cleanup.

* Cleanup & compile rocksdb with 8.1 branch.

* clean up.

* clean up.

* Allowed request_maybe_delivered error type in FetchShard.

* Added FDBRocksDBVersion.h.

* Fixed stuck fetchShard.

* Don't create checkpoint on TSS.

* Upgrade to RocksDB 8.1.1

* Cleanup.

* Fixed accidently deleted db_path and name fields.

* Improved trace event.

* Removed redundants from previuos ShardedrocksDB.

* Cleanup.

* cleanup.

* cleanup.

* reanme `state`.

* Cleanup.

* Removed excessive TraceEvent.

* * Fixed shardMap race condition on different threads
* Added *Stats, logging data move rates.
* Added `DD_PHYSICAL_SHARD_MOVE_PROBABILITY` to support hybrid data move.

* Resolved comments.

* fmt.

* Use physical shard move in PhysicalShardMoveTest.

* Enforce physical-shard-move for PhysicalShardMoveTest.

* fmt
2023-05-23 11:18:35 -07:00
Xiaoxi Wang 969196d8ba Add read ops shard metrics notify bound 2023-05-23 09:46:34 -07:00
Josh Slocum 629b068145
Bg tenant metadata restarting (#10235)
* making blob metadata optionally deterministic across runs

* Non restarting test passes after refactor

* adding downgrade version test

* formatting
2023-05-23 11:24:13 -05:00
He Liu eaa934dac6
Added more logs about shard management. (#10303) 2023-05-22 18:00:00 -07:00
Yao Xiao bbf15be05f
Knobs to speed up DB open. (#10301) 2023-05-22 16:21:05 -07:00
Vaidas Gasiunas 9bc55f67c3
Fix releasing watches on future cancellation (#10304)
* Test watch cleanup on cancel

* Fix clearing the database in Java integration tests

* Always cancel the futures wrapped by MVC abortable futures

* More tests for watch cleanup

* Fix clear database database in some Java integration tests
2023-05-22 22:01:27 +02:00
Zhe Wang 6c980862c3
Improve throughput of audit storage (#10245)
* improve audit throughput

* if ssshard fails do audit due to ssi failure, then global retry is required

* fix a trace event name

* fix budget release in doAudit

* avoid throttling in general simultion tests

* fix doAuditOnStorageServer throw error

* avoid starting a task that has been complete

* when ddaudit ssshard failed, check if ssi is removed, if yes, silently exit

* fix trace detail name of AuditUtilStorageServerRemovedEnd evenrt

* redo schedule in doAuditOnStorageServer

* schedule does not wait doAudit

* remove TESTING_AUDIT_STORAGE_THROTTLING

* ssaudit stops proceeding if ddauditstate is not in running phase

* make tester audit storage only happen when simulation, and randomly set CONCURRENT_AUDIT_TASK_COUNT_MAX
2023-05-22 12:09:08 -07:00
sfc-gh-tclinkenbeard 7ef66ab356 Add OutstandingWatches and WatchMapSize to TransactionMetrics 2023-05-22 12:07:10 -07:00
Ata E Husain Bohra 2b0a08dbe4
BlobMetadata: Move SimBlobMetada store to SimKmsVault (#10269)
Description

Patch refactor SimKmsConnector to move SimBlobMetadata store to SimKmsVault

Testing

BlobGranuleCorrectness - 100K
/fdbserver/blob/connectionprovider - 100K
devRunCorrectness - 100K
2023-05-22 11:00:59 -07:00
Hui Liu 7ca13d8f9c
support blob restore in fdbrestore (#10248) 2023-05-19 14:45:14 -07:00
Zhe Wu 93ad70db38
Merge pull request #10263 from halfprice/zhewu/gc-generation-using-recoverat
GC earlier TLog generation using each generation's `recover at` version instead of `start version`
2023-05-19 12:07:02 -07:00
Jefferson Zhong 3760522dc2 Make stepSize configurable for preloadApplyMutationsKeyVersionMap 2023-05-19 10:57:30 -07:00
Yao Xiao cef93f7d22
knobs (#10253) 2023-05-18 14:58:09 -07:00
Josh Slocum 2916a11a86
New ConsistencyScan (#10265)
* Remove duplicate getRange() for DB handles and update existing GetRange to accept DB handles.

* Initial progress checkpoint on new ConsistencyScan role.

* Updated TODOs, finished most if not all state updates.

* placeholder

* Add more TODOs, documentation and comment improvements.

* Checkpoint round state to avoid advancing progress if commit fails.

* Bug fix, check is supposed to be for overlap, not lack of overlap.

* Added more TODO's and added faked read results / exceptions and faked DB size retrieval to prove the consistencyScanCore logic works.

* Update JSON schemas and command help.

* Add comment about lifetime stats reset.

* More TODO comments and some renames for clarity, some bug fixes.

* properly stopping consistency scan in simulation so that it doesn't run forever and cause quiet database to fail

* removing trailing comma from consistency_scan json schema

* Making CC inconsistency not an error if it's intentional tss corruption

* consistency scan actually reads storage locations

* added check that consistency scan actually completes a round in simulation, fixed bug and added debugging around consistency scan getting stuck

* made consistency scan properly fetch database size

* refactoring data check to be used in both consistency scan and consistency check

* checking that consistency scan always completes at least one round and doesn't get stuck

* cleanup

* fixing ide build

* consistencyscan fdbcli command wasn't actually changing db state

* consistencyscan fdbcli command always said enabled even when it wasn't

---------

Co-authored-by: Steve Atherton <steve.atherton@snowflake.com>
2023-05-18 15:02:41 -05:00
Ata E Husain Bohra e25b9ff686
EaR: REST based Simulated KMS Vault request handler interface (#10240)
* EaR: REST based Simulated KMS Vault request hanlder interface

Description

  diff-1: Address review comments
             Improve unit test case coverage
  diff-2: Extend RESTKmsConnectorUtil to generate HTTP::Header

EaR simulation testing is currently driven using SimKmsConnector
interface, it exposes endpoints directly invoked by EKP to fetch
encryption keys. Approach avoids testing RESTKms communication
path. Recently FDB codebase got extended by adding HTTPServer
interface, which was a gap prohibiting end-to-end testing of
EaR code.

Patch proposes following changes:
1. Refactor RESTKmsConnector to move common code and definitions
to RESTKmsConnectorUtil namespace
2. Introduce RESTSimKmsVault accepting HTTP format requests and
providing appropriate HTTP response.

Testing

RESTUnit          100K + 5k valgrind
devRunCorrectness 100K

Testing
2023-05-17 12:38:09 -07:00
Zhe Wu 0bdfe1889b Add recovered at in CSTATE, and use a knob to guard the use of it 2023-05-16 12:47:00 -07:00
Josh Slocum 185e7d9f30
fixing BlobGranuleRequests to properly bump read version on retry (#10216) 2023-05-16 14:12:00 -05:00
Josh Slocum 3ea16ff579
Blob kms connector ids (#10121)
* blob metadata refactor to use location id and simplify rest api

* buggifying different ordering of locations in blob metadata response
2023-05-16 13:10:11 -05:00
neethuhaneesha 854464a6af
Hex values in TSS logs and rocksb debuglogs mode knob (#10231) 2023-05-16 10:34:58 -07:00
Zhe Wang 852e012eb2
Adding throttling of audit storage tasks and tracing progress of tasks (#10233)
* when trigger doAuditOnStorageServer, check remainingBudgetForAuditTasks

* add trace event of audit progress

* address comments

* code clean up

* make dispatch and schedule audit be more clear

* make dispatch and schedule audit be more clear 2

* make dispatch and schedule audit be more clear 3

* address comments
2023-05-15 16:19:41 -07:00
Jingyu Zhou 9675f13ba9 Reduce STORAGE_FETCH_KEYS_DELAY to speedup data movement
Buggified value of 100s is too long to cause consistency check failures.
2023-05-15 13:56:08 -07:00
A.J. Beamon 712fefd59f
Merge pull request #10213 from sfc-gh-ajbeamon/tenant-code-probes
Add code probes for tenant and metacluster code
2023-05-15 12:13:00 -07:00
Sam Gwydir 6c16875c34
Add networkoption to disable non-TLS connections (#9984)
* Add networkoption to disable non-TLS connections

* add disable plaintext connection to fdbserver

* python doc

* Formatting

* Add tls disable plaintext connection to client api test

* review

* fix negative test

* formatting

* add TLS support to c client config tests

Adds support for TLS in the client and server separately

* add tests for disable_plaintext_connections

Test TLS and Plaintext Clusters and Clients

* Fix documentation

* Rename option to indicate it is client-only

* clearer formatting

* default to allowing plaintext connections

* add SetTLSDisablePlaintextConnection to go bindings
2023-05-13 00:14:11 +02:00
A.J. Beamon eacf817b2f Add metacluster code probes 2023-05-12 12:32:24 -07:00
Josh Slocum f82ea43198
copying headers into http request (#10227) 2023-05-11 20:18:12 -05:00
A.J. Beamon b15622c492 Fix formatting and unrelated windows build issue 2023-05-11 08:52:20 -07:00
neethuhaneesha 92d1da79a9
RocksDB WAL archive options. (#10211) 2023-05-10 21:36:18 -07:00
A.J. Beamon d8141c049d Add code probes for tenant code 2023-05-10 20:44:39 -07:00
Zhe Wang 8559d4f1a8
Adding cleanup of old audit metadata (#10137)
* clean up old audit metadata

* change comments

* fix audit cleanup rule as PR description claim and reduce timeout of auditStorageCorrectness in tester

* address comment

* clear audit metadata should not throw error

* cleanup progress metadata by type

* control number of AuditStatistic events

* carefully persist new audit state

* add unit tests and fix issues

* cleanup

* allow audit concurrent run for different types and fix some bug in auditutl

* fix ci issue and nits
2023-05-10 19:32:04 -07:00
Yao Xiao 995fba9254
Merge pull request #10152 from yao-xiao-github/main
Cherrypick multiple ShardedRocksDB improvements
2023-05-10 16:14:17 -07:00
Yao Xiao 182d2cafbf Log physical shard size in KVS 2023-05-10 12:54:59 -07:00
Ata E Husain Bohra 18fd2702c4
EaR: Implement SimKmsVault interface, refactor SimKmsConnector (#10194)
Description

Patch implements a SimKmsVault interface allowing unittest/simulation
to satisfy encryption lookup usecases. It also refactors existing
SimKmsConnector to leverage SimKmsVault APIs

Testing

devRunCorrectness - 100K
/simKmsVault - asan & valgrind
EncryptionUnitTest
2023-05-10 12:44:53 -07:00
He Liu 66cd102821
Added `get_audit_status checkmigration` to print out the number of da… (#10188)
* Added `get_audit_status checkmigration` to print out the number of data shards and `physical shards`, so that we know the progress of migration to `shard_encode_location_metadata`

* Fixed print format.

* Addressed comments.
2023-05-10 12:26:39 -07:00
Yao Xiao 2d1b5d02e2 Range deletion memory usage improvements (#10048) 2023-05-10 10:23:01 -07:00
Yao Xiao fa101e1e11 Log background error and add knobs for memory tuning. (#9841)
* error logger

* recovery mode
2023-05-10 10:23:01 -07:00
Yao Xiao fa821c0ed6 Cherrypick #9746 2023-05-10 10:23:01 -07:00
Yao Xiao abd45c4486 Cherrypick #9665 2023-05-10 10:23:01 -07:00
Josh Slocum 9a2365daa8
fixing bugs with tenant_mode required on external clients and changin… (#10183)
* fixing bugs with tenant_mode required on external clients and changing test to find them

* Update fdbcli/BlobKeyCommand.actor.cpp

Co-authored-by: A.J. Beamon <aj.beamon@snowflake.com>

---------

Co-authored-by: A.J. Beamon <aj.beamon@snowflake.com>
2023-05-09 13:41:58 -05:00
Jay Zhuang 801a01bd38
Merge pull request #10159 from sfc-gh-jazhuang/redwood_test
Integrate the random key/value generator to Redwood test
2023-05-09 11:41:47 -07:00
Josh Slocum e69d54fbc0
Block unblobbify (#10182)
* stregthening check for not merging consecutive blob ranges

* implementing expanded unblobbify and changing tests to account
2023-05-09 11:43:11 -05:00