Commit Graph

1423 Commits

Author SHA1 Message Date
Dimitris Apostolou a88114c222
Fix typos 2024-02-07 01:16:00 +02:00
Zhe Wang 970175a8a2
cherrypick storage queue aware getteam (#11154) 2024-01-30 15:15:18 -08:00
Zhe Wang ebb05f54c3
Add storage interface for checksum (#11144)
* add-storage-interface-for-check-sum

* address comments
2024-01-24 14:34:35 -08:00
Yao Xiao 2329e8327a
Add log cleaner for rocksdb logs. (#11134)
Co-authored-by: yaoxiao-github <yaoxiao@Yaos-MacBook-Pro-14.local>
2024-01-17 14:51:15 -08:00
Dan Lambright 0bfc99bf1f
ensure synthetic data is written to existing shards (#11128)
Co-authored-by: Dan Lambright <hlambright@apple.com>
2024-01-16 10:22:07 -05:00
Dan Lambright 54ebcde97b
Fix bug in synthetic data creation (#11115)
Co-authored-by: Dan Lambright <hlambright@apple.com>
2024-01-08 12:56:41 -05:00
Jingyu Zhou 75f7814ad1
Merge pull request #11112 from sfc-gh-jslocum/stuck_watch_fix_main
Stuck watch bug fix
2024-01-05 10:41:10 -08:00
Josh Slocum 611eb00fe1 stuck watch bug fix
* buggify watch version retry and fix multiple watch race after retry

* watch debugging improvements
2024-01-03 16:05:42 -06:00
Dan Lambright 86a2301faa updated per review comments 2024-01-03 12:21:54 -05:00
Dan Lambright 20882507f4 sanity checks, fix knob 2024-01-02 12:09:32 -05:00
Dan Lambright 857e38b80b bug fixes/cleanup 2023-12-21 16:39:22 -05:00
Dan Lambright 05571c59a9 Set tags on apply metadata mutations 2023-12-21 13:20:21 -05:00
Dan Lambright 2b4b4ae512 Synthesize data on SS based off parameters from new system transaction 2023-12-20 11:25:47 -05:00
Dan Lambright 5ebe8b0915 move data to value and parse it 2023-12-18 09:10:06 -05:00
Dan Lambright a20f9d3475 Interfaces to synthesize data 2023-12-13 15:19:17 -05:00
neethuhaneesha 4f167f50be
Adding field length to audit storage trace events. (#11079) 2023-12-01 15:45:17 -08:00
He Liu a8cdb7367c
Throttle low-priority fetchKeys. (#11083) 2023-11-30 21:27:52 -08:00
He Liu 422da7d7c7
Throttle fetch keys (#11060)
* Added ThroughputLimiter class.

* Throttle fetchKeys.

* cleanup.

* Increased fetchKeys rate limit for sim tests.

* Added unit.

* Resolved comments.
2023-11-16 10:30:40 -08:00
Zhe Wang 1e9c5bb390
Propagate data move reason from DD to SS (#11063)
* encode reason to data move id

* address comments

* fix data move id decode bug and add assert for data move decode invariant

* address comments
2023-11-15 13:07:11 -08:00
He Liu b8f1670a0e
Physical shard move tss (#11057)
* Refactored newDataMoveId() and decodeServerKeysValue().

* Enabled physical shard move for tss.

* Added unit test & cleanup.

* clean up test configs.
2023-11-13 11:34:07 -08:00
He Liu 29a10311d9
Speed up physical shard move (#11056)
* Allow applyUpdates in multiple batches.

* Persist lastAppliedVersion together with a batch of updates.

* Fixed repeatedly applying the same fetched mutation.
Apply updates at different versions.

* Fixed race between removeDataShard and isInVersionedData.

* Ignore mutations earlier than `lastAppliedVersion`.

* Buggify the batch limit for applying updates.

* Implemented async load of updates.

* Fixed out-of-order version issue.

* Cleanup.

* Batch commit MoveInUpdates.

* Avoid popping unpersisted updates.

* Increment MoveInUpdates::lastAppliedVersion in MoveInUpdates::next().
Fixed MoveInUpdates::hasNext().

* Fixed loadUpdates start version.

* cleanup.

* Cleanup.

* fmt

* Commit MoveInUpdates regardless of MoveInShard status.

* Enable move restore.

* Disabled move restore and fallBackToAddingShard from ingestion failure.

* Get rid of persisting MoveInShard states between phases.

* Make updateMoveShardMetadata synchronous.

* Recovered code deleted accidentally.

* Cleanup.

* cleanup.
2023-11-13 09:27:58 -08:00
He Liu 967a546e15
Optimize physical shard move (#10962)
* Allow applyUpdates in multiple batches.

* Persist lastAppliedVersion together with a batch of updates.

* Fixed repeatedly applying the same fetched mutation.
Apply updates at different versions.

* Fixed race between removeDataShard and isInVersionedData.

* Ignore mutations earlier than `lastAppliedVersion`.

* Buggify the batch limit for applying updates.

* Implemented async load of updates.

* Fixed out-of-order version issue.

* Cleanup.

* Batch commit MoveInUpdates.

* Avoid popping unpersisted updates.

* Increment MoveInUpdates::lastAppliedVersion in MoveInUpdates::next().
Fixed MoveInUpdates::hasNext().

* Fixed loadUpdates start version.

* cleanup.

* Cleanup.

* fmt

* Commit MoveInUpdates regardless of MoveInShard status.

* Enable move restore.

* Disabled move restore and fallBackToAddingShard from ingestion failure.

* Resolved comments.

* Fixed leaks from other prs.
2023-11-07 14:19:18 -08:00
Dan Lambright 015167c17e
Throttle commits against hot shards (#10970)
* throttle hot shards

* expire throttled shards over time

* add backoff

* Parallelize messaging from RK to CP

* Obtain shards from a single SS

* handle expired transactions

* bump transaction_throttled_hot_shard

* Change SevError to SevWarn for CannotMonitorHotShardForSS

* Add log per request
2023-10-31 12:01:34 -04:00
Zhe Wang b0569f8717
fix corner cases of auditStorageServerShardQ (#10980) 2023-10-13 09:48:24 -07:00
Zhe Wang 5767fed414
AuditStorage check all DC replica (#10955)
* add trace events when update audit metadata

* audit all DCs in replica

* fix corner case of audit replica

* fmt

* address comments
2023-10-06 14:30:21 -07:00
Yao Xiao 45494e3bba
Add knob for fetch keys budget. (#10963) 2023-10-06 13:06:34 -07:00
Zhe Wang 29a2f63f8d
Fix SSShard Audit (#10896)
* fix ssshard

* address comments

* fmt
2023-09-13 21:15:12 -07:00
Hui Liu 4d2a7d507d
Add a new blob restore state to fix a race after data copy (#10854) 2023-09-05 14:04:35 -07:00
Yi Wu 8d7f2e84ed
Merge pull request #10831 from sfc-gh-yiwu/ear_timeout
EaR: Handle KMS timeout in storage server and commit proxy
2023-08-28 20:59:22 -07:00
Zhe Wang 432c077b51
fix dd issue when dd skip audit (#10844) 2023-08-28 16:39:45 -07:00
Yi Wu 3287098b4a EaR: Handle KMS timeout in storage server and commit proxy 2023-08-28 16:17:43 -07:00
Zhe Wang f43b20e15c
Audit location metadata in DD (#10820)
* Audit location metadata in DD

* nits
2023-08-25 17:11:11 -07:00
Zhe Wang f8311ae069
Add more trace event for TSS recruitment (#10809)
* add more trace event for tss

* update StorageServerInitProgress

* add more traces
2023-08-23 09:19:30 -07:00
Zhe Wang 83dc9ff6f7
Trace SS init progress (#10799)
* trace ss init progress

* improve trace events
2023-08-18 18:44:37 -07:00
Hui Liu aea6fa5ca6
Set BLOB_RESTORE_SKIP_EMPTY_RANGES default value to false (#10784) 2023-08-16 10:02:06 -07:00
Zhe Wang f1c17b27fc
Multiple improvements to AuditStorages (#10685)
* remove danger DDAudit assert, add AuditRate knob, add progress check when ssshard complete, add progress check for ssshard in fdbcli

* throttle progress check for ssshard

* fix getAuditProgressByServer

* fix trace event for ss audit

* using name -- checkMoveKeysLockForAudit

* new scheduleAuditLocationMetadata

* address comments

* shorten progress summary for ssshard

* simplify getAuditProgressByServer in fdbcli
2023-08-14 13:13:49 -07:00
He Liu df848005f8
Allow applyUpdates in multiple batches. (#10583) 2023-08-09 14:04:36 -07:00
Zhe Wang d0742c79ac
Improving visibility to debug sharded rocksdb (#10694)
* logging storage commit stats

* add rocks flush and compaction listener

* remove used field in FlushStats and fix CI error

* reduce LOGGING_ROCKSDB_BG_WORK_PROBABILITY

* merge rocks event listeners

* avoid using mutex/spinloop in rocksdb event listener

* code clean

* fix OnCompactionBegin and OnFlushBegin

* add logReason to RecentRocksDBBackgroundWorkStats

* add error listener back
2023-07-31 14:45:26 -07:00
Zhe Wang 3426fc3c1a
Add DD Security Mode (#10646)
* dd-security-mode

* address comments

* cleanup

* revise tr option set in loadAndUpdateAuditMetadataWithNewDDId

* address comments

* reset auditStorageInitStarted before DD init

* decouple audit resume and audit launch

* audit launch new request should wait for resuming existing requests

* address comment/clean up/fix

* fix

* fix initAuditMetadata retry

* fix initAuditMetadata retry should reset tr
2023-07-21 17:06:25 -07:00
Yao Xiao 70a7908fc9 Fix bytesPerCommit histogram. 2023-07-19 15:54:21 -07:00
Hui Liu 7c8c24bc8d
blob restore : Log and skip data copy if we miss data for a certain tenant (#10621) 2023-07-19 09:52:30 -07:00
Zhe Wang 63d387eb0b
Add complete check for location metadata by audit storage (#10636)
* cleanup traceevent

* add complete check

* fix and cleanup

* nit

* code cleanup

* code cleanup

* increase audit retry count

* revise comments and no code changes
2023-07-19 09:40:58 -07:00
Zhe Wang 522c9d4f0f
Add new implementation of audit storage for user data (#10613)
* remainingBudgetForAuditTasks should be managed within audit

* fix CI

* add audit storage test for various ranges

* clean DD

* new auditStorageUserDataQ

* fix assert fail in startTrackShardAssignment

* fix assert fail in ssaudit

* address comments

* replace assert with audit_cancel in ss audits

* add audit check progress tool

* add observability to audit progress and fix audit bugs

* fix audit progress issues and add sim test for audit progress and add trace event for the audit progress and add fdbcli to track the audit progress

* remove old audit storage on SS

* check audit progress when auditCore completes
2023-07-16 09:56:26 -07:00
Nim Wijetunga 7f2260bbd2
Add Encryption Related Latency Metrics (#10596)
* add ss and cp latency metrics

* make changes
2023-07-14 11:30:16 -07:00
Hui Liu 66a7acd960
Fix blob restore stuck issue (#10574) 2023-06-28 10:23:11 -07:00
He Liu 6337125712
Several minor improvements for ShardedRocksDB (#10520)
* Terminate DD if SHARD_ENCODE_LOCATION_METADATA is not enabled and storage_engine_type is ShardedRocksDB.

* Fixed Error in non-main thread.

* Minor improvements.
2023-06-24 16:07:14 -07:00
Zhe Wang 37689af3f2
Detect inconsistency of KeyServers and ServerKeys in real time (#10484)
* add framework

* add audit logic

* refactor audit loc metadata

* address comments

* add realtime audit timeout, add post validation logic

* fix input empty range to compareKeyServersAndServerKeys

* add context for auditKeyServersAndServerKeysInRealTime

* focus on moveShard

* remove space

* address comments

* cleanup

* add audit cleanup

* make validateRangeAssignment simple

* change trace name

* add shardAssigned

* stop DD when inconsistency detected

* fix ci

* small fix

* revert ss and auditUtl and simplify rt audit

* cleanup ss

* tiny change

* address comments and refactor code

* make auditLocationMetadataPreCheck retriable

* handle actor cancel in auditLocationMetadataPreCheck

* rm timeout and add new protection for failure of audit

* fix bugs

* import dataMoveId to validation

* improve trace event

* carefully propagate error and stop DD

* tiny fix

* small change

* remove a state var

* nit

* clean comments

* fmt
2023-06-23 17:40:21 -07:00
Evan Tschannen ef682d304e fix IKeyValueStore include 2023-06-16 13:28:40 -07:00
Josh Slocum 10c16dec41
fix retransmits in corruption check (#10491) 2023-06-14 10:41:06 -05:00
Evan Tschannen 359e178dcd Merge branch 'main' into feature-durable-change-feed
# Conflicts:
#	fdbclient/ClientKnobs.cpp
#	fdbserver/BlobManager.actor.cpp
#	fdbserver/worker.actor.cpp
2023-06-11 13:58:35 -07:00