Commit Graph

26439 Commits

Author SHA1 Message Date
Hui Liu 788f3420c8
Fix tenant map update race when applying mutations (#10557) 2023-06-27 10:31:58 -07:00
Jingyu Zhou 321b77d809 Add 7.3 protocol version to transaction_profiling_analyzer.py
Manually tested the tool to be working.
2023-06-27 10:19:10 -07:00
Yi Wu 92fc089eb0
FDBCORE-5211: Fix worker server handleIOErrors heap-use-after-free (#380) (#10555) 2023-06-26 16:35:33 -07:00
Jingyu Zhou e5728fcb84
Fix the version for deprecated keys (#10553) 2023-06-26 13:49:41 -07:00
A.J. Beamon da382372f0
Merge pull request #10550 from sfc-gh-ajbeamon/speedup-backup-correctness-clean
Speed up BackupCorrectnessClean
2023-06-26 09:48:18 -07:00
A.J. Beamon ecf8a5b3f6
Merge pull request #10542 from sfc-gh-ajbeamon/reduce-mutation-log-reader-correctness-duration
Decrease the number of records in MutationLogReaderCorrectness
2023-06-26 08:55:34 -07:00
He Liu 6337125712
Several minor improvements for ShardedRocksDB (#10520)
* Terminate DD if SHARD_ENCODE_LOCATION_METADATA is not enabled and storage_engine_type is ShardedRocksDB.

* Fixed Error in non-main thread.

* Minor improvements.
2023-06-24 16:07:14 -07:00
Zhe Wang 37689af3f2
Detect inconsistency of KeyServers and ServerKeys in real time (#10484)
* add framework

* add audit logic

* refactor audit loc metadata

* address comments

* add realtime audit timeout, add post validation logic

* fix input empty range to compareKeyServersAndServerKeys

* add context for auditKeyServersAndServerKeysInRealTime

* focus on moveShard

* remove space

* address comments

* cleanup

* add audit cleanup

* make validateRangeAssignment simple

* change trace name

* add shardAssigned

* stop DD when inconsistency detected

* fix ci

* small fix

* revert ss and auditUtl and simplify rt audit

* cleanup ss

* tiny change

* address comments and refactor code

* make auditLocationMetadataPreCheck retriable

* handle actor cancel in auditLocationMetadataPreCheck

* rm timeout and add new protection for failure of audit

* fix bugs

* import dataMoveId to validation

* improve trace event

* carefully propagate error and stop DD

* tiny fix

* small change

* remove a state var

* nit

* clean comments

* fmt
2023-06-23 17:40:21 -07:00
A.J. Beamon 155c03f6fe Decrease transaction rate of backup correctness clean to speed it up 2023-06-23 16:14:03 -07:00
Lukas Joswiak 8b8da372d5 Fix issue with inconsistent coordinator disk queue
An inconsistent coordinator disk queue could cause repeated reboots of
the coordinator process. In simulation, this could cause stuck runs if
another role was running on the coordinator process, as the other role
would also repeatedly reboot (specifically stateful roles like tlogs).

See the comment included in this commit for much more detail.
2023-06-23 15:37:31 -07:00
A.J. Beamon 09a0a60c07
Merge pull request #10546 from sfc-gh-ajbeamon/speedup-dr-upgrade
Speed up DR upgrade tests
2023-06-23 12:00:42 -07:00
A.J. Beamon 1af1346cde Change a buggify and decrease the number of transactions per second so that the dr upgrade test runs faster 2023-06-23 09:03:34 -07:00
A.J. Beamon daf20abeee Decrease the number of records in MutationLogReaderCorrectness to decrease its duration 2023-06-22 13:37:35 -07:00
Xiaoxi Wang 7d3cc86860
check serverList before update storage metadata (#10540) 2023-06-22 12:02:09 -07:00
Josh Slocum 356d0030a4
fixed consistency scan corruption check when worker is hard-killed in the middle, and improved debugging for stuck consistency scan (#10516) 2023-06-22 11:54:08 -05:00
A.J. Beamon b5497bdc3e
Merge pull request #10529 from sfc-gh-ajbeamon/test-metacluster-use-after-restore
Test use of the metacluster after a restore
2023-06-21 18:50:18 -07:00
Ata E Husain Bohra 370e39fbdb
Update SimKmsVault unit test assert checking max encryption keys (#10535)
Description

SimKmsVault unit test when run as part of simulation Random test,
based on the test order, SimKmsVaultKeyCtx can be initialized as
part of some other test (FlowSingleton).
Update the test to handle the scenario.

Testing

devRunCorrectness - 100K
2023-06-21 17:17:21 -07:00
Zhe Wu 8c796e6ff9
Merge pull request #10534 from halfprice/zhewu/gray-failure-recovery-cc
Cluster controller remove recovered peer in gray failure
2023-06-21 14:54:26 -07:00
Zhe Wu b89a50a37d Cluster controller remove recovered peer in gray failure 2023-06-21 13:02:59 -07:00
A.J. Beamon a2302e45ec
Merge pull request #10522 from sfc-gh-ajbeamon/fix-mapped-range-assert
Fix check in getExactRange that determines whether we can return early
2023-06-21 10:24:52 -07:00
Zhe Wu d1bafa7514
Merge pull request #10528 from halfprice/zhewu/nit-fixing
Nit fixing
2023-06-21 10:21:47 -07:00
A.J. Beamon 728b847bd6
Merge pull request #10526 from sfc-gh-yajin/mark-cp-rare-1
SNOW-XXXXXX Mark a rare code probe as rare
2023-06-21 09:41:29 -07:00
Zhe Wu 36befca2ba Nit fixing 2023-06-21 09:25:51 -07:00
A.J. Beamon cc68320333 Add testing that a metacluster can be used after a restore. Fix some bugs found by this related to the restore ID and tenant tombstones. 2023-06-21 08:59:22 -07:00
Zhe Wu aac6d95182
Merge pull request #10525 from halfprice/zhewu/turn-on-batch-empty-commit-by-default
Turn on batching empty peek by default
2023-06-21 07:57:17 -07:00
Zhe Wu f3e95f6698
Merge pull request #10521 from halfprice/zhewu/gray-failure-recovery
Worker health monitor tracks recovered peers
2023-06-21 07:56:55 -07:00
Yanqin Jin 75ab773819 SNOW-XXXXXX Mark a rare code probe as rare (#415) 2023-06-20 21:56:54 -07:00
Zhe Wu 514561e167 Turn on batching empty peek by default 2023-06-20 20:24:20 -07:00
Zhe Wu 8eb526684a
Merge pull request #10437 from halfprice/zhewu/one-shard-per-physical-shard
Add an option to limit the number of shard per team
2023-06-20 19:43:31 -07:00
Xiaoge Su c5f09b0921 fixup! Reformat source 2023-06-20 18:07:05 -07:00
Xiaoge Su de7db7ba14 fixup! reformat source 2023-06-20 18:07:05 -07:00
Xiaoge Su 63dde188dd Report missing attribute in StorageServer status 2023-06-20 18:07:05 -07:00
A.J. Beamon 75ec56bffb When redoing a key location request, wait until after we've checked whether we've satisfied our min rows 2023-06-20 16:02:12 -07:00
Zhe Wu 31de83ccc1 cleanup gray failure recovery documentation 2023-06-20 13:30:03 -07:00
Zhe Wu 4741d7032f Revert logic to test testing logic 2023-06-20 12:40:22 -07:00
Zhe Wu ef638cd6a6 Worker health monitor keeps track of recovered peer 2023-06-20 12:32:29 -07:00
Evan Tschannen 902d87a72c
Merge pull request #10515 from sfc-gh-etschannen/fix-ikeyvaluestore
fix IKeyValueStore include
2023-06-20 06:28:23 -07:00
Hui Liu af20493ad0
Move lastFlushTs to BlobGranuleBackupConfig (#10505) 2023-06-16 16:12:10 -07:00
Jefferson Zhong 13853c9f89 Move stepSize knob from ClientKnobs to ServerKnobs 2023-06-16 14:48:11 -07:00
Evan Tschannen 60cbc7ec8d fix formatting 2023-06-16 13:43:25 -07:00
Evan Tschannen ef682d304e fix IKeyValueStore include 2023-06-16 13:28:40 -07:00
A.J. Beamon ac30b3c84f
Merge pull request #10507 from sfc-gh-yajin/reset-tenant-state
Support restoring a cluster with a tenant in the error state
2023-06-16 09:35:03 -07:00
Jingyu Zhou 459172499a
Merge pull request #10508 from sfc-gh-jslocum/fix_empty_bg_flush
fix flushing empty range
2023-06-16 08:36:15 -07:00
Josh Slocum a7ba6fde3c fix flushing empty range 2023-06-16 09:26:08 -05:00
Yanqin Jin 626a8a1a5f SNOW-804199 Support restoring a cluster with a tenant in the error state (#357)
If we restore a cluster and a previously created tenant was not included in the backup, then the tenant will be marked in an error state on the management cluster. It is then up to the operator to resolve the error, generally by deleting the tenant and recreating it if needed.

There is, however, the possibility that we restored a backup that was older than we wanted, and a newer backup would have the tenant. If we tried to restore the newer backup, it would not leave the previously missing tenant in a fully usable state.

We need to have a way to deal with this case. One option is to allow us to clear the error state of a tenant, and that can be performed before (or maybe even after) the second restore.

Test plan:
Joshua test
100K ensemble: 20230613-225414-yajin-439d13ef3c6b3afd fail=0
2023-06-15 22:23:46 -07:00
Jay Zhuang 5130fe99d3
Fix RandomStringSetGenerator may not generate enough unique key (#10503)
For small key set, RandomStringSetGenerator may not try enough random
key to generate unique key set.
Increase the retry for smaller key set number.
2023-06-15 17:23:13 -07:00
Josh Slocum 6fb225c68b
Reset connection idle time when restarting connection monitor (#10495) 2023-06-15 13:37:00 -05:00
Josh Slocum 367088b831
made blob metadata load lazily from EKP (#10463) 2023-06-15 13:36:52 -05:00
Josh Slocum 01ce1ab24d
fix consistency scan ubsan issue and replication factor calculation (#10492) 2023-06-15 08:10:49 -05:00
Zhe Wu 5c8a163c72
Update main branch to 7.4 (#10459)
* Update main branch to 7.4

* Update API version to 740

* Makes fdb_c_client_config_tests.py passing after API version update

* Remove from_7.3.0_until_7.4.0 and add from_7.3.0

* Update tests in fdb_c_client_config_tests.py
2023-06-15 10:19:39 +02:00