Commit Graph

13168 Commits

Author SHA1 Message Date
Sreenath Bodagala d7eb028b2a
Enable replica consistency check on data movement (#11415)
* - Enable replica consistency check on data movement (and, randomly, on
all reads)

* - Address PR review comments
2024-06-17 17:07:32 -04:00
Xiaoge Su 3e3eee98fc fixup! Reformat source 2024-06-17 11:41:06 -07:00
Xiaoge Su afc04366fb Rewrite BUGGIFY related code
This is a rewrite of BUGGIFY function/macros. Seems the performance
improved a lot during the simulation, e.g.

fdbserver -r simulation -b on -f ../CycleTest.toml -s 99438

Without this patch:

Unseed: 54646
Elapsed: 494.091327 simsec, 14.586831 real seconds

With this patch:

Unseed: 54646
Elapsed: 494.091327 simsec, 12.580612 real seconds

I expected the improvement but did not expect a ~13% improvement.
2024-06-17 11:41:06 -07:00
Xiaoge Su c21c6c6ac3 fixup! Another ASSERT with side effect 2024-06-17 11:41:06 -07:00
Xiaoge Su 0b96d52eec fixup! Remove a stateful ASSERT statement 2024-06-17 11:41:06 -07:00
Andrew Noyes 7dc1281319 Remove two ptree searches when processing a clear 2024-05-30 11:50:17 -07:00
Zhe Wang b87e3003ac fix-restart-test-dmid 2024-05-22 19:22:35 -07:00
neethuhaneesha 8de8dd4281 RKUpdate metrics changes. 2024-05-21 11:57:58 -07:00
Dan Lambright 4bda00ab9c
use unique pointer (#11408)
Co-authored-by: Dan Lambright <hlambright@apple.com>
2024-05-20 18:43:18 -04:00
Jingyu Zhou fecffc93e4 Fix a segfault when tlog encounters platform_error
During destruction, rejoinClusterController actor should be cancelled to avoid
accessing TLogData object.
2024-05-17 11:40:22 -07:00
Jingyu Zhou 5bcbb6fa1c
Merge pull request #11401 from hfu94/fix_main
Fix globalconfig refresh hang issue
2024-05-16 10:12:08 -07:00
Yao Xiao 1791d07be1
Improvements (#11363) 2024-05-15 09:04:50 -07:00
hao fu 6b782c10f6 Fix globalconfig refresh hang issue
CC sets a version to int_max in ClientDBInfo indicating a refresh, however,
proxy server would reject this version for the error of future_version.

This change fixes this issue by not sending int_max, instead maintaining a
lastKnown in memory and send it to grvproxy to get latest globalconfig.

this change also fixes some java tests that were used to test the fix
2024-05-14 15:40:03 -07:00
Jingyu Zhou 0453b01622 Fix an assertion failure when waiting for recovery
CC's checkBetterSingletons() calls getUsedIds() that asserts proxy interfaces
are present. However, when a GRV/commit proxy failed, before CC starts a new
recovery, the proxy's processId becomes empty, thus triggering the failure.

The fix is to cancel the caller while waiting for recovery.

To reproduce 7.1 commit 725a08a3ff clang build:

./fdbserver.6.0.15 -r simulation -f ./tests/restarting/from_5.2.0_until_6.3.0/ClientTransactionProfilingCorrectness-1.txt -s 900000399 -b on
-f ./tests/restarting/from_5.2.0_until_6.3.0/ClientTransactionProfilingCorrectness-2.txt --restarting -s 900000400 -b on
2024-05-14 12:47:41 -07:00
Dan Lambright cc6948cec7
Only emit version vector counters when enabled. (#11385) 2024-05-14 08:47:53 -07:00
neethuhaneesha fa15b9df49
RocksDB memtable max range deletions knob update. (#11386) 2024-05-13 15:54:43 -07:00
Sreenath Bodagala 033df029a5
- Support for doing replica consistency check on data movement (#11373) 2024-05-10 14:15:17 -04:00
Zhe Wang aaabbedcc4
fix ss queue rebalance (#11375) 2024-05-08 09:35:42 -07:00
Dan Lambright 5d6333fab9
Add logging to LogRouter's waitForVersion function (#11359)
* Add logging to waitForVersion

* Respond to review comments.

---------

Co-authored-by: Dan Lambright <hlambright@apple.com>
2024-05-03 17:04:51 -04:00
Zhe Wang bf53218556
Improve distributed consistency checker (#11346)
* ConsistencyCheckerUrgent repeated run

* address comments

* avoid trace SevError for TesterRecruitmentTimeout unless it keeps failure for over 1 day

* address comments

* address comments
2024-04-30 14:45:32 -07:00
Yao Xiao 67a588380e
shard size log (#11342) 2024-04-29 13:42:19 -07:00
Yao Xiao 99910100a5 versoin upgrade 2024-04-26 14:53:01 -05:00
Zhe Wang 848b9c5b13 fix dcc assert false 2024-04-23 12:47:48 -07:00
Yao Xiao 9789c7f4ff
async io (#11325) 2024-04-22 14:20:11 -07:00
Zhe Wang 314f4c41c7
Fix ACS mutation bug and improve accumulative checksum (#11319)
* enable acs by default

* code clean

* improve ACS code

* nits

* nits

* fix data corruption issue triggered by acs mutation
2024-04-20 01:31:55 -07:00
Jingyu Zhou 55bbf4687e
Merge pull request #11311 from jzhou77/release-notes
Throw errors in getConsistentReadVersion
2024-04-19 16:42:22 -07:00
Yao Xiao 81b342fccd
Don't remove team when total team count is within threshold (#11295) 2024-04-19 15:40:42 -07:00
neethuhaneesha adf0e8fa18
Rocksdb metrics in status json (#11321) 2024-04-18 22:00:58 -07:00
Jingyu Zhou 04128834a4
Merge pull request #11313 from jzhou77/fix
Increase CommitProxyTerminated severity for failed_to_progress errors.
2024-04-17 22:06:42 -07:00
Jingyu Zhou e0674ced7c Increase CommitProxyTerminated severity for failed_to_progress errors.
For better visibility.
2024-04-17 14:49:48 -07:00
Jingyu Zhou 17bb1f4278 Fix BlobRestoreWorkload errors 2024-04-17 10:18:51 -07:00
neethuhaneesha c89074ab04
Revert "Added perpetualStorageWiggleSpeed check to pick perpetualStoreType (#…" (#11305)
This reverts commit 3f5b60f711.
2024-04-17 10:09:38 -07:00
neethuhaneesha ed7a275231
Rocksdb caching knob options. (#11282) 2024-04-17 10:09:14 -07:00
Jingyu Zhou 84d6b86715
Merge pull request #11309 from apple/sevwarn
Raise visibility of gray failure actions
2024-04-17 09:08:40 -07:00
Dan Lambright 3dc6c49791
respond to review comments
Make DegradedServerDetectedAndTriggerRecovery SevWarnAlways
2024-04-17 09:26:08 -04:00
Yao Xiao be3dcbde62
Sharded RocksDB knob changes. (#11291) 2024-04-16 11:15:08 -07:00
Dan Lambright 9cd5090965 Raise visibility of gray failure actions 2024-04-16 12:23:16 -04:00
Zhe Wang 832972e2da
Validate Mutation Version in Accumulative Checksum Framework (#11293)
* validate-mutation-version-in-acs-framework

* turn off knob

* randomly enable feature
2024-04-12 10:15:46 -07:00
Zhe Wang 33eecd0775
Real-time corruption detection with accumulative checksum (#11255)
* acs framework

* code refactor and fix bugs

* add ss crash loop protector

* use sharedptr instead of raw pointer

* fixed critical bugs and add provate mutation acs to the framework

* enable ACS for all mutations except for clear serverTag mutation and fix bugs

* fix restarting tests

* refactor code and fix bugs

* fix AccumulativeChecksumState toString

* fix bugs

* allow all mutations in acs and fixed bugs

* fix bugs and code cleanup

* code clean up for adding recovery support

* simplify code and support recovery

* clear acs state at ss

* fix bug

* terminate validator if ss will be removed in the current batch

* simplify code

* add trace

* address comments

* optimize code

* deep copy when adding mutation to acs validator

* warp encode and decode persist acs key

* make acstable private

* remove unless func

* remove unless func

* remove epoch in ACS validator

* add acs mutation counter in SS metrics

* code cleanup and make knob check better

* make mutation buffer global

* simplify code

* add comments

* make knob randomly set

* address comments

* ss reboot after acs mismatch found
2024-04-04 15:03:44 -07:00
Hao Fu 2a774d39a5
Suppress ChosenMachine to fix simulation error (#11277) 2024-04-03 17:26:22 -04:00
Dan Lambright f3b2bca2c2
Fix detection of private mutations in version vector (#11268)
* Fix detection of private mutations in version vector

* add assertion that all tlogs receive changes to txn state in version vector

* Re-suppress version_vector upgrade tests

---------

Co-authored-by: Dan Lambright <hlambright@apple.com>
2024-04-01 15:20:32 -04:00
neethuhaneesha 3f5b60f711
Added perpetualStorageWiggleSpeed check to pick perpetualStoreType (#11272) 2024-04-01 11:30:18 -07:00
neethuhaneesha c96dcc74a7 Add rocksdb direct_io knobs. 2024-03-27 10:34:00 -07:00
Dan Lambright 50f8eabfa3
Remove assertion equality in tcpvmap size on resolver return. (#11262) 2024-03-21 13:42:04 -04:00
Yao Xiao de5cc85c28
block cache usage (#11251) 2024-03-14 16:15:38 -07:00
neethuhaneesha 77ff238874
Fixing setting perpetual_storage_wiggle_engine is considered as wrongly configured (#11250) 2024-03-14 11:52:29 -07:00
Jingyu Zhou 1ed42ee658
Merge pull request #11236 from apple/fanout
Fix bug in which private mutations were detected incorrectly.
2024-03-13 10:59:32 -07:00
neethuhaneesha e26981a7a9
Added max range deletions before flush knob and some knob changes. (#11242) 2024-03-12 14:46:16 -07:00
Jingyu Zhou 5885e70a33
Merge pull request #11241 from jzhou77/main
Fix a DR corruption bug
2024-03-12 13:57:00 -07:00
Zhe Wang b10c7107bb
Enable Accumulative Checksum in MutationRef (#11225)
* code clean up and add accumulative checksum bits to mutation ref

* address comments and fix issues

* address comments

* propagate acs index from commit proxy to storage server

* address comments

* address comments

* address comments

* address comments
2024-03-11 09:51:31 -07:00