Commit Graph

26196 Commits

Author SHA1 Message Date
He Liu eaa934dac6
Added more logs about shard management. (#10303) 2023-05-22 18:00:00 -07:00
Yao Xiao bbf15be05f
Knobs to speed up DB open. (#10301) 2023-05-22 16:21:05 -07:00
Vaidas Gasiunas 9bc55f67c3
Fix releasing watches on future cancellation (#10304)
* Test watch cleanup on cancel

* Fix clearing the database in Java integration tests

* Always cancel the futures wrapped by MVC abortable futures

* More tests for watch cleanup

* Fix clear database database in some Java integration tests
2023-05-22 22:01:27 +02:00
Zhe Wang 6c980862c3
Improve throughput of audit storage (#10245)
* improve audit throughput

* if ssshard fails do audit due to ssi failure, then global retry is required

* fix a trace event name

* fix budget release in doAudit

* avoid throttling in general simultion tests

* fix doAuditOnStorageServer throw error

* avoid starting a task that has been complete

* when ddaudit ssshard failed, check if ssi is removed, if yes, silently exit

* fix trace detail name of AuditUtilStorageServerRemovedEnd evenrt

* redo schedule in doAuditOnStorageServer

* schedule does not wait doAudit

* remove TESTING_AUDIT_STORAGE_THROTTLING

* ssaudit stops proceeding if ddauditstate is not in running phase

* make tester audit storage only happen when simulation, and randomly set CONCURRENT_AUDIT_TASK_COUNT_MAX
2023-05-22 12:09:08 -07:00
sfc-gh-tclinkenbeard 7ef66ab356 Add OutstandingWatches and WatchMapSize to TransactionMetrics 2023-05-22 12:07:10 -07:00
Ata E Husain Bohra 2b0a08dbe4
BlobMetadata: Move SimBlobMetada store to SimKmsVault (#10269)
Description

Patch refactor SimKmsConnector to move SimBlobMetadata store to SimKmsVault

Testing

BlobGranuleCorrectness - 100K
/fdbserver/blob/connectionprovider - 100K
devRunCorrectness - 100K
2023-05-22 11:00:59 -07:00
Jingyu Zhou f820e92878
Merge pull request #10292 from sfc-gh-yajin/fix-heap-use-after-free 2023-05-19 20:48:33 -07:00
Jingyu Zhou e1a7335150
Merge pull request #10291 from jzhou77/main 2023-05-19 20:43:17 -07:00
Jingyu Zhou 8878de8c8f
Merge pull request #10288 from sfc-gh-yajin/update-test-pattern-1
Fix a test pattern so that simulator tests do not run non-sim tests
2023-05-19 16:34:01 -07:00
Jingyu Zhou 28dadd916a Track customReplicas failure times 2023-05-19 15:49:51 -07:00
Yanqin Jin 2951fe7033 Fix 2023-05-19 15:23:37 -07:00
Jingyu Zhou 2b3d60b960 Track failure time of different team sizes separately
In some rare cases, using the same failure time may prevent a certain size from
getting a team, thus cause data movement to become stuck.

Reproduction:
  Seed: -f ./tests/rare/DcLag.toml -s 924031356 -b on
  commit: 0975874b9 on release-7.3
  build: gcc
2023-05-19 14:55:32 -07:00
Hui Liu 7ca13d8f9c
support blob restore in fdbrestore (#10248) 2023-05-19 14:45:14 -07:00
Zhe Wu 93ad70db38
Merge pull request #10263 from halfprice/zhewu/gc-generation-using-recoverat
GC earlier TLog generation using each generation's `recover at` version instead of `start version`
2023-05-19 12:07:02 -07:00
Jefferson Zhong 3760522dc2 Make stepSize configurable for preloadApplyMutationsKeyVersionMap 2023-05-19 10:57:30 -07:00
Yanqin Jin 92873f2e1f Fix a test pattern so that simulator tests do not run non-sim tests (#256) 2023-05-18 21:48:47 -07:00
Jingyu Zhou 6ff024839b
Merge pull request #10283 from atn34/atn34/remove-duplicate-declaration
Remove unnecessary duplicate declaration
2023-05-18 21:04:27 -07:00
Jingyu Zhou 7b07c3d2c9
Merge pull request #10271 from jzhou77/main
Fix ConsistencyCheck_DataInconsistent failure in simulation
2023-05-18 21:03:01 -07:00
Jingyu Zhou 9b6491ccbd
Merge pull request #10282 from jzhou77/fix-head 2023-05-18 20:38:52 -07:00
Jingyu Zhou 58999e493f Also check storage servers are in different DCs 2023-05-18 16:57:58 -07:00
Andrew Noyes 4d2f038b60 Remove unnecessary duplicate declaration of fdb_tenant_list_blobbified_ranges 2023-05-18 16:20:18 -07:00
Jingyu Zhou 4641f17808 Ignore noSim/PerfShardedRocksDBTest.toml for Valgrind build 2023-05-18 16:09:41 -07:00
Yanqin Jin 686bd52953
Update a test to check data cluster's version only when registered (#10275)
This PR fixes an unitialized-value error reported by valgrind.

Steps to reproduce:
Ensemble: 20230516-050214-nightly_valgrind_main_x86_64_apple-b77b4ac44a15911e
Profile: team valgrind
Commit hash: 0ce1ab3162 on apple/main
Command: devRetryCorrectnessTest valgrind bin/fdbserver -r simulation -f tests/slow/MetaclusterManagement.toml -s 1695330198 -b off --crash --trace_format json
2023-05-18 15:28:31 -07:00
Jingyu Zhou c2e30d3829 format code and remove an unused variable 2023-05-18 15:00:18 -07:00
Yao Xiao cef93f7d22
knobs (#10253) 2023-05-18 14:58:09 -07:00
He Liu a5f639f859
Fix psm test (#10273) 2023-05-18 14:54:26 -07:00
Jingyu Zhou b1d0e9c578 Merge branch 'main' of https://github.com/apple/foundationdb 2023-05-18 14:42:35 -07:00
Josh Slocum 2916a11a86
New ConsistencyScan (#10265)
* Remove duplicate getRange() for DB handles and update existing GetRange to accept DB handles.

* Initial progress checkpoint on new ConsistencyScan role.

* Updated TODOs, finished most if not all state updates.

* placeholder

* Add more TODOs, documentation and comment improvements.

* Checkpoint round state to avoid advancing progress if commit fails.

* Bug fix, check is supposed to be for overlap, not lack of overlap.

* Added more TODO's and added faked read results / exceptions and faked DB size retrieval to prove the consistencyScanCore logic works.

* Update JSON schemas and command help.

* Add comment about lifetime stats reset.

* More TODO comments and some renames for clarity, some bug fixes.

* properly stopping consistency scan in simulation so that it doesn't run forever and cause quiet database to fail

* removing trailing comma from consistency_scan json schema

* Making CC inconsistency not an error if it's intentional tss corruption

* consistency scan actually reads storage locations

* added check that consistency scan actually completes a round in simulation, fixed bug and added debugging around consistency scan getting stuck

* made consistency scan properly fetch database size

* refactoring data check to be used in both consistency scan and consistency check

* checking that consistency scan always completes at least one round and doesn't get stuck

* cleanup

* fixing ide build

* consistencyscan fdbcli command wasn't actually changing db state

* consistencyscan fdbcli command always said enabled even when it wasn't

---------

Co-authored-by: Steve Atherton <steve.atherton@snowflake.com>
2023-05-18 15:02:41 -05:00
Jingyu Zhou 1a8f3445ff Remove unused state variables 2023-05-18 12:23:22 -07:00
Jingyu Zhou 8194997308
Merge pull request #10266 from xis19/main
Use fmt::print instead of printf
2023-05-18 11:45:14 -07:00
Jingyu Zhou b6bf512377 Merge branch 'main' of https://github.com/apple/foundationdb 2023-05-18 11:28:34 -07:00
Zhe Wu 2f3ecab125
Merge pull request #10267 from halfprice/zhewu/updateFDB_AV_LATEST_BINDINGS_VERSION
Update FDB_AV_LATEST_BINDINGS_VERSION to 730
2023-05-18 11:00:07 -07:00
Ata E Husain Bohra 3506ab5cfa
SimKmsVault: Fix invalidHeader test (#10272)
Description

Fix invalid header test to throw appropriate error

Testing

RESTSimKmsVault unittest - 100K
2023-05-18 10:53:45 -07:00
Jingyu Zhou eff338f7bd Fix ConsistencyCheck_DataInconsistent failure in simulation
In the KillRegion workload, a force recovery can intentionally kill a region
and recover from another region with data loss. It's possible that when the
ConsistencyScan (CS) role is checking a shard from two storage servers, the
one from the failed region returned first for a higher version, then the SS
was killed. Later, CS got data from the second SS from a different region,
where forced recovery rollback happened and a lower version value was returned.
In this case, even though the data is inconsistent, we shouldn't fail the test.

Reproduction:
  commit: 8e0c9eab8 on release-7.3 branch
  seed: -f ./tests/fast/KillRegionCycle.toml -s 2713020850 -b on
  build: gcc
2023-05-17 22:52:11 -07:00
Zhe Wu 46d89e988d Update FDB_AV_LATEST_BINDINGS_VERSION to 730 2023-05-17 16:28:21 -07:00
Xiaoge Su 32a4376813 Reorder the comments 2023-05-17 14:18:20 -07:00
Xiaoge Su 96001af790 Use fmt::print instead of printf in HTTP.actor.cpp
This avoids the mismatch between printf mini-language and variable type.
2023-05-17 14:18:20 -07:00
Ata E Husain Bohra e25b9ff686
EaR: REST based Simulated KMS Vault request handler interface (#10240)
* EaR: REST based Simulated KMS Vault request hanlder interface

Description

  diff-1: Address review comments
             Improve unit test case coverage
  diff-2: Extend RESTKmsConnectorUtil to generate HTTP::Header

EaR simulation testing is currently driven using SimKmsConnector
interface, it exposes endpoints directly invoked by EKP to fetch
encryption keys. Approach avoids testing RESTKms communication
path. Recently FDB codebase got extended by adding HTTPServer
interface, which was a gap prohibiting end-to-end testing of
EaR code.

Patch proposes following changes:
1. Refactor RESTKmsConnector to move common code and definitions
to RESTKmsConnectorUtil namespace
2. Introduce RESTSimKmsVault accepting HTTP format requests and
providing appropriate HTTP response.

Testing

RESTUnit          100K + 5k valgrind
devRunCorrectness 100K

Testing
2023-05-17 12:38:09 -07:00
Zhe Wu 3b651697c5 Update documents 2023-05-16 21:11:48 -07:00
Jefferson Zhong db4591eb2d
Stop BGCC while blob restore is in progress (#10236)
* Stop BGCC while blob restore is in progress

* Explicitly add back knob BG_CONSISTENCY_CHECK_ENABLED for possible later use

* Add back the knobs to test files to avoid race condition

* Add back the knobs to test files to avoid race condition

* Add back the knobs to test files to avoid race condition
2023-05-16 15:47:06 -07:00
Zhe Wu 1c290d3bc8 Make TLog server to handle empty oldGenerationRecoverAtVersions 2023-05-16 15:16:42 -07:00
Zhe Wu 1eae833ae2 test record_recover_at_in_cstate and track_tlog_recovery in restart test 2023-05-16 13:37:42 -07:00
Zhe Wu b2197e0062 Do not track TLog recovery if old generations have invalid recover at version. This can happen when we just turn on tracking TLog recovery. 2023-05-16 13:20:00 -07:00
Zhe Wu a956979c32 Replace oldestGenerationStartVersion with oldestGenerationRecoverAtVersion 2023-05-16 13:09:34 -07:00
Zhe Wu 0bdfe1889b Add recovered at in CSTATE, and use a knob to guard the use of it 2023-05-16 12:47:00 -07:00
Josh Slocum 185e7d9f30
fixing BlobGranuleRequests to properly bump read version on retry (#10216) 2023-05-16 14:12:00 -05:00
Josh Slocum 3ea16ff579
Blob kms connector ids (#10121)
* blob metadata refactor to use location id and simplify rest api

* buggifying different ordering of locations in blob metadata response
2023-05-16 13:10:11 -05:00
neethuhaneesha 854464a6af
Hex values in TSS logs and rocksb debuglogs mode knob (#10231) 2023-05-16 10:34:58 -07:00
Hui Liu c59e418d0f
Add TLS command line options to fdbbackup modify (#10239) 2023-05-16 08:50:50 -07:00
Zhongxing Zhang 0ce1ab3162 Add urllib3<2 to requirements.txt for Sphinx (Cherry-Pick #10141 to snowflake/release-71.2) (#10147) 2023-05-15 19:26:50 -07:00