Commit Graph

156 Commits

Author SHA1 Message Date
Josh Slocum 7c155f4521
Granule force purging (#7846)
* Granule purge cannot delete history entry for fully deleting granule until all children are completely done splitting

* Several purging fixes related to granule history

* Fixed typo in refactor

* fixing memory model for purgeRange

* formatting

* weakening granule purge test for now

* cleanup

* First version of force purging granules

* fixing issue in BW range assignment reporting

* Fixing incorrect assert with force purging

* Error handling when checking force purged state

* fixed force purging and recover/reassign range races and check

* Handling force purge + boundary change race

* more places to check for force purged status

* fixed manager restart in the middle of force purge bug

* fixing same-BM purge and assignment races in all cases

* weakening orphaned granule history check a bit because of difficult to solve races

* fixing txn options on retry

* loading force purged ranges at start to avoid resuming a merge that is being force purged

* cleanup

* Enabling purging in granule tests, and adding check for leaked change feeds in force purge

* formatting

* missed parameter in merge conflicts

* Fixing leaked change feed race with merge and force purge

* adding change feed cleanup when new blob manager recovers in-progress merge that raced with force purge

* added forcepurge fdbcli command
2022-08-11 15:22:32 -07:00
A.J. Beamon 1569f033c8 Reduce the number of extra databases to prevent using too many files 2022-08-08 12:47:35 -07:00
Josh Slocum b2835921ba
Using knownBlobRanges for blob granule ranges whether tenants are enabled or not (#7788)
* Using knownBlobRanges for blob granule ranges whether tenants are enabled or not

* Effectively disabled blob granule tests when tenants enabled to fix ctest
2022-08-05 11:46:09 -05:00
A.J. Beamon ff23d5994e
Merge pull request #7729 from sfc-gh-ajbeamon/feature-metacluster
Metacluster
2022-08-04 15:29:44 -07:00
A.J. Beamon fbe1a4a69a Use multiple databases in the metacluster managemen test. Fix a test bug as well as some issues with setting up multiple extra databases. 2022-08-03 19:10:34 -07:00
Dennis Zhou b34a54fa7f
blob: allow for alignment of granules to tuple boundaries (#7746)
* blob: read TenantMap during recovery

Future functionality in the blob subsystem will rely on the tenant data
being loaded. This fixes this issue by loading the tenant data before
completing recovery such that continued actions on existing blob
granules will have access to the tenant data.

Example scenario with failover, splits are restarted before loading the
tenant data:
BM - BlobManager
epoch 3:                        epoch 4:
  BM record intent to split.
  Epoch fails.
                                BM recovery begins.
  BM fails to persist split.
                                BM recovery finishes.
                                BM.checkBlobWorkerList()
                                  maybeSplitRange().
                                BM.monitorClientRanges().
                                  loads tenant data.

bin/fdbserver -r simulation -f tests/slow/BlobGranuleCorrectness.toml \
    -s 223570924 -b on  --crash --trace_format json

* blob: add tuple key truncation for blob granule alignment

FDB has a backup system available using the blob manager and blob
granule subsystem. If we want to audit the data in the blobs, it's a lot
easier if we can align them to something meaningful.

When a blob granule is being split, we ask the storage metrics system
for split points as it holds approximate data distribution metrics.
These keys are then processed to determine if they are a tuple and
should be truncated according to the new knob,
BG_KEY_TUPLE_TRUNCATE_OFFSET.

Here we keep all aligned keys together in the same granule even if it is
larger than the allowed granule size. The following commit will address
this by adding merge boundaries.

* blob: minor clean ups in merging code

1. Rename mergeNow -> seen. This is more inline with clocksweep naming
   and removes the confusion between mergeNow and canMergeNow.
2. Make clearMergeCandidate() reset to MergeCandidateCannotMerge to make
   a clear distinction what we're accomplishing.
3. Rename canMergeNow() -> mergeEligble().

* blob: add explicit (hard) boundaries

Blob ranges can be specified either through explicit ranges or at the
tenant level. Right now this is managed implicitly. This commit aims to
make it a little more explicit.

Blobification begins in monitorClientRanges() which parses either the
explicit blob ranges or the tenant map. As we do this and add new
ranges, let's explicitly track what is a hard boundary and what isn't.

When blob merging occurs, we respect this boundary. When a hard boundary
is encountered, we submit the found eligible ranges and start looking
for a new range beginning with this hard boundary.

* blob: create BlobGranuleSplitPoints struct

This is a setup for the following commit. Our goal here is to provide a
structure for split points to be passed around. The need is for us to be
able to carry uncommitted state until it is committed and we can apply
these mutations to the in-memory data structures.

* blob: implement soft boundaries

An earlier commit establishes the need to create data boundaries within
a tenant. The reality is we may encounter a set of keys that degnerate
to the same key prefix. We'll need to be able to split those across
granules, but we want to ensure we merge the split granules together
before merging with other granules.

This adds to the BlobGranuleSplitPoints state of new
BlobGranuleMergeBoundary items. BlobGranuleMergeBoundary contains state
saying if it is a left or right boundary. This information is used to,
like hard boundaries, force merging of like granules first.

We read the BlobGranuleMergeBoundary map into memory at recovery.
2022-08-02 16:06:25 -05:00
A.J. Beamon 9ded40b6e1 The metacluster management test should disallow creating any tenants in general workload setup. 2022-07-30 13:09:30 -07:00
A.J. Beamon 8b7b6d1d4c Various cleanup; change some test parameters; add a test for metacluster management operations 2022-07-29 09:24:06 -07:00
A.J. Beamon 7c6b3fb0b8 Merge branch 'main' into feature-metacluster 2022-07-27 08:55:10 -07:00
A.J. Beamon dec6dbfbfb
Merge pull request #7549 from sfc-gh-ajbeamon/feature-tenant-groups
Add support for tenant groups
2022-07-27 07:56:27 -07:00
Josh Slocum 77956dc7ae
Merge pull request #7639 from sfc-gh-jslocum/cf_metadata_rewrite
Change Feed Metadata Rewrite and adding targeted fault injection
2022-07-26 18:10:37 -05:00
A.J. Beamon a64693518a Add support for tenant groups 2022-07-26 09:04:29 -07:00
Yao Xiao 98bff116a4
Disabled unsupported tests. (#7693) 2022-07-25 21:57:47 -07:00
Josh Slocum 78d4d85f3b Adding non-tss delay injection to SS as well 2022-07-19 09:59:14 -05:00
Josh Slocum 0d9bb9f4a5 Added targeted storage server restarts at critical metadata points 2022-07-19 08:33:43 -05:00
A.J. Beamon 9f3819752f Change the command to create a metacluster from using 'configure tenant_mode=management' to 'metacluster create <NAME>'. Distribute this name to all processes in a metacluster. Eliminate the tenant mode entirely from metacluster clusters, instead relying on a metacluster registration key. 2022-06-22 12:15:43 -07:00
A.J. Beamon 986dd67278 Add some basic support for running multiple extra clusters in simulation. Use this to simulate a metacluster in some tests. 2022-06-10 10:08:18 -07:00
A.J. Beamon eabd43c0fd Add a workload that creates and deletes tenants simultaneously. 2022-06-07 13:48:12 -07:00
Josh Slocum ffa4255c65 Added blob metadata concept as new secret type, and verified blob workers can load it 2022-05-27 15:15:56 -05:00
Josh Slocum 85af0a25b2 Enabling BM to understand tenant boundaries, and changing BlobGranuleCorrectness to use tenants 2022-05-25 17:16:56 -05:00
Josh Slocum 49b50ae8b0 Disabling default tenant in blob granule tests 2022-05-25 17:16:56 -05:00
A.J. Beamon 19d78cf2a3 When clearing the database between tests, check that clearing the tenant left the entire normal key-space empty. Update the configuration of some tests. Disable a special key-space test that is invoking broken behavior. 2022-04-14 11:39:02 -07:00
Chaoguang Lin 7d365bd1bb
Remote ikvs debugging (#6465)
* initial structure for remote IKVS server

* moved struct to .h file, added new files to CMakeList

* happy path implementation, connection error when testing

* saved minor local change

* changed tracing to debug

* fixed onClosed and getError being called before init is finished

* fix spawn process bug, now use absolute path

* added server knob to set ikvs process port number

* added server knob for remote/local kv store

* implement simulator remote process spawning

* fixed bug for simulator timeout

* commit all changes

* removed print lines in trace

* added FlowProcess implementation by Markus

* initial debug of FlowProcess, stuck at parent sending OpenKVStoreRequest to child

* temporary fix for process factory throwing segfault on create

* specify public address in command

* change remote kv store knob to false for jenkins build

* made port 0 open random unused port

* change remote store knob to true for benchmark

* set listening port to randomly opened port

* added print lines for jenkins run open kv store timeout debug

* removed most tracing and print lines

* removed tutorial changes

* update handleIOErrors error handling to handle remote-ikvs cases

* Push all debugging changes

* A version where worker bug exists

* A version where restarting tests fail

* Use both the name and the port to determine the child process

* Remove unnecessary update on local address

* Disable remote-kvs for DiskFailureCycle test

* A version where restarting stuck

* A version where most restarting tests green

* Reset connection with child process explicitly

* Remove change on unnecessary files

* Unify flags from _ to -

* fix merging unexpected changes

* fix trac.error to .errorUnsuppressed

* Add license header

* Remove unnecessary header in FlowProcess.actor.cpp

* Fix Windows build

* Fix Windows build, add missing ;

* Fix a stupid bug caused by code dropped by code merging

* Disable remote kvs by default

* Pass the conn_file path to the flow process, though not needed, but the buildNetwork is difficult to tune

* serialization change on readrange

* Update traces

* Refactor the RemoteIKVS interface

* Format files

* Update sim2 interface to not clog connections between parent and child processes in simulation

* Update comments; remove debugging symbols; Add error handling for remote_kvs_cancelled

* Add comments, format files

* Change method name from isBuggifyDisabled to isStableConnection; Decrease(0.1x) latency for stable connections

* Commit the IConnection interface change, forgot in previous commit

* Fix the issue that onClosed request is cancelled by ActorCollection

* Enable the remote kv store knob

* Remove FlowProcess.actor.cpp and move functions to RemoteIKeyValueStore.actor.cpp; Add remote kv store delay to avoid race; Bind the child process to die with parent process

* Fix the bug where one process starts storage server more than once

* Add a please_reboot_remote_kv_store error to restart the storage server worker if remote kvs died abnormally

* Remove unreachable code path and add comments

* Clang format the code

* Fix a simple wait error

* Clang format after merging the main branch

* Testing mixed mode in simulation if remote_kvs knob is enabled, setting the default to false

* Disable remote kvs for PhysicalShardMove which is for RocksDB

* Cleanup #include orders, remove debugging traces

* Revert the reorder in fdbserver.actor.cpp, which fails the gcc build

Co-authored-by: “Lincoln <“lincoln.xiao@snowflake.com”>
2022-03-31 17:08:59 -07:00
Josh Slocum fd6c9544e2 Disabling rocks in blob granule tests 2022-03-24 20:40:16 -05:00
Josh Slocum 37e7c80f26 Merge branch 'main' into blob_integration 2022-03-17 18:45:42 -05:00
A.J. Beamon 74487310fa Fix a couple test specification errors 2022-03-17 12:10:19 -07:00
A.J. Beamon d0dc756c6d Allow disabling tenant mode in simulation. Fix a few bugs. 2022-03-17 12:10:18 -07:00
A.J. Beamon 81e8c7c362 Various test fixes to work with tenants. 2022-03-17 12:10:18 -07:00
A.J. Beamon 05495908b8 Implement some tenant tests 2022-03-17 12:10:18 -07:00
Josh Slocum cebe367037 Shortening BGVerifyLarge to match duration of other slow/BG* tests 2022-03-01 08:59:58 -06:00
Josh Slocum bc7cc984b0 Fixing BGVerifyBalance test killing issues 2022-02-25 11:30:21 -06:00
Josh Slocum 38a75a8b89 Merge branch 'main' into blob_integration 2022-02-17 17:47:38 -06:00
Josh Slocum 14cc0a8b02 Got BlobGranuleCorrectnessWorkload passing a single test 2022-01-25 15:46:29 -06:00
Josh Slocum 672b7ab89d Added new test for blob granules and had more consistent naming 2022-01-24 15:15:27 -06:00
Suraj Gupta 968a4f9f50 Don't rely on database config to be updated. 2021-12-10 14:00:34 -06:00
A.J. Beamon 88ae9fd1a8 Modify parameters of ApiCorrectnessSwitchover test to avoid out of memory errors. 2021-12-03 17:42:59 -08:00
Neethu Haneesha Bingi fa4ed67f70 Excluding DiskFailureCycle test for rocksdb storage engine. 2021-12-02 10:27:48 -08:00
Steve Atherton 80031e4ae6 Make DiskFailureCycle test do less work so valgrind tests finish faster. 2021-11-16 21:15:42 -08:00
negoyal 1e7338b6c3 Merge branch 'master' into bit-flipping-workload 2021-10-28 14:24:49 -07:00
Josh Slocum 5f0ec0612a Merge branch 'feature-range-feed' into blob_full 2021-10-13 15:44:35 -05:00
Suraj Gupta cfb8368da6 Address PR comments. 2021-10-13 14:56:17 -04:00
negoyal f913dfed97 Merge branch 'master' into bit-flipping-workload 2021-10-11 16:34:57 -07:00
Josh Slocum 77b0458f9f reducing volume for large workload since there are multiple testers 2021-09-25 11:25:12 -05:00
Xiaoxi Wang 1730d75f73 change configure test
add store type check
add test file
2021-09-21 18:11:04 -07:00
Josh Slocum c780b8ae69 adding granule tests to test suite 2021-09-17 10:17:05 -05:00
negoyal a48148fdb2 Tweak the chaos toml file. 2021-09-08 22:53:52 -07:00
negoyal 7729a282ce Misc fixes and updated test toml file. 2021-09-08 14:31:09 -07:00
Josh Slocum eb76343dfb Added blob granule reassignment and splitting 2021-09-08 14:09:14 -05:00
negoyal a8baeb75d0 Misc fixes. 2021-09-03 15:03:12 -07:00
Josh Slocum 8d49c98a41 Added simulation workload for blob granules and fixed some bugs 2021-08-26 13:48:05 -05:00