Commit Graph

1021 Commits

Author SHA1 Message Date
William Dowling 0f752473be
Merge branch 'main' into radixtree-production 2023-09-25 09:52:20 +02:00
Zhe Wang 29a2f63f8d
Fix SSShard Audit (#10896)
* fix ssshard

* address comments

* fmt
2023-09-13 21:15:12 -07:00
Zhe Wu 9e5488dd3d Make sure that storage and tlog are always set to a valid type 2023-09-06 14:58:42 -07:00
Hui Liu 4d2a7d507d
Add a new blob restore state to fix a race after data copy (#10854) 2023-09-05 14:04:35 -07:00
Lukas Joswiak bfb1c51299 Add `clearknob` fdbcli command
The `clearknob` command clears the value that a knob has been set to in
the configuration database. Note that this does not mean the knob value
itself gets cleared - only the value in the configuration database is
cleared. The value of the knob will revert to whatever is hardcoded in
the corresponding `*Knobs.cpp` file.

Sample `fdbcli` session:

```
Welcome to the fdbcli. For help, type `help'.
fdb> getknob min_trace_severity
`min_trace_severity' is not found
fdb> setknob min_trace_severity 20
Please set a description for the change. Description must be non-empty
description: test
Committed (2)
fdb> getknob min_trace_severity
`min_trace_severity' is `20'
fdb> clearknob min_trace_severity
Please set a description for the change. Description must be non-empty
description: clear
Committed (4)
fdb> getknob min_trace_severity
`min_trace_severity' is not found
```

Transactions are also supported with the new `clearknob` command:

```
Welcome to the fdbcli. For help, type `help'.
fdb> begin
Transaction started
fdb> setknob min_trace_severity 20
fdb> clearknob min_trace_severity
fdb> commit
Please set a description for the change. Description must be non-empty.
description: test
Committed (16)
fdb> getknob min_trace_severity
`min_trace_severity' is not found
```
2023-08-31 17:36:05 -07:00
Zhe Wang 7e8f326277
Audit storage for specific engine (#10781)
* audit storage for specific engine

* fix getStorageType

* fix budget of skipAuditOnRange

* fix budget in scheduleAuditOnRange

* fix CI error

* improve trace events

* address comments
2023-08-23 10:51:24 -07:00
Zhe Wang f1c17b27fc
Multiple improvements to AuditStorages (#10685)
* remove danger DDAudit assert, add AuditRate knob, add progress check when ssshard complete, add progress check for ssshard in fdbcli

* throttle progress check for ssshard

* fix getAuditProgressByServer

* fix trace event for ss audit

* using name -- checkMoveKeysLockForAudit

* new scheduleAuditLocationMetadata

* address comments

* shorten progress summary for ssshard

* simplify getAuditProgressByServer in fdbcli
2023-08-14 13:13:49 -07:00
Zhe Wu eb6f0c613d Add documentation for perpetual_storage_wiggle_engine config 2023-08-10 09:35:57 -07:00
Zhe Wu ab4ae712e8 Add PerpetualWiggleStorageMigrationWorkload documentation. 2023-08-10 09:35:57 -07:00
Zhe Wu 863038a44c Add improvement for initializing storage server using new perpetual_wiggle_storage_engine config 2023-08-10 09:35:57 -07:00
Jingyu Zhou 22a3ea803c
Add "checkall" debug command for fdbcli (#10687)
* Add "checkall" for checking \xff\x02/blog/ keys

* Avoid GRV calls for getlocation

* Update comments

* Add non-stopping checking and remove verbose output

* Update checkall command to accept customized range

* Fix format

* Fix a compiling issue and output
2023-07-26 17:19:16 -07:00
Zhe Wang 522c9d4f0f
Add new implementation of audit storage for user data (#10613)
* remainingBudgetForAuditTasks should be managed within audit

* fix CI

* add audit storage test for various ranges

* clean DD

* new auditStorageUserDataQ

* fix assert fail in startTrackShardAssignment

* fix assert fail in ssaudit

* address comments

* replace assert with audit_cancel in ss audits

* add audit check progress tool

* add observability to audit progress and fix audit bugs

* fix audit progress issues and add sim test for audit progress and add trace event for the audit progress and add fdbcli to track the audit progress

* remove old audit storage on SS

* check audit progress when auditCore completes
2023-07-16 09:56:26 -07:00
William Dowling 3ea1ba1648 Remove beta status from RadixTree storage engine 2023-07-05 17:54:54 +02:00
Yanqin Jin 626a8a1a5f SNOW-804199 Support restoring a cluster with a tenant in the error state (#357)
If we restore a cluster and a previously created tenant was not included in the backup, then the tenant will be marked in an error state on the management cluster. It is then up to the operator to resolve the error, generally by deleting the tenant and recreating it if needed.

There is, however, the possibility that we restored a backup that was older than we wanted, and a newer backup would have the tenant. If we tried to restore the newer backup, it would not leave the previously missing tenant in a fully usable state.

We need to have a way to deal with this case. One option is to allow us to clear the error state of a tenant, and that can be performed before (or maybe even after) the second restore.

Test plan:
Joshua test
100K ensemble: 20230613-225414-yajin-439d13ef3c6b3afd fail=0
2023-06-15 22:23:46 -07:00
Josh Slocum 31e4610b56
misc operational and documentation improvements (#10465)
* misc operational and documentation improvements

* fixing doc build
2023-06-12 15:14:01 -05:00
Jon Fu b4e2aef58b
add tenant_id_prefix to metacluster status (#10455) 2023-06-09 15:03:49 -04:00
Jingyu Zhou 66b0699774 Fix IDE build 2023-06-08 16:59:17 -07:00
Jingyu Zhou b8c0087ca6 Fix compiling errors 2023-06-07 15:10:00 -07:00
Jingyu Zhou 614686f737 Add getlocation and getall fdbcli debug commands
getlocation: returns the SS list for a key
getall: returns both the SS list and values on the SS for a key
2023-06-07 14:36:16 -07:00
He Liu ea2b611061
Print server IP address. (#10423) 2023-06-07 13:22:25 -07:00
Josh Slocum 220b7d1a37
Consistency scan test improvements (#10402)
* adding consistency scan clear stats and testing in simulation

* Adding test that intentionally injects corruption in consistency scan requests and ensures the scan finds it

* cleanup

* adding assert false to disabled code
2023-06-07 07:21:47 -05:00
Zhe Wang f8f8f72c4e
Add audit storage cancellation (#10386)
* list audits

* cancel audits and corresponding tests

* make audit storage dblock aware

* increase audit retry since we are able to cancel

* fix updateAuditState and fdb github ci

* fmt

* fix fdbcli audit_storage and fix CI issue

* fix fdb cli

* address comments

* fmt
2023-06-06 14:29:53 -07:00
He Liu fc8543125c
Added location_metadata fdbcli to query shard locations, assignements… (#10395)
* Added location_metadata fdbcli to query shard locations, assignements, numbers etc.

* Added `listshards` to get some random physical/non-physical shards.

* Resolved comments.
2023-06-06 10:33:48 -07:00
Zhe Wang 61aaca005e
SS Audit Storage Throttling (#10322)
* ss audit storage throttling

* add audit manager to ss

* reduce CONCURRENT_AUDIT_TASK_COUNT_MAX

* revises comments

* fix audit cli

* fix getAuditStates

* remove toStringForCLI
2023-05-29 14:43:47 -07:00
Hui Liu 7ca13d8f9c
support blob restore in fdbrestore (#10248) 2023-05-19 14:45:14 -07:00
Josh Slocum 2916a11a86
New ConsistencyScan (#10265)
* Remove duplicate getRange() for DB handles and update existing GetRange to accept DB handles.

* Initial progress checkpoint on new ConsistencyScan role.

* Updated TODOs, finished most if not all state updates.

* placeholder

* Add more TODOs, documentation and comment improvements.

* Checkpoint round state to avoid advancing progress if commit fails.

* Bug fix, check is supposed to be for overlap, not lack of overlap.

* Added more TODO's and added faked read results / exceptions and faked DB size retrieval to prove the consistencyScanCore logic works.

* Update JSON schemas and command help.

* Add comment about lifetime stats reset.

* More TODO comments and some renames for clarity, some bug fixes.

* properly stopping consistency scan in simulation so that it doesn't run forever and cause quiet database to fail

* removing trailing comma from consistency_scan json schema

* Making CC inconsistency not an error if it's intentional tss corruption

* consistency scan actually reads storage locations

* added check that consistency scan actually completes a round in simulation, fixed bug and added debugging around consistency scan getting stuck

* made consistency scan properly fetch database size

* refactoring data check to be used in both consistency scan and consistency check

* checking that consistency scan always completes at least one round and doesn't get stuck

* cleanup

* fixing ide build

* consistencyscan fdbcli command wasn't actually changing db state

* consistencyscan fdbcli command always said enabled even when it wasn't

---------

Co-authored-by: Steve Atherton <steve.atherton@snowflake.com>
2023-05-18 15:02:41 -05:00
Sam Gwydir 6c16875c34
Add networkoption to disable non-TLS connections (#9984)
* Add networkoption to disable non-TLS connections

* add disable plaintext connection to fdbserver

* python doc

* Formatting

* Add tls disable plaintext connection to client api test

* review

* fix negative test

* formatting

* add TLS support to c client config tests

Adds support for TLS in the client and server separately

* add tests for disable_plaintext_connections

Test TLS and Plaintext Clusters and Clients

* Fix documentation

* Rename option to indicate it is client-only

* clearer formatting

* default to allowing plaintext connections

* add SetTLSDisablePlaintextConnection to go bindings
2023-05-13 00:14:11 +02:00
Zhe Wang 8559d4f1a8
Adding cleanup of old audit metadata (#10137)
* clean up old audit metadata

* change comments

* fix audit cleanup rule as PR description claim and reduce timeout of auditStorageCorrectness in tester

* address comment

* clear audit metadata should not throw error

* cleanup progress metadata by type

* control number of AuditStatistic events

* carefully persist new audit state

* add unit tests and fix issues

* cleanup

* allow audit concurrent run for different types and fix some bug in auditutl

* fix ci issue and nits
2023-05-10 19:32:04 -07:00
Yanqin Jin 01fddb7799
Add `ignore_capacity_limit` to `tenant create` (#10173)
Similar to `tenant configure`, this PR adds `ignore_capacity_limit` as an optional argument to `tenant create`.
This allows the user of fdbcli to create a new tenant on an **assigned** cluster, ignoring the tenant group capacity
on that specific cluster.
When creating a tenant with `ignore_capacity_limit`.
- If the user does not specify `assigned_cluster`, this is an error.
- If the user specifies `assigned_cluster`,
  - user does not specify `tenant_group`, then the new tenant will be an ungrouped tenant on the `assigned_cluster` ignoring the capacity limit
  - user specifies `tenant_group`,
    - if `tenant_group` does not exist, then the new tenant will be created on the assigned cluster and the tenant group will be implicitly created.
    - if `tenant_group` already exists, then additional check will make sure the tenant_group's cluster matches what the user specifies.

Test plan:
Simulation and metacluster_fdbcli_tests.py
---------

Co-authored-by: A.J. Beamon <aj.beamon@snowflake.com>
2023-05-10 13:19:32 -07:00
He Liu 66cd102821
Added `get_audit_status checkmigration` to print out the number of da… (#10188)
* Added `get_audit_status checkmigration` to print out the number of data shards and `physical shards`, so that we know the progress of migration to `shard_encode_location_metadata`

* Fixed print format.

* Addressed comments.
2023-05-10 12:26:39 -07:00
Josh Slocum 9a2365daa8
fixing bugs with tenant_mode required on external clients and changin… (#10183)
* fixing bugs with tenant_mode required on external clients and changing test to find them

* Update fdbcli/BlobKeyCommand.actor.cpp

Co-authored-by: A.J. Beamon <aj.beamon@snowflake.com>

---------

Co-authored-by: A.J. Beamon <aj.beamon@snowflake.com>
2023-05-09 13:41:58 -05:00
Jon Fu 7c5de05cdb
Separate failed and excluded servers on fdbcli output (#10089)
* separate failed and excluded servers on fdbcli output

* change formatting
2023-05-02 14:22:17 -04:00
A.J. Beamon 0035d9c519
Merge pull request #10074 from sfc-gh-ajbeamon/apply-black-format
Apply black format to most Python files
2023-05-02 08:20:47 -07:00
Yanqin Jin 8b1fe728be
Add configuration option `auto_tenant_assignment` to data clusters (#10058)
This PR adds auto_tenant_assignment option to register/configure data clusters.
Setting auto_tenant_assignment to disabled means the data cluster is a dedicated one and won't be
used for auto tenant assignment. This option is enabled by default (allowing auto tenant assignment).

Test plan:
simulation tests and metacluster_fdbcli_tests.py
---------

Co-authored-by: A.J. Beamon <aj.beamon@snowflake.com>
2023-05-01 21:58:49 -07:00
Zhe Wang d6e7b5f736
Audit storage: validate consistency of replica and shard location metadata (#9628)
* Implemented AuditUtils.actor.cpp

Moved AuditUtils to fdbserver/

* Persist AuditStorageState.

* Passed persisted AuditStorageState test.

* Added audit_storage_error to indicate a corruption is caught.

Throw/Send audit_storage_error when there is a data corruption.

Added doAuditStorage() for resuming Audit.

* Load and resume AuditStorage when DD restarts.

* Generate audit id monotonically.

* Fixed minor issue AuditId/Type was not set.

* Adding getLatestAuditStates.

* Improved persisted errors and added AuditStorageCommand.actor.cpp for
fdbcli.

* Added `audit_storage` fdbcli command.

* fmt.

* Fixed null shared_ptr issue.

* Improve audit data.

* Change DDAuditFailed to SevWarn.

* Sev.

* set SERVE_AUDIT_STORAGE_PARALLELISM to 1.

* Moved AuditUtils* to fdbclient/.

* Added getAuditStatus fdbcli command.

* Refactor audit storage fdb cli commands.

* Added auditStorage in sim.

* Cleanup.

* Resolved comments.

* Resolved comments.

* Added SystemData for metadata audit.

Refactored audit workflow to make sure all sub-tasks are executed w/o
early exit.

* Improvements.

* Persisted Failed state after too many retries.

* Added retryCount for resumeAuditStorage().

* resolving conflict.

* Resolved conflicts.

* allow-merged-to-run

* add timeout to audit client

* fmt

* validate replica

* add audit serverKey

* address comments and fmt

* fix audit_storage_exceeded_request_limit

* fix segfault in getLatestAuditStatesImpl

* fix bugs

* remove timeout from workload

* fix bugs

* audit local view of shard assignment

* fmt

* fix-stuck-issue-and-make-dd-audit-storage-self-retry

* fix timeout

* fix timeout

* fix bugs and cleanup

* fix nit

* change name state to coreState for audit metadata

* address comments

* code clean

* fmt

* setup debug

* cleanup

* clean up

* code cleanup

* code clean

* remove tmp file

* fmt

* trace portion of shards that of anonymous physical shard

* remove unnecessary actor cleanup

* do not give up when tr is too old

* address commits

* refactor

* clean

* fmt

* fix-command-help-text

* fix-auditstate-restore-and-enable-restore-to-metadata-audit

* address comments

* fmrt

* debug and improve efficient of resume audit

* small change

* fix audit cli

* bypass completed audit when dd restart

* fix auditStorageCommandActor

* make mismatch key range more visable

* address comments

* make local shard metadata check can make progress by retries

* address comments

* address comments

* partition location metadata validation by range and server

* unset MIN_TRACE_SEVERITY

* address comments and SS auto proceed until failed then notify dd

* persistNewAuditState should checkMoveKeysLock

* audit storage location metadata partitioned by range and move shard assignment history def to the end of SS structure

* code cleanup

* fix error message in metadata validation

* fix registerAuditsForShardAssignmentHistoryCollection input for local shard validation

* add comments to code and add guard to make sure the SS audit does not proceeds automatically for many times without being notified by DD --- to support audit cancellation later

* fix coalesceRangeList

* replace rangeOverlapping func with operator and use struct instead of complicated type for return value of getKeyServer/serverKey/shardInfo

* simplify shard assignment history

* shardAssignmentRecordRequests should be unorder_map

* address comments, make trackShardAssignment simple, make anyChildAuditFailed cover all audit children, keep only one audit actor run at a time on each SS

* only run validate shard info once at a time, other audit type does not have this limitation

---------

Co-authored-by: He Liu <heliu05023@gmail.com>
Co-authored-by: He Liu <heliu@apple.com>
Co-authored-by: Zhe Wang <zhewang@Zhes-Laptop.local>
2023-05-01 10:35:52 -07:00
A.J. Beamon 182dc93ebd Apply black format to most Python files, excluding a few cases where we have Python 2 files and a few files written externally. Add external files as exclusions to the precommit checks. 2023-04-28 11:46:41 -07:00
Steve Atherton 69d6e43354 Added explanation of \u support to fdbcli token parsing. Small tweak to rangeconfig hints. Reformatted rangeconfig help to not be intended because the help printer does its own line wrapping which makes it look very messy. 2023-04-25 13:04:38 -07:00
Steve Atherton ab7b4c490e Add inline command line help for rangeconfig. 2023-04-25 12:16:04 -07:00
Steve Atherton b70ff34a66 Move custom shard test setup to a separate function. Add JSON utf-8 escaped bytes to fdbcli token parsing. 2023-04-25 10:48:54 -07:00
Steve Atherton 858b51a69b Address review comments. KeyRangeMapSnapshot is now ReferenceCounted and getSnapshot() returns a Reference to discourage copying. Added several comments for clarity. Added FormatUsingTraceable and changed all new formatters to use it except for Standalone<T> which redirects to the formatter for T. 2023-04-24 19:01:05 -07:00
Steve Atherton c57ed25987 Renamed SystemDBLockWriteNow() to SystemDBWriteLockedNow() and changed definition to be more direct / clear. 2023-04-22 13:17:41 -07:00
Steve Atherton 639d4d05ef Removed SYSTEM_PRIORITY_IMMEDIATE from KeyBackedTypes and all options from KeyBackedRangeMap database functions. Added SystemTransactionGenerator<> for wrapping Database types and generating transactions with selected system level options. 2023-04-21 19:00:29 -07:00
Steve Atherton 46cde666a5 Merge commit '9639192a88001043a104aeef0c394e99ca5d6a6e' into keybackedrangemap 2023-04-21 13:27:15 -07:00
Steve Atherton 948e2dd781 Bug fix in KeyBackedRangeMap::updateRange() where the range after the modified region could be set wrong. Added Database version of updateRange(). 2023-04-20 20:44:24 -07:00
Jon Fu a7cf82adb2
Update fdbcli tenant list function to take tenant group filter, support JSON, and report tenant IDs (#9967)
* fix metacluster get segfault

* update fdbcli tenant list function to take tenant group filter, support JSON, and report tenant IDs

* code review changes

* code formatting

* additional code review changes

* account for empty tenant groups

* reformat error catching in fdbcli command

* refactor json output and address code review comments

* add back mistakenly removed hint

* keep hints after 4th token

* add to tenant management workload

* fix compile error

* fix test range

* add more asserts to metacluster case

* nest test condition inside if block

* adjust tenant test layout

* refactor some test files

* reorganize test workload logic
2023-04-20 16:22:47 -04:00
Steve Atherton 2553aed118 KeyBackedRangeMap::updateRange() now coalesces adjacent matching ranges caused by the update, and supports replacing a range's config with a new explicit value. Added update command to rangeconfig cli. 2023-04-20 13:02:04 -07:00
Steve Atherton a164f8fa9d Add rangeconfig CLI. 2023-04-19 22:19:55 -07:00
Steve Atherton 53ee26d758 Changed KeyBackedTypes to an actor file. Added TypedKeySelectors for Map and Set classes and getRange() keySelector methods. Added debug macro for KeyBackedTypes. Rewrote KeyBackedRangeMap using keyselectors on KeyBackedMap. 2023-04-18 22:21:19 -07:00
Yanqin Jin 2959d07797
Add test coverage for metacluster operations via fdbcli (#9802)
Add test coverage for metacluster operations via fdbcli

Test plan:

```bash
mkdir build && cd build && cmake -G Ninja ..
ninja fdbcli fdbserver fdbmonitor
ctest -R metacluster_fdbcli_tests
```
2023-04-14 07:42:55 -07:00
Chaoguang Lin b9935ef6b4 Add wait at the end of versionepoch which triggers recovery; add start&end logging of each command test 2023-04-04 17:05:47 -07:00