From 2676f1185633297f6be1435d16acf3d0caf8001d Mon Sep 17 00:00:00 2001 From: Zhe Wang Date: Mon, 8 Apr 2024 15:50:55 -0700 Subject: [PATCH] address comments --- .../source/consistency-check-urgent.rst | 26 ++++++++++--------- 1 file changed, 14 insertions(+), 12 deletions(-) diff --git a/documentation/sphinx/source/consistency-check-urgent.rst b/documentation/sphinx/source/consistency-check-urgent.rst index 372bfbac26..eb754ccecb 100644 --- a/documentation/sphinx/source/consistency-check-urgent.rst +++ b/documentation/sphinx/source/consistency-check-urgent.rst @@ -6,15 +6,15 @@ Consistency Checker Urgent | Reviewer: Jingyu Zhou | Audience: FDB developers, SREs and expert users. -In a FoundationDB (FDB) key-value cluster, every key-value pair is copied across multiple storage servers. -The Consistency Checker Urgent tool is used to validate the consistency of all replica for each key-value pair. +In a FoundationDB (FDB) key-value cluster, every key-value pair is replicated across multiple storage servers. +The Consistency Checker Urgent tool can be used to validate the consistency of all replicas for each key-value pair. If any data inconsistency is detected, the tool generates ConsistencyCheck_DataInconsistent trace events for the corresponding shard. There are two types of data inconsistencies: 1. Value mismatch, where the value of a key on one server differs from that on another server, 2. Unique key, where a key exists on one server but not on another. -The ConsistencyCheck_DataInconsistent trace event helps differentiate between these two types of corruption. +The ConsistencyCheck_DataInconsistent trace event differentiates between these two types of corruption. Key features ============ @@ -23,17 +23,19 @@ The Consistency Checker Urgent tool is designed to ensure safe, fast, and compre * End-to-end completeness check --- The checker continues until all ranges are marked as complete. * Scalability --- Adding more testers results in nearly linear speedup with the number of testers. -* Progress monitoring --- A single trace event indicates the remaining number of shards to check. -* Independence from the existing cluster --- The tool does not store any data in the FDB system key space. +* Progress monitoring --- A single trace event (i.e. ConsistencyCheckUrgent_GotShardsToCheck) indicates the remaining number of shards to check. +* Independence from the existing cluster --- The tool does not store any data in the cluster being checked. * Fault tolerance --- Tester failures do not impact the checker process. Shard checking failures are automatically retried. * Workload throttling --- Each tester can handle one task at a time, with a maximum read rate of 50MB/s (though the actual value should be much smaller). +* Custom input ranges --- Users can specify at most 4 custom ranges in knobs. By default, the knob is set to check the entire key space (i.e., " " ~ "\xff\xff"). How to use? =========== To run the ConsistencyCheckerUrgent, you need 1 checker and N testers. The process is as follows: -* Start N testers. -* Start the checker, which initiates the consistency checking automatically. +* If you want to check consistency within specific ranges, set ranges via knobs: CONSISTENCY_CHECK_URGENT_RANGE_BEGIN_* and CONSISTENCY_CHECK_URGENT_RANGE_END_*. The custom range's start and end points must be represented in hexadecimal ASCII format, strictly adhering to the "\\x" prefix. By default, the knob is set to " " ~ "\\xff\\xff" to check the entire key space (i.e., " " ~ "\xff\xff"). +* Start N testers (i.e. fdbserver --class test). +* Start the checker (i.e. fdbserver -r consistencycheckurgent --num-testers={num-testers}), which initiates the consistency checking automatically. * Once the checking is complete, the checker exits automatically, leaving the testers alive but idle. * If you need to rerun the checking, simply restart the checker process. @@ -67,19 +69,19 @@ Workflow The checker operates in the following steps: -1. Initially, the checker loads the range to check from a knob (which defaults to " " ~ "\xff\xff"). +1. Initially, the checker loads ranges to check from knobs (which defaults to " " ~ "\xff\xff"). See: CONSISTENCY_CHECK_URGENT_RANGE_BEGIN_* and CONSISTENCY_CHECK_URGENT_RANGE_END_*. 2. The checker contacts the cluster controller (CC) to obtain the tester interfaces. -3. The checker continues to contact the CC until it receives a sufficient number of testers (as set by the knob). +3. The checker continues to contact the CC until it receives a sufficient number of testers (as specified by user input when starting the consistencycheckurgent process: --num-testers). 4. After collecting enough testers, the checker partitions the range to be checked according to the shard boundary (shard information is retrieved from the FDB system metadata). -5. The checker assigns shards to the collected testers. Each tester is assigned 10 shards at a time, and these 10 shards are expected to be completed in approximately 1 hour. +5. The checker distributes shards among the selected testers. Each tester is allocated a fixed number of shards at a time (as specified by the knob CONSISTENCY_CHECK_URGENT_BATCH_SHARD_COUNT). Currently, the knob is set to 10 by default and testers are expected to complete these shards by approximately 1 hour. 6. The checker sends assignments to the testers, and the testers begin processing their assigned ranges. This is done in batches; the checker sends assignments to all testers simultaneously and waits for replies from all testers. The checker proceeds only after receiving replies from all testers or if a tester fails or crashes. 7. When a tester completes its assigned shards, it reports whether it has completed all assigned shards. If so, the checker marks the assigned shards as complete. Otherwise, the checker marks all the assigned shards to that tester as failed and retries them later. 8. The checker collects all unfinished shards from memory and returns to step 2. -9. When the entire key space is marked as complete, the checker terminates. +9. When the entire key space is marked as complete, the checker terminates with a single trace event ConsistencyCheckUrgent_Complete. The tester operates in the following steps: -1. The tester receives ranges to check from the checker, handling 10 shards at a time. +1. The tester receives a set of ranges to check from the checker at a time (specified by CONSISTENCY_CHECK_URGENT_BATCH_SHARD_COUNT). 2. For each shard, the tester obtains the storage server (SS) interfaces of all data centers. 3. The tester issues a read range request to each SS interface, ensuring they are at the same version. 4. Key by key, the tester compares the values and records any inconsistencies, populating ConsistencyCheck_DataInconsistent in the presence of shard inconsistency.