sync with upstream main

2022-08-31 15:46:39 -07:00 · 2022-08-31 15:46:39 -07:00 · 0aa096dc17
parent 620c119e9a
commit 0aa096dc17
367 changed files with 27212 additions and 4712 deletions
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@ -20,7 +20,7 @@ If you have questions, we encourage you to engage in discussion on the [communit

 ## Before you get started
 ### Community Guidelines
-We want the FoundationDB community to be as welcoming and inclusive as possible, and have adopted a [Code of Conduct](CODE_OF_CONDUCT.md) that we ask all community members to read and observe.
+We want the FoundationDB community to be as welcoming and inclusive as possible, and have adopted a [Code of Conduct](CODE_OF_CONDUCT.md) that we ask all community members to read and abide by.

 ### Project Licensing
 By submitting a pull request, you represent that you have the right to license your contribution to Apple and the community, and agree by submitting the patch that your contributions are licensed under the Apache 2.0 license.
@ -34,7 +34,7 @@ Members of the Apple FoundationDB team are part of the core committers helping r

 ## Contributing
 ### Opening a Pull Request
-We love pull requests! For minor changes, feel free to open up a PR directly. For larger feature development and any changes that may require community discussion, we ask that you discuss your ideas on the [community forums](https://forums.foundationdb.org) prior to opening a PR, and then reference that thread within your PR comment. Please refer to [FoundationDB Commit Process](https://github.com/apple/foundationdb/wiki/FoundationDB-Commit-Process) for more detailed guidelines.
+We love pull requests! For minor changes, feel free to open up a PR directly. For larger feature development and any changes that may require community discussion, we ask that you discuss your ideas on the [community forums](https://forums.foundationdb.org) prior to opening a PR, and then reference that thread within your PR comment. Please refer to the [FoundationDB Commit Process](https://github.com/apple/foundationdb/wiki/FoundationDB-Commit-Process) for more detailed guidelines.

 CI will be run automatically for core committers, and for community PRs it will be initiated by the request of a core committer.  Tests can also be run locally via `ctest`, and core committers can run additional validation on pull requests prior to merging them.

@ -46,10 +46,10 @@ To report a security issue, please **DO NOT** start by filing a public issue or

 ## Project Communication
 ### Community Forums
-We encourage your participation asking questions and helping improve the FoundationDB project. Check out the [FoundationDB community forums](https://forums.foundationdb.org), which serve a similar function as mailing lists in many open source projects. The forums are organized into three sections:
+We encourage your participation asking questions and helping improve the FoundationDB project. Check out the [FoundationDB community forums](https://forums.foundationdb.org), which serve a similar function as mailing lists in many open source projects. The forums are organized into three categories:

 * [Development](https://forums.foundationdb.org/c/development): For discussing the internals and development of the FoundationDB core, as well as layers.
-* [Using FoundationDB](https://forums.foundationdb.org/c/using-foundationdb): For discussing user-facing topics. Getting started and have a question? This is the place for you.
+* [Using FoundationDB](https://forums.foundationdb.org/c/using-foundationdb): For discussing user-facing topics. Getting started and have a question? This is the category for you.
 * [Site Feedback](https://forums.foundationdb.org/c/site-feedback): A category for discussing the forums and the OSS project, its organization, how it works, and how we can improve it.

 ### Using GitHub Issues and Community Forums
@ -63,4 +63,4 @@ GitHub Issues should be used for tracking tasks. If you know the specific code t
 * Implementing an agreed upon feature: *GitHub Issues*

 ### Project and Development Updates
-Stay connected to the project and the community! For project and community updates, follow the [FoundationDB project blog](https://www.foundationdb.org/blog/). Development announcements will be made via the community forums' [dev-announce](https://forums.foundationdb.org/c/development/dev-announce) section.
+Stay connected to the project and the community! For project and community updates, follow the [FoundationDB project blog](https://www.foundationdb.org/blog/). Development announcements will be made via the community forums' [dev-announce](https://forums.foundationdb.org/c/development/dev-announce) category.
--- a/bindings/c/CMakeLists.txt
+++ b/bindings/c/CMakeLists.txt
@ -139,8 +139,12 @@ if(NOT WIN32)
    test/apitester/TesterTestSpec.cpp
    test/apitester/TesterTestSpec.h
    test/apitester/TesterBlobGranuleCorrectnessWorkload.cpp
+    test/apitester/TesterBlobGranuleErrorsWorkload.cpp
+    test/apitester/TesterBlobGranuleUtil.cpp
+    test/apitester/TesterBlobGranuleUtil.h
    test/apitester/TesterCancelTransactionWorkload.cpp
    test/apitester/TesterCorrectnessWorkload.cpp
+    test/apitester/TesterExampleWorkload.cpp
    test/apitester/TesterKeyValueStore.cpp
    test/apitester/TesterKeyValueStore.h
    test/apitester/TesterOptions.h
@ -332,6 +336,24 @@ if(NOT WIN32)
            @SERVER_CA_FILE@
            )

+  add_test(NAME fdb_c_upgrade_to_future_version
+    COMMAND ${CMAKE_SOURCE_DIR}/tests/TestRunner/upgrade_test.py
+        --build-dir ${CMAKE_BINARY_DIR}
+        --test-file ${CMAKE_SOURCE_DIR}/bindings/c/test/apitester/tests/upgrade/MixedApiWorkloadMultiThr.toml
+        --upgrade-path "7.2.0" "7.3.0" "7.2.0"
+        --process-number 3
+      )
+  set_tests_properties("fdb_c_upgrade_to_future_version" PROPERTIES ENVIRONMENT "${SANITIZER_OPTIONS}")
+
+  add_test(NAME fdb_c_upgrade_to_future_version_blob_granules
+    COMMAND ${CMAKE_SOURCE_DIR}/tests/TestRunner/upgrade_test.py
+        --build-dir ${CMAKE_BINARY_DIR}
+        --test-file ${CMAKE_SOURCE_DIR}/bindings/c/test/apitester/tests/upgrade/ApiBlobGranulesCorrectness.toml
+        --upgrade-path "7.2.0" "7.3.0" "7.2.0"
+        --blob-granules-enabled
+        --process-number 3
+      )
+
  if(CMAKE_SYSTEM_PROCESSOR STREQUAL "x86_64" AND NOT USE_SANITIZER)
    add_test(NAME fdb_c_upgrade_single_threaded_630api
      COMMAND ${CMAKE_SOURCE_DIR}/tests/TestRunner/upgrade_test.py
@ -439,7 +461,7 @@ if (OPEN_FOR_IDE)
  target_link_libraries(fdb_c_shim_lib_tester PRIVATE fdb_c_shim SimpleOpt fdb_cpp Threads::Threads)
  target_include_directories(fdb_c_shim_lib_tester PUBLIC ${CMAKE_CURRENT_SOURCE_DIR} ${CMAKE_CURRENT_BINARY_DIR}/foundationdb/ ${CMAKE_SOURCE_DIR}/flow/include)

-elseif(NOT WIN32 AND NOT APPLE AND NOT USE_UBSAN) # Linux Only, non-ubsan only
+elseif(NOT WIN32 AND NOT APPLE AND NOT USE_SANITIZER) # Linux Only, non-santizer only

  set(SHIM_LIB_OUTPUT_DIR ${CMAKE_CURRENT_BINARY_DIR})

@ -465,7 +487,7 @@ elseif(NOT WIN32 AND NOT APPLE AND NOT USE_UBSAN) # Linux Only, non-ubsan only
    DEPENDS ${IMPLIBSO_SRC}
    COMMENT "Generating source code for C shim library")

-  add_library(fdb_c_shim SHARED ${SHIM_LIB_GEN_SRC} foundationdb/fdb_c_shim.h fdb_c_shim.cpp)
+  add_library(fdb_c_shim STATIC ${SHIM_LIB_GEN_SRC} foundationdb/fdb_c_shim.h fdb_c_shim.cpp)
  target_link_options(fdb_c_shim PRIVATE "LINKER:--version-script=${CMAKE_CURRENT_SOURCE_DIR}/fdb_c.map,-z,nodelete,-z,noexecstack")
  target_link_libraries(fdb_c_shim PUBLIC dl)
  target_include_directories(fdb_c_shim PUBLIC
@ -492,7 +514,7 @@ elseif(NOT WIN32 AND NOT APPLE AND NOT USE_UBSAN) # Linux Only, non-ubsan only
          --api-test-dir ${CMAKE_SOURCE_DIR}/bindings/c/test/apitester/tests
          )

-endif() # End Linux only, non-ubsan only
+endif() # End Linux only, non-sanitizer only

 # TODO: re-enable once the old vcxproj-based build system is removed.
 #generate_export_header(fdb_c EXPORT_MACRO_NAME "DLLEXPORT"
@ -537,7 +559,7 @@ fdb_install(
  DESTINATION_SUFFIX "/cmake/${targets_export_name}"
  COMPONENT clients)

-if(NOT WIN32 AND NOT APPLE AND NOT USE_UBSAN) # Linux Only, non-ubsan only
+if(NOT WIN32 AND NOT APPLE AND NOT USE_SANITIZER) # Linux Only, non-sanitizer only

 fdb_install(
  FILES foundationdb/fdb_c_shim.h
--- a/bindings/c/fdb_c.cpp
+++ b/bindings/c/fdb_c.cpp
@ -79,9 +79,10 @@ extern "C" DLLEXPORT fdb_bool_t fdb_error_predicate(int predicate_test, fdb_erro
 	if (predicate_test == FDBErrorPredicates::RETRYABLE_NOT_COMMITTED) {
 		return code == error_code_not_committed || code == error_code_transaction_too_old ||
 		       code == error_code_future_version || code == error_code_database_locked ||
-		       code == error_code_proxy_memory_limit_exceeded || code == error_code_batch_transaction_throttled ||
-		       code == error_code_process_behind || code == error_code_tag_throttled ||
-		       code == error_code_unknown_tenant;
+		       code == error_code_grv_proxy_memory_limit_exceeded ||
+		       code == error_code_commit_proxy_memory_limit_exceeded ||
+		       code == error_code_batch_transaction_throttled || code == error_code_process_behind ||
+		       code == error_code_tag_throttled || code == error_code_unknown_tenant;
 	}
 	return false;
 }
@ -238,6 +239,10 @@ fdb_error_t fdb_future_get_version_v619(FDBFuture* f, int64_t* out_version) {
 	CATCH_AND_RETURN(*out_version = TSAV(Version, f)->get(););
 }

+extern "C" DLLEXPORT fdb_error_t fdb_future_get_bool(FDBFuture* f, fdb_bool_t* out_value) {
+	CATCH_AND_RETURN(*out_value = TSAV(bool, f)->get(););
+}
+
 extern "C" DLLEXPORT fdb_error_t fdb_future_get_int64(FDBFuture* f, int64_t* out_value) {
 	CATCH_AND_RETURN(*out_value = TSAV(int64_t, f)->get(););
 }
@ -493,6 +498,54 @@ extern "C" DLLEXPORT FDBFuture* fdb_database_wait_purge_granules_complete(FDBDat
 	    FDBFuture*)(DB(db)->waitPurgeGranulesComplete(StringRef(purge_key_name, purge_key_name_length)).extractPtr());
 }

+extern "C" DLLEXPORT FDBFuture* fdb_database_blobbify_range(FDBDatabase* db,
+                                                            uint8_t const* begin_key_name,
+                                                            int begin_key_name_length,
+                                                            uint8_t const* end_key_name,
+                                                            int end_key_name_length) {
+	return (FDBFuture*)(DB(db)
+	                        ->blobbifyRange(KeyRangeRef(StringRef(begin_key_name, begin_key_name_length),
+	                                                    StringRef(end_key_name, end_key_name_length)))
+	                        .extractPtr());
+}
+
+extern "C" DLLEXPORT FDBFuture* fdb_database_unblobbify_range(FDBDatabase* db,
+                                                              uint8_t const* begin_key_name,
+                                                              int begin_key_name_length,
+                                                              uint8_t const* end_key_name,
+                                                              int end_key_name_length) {
+	return (FDBFuture*)(DB(db)
+	                        ->unblobbifyRange(KeyRangeRef(StringRef(begin_key_name, begin_key_name_length),
+	                                                      StringRef(end_key_name, end_key_name_length)))
+	                        .extractPtr());
+}
+
+extern "C" DLLEXPORT FDBFuture* fdb_database_list_blobbified_ranges(FDBDatabase* db,
+                                                                    uint8_t const* begin_key_name,
+                                                                    int begin_key_name_length,
+                                                                    uint8_t const* end_key_name,
+                                                                    int end_key_name_length,
+                                                                    int rangeLimit) {
+	return (FDBFuture*)(DB(db)
+	                        ->listBlobbifiedRanges(KeyRangeRef(StringRef(begin_key_name, begin_key_name_length),
+	                                                           StringRef(end_key_name, end_key_name_length)),
+	                                               rangeLimit)
+	                        .extractPtr());
+}
+
+extern "C" DLLEXPORT WARN_UNUSED_RESULT FDBFuture* fdb_database_verify_blob_range(FDBDatabase* db,
+                                                                                  uint8_t const* begin_key_name,
+                                                                                  int begin_key_name_length,
+                                                                                  uint8_t const* end_key_name,
+                                                                                  int end_key_name_length,
+                                                                                  int64_t version) {
+	return (FDBFuture*)(DB(db)
+	                        ->verifyBlobRange(KeyRangeRef(StringRef(begin_key_name, begin_key_name_length),
+	                                                      StringRef(end_key_name, end_key_name_length)),
+	                                          version)
+	                        .extractPtr());
+}
+
 extern "C" DLLEXPORT fdb_error_t fdb_tenant_create_transaction(FDBTenant* tenant, FDBTransaction** out_transaction) {
 	CATCH_AND_RETURN(*out_transaction = (FDBTransaction*)TENANT(tenant)->createTransaction().extractPtr(););
 }
@ -855,11 +908,12 @@ extern "C" DLLEXPORT FDBFuture* fdb_transaction_get_blob_granule_ranges(FDBTrans
                                                                        uint8_t const* begin_key_name,
                                                                        int begin_key_name_length,
                                                                        uint8_t const* end_key_name,
-                                                                        int end_key_name_length) {
+                                                                        int end_key_name_length,
+                                                                        int rangeLimit) {
 	RETURN_FUTURE_ON_ERROR(
 	    Standalone<VectorRef<KeyRangeRef>>,
 	    KeyRangeRef range(KeyRef(begin_key_name, begin_key_name_length), KeyRef(end_key_name, end_key_name_length));
-	    return (FDBFuture*)(TXN(tr)->getBlobGranuleRanges(range).extractPtr()););
+	    return (FDBFuture*)(TXN(tr)->getBlobGranuleRanges(range, rangeLimit).extractPtr()););
 }

 extern "C" DLLEXPORT FDBResult* fdb_transaction_read_blob_granules(FDBTransaction* tr,
@ -889,6 +943,57 @@ extern "C" DLLEXPORT FDBResult* fdb_transaction_read_blob_granules(FDBTransactio
 	    return (FDBResult*)(TXN(tr)->readBlobGranules(range, beginVersion, rv, context).extractPtr()););
 }

+extern "C" DLLEXPORT FDBFuture* fdb_transaction_read_blob_granules_start(FDBTransaction* tr,
+                                                                         uint8_t const* begin_key_name,
+                                                                         int begin_key_name_length,
+                                                                         uint8_t const* end_key_name,
+                                                                         int end_key_name_length,
+                                                                         int64_t beginVersion,
+                                                                         int64_t readVersion,
+                                                                         int64_t* readVersionOut) {
+	Optional<Version> rv;
+	if (readVersion != latestVersion) {
+		rv = readVersion;
+	}
+	return (FDBFuture*)(TXN(tr)
+	                        ->readBlobGranulesStart(KeyRangeRef(KeyRef(begin_key_name, begin_key_name_length),
+	                                                            KeyRef(end_key_name, end_key_name_length)),
+	                                                beginVersion,
+	                                                rv,
+	                                                readVersionOut)
+	                        .extractPtr());
+}
+
+extern "C" DLLEXPORT FDBResult* fdb_transaction_read_blob_granules_finish(FDBTransaction* tr,
+                                                                          FDBFuture* f,
+                                                                          uint8_t const* begin_key_name,
+                                                                          int begin_key_name_length,
+                                                                          uint8_t const* end_key_name,
+                                                                          int end_key_name_length,
+                                                                          int64_t beginVersion,
+                                                                          int64_t readVersion,
+                                                                          FDBReadBlobGranuleContext* granule_context) {
+	// FIXME: better way to convert?
+	ReadBlobGranuleContext context;
+	context.userContext = granule_context->userContext;
+	context.start_load_f = granule_context->start_load_f;
+	context.get_load_f = granule_context->get_load_f;
+	context.free_load_f = granule_context->free_load_f;
+	context.debugNoMaterialize = granule_context->debugNoMaterialize;
+	context.granuleParallelism = granule_context->granuleParallelism;
+	ThreadFuture<Standalone<VectorRef<BlobGranuleChunkRef>>> startFuture(
+	    TSAV(Standalone<VectorRef<BlobGranuleChunkRef>>, f));
+
+	return (FDBResult*)(TXN(tr)
+	                        ->readBlobGranulesFinish(startFuture,
+	                                                 KeyRangeRef(KeyRef(begin_key_name, begin_key_name_length),
+	                                                             KeyRef(end_key_name, end_key_name_length)),
+	                                                 beginVersion,
+	                                                 readVersion,
+	                                                 context)
+	                        .extractPtr());
+}
+
 #include "fdb_c_function_pointers.g.h"

 #define FDB_API_CHANGED(func, ver)                                                                                     \
@ -964,6 +1069,10 @@ extern "C" DLLEXPORT const char* fdb_get_client_version() {
 	return API->getClientVersion();
 }

+extern "C" DLLEXPORT void fdb_use_future_protocol_version() {
+	API->useFutureProtocolVersion();
+}
+
 #if defined(__APPLE__)
 #include <dlfcn.h>
 __attribute__((constructor)) static void initialize() {
--- a/bindings/c/foundationdb/fdb_c.h
+++ b/bindings/c/foundationdb/fdb_c.h
@ -227,6 +227,8 @@ DLLEXPORT WARN_UNUSED_RESULT fdb_error_t fdb_future_set_callback(FDBFuture* f,
 DLLEXPORT WARN_UNUSED_RESULT fdb_error_t fdb_future_get_error(FDBFuture* f);
 #endif

+DLLEXPORT WARN_UNUSED_RESULT fdb_error_t fdb_future_get_bool(FDBFuture* f, fdb_bool_t* out);
+
 DLLEXPORT WARN_UNUSED_RESULT fdb_error_t fdb_future_get_int64(FDBFuture* f, int64_t* out);

 DLLEXPORT WARN_UNUSED_RESULT fdb_error_t fdb_future_get_uint64(FDBFuture* f, uint64_t* out);
@ -321,6 +323,32 @@ DLLEXPORT WARN_UNUSED_RESULT FDBFuture* fdb_database_wait_purge_granules_complet
                                                                                  uint8_t const* purge_key_name,
                                                                                  int purge_key_name_length);

+DLLEXPORT WARN_UNUSED_RESULT FDBFuture* fdb_database_blobbify_range(FDBDatabase* db,
+                                                                    uint8_t const* begin_key_name,
+                                                                    int begin_key_name_length,
+                                                                    uint8_t const* end_key_name,
+                                                                    int end_key_name_length);
+
+DLLEXPORT WARN_UNUSED_RESULT FDBFuture* fdb_database_unblobbify_range(FDBDatabase* db,
+                                                                      uint8_t const* begin_key_name,
+                                                                      int begin_key_name_length,
+                                                                      uint8_t const* end_key_name,
+                                                                      int end_key_name_length);
+
+DLLEXPORT WARN_UNUSED_RESULT FDBFuture* fdb_database_list_blobbified_ranges(FDBDatabase* db,
+                                                                            uint8_t const* begin_key_name,
+                                                                            int begin_key_name_length,
+                                                                            uint8_t const* end_key_name,
+                                                                            int end_key_name_length,
+                                                                            int rangeLimit);
+
+DLLEXPORT WARN_UNUSED_RESULT FDBFuture* fdb_database_verify_blob_range(FDBDatabase* db,
+                                                                       uint8_t const* begin_key_name,
+                                                                       int begin_key_name_length,
+                                                                       uint8_t const* end_key_name,
+                                                                       int end_key_name_length,
+                                                                       int64_t version);
+
 DLLEXPORT WARN_UNUSED_RESULT fdb_error_t fdb_tenant_create_transaction(FDBTenant* tenant,
                                                                       FDBTransaction** out_transaction);

@ -479,7 +507,8 @@ DLLEXPORT WARN_UNUSED_RESULT FDBFuture* fdb_transaction_get_blob_granule_ranges(
                                                                                uint8_t const* begin_key_name,
                                                                                int begin_key_name_length,
                                                                                uint8_t const* end_key_name,
-                                                                                int end_key_name_length);
+                                                                                int end_key_name_length,
+                                                                                int rangeLimit);

 /* LatestVersion (-2) for readVersion means get read version from transaction
   Separated out as optional because BG reads can support longer-lived reads than normal FDB transactions */
--- a/bindings/c/foundationdb/fdb_c_internal.h
+++ b/bindings/c/foundationdb/fdb_c_internal.h
@ -49,6 +49,29 @@ DLLEXPORT WARN_UNUSED_RESULT fdb_error_t fdb_future_get_shared_state(FDBFuture*
 DLLEXPORT WARN_UNUSED_RESULT fdb_error_t fdb_create_database_from_connection_string(const char* connection_string,
                                                                                    FDBDatabase** out_database);

+DLLEXPORT void fdb_use_future_protocol_version();
+
+// the logical read_blob_granules is broken out (at different points depending on the client type) into the asynchronous
+// start() that happens on the fdb network thread, and synchronous finish() that happens off it
+DLLEXPORT FDBFuture* fdb_transaction_read_blob_granules_start(FDBTransaction* tr,
+                                                              uint8_t const* begin_key_name,
+                                                              int begin_key_name_length,
+                                                              uint8_t const* end_key_name,
+                                                              int end_key_name_length,
+                                                              int64_t beginVersion,
+                                                              int64_t readVersion,
+                                                              int64_t* readVersionOut);
+
+DLLEXPORT FDBResult* fdb_transaction_read_blob_granules_finish(FDBTransaction* tr,
+                                                               FDBFuture* f,
+                                                               uint8_t const* begin_key_name,
+                                                               int begin_key_name_length,
+                                                               uint8_t const* end_key_name,
+                                                               int end_key_name_length,
+                                                               int64_t beginVersion,
+                                                               int64_t readVersion,
+                                                               FDBReadBlobGranuleContext* granuleContext);
+
 #ifdef __cplusplus
 }
 #endif
--- a/bindings/c/test/apitester/TesterBlobGranuleCorrectnessWorkload.cpp
+++ b/bindings/c/test/apitester/TesterBlobGranuleCorrectnessWorkload.cpp
@ -18,61 +18,13 @@
 * limitations under the License.
 */
 #include "TesterApiWorkload.h"
+#include "TesterBlobGranuleUtil.h"
 #include "TesterUtil.h"
 #include <memory>
 #include <fmt/format.h>

 namespace FdbApiTester {

-class TesterGranuleContext {
-public:
-	std::unordered_map<int64_t, uint8_t*> loadsInProgress;
-	int64_t nextId = 0;
-	std::string basePath;
-
-	~TesterGranuleContext() {
-		// if there was an error or not all loads finished, delete data
-		for (auto& it : loadsInProgress) {
-			uint8_t* dataToFree = it.second;
-			delete[] dataToFree;
-		}
-	}
-};
-
-static int64_t granule_start_load(const char* filename,
-                                  int filenameLength,
-                                  int64_t offset,
-                                  int64_t length,
-                                  int64_t fullFileLength,
-                                  void* context) {
-
-	TesterGranuleContext* ctx = (TesterGranuleContext*)context;
-	int64_t loadId = ctx->nextId++;
-
-	uint8_t* buffer = new uint8_t[length];
-	std::ifstream fin(ctx->basePath + std::string(filename, filenameLength), std::ios::in | std::ios::binary);
-	fin.seekg(offset);
-	fin.read((char*)buffer, length);
-
-	ctx->loadsInProgress.insert({ loadId, buffer });
-
-	return loadId;
-}
-
-static uint8_t* granule_get_load(int64_t loadId, void* context) {
-	TesterGranuleContext* ctx = (TesterGranuleContext*)context;
-	return ctx->loadsInProgress.at(loadId);
-}
-
-static void granule_free_load(int64_t loadId, void* context) {
-	TesterGranuleContext* ctx = (TesterGranuleContext*)context;
-	auto it = ctx->loadsInProgress.find(loadId);
-	uint8_t* dataToFree = it->second;
-	delete[] dataToFree;
-
-	ctx->loadsInProgress.erase(it);
-}
-
 class ApiBlobGranuleCorrectnessWorkload : public ApiWorkload {
 public:
 	ApiBlobGranuleCorrectnessWorkload(const WorkloadConfig& config) : ApiWorkload(config) {
@ -80,9 +32,12 @@ public:
 		if (Random::get().randomInt(0, 1) == 0) {
 			excludedOpTypes.push_back(OP_CLEAR_RANGE);
 		}
+		// FIXME: remove! this bug is fixed in another PR
+		excludedOpTypes.push_back(OP_GET_RANGES);
 	}

 private:
+	// FIXME: use other new blob granule apis!
 	enum OpType { OP_INSERT, OP_CLEAR, OP_CLEAR_RANGE, OP_READ, OP_GET_RANGES, OP_LAST = OP_GET_RANGES };
 	std::vector<OpType> excludedOpTypes;

@ -101,16 +56,8 @@ private:
 		execTransaction(
 		    [this, begin, end, results, tooOld](auto ctx) {
 			    ctx->tx().setOption(FDB_TR_OPTION_READ_YOUR_WRITES_DISABLE);
-			    TesterGranuleContext testerContext;
-			    testerContext.basePath = ctx->getBGBasePath();
-
-			    fdb::native::FDBReadBlobGranuleContext granuleContext;
-			    granuleContext.userContext = &testerContext;
-			    granuleContext.debugNoMaterialize = false;
-			    granuleContext.granuleParallelism = 1;
-			    granuleContext.start_load_f = &granule_start_load;
-			    granuleContext.get_load_f = &granule_get_load;
-			    granuleContext.free_load_f = &granule_free_load;
+			    TesterGranuleContext testerContext(ctx->getBGBasePath());
+			    fdb::native::FDBReadBlobGranuleContext granuleContext = createGranuleContext(&testerContext);

 			    fdb::Result res = ctx->tx().readBlobGranules(
 			        begin, end, 0 /* beginVersion */, -2 /* latest read version */, granuleContext);
@ -124,8 +71,10 @@ private:
 			    } else if (err.code() != error_code_success) {
 				    ctx->onError(err);
 			    } else {
-				    auto& [out_kv, out_count, out_more] = out;
+				    auto resCopy = copyKeyValueArray(out);
+				    auto& [resVector, out_more] = resCopy;
 				    ASSERT(!out_more);
+				    results.get()->assign(resVector.begin(), resVector.end());
 				    if (!seenReadSuccess) {
 					    info("BlobGranuleCorrectness::randomReadOp first success\n");
 				    }
@ -178,7 +127,7 @@ private:
 		}
 		execTransaction(
 		    [begin, end, results](auto ctx) {
-			    fdb::Future f = ctx->tx().getBlobGranuleRanges(begin, end).eraseType();
+			    fdb::Future f = ctx->tx().getBlobGranuleRanges(begin, end, 1000).eraseType();
 			    ctx->continueAfter(
 			        f,
 			        [ctx, f, results]() {
@ -196,11 +145,25 @@ private:

 			    for (int i = 0; i < results->size(); i++) {
 				    // no empty or inverted ranges
+				    if ((*results)[i].beginKey >= (*results)[i].endKey) {
+					    error(fmt::format("Empty/inverted range [{0} - {1}) for getBlobGranuleRanges({2} - {3})",
+					                      fdb::toCharsRef((*results)[i].beginKey),
+					                      fdb::toCharsRef((*results)[i].endKey),
+					                      fdb::toCharsRef(begin),
+					                      fdb::toCharsRef(end)));
+				    }
 				    ASSERT((*results)[i].beginKey < (*results)[i].endKey);
 			    }

 			    for (int i = 1; i < results->size(); i++) {
 				    // ranges contain entire requested key range
+				    if ((*results)[i].beginKey != (*results)[i].endKey) {
+					    error(fmt::format("Non-contiguous range [{0} - {1}) for getBlobGranuleRanges({2} - {3})",
+					                      fdb::toCharsRef((*results)[i].beginKey),
+					                      fdb::toCharsRef((*results)[i].endKey),
+					                      fdb::toCharsRef(begin),
+					                      fdb::toCharsRef(end)));
+				    }
 				    ASSERT((*results)[i].beginKey == (*results)[i - 1].endKey);
 			    }

--- a/bindings/c/test/apitester/TesterBlobGranuleErrorsWorkload.cpp
+++ b/bindings/c/test/apitester/TesterBlobGranuleErrorsWorkload.cpp
@ -0,0 +1,145 @@
+/*
+ * TesterBlobGranuleErrorsWorkload.cpp
+ *
+ * This source file is part of the FoundationDB open source project
+ *
+ * Copyright 2013-2022 Apple Inc. and the FoundationDB project authors
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+#include "TesterApiWorkload.h"
+#include "TesterBlobGranuleUtil.h"
+#include "TesterUtil.h"
+#include <memory>
+#include <fmt/format.h>
+
+namespace FdbApiTester {
+
+class BlobGranuleErrorsWorkload : public ApiWorkload {
+public:
+	BlobGranuleErrorsWorkload(const WorkloadConfig& config) : ApiWorkload(config) {}
+
+private:
+	enum OpType {
+		OP_READ_NO_MATERIALIZE,
+		OP_READ_FILE_LOAD_ERROR,
+		OP_READ_TOO_OLD,
+		OP_CANCEL_RANGES,
+		OP_LAST = OP_CANCEL_RANGES
+	};
+
+	// Allow reads at the start to get blob_granule_transaction_too_old if BG data isn't initialized yet
+	// FIXME: should still guarantee a read succeeds eventually somehow
+	bool seenReadSuccess = false;
+
+	void doErrorOp(TTaskFct cont,
+	               std::string basePathAddition,
+	               bool doMaterialize,
+	               int64_t readVersion,
+	               fdb::native::fdb_error_t expectedError) {
+		fdb::Key begin = randomKeyName();
+		fdb::Key end = begin;
+		// [K - K) empty range will succeed read because there is trivially nothing to do, so don't do it
+		while (end == begin) {
+			end = randomKeyName();
+		}
+		if (begin > end) {
+			std::swap(begin, end);
+		}
+
+		execTransaction(
+		    [this, begin, end, basePathAddition, doMaterialize, readVersion, expectedError](auto ctx) {
+			    ctx->tx().setOption(FDB_TR_OPTION_READ_YOUR_WRITES_DISABLE);
+
+			    TesterGranuleContext testerContext(ctx->getBGBasePath() + basePathAddition);
+			    fdb::native::FDBReadBlobGranuleContext granuleContext = createGranuleContext(&testerContext);
+			    granuleContext.debugNoMaterialize = !doMaterialize;
+
+			    fdb::Result res =
+			        ctx->tx().readBlobGranules(begin, end, 0 /* beginVersion */, readVersion, granuleContext);
+			    auto out = fdb::Result::KeyValueRefArray{};
+			    fdb::Error err = res.getKeyValueArrayNothrow(out);
+
+			    if (err.code() == error_code_success) {
+				    error(fmt::format("Operation succeeded in error test!"));
+			    }
+			    ASSERT(err.code() != error_code_success);
+			    if (err.code() != error_code_blob_granule_transaction_too_old) {
+				    seenReadSuccess = true;
+			    }
+			    if (err.code() != expectedError) {
+				    info(fmt::format("incorrect error. Expected {}, Got {}", err.code(), expectedError));
+				    if (err.code() == error_code_blob_granule_transaction_too_old) {
+					    ASSERT(!seenReadSuccess);
+					    ctx->done();
+				    } else {
+					    ctx->onError(err);
+				    }
+			    } else {
+				    ctx->done();
+			    }
+		    },
+		    [this, cont]() { schedule(cont); });
+	}
+
+	void randomOpReadNoMaterialize(TTaskFct cont) {
+		// ensure setting noMaterialize flag produces blob_granule_not_materialized
+		doErrorOp(cont, "", false, -2 /*latest read version */, error_code_blob_granule_not_materialized);
+	}
+
+	void randomOpReadFileLoadError(TTaskFct cont) {
+		// point to a file path that doesn't exist by adding an extra suffix
+		doErrorOp(cont, "extrapath/", true, -2 /*latest read version */, error_code_blob_granule_file_load_error);
+	}
+
+	void randomOpReadTooOld(TTaskFct cont) {
+		// read at a version (1) that should predate granule data
+		doErrorOp(cont, "", true, 1, error_code_blob_granule_transaction_too_old);
+	}
+
+	void randomCancelGetRangesOp(TTaskFct cont) {
+		fdb::Key begin = randomKeyName();
+		fdb::Key end = randomKeyName();
+		if (begin > end) {
+			std::swap(begin, end);
+		}
+		execTransaction(
+		    [begin, end](auto ctx) {
+			    fdb::Future f = ctx->tx().getBlobGranuleRanges(begin, end, 1000).eraseType();
+			    ctx->done();
+		    },
+		    [this, cont]() { schedule(cont); });
+	}
+
+	void randomOperation(TTaskFct cont) override {
+		OpType txType = (OpType)Random::get().randomInt(0, OP_LAST);
+		switch (txType) {
+		case OP_READ_NO_MATERIALIZE:
+			randomOpReadNoMaterialize(cont);
+			break;
+		case OP_READ_FILE_LOAD_ERROR:
+			randomOpReadFileLoadError(cont);
+			break;
+		case OP_READ_TOO_OLD:
+			randomOpReadTooOld(cont);
+			break;
+		case OP_CANCEL_RANGES:
+			randomCancelGetRangesOp(cont);
+			break;
+		}
+	}
+};
+
+WorkloadFactory<BlobGranuleErrorsWorkload> BlobGranuleErrorsWorkloadFactory("BlobGranuleErrors");
+
+} // namespace FdbApiTester
--- a/bindings/c/test/apitester/TesterBlobGranuleUtil.cpp
+++ b/bindings/c/test/apitester/TesterBlobGranuleUtil.cpp
@ -0,0 +1,80 @@
+/*
+ * TesterBlobGranuleUtil.cpp
+ *
+ * This source file is part of the FoundationDB open source project
+ *
+ * Copyright 2013-2022 Apple Inc. and the FoundationDB project authors
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "TesterBlobGranuleUtil.h"
+#include "TesterUtil.h"
+#include <fstream>
+
+namespace FdbApiTester {
+
+// FIXME: avoid duplicating this between files!
+static int64_t granule_start_load(const char* filename,
+                                  int filenameLength,
+                                  int64_t offset,
+                                  int64_t length,
+                                  int64_t fullFileLength,
+                                  void* context) {
+
+	TesterGranuleContext* ctx = (TesterGranuleContext*)context;
+	int64_t loadId = ctx->nextId++;
+
+	uint8_t* buffer = new uint8_t[length];
+	std::ifstream fin(ctx->basePath + std::string(filename, filenameLength), std::ios::in | std::ios::binary);
+	if (fin.fail()) {
+		delete[] buffer;
+		buffer = nullptr;
+	} else {
+		fin.seekg(offset);
+		fin.read((char*)buffer, length);
+	}
+
+	ctx->loadsInProgress.insert({ loadId, buffer });
+
+	return loadId;
+}
+
+static uint8_t* granule_get_load(int64_t loadId, void* context) {
+	TesterGranuleContext* ctx = (TesterGranuleContext*)context;
+	return ctx->loadsInProgress.at(loadId);
+}
+
+static void granule_free_load(int64_t loadId, void* context) {
+	TesterGranuleContext* ctx = (TesterGranuleContext*)context;
+	auto it = ctx->loadsInProgress.find(loadId);
+	uint8_t* dataToFree = it->second;
+	delete[] dataToFree;
+
+	ctx->loadsInProgress.erase(it);
+}
+
+fdb::native::FDBReadBlobGranuleContext createGranuleContext(const TesterGranuleContext* testerContext) {
+	fdb::native::FDBReadBlobGranuleContext granuleContext;
+
+	granuleContext.userContext = (void*)testerContext;
+	granuleContext.debugNoMaterialize = false;
+	granuleContext.granuleParallelism = 1 + Random::get().randomInt(0, 3);
+	granuleContext.start_load_f = &granule_start_load;
+	granuleContext.get_load_f = &granule_get_load;
+	granuleContext.free_load_f = &granule_free_load;
+
+	return granuleContext;
+}
+
+} // namespace FdbApiTester
--- a/bindings/c/test/apitester/TesterBlobGranuleUtil.h
+++ b/bindings/c/test/apitester/TesterBlobGranuleUtil.h
@ -0,0 +1,49 @@
+/*
+ * TesterBlobGranuleUtil.h
+ *
+ * This source file is part of the FoundationDB open source project
+ *
+ * Copyright 2013-2022 Apple Inc. and the FoundationDB project authors
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#pragma once
+
+#ifndef APITESTER_BLOBGRANULE_UTIL_H
+#define APITESTER_BLOBGRANULE_UTIL_H
+#include "TesterUtil.h"
+#include "test/fdb_api.hpp"
+#include <unordered_map>
+
+namespace FdbApiTester {
+
+class TesterGranuleContext {
+public:
+	std::unordered_map<int64_t, uint8_t*> loadsInProgress;
+	std::string basePath;
+	int64_t nextId;
+
+	TesterGranuleContext(const std::string& basePath) : basePath(basePath), nextId(0) {}
+
+	~TesterGranuleContext() {
+		// this should now never happen with proper memory management
+		ASSERT(loadsInProgress.empty());
+	}
+};
+
+fdb::native::FDBReadBlobGranuleContext createGranuleContext(const TesterGranuleContext* testerContext);
+
+} // namespace FdbApiTester
+
+#endif
--- a/bindings/c/test/apitester/TesterExampleWorkload.cpp
+++ b/bindings/c/test/apitester/TesterExampleWorkload.cpp
@ -0,0 +1,65 @@
+/*
+ * TesterExampleWorkload.cpp
+ *
+ * This source file is part of the FoundationDB open source project
+ *
+ * Copyright 2013-2022 Apple Inc. and the FoundationDB project authors
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "TesterWorkload.h"
+#include "TesterUtil.h"
+
+namespace FdbApiTester {
+
+class SetAndGetWorkload : public WorkloadBase {
+public:
+	fdb::Key keyPrefix;
+	Random random;
+
+	SetAndGetWorkload(const WorkloadConfig& config) : WorkloadBase(config) {
+		keyPrefix = fdb::toBytesRef(fmt::format("{}/", workloadId));
+	}
+
+	void start() override { setAndGet(NO_OP_TASK); }
+
+	void setAndGet(TTaskFct cont) {
+		fdb::Key key = keyPrefix + random.randomStringLowerCase(10, 100);
+		fdb::Value value = random.randomStringLowerCase(10, 1000);
+		execTransaction(
+		    [key, value](auto ctx) {
+			    ctx->tx().set(key, value);
+			    ctx->commit();
+		    },
+		    [this, key, value, cont]() {
+			    execTransaction(
+			        [this, key, value](auto ctx) {
+				        auto future = ctx->tx().get(key, false);
+				        ctx->continueAfter(future, [this, ctx, future, value]() {
+					        std::optional<fdb::Value> res = copyValueRef(future.get());
+					        if (res != value) {
+						        error(fmt::format(
+						            "expected: {} actual: {}", fdb::toCharsRef(value), fdb::toCharsRef(res.value())));
+					        }
+					        ctx->done();
+				        });
+			        },
+			        cont);
+		    });
+	}
+};
+
+WorkloadFactory<SetAndGetWorkload> SetAndGetWorkloadFactory("SetAndGet");
+
+} // namespace FdbApiTester
--- a/bindings/c/test/apitester/TesterOptions.h
+++ b/bindings/c/test/apitester/TesterOptions.h
@ -38,6 +38,7 @@ public:
 	std::string logGroup;
 	std::string externalClientLibrary;
 	std::string externalClientDir;
+	std::string futureVersionClientLibrary;
 	std::string tmpDir;
 	bool disableLocalClient = false;
 	std::string testFile;
--- a/bindings/c/test/apitester/TesterWorkload.cpp
+++ b/bindings/c/test/apitester/TesterWorkload.cpp
@ -165,8 +165,11 @@ void WorkloadManager::add(std::shared_ptr<IWorkload> workload, TTaskFct cont) {

 void WorkloadManager::run() {
 	std::vector<std::shared_ptr<IWorkload>> initialWorkloads;
-	for (auto iter : workloads) {
-		initialWorkloads.push_back(iter.second.ref);
+	{
+		std::unique_lock<std::mutex> lock(mutex);
+		for (auto iter : workloads) {
+			initialWorkloads.push_back(iter.second.ref);
+		}
 	}
 	for (auto iter : initialWorkloads) {
 		iter->init(this);
@ -324,4 +327,4 @@ std::unordered_map<std::string, IWorkloadFactory*>& IWorkloadFactory::factories(
 	return theFactories;
 }

-} // namespace FdbApiTester
+} // namespace FdbApiTester
--- a/bindings/c/test/apitester/blobgranuletests/CApiBlobGranuleErrorsMultiThr.toml
+++ b/bindings/c/test/apitester/blobgranuletests/CApiBlobGranuleErrorsMultiThr.toml
@ -0,0 +1,22 @@
+[[test]]
+title = 'Blob Granule Errors Multi Threaded'
+multiThreaded = true
+buggify = true
+minFdbThreads = 2
+maxFdbThreads = 8
+minDatabases = 2
+maxDatabases = 8
+minClientThreads = 2
+maxClientThreads = 8
+minClients = 2
+maxClients = 8
+
+	[[test.workload]]
+    name = 'BlobGranuleErrors'
+    minKeyLength = 1
+	maxKeyLength = 64
+	minValueLength = 1
+	maxValueLength = 1000
+	maxKeysPerTransaction = 50
+	initialSize = 100
+	numRandomOperations = 100
--- a/bindings/c/test/apitester/blobgranuletests/CApiBlobGranuleErrorsOnExternalThread.toml
+++ b/bindings/c/test/apitester/blobgranuletests/CApiBlobGranuleErrorsOnExternalThread.toml
@ -0,0 +1,22 @@
+[[test]]
+title = 'Blob Granule Errors Multi Threaded'
+multiThreaded = true
+buggify = true
+minFdbThreads = 2
+maxFdbThreads = 8
+minDatabases = 2
+maxDatabases = 8
+minClientThreads = 2
+maxClientThreads = 8
+minClients = 2
+maxClients = 8
+
+	[[test.workload]]
+    name = 'BlobGranuleErrors'
+    minKeyLength = 1
+	maxKeyLength = 64
+	minValueLength = 1
+	maxValueLength = 1000
+	maxKeysPerTransaction = 50
+	initialSize = 100
+	numRandomOperations = 100
--- a/bindings/c/test/apitester/blobgranuletests/CApiBlobGranuleErrorsSingleThr.toml
+++ b/bindings/c/test/apitester/blobgranuletests/CApiBlobGranuleErrorsSingleThr.toml
@ -0,0 +1,15 @@
+[[test]]
+title = 'Blob Granule Errors Single Threaded'
+minClients = 1
+maxClients = 3
+multiThreaded = false
+
+	[[test.workload]]
+    name = 'BlobGranuleErrors'
+    minKeyLength = 1
+	maxKeyLength = 64
+	minValueLength = 1
+	maxValueLength = 1000
+	maxKeysPerTransaction = 50
+	initialSize = 100
+	numRandomOperations = 100
--- a/bindings/c/test/apitester/fdb_c_api_tester.cpp
+++ b/bindings/c/test/apitester/fdb_c_api_tester.cpp
@ -46,6 +46,7 @@ enum TesterOptionId {
 	OPT_KNOB,
 	OPT_EXTERNAL_CLIENT_LIBRARY,
 	OPT_EXTERNAL_CLIENT_DIRECTORY,
+	OPT_FUTURE_VERSION_CLIENT_LIBRARY,
 	OPT_TMP_DIR,
 	OPT_DISABLE_LOCAL_CLIENT,
 	OPT_TEST_FILE,
@ -72,6 +73,7 @@ CSimpleOpt::SOption TesterOptionDefs[] = //
 	  { OPT_KNOB, "--knob-", SO_REQ_SEP },
 	  { OPT_EXTERNAL_CLIENT_LIBRARY, "--external-client-library", SO_REQ_SEP },
 	  { OPT_EXTERNAL_CLIENT_DIRECTORY, "--external-client-dir", SO_REQ_SEP },
+	  { OPT_FUTURE_VERSION_CLIENT_LIBRARY, "--future-version-client-library", SO_REQ_SEP },
 	  { OPT_TMP_DIR, "--tmp-dir", SO_REQ_SEP },
 	  { OPT_DISABLE_LOCAL_CLIENT, "--disable-local-client", SO_NONE },
 	  { OPT_TEST_FILE, "-f", SO_REQ_SEP },
@ -110,6 +112,8 @@ void printProgramUsage(const char* execName) {
 	       "                 Path to the external client library.\n"
 	       "  --external-client-dir DIR\n"
 	       "                 Directory containing external client libraries.\n"
+	       "  --future-version-client-library FILE\n"
+	       "                 Path to a client library to be used with a future protocol version.\n"
 	       "  --tmp-dir DIR\n"
 	       "                 Directory for temporary files of the client.\n"
 	       "  --disable-local-client DIR\n"
@ -204,6 +208,9 @@ bool processArg(TesterOptions& options, const CSimpleOpt& args) {
 	case OPT_EXTERNAL_CLIENT_DIRECTORY:
 		options.externalClientDir = args.OptionArg();
 		break;
+	case OPT_FUTURE_VERSION_CLIENT_LIBRARY:
+		options.futureVersionClientLibrary = args.OptionArg();
+		break;
 	case OPT_TMP_DIR:
 		options.tmpDir = args.OptionArg();
 		break;
@ -296,6 +303,11 @@ void applyNetworkOptions(TesterOptions& options) {
 		}
 	}

+	if (!options.futureVersionClientLibrary.empty()) {
+		fdb::network::setOption(FDBNetworkOption::FDB_NET_OPTION_FUTURE_VERSION_CLIENT_LIBRARY,
+		                        options.futureVersionClientLibrary);
+	}
+
 	if (options.testSpec.multiThreaded) {
 		fdb::network::setOption(FDBNetworkOption::FDB_NET_OPTION_CLIENT_THREADS_PER_VERSION, options.numFdbThreads);
 	}
--- a/bindings/c/test/apitester/tests/upgrade/ApiBlobGranulesCorrectness.toml
+++ b/bindings/c/test/apitester/tests/upgrade/ApiBlobGranulesCorrectness.toml
@ -0,0 +1,23 @@
+[[test]]
+title = 'Mixed Workload for Upgrade Tests with a Multi-Threaded Client'
+multiThreaded = true
+buggify = true
+databasePerTransaction = false
+minFdbThreads = 2
+maxFdbThreads = 8
+minDatabases = 2
+maxDatabases = 8
+minClientThreads = 2
+maxClientThreads = 8
+minClients = 2
+maxClients = 8
+
+    [[test.workload]]
+    name = 'ApiBlobGranuleCorrectness'
+    minKeyLength = 1
+	maxKeyLength = 64
+	minValueLength = 1
+	maxValueLength = 1000
+	maxKeysPerTransaction = 50
+	initialSize = 100
+	runUntilStop = true
--- a/bindings/c/test/apitester/tests/upgrade/MixedApiWorkloadMultiThr.toml
+++ b/bindings/c/test/apitester/tests/upgrade/MixedApiWorkloadMultiThr.toml
@ -32,4 +32,14 @@ maxClients = 8
    maxKeysPerTransaction = 50
    initialSize = 100
    runUntilStop = true
-    readExistingKeysRatio = 0.9
+    readExistingKeysRatio = 0.9
+
+    [[test.workload]]
+    name = 'AtomicOpsCorrectness'
+    initialSize = 0
+    runUntilStop = true
+
+    [[test.workload]]
+    name = 'WatchAndWait'
+    initialSize = 0
+    runUntilStop = true
--- a/bindings/c/test/apitester/tests/upgrade/MixedApiWorkloadSingleThr.toml
+++ b/bindings/c/test/apitester/tests/upgrade/MixedApiWorkloadSingleThr.toml
@ -30,4 +30,14 @@ maxClients = 8
    maxKeysPerTransaction = 50
    initialSize = 100
    runUntilStop = true
-    readExistingKeysRatio = 0.9
+    readExistingKeysRatio = 0.9
+
+    [[test.workload]]
+    name = 'AtomicOpsCorrectness'
+    initialSize = 0
+    runUntilStop = true
+
+    [[test.workload]]
+    name = 'WatchAndWait'
+    initialSize = 0
+    runUntilStop = true
--- a/bindings/c/test/fdb_api.hpp
+++ b/bindings/c/test/fdb_api.hpp
@ -559,9 +559,9 @@ public:
 		                                         reverse);
 	}

-	TypedFuture<future_var::KeyRangeRefArray> getBlobGranuleRanges(KeyRef begin, KeyRef end) {
+	TypedFuture<future_var::KeyRangeRefArray> getBlobGranuleRanges(KeyRef begin, KeyRef end, int rangeLimit) {
 		return native::fdb_transaction_get_blob_granule_ranges(
-		    tr.get(), begin.data(), intSize(begin), end.data(), intSize(end));
+		    tr.get(), begin.data(), intSize(begin), end.data(), intSize(end), rangeLimit);
 	}

 	Result readBlobGranules(KeyRef begin,
--- a/bindings/c/test/mako/blob_granules.cpp
+++ b/bindings/c/test/mako/blob_granules.cpp
@ -26,6 +26,9 @@

 extern thread_local mako::Logger logr;

+// FIXME: use the same implementation as the api tester! this implementation was from back when mako was written in C
+// and is inferior.
+
 namespace mako::blob_granules::local_file {

 int64_t startLoad(const char* filename,
--- a/bindings/c/test/unit/fdb_api.cpp
+++ b/bindings/c/test/unit/fdb_api.cpp
@ -356,9 +356,15 @@ fdb_error_t Transaction::add_conflict_range(std::string_view begin_key,
 	    tr_, (const uint8_t*)begin_key.data(), begin_key.size(), (const uint8_t*)end_key.data(), end_key.size(), type);
 }

-KeyRangeArrayFuture Transaction::get_blob_granule_ranges(std::string_view begin_key, std::string_view end_key) {
-	return KeyRangeArrayFuture(fdb_transaction_get_blob_granule_ranges(
-	    tr_, (const uint8_t*)begin_key.data(), begin_key.size(), (const uint8_t*)end_key.data(), end_key.size()));
+KeyRangeArrayFuture Transaction::get_blob_granule_ranges(std::string_view begin_key,
+                                                         std::string_view end_key,
+                                                         int rangeLimit) {
+	return KeyRangeArrayFuture(fdb_transaction_get_blob_granule_ranges(tr_,
+	                                                                   (const uint8_t*)begin_key.data(),
+	                                                                   begin_key.size(),
+	                                                                   (const uint8_t*)end_key.data(),
+	                                                                   end_key.size(),
+	                                                                   rangeLimit));
 }
 KeyValueArrayResult Transaction::read_blob_granules(std::string_view begin_key,
                                                    std::string_view end_key,
--- a/bindings/c/test/unit/fdb_api.hpp
+++ b/bindings/c/test/unit/fdb_api.hpp
@ -348,7 +348,7 @@ public:
 	// Wrapper around fdb_transaction_add_conflict_range.
 	fdb_error_t add_conflict_range(std::string_view begin_key, std::string_view end_key, FDBConflictRangeType type);

-	KeyRangeArrayFuture get_blob_granule_ranges(std::string_view begin_key, std::string_view end_key);
+	KeyRangeArrayFuture get_blob_granule_ranges(std::string_view begin_key, std::string_view end_key, int rangeLimit);
 	KeyValueArrayResult read_blob_granules(std::string_view begin_key,
 	                                       std::string_view end_key,
 	                                       int64_t beginVersion,
--- a/bindings/c/test/unit/unit_tests.cpp
+++ b/bindings/c/test/unit/unit_tests.cpp
@ -2853,7 +2853,7 @@ TEST_CASE("Blob Granule Functions") {
 	// test ranges

 	while (1) {
-		fdb::KeyRangeArrayFuture f = tr.get_blob_granule_ranges(key("bg"), key("bh"));
+		fdb::KeyRangeArrayFuture f = tr.get_blob_granule_ranges(key("bg"), key("bh"), 1000);
 		fdb_error_t err = wait_future(f);
 		if (err) {
 			fdb::EmptyFuture f2 = tr.on_error(err);
--- a/bindings/go/src/fdb/generated.go
+++ b/bindings/go/src/fdb/generated.go
@ -239,6 +239,13 @@ func (o NetworkOptions) SetClientThreadsPerVersion(param int64) error {
 	return o.setOpt(65, int64ToBytes(param))
 }

+// Adds an external client library to be used with a future version protocol. This option can be used testing purposes only!
+//
+// Parameter: path to client library
+func (o NetworkOptions) SetFutureVersionClientLibrary(param string) error {
+	return o.setOpt(66, []byte(param))
+}
+
 // Disables logging of client statistics, such as sampled transaction activity.
 func (o NetworkOptions) SetDisableClientStatisticsLogging() error {
 	return o.setOpt(70, nil)
@ -615,6 +622,13 @@ func (o TransactionOptions) SetUseGrvCache() error {
 	return o.setOpt(1101, nil)
 }

+// Attach given authorization token to the transaction such that subsequent tenant-aware requests are authorized
+//
+// Parameter: A JSON Web Token authorized to access data belonging to one or more tenants, indicated by 'tenants' claim of the token's payload.
+func (o TransactionOptions) SetAuthorizationToken(param string) error {
+	return o.setOpt(2000, []byte(param))
+}
+
 type StreamingMode int

 const (
--- a/bindings/java/CMakeLists.txt
+++ b/bindings/java/CMakeLists.txt
@ -34,9 +34,11 @@ set(JAVA_BINDING_SRCS
  src/main/com/apple/foundationdb/FDBDatabase.java
  src/main/com/apple/foundationdb/FDBTenant.java
  src/main/com/apple/foundationdb/FDBTransaction.java
+  src/main/com/apple/foundationdb/FutureBool.java
  src/main/com/apple/foundationdb/FutureInt64.java
  src/main/com/apple/foundationdb/FutureKey.java
  src/main/com/apple/foundationdb/FutureKeyArray.java
+  src/main/com/apple/foundationdb/FutureKeyRangeArray.java
  src/main/com/apple/foundationdb/FutureResult.java
  src/main/com/apple/foundationdb/FutureResults.java
  src/main/com/apple/foundationdb/FutureMappedResults.java
@ -56,6 +58,7 @@ set(JAVA_BINDING_SRCS
  src/main/com/apple/foundationdb/RangeQuery.java
  src/main/com/apple/foundationdb/MappedRangeQuery.java
  src/main/com/apple/foundationdb/KeyArrayResult.java
+  src/main/com/apple/foundationdb/KeyRangeArrayResult.java
  src/main/com/apple/foundationdb/RangeResult.java
  src/main/com/apple/foundationdb/MappedRangeResult.java
  src/main/com/apple/foundationdb/RangeResultInfo.java
--- a/bindings/java/fdbJNI.cpp
+++ b/bindings/java/fdbJNI.cpp
@ -25,9 +25,11 @@
 #include "com_apple_foundationdb_FDB.h"
 #include "com_apple_foundationdb_FDBDatabase.h"
 #include "com_apple_foundationdb_FDBTransaction.h"
+#include "com_apple_foundationdb_FutureBool.h"
 #include "com_apple_foundationdb_FutureInt64.h"
 #include "com_apple_foundationdb_FutureKey.h"
 #include "com_apple_foundationdb_FutureKeyArray.h"
+#include "com_apple_foundationdb_FutureKeyRangeArray.h"
 #include "com_apple_foundationdb_FutureResult.h"
 #include "com_apple_foundationdb_FutureResults.h"
 #include "com_apple_foundationdb_FutureStrings.h"
@ -55,7 +57,11 @@ static jclass mapped_range_result_class;
 static jclass mapped_key_value_class;
 static jclass string_class;
 static jclass key_array_result_class;
+static jclass keyrange_class;
+static jclass keyrange_array_result_class;
 static jmethodID key_array_result_init;
+static jmethodID keyrange_init;
+static jmethodID keyrange_array_result_init;
 static jmethodID range_result_init;
 static jmethodID mapped_range_result_init;
 static jmethodID mapped_key_value_from_bytes;
@ -278,6 +284,23 @@ JNIEXPORT void JNICALL Java_com_apple_foundationdb_NativeFuture_Future_1releaseM
 	fdb_future_release_memory(var);
 }

+JNIEXPORT jboolean JNICALL Java_com_apple_foundationdb_FutureBool_FutureBool_1get(JNIEnv* jenv, jobject, jlong future) {
+	if (!future) {
+		throwParamNotNull(jenv);
+		return 0;
+	}
+	FDBFuture* f = (FDBFuture*)future;
+
+	fdb_bool_t value = false;
+	fdb_error_t err = fdb_future_get_bool(f, &value);
+	if (err) {
+		safeThrow(jenv, getThrowable(jenv, err));
+		return 0;
+	}
+
+	return (jboolean)value;
+}
+
 JNIEXPORT jlong JNICALL Java_com_apple_foundationdb_FutureInt64_FutureInt64_1get(JNIEnv* jenv, jobject, jlong future) {
 	if (!future) {
 		throwParamNotNull(jenv);
@ -407,6 +430,61 @@ JNIEXPORT jobject JNICALL Java_com_apple_foundationdb_FutureKeyArray_FutureKeyAr
 	return result;
 }

+JNIEXPORT jobject JNICALL Java_com_apple_foundationdb_FutureKeyRangeArray_FutureKeyRangeArray_1get(JNIEnv* jenv,
+                                                                                                   jobject,
+                                                                                                   jlong future) {
+	if (!future) {
+		throwParamNotNull(jenv);
+		return JNI_NULL;
+	}
+
+	FDBFuture* f = (FDBFuture*)future;
+
+	const FDBKeyRange* fdbKr;
+	int count;
+	fdb_error_t err = fdb_future_get_keyrange_array(f, &fdbKr, &count);
+	if (err) {
+		safeThrow(jenv, getThrowable(jenv, err));
+		return JNI_NULL;
+	}
+
+	jobjectArray kr_values = jenv->NewObjectArray(count, keyrange_class, NULL);
+	if (!kr_values) {
+		if (!jenv->ExceptionOccurred())
+			throwOutOfMem(jenv);
+		return JNI_NULL;
+	}
+
+	for (int i = 0; i < count; i++) {
+		jbyteArray beginArr = jenv->NewByteArray(fdbKr[i].begin_key_length);
+		if (!beginArr) {
+			if (!jenv->ExceptionOccurred())
+				throwOutOfMem(jenv);
+			return JNI_NULL;
+		}
+		jbyteArray endArr = jenv->NewByteArray(fdbKr[i].end_key_length);
+		if (!endArr) {
+			if (!jenv->ExceptionOccurred())
+				throwOutOfMem(jenv);
+			return JNI_NULL;
+		}
+		jenv->SetByteArrayRegion(beginArr, 0, fdbKr[i].begin_key_length, (const jbyte*)fdbKr[i].begin_key);
+		jenv->SetByteArrayRegion(endArr, 0, fdbKr[i].end_key_length, (const jbyte*)fdbKr[i].end_key);
+
+		jobject kr = jenv->NewObject(keyrange_class, keyrange_init, beginArr, endArr);
+		if (jenv->ExceptionOccurred())
+			return JNI_NULL;
+		jenv->SetObjectArrayElement(kr_values, i, kr);
+		if (jenv->ExceptionOccurred())
+			return JNI_NULL;
+	}
+	jobject krarr = jenv->NewObject(keyrange_array_result_class, keyrange_array_result_init, kr_values);
+	if (jenv->ExceptionOccurred())
+		return JNI_NULL;
+
+	return krarr;
+}
+
 // SOMEDAY: explore doing this more efficiently with Direct ByteBuffers
 JNIEXPORT jobject JNICALL Java_com_apple_foundationdb_FutureResults_FutureResults_1get(JNIEnv* jenv,
                                                                                       jobject,
@ -830,6 +908,142 @@ Java_com_apple_foundationdb_FDBDatabase_Database_1waitPurgeGranulesComplete(JNIE
 	return (jlong)f;
 }

+JNIEXPORT jlong JNICALL Java_com_apple_foundationdb_FDBDatabase_Database_1blobbifyRange(JNIEnv* jenv,
+                                                                                        jobject,
+                                                                                        jlong dbPtr,
+                                                                                        jbyteArray beginKeyBytes,
+                                                                                        jbyteArray endKeyBytes) {
+	if (!dbPtr || !beginKeyBytes || !endKeyBytes) {
+		throwParamNotNull(jenv);
+		return 0;
+	}
+
+	FDBDatabase* database = (FDBDatabase*)dbPtr;
+
+	uint8_t* beginKeyArr = (uint8_t*)jenv->GetByteArrayElements(beginKeyBytes, JNI_NULL);
+	if (!beginKeyArr) {
+		if (!jenv->ExceptionOccurred())
+			throwRuntimeEx(jenv, "Error getting handle to native resources");
+		return 0;
+	}
+
+	uint8_t* endKeyArr = (uint8_t*)jenv->GetByteArrayElements(endKeyBytes, JNI_NULL);
+	if (!endKeyArr) {
+		jenv->ReleaseByteArrayElements(beginKeyBytes, (jbyte*)beginKeyArr, JNI_ABORT);
+		if (!jenv->ExceptionOccurred())
+			throwRuntimeEx(jenv, "Error getting handle to native resources");
+		return 0;
+	}
+
+	FDBFuture* f = fdb_database_blobbify_range(
+	    database, beginKeyArr, jenv->GetArrayLength(beginKeyBytes), endKeyArr, jenv->GetArrayLength(endKeyBytes));
+	jenv->ReleaseByteArrayElements(beginKeyBytes, (jbyte*)beginKeyArr, JNI_ABORT);
+	jenv->ReleaseByteArrayElements(endKeyBytes, (jbyte*)endKeyArr, JNI_ABORT);
+	return (jlong)f;
+}
+
+JNIEXPORT jlong JNICALL Java_com_apple_foundationdb_FDBDatabase_Database_1unblobbifyRange(JNIEnv* jenv,
+                                                                                          jobject,
+                                                                                          jlong dbPtr,
+                                                                                          jbyteArray beginKeyBytes,
+                                                                                          jbyteArray endKeyBytes) {
+	if (!dbPtr || !beginKeyBytes || !endKeyBytes) {
+		throwParamNotNull(jenv);
+		return 0;
+	}
+
+	FDBDatabase* database = (FDBDatabase*)dbPtr;
+
+	uint8_t* beginKeyArr = (uint8_t*)jenv->GetByteArrayElements(beginKeyBytes, JNI_NULL);
+	if (!beginKeyArr) {
+		if (!jenv->ExceptionOccurred())
+			throwRuntimeEx(jenv, "Error getting handle to native resources");
+		return 0;
+	}
+
+	uint8_t* endKeyArr = (uint8_t*)jenv->GetByteArrayElements(endKeyBytes, JNI_NULL);
+	if (!endKeyArr) {
+		jenv->ReleaseByteArrayElements(beginKeyBytes, (jbyte*)beginKeyArr, JNI_ABORT);
+		if (!jenv->ExceptionOccurred())
+			throwRuntimeEx(jenv, "Error getting handle to native resources");
+		return 0;
+	}
+
+	FDBFuture* f = fdb_database_unblobbify_range(
+	    database, beginKeyArr, jenv->GetArrayLength(beginKeyBytes), endKeyArr, jenv->GetArrayLength(endKeyBytes));
+	jenv->ReleaseByteArrayElements(beginKeyBytes, (jbyte*)beginKeyArr, JNI_ABORT);
+	jenv->ReleaseByteArrayElements(endKeyBytes, (jbyte*)endKeyArr, JNI_ABORT);
+	return (jlong)f;
+}
+
+JNIEXPORT jlong JNICALL Java_com_apple_foundationdb_FDBDatabase_Database_1listBlobbifiedRanges(JNIEnv* jenv,
+                                                                                               jobject,
+                                                                                               jlong dbPtr,
+                                                                                               jbyteArray beginKeyBytes,
+                                                                                               jbyteArray endKeyBytes,
+                                                                                               jint rangeLimit) {
+	if (!dbPtr || !beginKeyBytes || !endKeyBytes) {
+		throwParamNotNull(jenv);
+		return 0;
+	}
+	FDBDatabase* tr = (FDBDatabase*)dbPtr;
+
+	uint8_t* startKey = (uint8_t*)jenv->GetByteArrayElements(beginKeyBytes, JNI_NULL);
+	if (!startKey) {
+		if (!jenv->ExceptionOccurred())
+			throwRuntimeEx(jenv, "Error getting handle to native resources");
+		return 0;
+	}
+
+	uint8_t* endKey = (uint8_t*)jenv->GetByteArrayElements(endKeyBytes, JNI_NULL);
+	if (!endKey) {
+		jenv->ReleaseByteArrayElements(beginKeyBytes, (jbyte*)startKey, JNI_ABORT);
+		if (!jenv->ExceptionOccurred())
+			throwRuntimeEx(jenv, "Error getting handle to native resources");
+		return 0;
+	}
+
+	FDBFuture* f = fdb_database_list_blobbified_ranges(
+	    tr, startKey, jenv->GetArrayLength(beginKeyBytes), endKey, jenv->GetArrayLength(endKeyBytes), rangeLimit);
+	jenv->ReleaseByteArrayElements(beginKeyBytes, (jbyte*)startKey, JNI_ABORT);
+	jenv->ReleaseByteArrayElements(endKeyBytes, (jbyte*)endKey, JNI_ABORT);
+	return (jlong)f;
+}
+
+JNIEXPORT jlong JNICALL Java_com_apple_foundationdb_FDBDatabase_Database_1verifyBlobRange(JNIEnv* jenv,
+                                                                                          jobject,
+                                                                                          jlong dbPtr,
+                                                                                          jbyteArray beginKeyBytes,
+                                                                                          jbyteArray endKeyBytes,
+                                                                                          jlong version) {
+	if (!dbPtr || !beginKeyBytes || !endKeyBytes) {
+		throwParamNotNull(jenv);
+		return 0;
+	}
+	FDBDatabase* tr = (FDBDatabase*)dbPtr;
+
+	uint8_t* startKey = (uint8_t*)jenv->GetByteArrayElements(beginKeyBytes, JNI_NULL);
+	if (!startKey) {
+		if (!jenv->ExceptionOccurred())
+			throwRuntimeEx(jenv, "Error getting handle to native resources");
+		return 0;
+	}
+
+	uint8_t* endKey = (uint8_t*)jenv->GetByteArrayElements(endKeyBytes, JNI_NULL);
+	if (!endKey) {
+		jenv->ReleaseByteArrayElements(beginKeyBytes, (jbyte*)startKey, JNI_ABORT);
+		if (!jenv->ExceptionOccurred())
+			throwRuntimeEx(jenv, "Error getting handle to native resources");
+		return 0;
+	}
+
+	FDBFuture* f = fdb_database_list_blobbified_ranges(
+	    tr, startKey, jenv->GetArrayLength(beginKeyBytes), endKey, jenv->GetArrayLength(endKeyBytes), version);
+	jenv->ReleaseByteArrayElements(beginKeyBytes, (jbyte*)startKey, JNI_ABORT);
+	jenv->ReleaseByteArrayElements(endKeyBytes, (jbyte*)endKey, JNI_ABORT);
+	return (jlong)f;
+}
+
 JNIEXPORT jboolean JNICALL Java_com_apple_foundationdb_FDB_Error_1predicate(JNIEnv* jenv,
                                                                            jobject,
                                                                            jint predicate,
@ -1307,6 +1521,41 @@ Java_com_apple_foundationdb_FDBTransaction_Transaction_1getRangeSplitPoints(JNIE
 	return (jlong)f;
 }

+JNIEXPORT jlong JNICALL
+Java_com_apple_foundationdb_FDBTransaction_Transaction_1getBlobGranuleRanges(JNIEnv* jenv,
+                                                                             jobject,
+                                                                             jlong tPtr,
+                                                                             jbyteArray beginKeyBytes,
+                                                                             jbyteArray endKeyBytes,
+                                                                             jint rowLimit) {
+	if (!tPtr || !beginKeyBytes || !endKeyBytes || !rowLimit) {
+		throwParamNotNull(jenv);
+		return 0;
+	}
+	FDBTransaction* tr = (FDBTransaction*)tPtr;
+
+	uint8_t* startKey = (uint8_t*)jenv->GetByteArrayElements(beginKeyBytes, JNI_NULL);
+	if (!startKey) {
+		if (!jenv->ExceptionOccurred())
+			throwRuntimeEx(jenv, "Error getting handle to native resources");
+		return 0;
+	}
+
+	uint8_t* endKey = (uint8_t*)jenv->GetByteArrayElements(endKeyBytes, JNI_NULL);
+	if (!endKey) {
+		jenv->ReleaseByteArrayElements(beginKeyBytes, (jbyte*)startKey, JNI_ABORT);
+		if (!jenv->ExceptionOccurred())
+			throwRuntimeEx(jenv, "Error getting handle to native resources");
+		return 0;
+	}
+
+	FDBFuture* f = fdb_transaction_get_blob_granule_ranges(
+	    tr, startKey, jenv->GetArrayLength(beginKeyBytes), endKey, jenv->GetArrayLength(endKeyBytes), rowLimit);
+	jenv->ReleaseByteArrayElements(beginKeyBytes, (jbyte*)startKey, JNI_ABORT);
+	jenv->ReleaseByteArrayElements(endKeyBytes, (jbyte*)endKey, JNI_ABORT);
+	return (jlong)f;
+}
+
 JNIEXPORT void JNICALL Java_com_apple_foundationdb_FDBTransaction_Transaction_1set(JNIEnv* jenv,
                                                                                   jobject,
                                                                                   jlong tPtr,
@ -1746,6 +1995,15 @@ jint JNI_OnLoad(JavaVM* vm, void* reserved) {
 		key_array_result_init = env->GetMethodID(local_key_array_result_class, "<init>", "([B[I)V");
 		key_array_result_class = (jclass)(env)->NewGlobalRef(local_key_array_result_class);

+		jclass local_keyrange_class = env->FindClass("com/apple/foundationdb/Range");
+		keyrange_init = env->GetMethodID(local_keyrange_class, "<init>", "([B[B)V");
+		keyrange_class = (jclass)(env)->NewGlobalRef(local_keyrange_class);
+
+		jclass local_keyrange_array_result_class = env->FindClass("com/apple/foundationdb/KeyRangeArrayResult");
+		keyrange_array_result_init =
+		    env->GetMethodID(local_keyrange_array_result_class, "<init>", "([Lcom/apple/foundationdb/Range;)V");
+		keyrange_array_result_class = (jclass)(env)->NewGlobalRef(local_keyrange_array_result_class);
+
 		jclass local_range_result_summary_class = env->FindClass("com/apple/foundationdb/RangeResultSummary");
 		range_result_summary_init = env->GetMethodID(local_range_result_summary_class, "<init>", "([BIZ)V");
 		range_result_summary_class = (jclass)(env)->NewGlobalRef(local_range_result_summary_class);
@ -1770,6 +2028,12 @@ void JNI_OnUnload(JavaVM* vm, void* reserved) {
 		if (range_result_class != JNI_NULL) {
 			env->DeleteGlobalRef(range_result_class);
 		}
+		if (keyrange_array_result_class != JNI_NULL) {
+			env->DeleteGlobalRef(keyrange_array_result_class);
+		}
+		if (keyrange_class != JNI_NULL) {
+			env->DeleteGlobalRef(keyrange_class);
+		}
 		if (mapped_range_result_class != JNI_NULL) {
 			env->DeleteGlobalRef(mapped_range_result_class);
 		}
--- a/bindings/java/src/main/com/apple/foundationdb/Database.java
+++ b/bindings/java/src/main/com/apple/foundationdb/Database.java
@ -161,6 +161,20 @@ public interface Database extends AutoCloseable, TransactionContext {
 	 */
 	double getMainThreadBusyness();

+	/**
+	 * Runs {@link #purgeBlobGranules(Function)} on the default executor.
+	 *
+	 * @param beginKey start of the key range
+	 * @param endKey end of the key range
+	 * @param purgeVersion version to purge at
+	 * @param force if true delete all data, if not keep data >= purgeVersion
+	 *
+	 * @return the key to watch for purge complete
+	 */
+	default CompletableFuture<byte[]> purgeBlobGranules(byte[] beginKey, byte[] endKey, long purgeVersion, boolean force) {
+		return purgeBlobGranules(beginKey, endKey, purgeVersion, force, getExecutor());
+	}
+
 	/**
 	 * Queues a purge of blob granules for the specified key range, at the specified version.
     *
@ -168,17 +182,126 @@ public interface Database extends AutoCloseable, TransactionContext {
 	 * @param endKey end of the key range
 	 * @param purgeVersion version to purge at
 	 * @param force if true delete all data, if not keep data >= purgeVersion
+	 * @param e the {@link Executor} to use for asynchronous callbacks
+
 	 * @return the key to watch for purge complete
 	 */
 	CompletableFuture<byte[]> purgeBlobGranules(byte[] beginKey, byte[] endKey, long purgeVersion, boolean force, Executor e);

+
 	/**
-	 * Wait for a previous call to purgeBlobGranules to complete
+	 * Runs {@link #waitPurgeGranulesComplete(Function)} on the default executor.
 	 *
 	 * @param purgeKey key to watch
 	 */
+	default CompletableFuture<Void> waitPurgeGranulesComplete(byte[] purgeKey) {
+		return waitPurgeGranulesComplete(purgeKey, getExecutor());
+	}
+
+	/**
+	 * Wait for a previous call to purgeBlobGranules to complete.
+	 *
+	 * @param purgeKey key to watch
+	 * @param e the {@link Executor} to use for asynchronous callbacks
+	 */
 	CompletableFuture<Void> waitPurgeGranulesComplete(byte[] purgeKey, Executor e);

+	/**
+	 * Runs {@link #blobbifyRange(Function)} on the default executor.
+	 *
+	 * @param beginKey start of the key range
+	 * @param endKey end of the key range
+
+	 * @return if the recording of the range was successful
+	 */
+	default CompletableFuture<Boolean> blobbifyRange(byte[] beginKey, byte[] endKey) {
+		return blobbifyRange(beginKey, endKey, getExecutor());
+	}
+
+	/**
+	 * Sets a range to be blobbified in the database. Must be a completely unblobbified range.
+	 *
+	 * @param beginKey start of the key range
+	 * @param endKey end of the key range
+	 * @param e the {@link Executor} to use for asynchronous callbacks
+
+	 * @return if the recording of the range was successful
+	 */
+	CompletableFuture<Boolean> blobbifyRange(byte[] beginKey, byte[] endKey, Executor e);
+
+	/**
+	 * Runs {@link #unblobbifyRange(Function)} on the default executor.
+	 *
+	 * @param beginKey start of the key range
+	 * @param endKey end of the key range
+
+	 * @return if the recording of the range was successful
+	 */
+	default CompletableFuture<Boolean> unblobbifyRange(byte[] beginKey, byte[] endKey) {
+		return unblobbifyRange(beginKey, endKey, getExecutor());
+	}
+
+	/**
+	 * Unsets a blobbified range in the database. The range must be aligned to known blob ranges.
+	 *
+	 * @param beginKey start of the key range
+	 * @param endKey end of the key range
+	 * @param e the {@link Executor} to use for asynchronous callbacks
+
+	 * @return if the recording of the range was successful
+	 */
+	CompletableFuture<Boolean> unblobbifyRange(byte[] beginKey, byte[] endKey, Executor e);
+
+	/**
+	 * Runs {@link #listBlobbifiedRanges(Function)} on the default executor.
+	 *
+	 * @param beginKey start of the key range
+	 * @param endKey end of the key range
+	 * @param rangeLimit batch size
+	 * @param e the {@link Executor} to use for asynchronous callbacks
+
+	 * @return a future with the list of blobbified ranges: [lastLessThan(beginKey), firstGreaterThanOrEqual(endKey)]
+	 */
+	 default CompletableFuture<KeyRangeArrayResult> listBlobbifiedRanges(byte[] beginKey, byte[] endKey, int rangeLimit) {
+		return listBlobbifiedRanges(beginKey, endKey, rangeLimit, getExecutor());
+	 }
+
+	/**
+	 * Lists blobbified ranges in the database. There may be more if result.size() == rangeLimit.
+	 *
+	 * @param beginKey start of the key range
+	 * @param endKey end of the key range
+	 * @param rangeLimit batch size
+	 * @param e the {@link Executor} to use for asynchronous callbacks
+
+	 * @return a future with the list of blobbified ranges: [lastLessThan(beginKey), firstGreaterThanOrEqual(endKey)]
+	 */
+	 CompletableFuture<KeyRangeArrayResult> listBlobbifiedRanges(byte[] beginKey, byte[] endKey, int rangeLimit, Executor e);
+
+	/**
+	 * Runs {@link #verifyBlobRange(Function)} on the default executor.
+	 *
+	 * @param beginKey start of the key range
+	 * @param endKey end of the key range
+	 * @param version version to read at
+	 *
+	 * @return a future with the version of the last blob granule.
+	 */
+	default CompletableFuture<Long> verifyBlobRange(byte[] beginKey, byte[] endKey, long version) {
+		return verifyBlobRange(beginKey, endKey, version, getExecutor());
+	}
+
+	/**
+	 * Checks if a blob range is blobbified.
+	 *
+	 * @param beginKey start of the key range
+	 * @param endKey end of the key range
+	 * @param version version to read at
+	 *
+	 * @return a future with the version of the last blob granule.
+	 */
+	CompletableFuture<Long> verifyBlobRange(byte[] beginKey, byte[] endKey, long version, Executor e);
+
 	/**
 	 * Runs a read-only transactional function against this {@code Database} with retry logic.
 	 *  {@link Function#apply(Object) apply(ReadTransaction)} will be called on the
--- a/bindings/java/src/main/com/apple/foundationdb/FDBDatabase.java
+++ b/bindings/java/src/main/com/apple/foundationdb/FDBDatabase.java
@ -201,20 +201,60 @@ class FDBDatabase extends NativeObjectWrapper implements Database, OptionConsume
 	}

 	@Override
-	public CompletableFuture<byte[]> purgeBlobGranules(byte[] beginKey, byte[] endKey, long purgeVersion, boolean force, Executor executor) {
+	public CompletableFuture<byte[]> purgeBlobGranules(byte[] beginKey, byte[] endKey, long purgeVersion, boolean force, Executor e) {
 		pointerReadLock.lock();
 		try {
-			return new FutureKey(Database_purgeBlobGranules(getPtr(), beginKey, endKey, purgeVersion, force), executor, eventKeeper);
+			return new FutureKey(Database_purgeBlobGranules(getPtr(), beginKey, endKey, purgeVersion, force), e, eventKeeper);
 		} finally {
 			pointerReadLock.unlock();
 		}
 	}

 	@Override
-	public CompletableFuture<Void> waitPurgeGranulesComplete(byte[] purgeKey, Executor executor) {
+	public CompletableFuture<Void> waitPurgeGranulesComplete(byte[] purgeKey, Executor e) {
 		pointerReadLock.lock();
 		try {
-			return new FutureVoid(Database_waitPurgeGranulesComplete(getPtr(), purgeKey), executor);
+			return new FutureVoid(Database_waitPurgeGranulesComplete(getPtr(), purgeKey), e);
+		} finally {
+			pointerReadLock.unlock();
+		}
+	}
+
+	@Override
+	public CompletableFuture<Boolean> blobbifyRange(byte[] beginKey, byte[] endKey, Executor e) {
+		pointerReadLock.lock();
+		try {
+			return new FutureBool(Database_blobbifyRange(getPtr(), beginKey, endKey), e);
+		} finally {
+			pointerReadLock.unlock();
+		}
+	}
+
+	@Override
+	public CompletableFuture<Boolean> unblobbifyRange(byte[] beginKey, byte[] endKey, Executor e) {
+		pointerReadLock.lock();
+		try {
+			return new FutureBool(Database_unblobbifyRange(getPtr(), beginKey, endKey), e);
+		} finally {
+			pointerReadLock.unlock();
+		}
+	}
+
+	@Override
+	public CompletableFuture<KeyRangeArrayResult> listBlobbifiedRanges(byte[] beginKey, byte[] endKey, int rangeLimit, Executor e) {
+		pointerReadLock.lock();
+		try {
+			return new FutureKeyRangeArray(Database_listBlobbifiedRanges(getPtr(), beginKey, endKey, rangeLimit), e);
+		} finally {
+			pointerReadLock.unlock();
+		}
+	}
+
+	@Override
+	public CompletableFuture<Long> verifyBlobRange(byte[] beginKey, byte[] endKey, long version, Executor e) {
+		pointerReadLock.lock();
+		try {
+			return new FutureInt64(Database_verifyBlobRange(getPtr(), beginKey, endKey, version), e);
 		} finally {
 			pointerReadLock.unlock();
 		}
@ -237,4 +277,8 @@ class FDBDatabase extends NativeObjectWrapper implements Database, OptionConsume
 	private native double Database_getMainThreadBusyness(long cPtr);
 	private native long Database_purgeBlobGranules(long cPtr, byte[] beginKey, byte[] endKey, long purgeVersion, boolean force);
 	private native long Database_waitPurgeGranulesComplete(long cPtr, byte[] purgeKey);
+	private native long Database_blobbifyRange(long cPtr, byte[] beginKey, byte[] endKey);
+	private native long Database_unblobbifyRange(long cPtr, byte[] beginKey, byte[] endKey);
+	private native long Database_listBlobbifiedRanges(long cPtr, byte[] beginKey, byte[] endKey, int rangeLimit);
+	private native long Database_verifyBlobRange(long cPtr, byte[] beginKey, byte[] endKey, long version);
 }
--- a/bindings/java/src/main/com/apple/foundationdb/FDBTransaction.java
+++ b/bindings/java/src/main/com/apple/foundationdb/FDBTransaction.java
@ -97,6 +97,11 @@ class FDBTransaction extends NativeObjectWrapper implements Transaction, OptionC
 			return FDBTransaction.this.getRangeSplitPoints(range, chunkSize);
 		}

+		@Override
+		public CompletableFuture<KeyRangeArrayResult> getBlobGranuleRanges(byte[] begin, byte[] end, int rowLimit) {
+			return FDBTransaction.this.getBlobGranuleRanges(begin, end, rowLimit);
+		}
+
 		@Override
 		public AsyncIterable<MappedKeyValue> getMappedRange(KeySelector begin, KeySelector end, byte[] mapper,
 		                                                    int limit, int matchIndex, boolean reverse,
@ -352,6 +357,16 @@ class FDBTransaction extends NativeObjectWrapper implements Transaction, OptionC
 		return this.getRangeSplitPoints(range.begin, range.end, chunkSize);
 	}

+	@Override
+	public CompletableFuture<KeyRangeArrayResult> getBlobGranuleRanges(byte[] begin, byte[] end, int rowLimit) {
+		pointerReadLock.lock();
+		try {
+			return new FutureKeyRangeArray(Transaction_getBlobGranuleRanges(getPtr(), begin, end, rowLimit), executor);
+		} finally {
+			pointerReadLock.unlock();
+		}
+	}
+
 	@Override
 	public AsyncIterable<MappedKeyValue> getMappedRange(KeySelector begin, KeySelector end, byte[] mapper, int limit,
 	                                                    int matchIndex, boolean reverse, StreamingMode mode) {
@ -842,4 +857,5 @@ class FDBTransaction extends NativeObjectWrapper implements Transaction, OptionC
 	private native long Transaction_getKeyLocations(long cPtr, byte[] key);
 	private native long Transaction_getEstimatedRangeSizeBytes(long cPtr, byte[] keyBegin, byte[] keyEnd);
 	private native long Transaction_getRangeSplitPoints(long cPtr, byte[] keyBegin, byte[] keyEnd, long chunkSize);
+	private native long Transaction_getBlobGranuleRanges(long cPtr, byte[] keyBegin, byte[] keyEnd, int rowLimit);
 }
--- a/bindings/java/src/main/com/apple/foundationdb/FutureBool.java
+++ b/bindings/java/src/main/com/apple/foundationdb/FutureBool.java
@ -0,0 +1,37 @@
+/*
+ * FutureBool.java
+ *
+ * This source file is part of the FoundationDB open source project
+ *
+ * Copyright 2013-2019 Apple Inc. and the FoundationDB project authors
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package com.apple.foundationdb;
+
+import java.util.concurrent.Executor;
+
+class FutureBool extends NativeFuture<Boolean> {
+	FutureBool(long cPtr, Executor executor) {
+		super(cPtr);
+		registerMarshalCallback(executor);
+	}
+
+	@Override
+	protected Boolean getIfDone_internal(long cPtr) throws FDBException {
+		return FutureBool_get(cPtr);
+	}
+
+	private native boolean FutureBool_get(long cPtr) throws FDBException;
+}
--- a/bindings/java/src/main/com/apple/foundationdb/FutureKeyRangeArray.java
+++ b/bindings/java/src/main/com/apple/foundationdb/FutureKeyRangeArray.java
@ -0,0 +1,37 @@
+/*
+ * FutureKeyRangeArray.java
+ *
+ * This source file is part of the FoundationDB open source project
+ *
+ * Copyright 2013-2019 Apple Inc. and the FoundationDB project authors
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package com.apple.foundationdb;
+
+import java.util.concurrent.Executor;
+
+class FutureKeyRangeArray extends NativeFuture<KeyRangeArrayResult> {
+	FutureKeyRangeArray(long cPtr, Executor executor) {
+		super(cPtr);
+		registerMarshalCallback(executor);
+	}
+
+	@Override
+	protected KeyRangeArrayResult getIfDone_internal(long cPtr) throws FDBException {
+		return FutureKeyRangeArray_get(cPtr);
+	}
+
+	private native KeyRangeArrayResult FutureKeyRangeArray_get(long cPtr) throws FDBException;
+}
--- a/bindings/java/src/main/com/apple/foundationdb/KeyRangeArrayResult.java
+++ b/bindings/java/src/main/com/apple/foundationdb/KeyRangeArrayResult.java
@ -0,0 +1,36 @@
+/*
+ * KeyRangeArrayResult.java
+ *
+ * This source file is part of the FoundationDB open source project
+ *
+ * Copyright 2013-2020 Apple Inc. and the FoundationDB project authors
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package com.apple.foundationdb;
+
+import java.util.Arrays;
+import java.util.List;
+
+public class KeyRangeArrayResult {
+	final List<Range> keyRanges;
+
+	public KeyRangeArrayResult(Range[] keyRangeArr) {
+		this.keyRanges = Arrays.asList(keyRangeArr);
+	}
+
+	public List<Range> getKeyRanges() {
+		return keyRanges;
+	}
+}
--- a/bindings/java/src/main/com/apple/foundationdb/ReadTransaction.java
+++ b/bindings/java/src/main/com/apple/foundationdb/ReadTransaction.java
@ -513,6 +513,17 @@ public interface ReadTransaction extends ReadTransactionContext {
 	 */
 	CompletableFuture<KeyArrayResult> getRangeSplitPoints(Range range, long chunkSize);

+	/**
+	 * Gets the blob granule ranges for a given region.
+	 * Returned in batches, requires calling again moving the begin key up.
+	 *
+	 * @param begin beginning of the range (inclusive)
+	 * @param end end of the range (exclusive)
+
+	 * @return list of blob granules in the given range. May not be all.
+	 */
+	 CompletableFuture<KeyRangeArrayResult> getBlobGranuleRanges(byte[] begin, byte[] end, int rowLimit);
+
 	
 	/**
 	 * Returns a set of options that can be set on a {@code Transaction}
--- a/bindings/java/src/test/com/apple/foundationdb/test/Context.java
+++ b/bindings/java/src/test/com/apple/foundationdb/test/Context.java
@ -29,6 +29,7 @@ import java.util.Optional;

 import java.util.concurrent.atomic.AtomicInteger;
 import java.util.concurrent.CompletableFuture;
+import java.util.concurrent.ConcurrentHashMap;

 import com.apple.foundationdb.Database;
 import com.apple.foundationdb.FDB;
@ -64,7 +65,7 @@ abstract class Context implements Runnable, AutoCloseable {
 	private List<Thread> children = new LinkedList<>();
 	private static Map<String, TransactionState> transactionMap = new HashMap<>();
 	private static Map<Transaction, AtomicInteger> transactionRefCounts = new HashMap<>();
-	private static Map<byte[], Tenant> tenantMap = new HashMap<>();
+	private static Map<byte[], Tenant> tenantMap = new ConcurrentHashMap<>();

 	Context(Database db, byte[] prefix) {
 		this.db = db;
--- a/bindings/python/tests/size_limit_tests.py
+++ b/bindings/python/tests/size_limit_tests.py
@ -66,6 +66,9 @@ def test_size_limit_option(db):
    except fdb.FDBError as e:
        assert(e.code == 2101)  # Transaction exceeds byte limit (2101)

+    # Reset the size limit for future tests
+    db.options.set_transaction_size_limit(10000000)
+
@fdb.transactional
 def test_get_approximate_size(tr):
    tr[b'key1'] = b'value1'
--- a/cmake/AddFdbTest.cmake
+++ b/cmake/AddFdbTest.cmake
@ -142,7 +142,7 @@ function(add_fdb_test)
      ${VALGRIND_OPTION}
      ${ADD_FDB_TEST_TEST_FILES}
      WORKING_DIRECTORY ${PROJECT_BINARY_DIR})
-    set_tests_properties("${test_name}" PROPERTIES ENVIRONMENT UBSAN_OPTIONS=print_stacktrace=1:halt_on_error=1)
+    set_tests_properties("${test_name}" PROPERTIES ENVIRONMENT "${SANITIZER_OPTIONS}")
    get_filename_component(test_dir_full ${first_file} DIRECTORY)
    if(NOT ${test_dir_full} STREQUAL "")
      get_filename_component(test_dir ${test_dir_full} NAME)
@ -172,8 +172,7 @@ function(stage_correctness_package)
  file(MAKE_DIRECTORY ${STAGE_OUT_DIR}/bin)
  string(LENGTH "${CMAKE_SOURCE_DIR}/tests/" base_length)
  foreach(test IN LISTS TEST_NAMES)
-    if(("${TEST_TYPE_${test}}" STREQUAL "simulation") AND
-        (${test} MATCHES ${TEST_PACKAGE_INCLUDE}) AND
+    if((${test} MATCHES ${TEST_PACKAGE_INCLUDE}) AND
        (NOT ${test} MATCHES ${TEST_PACKAGE_EXCLUDE}))
      foreach(file IN LISTS TEST_FILES_${test})
        string(SUBSTRING ${file} ${base_length} -1 rel_out_file)
@ -199,16 +198,17 @@ function(stage_correctness_package)
      set(src_dir "${src_dir}/")
      string(SUBSTRING ${src_dir} ${dir_len} -1 dest_dir)
      string(SUBSTRING ${file} ${dir_len} -1 rel_out_file)
-	  set(out_file ${STAGE_OUT_DIR}/${rel_out_file})
+      set(out_file ${STAGE_OUT_DIR}/${rel_out_file})
      list(APPEND external_files ${out_file})
-	  add_custom_command(
+      add_custom_command(
        OUTPUT ${out_file}
-		DEPENDS ${file}
-		COMMAND ${CMAKE_COMMAND} -E copy ${file} ${out_file}
-		COMMENT "Copying ${STAGE_CONTEXT} external file ${file}"
-		)
+        DEPENDS ${file}
+        COMMAND ${CMAKE_COMMAND} -E copy ${file} ${out_file}
+        COMMENT "Copying ${STAGE_CONTEXT} external file ${file}"
+        )
    endforeach()
  endforeach()
+
  list(APPEND package_files ${STAGE_OUT_DIR}/bin/fdbserver
                            ${STAGE_OUT_DIR}/bin/coverage.fdbserver.xml
                            ${STAGE_OUT_DIR}/bin/coverage.fdbclient.xml
@ -218,6 +218,7 @@ function(stage_correctness_package)
                            ${STAGE_OUT_DIR}/bin/TraceLogHelper.dll
                            ${STAGE_OUT_DIR}/CMakeCache.txt
    )
+
  add_custom_command(
    OUTPUT ${package_files}
    DEPENDS ${CMAKE_BINARY_DIR}/CMakeCache.txt
@ -239,6 +240,20 @@ function(stage_correctness_package)
                                     ${STAGE_OUT_DIR}/bin
    COMMENT "Copying files for ${STAGE_CONTEXT} package"
    )
+
+  set(test_harness_dir "${CMAKE_SOURCE_DIR}/contrib/TestHarness2")
+  file(GLOB_RECURSE test_harness2_files RELATIVE "${test_harness_dir}" CONFIGURE_DEPENDS "${test_harness_dir}/*.py")
+  foreach(file IN LISTS test_harness2_files)
+    set(src_file "${test_harness_dir}/${file}")
+    set(out_file "${STAGE_OUT_DIR}/${file}")
+    get_filename_component(dir "${out_file}" DIRECTORY)
+    file(MAKE_DIRECTORY "${dir}")
+    add_custom_command(OUTPUT ${out_file}
+      COMMAND ${CMAKE_COMMAND} -E copy "${src_file}" "${out_file}"
+      DEPENDS "${src_file}")
+    list(APPEND package_files "${out_file}")
+  endforeach()
+
  list(APPEND package_files ${test_files} ${external_files})
  if(STAGE_OUT_FILES)
    set(${STAGE_OUT_FILES} ${package_files} PARENT_SCOPE)
@ -404,7 +419,7 @@ endfunction()

 # Creates a single cluster before running the specified command (usually a ctest test)
 function(add_fdbclient_test)
-  set(options DISABLED ENABLED DISABLE_LOG_DUMP API_TEST_BLOB_GRANULES_ENABLED TLS_ENABLED)
+  set(options DISABLED ENABLED DISABLE_TENANTS DISABLE_LOG_DUMP API_TEST_BLOB_GRANULES_ENABLED TLS_ENABLED)
  set(oneValueArgs NAME PROCESS_NUMBER TEST_TIMEOUT WORKING_DIRECTORY)
  set(multiValueArgs COMMAND)
  cmake_parse_arguments(T "${options}" "${oneValueArgs}" "${multiValueArgs}" "${ARGN}")
@ -431,6 +446,9 @@ function(add_fdbclient_test)
  if(T_DISABLE_LOG_DUMP)
    list(APPEND TMP_CLUSTER_CMD --disable-log-dump)
  endif()
+  if(T_DISABLE_TENANTS)
+    list(APPEND TMP_CLUSTER_CMD --disable-tenants)
+  endif()
  if(T_API_TEST_BLOB_GRANULES_ENABLED)
    list(APPEND TMP_CLUSTER_CMD --blob-granules-enabled)
  endif()
@ -447,9 +465,13 @@ function(add_fdbclient_test)
    set_tests_properties("${T_NAME}" PROPERTIES TIMEOUT ${T_TEST_TIMEOUT})
  else()
    # default timeout
-    set_tests_properties("${T_NAME}" PROPERTIES TIMEOUT 300)
+    if(USE_SANITIZER)
+      set_tests_properties("${T_NAME}" PROPERTIES TIMEOUT 1200)
+    else()
+      set_tests_properties("${T_NAME}" PROPERTIES TIMEOUT 300)
+    endif()
  endif()
-  set_tests_properties("${T_NAME}" PROPERTIES ENVIRONMENT UBSAN_OPTIONS=print_stacktrace=1:halt_on_error=1)
+  set_tests_properties("${T_NAME}" PROPERTIES ENVIRONMENT "${SANITIZER_OPTIONS}")
 endfunction()

 # Creates a cluster file for a nonexistent cluster before running the specified command
@ -483,7 +505,7 @@ function(add_unavailable_fdbclient_test)
    # default timeout
    set_tests_properties("${T_NAME}" PROPERTIES TIMEOUT 60)
  endif()
-  set_tests_properties("${T_NAME}" PROPERTIES ENVIRONMENT UBSAN_OPTIONS=print_stacktrace=1:halt_on_error=1)
+  set_tests_properties("${T_NAME}" PROPERTIES ENVIRONMENT "${SANITIZER_OPTIONS}")
 endfunction()

 # Creates 3 distinct clusters before running the specified command.
--- a/cmake/ConfigureCompiler.cmake
+++ b/cmake/ConfigureCompiler.cmake
@ -69,6 +69,7 @@ if(WIN32)
  add_definitions(-DWIN32_LEAN_AND_MEAN)
  add_definitions(-D_ITERATOR_DEBUG_LEVEL=0)
  add_definitions(-DNOGDI) # WinGDI.h defines macro ERROR
+  add_definitions(-D_USE_MATH_DEFINES) # Math constants
 endif()

 if (USE_CCACHE)
@ -191,6 +192,7 @@ else()
  endif()

  if(USE_GCOV)
+    add_compile_options(--coverage)
    add_link_options(--coverage)
  endif()

@ -199,6 +201,8 @@ else()
      -fsanitize=undefined
      # TODO(atn34) Re-enable -fsanitize=alignment once https://github.com/apple/foundationdb/issues/1434 is resolved
      -fno-sanitize=alignment
+      # https://github.com/apple/foundationdb/issues/7955
+      -fno-sanitize=function
      -DBOOST_USE_UCONTEXT)
    list(APPEND SANITIZER_LINK_OPTIONS -fsanitize=undefined)
  endif()
--- a/cmake/awssdk.cmake
+++ b/cmake/awssdk.cmake
@ -11,7 +11,7 @@ endif()
 include(ExternalProject)
 ExternalProject_Add(awssdk_project
  GIT_REPOSITORY    https://github.com/aws/aws-sdk-cpp.git
-  GIT_TAG           2af3ce543c322cb259471b3b090829464f825972 # v1.9.200
+  GIT_TAG           e4b4b310d8631bc7e9a797b6ac03a73c6f210bf6 # v1.9.331
  SOURCE_DIR        "${CMAKE_CURRENT_BINARY_DIR}/awssdk-src"
  BINARY_DIR        "${CMAKE_CURRENT_BINARY_DIR}/awssdk-build"
  GIT_CONFIG        advice.detachedHead=false
@ -35,6 +35,7 @@ ExternalProject_Add(awssdk_project
                    "${CMAKE_CURRENT_BINARY_DIR}/awssdk-build/install/lib64/libaws-c-event-stream.a"
                    "${CMAKE_CURRENT_BINARY_DIR}/awssdk-build/install/lib64/libaws-c-http.a"
                    "${CMAKE_CURRENT_BINARY_DIR}/awssdk-build/install/lib64/libaws-c-mqtt.a"
+                    "${CMAKE_CURRENT_BINARY_DIR}/awssdk-build/install/lib64/libaws-c-sdkutils.a"
                    "${CMAKE_CURRENT_BINARY_DIR}/awssdk-build/install/lib64/libaws-c-io.a"
                    "${CMAKE_CURRENT_BINARY_DIR}/awssdk-build/install/lib64/libaws-checksums.a"
                    "${CMAKE_CURRENT_BINARY_DIR}/awssdk-build/install/lib64/libaws-c-compression.a"
@ -75,6 +76,10 @@ add_library(awssdk_c_io STATIC IMPORTED)
 add_dependencies(awssdk_c_io awssdk_project)
 set_target_properties(awssdk_c_io PROPERTIES IMPORTED_LOCATION "${CMAKE_CURRENT_BINARY_DIR}/awssdk-build/install/lib64/libaws-c-io.a")

+add_library(awssdk_c_sdkutils STATIC IMPORTED)
+add_dependencies(awssdk_c_sdkutils awssdk_project)
+set_target_properties(awssdk_c_sdkutils PROPERTIES IMPORTED_LOCATION "${CMAKE_CURRENT_BINARY_DIR}/awssdk-build/install/lib64/libaws-c-sdkutils.a")
+
 add_library(awssdk_checksums STATIC IMPORTED)
 add_dependencies(awssdk_checksums awssdk_project)
 set_target_properties(awssdk_checksums PROPERTIES IMPORTED_LOCATION "${CMAKE_CURRENT_BINARY_DIR}/awssdk-build/install/lib64/libaws-checksums.a")
@ -94,4 +99,4 @@ set_target_properties(awssdk_c_common PROPERTIES IMPORTED_LOCATION "${CMAKE_CURR
 # link them all together in one interface target
 add_library(awssdk_target INTERFACE)
 target_include_directories(awssdk_target SYSTEM INTERFACE ${CMAKE_CURRENT_BINARY_DIR}/awssdk-build/install/include)
-target_link_libraries(awssdk_target INTERFACE awssdk_core awssdk_crt awssdk_c_s3 awssdk_c_auth awssdk_c_eventstream awssdk_c_http awssdk_c_mqtt awssdk_c_io awssdk_checksums awssdk_c_compression awssdk_c_cal awssdk_c_common curl)
+target_link_libraries(awssdk_target INTERFACE awssdk_core awssdk_crt awssdk_c_s3 awssdk_c_auth awssdk_c_eventstream awssdk_c_http awssdk_c_mqtt awssdk_c_sdkutils awssdk_c_io awssdk_checksums awssdk_c_compression awssdk_c_cal awssdk_c_common curl)
--- a/contrib/Joshua/scripts/correctnessTest.sh
+++ b/contrib/Joshua/scripts/correctnessTest.sh
@ -4,4 +4,6 @@
 export ASAN_OPTIONS="detect_leaks=0"

 OLDBINDIR="${OLDBINDIR:-/app/deploy/global_data/oldBinaries}"
-mono bin/TestHarness.exe joshua-run "${OLDBINDIR}" false
+#mono bin/TestHarness.exe joshua-run "${OLDBINDIR}" false
+
+python3 -m test_harness.app -s ${JOSHUA_SEED} --old-binaries-path ${OLDBINDIR}
--- a/contrib/Joshua/scripts/correctnessTimeout.sh
+++ b/contrib/Joshua/scripts/correctnessTimeout.sh
@ -1,4 +1,4 @@
 #!/bin/bash -u
-for file in `find . -name 'trace*.xml'` ; do
-    mono ./bin/TestHarness.exe summarize "${file}" summary.xml "" JoshuaTimeout true
-done
+
+
+python3 -m test_harness.timeout
--- a/contrib/Joshua/scripts/valgrindTest.sh
+++ b/contrib/Joshua/scripts/valgrindTest.sh
@ -1,3 +1,3 @@
 #!/bin/sh
 OLDBINDIR="${OLDBINDIR:-/app/deploy/global_data/oldBinaries}"
-mono bin/TestHarness.exe joshua-run "${OLDBINDIR}" true
+python3 -m test_harness.app -s ${JOSHUA_SEED} --old-binaries-path ${OLDBINDIR} --use-valgrind
--- a/contrib/Joshua/scripts/valgrindTimeout.sh
+++ b/contrib/Joshua/scripts/valgrindTimeout.sh
@ -1,6 +1,2 @@
 #!/bin/bash -u
-for file in `find . -name 'trace*.xml'` ; do
-    for valgrindFile in `find . -name 'valgrind*.xml'` ; do
-        mono ./bin/TestHarness.exe summarize "${file}" summary.xml "${valgrindFile}" JoshuaTimeout true
-    done
-done
+python3 -m test_harness.timeout --use-valgrind
--- a/contrib/TestHarness/Program.cs
+++ b/contrib/TestHarness/Program.cs
@ -19,6 +19,7 @@
 */

 using System;
+using System.Collections;
 using System.Collections.Generic;
 using System.Linq;
 using System.Text;
@ -302,6 +303,7 @@ namespace SummarizeTest
                        uniqueFileSet.Add(file.Substring(0, file.LastIndexOf("-"))); // all restarting tests end with -1.txt or -2.txt
                    }
                    uniqueFiles = uniqueFileSet.ToArray();
+                    Array.Sort(uniqueFiles);
                    testFile = random.Choice(uniqueFiles);
                    // The on-disk format changed in 4.0.0, and 5.x can't load files from 3.x.
                    string oldBinaryVersionLowerBound = "4.0.0";
@ -334,8 +336,9 @@ namespace SummarizeTest
                        // thus, by definition, if "until_" appears, we do not want to run with the current binary version
                        oldBinaries = oldBinaries.Concat(currentBinary);
                    }
-                    List<string> oldBinariesList = oldBinaries.ToList<string>();
-                    if (oldBinariesList.Count == 0) {
+                    string[] oldBinariesList = oldBinaries.ToArray<string>();
+                    Array.Sort(oldBinariesList);
+                    if (oldBinariesList.Count() == 0) {
                        // In theory, restarting tests are named to have at least one old binary version to run
                        // But if none of the provided old binaries fall in the range, we just skip the test
                        Console.WriteLine("No available old binary version from {0} to {1}", oldBinaryVersionLowerBound, oldBinaryVersionUpperBound);
@ -347,6 +350,7 @@ namespace SummarizeTest
                else
                {
                    uniqueFiles = Directory.GetFiles(testDir);
+                    Array.Sort(uniqueFiles);
                    testFile = random.Choice(uniqueFiles);
                }
            }
@ -487,6 +491,16 @@ namespace SummarizeTest
                        useValgrind ? "on" : "off");
                    }

+                    IDictionary data = Environment.GetEnvironmentVariables();
+                    foreach (DictionaryEntry i in data)
+                    {
+                        string k=(string)i.Key;
+                        string v=(string)i.Value;
+                        if (k.StartsWith("FDB_KNOB")) {
+                           process.StartInfo.EnvironmentVariables[k]=v;
+                        }
+                    }
+
                    process.Start();

                    // SOMEDAY: Do we want to actually do anything with standard output or error?
@ -718,7 +732,7 @@ namespace SummarizeTest
                        process.Refresh();
                        if (process.HasExited)
                            return;
-                        long mem = process.PrivateMemorySize64;
+                        long mem = process.PagedMemorySize64;
                        MaxMem = Math.Max(MaxMem, mem);
                        //Console.WriteLine(string.Format("Process used {0} bytes", MaxMem));
                        Thread.Sleep(1000);
@ -744,16 +758,28 @@ namespace SummarizeTest
            AppendToSummary(summaryFileName, xout);
        }

-        // Parses the valgrind XML file and returns a list of "what" tags for each error.
+        static string ParseValgrindStack(XElement stackElement) {
+            string backtrace = "";
+            foreach (XElement frame in stackElement.Elements()) {
+                backtrace += " " + frame.Element("ip").Value.ToLower();
+            }
+            if (backtrace.Length > 0) {
+                backtrace = "addr2line -e fdbserver.debug -p -C -f -i" + backtrace;
+            }
+
+            return backtrace;
+        }
+
+        // Parses the valgrind XML file and returns a list of error elements.
        //  All errors for which the "kind" tag starts with "Leak" are ignored
-        static string[] ParseValgrindOutput(string valgrindOutputFileName, bool traceToStdout)
+        static XElement[] ParseValgrindOutput(string valgrindOutputFileName, bool traceToStdout)
        {
            if (!traceToStdout)
            {
                Console.WriteLine("Reading vXML file: " + valgrindOutputFileName);
            }

-            ISet<string> whats = new HashSet<string>();
+            IList<XElement> errors = new List<XElement>();
            XElement xdoc = XDocument.Load(valgrindOutputFileName).Element("valgrindoutput");
            foreach(var elem in xdoc.Elements()) {
                if (elem.Name != "error")
@ -761,9 +787,29 @@ namespace SummarizeTest
                string kind = elem.Element("kind").Value;
                if(kind.StartsWith("Leak"))
                    continue;
-                whats.Add(elem.Element("what").Value);
+
+                XElement errorElement = new XElement("ValgrindError",
+                                new XAttribute("Severity", (int)Magnesium.Severity.SevError));
+
+                int num = 1;
+                string suffix = "";
+                foreach (XElement sub in elem.Elements()) {
+                    if (sub.Name == "what") {
+                        errorElement.SetAttributeValue("What", sub.Value);
+                    } else if (sub.Name == "auxwhat") {
+                        suffix = "Aux" + num++;
+                        errorElement.SetAttributeValue("What" + suffix, sub.Value);
+                    } else if (sub.Name == "stack") {
+                        errorElement.SetAttributeValue("Backtrace" + suffix, ParseValgrindStack(sub));
+                    } else if (sub.Name == "origin") {
+                        errorElement.SetAttributeValue("WhatOrigin", sub.Element("what").Value);
+                        errorElement.SetAttributeValue("BacktraceOrigin", ParseValgrindStack(sub.Element("stack")));
+                    }
+                }
+
+                errors.Add(errorElement);
            }
-            return whats.ToArray();
+            return errors.ToArray();
        }

        delegate IEnumerable<Magnesium.Event> parseDelegate(System.IO.Stream stream, string file,
@ -927,6 +973,10 @@ namespace SummarizeTest
                            {
                                xout.Add(new XElement(ev.Type, new XAttribute("File", ev.Details.File), new XAttribute("Line", ev.Details.Line)));
                            }
+                            if (ev.Type == "RunningUnitTest") 
+                            {
+                                xout.Add(new XElement(ev.Type, new XAttribute("Name", ev.Details.Name), new XAttribute("File", ev.Details.File), new XAttribute("Line", ev.Details.Line)));
+                            }
                            if (ev.Type == "TestsExpectedToPass")
                                testCount = int.Parse(ev.Details.Count);
                            if (ev.Type == "TestResults" && ev.Details.Passed == "1")
@ -1065,12 +1115,10 @@ namespace SummarizeTest
                try
                {
                    // If there are any errors reported "ok" will be set to false
-                    var whats = ParseValgrindOutput(valgrindOutputFileName, traceToStdout);
-                    foreach (var what in whats)
+                    var valgrindErrors = ParseValgrindOutput(valgrindOutputFileName, traceToStdout);
+                    foreach (var vError in valgrindErrors)
                    {
-                        xout.Add(new XElement("ValgrindError",
-                                new XAttribute("Severity", (int)Magnesium.Severity.SevError),
-                                new XAttribute("What", what)));
+                        xout.Add(vError);
                        ok = false;
                        error = true;
                    }
--- a/contrib/TestHarness2/.gitignore
+++ b/contrib/TestHarness2/.gitignore
@ -0,0 +1,2 @@
+/tmp/
+/venv
--- a/contrib/TestHarness2/test_harness/init.py
+++ b/contrib/TestHarness2/test_harness/init.py
@ -0,0 +1,2 @@
+# Currently this file is left intentionally empty. It's main job for now is to indicate that this directory
+# should be used as a module.
--- a/contrib/TestHarness2/test_harness/app.py
+++ b/contrib/TestHarness2/test_harness/app.py
@ -0,0 +1,25 @@
+import argparse
+import sys
+import traceback
+
+from test_harness.config import config
+from test_harness.run import TestRunner
+from test_harness.summarize import SummaryTree
+
+if __name__ == '__main__':
+    try:
+        parser = argparse.ArgumentParser('TestHarness', formatter_class=argparse.ArgumentDefaultsHelpFormatter)
+        config.build_arguments(parser)
+        args = parser.parse_args()
+        config.extract_args(args)
+        test_runner = TestRunner()
+        if not test_runner.run():
+            exit(1)
+    except Exception as e:
+        _, _, exc_traceback = sys.exc_info()
+        error = SummaryTree('TestHarnessError')
+        error.attributes['Severity'] = '40'
+        error.attributes['ErrorMessage'] = str(e)
+        error.attributes['Trace'] = repr(traceback.format_tb(exc_traceback))
+        error.dump(sys.stdout)
+        exit(1)
--- a/contrib/TestHarness2/test_harness/config.py
+++ b/contrib/TestHarness2/test_harness/config.py
@ -0,0 +1,263 @@
+from __future__ import annotations
+
+import argparse
+import collections
+import copy
+import os
+import random
+from enum import Enum
+from pathlib import Path
+from typing import List, Any, OrderedDict, Dict
+
+
+class BuggifyOptionValue(Enum):
+    ON = 1
+    OFF = 2
+    RANDOM = 3
+
+
+class BuggifyOption:
+    def __init__(self, val: str | None = None):
+        self.value = BuggifyOptionValue.RANDOM
+        if val is not None:
+            v = val.lower()
+            if v in ['on', '1', 'true']:
+                self.value = BuggifyOptionValue.ON
+            elif v in ['off', '0', 'false']:
+                self.value = BuggifyOptionValue.OFF
+            elif v in ['random', 'rnd', 'r']:
+                pass
+            else:
+                assert False, 'Invalid value {} -- use true, false, or random'.format(v)
+
+
+class ConfigValue:
+    def __init__(self, name: str, **kwargs):
+        self.name = name
+        self.value = None
+        self.kwargs = kwargs
+        if 'default' in self.kwargs:
+            self.value = self.kwargs['default']
+
+    def get_arg_name(self) -> str:
+        if 'long_name' in self.kwargs:
+            return self.kwargs['long_name']
+        else:
+            return self.name
+
+    def add_to_args(self, parser: argparse.ArgumentParser):
+        kwargs = copy.copy(self.kwargs)
+        long_name = self.name
+        short_name = None
+        if 'long_name' in kwargs:
+            long_name = kwargs['long_name']
+            del kwargs['long_name']
+        if 'short_name' in kwargs:
+            short_name = kwargs['short_name']
+            del kwargs['short_name']
+        if 'action' in kwargs and kwargs['action'] in ['store_true', 'store_false']:
+            del kwargs['type']
+        long_name = long_name.replace('_', '-')
+        if short_name is None:
+            # line below is useful for debugging
+            # print('add_argument(\'--{}\', [{{{}}}])'.format(long_name, ', '.join(['\'{}\': \'{}\''.format(k, v)
+            #                                                                       for k, v in kwargs.items()])))
+            parser.add_argument('--{}'.format(long_name), **kwargs)
+        else:
+            # line below is useful for debugging
+            # print('add_argument(\'-{}\', \'--{}\', [{{{}}}])'.format(short_name, long_name,
+            #                                                          ', '.join(['\'{}\': \'{}\''.format(k, v)
+            #                                                                     for k, v in kwargs.items()])))
+            parser.add_argument('-{}'.format(short_name), '--{}'.format(long_name), **kwargs)
+
+    def get_value(self, args: argparse.Namespace) -> tuple[str, Any]:
+        return self.name, args.__getattribute__(self.get_arg_name())
+
+
+class Config:
+    """
+    This is the central configuration class for test harness. The values in this class are exposed globally through
+    a global variable test_harness.config.config. This class provides some "magic" to keep test harness flexible.
+    Each parameter can further be configured using an `_args` member variable which is expected to be a dictionary.
+    * The value of any variable can be set through the command line. For a variable named `variable_name` we will
+      by default create a new command line option `--variable-name` (`_` is automatically changed to `-`). This
+      default can be changed by setting the `'long_name'` property in the `_arg` dict.
+    * In addition the user can also optionally set a short-name. This can be achieved by setting the `'short_name'`
+      property in the `_arg` dictionary.
+    * All additional properties in `_args` are passed to `argparse.add_argument`.
+    * If the default of a variable is `None` the user should explicitly set the `'type'` property to an appropriate
+      type.
+    * In addition to command line flags, all configuration options can also be controlled through environment variables.
+      By default, `variable-name` can be changed by setting the environment variable `TH_VARIABLE_NAME`. This default
+      can be changed by setting the `'env_name'` property.
+    * Test harness comes with multiple executables. Each of these should use the config facility. For this,
+      `Config.build_arguments` should be called first with the `argparse` parser. Then `Config.extract_args` needs
+      to be called with the result of `argparse.ArgumentParser.parse_args`. A sample example could look like this:
+      ```
+      parser = argparse.ArgumentParser('TestHarness', formatter_class=argparse.ArgumentDefaultsHelpFormatter)
+      config.build_arguments(parser)
+      args = parser.parse_args()
+      config.extract_args(args)
+      ```
+    * Changing the default value for all executables might not always be desirable. If it should be only changed for
+      one executable Config.change_default should be used.
+    """
+    def __init__(self):
+        self.random = random.Random()
+        self.cluster_file: str | None = None
+        self.cluster_file_args = {'short_name': 'C', 'type': str, 'help': 'Path to fdb cluster file', 'required': False,
+                                  'env_name': 'JOSHUA_CLUSTER_FILE'}
+        self.joshua_dir: str | None = None
+        self.joshua_dir_args = {'type': str, 'help': 'Where to write FDB data to', 'required': False,
+                                'env_name': 'JOSHUA_APP_DIR'}
+        self.stats: str | None = None
+        self.stats_args = {'type': str, 'help': 'A base64 encoded list of statistics (used to reproduce runs)',
+                           'required': False}
+        self.random_seed: int | None = None
+        self.random_seed_args = {'type': int,
+                                 'help': 'Force given seed given to fdbserver -- mostly useful for debugging',
+                                 'required': False}
+        self.kill_seconds: int = 30 * 60
+        self.kill_seconds_args = {'help': 'Timeout for individual test'}
+        self.buggify_on_ratio: float = 0.8
+        self.buggify_on_ratio_args = {'help': 'Probability that buggify is turned on'}
+        self.write_run_times = False
+        self.write_run_times_args = {'help': 'Write back probabilities after each test run',
+                                     'action': 'store_true'}
+        self.unseed_check_ratio: float = 0.05
+        self.unseed_check_ratio_args = {'help': 'Probability for doing determinism check'}
+        self.test_dirs: List[str] = ['slow', 'fast', 'restarting', 'rare', 'noSim']
+        self.test_dirs_args: dict = {'nargs': '*', 'help': 'test_directories to look for files in'}
+        self.trace_format: str = 'json'
+        self.trace_format_args = {'choices': ['json', 'xml'], 'help': 'What format fdb should produce'}
+        self.crash_on_error: bool = True
+        self.crash_on_error_args = {'long_name': 'no_crash', 'action': 'store_false',
+                                    'help': 'Don\'t crash on first error'}
+        self.max_warnings: int = 10
+        self.max_warnings_args = {'short_name': 'W'}
+        self.max_errors: int = 10
+        self.max_errors_args = {'short_name': 'E'}
+        self.old_binaries_path: Path = Path('/app/deploy/global_data/oldBinaries/')
+        self.old_binaries_path_args = {'help': 'Path to the directory containing the old fdb binaries'}
+        self.use_valgrind: bool = False
+        self.use_valgrind_args = {'action': 'store_true'}
+        self.buggify = BuggifyOption('random')
+        self.buggify_args = {'short_name': 'b', 'choices': ['on', 'off', 'random']}
+        self.pretty_print: bool = False
+        self.pretty_print_args = {'short_name': 'P', 'action': 'store_true'}
+        self.clean_up: bool = True
+        self.clean_up_args = {'long_name': 'no_clean_up', 'action': 'store_false'}
+        self.run_dir: Path = Path('tmp')
+        self.joshua_seed: int = random.randint(0, 2 ** 32 - 1)
+        self.joshua_seed_args = {'short_name': 's', 'help': 'A random seed', 'env_name': 'JOSHUA_SEED'}
+        self.print_coverage = False
+        self.print_coverage_args = {'action': 'store_true'}
+        self.binary = Path('bin') / ('fdbserver.exe' if os.name == 'nt' else 'fdbserver')
+        self.binary_args = {'help': 'Path to executable'}
+        self.hit_per_runs_ratio: int = 20000
+        self.hit_per_runs_ratio_args = {'help': 'Maximum test runs before each code probe hit at least once'}
+        self.output_format: str = 'xml'
+        self.output_format_args = {'short_name': 'O', 'choices': ['json', 'xml'],
+                                   'help': 'What format TestHarness should produce'}
+        self.include_test_files: str = r'.*'
+        self.include_test_files_args = {'help': 'Only consider test files whose path match against the given regex'}
+        self.exclude_test_files: str = r'.^'
+        self.exclude_test_files_args = {'help': 'Don\'t consider test files whose path match against the given regex'}
+        self.include_test_classes: str = r'.*'
+        self.include_test_classes_args = {'help': 'Only consider tests whose names match against the given regex'}
+        self.exclude_test_names: str = r'.^'
+        self.exclude_test_names_args = {'help': 'Don\'t consider tests whose names match against the given regex'}
+        self.details: bool = False
+        self.details_args = {'help': 'Print detailed results', 'short_name': 'c', 'action': 'store_true'}
+        self.success: bool = False
+        self.success_args = {'help': 'Print successful results', 'action': 'store_true'}
+        self.cov_include_files: str = r'.*'
+        self.cov_include_files_args = {'help': 'Only consider coverage traces that originated in files matching regex'}
+        self.cov_exclude_files: str = r'.^'
+        self.cov_exclude_files_args = {'help': 'Ignore coverage traces that originated in files matching regex'}
+        self.max_stderr_bytes: int = 1000
+        self.write_stats: bool = True
+        self.read_stats: bool = True
+        self.reproduce_prefix: str | None = None
+        self.reproduce_prefix_args = {'type': str, 'required': False,
+                                      'help': 'When printing the results, prepend this string to the command'}
+        self._env_names: Dict[str, str] = {}
+        self._config_map = self._build_map()
+        self._read_env()
+        self.random.seed(self.joshua_seed, version=2)
+
+    def change_default(self, attr: str, default_val):
+        assert attr in self._config_map, 'Unknown config attribute {}'.format(attr)
+        self.__setattr__(attr, default_val)
+        self._config_map[attr].kwargs['default'] = default_val
+
+    def _get_env_name(self, var_name: str) -> str:
+        return self._env_names.get(var_name, 'TH_{}'.format(var_name.upper()))
+
+    def dump(self):
+        for attr in dir(self):
+            obj = getattr(self, attr)
+            if attr == 'random' or attr.startswith('_') or callable(obj) or attr.endswith('_args'):
+                continue
+            print('config.{}: {} = {}'.format(attr, type(obj), obj))
+
+    def _build_map(self) -> OrderedDict[str, ConfigValue]:
+        config_map: OrderedDict[str, ConfigValue] = collections.OrderedDict()
+        for attr in dir(self):
+            obj = getattr(self, attr)
+            if attr == 'random' or attr.startswith('_') or callable(obj):
+                continue
+            if attr.endswith('_args'):
+                name = attr[0:-len('_args')]
+                assert name in config_map
+                assert isinstance(obj, dict)
+                for k, v in obj.items():
+                    if k == 'env_name':
+                        self._env_names[name] = v
+                    else:
+                        config_map[name].kwargs[k] = v
+            else:
+                # attribute_args has to be declared after the attribute
+                assert attr not in config_map
+                val_type = type(obj)
+                kwargs = {'type': val_type, 'default': obj}
+                config_map[attr] = ConfigValue(attr, **kwargs)
+        return config_map
+
+    def _read_env(self):
+        for attr in dir(self):
+            obj = getattr(self, attr)
+            if attr == 'random' or attr.startswith('_') or attr.endswith('_args') or callable(obj):
+                continue
+            env_name = self._get_env_name(attr)
+            attr_type = self._config_map[attr].kwargs['type']
+            assert type(None) != attr_type
+            e = os.getenv(env_name)
+            if e is not None:
+                # Use the env var to supply the default value, so that if the
+                # environment variable is set and the corresponding command line
+                # flag is not, the environment variable has an effect.
+                self._config_map[attr].kwargs['default'] = attr_type(e)
+
+    def build_arguments(self, parser: argparse.ArgumentParser):
+        for val in self._config_map.values():
+            val.add_to_args(parser)
+
+    def extract_args(self, args: argparse.Namespace):
+        for val in self._config_map.values():
+            k, v = val.get_value(args)
+            if v is not None:
+                config.__setattr__(k, v)
+        self.random.seed(self.joshua_seed, version=2)
+
+
+config = Config()
+
+if __name__ == '__main__':
+    # test the config setup
+    parser = argparse.ArgumentParser('TestHarness Config Tester',
+                                     formatter_class=argparse.ArgumentDefaultsHelpFormatter)
+    config.build_arguments(parser)
+    args = parser.parse_args()
+    config.extract_args(args)
+    config.dump()
--- a/contrib/TestHarness2/test_harness/fdb.py
+++ b/contrib/TestHarness2/test_harness/fdb.py
@ -0,0 +1,144 @@
+from __future__ import annotations
+
+from typing import OrderedDict, Tuple, List
+
+import collections
+import fdb
+import fdb.tuple
+import struct
+
+from test_harness.run import StatFetcher, TestDescription
+from test_harness.config import config
+from test_harness.summarize import SummaryTree, Coverage
+
+# Before increasing this, make sure that all Joshua clusters (at Apple and Snowflake) have been upgraded.
+# This version needs to be changed if we either need newer features from FDB or the current API version is
+# getting retired.
+fdb.api_version(630)
+
+
+def str_to_tuple(s: str | None):
+    if s is None:
+        return s
+    return tuple(s.split(','))
+
+
+fdb_db = None
+
+
+def open_db(cluster_file: str | None):
+    global fdb_db
+    if fdb_db is None:
+        fdb_db = fdb.open(cluster_file)
+    return fdb_db
+
+
+def chunkify(iterable, sz: int):
+    res = []
+    for item in iterable:
+        res.append(item)
+        if len(res) >= sz:
+            yield res
+            res = []
+    if len(res) > 0:
+        yield res
+
+
+@fdb.transactional
+def write_coverage_chunk(tr, path: Tuple[str, ...], metadata: Tuple[str, ...],
+                         coverage: List[Tuple[Coverage, bool]], initialized: bool) -> bool:
+    cov_dir = fdb.directory.create_or_open(tr, path)
+    if not initialized:
+        metadata_dir = fdb.directory.create_or_open(tr, metadata)
+        v = tr[metadata_dir['initialized']]
+        initialized = v.present()
+    for cov, covered in coverage:
+        if not initialized or covered:
+            tr.add(cov_dir.pack((cov.file, cov.line, cov.comment)), struct.pack('<I', 1 if covered else 0))
+    return initialized
+
+
+@fdb.transactional
+def set_initialized(tr, metadata: Tuple[str, ...]):
+    metadata_dir = fdb.directory.create_or_open(tr, metadata)
+    tr[metadata_dir['initialized']] = fdb.tuple.pack((True,))
+
+
+def write_coverage(cluster_file: str | None, cov_path: Tuple[str, ...], metadata: Tuple[str, ...],
+                   coverage: OrderedDict[Coverage, bool]):
+    db = open_db(cluster_file)
+    assert config.joshua_dir is not None
+    initialized: bool = False
+    for chunk in chunkify(coverage.items(), 100):
+        initialized = write_coverage_chunk(db, cov_path, metadata, chunk, initialized)
+    if not initialized:
+        set_initialized(db, metadata)
+
+
+@fdb.transactional
+def _read_coverage(tr, cov_path: Tuple[str, ...]) -> OrderedDict[Coverage, int]:
+    res = collections.OrderedDict()
+    cov_dir = fdb.directory.create_or_open(tr, cov_path)
+    for k, v in tr[cov_dir.range()]:
+        file, line, comment = cov_dir.unpack(k)
+        count = struct.unpack('<I', v)[0]
+        res[Coverage(file, line, comment)] = count
+    return res
+
+
+def read_coverage(cluster_file: str | None, cov_path: Tuple[str, ...]) -> OrderedDict[Coverage, int]:
+    db = open_db(cluster_file)
+    return _read_coverage(db, cov_path)
+
+
+class TestStatistics:
+    def __init__(self, runtime: int, run_count: int):
+        self.runtime: int = runtime
+        self.run_count: int = run_count
+
+
+class Statistics:
+    def __init__(self, cluster_file: str | None, joshua_dir: Tuple[str, ...]):
+        self.db = open_db(cluster_file)
+        self.stats_dir = self.open_stats_dir(self.db, joshua_dir)
+        self.stats: OrderedDict[str, TestStatistics] = self.read_stats_from_db(self.db)
+
+    @fdb.transactional
+    def open_stats_dir(self, tr, app_dir: Tuple[str]):
+        stats_dir = app_dir + ('runtime_stats',)
+        return fdb.directory.create_or_open(tr, stats_dir)
+
+    @fdb.transactional
+    def read_stats_from_db(self, tr) -> OrderedDict[str, TestStatistics]:
+        result = collections.OrderedDict()
+        for k, v in tr[self.stats_dir.range()]:
+            test_name = self.stats_dir.unpack(k)[0]
+            runtime, run_count = struct.unpack('<II', v)
+            result[test_name] = TestStatistics(runtime, run_count)
+        return result
+
+    @fdb.transactional
+    def _write_runtime(self, tr, test_name: str, time: int) -> None:
+        key = self.stats_dir.pack((test_name,))
+        tr.add(key, struct.pack('<II', time, 1))
+
+    def write_runtime(self, test_name: str, time: int) -> None:
+        assert self.db is not None
+        self._write_runtime(self.db, test_name, time)
+
+
+class FDBStatFetcher(StatFetcher):
+    def __init__(self, tests: OrderedDict[str, TestDescription],
+                 joshua_dir: Tuple[str] = str_to_tuple(config.joshua_dir)):
+        super().__init__(tests)
+        self.statistics = Statistics(config.cluster_file, joshua_dir)
+
+    def read_stats(self):
+        for k, v in self.statistics.stats.items():
+            if k in self.tests.keys():
+                self.tests[k].total_runtime = v.runtime
+                self.tests[k].num_runs = v.run_count
+
+    def add_run_time(self, test_name: str, runtime: int, out: SummaryTree):
+        self.statistics.write_runtime(test_name, runtime)
+        super().add_run_time(test_name, runtime, out)
--- a/contrib/TestHarness2/test_harness/joshua.py
+++ b/contrib/TestHarness2/test_harness/joshua.py
@ -0,0 +1,161 @@
+from __future__ import annotations
+
+import collections
+import io
+import sys
+import xml.sax
+import xml.sax.handler
+from pathlib import Path
+from typing import List, OrderedDict, Set
+
+from joshua import joshua_model
+
+import test_harness.run
+from test_harness.config import config
+from test_harness.summarize import SummaryTree
+
+
+class ToSummaryTree(xml.sax.handler.ContentHandler):
+    def __init__(self):
+        super().__init__()
+        self.root: SummaryTree | None = None
+        self.stack: List[SummaryTree] = []
+
+    def result(self) -> SummaryTree:
+        assert len(self.stack) == 0 and self.root is not None, 'Parse Error'
+        return self.root
+
+    def startElement(self, name, attrs):
+        new_child = SummaryTree(name)
+        for k, v in attrs.items():
+            new_child.attributes[k] = v
+        self.stack.append(new_child)
+
+    def endElement(self, name):
+        closed = self.stack.pop()
+        assert closed.name == name
+        if len(self.stack) == 0:
+            self.root = closed
+        else:
+            self.stack[-1].children.append(closed)
+
+
+def _print_summary(summary: SummaryTree, commands: Set[str]):
+    cmd = []
+    if config.reproduce_prefix is not None:
+        cmd.append(config.reproduce_prefix)
+    cmd.append('fdbserver')
+    if 'TestFile' in summary.attributes:
+        file_name = summary.attributes['TestFile']
+        role = 'test' if test_harness.run.is_no_sim(Path(file_name)) else 'simulation'
+        cmd += ['-r', role, '-f', file_name]
+    else:
+        cmd += ['-r', 'simulation', '-f', '<ERROR>']
+    if 'RandomSeed' in summary.attributes:
+        cmd += ['-s', summary.attributes['RandomSeed']]
+    else:
+        cmd += ['-s', '<Error>']
+    if 'BuggifyEnabled' in summary.attributes:
+        arg = 'on'
+        if summary.attributes['BuggifyEnabled'].lower() in ['0', 'off', 'false']:
+            arg = 'off'
+        cmd += ['-b', arg]
+    else:
+        cmd += ['b', '<ERROR>']
+    cmd += ['--crash', '--trace_format', config.trace_format]
+    key = ' '.join(cmd)
+    count = 1
+    while key in commands:
+        key = '{} # {}'.format(' '.join(cmd), count)
+        count += 1
+    # we want the command as the first attribute
+    attributes = {'Command': ' '.join(cmd)}
+    for k, v in summary.attributes.items():
+        if k == 'Errors':
+            attributes['ErrorCount'] = v
+        else:
+            attributes[k] = v
+    summary.attributes = attributes
+    if config.details:
+        key = str(len(commands))
+        str_io = io.StringIO()
+        summary.dump(str_io, prefix=('  ' if config.pretty_print else ''))
+        if config.output_format == 'json':
+            sys.stdout.write('{}"Test{}": {}'.format('  ' if config.pretty_print else '',
+                                                     key, str_io.getvalue()))
+        else:
+            sys.stdout.write(str_io.getvalue())
+        if config.pretty_print:
+            sys.stdout.write('\n' if config.output_format == 'xml' else ',\n')
+        return key
+    error_count = 0
+    warning_count = 0
+    small_summary = SummaryTree('Test')
+    small_summary.attributes = attributes
+    errors = SummaryTree('Errors')
+    warnings = SummaryTree('Warnings')
+    buggifies: OrderedDict[str, List[int]] = collections.OrderedDict()
+    for child in summary.children:
+        if 'Severity' in child.attributes and child.attributes['Severity'] == '40' and error_count < config.max_errors:
+            error_count += 1
+            errors.append(child)
+        if 'Severity' in child.attributes and child.attributes[
+            'Severity'] == '30' and warning_count < config.max_warnings:
+            warning_count += 1
+            warnings.append(child)
+        if child.name == 'BuggifySection':
+            file = child.attributes['File']
+            line = int(child.attributes['Line'])
+            buggifies.setdefault(file, []).append(line)
+    buggifies_elem = SummaryTree('Buggifies')
+    for file, lines in buggifies.items():
+        lines.sort()
+        if config.output_format == 'json':
+            buggifies_elem.attributes[file] = ' '.join(str(line) for line in lines)
+        else:
+            child = SummaryTree('Buggify')
+            child.attributes['File'] = file
+            child.attributes['Lines'] = ' '.join(str(line) for line in lines)
+            small_summary.append(child)
+    small_summary.children.append(buggifies_elem)
+    if len(errors.children) > 0:
+        small_summary.children.append(errors)
+    if len(warnings.children) > 0:
+        small_summary.children.append(warnings)
+    output = io.StringIO()
+    small_summary.dump(output, prefix=('  ' if config.pretty_print else ''))
+    if config.output_format == 'json':
+        sys.stdout.write('{}"{}": {}'.format('  ' if config.pretty_print else '', key, output.getvalue().strip()))
+    else:
+        sys.stdout.write('{}{}'.format('  ' if config.pretty_print else '', output.getvalue().strip()))
+    sys.stdout.write('\n' if config.output_format == 'xml' else ',\n')
+
+
+def print_errors(ensemble_id: str):
+    joshua_model.open(config.cluster_file)
+    properties = joshua_model.get_ensemble_properties(ensemble_id)
+    compressed = properties["compressed"] if "compressed" in properties else False
+    for rec in joshua_model.tail_results(ensemble_id, errors_only=(not config.success), compressed=compressed):
+        if len(rec) == 5:
+            version_stamp, result_code, host, seed, output = rec
+        elif len(rec) == 4:
+            version_stamp, result_code, host, output = rec
+            seed = None
+        elif len(rec) == 3:
+            version_stamp, result_code, output = rec
+            host = None
+            seed = None
+        elif len(rec) == 2:
+            version_stamp, seed = rec
+            output = str(joshua_model.fdb.tuple.unpack(seed)[0]) + "\n"
+            result_code = None
+            host = None
+            seed = None
+        else:
+            raise Exception("Unknown result format")
+        lines = output.splitlines()
+        commands: Set[str] = set()
+        for line in lines:
+            summary = ToSummaryTree()
+            xml.sax.parseString(line, summary)
+            commands.add(_print_summary(summary.result(), commands))
--- a/contrib/TestHarness2/test_harness/results.py
+++ b/contrib/TestHarness2/test_harness/results.py
@ -0,0 +1,144 @@
+from __future__ import annotations
+
+import argparse
+import io
+import json
+import re
+import sys
+import test_harness.fdb
+
+from typing import List, Tuple, OrderedDict
+from test_harness.summarize import SummaryTree, Coverage
+from test_harness.config import config
+from xml.sax.saxutils import quoteattr
+
+
+class GlobalStatistics:
+    def __init__(self):
+        self.total_probes_hit: int = 0
+        self.total_cpu_time: int = 0
+        self.total_test_runs: int = 0
+        self.total_missed_probes: int = 0
+
+
+class EnsembleResults:
+    def __init__(self, cluster_file: str | None, ensemble_id: str):
+        self.global_statistics = GlobalStatistics()
+        self.fdb_path = ('joshua', 'ensembles', 'results', 'application', ensemble_id)
+        self.coverage_path = self.fdb_path + ('coverage',)
+        self.statistics = test_harness.fdb.Statistics(cluster_file, self.fdb_path)
+        coverage_dict: OrderedDict[Coverage, int] = test_harness.fdb.read_coverage(cluster_file, self.coverage_path)
+        self.coverage: List[Tuple[Coverage, int]] = []
+        self.min_coverage_hit: int | None = None
+        self.ratio = self.global_statistics.total_test_runs / config.hit_per_runs_ratio
+        for cov, count in coverage_dict.items():
+            if re.search(config.cov_include_files, cov.file) is None:
+                continue
+            if re.search(config.cov_exclude_files, cov.file) is not None:
+                continue
+            self.global_statistics.total_probes_hit += count
+            self.coverage.append((cov, count))
+            if count <= self.ratio:
+                self.global_statistics.total_missed_probes += 1
+            if self.min_coverage_hit is None or self.min_coverage_hit > count:
+                self.min_coverage_hit = count
+        self.coverage.sort(key=lambda x: (x[1], x[0].file, x[0].line))
+        self.stats: List[Tuple[str, int, int]] = []
+        for k, v in self.statistics.stats.items():
+            self.global_statistics.total_test_runs += v.run_count
+            self.global_statistics.total_cpu_time += v.runtime
+            self.stats.append((k, v.runtime, v.run_count))
+        self.stats.sort(key=lambda x: x[1], reverse=True)
+        if self.min_coverage_hit is not None:
+            self.coverage_ok = self.min_coverage_hit > self.ratio
+        else:
+            self.coverage_ok = False
+
+    def dump(self, prefix: str):
+        errors = 0
+        out = SummaryTree('EnsembleResults')
+        out.attributes['TotalRuntime'] = str(self.global_statistics.total_cpu_time)
+        out.attributes['TotalTestRuns'] = str(self.global_statistics.total_test_runs)
+        out.attributes['TotalProbesHit'] = str(self.global_statistics.total_probes_hit)
+        out.attributes['MinProbeHit'] = str(self.min_coverage_hit)
+        out.attributes['TotalProbes'] = str(len(self.coverage))
+        out.attributes['MissedProbes'] = str(self.global_statistics.total_missed_probes)
+
+        for cov, count in self.coverage:
+            severity = 10 if count > self.ratio else 40
+            if severity == 40:
+                errors += 1
+            if (severity == 40 and errors <= config.max_errors) or config.details:
+                child = SummaryTree('CodeProbe')
+                child.attributes['Severity'] = str(severity)
+                child.attributes['File'] = cov.file
+                child.attributes['Line'] = str(cov.line)
+                child.attributes['Comment'] = '' if cov.comment is None else cov.comment
+                child.attributes['HitCount'] = str(count)
+                out.append(child)
+
+        if config.details:
+            for k, runtime, run_count in self.stats:
+                child = SummaryTree('Test')
+                child.attributes['Name'] = k
+                child.attributes['Runtime'] = str(runtime)
+                child.attributes['RunCount'] = str(run_count)
+                out.append(child)
+        if errors > 0:
+            out.attributes['Errors'] = str(errors)
+        str_io = io.StringIO()
+        out.dump(str_io, prefix=prefix, new_line=config.pretty_print)
+        if config.output_format == 'xml':
+            sys.stdout.write(str_io.getvalue())
+        else:
+            sys.stdout.write('{}"EnsembleResults":{}{}'.format('  ' if config.pretty_print else '',
+                                                               '\n' if config.pretty_print else ' ',
+                                                               str_io.getvalue()))
+
+
+def write_header(ensemble_id: str):
+    if config.output_format == 'json':
+        if config.pretty_print:
+            print('{')
+            print('  "{}": {},\n'.format('ID', json.dumps(ensemble_id.strip())))
+        else:
+            sys.stdout.write('{{{}: {},'.format('ID', json.dumps(ensemble_id.strip())))
+    elif config.output_format == 'xml':
+        sys.stdout.write('<Ensemble ID={}>'.format(quoteattr(ensemble_id.strip())))
+        if config.pretty_print:
+            sys.stdout.write('\n')
+    else:
+        assert False, 'unknown output format {}'.format(config.output_format)
+
+
+def write_footer():
+    if config.output_format == 'xml':
+        sys.stdout.write('</Ensemble>\n')
+    elif config.output_format == 'json':
+        sys.stdout.write('}\n')
+    else:
+        assert False, 'unknown output format {}'.format(config.output_format)
+
+
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser('TestHarness Results', formatter_class=argparse.ArgumentDefaultsHelpFormatter)
+    config.change_default('pretty_print', True)
+    config.change_default('max_warnings', 0)
+    config.build_arguments(parser)
+    parser.add_argument('ensemble_id', type=str, help='The ensemble to fetch the result for')
+    args = parser.parse_args()
+    config.extract_args(args)
+    config.output_format = args.output_format
+    write_header(args.ensemble_id)
+    try:
+        import test_harness.joshua
+        test_harness.joshua.print_errors(args.ensemble_id)
+    except ModuleNotFoundError:
+        child = SummaryTree('JoshuaNotFound')
+        child.attributes['Severity'] = '30'
+        child.attributes['Message'] = 'Could not import Joshua -- set PYTHONPATH to joshua checkout dir'
+        child.dump(sys.stdout, prefix=('  ' if config.pretty_print else ''), new_line=config.pretty_print)
+    results = EnsembleResults(config.cluster_file, args.ensemble_id)
+    results.dump('  ' if config.pretty_print else '')
+    write_footer()
+    exit(0 if results.coverage_ok else 1)
--- a/contrib/TestHarness2/test_harness/run.py
+++ b/contrib/TestHarness2/test_harness/run.py
@ -0,0 +1,465 @@
+from __future__ import annotations
+
+import array
+import base64
+import collections
+import math
+import os
+import resource
+import shutil
+import subprocess
+import re
+import sys
+import threading
+import time
+import uuid
+
+from functools import total_ordering
+from pathlib import Path
+from test_harness.version import Version
+from test_harness.config import config
+from typing import List, Pattern, OrderedDict
+
+from test_harness.summarize import Summary, SummaryTree
+
+
+@total_ordering
+class TestDescription:
+    def __init__(self, path: Path, name: str, priority: float):
+        self.paths: List[Path] = [path]
+        self.name = name
+        self.priority: float = priority
+        # we only measure in seconds. Otherwise, keeping determinism will be difficult
+        self.total_runtime: int = 0
+        self.num_runs: int = 0
+
+    def __lt__(self, other):
+        if isinstance(other, TestDescription):
+            return self.name < other.name
+        else:
+            return self.name < str(other)
+
+    def __eq__(self, other):
+        if isinstance(other, TestDescription):
+            return self.name < other.name
+        else:
+            return self.name < str(other.name)
+
+
+class StatFetcher:
+    def __init__(self, tests: OrderedDict[str, TestDescription]):
+        self.tests = tests
+
+    def read_stats(self):
+        pass
+
+    def add_run_time(self, test_name: str, runtime: int, out: SummaryTree):
+        self.tests[test_name].total_runtime += runtime
+
+
+class TestPicker:
+    def __init__(self, test_dir: Path):
+        if not test_dir.exists():
+            raise RuntimeError('{} is neither a directory nor a file'.format(test_dir))
+        self.include_files_regex = re.compile(config.include_test_files)
+        self.exclude_files_regex = re.compile(config.exclude_test_files)
+        self.include_tests_regex = re.compile(config.include_test_classes)
+        self.exclude_tests_regex = re.compile(config.exclude_test_names)
+        self.test_dir: Path = test_dir
+        self.tests: OrderedDict[str, TestDescription] = collections.OrderedDict()
+        self.restart_test: Pattern = re.compile(r".*-\d+\.(txt|toml)")
+        self.follow_test: Pattern = re.compile(r".*-[2-9]\d*\.(txt|toml)")
+
+        for subdir in self.test_dir.iterdir():
+            if subdir.is_dir() and subdir.name in config.test_dirs:
+                self.walk_test_dir(subdir)
+        self.stat_fetcher: StatFetcher
+        if config.stats is not None or config.joshua_dir is None:
+            self.stat_fetcher = StatFetcher(self.tests)
+        else:
+            from test_harness.fdb import FDBStatFetcher
+            self.stat_fetcher = FDBStatFetcher(self.tests)
+        if config.stats is not None:
+            self.load_stats(config.stats)
+        else:
+            self.fetch_stats()
+
+    def add_time(self, test_file: Path, run_time: int, out: SummaryTree) -> None:
+        # getting the test name is fairly inefficient. But since we only have 100s of tests, I won't bother
+        test_name: str | None = None
+        test_desc: TestDescription | None = None
+        for name, test in self.tests.items():
+            for p in test.paths:
+                test_files: List[Path]
+                if self.restart_test.match(p.name):
+                    test_files = self.list_restart_files(p)
+                else:
+                    test_files = [p]
+                for file in test_files:
+                    if file.absolute() == test_file.absolute():
+                        test_name = name
+                        test_desc = test
+                        break
+                if test_name is not None:
+                    break
+            if test_name is not None:
+                break
+        assert test_name is not None and test_desc is not None
+        self.stat_fetcher.add_run_time(test_name, run_time, out)
+        out.attributes['TotalTestTime'] = str(test_desc.total_runtime)
+        out.attributes['TestRunCount'] = str(test_desc.num_runs)
+
+    def dump_stats(self) -> str:
+        res = array.array('I')
+        for _, spec in self.tests.items():
+            res.append(spec.total_runtime)
+        return base64.standard_b64encode(res.tobytes()).decode('utf-8')
+
+    def fetch_stats(self):
+        self.stat_fetcher.read_stats()
+
+    def load_stats(self, serialized: str):
+        times = array.array('I')
+        times.frombytes(base64.standard_b64decode(serialized))
+        assert len(times) == len(self.tests.items())
+        for idx, (_, spec) in enumerate(self.tests.items()):
+            spec.total_runtime = times[idx]
+
+    def parse_txt(self, path: Path):
+        if self.include_files_regex.search(str(path)) is None or self.exclude_files_regex.search(str(path)) is not None:
+            return
+        with path.open('r') as f:
+            test_name: str | None = None
+            test_class: str | None = None
+            priority: float | None = None
+            for line in f:
+                line = line.strip()
+                kv = line.split('=')
+                if len(kv) != 2:
+                    continue
+                kv[0] = kv[0].strip()
+                kv[1] = kv[1].strip(' \r\n\t\'"')
+                if kv[0] == 'testTitle' and test_name is None:
+                    test_name = kv[1]
+                if kv[0] == 'testClass' and test_class is None:
+                    test_class = kv[1]
+                if kv[0] == 'testPriority' and priority is None:
+                    try:
+                        priority = float(kv[1])
+                    except ValueError:
+                        raise RuntimeError("Can't parse {} -- testPriority in {} should be set to a float".format(kv[1],
+                                                                                                                  path))
+                if test_name is not None and test_class is not None and priority is not None:
+                    break
+            if test_name is None:
+                return
+            if test_class is None:
+                test_class = test_name
+            if priority is None:
+                priority = 1.0
+            if self.include_tests_regex.search(test_class) is None \
+                    or self.exclude_tests_regex.search(test_class) is not None:
+                return
+            if test_class not in self.tests:
+                self.tests[test_class] = TestDescription(path, test_class, priority)
+            else:
+                self.tests[test_class].paths.append(path)
+
+    def walk_test_dir(self, test: Path):
+        if test.is_dir():
+            for file in test.iterdir():
+                self.walk_test_dir(file)
+        else:
+            # check whether we're looking at a restart test
+            if self.follow_test.match(test.name) is not None:
+                return
+            if test.suffix == '.txt' or test.suffix == '.toml':
+                self.parse_txt(test)
+
+    @staticmethod
+    def list_restart_files(start_file: Path) -> List[Path]:
+        name = re.sub(r'-\d+.(txt|toml)', '', start_file.name)
+        res: List[Path] = []
+        for test_file in start_file.parent.iterdir():
+            if test_file.name.startswith(name):
+                res.append(test_file)
+        assert len(res) > 1
+        res.sort()
+        return res
+
+    def choose_test(self) -> List[Path]:
+        min_runtime: float | None = None
+        candidates: List[TestDescription] = []
+        for _, v in self.tests.items():
+            this_time = v.total_runtime * v.priority
+            if min_runtime is None or this_time < min_runtime:
+                min_runtime = this_time
+                candidates = [v]
+            elif this_time == min_runtime:
+                candidates.append(v)
+        candidates.sort()
+        choice = config.random.randint(0, len(candidates) - 1)
+        test = candidates[choice]
+        result = test.paths[config.random.randint(0, len(test.paths) - 1)]
+        if self.restart_test.match(result.name):
+            return self.list_restart_files(result)
+        else:
+            return [result]
+
+
+class OldBinaries:
+    def __init__(self):
+        self.first_file_expr = re.compile(r'.*-1\.(txt|toml)')
+        self.old_binaries_path: Path = config.old_binaries_path
+        self.binaries: OrderedDict[Version, Path] = collections.OrderedDict()
+        if not self.old_binaries_path.exists() or not self.old_binaries_path.is_dir():
+            return
+        exec_pattern = re.compile(r'fdbserver-\d+\.\d+\.\d+(\.exe)?')
+        for file in self.old_binaries_path.iterdir():
+            if not file.is_file() or not os.access(file, os.X_OK):
+                continue
+            if exec_pattern.fullmatch(file.name) is not None:
+                self._add_file(file)
+
+    def _add_file(self, file: Path):
+        version_str = file.name.split('-')[1]
+        if version_str.endswith('.exe'):
+            version_str = version_str[0:-len('.exe')]
+        ver = Version.parse(version_str)
+        self.binaries[ver] = file
+
+    def choose_binary(self, test_file: Path) -> Path:
+        if len(self.binaries) == 0:
+            return config.binary
+        max_version = Version.max_version()
+        min_version = Version.parse('5.0.0')
+        dirs = test_file.parent.parts
+        if 'restarting' not in dirs:
+            return config.binary
+        version_expr = dirs[-1].split('_')
+        first_file = self.first_file_expr.match(test_file.name) is not None
+        if first_file and version_expr[0] == 'to':
+            # downgrade test -- first binary should be current one
+            return config.binary
+        if not first_file and version_expr[0] == 'from':
+            # upgrade test -- we only return an old version for the first test file
+            return config.binary
+        if version_expr[0] == 'from' or version_expr[0] == 'to':
+            min_version = Version.parse(version_expr[1])
+        if len(version_expr) == 4 and version_expr[2] == 'until':
+            max_version = Version.parse(version_expr[3])
+        candidates: List[Path] = []
+        for ver, binary in self.binaries.items():
+            if min_version <= ver <= max_version:
+                candidates.append(binary)
+        if len(candidates) == 0:
+            return config.binary
+        return config.random.choice(candidates)
+
+
+def is_restarting_test(test_file: Path):
+    for p in test_file.parts:
+        if p == 'restarting':
+            return True
+    return False
+
+
+def is_no_sim(test_file: Path):
+    return test_file.parts[-2] == 'noSim'
+
+
+class ResourceMonitor(threading.Thread):
+    def __init__(self):
+        super().__init__()
+        self.start_time = time.time()
+        self.end_time: float | None = None
+        self._stop_monitor = False
+        self.max_rss = 0
+
+    def run(self) -> None:
+        while not self._stop_monitor:
+            time.sleep(1)
+            resources = resource.getrusage(resource.RUSAGE_CHILDREN)
+            self.max_rss = max(resources.ru_maxrss, self.max_rss)
+
+    def stop(self):
+        self.end_time = time.time()
+        self._stop_monitor = True
+
+    def time(self):
+        return self.end_time - self.start_time
+
+
+class TestRun:
+    def __init__(self, binary: Path, test_file: Path, random_seed: int, uid: uuid.UUID,
+                 restarting: bool = False, test_determinism: bool = False, buggify_enabled: bool = False,
+                 stats: str | None = None, expected_unseed: int | None = None, will_restart: bool = False):
+        self.binary = binary
+        self.test_file = test_file
+        self.random_seed = random_seed
+        self.uid = uid
+        self.restarting = restarting
+        self.test_determinism = test_determinism
+        self.stats: str | None = stats
+        self.expected_unseed: int | None = expected_unseed
+        self.use_valgrind: bool = config.use_valgrind
+        self.old_binary_path: Path = config.old_binaries_path
+        self.buggify_enabled: bool = buggify_enabled
+        self.fault_injection_enabled: bool = True
+        self.trace_format: str | None = config.trace_format
+        if Version.of_binary(self.binary) < "6.1.0":
+            self.trace_format = None
+        self.temp_path = config.run_dir / str(self.uid)
+        # state for the run
+        self.retryable_error: bool = False
+        self.summary: Summary = Summary(binary, uid=self.uid, stats=self.stats, expected_unseed=self.expected_unseed,
+                                        will_restart=will_restart)
+        self.run_time: int = 0
+        self.success = self.run()
+
+    def log_test_plan(self, out: SummaryTree):
+        test_plan: SummaryTree = SummaryTree('TestPlan')
+        test_plan.attributes['TestUID'] = str(self.uid)
+        test_plan.attributes['RandomSeed'] = str(self.random_seed)
+        test_plan.attributes['TestFile'] = str(self.test_file)
+        test_plan.attributes['Buggify'] = '1' if self.buggify_enabled else '0'
+        test_plan.attributes['FaultInjectionEnabled'] = '1' if self.fault_injection_enabled else '0'
+        test_plan.attributes['DeterminismCheck'] = '1' if self.test_determinism else '0'
+        out.append(test_plan)
+
+    def delete_simdir(self):
+        shutil.rmtree(self.temp_path / Path('simfdb'))
+
+    def run(self):
+        command: List[str] = []
+        valgrind_file: Path | None = None
+        if self.use_valgrind:
+            command.append('valgrind')
+            valgrind_file = self.temp_path / Path('valgrind-{}.xml'.format(self.random_seed))
+            dbg_path = os.getenv('FDB_VALGRIND_DBGPATH')
+            if dbg_path is not None:
+                command.append('--extra-debuginfo-path={}'.format(dbg_path))
+            command += ['--xml=yes', '--xml-file={}'.format(valgrind_file.absolute()), '-q']
+        command += [str(self.binary.absolute()),
+                    '-r', 'test' if is_no_sim(self.test_file) else 'simulation',
+                    '-f', str(self.test_file),
+                    '-s', str(self.random_seed)]
+        if self.trace_format is not None:
+            command += ['--trace_format', self.trace_format]
+        if Version.of_binary(self.binary) >= '7.1.0':
+            command += ['-fi', 'on' if self.fault_injection_enabled else 'off']
+        if self.restarting:
+            command.append('--restarting')
+        if self.buggify_enabled:
+            command += ['-b', 'on']
+        if config.crash_on_error:
+            command.append('--crash')
+
+        self.temp_path.mkdir(parents=True, exist_ok=True)
+
+        # self.log_test_plan(out)
+        resources = ResourceMonitor()
+        resources.start()
+        process = subprocess.Popen(command, stdout=subprocess.DEVNULL, stderr=subprocess.PIPE, cwd=self.temp_path,
+                                   text=True)
+        did_kill = False
+        timeout = 20 * config.kill_seconds if self.use_valgrind else config.kill_seconds
+        err_out: str
+        try:
+            _, err_out = process.communicate(timeout=timeout)
+        except subprocess.TimeoutExpired:
+            process.kill()
+            _, err_out = process.communicate()
+            did_kill = True
+        resources.stop()
+        resources.join()
+        # we're rounding times up, otherwise we will prefer running very short tests (<1s)
+        self.run_time = math.ceil(resources.time())
+        self.summary.runtime = resources.time()
+        self.summary.max_rss = resources.max_rss
+        self.summary.was_killed = did_kill
+        self.summary.valgrind_out_file = valgrind_file
+        self.summary.error_out = err_out
+        self.summary.summarize(self.temp_path, ' '.join(command))
+        return self.summary.ok()
+
+
+def decorate_summary(out: SummaryTree, test_file: Path, seed: int, buggify: bool):
+    """Sometimes a test can crash before ProgramStart is written to the traces. These
+    tests are then hard to reproduce (they can be reproduced through TestHarness but
+    require the user to run in the joshua docker container). To account for this we
+    will write the necessary information into the attributes if it is missing."""
+    if 'TestFile' not in out.attributes:
+        out.attributes['TestFile'] = str(test_file)
+    if 'RandomSeed' not in out.attributes:
+        out.attributes['RandomSeed'] = str(seed)
+    if 'BuggifyEnabled' not in out.attributes:
+        out.attributes['BuggifyEnabled'] = '1' if buggify else '0'
+
+
+class TestRunner:
+    def __init__(self):
+        self.uid = uuid.uuid4()
+        self.test_path: Path = Path('tests')
+        self.cluster_file: str | None = None
+        self.fdb_app_dir: str | None = None
+        self.binary_chooser = OldBinaries()
+        self.test_picker = TestPicker(self.test_path)
+
+    def backup_sim_dir(self, seed: int):
+        temp_dir = config.run_dir / str(self.uid)
+        src_dir = temp_dir / 'simfdb'
+        assert src_dir.is_dir()
+        dest_dir = temp_dir / 'simfdb.{}'.format(seed)
+        assert not dest_dir.exists()
+        shutil.copytree(src_dir, dest_dir)
+
+    def restore_sim_dir(self, seed: int):
+        temp_dir = config.run_dir / str(self.uid)
+        src_dir = temp_dir / 'simfdb.{}'.format(seed)
+        assert src_dir.exists()
+        dest_dir = temp_dir / 'simfdb'
+        shutil.rmtree(dest_dir)
+        shutil.move(src_dir, dest_dir)
+
+    def run_tests(self, test_files: List[Path], seed: int, test_picker: TestPicker) -> bool:
+        result: bool = True
+        for count, file in enumerate(test_files):
+            will_restart = count + 1 < len(test_files)
+            binary = self.binary_chooser.choose_binary(file)
+            unseed_check = not is_no_sim(file) and config.random.random() < config.unseed_check_ratio
+            buggify_enabled: bool = config.random.random() < config.buggify_on_ratio
+            if unseed_check and count != 0:
+                # for restarting tests we will need to restore the sim2 after the first run
+                self.backup_sim_dir(seed + count - 1)
+            run = TestRun(binary, file.absolute(), seed + count, self.uid, restarting=count != 0,
+                          stats=test_picker.dump_stats(), will_restart=will_restart, buggify_enabled=buggify_enabled)
+            result = result and run.success
+            test_picker.add_time(test_files[0], run.run_time, run.summary.out)
+            decorate_summary(run.summary.out, file, seed + count, run.buggify_enabled)
+            if unseed_check and run.summary.unseed:
+                run.summary.out.append(run.summary.list_simfdb())
+            run.summary.out.dump(sys.stdout)
+            if not result:
+                return False
+            if unseed_check and run.summary.unseed is not None:
+                if count != 0:
+                    self.restore_sim_dir(seed + count - 1)
+                run2 = TestRun(binary, file.absolute(), seed + count, self.uid, restarting=count != 0,
+                               stats=test_picker.dump_stats(), expected_unseed=run.summary.unseed,
+                               will_restart=will_restart, buggify_enabled=buggify_enabled)
+                test_picker.add_time(file, run2.run_time, run.summary.out)
+                decorate_summary(run2.summary.out, file, seed + count, run.buggify_enabled)
+                run2.summary.out.dump(sys.stdout)
+                result = result and run2.success
+                if not result:
+                    return False
+        return result
+
+    def run(self) -> bool:
+        seed = config.random_seed if config.random_seed is not None else config.random.randint(0, 2 ** 32 - 1)
+        test_files = self.test_picker.choose_test()
+        success = self.run_tests(test_files, seed, self.test_picker)
+        if config.clean_up:
+            shutil.rmtree(config.run_dir / str(self.uid))
+        return success
--- a/contrib/TestHarness2/test_harness/summarize.py
+++ b/contrib/TestHarness2/test_harness/summarize.py
@ -0,0 +1,620 @@
+from __future__ import annotations
+
+import collections
+import inspect
+import json
+import os
+import re
+import sys
+import traceback
+import uuid
+import xml.sax
+import xml.sax.handler
+import xml.sax.saxutils
+
+from pathlib import Path
+from typing import List, Dict, TextIO, Callable, Optional, OrderedDict, Any, Tuple, Iterator, Iterable
+
+from test_harness.config import config
+from test_harness.valgrind import parse_valgrind_output
+
+
+class SummaryTree:
+    def __init__(self, name: str):
+        self.name = name
+        self.children: List[SummaryTree] = []
+        self.attributes: Dict[str, str] = {}
+
+    def append(self, element: SummaryTree):
+        self.children.append(element)
+
+    def to_dict(self, add_name: bool = True) -> Dict[str, Any] | List[Any]:
+        if len(self.children) > 0 and len(self.attributes) == 0:
+            children = []
+            for child in self.children:
+                children.append(child.to_dict())
+            if add_name:
+                return {self.name: children}
+            else:
+                return children
+        res: Dict[str, Any] = {}
+        if add_name:
+            res['Type'] = self.name
+        for k, v in self.attributes.items():
+            res[k] = v
+        children = []
+        child_keys: Dict[str, int] = {}
+        for child in self.children:
+            if child.name in child_keys:
+                child_keys[child.name] += 1
+            else:
+                child_keys[child.name] = 1
+        for child in self.children:
+            if child_keys[child.name] == 1 and child.name not in self.attributes:
+                res[child.name] = child.to_dict(add_name=False)
+            else:
+                children.append(child.to_dict())
+        if len(children) > 0:
+            res['children'] = children
+        return res
+
+    def to_json(self, out: TextIO, prefix: str = ''):
+        res = json.dumps(self.to_dict(), indent=('  ' if config.pretty_print else None))
+        for line in res.splitlines(False):
+            out.write('{}{}\n'.format(prefix, line))
+
+    def to_xml(self, out: TextIO, prefix: str = ''):
+        # minidom doesn't support omitting the xml declaration which is a problem for joshua
+        # However, our xml is very simple and therefore serializing manually is easy enough
+        attrs = []
+        print_width = 120
+        try:
+            print_width, _ = os.get_terminal_size()
+        except OSError:
+            pass
+        for k, v in self.attributes.items():
+            attrs.append('{}={}'.format(k, xml.sax.saxutils.quoteattr(v)))
+        elem = '{}<{}{}'.format(prefix, self.name, ('' if len(attrs) == 0 else ' '))
+        out.write(elem)
+        if config.pretty_print:
+            curr_line_len = len(elem)
+            for i in range(len(attrs)):
+                attr_len = len(attrs[i])
+                if i == 0 or attr_len + curr_line_len + 1 <= print_width:
+                    if i != 0:
+                        out.write(' ')
+                    out.write(attrs[i])
+                    curr_line_len += attr_len
+                else:
+                    out.write('\n')
+                    out.write(' ' * len(elem))
+                    out.write(attrs[i])
+                    curr_line_len = len(elem) + attr_len
+        else:
+            out.write(' '.join(attrs))
+        if len(self.children) == 0:
+            out.write('/>')
+        else:
+            out.write('>')
+        for child in self.children:
+            if config.pretty_print:
+                out.write('\n')
+            child.to_xml(out, prefix=('  {}'.format(prefix) if config.pretty_print else prefix))
+        if len(self.children) > 0:
+            out.write('{}{}</{}>'.format(('\n' if config.pretty_print else ''), prefix, self.name))
+
+    def dump(self, out: TextIO, prefix: str = '', new_line: bool = True):
+        if config.output_format == 'json':
+            self.to_json(out, prefix=prefix)
+        else:
+            self.to_xml(out, prefix=prefix)
+        if new_line:
+            out.write('\n')
+
+
+ParserCallback = Callable[[Dict[str, str]], Optional[str]]
+
+
+class ParseHandler:
+    def __init__(self, out: SummaryTree):
+        self.out = out
+        self.events: OrderedDict[Optional[Tuple[str, Optional[str]]], List[ParserCallback]] = collections.OrderedDict()
+
+    def add_handler(self, attr: Tuple[str, Optional[str]], callback: ParserCallback) -> None:
+        self.events.setdefault(attr, []).append(callback)
+
+    def _call(self, callback: ParserCallback, attrs: Dict[str, str]) -> str | None:
+        try:
+            return callback(attrs)
+        except Exception as e:
+            _, _, exc_traceback = sys.exc_info()
+            child = SummaryTree('NonFatalParseError')
+            child.attributes['Severity'] = '30'
+            child.attributes['ErrorMessage'] = str(e)
+            child.attributes['Trace'] = repr(traceback.format_tb(exc_traceback))
+            self.out.append(child)
+            return None
+
+    def handle(self, attrs: Dict[str, str]):
+        if None in self.events:
+            for callback in self.events[None]:
+                self._call(callback, attrs)
+        for k, v in attrs.items():
+            if (k, None) in self.events:
+                for callback in self.events[(k, None)]:
+                    remap = self._call(callback, attrs)
+                    if remap is not None:
+                        v = remap
+                        attrs[k] = v
+            if (k, v) in self.events:
+                for callback in self.events[(k, v)]:
+                    remap = self._call(callback, attrs)
+                    if remap is not None:
+                        v = remap
+                        attrs[k] = v
+
+
+class Parser:
+    def parse(self, file: TextIO, handler: ParseHandler) -> None:
+        pass
+
+
+class XmlParser(Parser, xml.sax.handler.ContentHandler):
+    def __init__(self):
+        super().__init__()
+        self.handler: ParseHandler | None = None
+
+    def parse(self, file: TextIO, handler: ParseHandler) -> None:
+        xml.sax.parse(file, self)
+
+    def startElement(self, name, attrs) -> None:
+        attributes: Dict[str, str] = {}
+        for name in attrs.getNames():
+            attributes[name] = attrs.getValue(name)
+        assert self.handler is not None
+        self.handler.handle(attributes)
+
+
+class JsonParser(Parser):
+    def __init__(self):
+        super().__init__()
+
+    def parse(self, file: TextIO, handler: ParseHandler):
+        for line in file:
+            obj = json.loads(line)
+            handler.handle(obj)
+
+
+class Coverage:
+    def __init__(self, file: str, line: str | int, comment: str | None = None):
+        self.file = file
+        self.line = int(line)
+        self.comment = comment
+
+    def to_tuple(self) -> Tuple[str, int, str | None]:
+        return self.file, self.line, self.comment
+
+    def __eq__(self, other) -> bool:
+        if isinstance(other, tuple) and len(other) == 3:
+            return self.to_tuple() == other
+        elif isinstance(other, Coverage):
+            return self.to_tuple() == other.to_tuple()
+        else:
+            return False
+
+    def __lt__(self, other) -> bool:
+        if isinstance(other, tuple) and len(other) == 3:
+            return self.to_tuple() < other
+        elif isinstance(other, Coverage):
+            return self.to_tuple() < other.to_tuple()
+        else:
+            return False
+
+    def __le__(self, other) -> bool:
+        if isinstance(other, tuple) and len(other) == 3:
+            return self.to_tuple() <= other
+        elif isinstance(other, Coverage):
+            return self.to_tuple() <= other.to_tuple()
+        else:
+            return False
+
+    def __gt__(self, other: Coverage) -> bool:
+        if isinstance(other, tuple) and len(other) == 3:
+            return self.to_tuple() > other
+        elif isinstance(other, Coverage):
+            return self.to_tuple() > other.to_tuple()
+        else:
+            return False
+
+    def __ge__(self, other):
+        if isinstance(other, tuple) and len(other) == 3:
+            return self.to_tuple() >= other
+        elif isinstance(other, Coverage):
+            return self.to_tuple() >= other.to_tuple()
+        else:
+            return False
+
+    def __hash__(self):
+        return hash((self.file, self.line, self.comment))
+
+
+class TraceFiles:
+    def __init__(self, path: Path):
+        self.path: Path = path
+        self.timestamps: List[int] = []
+        self.runs: OrderedDict[int, List[Path]] = collections.OrderedDict()
+        trace_expr = re.compile(r'trace.*\.(json|xml)')
+        for file in self.path.iterdir():
+            if file.is_file() and trace_expr.match(file.name) is not None:
+                ts = int(file.name.split('.')[6])
+                if ts in self.runs:
+                    self.runs[ts].append(file)
+                else:
+                    self.timestamps.append(ts)
+                    self.runs[ts] = [file]
+        self.timestamps.sort(reverse=True)
+
+    def __getitem__(self, idx: int) -> List[Path]:
+        res = self.runs[self.timestamps[idx]]
+        res.sort()
+        return res
+
+    def __len__(self) -> int:
+        return len(self.runs)
+
+    def items(self) -> Iterator[List[Path]]:
+        class TraceFilesIterator(Iterable[List[Path]]):
+            def __init__(self, trace_files: TraceFiles):
+                self.current = 0
+                self.trace_files: TraceFiles = trace_files
+
+            def __iter__(self):
+                return self
+
+            def __next__(self) -> List[Path]:
+                if len(self.trace_files) <= self.current:
+                    raise StopIteration
+                self.current += 1
+                return self.trace_files[self.current - 1]
+        return TraceFilesIterator(self)
+
+
+class Summary:
+    def __init__(self, binary: Path, runtime: float = 0, max_rss: int | None = None,
+                 was_killed: bool = False, uid: uuid.UUID | None = None, expected_unseed: int | None = None,
+                 exit_code: int = 0, valgrind_out_file: Path | None = None, stats: str | None = None,
+                 error_out: str = None, will_restart: bool = False):
+        self.binary = binary
+        self.runtime: float = runtime
+        self.max_rss: int | None = max_rss
+        self.was_killed: bool = was_killed
+        self.expected_unseed: int | None = expected_unseed
+        self.exit_code: int = exit_code
+        self.out: SummaryTree = SummaryTree('Test')
+        self.test_begin_found: bool = False
+        self.test_end_found: bool = False
+        self.unseed: int | None = None
+        self.valgrind_out_file: Path | None = valgrind_out_file
+        self.severity_map: OrderedDict[tuple[str, int], int] = collections.OrderedDict()
+        self.error: bool = False
+        self.errors: int = 0
+        self.warnings: int = 0
+        self.coverage: OrderedDict[Coverage, bool] = collections.OrderedDict()
+        self.test_count: int = 0
+        self.tests_passed: int = 0
+        self.error_out = error_out
+        self.stderr_severity: str = '40'
+        self.will_restart: bool = will_restart
+        self.test_dir: Path | None = None
+
+        if uid is not None:
+            self.out.attributes['TestUID'] = str(uid)
+        if stats is not None:
+            self.out.attributes['Statistics'] = stats
+        self.out.attributes['JoshuaSeed'] = str(config.joshua_seed)
+        self.out.attributes['WillRestart'] = '1' if self.will_restart else '0'
+
+        self.handler = ParseHandler(self.out)
+        self.register_handlers()
+
+    def summarize_files(self, trace_files: List[Path]):
+        assert len(trace_files) > 0
+        for f in trace_files:
+            self.parse_file(f)
+        self.done()
+
+    def summarize(self, trace_dir: Path, command: str):
+        self.test_dir = trace_dir
+        trace_files = TraceFiles(trace_dir)
+        if len(trace_files) == 0:
+            self.error = True
+            child = SummaryTree('NoTracesFound')
+            child.attributes['Severity'] = '40'
+            child.attributes['Path'] = str(trace_dir.absolute())
+            child.attributes['Command'] = command
+            self.out.append(child)
+            return
+        self.summarize_files(trace_files[0])
+        if config.joshua_dir is not None:
+            import test_harness.fdb
+            test_harness.fdb.write_coverage(config.cluster_file,
+                                            test_harness.fdb.str_to_tuple(config.joshua_dir) + ('coverage',),
+                                            test_harness.fdb.str_to_tuple(config.joshua_dir) + ('coverage-metadata',),
+                                            self.coverage)
+
+    def list_simfdb(self) -> SummaryTree:
+        res = SummaryTree('SimFDB')
+        res.attributes['TestDir'] = str(self.test_dir)
+        if self.test_dir is None:
+            return res
+        simfdb = self.test_dir / Path('simfdb')
+        if not simfdb.exists():
+            res.attributes['NoSimDir'] = "simfdb doesn't exist"
+            return res
+        elif not simfdb.is_dir():
+            res.attributes['NoSimDir'] = 'simfdb is not a directory'
+            return res
+        for file in simfdb.iterdir():
+            child = SummaryTree('Directory' if file.is_dir() else 'File')
+            child.attributes['Name'] = file.name
+            res.append(child)
+        return res
+
+    def ok(self):
+        return not self.error
+
+    def done(self):
+        if config.print_coverage:
+            for k, v in self.coverage.items():
+                child = SummaryTree('CodeCoverage')
+                child.attributes['File'] = k.file
+                child.attributes['Line'] = str(k.line)
+                if not v:
+                    child.attributes['Covered'] = '0'
+                if k.comment is not None and len(k.comment):
+                    child.attributes['Comment'] = k.comment
+                self.out.append(child)
+        if self.warnings > config.max_warnings:
+            child = SummaryTree('WarningLimitExceeded')
+            child.attributes['Severity'] = '30'
+            child.attributes['WarningCount'] = str(self.warnings)
+            self.out.append(child)
+        if self.errors > config.max_errors:
+            child = SummaryTree('ErrorLimitExceeded')
+            child.attributes['Severity'] = '40'
+            child.attributes['ErrorCount'] = str(self.errors)
+            self.out.append(child)
+        if self.was_killed:
+            child = SummaryTree('ExternalTimeout')
+            child.attributes['Severity'] = '40'
+            self.out.append(child)
+            self.error = True
+        if self.max_rss is not None:
+            self.out.attributes['PeakMemory'] = str(self.max_rss)
+        if self.valgrind_out_file is not None:
+            try:
+                valgrind_errors = parse_valgrind_output(self.valgrind_out_file)
+                for valgrind_error in valgrind_errors:
+                    if valgrind_error.kind.startswith('Leak'):
+                        continue
+                    self.error = True
+                    child = SummaryTree('ValgrindError')
+                    child.attributes['Severity'] = '40'
+                    child.attributes['What'] = valgrind_error.what.what
+                    child.attributes['Backtrace'] = valgrind_error.what.backtrace
+                    aux_count = 0
+                    for aux in valgrind_error.aux:
+                        child.attributes['WhatAux{}'.format(aux_count)] = aux.what
+                        child.attributes['BacktraceAux{}'.format(aux_count)] = aux.backtrace
+                        aux_count += 1
+                    self.out.append(child)
+            except Exception as e:
+                self.error = True
+                child = SummaryTree('ValgrindParseError')
+                child.attributes['Severity'] = '40'
+                child.attributes['ErrorMessage'] = str(e)
+                _, _, exc_traceback = sys.exc_info()
+                child.attributes['Trace'] = repr(traceback.format_tb(exc_traceback))
+                self.out.append(child)
+        if not self.test_end_found:
+            child = SummaryTree('TestUnexpectedlyNotFinished')
+            child.attributes['Severity'] = '40'
+            self.out.append(child)
+        if self.error_out is not None and len(self.error_out) > 0:
+            lines = self.error_out.splitlines()
+            stderr_bytes = 0
+            for line in lines:
+                if line.endswith("WARNING: ASan doesn't fully support makecontext/swapcontext functions and may produce false positives in some cases!"):
+                    # When running ASAN we expect to see this message. Boost coroutine should be using the correct asan annotations so that it shouldn't produce any false positives.
+                    continue
+                if line.endswith("Warning: unimplemented fcntl command: 1036"):
+                    # Valgrind produces this warning when F_SET_RW_HINT is used
+                    continue
+                if self.stderr_severity == '40':
+                    self.error = True
+                remaining_bytes = config.max_stderr_bytes - stderr_bytes
+                if remaining_bytes > 0:
+                    out_err = line[0:remaining_bytes] + ('...' if len(line) > remaining_bytes else '')
+                    child = SummaryTree('StdErrOutput')
+                    child.attributes['Severity'] = self.stderr_severity
+                    child.attributes['Output'] = out_err
+                    self.out.append(child)
+                stderr_bytes += len(line)
+            if stderr_bytes > config.max_stderr_bytes:
+                child = SummaryTree('StdErrOutputTruncated')
+                child.attributes['Severity'] = self.stderr_severity
+                child.attributes['BytesRemaining'] = stderr_bytes - config.max_stderr_bytes
+                self.out.append(child)
+
+        self.out.attributes['Ok'] = '1' if self.ok() else '0'
+        if not self.ok():
+            reason = 'Unknown'
+            if self.error:
+                reason = 'ProducedErrors'
+            elif not self.test_end_found:
+                reason = 'TestDidNotFinish'
+            elif self.tests_passed == 0:
+                reason = 'NoTestsPassed'
+            elif self.test_count != self.tests_passed:
+                reason = 'Expected {} tests to pass, but only {} did'.format(self.test_count, self.tests_passed)
+            self.out.attributes['FailReason'] = reason
+
+    def parse_file(self, file: Path):
+        parser: Parser
+        if file.suffix == '.json':
+            parser = JsonParser()
+        elif file.suffix == '.xml':
+            parser = XmlParser()
+        else:
+            child = SummaryTree('TestHarnessBug')
+            child.attributes['File'] = __file__
+            frame = inspect.currentframe()
+            if frame is not None:
+                child.attributes['Line'] = str(inspect.getframeinfo(frame).lineno)
+            child.attributes['Details'] = 'Unexpected suffix {} for file {}'.format(file.suffix, file.name)
+            self.error = True
+            self.out.append(child)
+            return
+        with file.open('r') as f:
+            try:
+                parser.parse(f, self.handler)
+            except Exception as e:
+                child = SummaryTree('SummarizationError')
+                child.attributes['Severity'] = '40'
+                child.attributes['ErrorMessage'] = str(e)
+                self.out.append(child)
+
+    def register_handlers(self):
+        def remap_event_severity(attrs):
+            if 'Type' not in attrs or 'Severity' not in attrs:
+                return None
+            k = (attrs['Type'], int(attrs['Severity']))
+            if k in self.severity_map:
+                return str(self.severity_map[k])
+
+        self.handler.add_handler(('Severity', None), remap_event_severity)
+
+        def program_start(attrs: Dict[str, str]):
+            if self.test_begin_found:
+                return
+            self.test_begin_found = True
+            self.out.attributes['RandomSeed'] = attrs['RandomSeed']
+            self.out.attributes['SourceVersion'] = attrs['SourceVersion']
+            self.out.attributes['Time'] = attrs['ActualTime']
+            self.out.attributes['BuggifyEnabled'] = attrs['BuggifyEnabled']
+            self.out.attributes['DeterminismCheck'] = '0' if self.expected_unseed is None else '1'
+            if self.binary.name != 'fdbserver':
+                self.out.attributes['OldBinary'] = self.binary.name
+            if 'FaultInjectionEnabled' in attrs:
+                self.out.attributes['FaultInjectionEnabled'] = attrs['FaultInjectionEnabled']
+
+        self.handler.add_handler(('Type', 'ProgramStart'), program_start)
+
+        def set_test_file(attrs: Dict[str, str]):
+            test_file = Path(attrs['TestFile'])
+            cwd = Path('.').absolute()
+            try:
+                test_file = test_file.relative_to(cwd)
+            except ValueError:
+                pass
+            self.out.attributes['TestFile'] = str(test_file)
+
+        self.handler.add_handler(('Type', 'Simulation'), set_test_file)
+        self.handler.add_handler(('Type', 'NonSimulationTest'), set_test_file)
+
+        def set_elapsed_time(attrs: Dict[str, str]):
+            if self.test_end_found:
+                return
+            self.test_end_found = True
+            self.unseed = int(attrs['RandomUnseed'])
+            if self.expected_unseed is not None and self.unseed != self.expected_unseed:
+                severity = 40 if ('UnseedMismatch', 40) not in self.severity_map \
+                    else self.severity_map[('UnseedMismatch', 40)]
+                if severity >= 30:
+                    child = SummaryTree('UnseedMismatch')
+                    child.attributes['Unseed'] = str(self.unseed)
+                    child.attributes['ExpectedUnseed'] = str(self.expected_unseed)
+                    child.attributes['Severity'] = str(severity)
+                    if severity >= 40:
+                        self.error = True
+                    self.out.append(child)
+            self.out.attributes['SimElapsedTime'] = attrs['SimTime']
+            self.out.attributes['RealElapsedTime'] = attrs['RealTime']
+            if self.unseed is not None:
+                self.out.attributes['RandomUnseed'] = str(self.unseed)
+
+        self.handler.add_handler(('Type', 'ElapsedTime'), set_elapsed_time)
+
+        def parse_warning(attrs: Dict[str, str]):
+            self.warnings += 1
+            if self.warnings > config.max_warnings:
+                return
+            child = SummaryTree(attrs['Type'])
+            for k, v in attrs.items():
+                if k != 'Type':
+                    child.attributes[k] = v
+            self.out.append(child)
+
+        self.handler.add_handler(('Severity', '30'), parse_warning)
+
+        def parse_error(attrs: Dict[str, str]):
+            self.errors += 1
+            self.error = True
+            if self.errors > config.max_errors:
+                return
+            child = SummaryTree(attrs['Type'])
+            for k, v in attrs.items():
+                child.attributes[k] = v
+            self.out.append(child)
+
+        self.handler.add_handler(('Severity', '40'), parse_error)
+
+        def coverage(attrs: Dict[str, str]):
+            covered = True
+            if 'Covered' in attrs:
+                covered = int(attrs['Covered']) != 0
+            comment = ''
+            if 'Comment' in attrs:
+                comment = attrs['Comment']
+            c = Coverage(attrs['File'], attrs['Line'], comment)
+            if covered or c not in self.coverage:
+                self.coverage[c] = covered
+
+        self.handler.add_handler(('Type', 'CodeCoverage'), coverage)
+
+        def expected_test_pass(attrs: Dict[str, str]):
+            self.test_count = int(attrs['Count'])
+
+        self.handler.add_handler(('Type', 'TestsExpectedToPass'), expected_test_pass)
+
+        def test_passed(attrs: Dict[str, str]):
+            if attrs['Passed'] == '1':
+                self.tests_passed += 1
+
+        self.handler.add_handler(('Type', 'TestResults'), test_passed)
+
+        def remap_event_severity(attrs: Dict[str, str]):
+            self.severity_map[(attrs['TargetEvent'], int(attrs['OriginalSeverity']))] = int(attrs['NewSeverity'])
+
+        self.handler.add_handler(('Type', 'RemapEventSeverity'), remap_event_severity)
+
+        def buggify_section(attrs: Dict[str, str]):
+            if attrs['Type'] == 'FaultInjected' or attrs.get('Activated', '0') == '1':
+                child = SummaryTree(attrs['Type'])
+                child.attributes['File'] = attrs['File']
+                child.attributes['Line'] = attrs['Line']
+                self.out.append(child)
+        self.handler.add_handler(('Type', 'BuggifySection'), buggify_section)
+        self.handler.add_handler(('Type', 'FaultInjected'), buggify_section)
+
+        def running_unit_test(attrs: Dict[str, str]):
+            child = SummaryTree('RunningUnitTest')
+            child.attributes['Name'] = attrs['Name']
+            child.attributes['File'] = attrs['File']
+            child.attributes['Line'] = attrs['Line']
+        self.handler.add_handler(('Type', 'RunningUnitTest'), running_unit_test)
+
+        def stderr_severity(attrs: Dict[str, str]):
+            if 'NewSeverity' in attrs:
+                self.stderr_severity = attrs['NewSeverity']
+        self.handler.add_handler(('Type', 'StderrSeverity'), stderr_severity)
--- a/contrib/TestHarness2/test_harness/test_valgrind_parser.py
+++ b/contrib/TestHarness2/test_harness/test_valgrind_parser.py
@ -0,0 +1,16 @@
+import sys
+
+from test_harness.valgrind import parse_valgrind_output
+from pathlib import Path
+
+
+if __name__ == '__main__':
+    errors = parse_valgrind_output(Path(sys.argv[1]))
+    for valgrind_error in errors:
+        print('ValgrindError: what={}, kind={}'.format(valgrind_error.what.what, valgrind_error.kind))
+        print('Backtrace: {}'.format(valgrind_error.what.backtrace))
+        counter = 0
+        for aux in valgrind_error.aux:
+            print('Aux {}:'.format(counter))
+            print('  What: {}'.format(aux.what))
+            print('  Backtrace: {}'.format(aux.backtrace))
--- a/contrib/TestHarness2/test_harness/timeout.py
+++ b/contrib/TestHarness2/test_harness/timeout.py
@ -0,0 +1,60 @@
+import argparse
+import re
+import sys
+
+from pathlib import Path
+from test_harness.config import config
+from test_harness.summarize import Summary, TraceFiles
+from typing import Pattern, List
+
+
+def files_matching(path: Path, pattern: Pattern, recurse: bool = True) -> List[Path]:
+    res: List[Path] = []
+    for file in path.iterdir():
+        if file.is_file() and pattern.match(file.name) is not None:
+            res.append(file)
+        elif file.is_dir() and recurse:
+            res += files_matching(file, pattern, recurse)
+    return res
+
+
+def dirs_with_files_matching(path: Path, pattern: Pattern, recurse: bool = True) -> List[Path]:
+    res: List[Path] = []
+    sub_directories: List[Path] = []
+    has_file = False
+    for file in path.iterdir():
+        if file.is_file() and pattern.match(file.name) is not None:
+            has_file = True
+        elif file.is_dir() and recurse:
+            sub_directories.append(file)
+    if has_file:
+        res.append(path)
+    if recurse:
+        for file in sub_directories:
+            res += dirs_with_files_matching(file, pattern, recurse=True)
+    res.sort()
+    return res
+
+
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser('TestHarness Timeout', formatter_class=argparse.ArgumentDefaultsHelpFormatter)
+    config.build_arguments(parser)
+    args = parser.parse_args()
+    config.extract_args(args)
+    valgrind_files: List[Path] = []
+    if config.use_valgrind:
+        valgrind_files = files_matching(Path.cwd(), re.compile(r'valgrind.*\.xml'))
+
+    for directory in dirs_with_files_matching(Path.cwd(), re.compile(r'trace.*\.(json|xml)'), recurse=True):
+        trace_files = TraceFiles(directory)
+        for files in trace_files.items():
+            if config.use_valgrind:
+                for valgrind_file in valgrind_files:
+                    summary = Summary(Path('bin/fdbserver'), was_killed=True)
+                    summary.valgrind_out_file = valgrind_file
+                    summary.summarize_files(files)
+                    summary.out.dump(sys.stdout)
+            else:
+                summary = Summary(Path('bin/fdbserver'), was_killed=True)
+                summary.summarize_files(files)
+                summary.out.dump(sys.stdout)
--- a/contrib/TestHarness2/test_harness/valgrind.py
+++ b/contrib/TestHarness2/test_harness/valgrind.py
@ -0,0 +1,141 @@
+import enum
+import xml
+import xml.sax.handler
+from pathlib import Path
+from typing import List
+
+
+class ValgrindWhat:
+    def __init__(self):
+        self.what: str = ''
+        self.backtrace: str = ''
+
+
+class ValgrindError:
+    def __init__(self):
+        self.what: ValgrindWhat = ValgrindWhat()
+        self.kind: str = ''
+        self.aux: List[ValgrindWhat] = []
+
+
+# noinspection PyArgumentList
+class ValgrindParseState(enum.Enum):
+    ROOT = enum.auto()
+    ERROR = enum.auto()
+    ERROR_AUX = enum.auto()
+    KIND = enum.auto()
+    WHAT = enum.auto()
+    TRACE = enum.auto()
+    AUX_WHAT = enum.auto()
+    STACK = enum.auto()
+    STACK_AUX = enum.auto()
+    STACK_IP = enum.auto()
+    STACK_IP_AUX = enum.auto()
+
+
+class ValgrindHandler(xml.sax.handler.ContentHandler):
+    def __init__(self):
+        super().__init__()
+        self.stack: List[ValgrindError] = []
+        self.result: List[ValgrindError] = []
+        self.state_stack: List[ValgrindParseState] = []
+
+    def state(self) -> ValgrindParseState:
+        if len(self.state_stack) == 0:
+            return ValgrindParseState.ROOT
+        return self.state_stack[-1]
+
+    @staticmethod
+    def from_content(content):
+        # pdb.set_trace()
+        if isinstance(content, bytes):
+            return content.decode()
+        assert isinstance(content, str)
+        return content
+
+    def characters(self, content):
+        # pdb.set_trace()
+        state = self.state()
+        if len(self.state_stack) == 0:
+            return
+        else:
+            assert len(self.stack) > 0
+        if state is ValgrindParseState.KIND:
+            self.stack[-1].kind += self.from_content(content)
+        elif state is ValgrindParseState.WHAT:
+            self.stack[-1].what.what += self.from_content(content)
+        elif state is ValgrindParseState.AUX_WHAT:
+            self.stack[-1].aux[-1].what += self.from_content(content)
+        elif state is ValgrindParseState.STACK_IP:
+            self.stack[-1].what.backtrace += self.from_content(content)
+        elif state is ValgrindParseState.STACK_IP_AUX:
+            self.stack[-1].aux[-1].backtrace += self.from_content(content)
+
+    def startElement(self, name, attrs):
+        # pdb.set_trace()
+        if name == 'error':
+            self.stack.append(ValgrindError())
+            self.state_stack.append(ValgrindParseState.ERROR)
+        if len(self.stack) == 0:
+            return
+        if name == 'kind':
+            self.state_stack.append(ValgrindParseState.KIND)
+        elif name == 'what':
+            self.state_stack.append(ValgrindParseState.WHAT)
+        elif name == 'auxwhat':
+            assert self.state() in [ValgrindParseState.ERROR, ValgrindParseState.ERROR_AUX]
+            self.state_stack.pop()
+            self.state_stack.append(ValgrindParseState.ERROR_AUX)
+            self.state_stack.append(ValgrindParseState.AUX_WHAT)
+            self.stack[-1].aux.append(ValgrindWhat())
+        elif name == 'stack':
+            state = self.state()
+            assert state in [ValgrindParseState.ERROR, ValgrindParseState.ERROR_AUX]
+            if state == ValgrindParseState.ERROR:
+                self.state_stack.append(ValgrindParseState.STACK)
+            else:
+                self.state_stack.append(ValgrindParseState.STACK_AUX)
+        elif name == 'ip':
+            state = self.state()
+            assert state in [ValgrindParseState.STACK, ValgrindParseState.STACK_AUX]
+            if state == ValgrindParseState.STACK:
+                self.state_stack.append(ValgrindParseState.STACK_IP)
+                if len(self.stack[-1].what.backtrace) == 0:
+                    self.stack[-1].what.backtrace = 'addr2line -e fdbserver.debug -p -C -f -i '
+                else:
+                    self.stack[-1].what.backtrace += ' '
+            else:
+                self.state_stack.append(ValgrindParseState.STACK_IP_AUX)
+                if len(self.stack[-1].aux[-1].backtrace) == 0:
+                    self.stack[-1].aux[-1].backtrace = 'addr2line -e fdbserver.debug -p -C -f -i '
+                else:
+                    self.stack[-1].aux[-1].backtrace += ' '
+
+    def endElement(self, name):
+        # pdb.set_trace()
+        if name == 'error':
+            self.result.append(self.stack.pop())
+            self.state_stack.pop()
+        elif name == 'kind':
+            assert self.state() == ValgrindParseState.KIND
+            self.state_stack.pop()
+        elif name == 'what':
+            assert self.state() == ValgrindParseState.WHAT
+            self.state_stack.pop()
+        elif name == 'auxwhat':
+            assert self.state() == ValgrindParseState.AUX_WHAT
+            self.state_stack.pop()
+        elif name == 'stack':
+            assert self.state() in [ValgrindParseState.STACK, ValgrindParseState.STACK_AUX]
+            self.state_stack.pop()
+        elif name == 'ip':
+            self.state_stack.pop()
+            state = self.state()
+            assert state in [ValgrindParseState.STACK, ValgrindParseState.STACK_AUX]
+
+
+def parse_valgrind_output(valgrind_out_file: Path) -> List[ValgrindError]:
+    handler = ValgrindHandler()
+    with valgrind_out_file.open('r') as f:
+        xml.sax.parse(f, handler)
+        return handler.result
--- a/contrib/TestHarness2/test_harness/version.py
+++ b/contrib/TestHarness2/test_harness/version.py
@ -0,0 +1,66 @@
+from functools import total_ordering
+from pathlib import Path
+from typing import Tuple
+
+
+@total_ordering
+class Version:
+    def __init__(self):
+        self.major: int = 0
+        self.minor: int = 0
+        self.patch: int = 0
+
+    def version_tuple(self):
+        return self.major, self.minor, self.patch
+
+    def _compare(self, other) -> int:
+        lhs: Tuple[int, int, int] = self.version_tuple()
+        rhs: Tuple[int, int, int]
+        if isinstance(other, Version):
+            rhs = other.version_tuple()
+        else:
+            rhs = Version.parse(str(other)).version_tuple()
+        if lhs < rhs:
+            return -1
+        elif lhs > rhs:
+            return 1
+        else:
+            return 0
+
+    def __eq__(self, other) -> bool:
+        return self._compare(other) == 0
+
+    def __lt__(self, other) -> bool:
+        return self._compare(other) < 0
+
+    def __hash__(self):
+        return hash(self.version_tuple())
+
+    def __str__(self):
+        return format('{}.{}.{}'.format(self.major, self.minor, self.patch))
+
+    @staticmethod
+    def of_binary(binary: Path):
+        parts = binary.name.split('-')
+        if len(parts) != 2:
+            return Version.max_version()
+        return Version.parse(parts[1])
+
+    @staticmethod
+    def parse(version: str):
+        version_tuple = version.split('.')
+        self = Version()
+        self.major = int(version_tuple[0])
+        if len(version_tuple) > 1:
+            self.minor = int(version_tuple[1])
+            if len(version_tuple) > 2:
+                self.patch = int(version_tuple[2])
+        return self
+
+    @staticmethod
+    def max_version():
+        self = Version()
+        self.major = 2**32 - 1
+        self.minor = 2**32 - 1
+        self.patch = 2**32 - 1
+        return self
--- a/contrib/observability_splunk_dashboard/details.xml
+++ b/contrib/observability_splunk_dashboard/details.xml
@ -0,0 +1,431 @@
+<form theme="light">
+  <label>FoundationDB - Details</label>
+  <description>Details for FoundationDB Cluster</description>
+  <fieldset submitButton="false">
+    <input type="text" token="Index" searchWhenChanged="true">
+      <label>Index</label>
+      <default>*</default>
+    </input>
+    <input type="text" token="LogGroup" searchWhenChanged="true">
+      <label>LogGroup</label>
+      <default>*</default>
+    </input>
+    <input type="time" token="TimeRange" searchWhenChanged="true">
+      <label>Time Range</label>
+      <default>
+        <earliest>-60m@m</earliest>
+        <latest>now</latest>
+      </default>
+    </input>
+    <input type="dropdown" token="Span" searchWhenChanged="true">
+      <label>Timechart Resolution</label>
+      <choice value="bins=100">Default</choice>
+      <choice value="span=5s">5 seconds</choice>
+      <choice value="span=1m">1 minute</choice>
+      <choice value="span=10m">10 minutes</choice>
+      <choice value="span=1h">1 hour</choice>
+      <choice value="span=1d">1 day</choice>
+      <default>bins=100</default>
+      <initialValue>bins=100</initialValue>
+    </input>
+    <input type="dropdown" token="Roles" searchWhenChanged="true">
+      <label>Roles</label>
+      <choice value="">All</choice>
+      <choice value="Roles=*SS*">Storage Server</choice>
+      <choice value="Roles=*TL*">Transaction Log</choice>
+      <choice value="Roles=*MP*">Proxy</choice>
+      <choice value="Roles=*RV*">Resolver</choice>
+      <choice value="Roles=*MS*">Master</choice>
+      <choice value="Roles=*CC*">Cluster Controller</choice>
+      <choice value="Roles=*LR*">Log Router</choice>
+      <choice value="Roles=*DD*">Data Distributor</choice>
+      <choice value="Roles=*RK*">Ratekeeper</choice>
+      <choice value="Roles=*TS*">Tester</choice>
+      <default></default>
+    </input>
+    <input type="text" token="Host" searchWhenChanged="true">
+      <label>Host</label>
+      <default>*</default>
+    </input>
+    <input type="text" token="Machine" searchWhenChanged="true">
+      <label>Machine</label>
+      <default>*</default>
+    </input>
+  </fieldset>
+  <row>
+    <panel>
+      <chart>
+        <title>Storage Queue Size</title>
+        <search>
+          <query>index=$Index$ LogGroup=$LogGroup$ Type=StorageMetrics $Roles$ host=$Host$ Machine=$Machine$ TrackLatestType=Original | rex field=BytesInput "(?&lt;InputRate&gt;.*) (?&lt;InputRoughness&gt;.*) (?&lt;InputCounter&gt;.*)" | rex field=BytesDurable "(?&lt;DurableRate&gt;.*) (?&lt;DurableRoughness&gt;.*) (?&lt;DurableCounter&gt;.*)" | eval QueueSize=InputCounter-DurableCounter | timechart $Span$ avg(QueueSize) by Machine</query>
+          <earliest>$TimeRange.earliest$</earliest>
+          <latest>$TimeRange.latest$</latest>
+        </search>
+        <option name="charting.chart">line</option>
+        <option name="charting.drilldown">none</option>
+        <option name="refresh.display">progressbar</option>
+      </chart>
+    </panel>
+    <panel>
+      <chart>
+        <title>Storage Input Rate</title>
+        <search>
+          <query>index=$Index$ LogGroup=$LogGroup$ Type=StorageMetrics $Roles$ host=$Host$ Machine=$Machine$ TrackLatestType=Original | rex field=BytesInput "(?&lt;InputRate&gt;.*) (?&lt;InputRoughness&gt;.*) (?&lt;InputCounter&gt;.*)" | timechart $Span$ avg(InputRate) by Machine</query>
+          <earliest>$TimeRange.earliest$</earliest>
+          <latest>$TimeRange.latest$</latest>
+        </search>
+        <option name="charting.chart">line</option>
+        <option name="charting.drilldown">none</option>
+        <option name="refresh.display">progressbar</option>
+      </chart>
+    </panel>
+    <panel>
+      <chart>
+        <title>Storage Bytes Queried</title>
+        <search>
+          <query>index=$Index$ LogGroup=$LogGroup$ Type=StorageMetrics $Roles$ host=$Host$ Machine=$Machine$ TrackLatestType=Original | rex field=BytesQueried "(?&lt;Rate&gt;.*) (?&lt;Roughness&gt;.*) (?&lt;Counter&gt;.*)" | timechart $Span$ avg(Rate) by Machine</query>
+          <earliest>$TimeRange.earliest$</earliest>
+          <latest>$TimeRange.latest$</latest>
+        </search>
+        <option name="charting.chart">line</option>
+        <option name="charting.drilldown">none</option>
+        <option name="refresh.display">progressbar</option>
+      </chart>
+    </panel>
+  </row>
+  <row>
+    <panel>
+      <chart>
+        <title>Average Process CPU by Role (capped at 2; beware kernel bug)</title>
+        <search>
+          <query>index=$Index$ LogGroup=$LogGroup$ Type=ProcessMetrics $Roles$ host=$Host$ Machine=$Machine$ TrackLatestType=Original | eval Cpu=CPUSeconds/Elapsed | timechart $Span$ avg(Cpu) by Roles</query>
+          <earliest>$TimeRange.earliest$</earliest>
+          <latest>$TimeRange.latest$</latest>
+        </search>
+        <option name="charting.axisY.maximumNumber">2</option>
+        <option name="charting.chart">line</option>
+        <option name="charting.drilldown">none</option>
+        <option name="refresh.display">progressbar</option>
+      </chart>
+    </panel>
+    <panel>
+      <chart>
+        <title>Max Process CPU by Role (capped at 2; beware kernel bug)</title>
+        <search>
+          <query>index=$Index$ LogGroup=$LogGroup$ Type=ProcessMetrics $Roles$ host=$Host$ Machine=$Machine$ TrackLatestType=Original | eval Cpu=CPUSeconds/Elapsed | timechart $Span$ max(Cpu) by Roles</query>
+          <earliest>$TimeRange.earliest$</earliest>
+          <latest>$TimeRange.latest$</latest>
+        </search>
+        <option name="charting.axisY.maximumNumber">2</option>
+        <option name="charting.chart">line</option>
+        <option name="charting.drilldown">none</option>
+        <option name="refresh.display">progressbar</option>
+      </chart>
+    </panel>
+    <panel>
+      <chart>
+        <title>Disk Busyness</title>
+        <search>
+          <query>index=$Index$ LogGroup=$LogGroup$ $Roles$ host=$Host$ Machine=$Machine$ Type=ProcessMetrics TrackLatestType=Original | eval DiskBusyPercentage=(Elapsed-DiskIdleSeconds)/Elapsed | timechart $Span$ avg(DiskBusyPercentage) by Machine</query>
+          <earliest>$TimeRange.earliest$</earliest>
+          <latest>$TimeRange.latest$</latest>
+        </search>
+        <option name="charting.chart">line</option>
+        <option name="charting.drilldown">none</option>
+        <option name="refresh.display">progressbar</option>
+      </chart>
+    </panel>
+  </row>
+  <row>
+    <panel>
+      <chart>
+        <title>Max Run Loop Busyness by Role (for &lt;=6.1, S2Pri1)</title>
+        <search>
+          <query>index=$Index$ LogGroup=$LogGroup$ $Roles$ host=$Host$ Machine=$Machine$ Type=NetworkMetrics NOT TrackLatestType=Rolled | eval Busyness=if(isnull(PriorityStarvedBelow1), if(isnull(PriorityBusy1), S2Pri1, PriorityBusy1/Elapsed), PriorityStarvedBelow1/Elapsed) | timechart $Span$ max(Busyness) by Roles</query>
+          <earliest>$TimeRange.earliest$</earliest>
+          <latest>$TimeRange.latest$</latest>
+        </search>
+        <option name="charting.chart">line</option>
+        <option name="charting.drilldown">none</option>
+        <option name="refresh.display">progressbar</option>
+      </chart>
+    </panel>
+    <panel>
+      <chart>
+        <title>Max Run Loop Busyness by Priority (6.2+ only)</title>
+        <search>
+          <query>index=$Index$ LogGroup=$LogGroup$ $Roles$ host=$Host$ Machine=$Machine$ Type=NetworkMetrics TrackLatestType=Original | foreach PriorityBusy* [eval Busyness&lt;&lt;MATCHSTR&gt;&gt;=PriorityBusy&lt;&lt;MATCHSTR&gt;&gt;/Elapsed] | timechart $Span$ max(Busyness*)</query>
+          <earliest>$TimeRange.earliest$</earliest>
+          <latest>$TimeRange.latest$</latest>
+        </search>
+        <option name="charting.chart">line</option>
+        <option name="charting.drilldown">none</option>
+        <option name="refresh.display">progressbar</option>
+      </chart>
+    </panel>
+    <panel>
+      <chart>
+        <title>TLog Queue Size</title>
+        <search>
+          <query>index=$Index$ LogGroup=$LogGroup$ Type=TLogMetrics $Roles$ host=$Host$ Machine=$Machine$ TrackLatestType=Original | eval QueueSize=SharedBytesInput-SharedBytesDurable | timechart $Span$ avg(QueueSize) by Machine</query>
+          <earliest>$TimeRange.earliest$</earliest>
+          <latest>$TimeRange.latest$</latest>
+        </search>
+        <option name="charting.chart">line</option>
+        <option name="charting.drilldown">none</option>
+        <option name="refresh.display">progressbar</option>
+      </chart>
+    </panel>
+  </row>
+  <row>
+    <panel>
+      <chart>
+        <title>Connection Timeouts (counted on both sides of connection)</title>
+        <search>
+          <query>index=$Index$ LogGroup=$LogGroup$ (Type=ConnectionTimeout OR Type=ConnectionTimedOut) $Roles$ host=$Host$ | eval WithAddr=if(Type=="ConnectionTimedOut", PeerAddr, WithAddr) | rex field=WithAddr "(?&lt;OtherAddr&gt;[^:]*:[^:]*).*" | eval Machine=Machine+","+OtherAddr | makemv delim="," Machine | search Machine=$Machine$ | eval Count=1+SuppressedEventCount | timechart sum(Count) by Machine useother=f</query>
+          <earliest>$TimeRange.earliest$</earliest>
+          <latest>$TimeRange.latest$</latest>
+        </search>
+        <option name="charting.chart">line</option>
+        <option name="charting.chart.nullValueMode">zero</option>
+        <option name="charting.drilldown">none</option>
+        <option name="refresh.display">progressbar</option>
+      </chart>
+    </panel>
+    <panel>
+      <chart>
+        <title>Pairwise Connection Timeouts Between Datacenters</title>
+        <search>
+          <query>index=$Index$ LogGroup=$LogGroup$ (Type=ConnectionTimeout OR Type=ConnectionTimedOut)  host=* Machine=* NOT TrackLatestType=Rolled 
+| eval WithAddr=if(Type=="ConnectionTimedOut", PeerAddr, WithAddr) 
+| rex field=host "(?&lt;Datacenter&gt;..).*" 
+| eval Datacenter=if(isnotnull(pie_work_unit), pie_work_unit, Datacenter) 
+| rex field=WithAddr "(?&lt;OtherIP&gt;[^:]*):.*" 
+| join OtherIP 
+    [search index=$Index$ LogGroup=$LogGroup$ Type=ProcessMetrics NOT TrackLatestType=Rolled 
+    | rex field=Machine "(?&lt;OtherIP&gt;[^:]*):.*" 
+    | rex field=host "(?&lt;OtherDatacenter&gt;..).*"
+    | eval OtherDatacenter=if(isnotnull(pie_work_unit), pie_work_unit, OtherDatacenter)]
+| eval DC1=if(Datacenter&gt;OtherDatacenter, Datacenter, OtherDatacenter), DC2=if(Datacenter&gt;OtherDatacenter, OtherDatacenter, Datacenter) 
+| eval Connection=DC1+" &lt;-&gt; " + DC2 
+| eval Count=1+SuppressedEventCount 
+| timechart count by Connection</query>
+          <earliest>$TimeRange.earliest$</earliest>
+          <latest>$TimeRange.latest$</latest>
+        </search>
+        <option name="charting.chart">line</option>
+        <option name="charting.drilldown">none</option>
+        <option name="refresh.display">progressbar</option>
+      </chart>
+    </panel>
+    <panel>
+      <table>
+        <title>Pairwise Connection Timeouts Between Known Server Processes (Sorted by Count, descending)</title>
+        <search>
+          <query>index=$Index$ LogGroup=$LogGroup$ (Type=ConnectionTimeout OR Type=ConnectionTimedOut OR Type=ProcessMetrics) $Roles$ host=$Host$ Machine=$Machine$ NOT TrackLatestType=Rolled | eval WithAddr=if(Type=="ConnectionTimedOut", PeerAddr, WithAddr), Reason=if(Type=="ConnectionTimedOut", "Timed out trying to connect", "Established connection timed out") | rex field=Machine "(?&lt;IP&gt;[^:]*):.*" | rex field=host "(?&lt;Datacenter&gt;..).*" | rex field=WithAddr "(?&lt;OtherIP&gt;[^:]*):.*" | eventstats values(Roles) as Roles by IP | join OtherIP [search index=$Index$ LogGroup=$LogGroup$ Type=ProcessMetrics NOT TrackLatestType=Rolled | rex field=Machine "(?&lt;OtherIP&gt;[^:]*):.*" | rex field=host "(?&lt;OtherDatacenter&gt;..).*" | stats values(Roles) as OtherRoles by OtherIP, OtherDatacenter | eval OtherRoles="("+mvjoin(OtherRoles,",")+")"] | eval Roles="("+mvjoin(Roles,",")+")" | eval IP=Datacenter+": "+IP+" "+Roles, OtherIP=OtherDatacenter+": "+OtherIP+" "+OtherRoles | eval Addr1=if(IP&gt;OtherIP, IP, OtherIP), Addr2=if(IP&gt;OtherIP, OtherIP, IP) | eval Connection=Addr1+" &lt;-&gt; " + Addr2 | eval Count=1+SuppressedEventCount | stats sum(Count) as Count, values(Reason) as Reasons by Connection | sort -Count</query>
+          <earliest>$TimeRange.earliest$</earliest>
+          <latest>$TimeRange.latest$</latest>
+        </search>
+        <option name="count">10</option>
+        <option name="drilldown">none</option>
+        <option name="refresh.display">progressbar</option>
+      </table>
+    </panel>
+  </row>
+  <row>
+    <panel>
+      <chart>
+        <title>Lazy Deletion Rate (making space available for reuse)</title>
+        <search>
+          <query>index=$Index$ LogGroup=$LogGroup$ $Roles$ host=$Host$ Machine=$Machine$ Type=SpringCleaningMetrics | eval Metric=LazyDeletePages | streamstats current=f global=f window=1 first(Metric) as NextMetric, first(Time), as NextTime by ID | eval Rate=4096*(NextMetric-Metric)/(NextTime-Time) | timechart $Span$ avg(Rate) by Machine</query>
+          <earliest>$TimeRange.earliest$</earliest>
+          <latest>$TimeRange.latest$</latest>
+        </search>
+        <option name="charting.chart">line</option>
+        <option name="charting.drilldown">none</option>
+        <option name="refresh.display">progressbar</option>
+      </chart>
+    </panel>
+    <panel>
+      <chart>
+        <title>Vacuuming Rate (shrinking file)</title>
+        <search>
+          <query>index=$Index$ LogGroup=$LogGroup$ $Roles$ host=$Host$ Machine=$Machine$ Type=SpringCleaningMetrics | eval Metric=VacuumedPages | streamstats current=f global=f window=1 first(Metric) as NextMetric, first(Time), as NextTime by ID | eval Rate=4096*(NextMetric-Metric)/(NextTime-Time) | timechart $Span$ avg(Rate) by Machine</query>
+          <earliest>$TimeRange.earliest$</earliest>
+          <latest>$TimeRange.latest$</latest>
+        </search>
+        <option name="charting.chart">line</option>
+        <option name="charting.drilldown">none</option>
+        <option name="refresh.display">progressbar</option>
+      </chart>
+    </panel>
+    <panel>
+      <chart>
+        <title>Roles</title>
+        <search>
+          <query>index=$Index$ LogGroup=$LogGroup$ $Roles$ host=$Host$ Machine=$Machine$ NOT TrackLatestType=Rolled | makemv delim="," Roles | mvexpand Roles | timechart $Span$ distinct_count(Machine) by Roles</query>
+          <earliest>$TimeRange.earliest$</earliest>
+          <latest>$TimeRange.latest$</latest>
+        </search>
+        <option name="charting.axisY.scale">log</option>
+        <option name="charting.chart">line</option>
+        <option name="charting.drilldown">none</option>
+        <option name="refresh.display">progressbar</option>
+      </chart>
+    </panel>
+  </row>
+  <row>
+    <panel>
+      <table>
+        <title>Slow Tasks (Sorted by Duration, Descending)</title>
+        <search>
+          <query>index=$Index$ LogGroup=$LogGroup$ Type=SlowTask $Roles$ host=$Host$ Machine=$Machine$ | sort -Duration | table _time, Duration, Machine, TaskID, Roles</query>
+          <earliest>$TimeRange.earliest$</earliest>
+          <latest>$TimeRange.latest$</latest>
+        </search>
+        <option name="drilldown">none</option>
+        <option name="refresh.display">progressbar</option>
+      </table>
+    </panel>
+    <panel>
+      <table>
+        <title>Event Counts (Sorted by Severity and Count, Descending)</title>
+        <search>
+          <query>index=$Index$ LogGroup=$LogGroup$ $Roles$ host=$Host$ Machine=$Machine$ NOT TrackLatestType=Rolled | stats count as Count by Type, Severity | sort -Severity, -Count</query>
+          <earliest>$TimeRange.earliest$</earliest>
+          <latest>$TimeRange.latest$</latest>
+        </search>
+        <option name="count">10</option>
+        <option name="drilldown">none</option>
+        <option name="refresh.display">progressbar</option>
+      </table>
+    </panel>
+    <panel>
+      <table>
+        <title>Errors</title>
+        <search>
+          <query>index=$Index$ LogGroup=$LogGroup$ Severity=40 $Roles$ host=$Host$ Machine=$Machine$ NOT TrackLatestType=Rolled | table _time, Type, Machine, Roles</query>
+          <earliest>$TimeRange.earliest$</earliest>
+          <latest>$TimeRange.latest$</latest>
+        </search>
+        <option name="drilldown">none</option>
+        <option name="refresh.display">progressbar</option>
+      </table>
+    </panel>
+  </row>
+  <row>
+    <panel>
+      <table>
+        <title>Recoveries (Ignores Filters)</title>
+        <search>
+          <query>index=$Index$ LogGroup=$LogGroup$ Type=MasterRecoveryState TrackLatestType=Original (StatusCode=0 OR StatusCode=11) | eval RecoveryResetInterval=10 | sort _time | streamstats earliest(_time) as RecoveryStart, count as EventCount reset_after="(StatusCode=11)" | where StatusCode=11 | eval EventCount=if(EventCount==1, 2, EventCount), RecoveryStart=if(RecoveryStart==_time, _time-RecoveryDuration, RecoveryStart) | sort -_time | streamstats current=f global=f window=1 first(RecoveryStart) as NextRecoveryStart | eval RecoverySpan=NextRecoveryStart-_time, FailedRecoveries=EventCount-2, SuccessfulRecoveries=1 | eval AvailableSeconds=if(RecoverySpan&lt;RecoveryResetInterval, RecoverySpan, 0) | sort _time | streamstats earliest(RecoveryStart) as RecoveryStart, sum(FailedRecoveries) as FailedRecoveryCount, sum(SuccessfulRecoveries) as SuccessfulRecoveryCount, sum(AvailableSeconds) as AvailableSeconds reset_after="(NOT RecoverySpan &lt; RecoveryResetInterval)"  | where NOT RecoverySpan &lt; RecoveryResetInterval | eval Duration=_time-RecoveryStart, StartTime=strftime(RecoveryStart, "%F %X.%Q"), ShortLivedRecoveryCount=SuccessfulRecoveryCount-1 | table StartTime, Duration, FailedRecoveryCount, ShortLivedRecoveryCount, AvailableSeconds | sort -StartTime</query>
+          <earliest>$TimeRange.earliest$</earliest>
+          <latest>$TimeRange.latest$</latest>
+        </search>
+        <option name="count">10</option>
+        <option name="drilldown">none</option>
+        <option name="refresh.display">progressbar</option>
+      </table>
+    </panel>
+    <panel>
+      <table>
+        <title>Process (Re)starts</title>
+        <search>
+          <query>index=$Index$ LogGroup=$LogGroup$ Type=ProgramStart TrackLatestType=Original $Roles$ host=$Host$ Machine=$Machine$ | table _time, Machine | sort -_time</query>
+          <earliest>$TimeRange.earliest$</earliest>
+          <latest>$TimeRange.latest$</latest>
+        </search>
+        <option name="count">10</option>
+        <option name="drilldown">none</option>
+        <option name="refresh.display">progressbar</option>
+      </table>
+    </panel>
+    <panel>
+      <chart>
+        <title>Failure Detection (Machine Filter Only)</title>
+        <search>
+          <query>index=$Index$ LogGroup=$LogGroup$ Type=FailureDetectionStatus System=$Machine$ | sort _time | eval Failed=if(Status=="Failed", 1, 0) | streamstats current=t global=f window=2 first(Failed) as PrevFailed by System | where PrevFailed=1 OR Failed=1 | eval Failed=PrevFailed + "," + Failed | makemv delim="," Failed | mvexpand Failed | timechart $Span$ max(Failed) by System</query>
+          <earliest>$TimeRange.earliest$</earliest>
+          <latest>$TimeRange.latest$</latest>
+        </search>
+        <option name="charting.axisY.maximumNumber">1</option>
+        <option name="charting.chart">line</option>
+        <option name="charting.drilldown">none</option>
+        <option name="refresh.display">progressbar</option>
+      </chart>
+    </panel>
+  </row>
+  <row>
+    <panel>
+      <table>
+        <title>Storage Server Space Usage (Sorted by Available Space Percentage, Ascending)</title>
+        <search>
+          <query>index=$Index$ LogGroup=$LogGroup$ Type=StorageMetrics $Roles$ host=$Host$ Machine=$Machine$ TrackLatestType=Original | eval AvailableSpacePercent=KvstoreBytesAvailable/KvstoreBytesTotal, FreeSpacePercent=KvstoreBytesFree/KvstoreBytesTotal, GBUsed=KvstoreBytesUsed/1e9, GBStored=BytesStored/1e9, Overhead=KvstoreBytesUsed/BytesStored, GBTotalSpace=KvstoreBytesTotal/1e9 | stats latest(AvailableSpacePercent) as AvailableSpacePercent, latest(FreeSpacePercent) as FreeSpacePercent, latest(GBStored) as GBStored, latest(GBUsed) as GBUsed, latest(Overhead) as OverheadFactor, latest(GBTotalSpace) as GBTotalSpace by Machine | sort AvailableSpacePercent</query>
+          <earliest>$TimeRange.earliest$</earliest>
+          <latest>$TimeRange.latest$</latest>
+        </search>
+        <option name="count">10</option>
+        <option name="drilldown">none</option>
+        <option name="refresh.display">progressbar</option>
+      </table>
+    </panel>
+  </row>
+  <row>
+    <panel>
+      <table>
+        <title>TLog Server Space Usage (Sorted by Available Space Percentage, Ascending)</title>
+        <search>
+          <query>index=$Index$ LogGroup=$LogGroup$  Type=TLogMetrics  host=* Machine=* TrackLatestType=Original Roles=TL | eval AvailableSpacePercent=KvstoreBytesAvailable/KvstoreBytesTotal, FreeDiskSpacePercent=KvstoreBytesFree/KvstoreBytesTotal, GBUsed=KvstoreBytesUsed/1e9,  GBTotalSpace=KvstoreBytesTotal/1e9  | stats latest(AvailableSpacePercent) as AvailableSpacePercent, latest(FreeDiskSpacePercent) as FreeDiskSpacePercent,  latest(GBUsed) as GBUsed, latest(GBTotalSpace) as GBTotalSpace by Machine | sort AvailableSpacePercent</query>
+          <earliest>$TimeRange.earliest$</earliest>
+          <latest>$TimeRange.latest$</latest>
+        </search>
+        <option name="count">10</option>
+        <option name="drilldown">none</option>
+        <option name="refresh.display">progressbar</option>
+      </table>
+    </panel>
+  </row>
+  <row>
+    <panel>
+      <chart>
+        <title>Data Movement by Type (Log Scale, Ignores Filters)</title>
+        <search>
+          <query>index=$Index$ LogGroup=$LogGroup$ Type=MovingData TrackLatestType=Original | timechart avg(Priority*) as *</query>
+          <earliest>$TimeRange.earliest$</earliest>
+          <latest>$TimeRange.latest$</latest>
+        </search>
+        <option name="charting.axisY.scale">log</option>
+        <option name="charting.chart">line</option>
+        <option name="charting.drilldown">none</option>
+        <option name="refresh.display">progressbar</option>
+      </chart>
+    </panel>
+    <panel>
+      <chart>
+        <title>Storage Server Max Bytes Stored by Host</title>
+        <search>
+          <query>index=$Index$ LogGroup=$LogGroup$ Type=StorageMetrics $Roles$ host=$Host$ Machine=$Machine$ TrackLatestType=Original | eval GBStored=BytesStored/1e9 | timechart max(GBStored) by host limit=100</query>
+          <earliest>$TimeRange.earliest$</earliest>
+          <latest>$TimeRange.latest$</latest>
+        </search>
+        <option name="charting.chart">line</option>
+        <option name="charting.drilldown">none</option>
+        <option name="refresh.display">progressbar</option>
+      </chart>
+    </panel>
+  </row>
+  <row>
+    <panel>
+      <table>
+        <title>Master Failed Clients</title>
+        <search>
+          <query>index=$Index$ LogGroup=$LogGroup$  Type=WaitFailureClient 
+| stats count by FailedEndpoint</query>
+          <earliest>$TimeRange.earliest$</earliest>
+          <latest>$TimeRange.latest$</latest>
+        </search>
+        <option name="drilldown">none</option>
+      </table>
+    </panel>
+  </row>
+</form>
--- a/contrib/observability_splunk_dashboard/performance_overview.xml
+++ b/contrib/observability_splunk_dashboard/performance_overview.xml
@ -0,0 +1,323 @@
+<form theme="dark">
+  <label>FoundationDB - Performance Overview (Dev WiP)</label>
+  <fieldset submitButton="false" autoRun="true">
+    <input type="text" token="Index" searchWhenChanged="true">
+      <label>Index</label>
+      <default>*</default>
+    </input>
+    <input type="text" token="LogGroup" searchWhenChanged="true">
+      <label>LogGroup</label>
+      <default></default>
+    </input>
+    <input type="time" token="TimeSpan" searchWhenChanged="true">
+      <label>TimeSpan</label>
+      <default>
+        <earliest>-60m@m</earliest>
+        <latest>now</latest>
+      </default>
+    </input>
+    <input type="dropdown" token="UpdateRateTypeToken" searchWhenChanged="true">
+      <label>RK: Normal or Batch Txn</label>
+      <choice value="">Normal</choice>
+      <choice value="Batch">Batch</choice>
+      <default></default>
+    </input>
+    <input type="text" token="ChartBinSizeToken" searchWhenChanged="true">
+      <label>Chart Bin Size</label>
+      <default>60s</default>
+    </input>
+  </fieldset>
+  <row>
+    <panel>
+      <title>Transaction Rate measured on Proxies</title>
+      <chart>
+        <title>Sum in $ChartBinSizeToken$ seconds</title>
+        <search>
+          <query>index=$Index$ LogGroup=$LogGroup$ host=* Machine=*  (Type="ProxyMetrics" OR Type="GrvProxyMetrics") AND TrackLatestType="Original" 
+| makemv delim=" " TxnRequestIn | makemv delim=" " TxnRequestOut | makemv delim=" " TxnStartIn | makemv delim=" " TxnStartOut | makemv delim=" " TxnThrottled
+| eval TxnRequestInRate=mvindex(TxnRequestIn, 0), TxnRequestOutRate=mvindex(TxnRequestOut, 0), TxnStartInRate=mvindex(TxnStartIn, 0), TxnStartOutRate=mvindex(TxnStartOut, 0), TxnThrottledRate=mvindex(TxnThrottled, 0)
+| timechart span=$ChartBinSizeToken$ sum(TxnRequestInRate) as StartedTxnBatchRate, sum(TxnRequestOutRate) as FinishedTxnBatchRate, sum(TxnStartInRate) as StartedTxnRate, sum(TxnStartOutRate) as FinishedTxnRate, sum(TxnThrottledRate) as ThrottledTxnRate</query>
+          <earliest>$TimeSpan.earliest$</earliest>
+          <latest>$TimeSpan.latest$</latest>
+        </search>
+        <option name="charting.chart">line</option>
+        <option name="charting.drilldown">none</option>
+        <option name="refresh.display">progressbar</option>
+      </chart>
+    </panel>
+    <panel>
+      <title>Read Rate measured on Storage Servers</title>
+      <chart>
+        <title>Average in $ChartBinSizeToken$ seconds</title>
+        <search>
+          <query>index=$Index$ LogGroup=$LogGroup$ Type=StorageMetrics TrackLatestType="Original" 
+| rex field=BytesQueried "(?&lt;RRate&gt;.*) (?&lt;RRoughness&gt;.*) (?&lt;RCounter&gt;.*)" 
+| rex field=RowsQueried "(?&lt;KRate&gt;.*) (?&lt;KRoughness&gt;.*) (?&lt;KCounter&gt;.*)" 
+| rex field=BytesInput "(?&lt;WRate&gt;.*) (?&lt;WRoughness&gt;.*) (?&lt;WCounter&gt;.*)" 
+| rex field=BytesFetched "(?&lt;FRate&gt;.*) (?&lt;FRoughness&gt;.*) (?&lt;FCounter&gt;.*)" 
+| timechart span=$ChartBinSizeToken$ avg(RRate) as BytesReadPerSecond, avg(KRate) as RowsReadPerSecond, avg(FRate) as DDReadPerSecond</query>
+          <earliest>$TimeSpan.earliest$</earliest>
+          <latest>$TimeSpan.latest$</latest>
+        </search>
+        <option name="charting.axisY.scale">linear</option>
+        <option name="charting.chart">line</option>
+        <option name="charting.drilldown">none</option>
+        <option name="refresh.display">progressbar</option>
+      </chart>
+    </panel>
+  </row>
+  <row>
+    <panel>
+      <title>Write Rate measured on Proxies</title>
+      <chart>
+        <title>1min Average</title>
+        <search>
+          <query>index=$Index$ LogGroup=$LogGroup$ host=* Machine=*  (Type="ProxyMetrics" OR Type="GrvProxyMetrics") AND TrackLatestType="Original" 
+| makemv delim=" " MutationBytes
+| makemv delim=" " Mutations
+| eval MutationBytesRate=mvindex(MutationBytes, 0), MutationsRate=mvindex(Mutations,0)
+| bucket span=5s _time
+| stats sum(MutationBytesRate) as MutationBytes, sum(MutationsRate) as Mutations by _time
+|eval MutationMB=MutationBytes/1024/1024, MutationsK=Mutations/1000
+| timechart span=$ChartBinSizeToken$ avg(MutationMB) as MutationMB, avg(MutationsK) as MutationsK</query>
+          <earliest>$TimeSpan.earliest$</earliest>
+          <latest>$TimeSpan.latest$</latest>
+        </search>
+        <option name="charting.axisY.abbreviation">none</option>
+        <option name="charting.axisY.scale">linear</option>
+        <option name="charting.chart">line</option>
+        <option name="charting.drilldown">none</option>
+        <option name="charting.layout.splitSeries">0</option>
+        <option name="refresh.display">progressbar</option>
+      </chart>
+    </panel>
+    <panel>
+      <title>Write Rate measured on Storage Servers</title>
+      <chart>
+        <search>
+          <query>index=$Index$ LogGroup=$LogGroup$ Type=StorageMetrics TrackLatestType="Original" 
+| rex field=BytesInput "(?&lt;WRate&gt;.*) (?&lt;WRoughness&gt;.*) (?&lt;WCounter&gt;.*)" 
+| rex field=BytesFetched "(?&lt;FRate&gt;.*) (?&lt;FRoughness&gt;.*) (?&lt;FCounter&gt;.*)" 
+| timechart span=$ChartBinSizeToken$ avg(WRate) as BytesPerSecond, avg(FRate) as DDBytesWrittenPerSecond</query>
+          <earliest>$TimeSpan.earliest$</earliest>
+          <latest>$TimeSpan.latest$</latest>
+        </search>
+        <option name="charting.chart">line</option>
+        <option name="charting.drilldown">none</option>
+        <option name="refresh.display">progressbar</option>
+      </chart>
+    </panel>
+  </row>
+  <row>
+    <panel>
+      <title>GRV Latency measured on all Proxies</title>
+      <chart>
+        <title>Seconds</title>
+        <search>
+          <query>index=$Index$ LogGroup=$LogGroup$ Type=GRVLatencyMetrics AND TrackLatestType="Original"
+| timechart span=$ChartBinSizeToken$ avg(Max) as maxLatency, avg(Mean) as meanLatency, avg(P99) as P99Latency, avg(P99.9) as P999Latency, avg(P95) as P95Latency</query>
+          <earliest>$TimeSpan.earliest$</earliest>
+          <latest>$TimeSpan.latest$</latest>
+        </search>
+        <option name="charting.chart">line</option>
+        <option name="charting.drilldown">none</option>
+        <option name="charting.legend.placement">bottom</option>
+        <option name="refresh.display">progressbar</option>
+      </chart>
+    </panel>
+    <panel>
+      <title>Commit Latency measured on all Proxies</title>
+      <chart>
+        <title>Seconds</title>
+        <search>
+          <query>index=$Index$ LogGroup=$LogGroup$  Type=CommitLatencyMetrics AND TrackLatestType="Original"
+| timechart span=$ChartBinSizeToken$ avg(Max) as maxLatency, avg(Mean) as meanLatency, avg(P99) as P99Latency, avg(P99.9) as P999Latency, avg(P95) as P95Latency</query>
+          <earliest>$TimeSpan.earliest$</earliest>
+          <latest>$TimeSpan.latest$</latest>
+        </search>
+        <option name="charting.chart">line</option>
+        <option name="charting.drilldown">none</option>
+        <option name="charting.legend.placement">bottom</option>
+        <option name="refresh.display">progressbar</option>
+      </chart>
+    </panel>
+    <panel>
+      <title>Read Latency measured on all Storage Servers</title>
+      <chart>
+        <title>Seconds</title>
+        <search>
+          <query>index=$Index$ LogGroup=$LogGroup$  Type=ReadLatencyMetrics AND TrackLatestType="Original"
+| timechart span=$ChartBinSizeToken$ avg(Max) as maxLatency, avg(Mean) as meanLatency, avg(P99) as P99Latency, avg(P99.9) as P999Latency, avg(P95) as P95Latency</query>
+          <earliest>$TimeSpan.earliest$</earliest>
+          <latest>$TimeSpan.latest$</latest>
+        </search>
+        <option name="charting.chart">line</option>
+        <option name="charting.drilldown">none</option>
+        <option name="charting.legend.placement">bottom</option>
+        <option name="refresh.display">progressbar</option>
+      </chart>
+    </panel>
+  </row>
+  <row>
+    <panel>
+      <title>RateKeeper: ReleasedTPS vs LimitTPS</title>
+      <chart>
+        <search>
+          <query>index=$Index$ LogGroup=$LogGroup$ Type=RkUpdate$UpdateRateTypeToken$ AND TrackLatestType="Original" 
+| replace inf with 100000000000 
+| eval _time=Time
+| table _time ReleasedTPS TPSLimit
+| timechart span=$ChartBinSizeToken$ avg(ReleasedTPS) avg(TPSLimit)</query>
+          <earliest>$TimeSpan.earliest$</earliest>
+          <latest>$TimeSpan.latest$</latest>
+        </search>
+        <option name="charting.axisY.scale">log</option>
+        <option name="charting.chart">line</option>
+        <option name="charting.drilldown">none</option>
+        <option name="height">251</option>
+        <option name="refresh.display">progressbar</option>
+      </chart>
+    </panel>
+    <panel>
+      <title>RateKeeper: Throttling Reason</title>
+      <chart>
+        <search>
+          <query>index=$Index$ LogGroup=$LogGroup$ Type=RkUpdate$UpdateRateTypeToken$ AND TrackLatestType="Original" 
+| replace inf with 100000000000 
+| eval _time=Time
+| table _time Reason</query>
+          <earliest>$TimeSpan.earliest$</earliest>
+          <latest>$TimeSpan.latest$</latest>
+        </search>
+        <option name="charting.axisLabelsY.majorUnit">1</option>
+        <option name="charting.axisY.abbreviation">none</option>
+        <option name="charting.axisY.scale">linear</option>
+        <option name="charting.chart">area</option>
+        <option name="charting.drilldown">none</option>
+        <option name="charting.legend.mode">standard</option>
+        <option name="height">249</option>
+        <option name="refresh.display">progressbar</option>
+      </chart>
+    </panel>
+    <panel>
+      <title>RateKeeper: Throttling Server</title>
+      <table>
+        <title>Ratekeeper: Limit Reason: ReasonServerID (Most recent 10 records)</title>
+        <search>
+          <query>index=$Index$ LogGroup=$LogGroup$ Type=RkUpdate AND TrackLatestType="Original" 
+| streamstats count as numOfEvents 
+| where numOfEvents &lt; 10
+| eval DateTime=strftime(Time, "%Y-%m-%dT%H:%M:%S")
+| table DateTime, ReasonServerID</query>
+          <earliest>$TimeSpan.earliest$</earliest>
+          <latest>$TimeSpan.latest$</latest>
+        </search>
+        <option name="drilldown">none</option>
+        <option name="refresh.display">progressbar</option>
+      </table>
+    </panel>
+  </row>
+  <row>
+    <panel>
+      <title>Disk Overhead = Disk Usage / Logical KV Size</title>
+      <chart>
+        <title>Y-axis is capped at 10</title>
+        <search>
+          <query>index=$Index$ LogGroup=$LogGroup$ host=* Machine=*  (Type=StorageMetrics OR Type=DDTrackerStats) TrackLatestType=Original
+| bucket _time span=5s 
+| stats sum(KvstoreBytesUsed) as StorageDiskUsedBytes, sum(KvstoreBytesTotal) as StorageDiskTotalBytes, avg(TotalSizeBytes) as LogicalKVBytes by _time
+| eval overhead=StorageDiskUsedBytes/LogicalKVBytes
+| timechart avg(overhead)</query>
+          <earliest>$TimeSpan.earliest$</earliest>
+          <latest>$TimeSpan.latest$</latest>
+        </search>
+        <option name="charting.axisY.maximumNumber">10</option>
+        <option name="charting.chart">line</option>
+        <option name="charting.drilldown">none</option>
+        <option name="charting.legend.placement">bottom</option>
+      </chart>
+    </panel>
+    <panel>
+      <title>KV Data Size</title>
+      <chart>
+        <search>
+          <query>index=$Index$ LogGroup=$LogGroup$
+Roles=*DD* host=* Machine=*  Type=DDTrackerStats TrackLatestType=Original
+| eval TotalKVGB=TotalSizeBytes/1024/1024/1024, SystemKVGB=SystemSizeBytes/1024/1024/1024
+|timechart avg(TotalKVGB), avg(SystemKVGB), avg(Shards)</query>
+          <earliest>$TimeSpan.earliest$</earliest>
+          <latest>$TimeSpan.latest$</latest>
+        </search>
+        <option name="charting.chart">line</option>
+        <option name="charting.drilldown">none</option>
+        <option name="charting.legend.placement">bottom</option>
+        <option name="refresh.display">progressbar</option>
+      </chart>
+    </panel>
+    <panel>
+      <title>Disk Usage</title>
+      <chart>
+        <search>
+          <query>index=$Index$ LogGroup=$LogGroup$ host=* Machine=*  Type=StorageMetrics TrackLatestType=Original
+| bucket _time span=5s 
+| stats sum(KvstoreBytesUsed) as StorageDiskUsedBytes, sum(KvstoreBytesTotal) as StorageDiskTotalBytes by _time
+|eval StorageDiskTotalMB = StorageDiskTotalBytes/1024/1024, StorageDiskUsedMB=StorageDiskUsedBytes/1024/1024
+| timechart avg(StorageDiskTotalMB) as StorageDiskTotalMB, avg(StorageDiskUsedMB) as StorageDiskUsedMB</query>
+          <earliest>$TimeSpan.earliest$</earliest>
+          <latest>$TimeSpan.latest$</latest>
+        </search>
+        <option name="charting.chart">line</option>
+        <option name="charting.drilldown">none</option>
+        <option name="charting.legend.placement">bottom</option>
+        <option name="refresh.display">progressbar</option>
+      </chart>
+    </panel>
+  </row>
+  <row>
+    <panel>
+      <title>Cluster Roles</title>
+      <table>
+        <search>
+          <query>index=$Index$ LogGroup=$LogGroup$ Type=ProcessMetrics TrackLatestType="Original"
+| rex field=host "(?&lt;HostDC&gt;..).*-..(?&lt;HostConfig&gt;..).*"
+| eval HostDC=if(isnotnull(pie_work_unit), pie_work_unit, HostDC) 
+| makemv delim="," Roles
+| stats dc(Machine) as MachineCount by Roles, HostDC
+| stats list(HostDC), list(MachineCount) by Roles
+| sort Roles</query>
+          <earliest>$TimeSpan.earliest$</earliest>
+          <latest>$TimeSpan.latest$</latest>
+        </search>
+        <option name="drilldown">none</option>
+      </table>
+    </panel>
+  </row>
+  <row>
+    <panel>
+      <title>Storage Engine</title>
+      <table>
+        <search>
+          <query>index=$Index$ LogGroup=$LogGroup$ Type=Role Origination=Recruited As=StorageServer | table StorageEngine, OriginalDateTime, DateTime |head 2</query>
+          <earliest>$TimeSpan.earliest$</earliest>
+          <latest>$TimeSpan.latest$</latest>
+        </search>
+        <option name="drilldown">none</option>
+        <option name="refresh.display">progressbar</option>
+      </table>
+    </panel>
+    <panel>
+      <title>Cluster Generations</title>
+      <chart>
+        <title>Indicate FDB recoveries</title>
+        <search>
+          <query>index=$Index$ LogGroup=$LogGroup$ Type=TLogMetrics |timechart max(Generation)</query>
+          <earliest>$TimeSpan.earliest$</earliest>
+          <latest>$TimeSpan.latest$</latest>
+        </search>
+        <option name="charting.chart">line</option>
+        <option name="charting.drilldown">none</option>
+        <option name="refresh.display">progressbar</option>
+      </chart>
+    </panel>
+  </row>
+</form>
--- a/contrib/observability_splunk_dashboard/ratekeeper.xml
+++ b/contrib/observability_splunk_dashboard/ratekeeper.xml
@ -0,0 +1,928 @@
+<form theme="dark">
+  <label>FoundationDB - RateKeeper (Dev)</label>
+  <fieldset submitButton="false">
+    <input type="text" token="Index" searchWhenChanged="true">
+      <label>Index</label>
+      <default>*</default>
+    </input>
+    <input type="text" token="LogGroup" searchWhenChanged="true">
+      <label>LogGroup</label>
+      <default></default>
+    </input>
+    <input type="time" token="TimeSpan" searchWhenChanged="true">
+      <label>TimeSpan</label>
+      <default>
+        <earliest>-60m@m</earliest>
+        <latest>now</latest>
+      </default>
+    </input>
+    <input type="dropdown" token="UpdateRateTypeToken" searchWhenChanged="true">
+      <label>RKChart: Normal or Batch</label>
+      <choice value="">Normal</choice>
+      <choice value="Batch">Batch</choice>
+      <default></default>
+    </input>
+    <input type="text" token="ChartBinSizeToken" searchWhenChanged="true">
+      <label>Chart Bin Size</label>
+      <default>30s</default>
+    </input>
+    <input type="dropdown" token="ChartByMachineToken" searchWhenChanged="true">
+      <label>ClusterStateMetric byMachine</label>
+      <choice value="by Machine">Yes</choice>
+      <choice value="">No</choice>
+      <default></default>
+    </input>
+    <input type="dropdown" token="RolePerformanceChartToken" searchWhenChanged="true">
+      <label>Role for Proc Perf Charts</label>
+      <choice value="MasterServer">MasterServer</choice>
+      <choice value="MasterProxyServer">MasterProxyServer</choice>
+      <choice value="StorageServer">StorageServer</choice>
+      <choice value="TLog">TLog</choice>
+      <choice value="Resolver">Resolver</choice>
+      <choice value="GrvProxyServer">GrvProxyServer</choice>
+      <choice value="CommitProxyServer">CommitProxyServer</choice>
+    </input>
+    <input type="dropdown" token="SourcePerfConnectionToken" searchWhenChanged="true">
+      <label>Source for Perf Connection</label>
+      <choice value="MasterServer">MasterServer</choice>
+      <choice value="MasterProxyServer">MasterProxyServer</choice>
+      <choice value="Resolver">Resolver</choice>
+      <choice value="TLog">TLog</choice>
+      <choice value="StorageServer">StorageServer</choice>
+      <choice value="GrvProxyServer">GrvProxyServer</choice>
+      <choice value="CommitProxyServer">CommitProxyServer</choice>
+    </input>
+    <input type="dropdown" token="DestinationPerfConnectionToken" searchWhenChanged="true">
+      <label>Dest for Perf Connection</label>
+      <choice value="MasterServer">MasterServer</choice>
+      <choice value="MasterProxyServer">MasterProxyServer</choice>
+      <choice value="Resolver">Resolver</choice>
+      <choice value="TLog">TLog</choice>
+      <choice value="StorageServer">StorageServer</choice>
+      <choice value="GrvProxyServer">GrvProxyServer</choice>
+      <choice value="CommitProxyServer">CommitProxyServer</choice>
+    </input>
+  </fieldset>
+  <row>
+    <panel>
+      <title>Aggregated Storage Server Bandwidth</title>
+      <chart>
+        <search>
+          <query>index=$Index$ LogGroup=$LogGroup$ Type=StorageMetrics TrackLatestType="Original" 
+        | rex field=BytesQueried "(?&lt;RRate&gt;.*) (?&lt;RRoughness&gt;.*) (?&lt;RCounter&gt;.*)" 
+         |  rex field=BytesInput "(?&lt;WRate&gt;.*) (?&lt;WRoughness&gt;.*) (?&lt;WCounter&gt;.*)" 
+          | rex field=BytesFetched "(?&lt;FRate&gt;.*) (?&lt;FRoughness&gt;.*) (?&lt;FCounter&gt;.*)" 
+          | bin span=5s _time 
+          | stats sum(RRate) as ReadSum, sum(WRate) as WriteSum, sum(FRate) as FetchedKeyRate by _time
+          | eval ReadSpeedMB=ReadSum/1024/1024, WriteSpeedMB=WriteSum/1024/1024, FetchedKeyRateMB=FetchedKeyRate/1024/1024
+          |timechart avg(ReadSpeedMB), avg(WriteSpeedMB), avg(FetchedKeyRateMB)</query>
+          <earliest>$TimeSpan.earliest$</earliest>
+          <latest>$TimeSpan.latest$</latest>
+        </search>
+        <option name="charting.chart">line</option>
+        <option name="charting.drilldown">none</option>
+        <option name="refresh.display">progressbar</option>
+      </chart>
+    </panel>
+    <panel>
+      <title>Aggregated Proxy Bandwidth</title>
+      <chart>
+        <search>
+          <query>index=$Index$ LogGroup=$LogGroup$ (Type="ProxyMetrics" OR Type="GrvProxyMetrics") AND TrackLatestType="Original" 
+| makemv delim=" " TxnRequestIn | makemv delim=" " TxnRequestOut | makemv delim=" " TxnStartIn | makemv delim=" " TxnStartOut | makemv delim=" " MutationBytes
+| eval TxnRequestInRate=mvindex(TxnRequestIn, 0), TxnRequestOutRate=mvindex(TxnRequestOut, 0), TxnStartInRate=mvindex(TxnStartIn, 0), TxnStartOutRate=mvindex(TxnStartOut, 0), MutationBytesRate=mvindex(MutationBytes, 0)
+| bin span=60s _time
+| stats avg(TxnRequestInRate) as TxnRequestInRatePerHost, avg(TxnRequestOutRate) as TxnRequestOutRatePerHost, avg(TxnStartInRate) as TxnStartInRatePerHost, avg(TxnStartOutRate) as TxnStartOutRatePerHost, avg(MutationBytesRate) as MutationBytesRatePerHost by Machine,_time
+| eval WriteThroughputKB=sum(MutationBytesRatePerHost)/1000  
+| timechart span=1m sum(TxnRequestInRatePerHost), sum(TxnRequestOutRatePerHost), sum(TxnStartInRatePerHost), sum(TxnStartOutRatePerHost), sum(WriteThroughputKB)</query>
+          <earliest>$TimeSpan.earliest$</earliest>
+          <latest>$TimeSpan.latest$</latest>
+        </search>
+        <option name="charting.chart">line</option>
+        <option name="charting.drilldown">none</option>
+        <option name="refresh.display">progressbar</option>
+      </chart>
+    </panel>
+  </row>
+  <row>
+    <panel>
+      <title>Chart 1: Overview - GRV Arrivals and Leaves per Second Seen by Proxies</title>
+      <chart>
+        <search>
+          <query>index=$Index$ LogGroup=$LogGroup$ (Type="ProxyMetrics" OR Type="GrvProxyMetrics") AND TrackLatestType="Original" 
+| eval TxnRequestIn=mvindex(TxnRequestIn, 0), TxnRequestOut=mvindex(TxnRequestOut, 0), TxnStartIn=mvindex(TxnStartIn, 0), TxnStartOut=mvindex(TxnStartOut, 0) 
+| timechart span=30s avg(TxnRequestIn) avg(TxnRequestOut) avg(TxnStartIn) avg(TxnStartOut) by Machine</query>
+          <earliest>$TimeSpan.earliest$</earliest>
+          <latest>$TimeSpan.latest$</latest>
+        </search>
+        <option name="charting.axisY.scale">log</option>
+        <option name="charting.chart">line</option>
+        <option name="charting.drilldown">none</option>
+        <option name="height">249</option>
+        <option name="refresh.display">progressbar</option>
+      </chart>
+    </panel>
+  </row>
+  <row>
+    <panel>
+      <title>Chart 2: RKOverview - Input ReleasedTPS and Output TPSLimit</title>
+      <chart>
+        <search>
+          <query>index=$Index$ LogGroup=$LogGroup$ Type=RkUpdate$UpdateRateTypeToken$ AND TrackLatestType="Original" 
+| replace inf with 100000000000 
+| eval _time=Time
+| table _time ReleasedTPS TPSLimit
+| timechart span=$ChartBinSizeToken$ avg(ReleasedTPS) avg(TPSLimit)</query>
+          <earliest>$TimeSpan.earliest$</earliest>
+          <latest>$TimeSpan.latest$</latest>
+        </search>
+        <option name="charting.axisY.scale">log</option>
+        <option name="charting.chart">line</option>
+        <option name="charting.drilldown">none</option>
+        <option name="height">251</option>
+        <option name="refresh.display">progressbar</option>
+      </chart>
+    </panel>
+    <panel>
+      <title>Chart 3: RKOverview - RKLimitReason</title>
+      <chart>
+        <search>
+          <query>index=$Index$ LogGroup=$LogGroup$ Type=RkUpdate$UpdateRateTypeToken$ AND TrackLatestType="Original" 
+| replace inf with 100000000000 
+| eval _time=Time
+| table _time Reason</query>
+          <earliest>$TimeSpan.earliest$</earliest>
+          <latest>$TimeSpan.latest$</latest>
+        </search>
+        <option name="charting.axisLabelsY.majorUnit">1</option>
+        <option name="charting.axisY.abbreviation">none</option>
+        <option name="charting.axisY.scale">linear</option>
+        <option name="charting.chart">area</option>
+        <option name="charting.drilldown">none</option>
+        <option name="height">249</option>
+        <option name="refresh.display">progressbar</option>
+      </chart>
+    </panel>
+  </row>
+  <row>
+    <panel>
+      <title>Chart 4: Don't Process Transactions - RkSSListFetchTimeout (TpsLimit = 0)</title>
+      <chart>
+        <search>
+          <query>index=$Index$ LogGroup=$LogGroup$ 
+Type="RkSSListFetchTimeout" 
+| timechart span=1s count</query>
+          <earliest>$TimeSpan.earliest$</earliest>
+          <latest>$TimeSpan.latest$</latest>
+        </search>
+        <option name="charting.chart">line</option>
+        <option name="charting.drilldown">none</option>
+        <option name="refresh.display">progressbar</option>
+      </chart>
+    </panel>
+    <panel>
+      <title>Chart 5: Don't Process Transactions - RkTlogMinFreeSpaceZero (TpsLimit = 0)</title>
+      <chart>
+        <search>
+          <query>index=$Index$ LogGroup=$LogGroup$ 
+Type="RkTlogMinFreeSpaceZero" 
+| timechart span=1s count</query>
+          <earliest>$TimeSpan.earliest$</earliest>
+          <latest>$TimeSpan.latest$</latest>
+        </search>
+        <option name="charting.chart">line</option>
+        <option name="charting.drilldown">none</option>
+        <option name="refresh.display">progressbar</option>
+      </chart>
+    </panel>
+    <panel>
+      <title>Chart 6: Don't Process Transactions - ProxyGRVThresholdExceeded</title>
+      <chart>
+        <search>
+          <query>index=$Index$ LogGroup=$LogGroup$ (Type="ProxyGRVThresholdExceeded*") AND TrackLatestType="Original" 
+| timechart span=1s count by Type</query>
+          <earliest>$TimeSpan.earliest$</earliest>
+          <latest>$TimeSpan.latest$</latest>
+        </search>
+        <option name="charting.chart">line</option>
+        <option name="charting.drilldown">none</option>
+        <option name="refresh.display">progressbar</option>
+      </chart>
+    </panel>
+  </row>
+  <row>
+    <panel>
+      <title>Chart 7: RKLimitReasonCandidate - LimitingStorageServerDurabilityLag (MVCCVersionInMemory)</title>
+      <chart>
+        <search>
+          <query>index=$Index$ LogGroup=$LogGroup$ Type=RkUpdate$UpdateRateTypeToken$ AND TrackLatestType="Original" 
+| replace inf with 100000000000 
+| timechart span=$ChartBinSizeToken$ avg(LimitingStorageServerDurabilityLag)</query>
+          <earliest>$TimeSpan.earliest$</earliest>
+          <latest>$TimeSpan.latest$</latest>
+        </search>
+        <option name="charting.chart">line</option>
+        <option name="charting.drilldown">none</option>
+        <option name="refresh.display">progressbar</option>
+      </chart>
+    </panel>
+    <panel>
+      <title>Chart 8: RKLimitReasonCandidate - LimitingStorageServerVersionLag (TLogVer-SSVer)</title>
+      <chart>
+        <search>
+          <query>index=$Index$ LogGroup=$LogGroup$ Type=RkUpdate$UpdateRateTypeToken$ AND TrackLatestType="Original" 
+| replace inf with 100000000000 
+| timechart span=$ChartBinSizeToken$ avg(LimitingStorageServerVersionLag)</query>
+          <earliest>$TimeSpan.earliest$</earliest>
+          <latest>$TimeSpan.latest$</latest>
+        </search>
+        <option name="charting.chart">line</option>
+        <option name="charting.drilldown">none</option>
+        <option name="refresh.display">progressbar</option>
+      </chart>
+    </panel>
+    <panel>
+      <title>Chart 9: RKLimitReasonCandidate - LimitingStorageServerQueue</title>
+      <chart>
+        <search>
+          <query>index=$Index$ LogGroup=$LogGroup$ Type=RkUpdate$UpdateRateTypeToken$ AND TrackLatestType="Original" 
+| replace inf with 100000000000 
+| timechart span=$ChartBinSizeToken$ avg(LimitingStorageServerQueue)</query>
+          <earliest>$TimeSpan.earliest$</earliest>
+          <latest>$TimeSpan.latest$</latest>
+        </search>
+        <option name="charting.chart">line</option>
+        <option name="charting.drilldown">none</option>
+        <option name="refresh.display">progressbar</option>
+      </chart>
+    </panel>
+  </row>
+  <row>
+    <panel>
+      <title>Chart 10: Runtime Monitoring - StorageServer MVCCVersionInMemory (storage_server_durability_lag)</title>
+      <chart>
+        <search>
+          <query>index=$Index$ LogGroup=$LogGroup$ Type="StorageMetrics" AND TrackLatestType="Original" 
+| eval NonDurableVersions=Version-DurableVersion
+| timechart span=$ChartBinSizeToken$ limit=0 avg(NonDurableVersions) $ChartByMachineToken$</query>
+          <earliest>$TimeSpan.earliest$</earliest>
+          <latest>$TimeSpan.latest$</latest>
+        </search>
+        <option name="charting.axisY.scale">linear</option>
+        <option name="charting.chart">line</option>
+        <option name="charting.drilldown">none</option>
+        <option name="height">251</option>
+        <option name="refresh.display">progressbar</option>
+      </chart>
+    </panel>
+    <panel>
+      <title>Chart 11: Runtime Monitoring - StorageServer LocalRate (higher MVCCVersionInMemory -&gt; lower LocalRate)</title>
+      <chart>
+        <search>
+          <query>index=$Index$ LogGroup=$LogGroup$ Type="StorageMetrics" 
+| timechart limit=0 avg(LocalRate) $ChartByMachineToken$</query>
+          <earliest>$TimeSpan.earliest$</earliest>
+          <latest>$TimeSpan.latest$</latest>
+        </search>
+        <option name="charting.chart">line</option>
+        <option name="charting.drilldown">none</option>
+        <option name="refresh.display">progressbar</option>
+      </chart>
+    </panel>
+    <panel>
+      <title>Chart 12: Runtime Monitoring - StorageServer ReadsRejected (lower LocalRate -&gt; higher probability of rejecting read))</title>
+      <chart>
+        <search>
+          <query>index=$Index$ LogGroup=$LogGroup$ Type="StorageMetrics" 
+| timechart limit=0 avg(ReadsRejected) $ChartByMachineToken$</query>
+          <earliest>$TimeSpan.earliest$</earliest>
+          <latest>$TimeSpan.latest$</latest>
+        </search>
+        <option name="charting.chart">line</option>
+        <option name="charting.drilldown">none</option>
+        <option name="refresh.display">progressbar</option>
+      </chart>
+    </panel>
+  </row>
+  <row>
+    <panel>
+      <title>Chart 13: Runtime Monitoring - Version Lag between StorageServer and Tlog (storage_server_readable_behind)</title>
+      <chart>
+        <search>
+          <query>index=$Index$ LogGroup=$LogGroup$ Type="StorageMetrics" AND TrackLatestType="Original" 
+| eval SSFallBehindVersions=VersionLag
+| timechart span=$ChartBinSizeToken$ limit=0 avg(SSFallBehindVersions) $ChartByMachineToken$</query>
+          <earliest>$TimeSpan.earliest$</earliest>
+          <latest>$TimeSpan.latest$</latest>
+        </search>
+        <option name="charting.axisY.scale">linear</option>
+        <option name="charting.chart">line</option>
+        <option name="charting.drilldown">none</option>
+        <option name="refresh.display">progressbar</option>
+      </chart>
+    </panel>
+    <panel>
+      <title>Chart 14: Runtime Monitoring - StorageServerBytes (storage_server_write_queue_size)</title>
+      <chart>
+        <search>
+          <query>index=$Index$ LogGroup=$LogGroup$ Type="StorageMetrics" AND TrackLatestType="Original" 
+| makemv delim=" " BytesInput | makemv delim=" " BytesDurable | makemv delim=" " BytesFetched | makemv delim=" " MutationBytes
+| eval BytesInput=mvindex(BytesInput, 2), BytesDurable=mvindex(BytesDurable, 2), BytesFetched=mvindex(BytesFetched, 2), MutationBytes=mvindex(MutationBytes, 2), BytesInMemoryQueue=BytesInput-BytesDurable
+| timechart span=$ChartBinSizeToken$ limit=0 avg(BytesInMemoryQueue) $ChartByMachineToken$</query>
+          <earliest>$TimeSpan.earliest$</earliest>
+          <latest>$TimeSpan.latest$</latest>
+        </search>
+        <option name="charting.axisY.scale">linear</option>
+        <option name="charting.chart">line</option>
+        <option name="charting.drilldown">none</option>
+        <option name="refresh.display">progressbar</option>
+      </chart>
+    </panel>
+  </row>
+  <row>
+    <panel>
+      <title>Chart 15: Runtime Monitoring - StorageServer KVStore Free Space Ratio (storage_server_min_free_space)</title>
+      <chart>
+        <search>
+          <query>index=$Index$ LogGroup=$LogGroup$ Type="StorageMetrics" AND TrackLatestType="Original" 
+| eval KvstoreBytesFreeRatio=KvstoreBytesFree/KvstoreBytesTotal
+| timechart span=$ChartBinSizeToken$ limit=0 avg(KvstoreBytesFreeRatio) $ChartByMachineToken$</query>
+          <earliest>$TimeSpan.earliest$</earliest>
+          <latest>$TimeSpan.latest$</latest>
+        </search>
+        <option name="charting.chart">line</option>
+        <option name="charting.drilldown">none</option>
+        <option name="refresh.display">progressbar</option>
+      </chart>
+    </panel>
+    <panel>
+      <title>Chart 16: Runtime Monitoring - TLog Queue Free Space Ratio (log_server_min_free_space)</title>
+      <chart>
+        <search>
+          <query>index=$Index$ LogGroup=$LogGroup$ Type="TLogMetrics" AND TrackLatestType="Original" 
+| eval QueueBytesFreeRatio=QueueDiskBytesFree/QueueDiskBytesTotal
+| timechart span=$ChartBinSizeToken$ limit=0 avg(QueueBytesFreeRatio) $ChartByMachineToken$</query>
+          <earliest>$TimeSpan.earliest$</earliest>
+          <latest>$TimeSpan.latest$</latest>
+        </search>
+        <option name="charting.chart">line</option>
+        <option name="charting.drilldown">none</option>
+        <option name="refresh.display">progressbar</option>
+      </chart>
+    </panel>
+    <panel>
+      <title>Chart 17: Runtime Monitoring - TLog KVStore Free Space Ratio (log_server_min_free_space)</title>
+      <chart>
+        <search>
+          <query>index=$Index$ LogGroup=$LogGroup$ Type="TLogMetrics" AND TrackLatestType="Original" 
+| eval KvstoreBytesFreeRatio=KvstoreBytesFree/KvstoreBytesTotal
+| timechart span=$ChartBinSizeToken$ limit=0 avg(KvstoreBytesFreeRatio) $ChartByMachineToken$</query>
+          <earliest>$TimeSpan.earliest$</earliest>
+          <latest>$TimeSpan.latest$</latest>
+        </search>
+        <option name="charting.chart">line</option>
+        <option name="charting.drilldown">none</option>
+        <option name="refresh.display">progressbar</option>
+      </chart>
+    </panel>
+  </row>
+  <row>
+    <panel>
+      <title>Chart 18: Runtime Monitoring - TLogBytes (log_server_write_queue)</title>
+      <chart>
+        <search>
+          <query>index=$Index$ LogGroup=$LogGroup$ Type="TLogMetrics" AND TrackLatestType="Original" 
+| makemv delim=" " BytesInput 
+| makemv delim=" " BytesDurable 
+| eval BytesInput=mvindex(BytesInput, 2), BytesDurable=mvindex(BytesDurable, 2), BytesInMemoryQueue=BytesInput-BytesDurable | timechart span=$ChartBinSizeToken$ limit=0 avg(BytesInMemoryQueue) $ChartByMachineToken$</query>
+          <earliest>$TimeSpan.earliest$</earliest>
+          <latest>$TimeSpan.latest$</latest>
+        </search>
+        <option name="charting.axisY.scale">log</option>
+        <option name="charting.chart">line</option>
+        <option name="charting.drilldown">none</option>
+        <option name="refresh.display">progressbar</option>
+      </chart>
+    </panel>
+    <panel>
+      <title>Chart 19: Runtime Monitoring - Proxy Throughput</title>
+      <chart>
+        <search>
+          <query>index=$Index$ LogGroup=$LogGroup$ (Type="ProxyMetrics" OR Type="GrvProxyMetrics") AND TrackLatestType="Original" 
+| timechart span=$ChartBinSizeToken$ limit=0 avg(TxnRequestIn) avg(TxnRequestOut) avg(TxnStartIn) avg(TxnStartOut) avg(TxnStartBatch) avg(TxnStartErrors) avg(TxnCommitIn) avg(TxnCommitVersionAssigned) avg(TxnCommitResolving) avg(TxnCommitResolved) avg(TxnCommitOut) avg(TxnCommitOutSuccess) avg(TxnCommitErrors) avg(TxnThrottled) avg(TxnConflicts) avg(CommitBatchIn) avg(CommitBatchOut) avg(TxnRejectedForQueuedTooLong) avg(Mutations)  $ChartByMachineToken$</query>
+          <earliest>$TimeSpan.earliest$</earliest>
+          <latest>$TimeSpan.latest$</latest>
+        </search>
+        <option name="charting.axisY.scale">log</option>
+        <option name="charting.chart">line</option>
+        <option name="charting.drilldown">none</option>
+        <option name="refresh.display">progressbar</option>
+      </chart>
+    </panel>
+    <panel>
+      <title>Chart 20: Runtime Monitoring - Proxy Queue Length</title>
+      <chart>
+        <search>
+          <query>index=$Index$ LogGroup=$LogGroup$ (Type="ProxyMetrics" OR Type="GrvProxyMetrics") AND TrackLatestType="Original" | timechart span=$ChartBinSizeToken$ limit=0 avg(*QueueSize*)  $ChartByMachineToken$</query>
+          <earliest>$TimeSpan.earliest$</earliest>
+          <latest>$TimeSpan.latest$</latest>
+        </search>
+        <option name="charting.chart">line</option>
+        <option name="charting.drilldown">none</option>
+        <option name="refresh.display">progressbar</option>
+      </chart>
+    </panel>
+  </row>
+  <row>
+    <panel>
+      <title>Chart 21: Runtime Monitoring - TLog UnpoppedVersion</title>
+      <chart>
+        <search>
+          <query>index=$Index$ LogGroup=$LogGroup$ Type="TLogMetrics" AND TrackLatestType="Original" 
+| eval UnpoppedVersion=PersistentDataDurableVersion-QueuePoppedVersion 
+| timechart span=$ChartBinSizeToken$ limit=0 avg(UnpoppedVersion) $ChartByMachineToken$</query>
+          <earliest>$TimeSpan.earliest$</earliest>
+          <latest>$TimeSpan.latest$</latest>
+        </search>
+        <option name="charting.axisY.scale">log</option>
+        <option name="charting.chart">line</option>
+        <option name="charting.drilldown">none</option>
+        <option name="refresh.display">progressbar</option>
+      </chart>
+    </panel>
+    <panel>
+      <title>Chart 22: Runtime Monitoring - Storage Server Disk (AIODiskStall)</title>
+      <chart>
+        <search>
+          <query>index=$Index$ LogGroup=$LogGroup$ Type="ProcessMetrics" 
+| join Machine 
+    [ search index=$Index$ LogGroup=$LogGroup$ Type="Role" AND As="StorageServer" 
+    | stats first(Machine) by Machine 
+    | rename first(Machine) as Machine 
+    | table Machine] 
+| timechart span=$ChartBinSizeToken$ limit=0 avg(AIODiskStall) $ChartByMachineToken$</query>
+          <earliest>$TimeSpan.earliest$</earliest>
+          <latest>$TimeSpan.latest$</latest>
+        </search>
+        <option name="charting.chart">line</option>
+        <option name="charting.drilldown">none</option>
+        <option name="refresh.display">progressbar</option>
+      </chart>
+    </panel>
+    <panel>
+      <title>Chart 23: Runtime Monitoring - StorageServer Query Queue Length</title>
+      <chart>
+        <search>
+          <query>index=$Index$ LogGroup=$LogGroup$ Type="StorageMetrics" AND TrackLatestType="Original" 
+| makemv QueryQueue | eval QueryQueue=mvindex(QueryQueue, 1) | table _time QueryQueue Machine
+| timechart span=$ChartBinSizeToken$ limit=0 avg(QueryQueue) $ChartByMachineToken$</query>
+          <earliest>$TimeSpan.earliest$</earliest>
+          <latest>$TimeSpan.latest$</latest>
+        </search>
+        <option name="charting.axisY.scale">log</option>
+        <option name="charting.chart">line</option>
+        <option name="charting.drilldown">none</option>
+        <option name="refresh.display">progressbar</option>
+      </chart>
+    </panel>
+  </row>
+  <row>
+    <panel>
+      <title>Chart 24: Transaction Trace Stats - GRV Latency (only show CC transactions by default; it shows client transactions only when you manually open client transaction trace)</title>
+      <input type="dropdown" token="GRVByMachineStatsToken" searchWhenChanged="true">
+        <label>By Machine</label>
+        <choice value="Machine">Yes</choice>
+        <choice value="">No</choice>
+        <default></default>
+      </input>
+      <input type="text" token="StatsGRVSpanToken" searchWhenChanged="true">
+        <label>Span</label>
+        <default>500ms</default>
+      </input>
+      <chart>
+        <search>
+          <query>index=$Index$ LogGroup=$LogGroup$
+    Type="TransactionDebug" AND (*ProxyServer.masterProxyServerCore.Broadcast OR *ProxyServer.getLiveCommittedVersion.confirmEpochLive OR *ProxyServer.getLiveCommittedVersion.After) 
+| table Time Type ID Location Machine Roles 
+| append 
+    [ search index=$Index$ LogGroup=$LogGroup$ Type="TransactionDebug" AND (*ProxyServer.queueTransactionStartRequests.Before) 
+    | rename ID as ParentID 
+    | table Time Type ParentID Location Machine Roles 
+    | join ParentID 
+        [ search index=$Index$ LogGroup=$LogGroup$ Type="TransactionAttachID" 
+        | rename ID as ParentID 
+        | rename To as ID 
+        | table ParentID ID] 
+    | table Time Type ID Location Machine Roles] 
+| table Time Type ID Location Machine Roles 
+| sort 0 Time 
+| table Machine Location Time Roles Type ID 
+| stats list(*) by ID 
+| rename list(*) as * 
+| eval TBegin=mvindex(Time, 0), TEnd=mvindex(Time, -1), TimeSpan=TEnd-TBegin, _time=TBegin 
+| bin bins=20 span=$StatsGRVSpanToken$ TimeSpan 
+| chart limit=0 count by TimeSpan $GRVByMachineStatsToken$</query>
+          <earliest>$TimeSpan.earliest$</earliest>
+          <latest>$TimeSpan.latest$</latest>
+        </search>
+        <option name="charting.axisY.scale">log</option>
+        <option name="charting.chart">column</option>
+        <option name="charting.drilldown">none</option>
+        <option name="refresh.display">progressbar</option>
+      </chart>
+    </panel>
+  </row>
+  <row>
+    <panel>
+      <title>Chart 25: Transaction Trace Stats - GetValue Latency (only show CC transactions by default; it shows client transactions only when you manually open client transaction trace)</title>
+      <input type="dropdown" token="GetValueByMachineStatsToken" searchWhenChanged="true">
+        <label>By Machine</label>
+        <choice value="Machine">Yes</choice>
+        <choice value="">No</choice>
+        <default></default>
+      </input>
+      <input type="text" token="StatsReadSpanToken" searchWhenChanged="true">
+        <label>Span</label>
+        <default>500ms</default>
+      </input>
+      <chart>
+        <search>
+          <query>index=$Index$ LogGroup=$LogGroup$
+    (storageServer.received OR getValueQ.DoRead OR getValueQ.AfterVersion OR Reader.Before OR Reader.After OR getValueQ.AfterRead OR NativeAPI.getKeyLocation.Before OR NativeAPI.getKeyLocation.After) 
+| table Machine Location Time Roles ID Type 
+| eval Order=case(Location=="NativeAPI.getKeyLocation.Before", 0, Location=="NativeAPI.getKeyLocation.After", 1, Location=="NativeAPI.getValue.Before", 2, Location=="storageServer.received", 3, Location=="getValueQ.DoRead", 4, Location=="getValueQ.AfterVersion", 5, Location=="Reader.Before", 6, Location=="Reader.After", 7, Location=="getValueQ.AfterRead", 8, Location=="NativeAPI.getValue.After", 9, Location=="NativeAPI.getValue.Error", 10) 
+| sort 0 Time Order 
+| stats list(*) by ID 
+| rename list(*) as * 
+| table Machine Location Time Roles ID Type 
+| eval count = mvcount(Location)
+| search count&gt;2
+| eval TEnd=mvindex(Time, -1), TBegin=mvindex(Time, 0), TimeSpan=TEnd-TBegin, _time=TBegin
+| table _time ID TimeSpan Machine Location Time 
+| bin bins=20 span=$StatsReadSpanToken$ TimeSpan 
+| chart limit=0 count by TimeSpan $GetValueByMachineStatsToken$</query>
+          <earliest>$TimeSpan.earliest$</earliest>
+          <latest>$TimeSpan.latest$</latest>
+        </search>
+        <option name="charting.axisY.scale">log</option>
+        <option name="charting.chart">column</option>
+        <option name="charting.drilldown">none</option>
+        <option name="refresh.display">progressbar</option>
+      </chart>
+    </panel>
+  </row>
+  <row>
+    <panel>
+      <title>Chart 26: Transaction Trace Stats - Commit Latency (only show CC transactions by default; it shows client transactions only when you manually open client transaction trace)</title>
+      <input type="dropdown" token="CommitByMachineStatsToken">
+        <label>By Machine</label>
+        <choice value="Machine">Yes</choice>
+        <choice value="">No</choice>
+        <default>Machine</default>
+      </input>
+      <input type="text" token="StatsCommitSpanToken" searchWhenChanged="true">
+        <label>Span</label>
+        <default>500ms</default>
+      </input>
+      <chart>
+        <search>
+          <query>index=$Index$ LogGroup=$LogGroup$
+    Type="CommitDebug" AND (*ProxyServer.commitBatch.Before OR *ProxyServer.commitBatch.GettingCommitVersion OR *ProxyServer.commitBatch.GotCommitVersion OR *ProxyServer.commitBatch.ProcessingMutations OR *ProxyServer.commitBatch.AfterStoreCommits OR *ProxyServer.commitBatch.AfterLogPush OR *ProxyServer.commitBatch.AfterResolution) 
+| table Time Type ID Location Machine Roles 
+| sort 0 Time 
+| table Machine Location Time Roles Type ID
+| stats list(*) by ID
+| rename list(*) as *
+| eval Count=mvcount(Location)
+| search Count&gt;=2
+| eval TBegin=mvindex(Time, 0), TEnd=mvindex(Time, -1), TimeSpan=TEnd-TBegin, _time=T1
+| table _time TimeSpan Machine
+| bin bins=20 span=$StatsCommitSpanToken$ TimeSpan 
+| chart limit=0 count by TimeSpan $CommitByMachineStatsToken$</query>
+          <earliest>$TimeSpan.earliest$</earliest>
+          <latest>$TimeSpan.latest$</latest>
+        </search>
+        <option name="charting.axisY.scale">log</option>
+        <option name="charting.chart">column</option>
+        <option name="charting.drilldown">none</option>
+        <option name="refresh.display">progressbar</option>
+      </chart>
+    </panel>
+  </row>
+  <row>
+    <panel>
+      <title>Chart 27: Transaction Tracing - GRV Latency (only show CC transactions by default; it shows client transactions only when you manually open client transaction trace)</title>
+      <input type="dropdown" token="GRVLatencyByMachineToken" searchWhenChanged="true">
+        <label>By Machine</label>
+        <choice value="by Machine">Yes</choice>
+        <choice value="">No</choice>
+        <default></default>
+      </input>
+      <chart>
+        <search>
+          <query>index=$Index$ LogGroup=$LogGroup$
+    Type="TransactionDebug" AND (*ProxyServer.*ProxyServerCore.Broadcast OR *ProxyServer.getLiveCommittedVersion.confirmEpochLive OR *ProxyServer.getLiveCommittedVersion.After) 
+| table Time Type ID Location Machine Roles 
+| append 
+    [ search index=$Index$ LogGroup=$LogGroup$ Type="TransactionDebug" AND (*ProxyServer.queueTransactionStartRequests.Before) 
+    | rename ID as ParentID 
+    | table Time Type ParentID Location Machine Roles 
+    | join ParentID 
+        [ search index=$Index$ LogGroup=$LogGroup$ Type="TransactionAttachID" 
+        | rename ID as ParentID 
+        | rename To as ID 
+        | table ParentID ID] 
+    | table Time Type ID Location Machine Roles] 
+| table Time Type ID Location Machine Roles 
+| eval Order = case(Location=="NativeAPI.getConsistentReadVersion.Before", 0, Location like "%ProxyServer.queueTransactionStartRequests.Before", 1, Location="MasterProxyServer.masterProxyServerCore.Broadcast", 2, Location like "%ProxyServer.getLiveCommittedVersion.confirmEpochLive", 3, Location like "%ProxyServer.getLiveCommittedVersion.After", 5, Location=="NativeAPI.getConsistentReadVersion.After", 6) 
+| table Time Order Type ID Location Machine Roles 
+| sort 0 Order Time 
+| table Machine Location Time Roles Type ID 
+| stats list(*) by ID 
+| rename list(*) as * 
+| eval T1=mvindex(Time, 0), T2=mvindex(Time, 1), T3=mvindex(Time, 2), T4=mvindex(Time, 3), TimeInQueue = T2-T1, TimeGetVersionFromProxies = if(mvcount==4, T3-T2, -0.0000001), TimeConfirmLivenessFromTLogs = if(mvcount==4, T4-T3, T3-T2), TimeSpan=if(mvcount==4,T4-T1,T3-T1), _time=T1 
+| table _time TimeSpan TimeInQueue TimeGetVersionFromProxies TimeConfirmLivenessFromTLogs Machine 
+| timechart span=$ChartBinSizeToken$ limit=0 avg(TimeSpan), avg(TimeInQueue), avg(TimeGetVersionFromProxies), avg(TimeConfirmLivenessFromTLogs) $GRVLatencyByMachineToken$</query>
+          <earliest>$TimeSpan.earliest$</earliest>
+          <latest>$TimeSpan.latest$</latest>
+        </search>
+        <option name="charting.chart">line</option>
+        <option name="charting.drilldown">none</option>
+        <option name="refresh.display">progressbar</option>
+      </chart>
+    </panel>
+    <panel>
+      <title>Chart 28: Transaction Tracing - GetValue Latency (only show CC transactions by default; it shows client transactions only when you manually open client transaction trace)</title>
+      <input type="dropdown" token="GetValueLatencyByMachineToken" searchWhenChanged="true">
+        <label>By Machine</label>
+        <choice value="by Machine">Yes</choice>
+        <choice value="">No</choice>
+        <default></default>
+      </input>
+      <chart>
+        <search>
+          <query>index=$Index$ LogGroup=$LogGroup$
+    (storageServer.received OR getValueQ.DoRead OR getValueQ.AfterVersion OR Reader.Before OR Reader.After OR getValueQ.AfterRead OR NativeAPI.getKeyLocation.Before OR NativeAPI.getKeyLocation.After) 
+| table Machine Location Time Roles ID Type 
+| eval Order=case(Location=="NativeAPI.getKeyLocation.Before", 0, Location=="NativeAPI.getKeyLocation.After", 1, Location=="NativeAPI.getValue.Before", 2, Location=="storageServer.received", 3, Location=="getValueQ.DoRead", 4, Location=="getValueQ.AfterVersion", 5, Location=="Reader.Before", 6, Location=="Reader.After", 7, Location=="getValueQ.AfterRead", 8, Location=="NativeAPI.getValue.After", 9, Location=="NativeAPI.getValue.Error", 10) 
+| sort 0 Time Order 
+| stats list(*) by ID 
+| rename list(*) as * 
+| table Machine Location Time Roles ID Type 
+| eval count = mvcount(Location)
+| search count&gt;2
+| eval TEnd=mvindex(Time, -1), TBegin=mvindex(Time, 0), TimeSpan=TEnd-TBegin, _time=TBegin
+| table _time TimeSpan  
+| timechart span=30s limit=0 avg(TimeSpan) $GetValueLatencyByMachineToken$</query>
+          <earliest>$TimeSpan.earliest$</earliest>
+          <latest>$TimeSpan.latest$</latest>
+        </search>
+        <option name="charting.chart">line</option>
+        <option name="charting.drilldown">none</option>
+        <option name="refresh.display">progressbar</option>
+      </chart>
+    </panel>
+  </row>
+  <row>
+    <panel>
+      <title>Chart 29: Transaction Tracing - Commit Latency (only show CC transactions by default; it shows client transactions only when you manually open client transaction trace)</title>
+      <input type="dropdown" token="CommitByMachineToken" searchWhenChanged="true">
+        <label>By Machine</label>
+        <choice value="By Machine">Yes</choice>
+        <choice value="">No</choice>
+        <default></default>
+      </input>
+      <chart>
+        <search>
+          <query>index=$Index$ LogGroup=$LogGroup$
+    Type="CommitDebug" AND (*ProxyServer.commitBatch.Before OR *ProxyServer.commitBatch.GettingCommitVersion OR *ProxyServer.commitBatch.GotCommitVersion OR *ProxyServer.commitBatch.ProcessingMutations OR *ProxyServer.commitBatch.AfterStoreCommits OR *ProxyServer.commitBatch.AfterLogPush OR *ProxyServer.commitBatch.AfterResolution) 
+| table Time Type ID Location Machine Roles 
+| eval Order=case(Location=="NativeAPI.commit.Before", 0, Location like "%ProxyServer.batcher", 1, Location like "%ProxyServer.commitBatch.Before", 2, Location like "%ProxyServer.commitBatch.GettingCommitVersion", 3, Location like "%ProxyServer.commitBatch.GotCommitVersion", 4, Location=="Resolver.resolveBatch.Before", 5, Location=="Resolver.resolveBatch.AfterQueueSizeCheck", 6, Location=="Resolver.resolveBatch.AfterOrderer", 7, Location=="Resolver.resolveBatch.After", 8, Location like "%ProxyServer.commitBatch.AfterResolution", 8.5, Location like "%ProxyServer.commitBatch.ProcessingMutations", 9, Location like "%ProxyServer.commitBatch.AfterStoreCommits", 10, Location=="TLog.tLogCommit.BeforeWaitForVersion", 11, Location=="TLog.tLogCommit.Before", 12, Location=="TLog.tLogCommit.AfterTLogCommit", 13, Location=="TLog.tLogCommit.After", 14, Location like "%ProxyServer.commitBatch.AfterLogPush", 15, Location=="NativeAPI.commit.After", 16)
+| table Time Order Type ID Location Machine Roles 
+| sort 0 Time Order 
+| table Machine Location Time Roles Type ID
+| stats list(*) by ID
+| rename list(*) as *
+| eval Count=mvcount(Location)
+| search Count=7
+| eval T1=mvindex(Time, 0), T2=mvindex(Time, 1), T3=mvindex(Time, 2), T4=mvindex(Time, 3), T5=mvindex(Time, 4), T6=mvindex(Time, 5), T7=mvindex(Time, 6), TimeSpan=T7-T1, TimeResolution=T4-T3, TimePostResolution=T5-T4, TimeProcessingMutation=T6-T5, TimeTLogPush=T7-T6, _time=T1
+| table _time TimeSpan TimeResolution TimePostResolution TimeProcessingMutation TimeTLogPush Machine
+| timechart span=$ChartBinSizeToken$ limit=0 avg(TimeSpan), avg(TimeResolution), avg(TimePostResolution), avg(TimeProcessingMutation), avg(TimeTLogPush) $CommitByMachineToken$</query>
+          <earliest>$TimeSpan.earliest$</earliest>
+          <latest>$TimeSpan.latest$</latest>
+        </search>
+        <option name="charting.chart">line</option>
+        <option name="charting.drilldown">none</option>
+        <option name="refresh.display">progressbar</option>
+      </chart>
+    </panel>
+    <panel>
+      <title>Chart 30: Transaction Tracing - Commit - TLogPush and Resolver Latency (only show CC transactions by default; it shows client transactions only when you manually open client transaction trace)</title>
+      <input type="dropdown" token="TLogResolverByMachineToken" searchWhenChanged="true">
+        <label>By Machine</label>
+        <choice value="MachineStep">Yes</choice>
+        <choice value="Step">No</choice>
+        <default>Step</default>
+      </input>
+      <chart>
+        <search>
+          <query>index=$Index$ LogGroup=$LogGroup$
+    Type="CommitDebug" AND (Resolver.resolveBatch.Before OR Resolver.resolveBatch.AfterQueueSizeCheck OR Resolver.resolveBatch.AfterOrderer OR Resolver.resolveBatch.After OR TLog.tLogCommit.BeforeWaitForVersion OR TLog.tLogCommit.Before OR TLog.tLogCommit.AfterTLogCommit OR TLog.tLogCommit.After) 
+| table Time Type ID Location Machine Roles 
+| eval Order=case(Location=="NativeAPI.commit.Before", 0, Location=="MasterProxyServer.batcher", 1, Location=="MasterProxyServer.commitBatch.Before", 2, Location=="MasterProxyServer.commitBatch.GettingCommitVersion", 3, Location=="MasterProxyServer.commitBatch.GotCommitVersion", 4, Location=="Resolver.resolveBatch.Before", 5, Location=="Resolver.resolveBatch.AfterQueueSizeCheck", 6, Location=="Resolver.resolveBatch.AfterOrderer", 7, Location=="Resolver.resolveBatch.After", 8, Location=="MasterProxyServer.commitBatch.AfterResolution", 8.5, Location=="MasterProxyServer.commitBatch.ProcessingMutations", 9, Location=="MasterProxyServer.commitBatch.AfterStoreCommits", 10, Location=="TLog.tLogCommit.BeforeWaitForVersion", 11, Location=="TLog.tLogCommit.Before", 12, Location=="TLog.tLogCommit.AfterTLogCommit", 13, Location=="TLog.tLogCommit.After", 14, Location=="MasterProxyServer.commitBatch.AfterLogPush", 15, Location=="NativeAPI.commit.After", 16)
+| table Time Order Type ID Location Machine Roles 
+| sort 0 Time Order 
+| table Machine Location Time Roles Type ID
+| stats list(*) by ID
+| rename list(*) as *
+| eval Count=mvcount(Location), Step=case(Count=4 and (mvindex(Location, 0) like "TLog%"), "TimeTLogCommit", Count=4 and (mvindex(Location, 0) like "Resolver%"), "TimeResolver", Count=10, "TimeSpan"), BeginTime=mvindex(Time, 0), EndTime=mvindex(Time, -1), Duration=EndTime-BeginTime, _time=BeginTime
+| search Count=4
+| eval Machinei=mvindex(Machine, 0), MachineStep = Step."-".Machinei
+| table _time Step Duration Machinei Location Machine MachineStep
+| timechart span=$ChartBinSizeToken$ limit=0 avg(Duration) by $TLogResolverByMachineToken$</query>
+          <earliest>$TimeSpan.earliest$</earliest>
+          <latest>$TimeSpan.latest$</latest>
+        </search>
+        <option name="charting.chart">line</option>
+        <option name="charting.drilldown">none</option>
+        <option name="refresh.display">progressbar</option>
+      </chart>
+    </panel>
+  </row>
+  <row>
+    <panel>
+      <title>Chart 31: Machine Performance - CPU Utilization (CPU Time divided by Elapsed)</title>
+      <chart>
+        <search>
+          <query>index=$Index$ LogGroup=$LogGroup$ Type=ProcessMetrics AND TrackLatestType="Original" 
+| table _time Machine CPUSeconds DiskFreeBytes DiskIdleSeconds DiskQueueDepth DiskReadsCount DiskWriteSectors DiskTotalBytes DiskWritesCount FileReads MbpsReceived MbpsSent Memory ResidentMemory UnusedAllocatedMemory Elapsed
+| join Machine 
+    [ search index=$Index$ LogGroup=$LogGroup$ Type="Role" AND As=$RolePerformanceChartToken$ 
+    | stats first(Machine) by Machine 
+    | rename first(Machine) as Machine 
+    | table Machine] 
+| eval Utilization=CPUSeconds/Elapsed
+| timechart span=$ChartBinSizeToken$ avg(Utilization) $ChartByMachineToken$</query>
+          <earliest>$TimeSpan.earliest$</earliest>
+          <latest>$TimeSpan.latest$</latest>
+        </search>
+        <option name="charting.axisY.scale">linear</option>
+        <option name="charting.chart">line</option>
+        <option name="charting.drilldown">none</option>
+        <option name="refresh.display">progressbar</option>
+      </chart>
+    </panel>
+    <panel>
+      <title>Chart 32: Machine Performance - Memory Utilization (ResidentMemory divided by Memory)</title>
+      <chart>
+        <search>
+          <query>index=$Index$ LogGroup=$LogGroup$ Type=ProcessMetrics AND TrackLatestType="Original" 
+| table _time Machine CPUSeconds DiskFreeBytes DiskIdleSeconds DiskQueueDepth DiskReadsCount DiskWriteSectors DiskTotalBytes DiskWritesCount FileReads MbpsReceived MbpsSent Memory ResidentMemory UnusedAllocatedMemory 
+| join Machine 
+    [ search index=$Index$ LogGroup=$LogGroup$ Type="Role" AND As=$RolePerformanceChartToken$ 
+    | stats first(Machine) by Machine 
+    | rename first(Machine) as Machine 
+    | table Machine] 
+| eval Utilization = ResidentMemory/Memory
+| timechart span=$ChartBinSizeToken$ avg(Utilization) $ChartByMachineToken$</query>
+          <earliest>$TimeSpan.earliest$</earliest>
+          <latest>$TimeSpan.latest$</latest>
+        </search>
+        <option name="charting.axisY.scale">linear</option>
+        <option name="charting.chart">line</option>
+        <option name="charting.drilldown">none</option>
+        <option name="refresh.display">progressbar</option>
+      </chart>
+    </panel>
+    <panel>
+      <title>Chart 33: Machine Performance - Disk Utilization ((DiskTotalBytes-DiskFreeBytes)/DiskTotalBytes)</title>
+      <chart>
+        <search>
+          <query>index=$Index$ LogGroup=$LogGroup$ Type=ProcessMetrics AND TrackLatestType="Original" 
+| table _time Machine CPUSeconds DiskFreeBytes DiskIdleSeconds DiskQueueDepth DiskReadsCount DiskWriteSectors DiskTotalBytes DiskWritesCount FileReads MbpsReceived MbpsSent Memory ResidentMemory UnusedAllocatedMemory 
+| join Machine 
+    [ search index=$Index$ LogGroup=$LogGroup$ Type="Role" AND As=$RolePerformanceChartToken$
+    | stats first(Machine) by Machine 
+    | rename first(Machine) as Machine 
+    | table Machine] 
+| eval Utilization = (DiskTotalBytes-DiskFreeBytes)/DiskTotalBytes
+| timechart span=$ChartBinSizeToken$ avg(Utilization) $ChartByMachineToken$</query>
+          <earliest>$TimeSpan.earliest$</earliest>
+          <latest>$TimeSpan.latest$</latest>
+        </search>
+        <option name="charting.chart">line</option>
+        <option name="charting.drilldown">none</option>
+        <option name="refresh.display">progressbar</option>
+      </chart>
+    </panel>
+  </row>
+  <row>
+    <panel>
+      <title>Chart 34: Machine Performance - Network (Mbps Received and Mbps Sent)</title>
+      <chart>
+        <search>
+          <query>index=$Index$ LogGroup=$LogGroup$ Type=ProcessMetrics AND TrackLatestType="Original" 
+| table _time Machine CPUSeconds DiskFreeBytes DiskIdleSeconds DiskQueueDepth DiskReadsCount DiskWriteSectors DiskTotalBytes DiskWritesCount FileReads MbpsReceived MbpsSent Memory ResidentMemory UnusedAllocatedMemory 
+| join Machine 
+    [ search index=$Index$ LogGroup=$LogGroup$ Type="Role" AND As=$RolePerformanceChartToken$ 
+    | stats first(Machine) by Machine 
+    | rename first(Machine) as Machine 
+    | table Machine] 
+| timechart span=$ChartBinSizeToken$ avg(MbpsReceived) avg(MbpsSent) $ChartByMachineToken$</query>
+          <earliest>$TimeSpan.earliest$</earliest>
+          <latest>$TimeSpan.latest$</latest>
+        </search>
+        <option name="charting.axisY.scale">log</option>
+        <option name="charting.chart">line</option>
+        <option name="charting.drilldown">none</option>
+        <option name="refresh.display">progressbar</option>
+      </chart>
+    </panel>
+    <panel>
+      <title>Chart 35: Machine Performance - Disk (Reads Count and Writes Count)</title>
+      <chart>
+        <search>
+          <query>index=$Index$ LogGroup=$LogGroup$ Type=ProcessMetrics AND TrackLatestType="Original" 
+| table _time Machine CPUSeconds DiskFreeBytes DiskIdleSeconds DiskQueueDepth DiskReadsCount DiskWriteSectors DiskTotalBytes DiskWritesCount FileReads MbpsReceived MbpsSent Memory ResidentMemory UnusedAllocatedMemory 
+| join Machine 
+    [ search index=$Index$ LogGroup=$LogGroup$ Type="Role" AND As=$RolePerformanceChartToken$ 
+    | stats first(Machine) by Machine 
+    | rename first(Machine) as Machine 
+    | table Machine] 
+| timechart span=$ChartBinSizeToken$ avg(DiskReadsCount) avg(DiskWritesCount) $ChartByMachineToken$</query>
+          <earliest>$TimeSpan.earliest$</earliest>
+          <latest>$TimeSpan.latest$</latest>
+        </search>
+        <option name="charting.chart">line</option>
+        <option name="charting.drilldown">none</option>
+        <option name="refresh.display">progressbar</option>
+      </chart>
+    </panel>
+  </row>
+  <row>
+    <panel>
+      <title>Chart 36: Network Performance - Timeout</title>
+      <input type="dropdown" token="TimeoutByConnectionToken" searchWhenChanged="true">
+        <label>By Connection</label>
+        <choice value="By Connection">Yes</choice>
+        <choice value="">No</choice>
+        <default></default>
+      </input>
+      <chart>
+        <search>
+          <query>index=$Index$ LogGroup=$LogGroup$
+    (Type=ConnectionTimedOut OR Type=ConnectionTimeout) 
+| replace *:tls with * in PeerAddr 
+| join Machine 
+    [ search index=$Index$ LogGroup=$LogGroup$
+        (Type="Role" AND ($SourcePerfConnectionToken$)) 
+    | dedup ID] 
+| join PeerAddr 
+    [ search index=$Index$ LogGroup=$LogGroup$
+        (Type="Role" AND ($DestinationPerfConnectionToken$)) 
+    | dedup ID 
+    | rename Machine as PeerAddr] 
+| eval Connection=Machine."-".PeerAddr
+| timechart useother=0 span=$ChartBinSizeToken$ count $TimeoutByConnectionToken$</query>
+          <earliest>$TimeSpan.earliest$</earliest>
+          <latest>$TimeSpan.latest$</latest>
+        </search>
+        <option name="charting.chart">line</option>
+        <option name="charting.drilldown">none</option>
+        <option name="refresh.display">progressbar</option>
+      </chart>
+    </panel>
+    <panel>
+      <title>Chart 37: Network Performance - PingLatency</title>
+      <input type="dropdown" token="PingLatencyByConnectionToken" searchWhenChanged="true">
+        <label>By Connection</label>
+        <choice value="By Connection">Yes</choice>
+        <choice value="">No</choice>
+        <default></default>
+      </input>
+      <chart>
+        <search>
+          <query>index=$Index$ LogGroup=$LogGroup$
+    (Type=PingLatency) 
+| replace *:tls with * in PeerAddr 
+| join Machine 
+    [ search index=$Index$ LogGroup=$LogGroup$
+        (Type="Role" AND ($SourcePerfConnectionToken$)) 
+    | dedup ID] 
+| join PeerAddr 
+    [ search index=$Index$ LogGroup=$LogGroup$
+        (Type="Role" AND ($DestinationPerfConnectionToken$)) 
+    | dedup ID 
+    | rename Machine as PeerAddr] 
+| eval Connection=Machine."-".PeerAddr
+| timechart useother=0 span=$ChartBinSizeToken$ avg(MeanLatency) avg(MaxLatency)   $PingLatencyByConnectionToken$</query>
+          <earliest>$TimeSpan.earliest$</earliest>
+          <latest>$TimeSpan.latest$</latest>
+        </search>
+        <option name="charting.chart">line</option>
+        <option name="charting.drilldown">none</option>
+        <option name="refresh.display">progressbar</option>
+      </chart>
+    </panel>
+  </row>
+</form>
--- a/contrib/observability_splunk_dashboard/recovery.xml
+++ b/contrib/observability_splunk_dashboard/recovery.xml
@ -0,0 +1,873 @@
+<form theme="dark">
+  <label>FoundationDB - Long Recovery (Dev)</label>
+  <fieldset submitButton="false" autoRun="false"></fieldset>
+  <row>
+    <panel>
+      <title>Table 1: Find long recovery (Input Index and LogGroup and Select a time span).</title>
+      <input type="text" token="IndexForOverview" searchWhenChanged="true">
+        <label>Index</label>
+        <default>*</default>
+      </input>
+      <input type="text" token="LogGroupForOverview" searchWhenChanged="true">
+        <label>LogGroup</label>
+        <default></default>
+      </input>
+      <input type="time" token="time_token_for_recoveryhistorytable" searchWhenChanged="true">
+        <label>Select a time span</label>
+        <default>
+          <earliest>-0s</earliest>
+          <latest>now</latest>
+        </default>
+      </input>
+      <table>
+        <search>
+          <query>index=$IndexForOverview$ LogGroup=$LogGroupForOverview$
+    ((Type="MasterRecoveryState" AND (Status="reading_coordinated_state" OR Status="fully_recovered" OR Status="accepting_commits")) OR (Type="Role" AND As="MasterServer" AND ("Transition"="Begin" OR "Transition"="End")) OR Type="MasterTerminated") AND (NOT TrackLatestType="Rolled") | eval DateTime=strftime(Time, "%Y-%m-%d %H:%M:%S.%Q (%Z)")
+| table ID Machine Type Transition As Status DateTime Time ErrorDescription LogGroup
+| search NOT ErrorDescription="Success"
+| eval EventType=case(Transition="Begin" AND As="MasterServer" AND Type="Role", "MasterStart", Type="MasterRecoveryState" AND Status="fully_recovered", "FullRecovery", Type="MasterRecoveryState" AND Status="reading_coordinated_state", "StartRecoveryAttempt", Transition="End" AND As="MasterServer" AND Type="Role", "MasterTerminated", Type="MasterTerminated", "MasterTerminated", Type="MasterRecoveryState" AND Status="accepting_commits", "AcceptingCommits") 
+| table ID Machine EventType DateTime Time ErrorDescription LogGroup
+| fillnull value="-" 
+| sort -Time 
+| eval ifMasterTerminatedEvent=if(EventType="MasterTerminated", 1, 0) 
+| stats list(*) by ID Machine ifMasterTerminatedEvent 
+| rename list(*) as * 
+| table ID Machine EventType DateTime Time ErrorDescription LogGroup
+| sort -Time 
+| eval LastTime=mvindex(Time, 0), FirstTime=mvindex(Time, -1), Duration=LastTime-FirstTime 
+| table ID Machine Duration EventType DateTime Time ErrorDescription LogGroup</query>
+          <earliest>$time_token_for_recoveryhistorytable.earliest$</earliest>
+          <latest>$time_token_for_recoveryhistorytable.latest$</latest>
+        </search>
+        <option name="count">15</option>
+        <option name="drilldown">none</option>
+        <option name="refresh.display">progressbar</option>
+      </table>
+    </panel>
+  </row>
+  <row>
+    <panel>
+      <title>Table 2: Select timespan containing the long recovery and see all recovery attempts in the time span (The input Index and LogGroup and Timespan are for all following tables and charts)</title>
+      <input type="text" token="Index" searchWhenChanged="true">
+        <label>Index</label>
+        <default>*</default>
+      </input>
+      <input type="text" searchWhenChanged="true" token="LogGroup">
+        <label>LogGroup</label>
+      </input>
+      <input type="time" token="ReoveryTime" searchWhenChanged="true">
+        <label>ReoveryTimeSpan</label>
+        <default>
+          <earliest>-0s@s</earliest>
+          <latest>now</latest>
+        </default>
+      </input>
+      <table>
+        <search>
+          <query>index=$Index$ LogGroup=$LogGroup$
+    (Type="MasterRecoveryState" OR (Type="MasterTerminated") OR (Type="Role" AND As="MasterServer" AND "Transition"="End") OR Type="RecoveryInternal" OR Type="ProxyReplies" OR Type="CommitProxyReplies" OR Type="ResolverReplies" OR Type="MasterRecruitedInitialStorageServers") AND (NOT TrackLatestType="Rolled") 
+| rename ID as MasterID 
+| eval DateTime=strftime(Time, "%Y-%m-%d %H:%M:%S.%Q (%Z)") 
+| table MasterID Machine Status Step Type DateTime Time StatusCode MyRecoveryCount ErrorDescription Reason ErrorCode 
+| fillnull value="-" ErrorDescription Reason ErrorCode 
+| eval Status=case(Type=="MasterRecoveryState", Status, Type=="Role", "RoleEnd", Type=="MasterTerminated", "MasterTerminated", Type=="RecoveryInternal", Status."/".Step, Type=="ProxyReplies" OR Type=="CommitProxyReplies", "initializing_transaction_servers/ProxyReplies", Type="ResolverReplies", "initializing_transaction_servers/ResolverReplies", Type=="MasterRecruitedInitialStorageServers", "initializing_transaction_servers/MasterRecruitedInitialStorageServers"), StatusCode=case(Type=="ProxyReplies" OR Type=="CommitProxyReplies" OR Type=="ResolverReplies" OR Type=="MasterRecruitedInitialStorageServers", "8", Type!="ProxyReplies" AND Type!="CommitProxyReplies" AND Type!="ResolverReplies" AND Type!="MasterRecruitedInitialStorageServers", StatusCode)
+| fillnull value="-" StatusCode 
+| sort 0 -Time -StatusCode
+| stats list(*) by MasterID Machine 
+| rename list(*) as * 
+| eval FirstTime=mvindex(Time, -1), LastTime=mvindex(Time, 0), Duration=LastTime-FirstTime 
+| table MasterID Machine MyRecoveryCount Duration ErrorDescription Reason ErrorCode StatusCode Status DateTime Time 
+| sort -MyRecoveryCount 
+| fillnull value="-" MyRecoveryCount</query>
+          <earliest>$ReoveryTime.earliest$</earliest>
+          <latest>$ReoveryTime.latest$</latest>
+        </search>
+        <option name="count">3</option>
+        <option name="drilldown">none</option>
+        <option name="refresh.display">progressbar</option>
+        <option name="wrap">false</option>
+      </table>
+    </panel>
+  </row>
+  <row>
+    <panel>
+      <title>Table 3: Why recovery is triggered? Using WaitFailureClient event. Machine A detects Machine B's failure. First column is the time when WaitFailureClient happens. Columns of 2,3,4,5 are for A. Columns of 6,7 are for B.</title>
+      <table>
+        <search>
+          <query>index=$Index$ LogGroup=$LogGroup$
+    Type="WaitFailureClient" 
+| table Type Time Machine FailedEndpoint 
+| replace *:tls with * in FailedEndpoint 
+| join Machine type=left 
+    [ search index=$Index$ LogGroup=$LogGroup$ Type="Role" AND Transition="End" 
+    | eval EndTime=strftime(Time, "%Y-%m-%d %H:%M:%S.%Q (%Z)") 
+    | rename As as Role 
+    | table ID EndTime Machine Role] 
+| join FailedEndpoint type=left 
+    [ search index=$Index$ LogGroup=$LogGroup$ Type="Role" 
+    | stats latest(*) by ID | rename latest(*) as *
+    | rename Machine as FailedEndpoint 
+    | eval FailedEndpointLatestRoleEventInfo=As."/".ID."/".Type.Transition."/".strftime(Time, "%Y-%m-%d %H:%M:%S.%Q (%Z)") 
+    | stats list(*) by FailedEndpoint 
+    | rename list(*) as * 
+    | table FailedEndpoint FailedEndpointLatestRoleEventInfo] 
+| eval FailureDetectedTime=strftime(Time, "%Y-%m-%d %H:%M:%S.%Q (%Z)") 
+| makemv delim=" " FailedEndpointLatestRoleEventInfo 
+| table FailureDetectedTime Machine ID Role EndTime FailedEndpoint FailedEndpointLatestRoleEventInfo</query>
+          <earliest>$ReoveryTime.earliest$</earliest>
+          <latest>$ReoveryTime.latest$</latest>
+        </search>
+        <option name="drilldown">none</option>
+        <option name="wrap">false</option>
+      </table>
+    </panel>
+  </row>
+  <row>
+    <panel>
+      <title>Table 4: New Recruitment Configuration (using MasterRecoveredConfig event)</title>
+      <event>
+        <search>
+          <query>index=$Index$ LogGroup=$LogGroup$
+    Type="MasterRecoveredConfig" AND TrackLatestType="Original" 
+| eval Configuration=replace(Conf, "&amp;quot;", "\"") 
+| rename Configuration as _raw</query>
+          <earliest>$ReoveryTime.earliest$</earliest>
+          <latest>$ReoveryTime.latest$</latest>
+        </search>
+        <option name="list.drilldown">none</option>
+        <option name="refresh.display">progressbar</option>
+      </event>
+    </panel>
+  </row>
+  <row>
+    <panel>
+      <title>Table 5: Data Centers (using ProcessMetrics event)</title>
+      <table>
+        <search>
+          <query>index=$Index$ LogGroup=$LogGroup$
+    Type=ProcessMetrics 
+| dedup DCID 
+| rename DCID as DataCenterID 
+| table DataCenterID pie_work_unit
+| fillnull value="-"</query>
+          <earliest>$ReoveryTime.earliest$</earliest>
+          <latest>$ReoveryTime.latest$</latest>
+        </search>
+        <option name="drilldown">none</option>
+        <option name="refresh.display">progressbar</option>
+      </table>
+    </panel>
+    <panel>
+      <title>Table 6: New Role (using Role event joined by ProcessMetrics event)</title>
+      <table>
+        <search>
+          <query>index=$Index$ LogGroup=$LogGroup$
+    (Type="Role" AND ((As="ClusterController") OR (As="MasterServer") OR (As="TLog") OR (As="Resolver") OR (As="MasterProxyServer") OR (As="CommitProxyServer") OR (As="GrvProxyServer") OR (As="LogRouter")) AND (NOT TrackLatestType="Rolled") AND (NOT Transition="Refresh")) 
+| eventstats count by ID 
+| rename As as Role 
+| search count=1 AND Transition="Begin" 
+| table ID Role Machine 
+| join type=left Machine 
+    [ search index=$Index$ LogGroup=$LogGroup$ Type=ProcessMetrics 
+    | dedup Machine, DCID 
+    | rename DCID as DataCenter 
+    | table Machine DataCenter] 
+| table ID Role Machine DataCenter 
+| fillnull value="null" DataCenter 
+| stats count by Role DataCenter</query>
+          <earliest>$ReoveryTime.earliest$</earliest>
+          <latest>$ReoveryTime.latest$</latest>
+        </search>
+        <option name="drilldown">none</option>
+        <option name="refresh.display">progressbar</option>
+      </table>
+    </panel>
+  </row>
+  <row>
+    <panel>
+      <title>Table 7: Role Details</title>
+      <input type="multiselect" token="RolesToken" searchWhenChanged="true">
+        <label>Roles</label>
+        <choice value="MasterServer">MasterServer</choice>
+        <choice value="TLog">TLog</choice>
+        <choice value="Resolver">Resolver</choice>
+        <choice value="MasterProxyServer">MasterProxyServer (for &lt;7.0)</choice>
+        <choice value="LogRouter">LogRouter</choice>
+        <choice value="CommitProxyServer">CommitProxyServer (for 7.0+)</choice>
+        <choice value="GrvProxyServer">GrvProxyServer (for 7.0+)</choice>
+        <valuePrefix>As="</valuePrefix>
+        <valueSuffix>"</valueSuffix>
+        <delimiter> OR </delimiter>
+      </input>
+      <input type="dropdown" token="RoleDetailTableWhichRoleToken" searchWhenChanged="true">
+        <label>Begin/End</label>
+        <choice value="count=1 AND Transition=&quot;Begin&quot;">Begin</choice>
+        <choice value="count=1 AND Transition=&quot;End&quot;">End</choice>
+        <choice value="count=2">Begin-&gt;End</choice>
+        <default>count=1 AND Transition="Begin"</default>
+      </input>
+      <table>
+        <search>
+          <query>index=$Index$ LogGroup=$LogGroup$
+    (Type="Role" AND ($RolesToken$) AND (NOT TrackLatestType="Rolled") AND (NOT Transition="Refresh")) 
+| eventstats count by ID 
+| rename As as Role 
+| search $RoleDetailTableWhichRoleToken$
+| table ID Role Machine Time
+| join type=left Machine 
+    [ search index=$Index$ LogGroup=$LogGroup$ Type=ProcessMetrics 
+    | dedup Machine, DCID 
+    | rename DCID as DataCenter 
+    | table Machine DataCenter] 
+| table ID Role Machine DataCenter Time
+| fillnull value="null" DataCenter 
+| eval DateTime=strftime(Time, "%Y-%m-%d %H:%M:%S.%Q (%Z)") 
+| table ID Role Machine DataCenter DateTime 
+| sort 0 -DateTime</query>
+          <earliest>$ReoveryTime.earliest$</earliest>
+          <latest>$ReoveryTime.latest$</latest>
+        </search>
+        <option name="count">10</option>
+        <option name="drilldown">none</option>
+        <option name="refresh.display">progressbar</option>
+      </table>
+    </panel>
+  </row>
+  <row>
+    <panel>
+      <title>Table 8: CC Recruitment SevWarn OR SevError (use events in clusterRecruitFromConfiguration and clusterRecruitRemoteFromConfiguration)</title>
+      <table>
+        <search>
+          <query>index=$Index$ LogGroup=$LogGroup$
+    Type="RecruitFromConfigurationNotAvailable" OR Type="RecruitFromConfigurationRetry" OR Type="RecruitFromConfigurationError" OR Type="RecruitRemoteFromConfigurationNotAvailable" OR Type="RecruitRemoteFromConfigurationRetry" OR Type="RecruitRemoteFromConfigurationError"
+    | eval DateTime=strftime(Time, "%Y-%m-%d %H:%M:%S.%Q (%Z)"), GoodRecruitmentTimeReady=case(Type=="RecruitFromConfigurationNotAvailable" OR Type=="RecruitRemoteFromConfigurationNotAvailable", "True", Type=="RecruitFromConfigurationRetry" OR Type=="RecruitRemoteFromConfigurationRetry", GoodRecruitmentTimeReady, Type=="RecruitFromConfigurationError" OR Type=="RecruitRemoteFromConfigurationError", "-")
+    | table Type GoodRecruitmentTimeReady Time DateTime</query>
+          <earliest>$ReoveryTime.earliest$</earliest>
+          <latest>$ReoveryTime.latest$</latest>
+        </search>
+        <option name="drilldown">none</option>
+      </table>
+    </panel>
+  </row>
+  <row>
+    <panel>
+      <title>Table 9: RecoveryCount of the selected TLog (in Table 11)</title>
+      <table>
+        <search>
+          <query>index=$Index$ LogGroup=$LogGroup$
+    (ID=$row.TLogID$ AND Type="TLogStart") OR (LogId=$row.TLogID$ AND Type="TLogPersistentStateRestore") 
+| eval ID=if(Type="TLogStart", ID, LogId), DateTime=strftime(Time, "%Y-%m-%d %H:%M:%S.%Q (%Z)")  
+| table ID RecoveryCount Type DateTime | fillnull value="Not found. The fdb version is somewhat old."</query>
+          <earliest>-7d@h</earliest>
+          <latest>now</latest>
+        </search>
+        <option name="drilldown">none</option>
+        <option name="refresh.display">progressbar</option>
+      </table>
+    </panel>
+    <panel>
+      <title>Table 10: Which roles the selected TLog (in Table 11) talks to</title>
+      <table>
+        <search>
+          <query>index=$Index$ LogGroup=$LogGroup$
+    ((Type="TLogRejoining" AND ID=$row.TLogID$) OR ((Type="TLogJoinedMe" OR Type="TLogJoinedMeUnknown" OR Type="TLogRejoinSlow") AND TLog=$row.TLogID$) OR ((Type="TLogLockStarted" OR Type="TLogLocked") AND TLog=$row.TLogID$) OR (Type="TLogStop" AND ID=$row.TLogID$) OR (Type="TLogStop2" AND LogId=$row.TLogID$) OR (Type="Role" AND As="TLog" AND NOT Transition="Refresh" AND ID=$row.TLogID$)) AND (NOT TrackLatestType="Rolled") 
+| sort -Time 
+| eval TLogID=case((Type="TLogRejoining"), ID, (Type="TLogJoinedMe") OR (Type="TLogJoinedMeUnknown") OR (Type="TLogRejoinSlow"), TLog, (Type="TLogLockStarted") OR (Type="TLogLocked"), TLog, (Type="TLogStop"), ID, (Type="TLogStop2"), LogId, Type="Role", ID), TLogEvents=case((Type="TLogRejoining"), Time." ".Type." ".Master, (Type="TLogJoinedMe") OR (Type="TLogJoinedMeUnknown") OR (Type="TLogRejoinSlow") OR (Type="TLogLockStarted") OR (Type="TLogLocked"), Time." ".Type." ".ID." "."Null", (Type="TLogStop") OR (Type="TLogStop2"), Time." ".Type." "."Null", (Type="Role" AND As="TLog" AND NOT Transition="Refresh" AND NOT TrackLatestType="Rolled"), Time." "."Role".Transition." "."Null") 
+| stats list(*) by TLogID 
+| rename list(*) As * 
+| table TLogID TLogEvents 
+| eval ignore = if(mvcount(TLogEvents)==1 AND like(mvindex(TLogEvents, 0), "% RoleEnd"), 1, 0) 
+| search ignore=0 
+| sort TLogID 
+| table TLogID TLogEvents 
+| mvexpand TLogEvents 
+| eval temp=split(TLogEvents," "), Time=mvindex(temp,0), Event=mvindex(temp,1), MasterID=mvindex(temp,2) 
+| fields - temp - TLogEvents 
+| sort 0 -Time 
+| search NOT MasterID="NULL" 
+| dedup MasterID 
+| rename MasterID as ID 
+| join type=left ID 
+    [ search index=$Index$ LogGroup=$LogGroup$ 
+        (Type="Role")
+    | sort 0 -Time 
+    | dedup ID 
+    | table ID Machine As] 
+| table ID Machine As | fillnull value="null" Machine As</query>
+          <earliest>$ReoveryTime.earliest$</earliest>
+          <latest>$ReoveryTime.latest$</latest>
+        </search>
+        <option name="drilldown">none</option>
+        <option name="refresh.display">progressbar</option>
+      </table>
+    </panel>
+  </row>
+  <row>
+    <panel>
+      <title>Table 11: TLog Events (Collecting all TLogs that produce interesting events during the time span)</title>
+      <input type="text" token="SeeLogEventDetailTableToken" searchWhenChanged="true">
+        <label>Input * to do search</label>
+      </input>
+      <table>
+        <search>
+          <query>index=$Index$ LogGroup=$LogGroup$
+    (Type="TLogRecover") OR (Type="TLogReady") OR (Type="TLogStart") OR
+    ((Type="TLogLockStarted") OR (Type="TLogLocked") OR (Type="TLogStop") OR (Type="TLogStop2")) OR (Type="Role" AND As="TLog" AND NOT Transition="Refresh") AND (NOT TrackLatestType="Rolled") AND $SeeLogEventDetailTableToken$
+| sort -Time 
+| eval TLogID=case((Type="TLogRecover"), LogId, (Type="TLogReady"), ID, (Type="TLogStart"), ID, (Type="TLogLockStarted") OR (Type="TLogLocked"), TLog, (Type="TLogStop"), ID, (Type="TLogStop2"), LogId, Type="Role", ID), TLogEvents=case((Type="TLogRecover"), Time." ".Type." "."null", (Type="TLogReady"), Time." ".Type." "."null", (Type="TLogStart"), Time." ".Type." "."null", (Type="TLogLockStarted") OR (Type="TLogLocked"), Time." ".Type." ".ID." "."null", (Type="TLogStop") OR (Type="TLogStop2"), Time." ".Type." "."null", (Type="Role" AND As="TLog" AND NOT Transition="Refresh" AND NOT TrackLatestType="Rolled"), Time." "."Role".Transition." "."null") 
+| stats list(TLogEvents) by TLogID 
+| rename list(TLogEvents) As TLogEvents 
+| eval EarliestEvent=mvindex(TLogEvents, -1) , LatestEvent=mvindex(TLogEvents, 0) 
+| table TLogID TLogEvents EarliestEvent LatestEvent 
+| eval ignore = if(mvcount(TLogEvents)==1 AND like(mvindex(TLogEvents, 0), "% RoleEnd"), 1, 0) 
+| search ignore=0 
+| sort TLogID 
+| join type=left TLogID 
+    [ search index=$Index$ LogGroup=$LogGroup$
+        (Type="Role" AND As="TLog") 
+    | sort 0 -Time 
+    | dedup ID 
+    | rename ID as TLogID 
+    | table TLogID host LogGroup Machine] 
+| table TLogID Machine LogGroup host EarliestEvent LatestEvent 
+| fillnull value="null" Machine host LogGroup
+| eval temp=split(LatestEvent," "), LatestTime=mvindex(temp,0), LatestEvent=mvindex(temp,1), temp2=split(EarliestEvent," "), EarliestTime=mvindex(temp2,0), EarliestEvent=mvindex(temp2,1), Duration=LatestTime-EarliestTime 
+| table TLogID Machine EarliestTime Duration LogGroup host 
+| join type=left Machine 
+    [ search index=$Index$ LogGroup=$LogGroup$ 
+        Type=ProcessMetrics 
+    | dedup Machine, DCID 
+    | rename DCID as DataCenter 
+    | table Machine DataCenter] 
+| fillnull value="null" DataCenter 
+| table TLogID Machine DataCenter EarliestTime Duration host LogGroup 
+| join type=left TLogID 
+    [ search index=$Index$ LogGroup=$LogGroup$
+        ((Type="TLogRejoining") OR ((Type="TLogJoinedMe" OR Type="TLogJoinedMeUnknown" OR Type="TLogRejoinSlow")) OR ((Type="TLogLockStarted" OR Type="TLogLocked")) OR (Type="TLogStop") OR (Type="TLogStop2") OR (Type="Role" AND As="TLog" AND NOT Transition="Refresh")) AND (NOT TrackLatestType="Rolled") 
+    | sort -Time 
+    | eval TLogID=case((Type="TLogRejoining"), ID, (Type="TLogJoinedMe") OR (Type="TLogJoinedMeUnknown") OR (Type="TLogRejoinSlow"), TLog, (Type="TLogLockStarted") OR (Type="TLogLocked"), TLog, (Type="TLogStop"), ID, (Type="TLogStop2"), LogId, Type="Role", ID), TLogEvents=case((Type="TLogRejoining"), Time." ".Type." ".Master, (Type="TLogJoinedMe") OR (Type="TLogJoinedMeUnknown") OR (Type="TLogRejoinSlow") OR (Type="TLogLockStarted") OR (Type="TLogLocked"), Time." ".Type." ".ID." "."Null", (Type="TLogStop") OR (Type="TLogStop2"), Time." ".Type." "."Null", (Type="Role" AND As="TLog" AND NOT Transition="Refresh" AND NOT TrackLatestType="Rolled"), Time." "."Role".Transition." "."Null") 
+    | stats list(*) by TLogID 
+    | rename list(*) As * 
+    | table TLogID TLogEvents 
+    | eval ignore = if(mvcount(TLogEvents)==1 AND like(mvindex(TLogEvents, 0), "% RoleEnd"), 1, 0) 
+    | search ignore=0 
+    | sort TLogID 
+    | table TLogID TLogEvents 
+    | mvexpand TLogEvents 
+    | eval temp=split(TLogEvents," "), Time=mvindex(temp,0), Event=mvindex(temp,1), RoleID=mvindex(temp,2) 
+    | fields - temp - TLogEvents 
+    | sort 0 -Time 
+    | search NOT RoleID="NULL" 
+    | table TLogID RoleID MasterMachine 
+    | stats list(*) by TLogID 
+    | rename list(*) as * 
+    | streamstats count 
+    | mvexpand RoleID 
+    | dedup count RoleID 
+    | fields - count 
+    | stats count by TLogID 
+    | rename count as Roles 
+    | table TLogID Roles] 
+| table TLogID Machine DataCenter Roles EarliestTime Duration host LogGroup 
+| join type=left TLogID 
+    [ search index=$Index$ LogGroup=$LogGroup$
+        (Type="TLogRecover") OR (Type="TLogReady") OR (Type="TLogStart") OR
+        ((Type="TLogRejoinSlow") OR (Type="TLogLockStarted") OR (Type="TLogLocked") OR (Type="TLogStop") OR (Type="TLogStop2") OR (Type="Role" AND As="TLog" AND NOT Transition="Refresh") AND (NOT TrackLatestType="Rolled")) 
+    | sort -Time 
+    | eval TLogID=case((Type="TLogRecover"), LogId, (Type="TLogReady"), ID, (Type="TLogStart"), ID, (Type="TLogRejoinSlow"), TLog, (Type="TLogLockStarted") OR (Type="TLogLocked"), TLog, (Type="TLogStop"), ID, (Type="TLogStop2"), LogId, Type="Role", ID), TLogEvents=if(Type="Role", Type.Transition, Type) 
+    | sort 0 TLogEvents 
+    | stats list(TLogEvents) by TLogID 
+    | rename list(TLogEvents) As TLogEvents 
+    | table TLogID TLogEvents 
+    | eval ignore = if(mvcount(TLogEvents)==1 AND like(mvindex(TLogEvents, 0), "% RoleEnd"), 1, 0) 
+    | search ignore=0 
+    | mvcombine delim=" " TLogEvents 
+    | table TLogID TLogEvents] 
+| table TLogID Machine DataCenter Roles Duration TLogEvents EarliestTime host LogGroup 
+| eval EarliestDateTime=strftime(EarliestTime, "%Y-%m-%d %H:%M:%S.%Q (%Z)") 
+| table TLogID Machine DataCenter Roles Duration TLogEvents EarliestDateTime host LogGroup
+| join type=left TLogID 
+    [ search index=$Index$ LogGroup=$LogGroup$
+        (Type="TLogStart") OR (Type="TLogPersistentStateRestore") 
+    | eval TLogID=if(Type="TLogStart", ID, LogId) 
+    | table TLogID RecoveryCount] 
+| table TLogID RecoveryCount Machine DataCenter Roles Duration TLogEvents EarliestDateTime host LogGroup 
+| fillnull value="TLog too old, click and see details" RecoveryCount</query>
+          <earliest>$ReoveryTime.earliest$</earliest>
+          <latest>$ReoveryTime.latest$</latest>
+        </search>
+        <option name="count">10</option>
+        <option name="drilldown">cell</option>
+        <option name="wrap">false</option>
+        <drilldown>
+          <set token="row.TLogID">$click.value$</set>
+        </drilldown>
+      </table>
+    </panel>
+    <panel>
+      <title>Table 12: Event Details (Including rejoining events) of the selected TLog (in Table 11)</title>
+      <table>
+        <search>
+          <query>index=$Index$ LogGroup=$LogGroup$
+    (Type="TLogRecover" AND LogId=$row.TLogID$) OR (Type="TLogReady" AND ID=$row.TLogID$) OR (Type="TLogStart" AND ID=$row.TLogID$) OR
+    ((Type="TLogRejoining" AND ID=$row.TLogID$) OR ((Type="TLogJoinedMe" OR Type="TLogJoinedMeUnknown" OR Type="TLogRejoinSlow") AND TLog=$row.TLogID$) OR ((Type="TLogLockStarted" OR Type="TLogLocked") AND TLog=$row.TLogID$) OR (Type="TLogStop" AND ID=$row.TLogID$) OR (Type="TLogStop2" AND LogId=$row.TLogID$) OR (Type="Role" AND As="TLog" AND NOT Transition="Refresh" AND ID=$row.TLogID$)) AND (NOT TrackLatestType="Rolled") 
+| sort -Time 
+| eval TLogID=case((Type="TLogRecover"), LogId, (Type="TLogReady"), ID, (Type="TLogStart"), ID, (Type="TLogRejoining"), ID, (Type="TLogJoinedMe") OR (Type="TLogJoinedMeUnknown") OR (Type="TLogRejoinSlow"), TLog, (Type="TLogLockStarted") OR (Type="TLogLocked"), TLog, (Type="TLogStop"), ID, (Type="TLogStop2"), LogId, Type="Role", ID), TLogEvents=case((Type="TLogRecover"), Time." ".Type." "."-"." "."-", (Type="TLogReady"), Time." ".Type." "."-"." "."-", (Type="TLogStart"), Time." ".Type." "."-"." "."-", (Type="TLogRejoining"), Time." ".Type." ".Master." "."-", (Type="TLogJoinedMe") OR (Type="TLogJoinedMeUnknown") OR (Type="TLogRejoinSlow") OR (Type="TLogLockStarted") OR (Type="TLogLocked"), Time." ".Type." ".ID." "."-", (Type="TLogStop") OR (Type="TLogStop2"), Time." ".Type." "."-"." "."-", (Type="Role" AND As="TLog" AND Transition="Begin" AND NOT TrackLatestType="Rolled"), Time." "."Role".Transition." "."-"." ".Origination, (Type="Role" AND As="TLog" AND Transition="End" AND NOT TrackLatestType="Rolled"), Time." "."Role".Transition." "."-"." "."-") 
+| stats list(*) by TLogID 
+| rename list(*) As * 
+| table TLogID TLogEvents 
+| eval ignore = if(mvcount(TLogEvents)==1 AND like(mvindex(TLogEvents, 0), "% RoleEnd"), 1, 0) 
+| search ignore=0 
+| sort TLogID 
+| join type=left TLogID 
+    [ search index=$Index$ LogGroup=$LogGroup$ (Type="Role" AND As="TLog" AND ID=$row.TLogID$) 
+    | dedup ID 
+    | rename ID as TLogID 
+    | table TLogID Machine] 
+| table TLogID Machine TLogEvents 
+| fillnull value="-" Machine 
+| mvexpand TLogEvents 
+| eval temp=split(TLogEvents," "), Time=mvindex(temp,0), Event=mvindex(temp,1), ToID=mvindex(temp,2), Origination= mvindex(temp,3) 
+| fields - temp - TLogEvents 
+| join type=left 
+    [ search index=$Index$ LogGroup=$LogGroup$ (Type="Role") 
+    | dedup ID 
+    | rename ID as ToID 
+    | rename As as ToRole 
+    | rename Machine as ToMachine 
+    | table ToID ToRole ToMachine] 
+| sort 0 -Time 
+| fillnull value="-" ToRole ToMachine 
+| eval DateTime=strftime(Time, "%Y-%m-%d %H:%M:%S.%Q (%Z)") 
+| table TLogID Machine Event DateTime ToID ToRole ToMachine Time DateTime</query>
+          <earliest>$ReoveryTime.earliest$</earliest>
+          <latest>$ReoveryTime.latest$</latest>
+        </search>
+        <option name="count">14</option>
+        <option name="drilldown">none</option>
+        <option name="refresh.display">progressbar</option>
+        <option name="wrap">false</option>
+      </table>
+    </panel>
+  </row>
+  <row>
+    <panel>
+      <title>Table 13: All Tags of the selected TLog (in Table 11) that have been popped by SSes (using TLogPoppedTag event)</title>
+      <table>
+        <search>
+          <query>index=$Index$ LogGroup=$LogGroup$
+    (ID=$row.TLogID$ AND Type="TLogPoppedTag") 
+| eval DateTime=strftime(Time, "%Y-%m-%d %H:%M:%S.%Q (%Z)") 
+| rename ID as TLogID 
+| rename Tags as UnpoppedRecoveredTagCount 
+| rename Tag as TagPopped 
+| rename DurableKCVer as DurableKnownCommittedVersion 
+| search TagPopped!="-1:2" 
+| table TLogID DateTime UnpoppedRecoveredTagCount TagPopped DurableKnownCommittedVersion RecoveredAt 
+| sort 0 -UnpoppedRecoveredTagCount 
+| join TagPopped type=left 
+    [ search index=$Index$ LogGroup=$LogGroup$
+        (Type="StorageMetrics") 
+    | stats latest(*) by Machine 
+    | rename latest(*) as * 
+    | rename Tag as TagPopped 
+    | table TagPopped ID Machine] 
+| table TLogID DateTime UnpoppedRecoveredTagCount TagPopped DurableKnownCommittedVersion RecoveredAt ID Machine
+| join type=left Machine 
+    [ search index=$Index$ LogGroup=$LogGroup$
+        Type=ProcessMetrics
+    | dedup Machine, DCID 
+    | rename DCID as DataCenter 
+    | table Machine DataCenter] 
+| rename ID as SSID 
+| rename Machine as SSMachine 
+| rename DataCenter as SSDataCenter 
+| table TLogID DateTime UnpoppedRecoveredTagCount TagPopped SSID SSMachine SSDataCenter DurableKnownCommittedVersion RecoveredAt 
+| fillnull value="-"</query>
+          <earliest>$ReoveryTime.earliest$</earliest>
+          <latest>$ReoveryTime.latest$</latest>
+        </search>
+        <option name="count">10</option>
+        <option name="drilldown">none</option>
+        <option name="refresh.display">progressbar</option>
+        <option name="wrap">false</option>
+      </table>
+    </panel>
+    <panel>
+      <title>Table 14: All Tags of the selected TLog (in Table 11) to be popped by SSes (using TLogReady event)</title>
+      <table>
+        <search>
+          <query>index=$Index$ LogGroup=$LogGroup$
+    (ID=$row.TLogID$ AND Type="TLogReady") 
+| eval DateTime=strftime(Time, "%Y-%m-%d %H:%M:%S.%Q (%Z)") 
+| rename ID as TLogID 
+| table TLogID Type AllTags Locality 
+| makemv delim="," AllTags 
+| mvexpand AllTags 
+| rename AllTags as Tag | sort 0 Tag
+| join Tag type=left 
+    [ search index=$Index$ LogGroup=$LogGroup$ 
+        (Type="StorageMetrics") 
+    | stats latest(*) by Machine 
+    | rename latest(*) as * 
+    | table Tag ID Machine] 
+| table TLogID Tag ID Machine
+| join type=left Machine 
+    [ search index=$Index$ LogGroup=$LogGroup$
+        Type=ProcessMetrics 
+    | dedup Machine, DCID 
+    | rename DCID as DataCenter 
+    | table Machine DataCenter] 
+| fillnull value="-"
+| table TLogID Tag ID Machine DataCenter 
+| rename ID as SSID | rename Machine as SSMachine | rename DataCenter as SSDataCenter
+| search Tag!="-1:2"</query>
+          <earliest>$ReoveryTime.earliest$</earliest>
+          <latest>$ReoveryTime.latest$</latest>
+        </search>
+        <option name="count">10</option>
+        <option name="drilldown">none</option>
+        <option name="refresh.display">progressbar</option>
+      </table>
+    </panel>
+  </row>
+  <row>
+    <panel>
+      <title>Table 15: The Tags of the selected TLog (in Table 11) that are not popped by SSes (using set diff tags in Table 13 and Table 14) (if result contains "...", the result of Table 15 is wrong)</title>
+      <table>
+        <search>
+          <query>| set diff 
+    [ search index=$Index$ LogGroup=$LogGroup$ 
+        (ID=$row.TLogID$ AND Type="TLogReady") 
+    | table AllTags  
+    | makemv delim="," AllTags 
+    | mvexpand AllTags 
+    | rename AllTags as Tag 
+    | table Tag] 
+    [ search index=$Index$ LogGroup=$LogGroup$ 
+        (ID=$row.TLogID$ AND Type="TLogPoppedTag") 
+    | table Tag]</query>
+          <earliest>$ReoveryTime.earliest$</earliest>
+          <latest>$ReoveryTime.latest$</latest>
+        </search>
+        <option name="drilldown">none</option>
+        <option name="refresh.display">progressbar</option>
+      </table>
+    </panel>
+    <panel>
+      <title>Table 16: All Current Storage Servers (assume each machine has at most one SS)</title>
+      <input type="text" token="TriggerSSTableToken" searchWhenChanged="true">
+        <label>Input * to search</label>
+      </input>
+      <table>
+        <search>
+          <query>index=$Index$ LogGroup=$LogGroup$
+    (Type="StorageMetrics") AND $TriggerSSTableToken$ 
+| stats latest(*) by Machine 
+| rename latest(*) as * 
+| table Tag ID Machine 
+| join type=left Machine 
+    [ search index=$Index$ LogGroup=$LogGroup$
+        Type=ProcessMetrics 
+    | dedup Machine, DCID 
+    | rename DCID as DataCenter 
+    | table Machine DataCenter] 
+| table ID Machine DataCenter Tag 
+| join ID 
+    [ search index=$Index$ LogGroup=$LogGroup$
+        (Type="Role" AND ((As="StorageServer")) AND (NOT TrackLatestType="Rolled")) 
+    | stats latest(*) by Machine 
+    | rename latest(*) as * 
+    | rename As as Role 
+    | table ID Role Machine 
+    | join type=left Machine 
+        [ search index=$Index$ LogGroup=$LogGroup$
+            Type=ProcessMetrics 
+        | dedup Machine, DCID 
+        | rename DCID as DataCenter 
+        | table Machine DataCenter] 
+    | table ID Role Machine DataCenter 
+    | fillnull value="null" DataCenter] 
+| sort 0 DataCenter
+| table Tag ID Machine DataCenter | sort 0 Tag</query>
+          <earliest>$ReoveryTime.earliest$</earliest>
+          <latest>$ReoveryTime.latest$</latest>
+        </search>
+        <option name="drilldown">none</option>
+        <option name="refresh.display">progressbar</option>
+      </table>
+    </panel>
+  </row>
+  <row>
+    <panel>
+      <title>Chart 1: Timeout/TimedOut event distribution grouped by source (Machine)</title>
+      <input type="text" token="TimeoutEventByMachineTableTimeSpanToken" searchWhenChanged="true">
+        <label>TimeSpan</label>
+        <default>5s</default>
+      </input>
+      <input type="multiselect" token="TimeoutbyMachineTableSourceRoleToken" searchWhenChanged="true">
+        <label>Select Source Roles</label>
+        <choice value="TLog">TLog</choice>
+        <choice value="MasterServer">MasterServer</choice>
+        <choice value="MasterProxyServer">MasterProxyServer (for version &lt; 7)</choice>
+        <choice value="Resolver">Resolver</choice>
+        <choice value="ClusterController">ClusterController</choice>
+        <choice value="SharedTLog">SharedTLog</choice>
+        <choice value="LogRouter">LogRouter</choice>
+        <choice value="Coordinator">Coordinator</choice>
+        <choice value="StorageServer">StorageServer</choice>
+        <choice value="CommitProxyServer">CommitProxyServer (for version 7+)</choice>
+        <choice value="GrvProxyServer">GrvProxyServer (for ver 7+)</choice>
+        <valuePrefix>As="</valuePrefix>
+        <valueSuffix>"</valueSuffix>
+        <delimiter> OR </delimiter>
+      </input>
+      <input type="multiselect" token="TimeoutbyMachineTableDestinationRoleToken" searchWhenChanged="true">
+        <label>Select Destination Roles</label>
+        <choice value="TLog">TLog</choice>
+        <choice value="MasterServer">MasterServer</choice>
+        <choice value="MasterProxyServer">MasterProxyServer (for version &lt;7)</choice>
+        <choice value="Resolver">Resolver</choice>
+        <choice value="ClusterController">ClusterController</choice>
+        <choice value="SharedTLog">SharedTLog</choice>
+        <choice value="LogRouter">LogRouter</choice>
+        <choice value="Coordinator">Coordinator</choice>
+        <choice value="StorageServer">StorageServer</choice>
+        <choice value="CommitProxyServer">CommitProxyServer (for version 7+)</choice>
+        <choice value="GrvProxyServer">GrvProxyServer (for version 7+)</choice>
+        <valuePrefix>As="</valuePrefix>
+        <valueSuffix>"</valueSuffix>
+        <delimiter> OR </delimiter>
+      </input>
+      <chart>
+        <search>
+          <query>index=$Index$ LogGroup=$LogGroup$
+    (Type=ConnectionTimedOut OR Type=ConnectionTimeout) 
+| replace *:tls with * in PeerAddr 
+| join Machine 
+    [ search index=$Index$ LogGroup=$LogGroup$
+        (Type="Role" AND ($TimeoutbyMachineTableSourceRoleToken$)) 
+    | dedup ID] 
+| join PeerAddr 
+    [ search index=$Index$ LogGroup=$LogGroup$
+        (Type="Role" AND ($TimeoutbyMachineTableDestinationRoleToken$)) 
+    | dedup ID 
+    | rename Machine as PeerAddr] 
+| timechart useother=0 span=$TimeoutEventByMachineTableTimeSpanToken$ count by Machine</query>
+          <earliest>$ReoveryTime.earliest$</earliest>
+          <latest>$ReoveryTime.latest$</latest>
+        </search>
+        <option name="charting.chart">line</option>
+        <option name="charting.drilldown">none</option>
+        <option name="height">233</option>
+        <option name="refresh.display">progressbar</option>
+      </chart>
+    </panel>
+  </row>
+  <row>
+    <panel>
+      <title>Chart 2: Timeout/TimedOut event distribution grouped by destination (PeerAddr)</title>
+      <chart>
+        <search>
+          <query>index=$Index$ LogGroup=$LogGroup$
+    (Type=ConnectionTimedOut OR Type=ConnectionTimeout) 
+| replace *:tls with * in PeerAddr 
+| join Machine 
+    [ search index=$Index$ LogGroup=$LogGroup$
+        (Type="Role" AND ($TimeoutbyMachineTableSourceRoleToken$)) 
+    | dedup ID] 
+| join PeerAddr 
+    [ search index=$Index$ LogGroup=$LogGroup$
+        (Type="Role" AND ($TimeoutbyMachineTableDestinationRoleToken$)) 
+    | dedup ID 
+    | rename Machine as PeerAddr] 
+| timechart useother=0 span=$TimeoutEventByMachineTableTimeSpanToken$ count by PeerAddr</query>
+          <earliest>$ReoveryTime.earliest$</earliest>
+          <latest>$ReoveryTime.latest$</latest>
+        </search>
+        <option name="charting.chart">line</option>
+        <option name="charting.drilldown">none</option>
+        <option name="height">219</option>
+        <option name="refresh.display">progressbar</option>
+      </chart>
+    </panel>
+  </row>
+  <row>
+    <panel>
+      <title>Table 17: Check Type=ConnectionTimedOut OR Type=ConnectionTimeout events between transaction roles in the recovery (including the role that refresh/begin/end in the timespan)</title>
+      <table>
+        <search>
+          <query>index=$Index$ LogGroup=$LogGroup$
+    (Type=ConnectionTimedOut OR Type=ConnectionTimeout) 
+| replace *:tls with * in PeerAddr 
+| stats count as TotalTimeouts by Machine PeerAddr 
+| table Machine PeerAddr TotalTimeouts 
+| join Machine 
+    [ search index=$Index$ LogGroup=$LogGroup$
+        (Type="Role" AND ($TimeoutbyMachineTableSourceRoleToken$)) 
+    | stats latest(*) by ID 
+    | rename latest(*) as * 
+    | eval Role = As."/".ID."/".Type.Transition."/".strftime(Time, "%Y-%m-%d %H:%M:%S.%Q (%Z)") 
+    | stats list(Role) AS MachineRoleLatestEvent BY Machine 
+        ] 
+| join PeerAddr 
+    [ search index=$Index$ LogGroup=$LogGroup$
+        (Type="Role" AND ($TimeoutbyMachineTableDestinationRoleToken$)) 
+    | stats latest(*) by ID 
+    | rename latest(*) as * 
+    | eval Role = As."/".ID."/".Type.Transition."/".strftime(Time, "%Y-%m-%d %H:%M:%S.%Q (%Z)") 
+    | stats list(Role) AS PeerRoleLatestEvent BY Machine 
+    | rename Machine AS PeerAddr
+        ] 
+| table Machine PeerAddr TotalTimeouts MachineRoleLatestEvent PeerRoleLatestEvent</query>
+          <earliest>$ReoveryTime.earliest$</earliest>
+          <latest>$ReoveryTime.latest$</latest>
+        </search>
+        <option name="count">10</option>
+        <option name="drilldown">none</option>
+        <option name="refresh.display">progressbar</option>
+      </table>
+    </panel>
+  </row>
+  <row>
+    <panel>
+      <title>Table 18: Proxy 0</title>
+      <table>
+        <search>
+          <query>index=$Index$ LogGroup=$LogGroup$
+    (Type="ProxyReplies" OR Type="CommitProxyReplies") AND FirstProxy="True" 
+| eval DateTime=strftime(Time, "%Y-%m-%d %H:%M:%S.%Q (%Z)") 
+| table WorkerID LogGroup FirstProxy Time DateTime 
+| sort 0 -Time 
+| join type=left WorkerID 
+    [ search index=$Index$ LogGroup=$LogGroup$
+        Type="Role" AND As="Worker" AND Transition="Refresh" 
+    | dedup ID 
+    | rename ID as WorkerID 
+    | stats list(*) by WorkerID 
+    | rename list(*) as * 
+    | table WorkerID Machine Roles] 
+| table WorkerID Machine Roles LogGroup FirstProxy Time DateTime 
+| join type=left Machine 
+    [ search index=$Index$ LogGroup=$LogGroup$
+        Type="Role" AND (As="MasterProxyServer" OR As="CommitProxyServer") AND Transition="Refresh" 
+    | dedup ID 
+    | rename ID as ProxyID 
+    | table Machine ProxyID] 
+| table ProxyID Machine LogGroup FirstProxy</query>
+          <earliest>$ReoveryTime.earliest$</earliest>
+          <latest>$ReoveryTime.latest$</latest>
+        </search>
+        <option name="drilldown">none</option>
+        <option name="refresh.display">progressbar</option>
+      </table>
+    </panel>
+  </row>
+  <row>
+    <panel>
+      <title>Table 19: Latest Role Events on the input Machine (Input Machine, like 172.27.113.121:4500)</title>
+      <input type="text" token="SearchMachineToken" searchWhenChanged="true">
+        <label>Machine (IP:PORT)</label>
+      </input>
+      <table>
+        <search>
+          <query>index=$Index$ LogGroup=$LogGroup$
+    Type="Role" AND Machine=$SearchMachineToken$ 
+| stats latest(*) by ID Transition 
+| rename latest(*) as * 
+| eval DateTime=strftime(Time, "%Y-%m-%d %H:%M:%S.%Q (%Z)") 
+| table DateTime Machine ID Transition As Roles LogGroup Error ErrorDescription Reason 
+| sort 0 -DateTime 
+| fillnull value="-"</query>
+          <earliest>$ReoveryTime.earliest$</earliest>
+          <latest>$ReoveryTime.latest$</latest>
+        </search>
+        <option name="drilldown">none</option>
+        <option name="refresh.display">progressbar</option>
+      </table>
+    </panel>
+  </row>
+  <row>
+    <panel>
+      <title>Chart 3: severity&gt;=20 event distribution (including roles that refresh/begin/end in the timespan)</title>
+      <input type="text" token="BadEvents" searchWhenChanged="true">
+        <label>Events</label>
+        <default>*</default>
+      </input>
+      <input type="multiselect" token="BadEventRoleToken" searchWhenChanged="true">
+        <label>Roles</label>
+        <choice value="TLog">TLog</choice>
+        <choice value="MasterServer">MasterServer</choice>
+        <choice value="MasterProxyServer">MasterProxyServer (for version &lt;7)</choice>
+        <choice value="Resolver">Resolver</choice>
+        <choice value="ClusterController">ClusterController</choice>
+        <choice value="SharedTLog">SharedTLog</choice>
+        <choice value="LogRouter">LogRouter</choice>
+        <choice value="Coordinator">Coordinator</choice>
+        <choice value="StorageServer">StorageServer</choice>
+        <choice value="CommitProxyServer">CommitProxyServer (for version 7+)</choice>
+        <choice value="GrvProxyServer">GrvProxyServer (for version 7+)</choice>
+        <valuePrefix>As="</valuePrefix>
+        <valueSuffix>"</valueSuffix>
+        <delimiter> OR </delimiter>
+      </input>
+      <input type="dropdown" token="BadEventChartBy" searchWhenChanged="true">
+        <label>By</label>
+        <choice value="Type">EventType</choice>
+        <choice value="Machine">Machine</choice>
+        <choice value="Severity">Severity</choice>
+        <default>Type</default>
+      </input>
+      <input type="text" token="BadEventChartTimeSpanToken" searchWhenChanged="true">
+        <label>TimeSpan</label>
+        <default>5s</default>
+      </input>
+      <chart>
+        <search>
+          <query>index=$Index$ LogGroup=$LogGroup$
+    Severity&gt;10 AND $BadEvents$
+| join Machine 
+    [ search index=$Index$ LogGroup=$LogGroup$
+        Type="Role" AND ($BadEventRoleToken$)
+    | dedup ID | table Machine] 
+| table Machine Type Severity _time
+| timechart useother=0 span=$BadEventChartTimeSpanToken$ count by $BadEventChartBy$</query>
+          <earliest>$ReoveryTime.earliest$</earliest>
+          <latest>$ReoveryTime.latest$</latest>
+        </search>
+        <option name="charting.chart">line</option>
+        <option name="charting.drilldown">none</option>
+        <option name="height">305</option>
+        <option name="refresh.display">progressbar</option>
+      </chart>
+    </panel>
+  </row>
+  <row>
+    <panel>
+      <title>Table 20: Check severity&gt;20 events of roles in the recovery (including the role that refresh/begin/end in the timespan)</title>
+      <table>
+        <search>
+          <query>index=$Index$ LogGroup=$LogGroup$ 
+    Severity&gt;10 
+| stats count by Machine Type 
+| rename count as Count 
+| join Machine 
+    [ search index=$Index$ LogGroup=$LogGroup$ 
+        Type="Role" AND ($BadEventRoleToken$)
+    | dedup ID 
+    | eval Role=As."-".ID 
+    | stats list(Role) by Machine 
+    | rename list(Role) as Roles 
+    | table Machine Roles] 
+| table Type Count Roles Machine 
+| sort -Count</query>
+          <earliest>$ReoveryTime.earliest$</earliest>
+          <latest>$ReoveryTime.latest$</latest>
+        </search>
+        <option name="drilldown">none</option>
+        <option name="refresh.display">progressbar</option>
+        <option name="wrap">false</option>
+      </table>
+    </panel>
+  </row>
+</form>
--- a/contrib/observability_splunk_dashboard/transaction_latency.xml
+++ b/contrib/observability_splunk_dashboard/transaction_latency.xml
@ -0,0 +1,247 @@
+<form theme="dark">
+  <label>FoundationDB - Tracing GRV and Commit Long Latency of CC Transactions (6.3 and 7.0+) (DEV)</label>
+  <description>Design for ClusterController issued transactions.</description>
+  <fieldset submitButton="false" autoRun="true">
+    <input type="text" token="Index" searchWhenChanged="true">
+      <label>Index</label>
+      <default></default>
+    </input>
+    <input type="text" token="LogGroup" searchWhenChanged="true">
+      <label>LogGroup</label>
+      <default>*</default>
+    </input>
+    <input type="text" token="transactionID">
+      <label>Hex Transaction ID (optional)</label>
+      <default>*</default>
+    </input>
+    <input type="time" token="time_token" searchWhenChanged="true">
+      <label>Time span</label>
+      <default>
+        <earliest>@d</earliest>
+        <latest>now</latest>
+      </default>
+    </input>
+  </fieldset>
+  <row>
+    <panel>
+      <title>All Transactions (Currently, this table also does not cover getrange operation and the operation which not do commit).</title>
+      <table>
+        <title>for FDB 6.3 and 7.0+</title>
+        <search>
+          <query>index=$Index$ LogGroup=$LogGroup$ ID=$transactionID$
+    (Type="TransactionAttachID" OR Type="GetValueAttachID" OR Type="CommitAttachID") 
+| eval To=case(Type=="TransactionAttachID", "0"."-".To, Type="GetValueAttachID", "1"."-".To, Type=="CommitAttachID", "2"."-".To) 
+| stats list(To) by ID 
+| rename list(To) as ToList 
+| table ID ToList 
+| eval Count = mvcount(ToList) 
+| search Count=3 
+| eval To0=mvindex(ToList,0), To1=mvindex(ToList,1), To2=mvindex(ToList,2), To0=split(To0,"-"), To1=split(To1,"-"), To2=split(To2,"-"), GrvID=case(mvindex(To0, 0)=="0", mvindex(To0, 1), mvindex(To1, 0)=="0", mvindex(To1, 1), mvindex(To2, 0)=="0", mvindex(To2, 1)), ReadID=case(mvindex(To0, 0)=="1", mvindex(To0, 1), mvindex(To1, 0)=="1", mvindex(To1, 1), mvindex(To2, 0)=="1", mvindex(To2, 1)), CommitID=case(mvindex(To0, 0)=="2", mvindex(To0, 1), mvindex(To1, 0)=="2", mvindex(To1, 1), mvindex(To2, 0)=="2", mvindex(To2, 1)) 
+| table ID GrvID ReadID CommitID 
+| join GrvID 
+    [ search index=$Index$ LogGroup=$LogGroup$
+        (Type="TransactionDebug" AND Location="NativeAPI.getConsistentReadVersion.Before") 
+    | rename ID as GrvID 
+    | rename Time as BeginTime 
+    | table GrvID BeginTime
+        ] 
+| join GrvID 
+    [ search index=$Index$ LogGroup=$LogGroup$
+        (Type="TransactionDebug" AND Location="NativeAPI.getConsistentReadVersion.After") 
+    | rename ID as GrvID 
+    | rename Time as GRVDoneTime 
+    | table GrvID GRVDoneTime
+        ] 
+| join ReadID 
+    [ search index=$Index$ LogGroup=$LogGroup$
+        (Type="GetValueDebug" AND Location="NativeAPI.getValue.After") 
+    | rename ID as ReadID 
+    | rename Time as ReadDoneTime 
+    | table ReadID ReadDoneTime
+        ] 
+| join CommitID 
+    [ search index=$Index$ LogGroup=$LogGroup$
+        (Type="CommitDebug" AND Location="NativeAPI.commit.After") 
+    | rename ID as CommitID 
+    | rename Time as CommitDoneTime 
+    | table CommitID CommitDoneTime
+        ] 
+| rename ID as TransactionID 
+| eval BeginToGRVDone = GRVDoneTime-BeginTime, GRVDoneToReadDone = ReadDoneTime-GRVDoneTime, ReadDoneToCommitDone = CommitDoneTime-ReadDoneTime, Duration=CommitDoneTime-BeginTime, BeginTimeScope=BeginTime-1, EndTimeScope=CommitDoneTime+1, BeginDateTime=strftime(BeginTime, "%Y-%m-%d %H:%M:%S.%Q (%Z)")
+| table TransactionID Duration BeginDateTime BeginToGRVDone GRVDoneToReadDone ReadDoneToCommitDone Duration  GrvID ReadID CommitID BeginTimeScope EndTimeScope | sort -Duration</query>
+          <earliest>$time_token.earliest$</earliest>
+          <latest>$time_token.latest$</latest>
+        </search>
+        <option name="drilldown">cell</option>
+        <drilldown>
+          <set token="BeginTime">$row.BeginTimeScope$</set>
+          <set token="EndTime">$row.EndTimeScope$</set>
+          <set token="ReadID">$row.ReadID$</set>
+          <set token="GrvID">$row.GrvID$</set>
+          <set token="CommitID">$row.CommitID$</set>
+        </drilldown>
+      </table>
+    </panel>
+  </row>
+  <row>
+    <panel>
+      <title>Step1: GRV</title>
+      <table>
+        <title>for FDB 6.3 and 7.0+</title>
+        <search>
+          <query>index=$Index$ LogGroup=$LogGroup$ 
+    Type="TransactionDebug" AND (NOT MasterProxyServer.masterProxyServerCore.GetRawCommittedVersion) 
+AND (ID=$GrvID$ OR ID= 
+    [ search index=$Index$ LogGroup=$LogGroup$
+        Type="TransactionAttachID" AND ID=$GrvID$
+    | return $To])
+| table Time Type ID Location Machine Roles
+| eventstats min(Time) as MinTime
+| eval Delta = Time - MinTime, Order = case(Location=="NativeAPI.getConsistentReadVersion.Before", 0, Location like "%ProxyServer.queueTransactionStartRequests.Before", 1, Location=="MasterProxyServer.masterProxyServerCore.Broadcast", 2, Location=="GrvProxyServer.transactionStarter.AskLiveCommittedVersionFromMaster", 2.1, Location like "%ProxyServer.getLiveCommittedVersion.confirmEpochLive", 3, Location=="MasterServer.serveLiveCommittedVersion.GetRawCommittedVersion", 4, Location like "%ProxyServer.getLiveCommittedVersion.After", 5, Location=="NativeAPI.getConsistentReadVersion.After", 6)
+| table Time Delta Order Type ID Location Machine Roles
+| sort 0 Order
+| table Machine Location Delta Time Roles ID Type</query>
+          <earliest>$BeginTime$</earliest>
+          <latest>$EndTime$</latest>
+        </search>
+        <option name="drilldown">none</option>
+      </table>
+    </panel>
+    <panel>
+      <title>Step1: (Only for FDB v6.3): GRV --- Get Committed Version (MasterProxyServer.masterProxyServerCore.GetRawCommittedVersion Events)</title>
+      <table>
+        <title>only for FDB 6.3</title>
+        <search>
+          <query>index=$Index$ LogGroup=$LogGroup$
+    Type="TransactionDebug" AND Location="MasterProxyServer.masterProxyServerCore.GetRawCommittedVersion" 
+    AND ID= 
+    [ search index=$Index$ LogGroup=$LogGroup$
+        Type="TransactionAttachID" AND ID=$GrvID$ 
+    | return $To] 
+| table Time Type ID Location Machine Roles
+| eventstats min(Time) as MinTime
+| eval Delta = Time - MinTime
+| sort 0 -Time
+| table Machine Delta Time Roles ID Type</query>
+          <earliest>$BeginTime$</earliest>
+          <latest>$EndTime$</latest>
+        </search>
+        <option name="drilldown">none</option>
+      </table>
+    </panel>
+  </row>
+  <row>
+    <panel>
+      <title>Step2: GetValue</title>
+      <table>
+        <title>for FDB 6.3 and 7.0+</title>
+        <search>
+          <query>index=$Index$ LogGroup=$LogGroup$ Type="GetValueDebug" AND ID=$ReadID$ 
+| eventstats min(Time) as MinTime 
+| eval Delta = Time-MinTime 
+| table Machine Location Delta Time Roles ID Type 
+| eval Order=case(Location=="NativeAPI.getKeyLocation.Before", 0, Location=="NativeAPI.getKeyLocation.After", 1, Location=="NativeAPI.getValue.Before", 2, Location=="storageServer.received", 3, Location=="getValueQ.DoRead", 4, Location=="getValueQ.AfterVersion", 5, Location=="Reader.Before", 6, Location=="Reader.After", 7, Location=="getValueQ.AfterRead", 8, Location=="NativeAPI.getValue.After", 9, Location=="NativeAPI.getValue.Error", 10) 
+| sort 0 Order
+| table Machine Location Delta Time Roles ID Type</query>
+          <earliest>$time_token.earliest$</earliest>
+          <latest>$time_token.latest$</latest>
+        </search>
+        <option name="drilldown">none</option>
+      </table>
+    </panel>
+  </row>
+  <row>
+    <panel>
+      <title>Step3: Commit</title>
+      <table>
+        <title>for FDB 6.3 and 7.0+</title>
+        <search>
+          <query>index=$Index$ LogGroup=$LogGroup$
+    Type="CommitDebug" AND (ID=$CommitID$ OR ID= 
+    [ search index=$Index$ LogGroup=$LogGroup$
+        Type="CommitAttachID" AND ID=$CommitID$ 
+    | return $To]) 
+
+| table Time Type ID Location Machine Roles 
+| eventstats min(Time) as MinTime
+| eval Delta = Time-MinTime
+| table Machine Location Delta Time Roles ID Type
+| eval Order=case(Location=="NativeAPI.commit.Before", 0, Location like "%ProxyServer.batcher", 1, Location like "%ProxyServer.commitBatch.Before", 2, Location like "%ProxyServer.commitBatch.GettingCommitVersion", 3, Location like "%ProxyServer.commitBatch.GotCommitVersion", 4, Location=="Resolver.resolveBatch.Before", 5, Location=="Resolver.resolveBatch.AfterQueueSizeCheck", 6, Location=="Resolver.resolveBatch.AfterOrderer", 7, Location=="Resolver.resolveBatch.After", 8, Location like "%ProxyServer.commitBatch.AfterResolution", 8.5, Location like "%ProxyServer.commitBatch.ProcessingMutations", 9, Location like "%ProxyServer.commitBatch.AfterStoreCommits", 10, Location=="TLogServer.tLogCommit.BeforeWaitForVersion", 11, Location=="TLogServer.tLogCommit.Before", 12, Location=="TLogServer.tLogCommit.AfterTLogCommit", 13, Location=="TLogServer.tLogCommit.After", 14, Location like "%ProxyServer.commitBatch.AfterLogPush", 15, Location=="NativeAPI.commit.After", 16)
+| sort 0 Order
+| table Machine Location Delta Time Roles ID Type</query>
+          <earliest>$BeginTime$</earliest>
+          <latest>$EndTime$</latest>
+        </search>
+        <option name="drilldown">none</option>
+      </table>
+    </panel>
+  </row>
+  <row>
+    <panel>
+      <title>Step3: Commit --- Resolver</title>
+      <table>
+        <title>for FDB 6.3 and 7.0+</title>
+        <search>
+          <query>index=$Index$ LogGroup=$LogGroup$
+    (Location="Resolver*") 
+| join ID 
+    [ search index=$Index$ LogGroup=$LogGroup$
+        Type="CommitAttachID" AND ID= 
+        [ search index=$Index$ LogGroup=$LogGroup$
+            Type="CommitAttachID" AND ID=$CommitID$ 
+        | return $To] 
+    | rename To as ID 
+    | table ID] 
+| eventstats min(Time) as MinTime 
+| eval Delta = Time-MinTime 
+| eval Order=case(Location=="Resolver.resolveBatch.Before", 5, Location=="Resolver.resolveBatch.AfterQueueSizeCheck", 6, Location=="Resolver.resolveBatch.AfterOrderer", 7, Location=="Resolver.resolveBatch.After", 8)
+| sort 0 Time Order
+| stats list(*) by Type ID Machine Roles
+| rename list(*) as *
+| eval T1=mvindex(Time, 0), T2=mvindex(Time, 3), Duration=T2-T1 | sort -Duration
+| table Machine Roles Duration Location Delta Time
+| join type=left Machine 
+    [ search index=$Index$ LogGroup=$LogGroup$ Type=ProcessMetrics 
+    | dedup Machine, DCID 
+    | rename DCID as DataCenter 
+    | table Machine DataCenter]
+| table Machine DataCenter Roles Duration Location Delta Time</query>
+          <earliest>$time_token.earliest$</earliest>
+          <latest>$time_token.latest$</latest>
+        </search>
+        <option name="drilldown">none</option>
+      </table>
+    </panel>
+  </row>
+  <row>
+    <panel>
+      <title>Step3: Commit --- Commit to TLogs (CommitDebug Events), grouped by Machine and sorted by Duration</title>
+      <table>
+        <title>for FDB 6.3 and 7.0+</title>
+        <search>
+          <query>index=$Index$ LogGroup=$LogGroup$
+    (Location="TLog*") 
+| join ID 
+    [ search index=$Index$ LogGroup=$LogGroup$
+        Type="CommitAttachID" AND ID= 
+        [ search index=$Index$ LogGroup=$LogGroup$
+            Type="CommitAttachID" AND ID=$CommitID$ 
+        | return $To] 
+    | rename To as ID 
+    | table ID] 
+| eventstats min(Time) as MinTime 
+| eval Delta = Time-MinTime 
+| sort 0 Time
+| stats list(*) by Type ID Machine Roles
+| rename list(*) as *
+| eval T1=mvindex(Time, 0), T2=mvindex(Time, 3), Duration=T2-T1 | sort -Duration
+| table Machine Roles Duration Location Delta Time</query>
+          <earliest>$BeginTime$</earliest>
+          <latest>$EndTime$</latest>
+        </search>
+        <option name="count">10</option>
+        <option name="drilldown">none</option>
+      </table>
+    </panel>
+  </row>
+</form>
--- a/contrib/pkg_tester/test_fdb_pkgs.py
+++ b/contrib/pkg_tester/test_fdb_pkgs.py
@ -165,7 +165,6 @@ def centos_image_with_fdb_helper(versioned: bool) -> Iterator[Optional[Image]]:
        container = Container("centos:7", initd=True)
        for rpm in rpms:
            container.copy_to(rpm, "/opt")
-        container.run(["bash", "-c", "yum update -y"])
        container.run(
            ["bash", "-c", "yum install -y prelink"]
        )  # this is for testing libfdb_c execstack permissions
@ -327,7 +326,7 @@ def test_execstack_permissions_libfdb_c(linux_container: Container, snapshot):
        [
            "bash",
            "-c",
-            "execstack -q $(ldconfig -p | grep libfdb_c | awk '{print $(NF)}')",
+            "execstack -q $(ldconfig -p | grep libfdb_c.so | awk '{print $(NF)}')",
        ]
    )

--- a/contrib/transaction_profiling_analyzer/transaction_profiling_analyzer.py
+++ b/contrib/transaction_profiling_analyzer/transaction_profiling_analyzer.py
@ -284,6 +284,12 @@ class ErrorCommitInfo(BaseInfo):
        if protocol_version >= PROTOCOL_VERSION_6_3:
            self.report_conflicting_keys = bb.get_bool()

+        if protocol_version >= PROTOCOL_VERSION_7_1:
+            lock_aware = bb.get_bool()
+            if bb.get_bool():
+                spanId = bb.get_bytes(16)
+
+
 class UnsupportedProtocolVersionError(Exception):
    def __init__(self, protocol_version):
        super().__init__("Unsupported protocol version 0x%0.2X" % protocol_version)
--- a/contrib/tsan.suppressions
+++ b/contrib/tsan.suppressions
@ -0,0 +1,5 @@
+# ThreadSanitizer suppressions file for FDB
+# https://github.com/google/sanitizers/wiki/ThreadSanitizerSuppressions
+
+# FDB signal handler is not async-signal safe
+signal:crashHandler
--- a/design/data-distributor-internals.md
+++ b/design/data-distributor-internals.md
@ -20,7 +20,7 @@ Data distribution manages the lifetime of storage servers, decides which storage

 **RelocateShard (`struct RelocateShard`)**: A `RelocateShard` records the key range that need to be moved among servers and the data movement’s priority. DD always move shards with higher priorities first.

-**Data distribution queue (`struct DDQueueData`)**: It receives shards to be relocated (i.e., RelocateShards), decides which shard should be moved to which server team, prioritizes the data movement based on relocate shard’s priority, and controls the progress of data movement based on servers’ workload.
+**Data distribution queue (`struct DDQueue`)**: It receives shards to be relocated (i.e., RelocateShards), decides which shard should be moved to which server team, prioritizes the data movement based on relocate shard’s priority, and controls the progress of data movement based on servers’ workload.

 **Special keys in the system keyspace**: DD saves its state in the system keyspace to recover from failure and to ensure every process (e.g., commit proxies, tLogs and storage servers) has a consistent view of which storage server is responsible for which key range.

@ -153,3 +153,25 @@ CPU utilization. This metric is in a positive relationship with “FinishedQueri
 * The typical movement size under a read-skew scenario is 100M ~ 600M under default KNOB value `READ_REBALANCE_MAX_SHARD_FRAC=0.2, READ_REBALANCE_SRC_PARALLELISM = 20`. Increasing those knobs may accelerate the converge speed with the risk of data movement churn, which overwhelms the destination and over-cold the source.
 * The upper bound of `READ_REBALANCE_MAX_SHARD_FRAC` is 0.5. Any value larger than 0.5 can result in hot server switching.
 * When needing a deeper diagnosis of the read aware DD, `BgDDMountainChopper_New`, and `BgDDValleyFiller_New` trace events are where to go.
+
+## Data Distribution Diagnosis Q&A
+* Why Read-aware DD hasn't been triggered when there's a read imbalance? 
+  * Check `BgDDMountainChopper_New`, `BgDDValleyFiller_New` `SkipReason` field.
+* The Read-aware DD is triggered, and some data movement happened, but it doesn't help the read balance. Why? 
+  * Need to figure out which server is selected as the source and destination. The information is in `BgDDMountainChopper*`, `BgDDValleyFiller*`  `DestTeam` and `SourceTeam` field.
+  * Also, the `DDQueueServerCounter` event tells how many times a server being a source or destination (defined in 
+  ```c++
+  enum CountType : uint8_t { ProposedSource = 0, QueuedSource, LaunchedSource, LaunchedDest };
+  ```
+  ) for different relocation reason (`Other`, `RebalanceDisk` and so on) in different phase within `DD_QUEUE_COUNTER_REFRESH_INTERVAL` (default 60) seconds. For example, 
+  ```xml
+  <Event Severity="10" Time="1659974950.984176" DateTime="2022-08-08T16:09:10Z" Type="DDQueueServerCounter" ID="0000000000000000" ServerId="0000000000000004" OtherPQSD="0 1 3 2" RebalanceDiskPQSD="0 0 1 4" RebalanceReadPQSD="2 0 0 5" MergeShardPQSD="0 0 1 0" SizeSplitPQSD="0 0 5 0" WriteSplitPQSD="1 0 0 0" ThreadID="9733255463206053180" Machine="0.0.0.0:0" LogGroup="default" Roles="TS" />
+  ```
+  `RebalanceReadPQSD="2 0 0 5"` means server `0000000000000004` has been selected as for read balancing for twice, but it's not queued and executed yet. This server also has been a destination for read balancing for 5 times in the past 1 min. Note that the field will be skipped if all 4 numbers are 0. To avoid spammy traces, if is enabled with knob `DD_QUEUE_COUNTER_SUMMARIZE = true`, event `DDQueueServerCounterTooMany` will summarize the unreported servers that involved in launched relocations (aka. `LaunchedSource`, `LaunchedDest` count are non-zero):
+    ```xml
+    <Event Severity="10" Time="1660095057.995837" DateTime="2022-08-10T01:30:57Z" Type="DDQueueServerCounterTooMany" ID="0000000000000000" RemainedLaunchedSources="000000000000007f,00000000000000d9,00000000000000e8,000000000000014c,0000000000000028,00000000000000d6,0000000000000067,000000000000003e,000000000000007d,000000000000000a,00000000000000cb,0000000000000106,00000000000000c1,000000000000003c,000000000000016e,00000000000000e4,000000000000013c,0000000000000016,0000000000000179,0000000000000061,00000000000000c2,000000000000005a,0000000000000001,00000000000000c9,000000000000012a,00000000000000fb,0000000000000146," RemainedLaunchedDestinations="0000000000000079,0000000000000115,000000000000018e,0000000000000167,0000000000000135,0000000000000139,0000000000000077,0000000000000118,00000000000000bb,0000000000000177,00000000000000c0,000000000000014d,000000000000017f,00000000000000c3,000000000000015c,00000000000000fb,0000000000000186,0000000000000157,00000000000000b6,0000000000000072,0000000000000144," ThreadID="1322639651557440362" Machine="0.0.0.0:0" LogGroup="default" Roles="TS" />
+    ```
+* How to track the lifecycle of a relocation attempt for balancing?
+  * First find the TraceId fields in `BgDDMountainChopper*`, `BgDDValleyFiller*`, which indicates a relocation is triggered.
+  * (Only when enabled) Find the `QueuedRelocation` event with the same `BeginPair` and `EndPair` as the original `TraceId`. This means the relocation request is queued.
+  * Find the `RelocateShard` event whose `BeginPair`, `EndPair` field is the same as `TraceId`. This event means the relocation is ongoing.
--- a/design/dynamic-knobs.md
+++ b/design/dynamic-knobs.md
@ -0,0 +1,420 @@
+# Dynamic Knobs
+
+This document is largely adapted from original design documents by Markus
+Pilman and Trevor Clinkenbeard.
+
+## Background
+
+FoundationDB parameters control the behavior of the database, including whether
+certain features are available and the value of internal constants. Parameters
+will be referred to as knobs for the remainder of this document. Currently,
+these knobs are configured through arguments passed to `fdbserver` processes,
+often controlled by `fdbmonitor`. This has a number of problems:
+
+1. Updating knobs involves updating `foundationdb.conf` files on each host in a
+   cluster. This has a lot of overhead and typically requires external tooling
+   for large scale changes.
+2. All knob changes require a process restart.
+3. We can't easily track the history of knob changes.
+
+## Overview
+
+The dynamic knobs project creates a strictly serializable quorum-based
+configuration database stored on the coordinators. Each `fdbserver` process
+specifies a configuration path and applies knob overrides from the
+configuration database for its specified classes.
+
+### Caveats
+
+The configuration database explicitly does not support the following:
+
+1. A high load. The update rate, while not specified, should be relatively low.
+2. A large amount of data. The database is meant to be relatively small (under
+   one megabyte). Data is not sharded and every coordinator stores a complete
+   copy.
+3. Concurrent writes. At most one write can succeed at a time, and clients must
+   retry their failed writes.
+
+## Design
+
+### Configuration Path
+
+Each `fdbserver` process can now include a `--config_path` argument specifying
+its configuration path. A configuration path is a hierarchical list of
+configuration classes specifying which knob overrides the `fdbserver` process
+should apply from the configuration database. For example:
+
+```bash
+$ fdbserver --config_path classA/classB/classC ...
+```
+
+Knob overrides follow descending priority:
+
+1. Manually specified command line knobs.
+2. Individual configuration class overrides.
+  * Subdirectories override parent directories. For example, if the
+    configuration path is `az-1/storage/gp3`, the `gp3` configuration takes
+    priority over the `storage` configuration, which takes priority over the
+    `az-1` configuration.
+3. Global configuration knobs.
+4. Default knob values.
+
+#### Example
+
+For example, imagine an `fdbserver` process run as follows:
+
+```bash
+$ fdbserver --datadir /mnt/fdb/storage/4500 --logdir /var/log/foundationdb --public_address auto:4500 --config_path az-1/storage/gp3 --knob_disable_asserts false
+```
+
+And the configuration database contains:
+
+| ConfigClass | KnobName            | KnobValue |
+|-------------|---------------------|-----------|
+| az-2        | page_cache_4k       | 8e9       |
+| storage     | min_trace_severity  | 20        |
+| az-1        | compaction_interval | 280       |
+| storage     | compaction_interval | 350       |
+| az-1        | disable_asserts     | true      |
+| \<global\>  | max_metric_size     | 5000      |
+| gp3         | max_metric_size     | 1000      |
+
+The final configuration for the process will be:
+
+| KnobName            |  KnobValue  | Explanation |
+|---------------------|-------------|-------------|
+| page_cache_4k       | \<default\> | The configuration database knob override for `az-2` is ignored, so the compiled default is used |
+| min_trace_severity  | 20          | Because the `storage` configuration class is part of the process’s configuration path, the corresponding knob override is applied from the configuration database |
+| compaction_interval | 350         | The `storage` knob override takes precedence over the `az-1` knob override |
+| disable_asserts     | false       | This knob is manually overridden, so all other overrides are ignored |
+| max_metric_size     | 1000        | Knob overrides for specific configuration classes take precedence over global knob overrides, so the global override is ignored |
+
+### Clients
+
+Clients can write to the configuration database using transactions.
+Configuration database transactions are differentiated from regular
+transactions through specification of the `USE_CONFIG_DATABASE` database
+option.
+
+In configuration transactions, the client uses the tuple layer to interact with
+the configuration database. Keys are tuples of size two, where the first item
+is the configuration class being written, and the second item is the knob name.
+The value should be specified as a string. It will be converted to the
+appropriate type based on the declared type of the knob being set.
+
+Below is a sample Python script to write to the configuration database.
+
+```python
+import fdb
+
+fdb.api_version(720)
+
+@fdb.transactional
+def set_knob(tr, knob_name, knob_value, config_class, description):
+        tr['\xff\xff/description'] = description
+        tr[fdb.tuple.pack((config_class, knob_name,))] = knob_value
+
+# This function performs two knob changes transactionally.
+@fdb.transactional
+def set_multiple_knobs(tr):
+        tr['\xff\xff/description'] = 'description'
+        tr[fdb.tuple.pack((None, 'min_trace_severity',))] = '10'
+        tr[fdb.tuple.pack(('az-1', 'min_trace_severity',))] = '20'
+
+db = fdb.open()
+db.options.set_use_config_database()
+
+set_knob(db, 'min_trace_severity', '10', None, 'description')
+set_knob(db, 'min_trace_severity', '20', 'az-1', 'description')
+```
+
+### Disable the Configuration Database
+
+The configuration database includes both client and server changes and is
+enabled by default. Thus, to disable the configuration database, changes must
+be made to both.
+
+#### Server
+
+The configuration database can be disabled by specifying the ``fdbserver``
+command line option ``--no-config-db``. Note that this option must be specified
+for *every* ``fdbserver`` process.
+
+#### Client
+
+The only client change from the configuration database is as part of the change
+coordinators command. The change coordinators command is not considered
+successful until the configuration database is readable on the new
+coordinators. This will cause the change coordinators command to hang if run
+against a database with dynamic knobs disabled. To disable the client side
+configuration database liveness check, specify the ``--no-config-db`` flag when
+changing coordinators. For example:
+
+```
+fdbcli> coordinators auto --no-config-db
+```
+
+## Status
+
+The current state of the configuration database is output as part of `status
+json`. The configuration path for each process can be determined from the
+``command_line`` key associated with each process.
+
+Sample from ``status json``:
+
+```
+"configuration_database" : {
+    "commits" : [
+        {
+            "description" : "set some knobs",
+            "timestamp" : 1659570000,
+            "version" : 1
+        },
+        {
+            "description" : "make some other changes",
+            "timestamp" : 1659570000,
+            "version" : 2
+        }
+    ],
+    "last_compacted_version" : 0,
+    "most_recent_version" : 2,
+    "mutations" : [
+        {
+            "config_class" : "<global>",
+            "knob_name" : "min_trace_severity",
+            "knob_value" : "int:5",
+            "type" : "set",
+            "version" : 1
+        },
+        {
+            "config_class" : "<global>",
+            "knob_name" : "compaction_interval",
+            "knob_value" : "double:30.000000",
+            "type" : "set",
+            "version" : 1
+        },
+        {
+            "config_class" : "az-1",
+            "knob_name" : "compaction_interval",
+            "knob_value" : "double:60.000000",
+            "type" : "set",
+            "version" : 1
+        },
+        {
+            "config_class" : "<global>",
+            "knob_name" : "compaction_interval",
+            "type" : "clear",
+            "version" : 2
+        },
+        {
+            "config_class" : "<global>",
+            "knob_name" : "update_node_timeout",
+            "knob_value" : "double:4.000000",
+            "type" : "set",
+            "version" : 2
+        }
+    ],
+    "snapshot" : {
+        "<global>" : {
+            "min_trace_severity" : "int:5",
+            "update_node_timeout" : "double:4.000000"
+        },
+        "az-1" : {
+            "compaction_interval" : "double:60.000000"
+        }
+    }
+}
+```
+
+After compaction, ``status json`` would show:
+
+```
+"configuration_database" : {
+    "commits" : [
+    ],
+    "last_compacted_version" : 2,
+    "most_recent_version" : 2,
+    "mutations" : [
+    ],
+    "snapshot" : {
+        "<global>" : {
+            "min_trace_severity" : "int:5",
+            "update_node_timeout" : "double:4.000000"
+        },
+        "az-1" : {
+            "compaction_interval" : "double:60.000000"
+        }
+    }
+}
+```
+
+## Detailed Implementation
+
+The configuration database is implemented as a replicated state machine living
+on the coordinators. This allows configuration database transactions to
+continue to function in the event of a catastrophic loss of the transaction
+subsystem.
+
+To commit a transaction, clients run the two phase Paxos protocol. First, the
+client asks for a live version from a quorum of coordinators. When a
+coordinator receives a request for its live version, it increments its local
+live version by one and returns it to the client. Then, the client submits its
+writes at the live version it received in the previous step. A coordinator will
+accept the commit if it is still on the same live version. If a majority of
+coordinators accept the commit, it is considered committed.
+
+### Coordinator
+
+Each coordinator runs a ``ConfigNode`` which serves as a replica storing one
+full copy of the configuration database. Coordinators never communicate with
+other coordinators while processing configuration database transactions.
+Instead, the client runs the transaction and determines when it has quorum
+agreement.
+
+Coordinators serve the following ``ConfigTransactionInterface`` to allow
+clients to read from and write to the configuration database.
+
+#### ``ConfigTransactionInterface``
+| Request          | Request fields                                                 | Reply fields                                                                                  | Explanation                                                                                                      |
+|------------------|----------------------------------------------------------------|-----------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------|
+| GetGeneration    | (coordinatorsHash)                                             | (generation) or (coordinators_changed error)                                                  | Get a new read version. This read version is used for all future requests in the transaction                     |
+| Get              | (configuration class, knob name, coordinatorsHash, generation) | (knob value or empty) or (coordinators_changed error) or (transaction_too_old error)          | Returns the current value stored at the specified configuration class and knob name, or empty if no value exists |
+| GetConfigClasses | (coordinatorsHash, generation)                                 | (configuration classes) or (coordinators_changed error) or (transaction_too_old error)        | Returns a list of all configuration classes stored in the configuration database                                 |
+| GetKnobs         | (configuration class, coordinatorsHash, generation)            | (knob names) or (coordinators_changed error) or (transaction_too_old error)                   | Returns a list of all knob names stored for the provided configuration class                                     |
+| Commit           | (mutation list, coordinatorsHash, generation)                  | ack or (coordinators_changed error) or (commit_unknown_result error) or (not_committed error) | Commit mutations set by the transaction                                                                          |
+
+Coordinators also serve the following ``ConfigFollowerInterface`` to provide
+access to (and modification of) their current state. Most interaction through
+this interface is done by the cluster controller through its
+``IConfigConsumer`` implementation living on the ``ConfigBroadcaster``.
+
+#### ``ConfigFollowerInterface``
+| Request               | Request fields                                                       | Reply fields                                                                            | Explanation                                                                                                         |
+|-----------------------|----------------------------------------------------------------------|-----------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------|
+| GetChanges            | (lastSeenVersion, mostRecentVersion)                                 | (mutation list, version) or (version_already_compacted error) or (process_behind error) | Request changes since the last seen version, receive a new most recent version, as well as recent mutations         |
+| GetSnapshotAndChanges | (mostRecentVersion)                                                  | (snapshot, snapshotVersion, changes)                                                    | Request the full configuration database, in the form of a base snapshot and changes to apply on top of the snapshot |
+| Compact               | (version)                                                            | ack                                                                                     | Compact mutations up to the provided version                                                                        |
+| Rollforward           | (rollbackTo, lastKnownCommitted, target, changes, specialZeroQuorum) | ack or (version_already_compacted error) or (transaction_too_old error)                 | Rollback/rollforward mutations on a node to catch it up with the majority                                           |
+| GetCommittedVersion   | ()                                                                   | (registered, lastCompacted, lastLive, lastCommitted)                                    | Request version information from a ``ConfigNode``                                                                   |
+| Lock                  | (coordinatorsHash)                                                   | ack                                                                                     | Lock a ``ConfigNode`` to prevent it from serving requests during a coordinator change                               |
+
+### Cluster Controller
+
+The cluster controller runs a singleton ``ConfigBroadcaster`` which is
+responsible for periodically polling the ``ConfigNode``s for updates, then
+broadcasting these updates to workers through the ``ConfigBroadcastInterface``.
+When workers join the cluster, they register themselves and their
+``ConfigBroadcastInterface`` with the broadcaster. The broadcaster then pushes
+new updates to registered workers.
+
+The ``ConfigBroadcastInterface`` is also used by ``ConfigNode``s to register
+with the ``ConfigBroadcaster``. ``ConfigNode``s need to register with the
+broadcaster because the broadcaster decides when the ``ConfigNode`` may begin
+serving requests, based on global information about status of other
+``ConfigNode``s. For example, if a system with three ``ConfigNode``s suffers a
+fault where one ``ConfigNode`` loses data, the faulty ``ConfigNode``  should
+not be allowed to begin serving requests again until it has been rolled forward
+and is up to date with the latest state of the configuration database.
+
+#### ``ConfigBroadcastInterface``
+
+| Request    | Request fields                                             | Reply fields                  | Explanation                                                                                 |
+|------------|------------------------------------------------------------|-------------------------------|---------------------------------------------------------------------------------------------|
+| Snapshot   | (snapshot, version, restartDelay)                          | ack                           | A snapshot of the configuration database sent by the broadcaster to workers                 |
+| Changes    | (changes, mostRecentVersion, restartDelay)                 | ack                           | A list of changes up to and including mostRecentVersion, sent by the broadcaster to workers |
+| Registered | ()                                                         | (registered, lastSeenVersion) | Sent by the broadcaster to new ``ConfigNode``s to determine their registration status       |
+| Ready      | (snapshot, snapshotVersion, liveVersion, coordinatorsHash) | ack                           | Sent by the broadcaster to new ``ConfigNode``s to allow them to start serving requests      |
+
+### Worker
+
+Each worker runs a ``LocalConfiguration`` instance which receives and applies
+knob updates from the ``ConfigBroadcaster``. The local configuration maintains
+a durable ``KeyValueStoreMemory`` containing the following:
+
+* The latest known configuration version
+* The most recently used configuration path
+* All knob overrides corresponding to the configuration path at the latest known version
+
+Once a worker starts, it will:
+
+* Apply manually set knobs
+* Read its local configuration file
+  * If the stored configuration path does not match the configuration path
+    specified on the command line, delete the local configuration file
+  * Otherwise, apply knob updates from the local configuration file. Manually
+    specified knobs will not be overridden
+  * Register with the broadcaster to receive new updates for its configuration
+    classes
+    * Persist these updates when received and restart if necessary
+
+### Knob Atomicity
+
+All knobs are classified as either atomic or non-atomic. Atomic knobs require a
+process restart when changed, while non-atomic knobs do not.
+
+### Compaction
+
+``ConfigNode``s store individual mutations in order to be able to update other,
+out of date ``ConfigNode``s without needing to send a full snapshot. Each
+configuration database commit also contains additional metadata such as a
+timestamp and a text description of the changes being made. To keep the size of
+the configuration database manageable, a compaction process runs periodically
+(defaulting to every five minutes) which compacts individual mutations into a
+simplified snapshot of key-value pairs. Compaction is controlled by the
+``ConfigBroadcaster``, using information it peridiodically requests from
+``ConfigNode``s. Compaction will only compact up to the minimum known version
+across *all* ``ConfigNode``s. This means that if one ``ConfigNode`` is
+permanently partitioned from the ``ConfigBroadcaster`` or from clients, no
+compaction will ever take place.
+
+### Rollback / Rollforward
+
+It is necessary to be able to roll ``ConfigNode``s backward and forward with
+respect to their committed versions due to the nature of quorum logic and
+unreliable networks.
+
+Consider a case where a client commit gets persisted durably on one out of
+three ``ConfigNode``s (assume commit messages to the other two nodes are lost).
+Since the value is not committed on a majority of ``ConfigNode``s, it cannot be
+considered committed. But it is also incorrect to have the value persist on one
+out of three nodes as future commits are made. In this case, the most common
+result is that the ``ConfigNode`` will be rolled back when the next commit from
+a different client is made, and then rolled forward to contain the data from
+the commit. ``PaxosConfigConsumer`` contains logic to recognize ``ConfigNode``
+minorities and update them to match the quorum.
+
+### Changing Coordinators
+
+Since the configuration database lives on the coordinators and the
+[coordinators can be
+changed](https://apple.github.io/foundationdb/configuration.html#configuration-changing-coordination-servers),
+it is necessary to copy the configuration database from the old to the new
+coordinators during such an event. A coordinator change performs the following
+steps in regards to the configuration database:
+
+1. Write ``\xff/coordinatorsKey`` with the new coordinators string. The key
+   ``\xff/previousCoordinators`` contains the current (old) set of
+   coordinators.
+2. Lock the old ``ConfigNode``s so they can no longer serve client requests.
+3. Start a recovery, causing a new cluster controller (and therefore
+   ``ConfigBroadcaster``) to be selected.
+4. Read ``\xff/previousCoordinators`` on the ``ConfigBroadcaster`` and, if
+   present, read an up-to-date snapshot of the configuration database on the
+   old coordinators.
+5. Determine if each registering ``ConfigNode`` needs an up-to-date snapshot of
+   the configuration database sent to it, based on its reported version and the
+   snapshot version of the database received from the old coordinators.
+   * Some new coordinators which were also coordinators in the previous
+     configuration may not need a snapshot.
+6. Send ready requests to new ``ConfigNode``s, including an up-to-date snapshot
+   if necessary. This allows the new coordinators to begin serving
+   configuration database requests from clients.
+
+## Testing
+
+The ``ConfigDatabaseUnitTests`` class unit test a number of different
+configuration database dimensions.
+
+The ``ConfigIncrement`` workload tests contention between clients attempting to
+write to the configuration database, paired with machine failure and
+coordinator changes.
--- a/design/global-tag-throttling.md
+++ b/design/global-tag-throttling.md
@ -125,6 +125,3 @@ In each test, the `GlobalTagThrottlerTesting::monitor` function is used to perio
 On the ratekeeper, every `SERVER_KNOBS->TAG_THROTTLE_PUSH_INTERVAL` seconds, the ratekeeper will call `GlobalTagThrottler::getClientRates`. At the end of the rate calculation for each tag, a trace event of type `GlobalTagThrottler_GotClientRate` is produced. This trace event reports the relevant inputs that went in to the rate calculation, and can be used for debugging.

 On storage servers, every `SERVER_KNOBS->TAG_MEASUREMENT_INTERVAL` seconds, there are `BusyReadTag` events for every tag that has sufficient read cost to be reported to the ratekeeper. Both cost and fractional busyness are reported.
-
-### Status
-For each storage server, the busiest read tag is reported in the full status output, along with its cost and fractional busyness. 
--- a/documentation/sphinx/source/architecture.rst
+++ b/documentation/sphinx/source/architecture.rst
@ -14,8 +14,12 @@ Detailed FoundationDB Architecture

 The FoundationDB architecture chooses a decoupled design, where
 processes are assigned different heterogeneous roles (e.g.,
-Coordinators, Storage Servers, Master). Scaling the database is achieved
-by horizontally expanding the number of processes for separate roles:
+Coordinators, Storage Servers, Master). Cluster attempts to recruit
+different roles as separate processes, however, it is possible that
+multiple Stateless roles gets colocated (recruited) on a single
+process to meet the cluster recruitment goals. Scaling the database
+is achieved by horizontally expanding the number of processes for
+separate roles:

 Coordinators
 ~~~~~~~~~~~~
--- a/documentation/sphinx/source/client-testing.rst
+++ b/documentation/sphinx/source/client-testing.rst
@ -373,3 +373,302 @@ with the ``multitest`` role:
   fdbserver -r multitest -f testfile.txt

 This command will block until all tests are completed.
+
+##########
+API Tester
+##########
+
+Introduction
+============
+
+API tester is a framework for implementing end-to-end tests of FDB C API, i.e. testing the API on a real
+FDB cluster through all layers of the FDB client. Its executable is ``fdb_c_api_tester``, and the source
+code is located in ``bindings/c/test/apitester``. The structure of API Tests is similar to that of the 
+Simulation Tests. The tests are implemented as workloads using FDB API, which are all built into the 
+``fdb_c_api_tester``. A concrete test configuration is defined as a TOML file, which specifies the
+combination of workloads to be executed by the test together with their parameters. The test can be then
+executed by passing the TOML file as a parameter to ``fdb_c_api_tester``. 
+
+Since simulation tests rely on the actor model to execute the tests deterministically in single-threaded
+mode, they are not suitable for testing various multi-threaded aspects of the FDB client. End-to-end API
+tests complement the simulation tests by testing the FDB Client layers above the single-threaded Native
+Client. 
+
+- The specific testing goals of the end-to-end tests are:
+- Check functional correctness of the Multi-Version Client (MVC) and Thread-Safe Client 
+- Detecting race conditions. They can be caused by accessing the state of the Native Client from wrong
+  threads or introducing other shared state without proper synchronization
+- Detecting memory management errors. Thread-safe reference counting must be used where necessary. MVC
+  works with multiple client libraries. Memory allocated by one client library must be also deallocated
+  by the same library.
+- Maintaining interoperability with other client versions.  The client functionality is made available
+  depending on the selected API version. The API changes are correctly adapted. 
+- Client API behaves correctly in case of cluster upgrades. Database and transaction state is correctly
+  migrated to the upgraded connections. Pending operations are canceled and successfully retried on the
+  upgraded connections.
+
+Implementing a Workload
+=======================
+
+Each workload is declared as a direct or indirect subclass of ``WorkloadBase`` implementing a constructor
+with ``WorkloadConfig`` as a parameter and the method ``start()``, which defines the entry point of the
+workload. 
+
+``WorkloadBase`` provides a set of methods that serve as building blocks for implementation of a workload:
+
+.. function:: execTransaction(start, cont, failOnError = true)
+
+   creates and executes an FDB transaction. Here ``start`` is a function that takes a transaction context
+   as parameter and implements the starting point of the transaction, and ``cont`` is a function implementing
+   a continuation to be executed after finishing the transaction execution. Transactions are automatically
+   retried on retryable errors. Transactions are retried by calling the ``start`` function again. In case
+   of a fatal error, the entire workload is considered as failed unless ``failOnError`` is set to ``false``.
+
+.. function:: schedule(task)
+
+   schedules a task for asynchronous execution. It is usually used in the continuations to schedule 
+   the next step of the workload.
+
+.. function:: info(msg) 
+              error(msg) 
+              
+   are used for logging a message with a tag identifying the workload. Issuing an error message marks
+   the workload as failed.
+
+The transaction context provides methods for implementation of the transaction logics:
+
+.. function:: tx()
+   
+   the reference to the FDB transaction object
+
+.. function:: continueAfter(future, cont, retryOnError = true)
+   
+   set a continuation to be executed when the future is ready. The ``retryOnError`` flag controls whether
+   the transaction should be automatically retried in case the future results in a retriable error.
+
+.. function:: continueAfterAll(futures, cont)
+   
+   takes a vector of futures and sets a continuation to be executed when all of the futures get ready.
+   The transaction is retried if at least one of the futures results in an error. This method is useful 
+   for handling multiple concurrent reads.
+
+.. function:: commit() 
+   
+   commit and finish the transaction. If the commit is successful, the execution proceeds to the
+   continuation of ``execTransaction()``. In case of a retriable error the transaction is
+   automatically retried. A fatal error results in a failure of the workoad.
+
+
+.. function:: done() 
+   
+   finish the transaction without committing. This method should be used to finish read transactions. 
+   The transaction gets destroyed and execution proceeds to the continuation of ``execTransaction()``.
+   Each transaction must be finished either by ``commit()`` or ``done()``, because otherwise
+   the framework considers that the transaction is still being executed, so it won't destroy it and
+   won't call the continuation.
+
+.. function:: onError(err) 
+   
+   Handle an error: restart the transaction in case of a retriable error, otherwise fail the workload.
+   This method is typically used in the continuation of ``continueAfter`` called with
+   ``retryOnError=false`` as a fallback to the default error handling.
+
+A workload execution ends automatically when it is marked as failed or its last continuation does not
+schedule any new task or transaction. 
+
+The workload class should be defined in the namespace FdbApiTester. The file name convention is
+``Tester{Name}Workload.cpp`` so that we distinguish them from the source files of simulation workloads.
+
+Basic Workload Example
+======================
+
+The code below implements a workload that consists of only two transactions. The first one sets a
+randomly generated key to a randomly generated value, and the second one reads the key and checks if
+the returned value matches the written one.
+
+.. literalinclude:: ../../../bindings/c/test/apitester/TesterExampleWorkload.cpp
+   :language: C++
+   :lines: 21-
+
+The workload is implemented in the method ``setAndGet``. It generates a random key and a random value
+and executes a transaction that writes that key-value pair and commits. In the continuation of the
+first ``execTransaction`` call, we execute the second transaction that reads the same key. The read
+operation returns a future. So we call ``continueAfter`` to set a continuation for that future. In the
+continuation we check if the returned value matches the written one and finish the transaction by
+calling ``ctx->done()``. After completing the second transaction we execute the continuation passed
+as parameter to the ``setAndGet`` method by the start method. In this case it is ``NO_OP_TASK``, which
+does nothing and so finishes the workload.
+
+Finally, we declare an instance ``WorkloadFactory`` to register this workload with the name ``SetAndGet``.
+
+Note that we use ``workloadId`` as a key prefix. This is necessary for isolating the key space of this
+workload, because the framework may be instructed to create multiple instances of the ``SetAndGet``
+workload. If we do not isolate the key space, another workload can write a different value for the
+same key and so break the assumption of the test.
+
+The workload is implemented using the internal C++ API, implemented in ``fdb_api.hpp``. It introduces
+a set of classes representing the FDB objects (transactions, futures, etc.). These classes provide C++-style 
+methods wrapping FDB C API calls and automate memory management by means of reference counting.
+
+Implementing Control Structures
+===============================
+
+Our basic workload executes just 2 transactions, but in practice we want to have workloads that generate
+multiple transactions. The following code demonstrates how we can modify our basic workload to generate
+multiple transactions in a loop. 
+
+.. code-block:: C++
+
+   class SetAndGetWorkload : public WorkloadBase {
+   public:
+      ...
+      int numIterations;
+      int iterationsLeft;
+
+      SetAndGetWorkload(const WorkloadConfig& config) : WorkloadBase(config) {
+         keyPrefix = fdb::toBytesRef(fmt::format("{}/", workloadId));
+         numIterations = config.getIntOption("numIterations", 1000);
+      }
+
+      void start() override {
+         iterationsLeft = numIterations;
+         setAndGetLoop();
+      }
+
+      void setAndGetLoop() {
+         if (iterationsLeft == 0) {
+            return;
+         }
+         iterationsLeft--;
+         setAndGet([this]() { setAndGetLoop(); });
+      }
+      ...
+   }
+
+We introduce a workload parameter ``numIterations`` to specify the number of iterations. If not specified
+in the test configuration it defaults to 1000.
+
+The method ``setAndGetLoop`` implements the loop that decrements iterationsLeft counter until it reaches 0
+and each iteration calls setAndGet with a continuation that returns the execution to the loop. As you
+can see we don't need any change in ``setAndGet``, just call it with another continuation. 
+
+The pattern of passing a continuation as a parameter also can be used to decompose the workload into a
+sequence of steps. For example,  we can introduce setup and cleanUp steps to our workload and modify the
+``setAndGetLoop`` to make it composable with an arbitrary continuation:
+
+.. code-block:: C++
+
+    void start() override {
+       setup([this](){
+           iterationsLeft = numIterations;
+           setAndGetLoop([this](){
+               cleanup(NO_OP_TASK);
+           });
+       });
+    }
+
+    void setAndGetLoop(TTaskFct cont) {
+       if (iterationsLeft == 0) {
+           schedule(cont);
+       }
+       iterationsLeft--;
+       setAndGet([this, cont]() { setAndGetLoop(cont); });
+   }
+
+   void setup(TTaskFct cont) { ... }
+
+   void cleanup(TTaskFct cont) {  ... }
+
+Note that we call ``schedule(cont)`` in ``setAndGetLoop`` instead of calling the continuation directly.
+In this way we avoid keeping ``setAndGetLoop`` in the call stack, when executing the next step.
+
+Subclassing ApiWorkload
+=======================
+
+``ApiWorkload`` is an abstract subclass of ``WorkloadBase`` that provides a framework for a typical
+implementation of API test workloads. It implements a workflow consisting of cleaning up the key space
+of the workload, populating it with newly generated data and then running a loop consisting of random
+database operations. The concrete subclasses of ``ApiWorkload`` are expected to override the method
+``randomOperation`` with an implementation of concrete random operations.
+
+The ``ApiWorkload`` maintains a local key-value store that mirrors the part of the database state
+relevant to the workload. A successful database write operation should be followed by a continuation
+that performs equivalent changes in the local store, and the results of a database read operation should
+be validated against the values from the local store. 
+
+Test Configuration
+==================
+
+A concrete test configuration is specified by a TOML file. The file must contain one ``[[test]]`` section
+specifying the general settings for test execution followed by one or more ``[[test.workload]]``
+configuration sessions, specifying the workloads to be executed and their parameters. The specified
+workloads are started all at once and executed concurrently.
+
+The ``[[test]]`` section can contain the following options:
+
+- ``title``: descriptive title of the test
+- ``multiThreaded``: enable multi-threading (default: false)
+- ``minFdbThreads`` and ``maxFdbThreads``: the number of FDB (network) threads to be randomly selected
+  from the given range (default: 1-1). Used only if ``multiThreaded=true``. It is also important to use
+  multiple database instances to make use of the multithreading.
+- ``minDatabases`` and ``maxDatabases``: the number of database instances to be randomly selected from
+  the given range (default 1-1). The transactions of all workloads are randomly load-balanced over the
+  pool of database instances.
+- ``minClients`` and ``maxClients``: the number of clients, i.e. instances of each workload, to be
+  randomly selected from the given range (default 1-8).
+- ``minClientThreads`` and ``maxClientThreads``: the number of client threads, i.e. the threads used
+  for execution of the workload, to be randomly selected from the given range (default 1-1).
+- ``blockOnFutures``: use blocking waits on futures instead of scheduling future callbacks asynchronously
+  (default: false)
+- ``buggify``: Enable client-side failure injection (default: false)
+- ``databasePerTransaction``: Create a separate database instance for each transaction (default: false).
+  It is a special mode useful for testing bugs related to creation and destruction of database instances. 
+- ``fdbCallbacksOnExternalThreads``: Enables the option ``FDB_NET_OPTION_CALLBACKS_ON_EXTERNAL_THREADS``
+  causting the callbacks of futures to be executed directly on the threads of the external FDB clients 
+  rather than on the thread of the local FDB client. 
+
+The workload section ``[[test.workload]]`` must contain the attribute name matching the registered name
+of the workload to be executed. Other options are workload-specific. 
+
+The subclasses of the ``ApiWorkload`` inherit the following configuration options:
+
+- ``minKeyLength`` and ``maxKeyLength``: the size range of randomly generated keys (default: 1-64)
+- ``minValueLength`` and ``maxValueLength``:  the size range of randomly generated values 
+  (default: 1-1000)
+- ``maxKeysPerTransaction``: the maximum number of keys per transaction (default: 50)
+- ``initialSize``: the number of key-value pairs in the initially populated database (default: 1000)
+- ``readExistingKeysRatio``: the probability of choosing an existing key for read operations 
+  (default: 0.9)
+- ``numRandomOperations``: the number of random operations to be executed per workload (default: 1000)
+- ``runUntilStop``: run the workload indefinitely until the stop command is received (default: false).
+   This execution mode in upgrade tests and other scripted tests, where the workload needs to
+   be generated continously until completion of the scripted test.
+- ``numOperationsForProgressCheck``: the number of operations to be performed to confirm a progress 
+   check (default: 10). This option is used in combination with ``runUntilStop``. Progress checks are
+   initiated by a test script to check if the client workload is successfully progressing after a
+   cluster change.
+
+Executing the Tests
+===================
+
+The ``fdb_c_api_tester`` executable takes a single TOML file as a parameter and executes the test
+according to its specification. Before that we must create a FDB cluster and pass its cluster file as
+a parameter to ``fdb_c_api_tester``. Note that multithreaded tests also need to be provided with an
+external client library. 
+
+For example, we can create a temporary cluster and use it for execution of one of the existing API tests:
+
+.. code-block:: bash
+
+   ${srcDir}/tests/TestRunner/tmp_cluster.py --build-dir ${buildDir} -- \
+      ${buildDir}/bin/fdb_c_api_tester \
+      --cluster-file @CLUSTER_FILE@ \
+      --external-client-library=${buildDir}/bindings/c/libfdb_c_external.so \
+      --test-file ${srcDir}/bindings/c/test/apitester/tests/CApiCorrectnessMultiThr.toml
+
+The test specifications added to the ``bindings/c/test/apitester/tests/`` directory are executed as a part
+of the regression test suite. They can be executed using the ``ctest`` target ``fdb_c_api_tests``:
+
+.. code-block:: bash
+   
+   ctest -R fdb_c_api_tests -VV
--- a/documentation/sphinx/source/configuration.rst
+++ b/documentation/sphinx/source/configuration.rst
@ -416,6 +416,9 @@ FoundationDB will never use processes on the same machine for the replication of
 ``three_data_hall`` mode
    FoundationDB stores data in triplicate, with one copy on a storage server in each of three data halls. The transaction logs are replicated four times, with two data halls containing two replicas apiece. Four available machines (two in each of two data halls) are therefore required to make progress. This configuration enables the cluster to remain available after losing a single data hall and one machine in another data hall.

+``three_data_hall_fallback`` mode
+    FoundationDB stores data in duplicate, with one copy each on a storage server in two of three data halls. The transaction logs are replicated four times, with two data halls containing two replicas apiece. Four available machines (two in each of two data halls) are therefore required to make progress. This configuration is similar to ``three_data_hall``, differing only in that data is stored on two instead of three replicas. This configuration is useful to unblock data distribution when a data hall becomes temporarily unavailable. Because ``three_data_hall_fallback`` reduces the redundancy level to two, it should only be used as a temporary measure to restore cluster health during a datacenter outage.
+
 Datacenter-aware mode
 ---------------------

--- a/documentation/sphinx/source/mr-status-json-schemas.rst.inc
+++ b/documentation/sphinx/source/mr-status-json-schemas.rst.inc
@ -379,7 +379,9 @@
                  "log_server_min_free_space",
                  "log_server_min_free_space_ratio",
                  "storage_server_durability_lag",
-                  "storage_server_list_fetch_failed"
+                  "storage_server_list_fetch_failed",
+                  "blob_worker_lag",
+                  "blob_worker_missing"
               ]
            },
            "description":"The database is not being saturated by the workload."
@ -400,7 +402,9 @@
                  "log_server_min_free_space",
                  "log_server_min_free_space_ratio",
                  "storage_server_durability_lag",
-                  "storage_server_list_fetch_failed"
+                  "storage_server_list_fetch_failed",
+                  "blob_worker_lag",
+                  "blob_worker_missing"
               ]
            },
            "description":"The database is not being saturated by the workload."
@ -599,7 +603,7 @@
               "counter":0,
               "roughness":0.0
            },
-            "memory_errors":{ // measures number of proxy_memory_limit_exceeded errors
+            "memory_errors":{ // measures number of (commit/grv)_proxy_memory_limit_exceeded errors
               "hz":0.0,
               "counter":0,
               "roughness":0.0
--- a/documentation/sphinx/source/mr-status.rst
+++ b/documentation/sphinx/source/mr-status.rst
@ -131,6 +131,9 @@ min_free_space_ratio                Running out of space (approaching 5% limit).
 log_server_min_free_space           Log server running out of space (approaching 100MB limit).
 log_server_min_free_space_ratio     Log server running out of space (approaching 5% limit).
 storage_server_durability_lag       Storage server durable version falling behind.
+storage_server_list_fetch_failed    Unable to fetch storage server list.
+blob_worker_lag                     Blob worker granule version falling behind.
+blob_worker_missing                 No blob workers are reporting metrics.
 =================================== ====================================================

 The JSON path ``cluster.qos.throttled_tags``, when it exists, is an Object containing ``"auto"`` , ``"manual"`` and ``"recommended"``.  The possible fields for those object are in the following table:
--- a/documentation/sphinx/source/release-notes/release-notes-710.rst
+++ b/documentation/sphinx/source/release-notes/release-notes-710.rst
@ -2,6 +2,30 @@
 Release Notes
 #############

+7.1.21
+======
+* Same as 7.1.20 release with AVX enabled.
+
+7.1.20
+======
+* Released with AVX disabled.
+* Fixed missing localities for fdbserver that can cause cross DC calls among storage servers. `(PR #7995) <https://github.com/apple/foundationdb/pull/7995>`_
+* Removed extremely spammy trace event in FetchKeys and fixed transaction_profiling_analyzer.py. `(PR #7934) <https://github.com/apple/foundationdb/pull/7934>`_
+* Fixed bugs when GRV proxy returns an error. `(PR #7860) <https://github.com/apple/foundationdb/pull/7860>`_
+
+7.1.19
+======
+* Same as 7.1.18 release with AVX enabled.
+
+7.1.18
+======
+* Released with AVX disabled.
+* Added knobs for the minimum and the maximum of the Ratekeeper's default priority. `(PR #7820) <https://github.com/apple/foundationdb/pull/7820>`_
+* Fixed bugs in ``getRange`` of the special key space. `(PR #7778) <https://github.com/apple/foundationdb/pull/7778>`_, `(PR #7720) <https://github.com/apple/foundationdb/pull/7720>`_
+* Added debug ID for secondary queries in index prefetching. `(PR #7755) <https://github.com/apple/foundationdb/pull/7755>`_
+* Changed hostname resolving to prefer IPv6 addresses. `(PR #7750) <https://github.com/apple/foundationdb/pull/7750>`_
+* Added more transaction debug events for prefetch queries. `(PR #7732) <https://github.com/apple/foundationdb/pull/7732>`_
+
 7.1.17
 ======
 * Same as 7.1.16 release with AVX enabled.
@ -15,7 +39,7 @@ Release Notes
 * Fixed ScopeEventFieldTypeMismatch error for TLogMetrics. `(PR #7640) <https://github.com/apple/foundationdb/pull/7640>`_
 * Added getMappedRange latency metrics. `(PR #7632) <https://github.com/apple/foundationdb/pull/7632>`_
 * Fixed a version vector performance bug due to not updating client side tag cache. `(PR #7616) <https://github.com/apple/foundationdb/pull/7616>`_
-* Fixed DiskReadSeconds and DiskWriteSeconds calculaion in ProcessMetrics. `(PR #7609) <https://github.com/apple/foundationdb/pull/7609>`_
+* Fixed DiskReadSeconds and DiskWriteSeconds calculation in ProcessMetrics. `(PR #7609) <https://github.com/apple/foundationdb/pull/7609>`_
 * Added Rocksdb compression and data size stats. `(PR #7596) <https://github.com/apple/foundationdb/pull/7596>`_

 7.1.15
@ -74,7 +98,7 @@ Release Notes
 * Added support of the reboot command in go bindings. `(PR #7270) <https://github.com/apple/foundationdb/pull/7270>`_
 * Fixed several issues in profiling special keys using GlobalConfig. `(PR #7120) <https://github.com/apple/foundationdb/pull/7120>`_
 * Fixed a stuck transaction system bug due to inconsistent recovery transaction version. `(PR #7261) <https://github.com/apple/foundationdb/pull/7261>`_
-* Fixed a unknown_error crash due to not resolving hostnames. `(PR #7254) <https://github.com/apple/foundationdb/pull/7254>`_
+* Fixed an unknown_error crash due to not resolving hostnames. `(PR #7254) <https://github.com/apple/foundationdb/pull/7254>`_
 * Fixed a heap-use-after-free bug. `(PR #7250) <https://github.com/apple/foundationdb/pull/7250>`_
 * Fixed a performance issue that remote TLogs are sending too many pops to log routers. `(PR #7235) <https://github.com/apple/foundationdb/pull/7235>`_
 * Fixed an issue that SharedTLogs are not displaced and leaking disk space. `(PR #7246) <https://github.com/apple/foundationdb/pull/7246>`_
--- a/documentation/sphinx/source/special-keys.rst
+++ b/documentation/sphinx/source/special-keys.rst
@ -22,6 +22,8 @@ Each special key that existed before api version 630 is its own module. These ar
 #. ``\xff\xff/cluster_file_path`` - See :ref:`cluster file client access <cluster-file-client-access>`
 #. ``\xff\xff/status/json`` - See :doc:`Machine-readable status <mr-status>`

+#. ``\xff\xff/worker_interfaces`` - key as the worker's network address and value as the serialized ClientWorkerInterface, not transactional
+
 Prior to api version 630, it was also possible to read a range starting at ``\xff\xff/worker_interfaces``. This is mostly an implementation detail of fdbcli,
 but it's available in api version 630 as a module with prefix ``\xff\xff/worker_interfaces/``.

@ -210,6 +212,7 @@ that process, and wait for necessary data to be moved away.
 #. ``\xff\xff/management/options/failed_locality/force`` Read/write. Setting this key disables safety checks for writes to ``\xff\xff/management/failed_locality/<locality>``. Setting this key only has an effect in the current transaction and is not persisted on commit.
 #. ``\xff\xff/management/tenant/map/<tenant>`` Read/write. Setting a key in this range to any value will result in a tenant being created with name ``<tenant>``. Clearing a key in this range will delete the tenant with name ``<tenant>``. Reading all or a portion of this range will return the list of tenants currently present in the cluster, excluding any changes in this transaction. Values read in this range will be JSON objects containing the metadata for the associated tenants.
 #. ``\xff\xff/management/tenant/rename/<tenant>`` Read/write. Setting a key in this range to an unused tenant name will result in the tenant with the name ``<tenant>`` to be renamed to the value provided. If the rename operation is a transaction retried in a loop, it is possible for the rename to be applied twice, in which case ``tenant_not_found`` or ``tenant_already_exists`` errors may be returned. This can be avoided by checking for the tenant's existence first.
+#. ``\xff\xff/management/options/worker_interfaces/verify`` Read/write. Setting this key will add a verification phase in reading ``\xff\xff/worker_interfaces``. Setting this key only has an effect in the current transaction and is not persisted on commit. Try to establish connections with every worker from the list returned by Cluster Controller and only return those workers that the client can connect to. This option is now only used in fdbcli commands ``kill``, ``suspend`` and ``expensive_data_check`` to populate the worker list.

 An exclusion is syntactically either an ip address (e.g. ``127.0.0.1``), or
 an ip address and port (e.g. ``127.0.0.1:4500``) or any locality (e.g ``locality_dcid:primary-satellite`` or
--- a/documentation/sphinx/source/tenants.rst
+++ b/documentation/sphinx/source/tenants.rst
@ -49,7 +49,7 @@ All operations performed within a tenant transaction will occur within the tenan
 Raw access
 ----------

-When operating in the tenant mode ``required_experimental``, transactions are not ordinarily permitted to run without using a tenant. In order to access the system keys or perform maintenance operations that span multiple tenants, it is required to use the ``RAW_ACCESS`` transaction option to access the global key-space. It is an error to specify ``RAW_ACCESS`` on a transaction that is configured to use a tenant.
+When operating in the tenant mode ``required_experimental`` or using a metacluster, transactions are not ordinarily permitted to run without using a tenant. In order to access the system keys or perform maintenance operations that span multiple tenants, it is required to use the ``RAW_ACCESS`` transaction option to access the global key-space. It is an error to specify ``RAW_ACCESS`` on a transaction that is configured to use a tenant.

 .. note :: Setting the ``READ_SYSTEM_KEYS`` or ``ACCESS_SYSTEM_KEYS`` options implies ``RAW_ACCESS`` for your transaction.

--- a/fdbbackup/backup.actor.cpp
+++ b/fdbbackup/backup.actor.cpp
@ -928,7 +928,7 @@ void parentWatcher(void* parentHandle) {
 static void printVersion() {
 	printf("FoundationDB " FDB_VT_PACKAGE_NAME " (v" FDB_VT_VERSION ")\n");
 	printf("source version %s\n", getSourceVersion());
-	printf("protocol %llx\n", (long long)currentProtocolVersion.version());
+	printf("protocol %llx\n", (long long)currentProtocolVersion().version());
 }

 static void printBuildInformation() {
--- a/fdbcli/BlobRangeCommand.actor.cpp
+++ b/fdbcli/BlobRangeCommand.actor.cpp
@ -23,6 +23,7 @@
 #include "fdbclient/FDBOptions.g.h"
 #include "fdbclient/IClientApi.h"
 #include "fdbclient/ManagementAPI.actor.h"
+#include "fdbclient/NativeAPI.actor.h"

 #include "flow/Arena.h"
 #include "flow/FastRef.h"
@ -31,33 +32,6 @@

 namespace {

-// copy to standalones for krm
-ACTOR Future<Void> setBlobRange(Database db, Key startKey, Key endKey, Value value) {
-	state Reference<ReadYourWritesTransaction> tr = makeReference<ReadYourWritesTransaction>(db);
-
-	loop {
-		try {
-			tr->setOption(FDBTransactionOptions::ACCESS_SYSTEM_KEYS);
-			tr->setOption(FDBTransactionOptions::PRIORITY_SYSTEM_IMMEDIATE);
-
-			// FIXME: check that the set range is currently inactive, and that a revoked range is currently its own
-			// range in the map and fully set.
-
-			tr->set(blobRangeChangeKey, deterministicRandom()->randomUniqueID().toString());
-			// This is not coalescing because we want to keep each range logically separate.
-			wait(krmSetRange(tr, blobRangeKeys.begin, KeyRange(KeyRangeRef(startKey, endKey)), value));
-			wait(tr->commit());
-			printf("Successfully updated blob range [%s - %s) to %s\n",
-			       startKey.printable().c_str(),
-			       endKey.printable().c_str(),
-			       value.printable().c_str());
-			return Void();
-		} catch (Error& e) {
-			wait(tr->onError(e));
-		}
-	}
-}
-
 ACTOR Future<Version> getLatestReadVersion(Database db) {
 	state Transaction tr(db);
 	loop {
@ -78,7 +52,7 @@ ACTOR Future<Void> printAfterDelay(double delaySeconds, std::string message) {
 	return Void();
 }

-ACTOR Future<Void> doBlobPurge(Database db, Key startKey, Key endKey, Optional<Version> version) {
+ACTOR Future<Void> doBlobPurge(Database db, Key startKey, Key endKey, Optional<Version> version, bool force) {
 	state Version purgeVersion;
 	if (version.present()) {
 		purgeVersion = version.get();
@ -86,7 +60,7 @@ ACTOR Future<Void> doBlobPurge(Database db, Key startKey, Key endKey, Optional<V
 		wait(store(purgeVersion, getLatestReadVersion(db)));
 	}

-	state Key purgeKey = wait(db->purgeBlobGranules(KeyRange(KeyRangeRef(startKey, endKey)), purgeVersion, {}));
+	state Key purgeKey = wait(db->purgeBlobGranules(KeyRange(KeyRangeRef(startKey, endKey)), purgeVersion, {}, force));

 	fmt::print("Blob purge registered for [{0} - {1}) @ {2}\n", startKey.printable(), endKey.printable(), purgeVersion);

@ -99,65 +73,10 @@ ACTOR Future<Void> doBlobPurge(Database db, Key startKey, Key endKey, Optional<V
 	return Void();
 }

-ACTOR Future<Version> checkBlobSubrange(Database db, KeyRange keyRange, Optional<Version> version) {
-	state Transaction tr(db);
-	state Version readVersionOut = invalidVersion;
-	loop {
-		try {
-			wait(success(tr.readBlobGranules(keyRange, 0, version, &readVersionOut)));
-			return readVersionOut;
-		} catch (Error& e) {
-			wait(tr.onError(e));
-		}
-	}
-}
-
 ACTOR Future<Void> doBlobCheck(Database db, Key startKey, Key endKey, Optional<Version> version) {
-	state Transaction tr(db);
-	state Version readVersionOut = invalidVersion;
 	state double elapsed = -timer_monotonic();
-	state KeyRange range = KeyRange(KeyRangeRef(startKey, endKey));
-	state Standalone<VectorRef<KeyRangeRef>> allRanges;
-	loop {
-		try {
-			wait(store(allRanges, tr.getBlobGranuleRanges(range)));
-			break;
-		} catch (Error& e) {
-			wait(tr.onError(e));
-		}
-	}

-	if (allRanges.empty()) {
-		fmt::print("ERROR: No blob ranges for [{0} - {1})\n", startKey.printable(), endKey.printable());
-		return Void();
-	}
-	fmt::print("Loaded {0} blob ranges to check\n", allRanges.size());
-	state std::vector<Future<Version>> checkParts;
-	// chunk up to smaller ranges than max
-	int maxChunkSize = 1000;
-	KeyRange currentChunk;
-	int currentChunkSize = 0;
-	for (auto& it : allRanges) {
-		if (currentChunkSize == maxChunkSize) {
-			checkParts.push_back(checkBlobSubrange(db, currentChunk, version));
-			currentChunkSize = 0;
-		}
-		if (currentChunkSize == 0) {
-			currentChunk = it;
-		} else if (it.begin != currentChunk.end) {
-			fmt::print("ERROR: Blobrange check failed, gap in blob ranges from [{0} - {1})\n",
-			           currentChunk.end.printable(),
-			           it.begin.printable());
-			return Void();
-		} else {
-			currentChunk = KeyRangeRef(currentChunk.begin, it.end);
-		}
-		currentChunkSize++;
-	}
-	checkParts.push_back(checkBlobSubrange(db, currentChunk, version));
-
-	wait(waitForAll(checkParts));
-	readVersionOut = checkParts.back().get();
+	state Version readVersionOut = wait(db->verifyBlobRange(KeyRangeRef(startKey, endKey), version));

 	elapsed += timer_monotonic();

@ -201,7 +120,7 @@ ACTOR Future<bool> blobRangeCommandActor(Database localDb,
 		fmt::print("Invalid blob range [{0} - {1})\n", tokens[2].printable(), tokens[3].printable());
 	} else {
 		if (tokencmp(tokens[1], "start") || tokencmp(tokens[1], "stop")) {
-			bool starting = tokencmp(tokens[1], "start");
+			state bool starting = tokencmp(tokens[1], "start");
 			if (tokens.size() > 4) {
 				printUsage(tokens[0]);
 				return false;
@ -210,9 +129,22 @@ ACTOR Future<bool> blobRangeCommandActor(Database localDb,
 			           starting ? "Starting" : "Stopping",
 			           tokens[2].printable().c_str(),
 			           tokens[3].printable().c_str());
-			wait(setBlobRange(localDb, begin, end, starting ? LiteralStringRef("1") : StringRef()));
-		} else if (tokencmp(tokens[1], "purge") || tokencmp(tokens[1], "check")) {
-			bool purge = tokencmp(tokens[1], "purge");
+			state bool success = false;
+			if (starting) {
+				wait(store(success, localDb->blobbifyRange(KeyRangeRef(begin, end))));
+			} else {
+				wait(store(success, localDb->unblobbifyRange(KeyRangeRef(begin, end))));
+			}
+			if (!success) {
+				fmt::print("{0} blobbify range for [{1} - {2}) failed\n",
+				           starting ? "Starting" : "Stopping",
+				           tokens[2].printable().c_str(),
+				           tokens[3].printable().c_str());
+			}
+			return success;
+		} else if (tokencmp(tokens[1], "purge") || tokencmp(tokens[1], "forcepurge") || tokencmp(tokens[1], "check")) {
+			bool purge = tokencmp(tokens[1], "purge") || tokencmp(tokens[1], "forcepurge");
+			bool forcePurge = tokencmp(tokens[1], "forcepurge");

 			Optional<Version> version;
 			if (tokens.size() > 4) {
@ -225,17 +157,18 @@ ACTOR Future<bool> blobRangeCommandActor(Database localDb,
 				version = v;
 			}

-			fmt::print("{0} blob range [{1} - {2})",
+			fmt::print("{0} blob range [{1} - {2}){3}",
 			           purge ? "Purging" : "Checking",
 			           tokens[2].printable(),
-			           tokens[3].printable());
+			           tokens[3].printable(),
+			           forcePurge ? " (force)" : "");
 			if (version.present()) {
 				fmt::print(" @ {0}", version.get());
 			}
 			fmt::print("\n");

 			if (purge) {
-				wait(doBlobPurge(localDb, begin, end, version));
+				wait(doBlobPurge(localDb, begin, end, version, forcePurge));
 			} else {
 				wait(doBlobCheck(localDb, begin, end, version));
 			}
@ -247,8 +180,7 @@ ACTOR Future<bool> blobRangeCommandActor(Database localDb,
 	return true;
 }

-CommandFactory blobRangeFactory("blobrange",
-                                CommandHelp("blobrange <start|stop|purge|check> <startkey> <endkey> [version]",
-                                            "",
-                                            ""));
+CommandFactory blobRangeFactory(
+    "blobrange",
+    CommandHelp("blobrange <start|stop|check|purge|forcepurge> <startkey> <endkey> [version]", "", ""));
 } // namespace fdb_cli
--- a/fdbcli/ConfigureCommand.actor.cpp
+++ b/fdbcli/ConfigureCommand.actor.cpp
@ -272,6 +272,10 @@ ACTOR Future<bool> configureCommandActor(Reference<IDatabase> db,
 		    stderr,
 		    "WARN: Sharded RocksDB storage engine type is still in experimental stage, not yet production tested.\n");
 		break;
+	case ConfigurationResult::DATABASE_IS_REGISTERED:
+		fprintf(stderr, "ERROR: A cluster cannot change its tenant mode while part of a metacluster.\n");
+		ret = false;
+		break;
 	default:
 		ASSERT(false);
 		ret = false;
--- a/fdbcli/ExpensiveDataCheckCommand.actor.cpp
+++ b/fdbcli/ExpensiveDataCheckCommand.actor.cpp
@ -46,7 +46,7 @@ ACTOR Future<bool> expensiveDataCheckCommandActor(
 	if (tokens.size() == 1) {
 		// initialize worker interfaces
 		address_interface->clear();
-		wait(getWorkerInterfaces(tr, address_interface));
+		wait(getWorkerInterfaces(tr, address_interface, true));
 	}
 	if (tokens.size() == 1 || tokencmp(tokens[1], "list")) {
 		if (address_interface->size() == 0) {
--- a/fdbcli/KillCommand.actor.cpp
+++ b/fdbcli/KillCommand.actor.cpp
@ -44,7 +44,7 @@ ACTOR Future<bool> killCommandActor(Reference<IDatabase> db,
 	if (tokens.size() == 1) {
 		// initialize worker interfaces
 		address_interface->clear();
-		wait(getWorkerInterfaces(tr, address_interface));
+		wait(getWorkerInterfaces(tr, address_interface, true));
 	}
 	if (tokens.size() == 1 || tokencmp(tokens[1], "list")) {
 		if (address_interface->size() == 0) {
--- a/fdbcli/MetaclusterCommands.actor.cpp
+++ b/fdbcli/MetaclusterCommands.actor.cpp
@ -0,0 +1,432 @@
+/*
+ * MetaclusterCommands.actor.cpp
+ *
+ * This source file is part of the FoundationDB open source project
+ *
+ * Copyright 2013-2022 Apple Inc. and the FoundationDB project authors
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "fdbcli/fdbcli.actor.h"
+
+#include "fdbclient/FDBOptions.g.h"
+#include "fdbclient/IClientApi.h"
+#include "fdbclient/Knobs.h"
+#include "fdbclient/MetaclusterManagement.actor.h"
+#include "fdbclient/Schemas.h"
+
+#include "flow/Arena.h"
+#include "flow/FastRef.h"
+#include "flow/ThreadHelper.actor.h"
+#include "flow/actorcompiler.h" // This must be the last #include.
+
+namespace fdb_cli {
+
+Optional<std::pair<Optional<ClusterConnectionString>, Optional<DataClusterEntry>>>
+parseClusterConfiguration(std::vector<StringRef> const& tokens, DataClusterEntry const& defaults, int startIndex) {
+	Optional<DataClusterEntry> entry;
+	Optional<ClusterConnectionString> connectionString;
+
+	std::set<std::string> usedParams;
+	for (int tokenNum = startIndex; tokenNum < tokens.size(); ++tokenNum) {
+		StringRef token = tokens[tokenNum];
+		bool foundEquals;
+		StringRef param = token.eat("=", &foundEquals);
+		if (!foundEquals) {
+			fmt::print(stderr,
+			           "ERROR: invalid configuration string `{}'. String must specify a value using `='.\n",
+			           param.toString().c_str());
+			return {};
+		}
+		std::string value = token.toString();
+		if (!usedParams.insert(value).second) {
+			fmt::print(
+			    stderr, "ERROR: configuration parameter `{}' specified more than once.\n", param.toString().c_str());
+			return {};
+		}
+		if (tokencmp(param, "max_tenant_groups")) {
+			entry = defaults;
+
+			int n;
+			if (sscanf(value.c_str(), "%d%n", &entry.get().capacity.numTenantGroups, &n) != 1 || n != value.size() ||
+			    entry.get().capacity.numTenantGroups < 0) {
+				fmt::print(stderr, "ERROR: invalid number of tenant groups `{}'.\n", value.c_str());
+				return {};
+			}
+		} else if (tokencmp(param, "connection_string")) {
+			connectionString = ClusterConnectionString(value);
+		} else {
+			fmt::print(stderr, "ERROR: unrecognized configuration parameter `{}'.\n", param.toString().c_str());
+			return {};
+		}
+	}
+
+	return std::make_pair(connectionString, entry);
+}
+
+void printMetaclusterConfigureOptionsUsage() {
+	fmt::print("max_tenant_groups sets the maximum number of tenant groups that can be assigned\n"
+	           "to the named data cluster.\n");
+	fmt::print("connection_string sets the connection string for the named data cluster.\n");
+}
+
+// metacluster create command
+ACTOR Future<bool> metaclusterCreateCommand(Reference<IDatabase> db, std::vector<StringRef> tokens) {
+	if (tokens.size() != 3) {
+		fmt::print("Usage: metacluster create_experimental <NAME>\n\n");
+		fmt::print("Configures the cluster to be a management cluster in a metacluster.\n");
+		fmt::print("NAME is an identifier used to distinguish this metacluster from other metaclusters.\n");
+		return false;
+	}
+
+	Optional<std::string> errorStr = wait(MetaclusterAPI::createMetacluster(db, tokens[2]));
+	if (errorStr.present()) {
+		fmt::print("ERROR: {}.\n", errorStr.get());
+	} else {
+		fmt::print("The cluster has been configured as a metacluster.\n");
+	}
+	return true;
+}
+
+// metacluster decommission command
+ACTOR Future<bool> metaclusterDecommissionCommand(Reference<IDatabase> db, std::vector<StringRef> tokens) {
+	if (tokens.size() != 2) {
+		fmt::print("Usage: metacluster decommission\n\n");
+		fmt::print("Converts the current cluster from a metacluster management cluster back into an\n");
+		fmt::print("ordinary cluster. It must be called on a cluster with no registered data clusters.\n");
+		return false;
+	}
+
+	wait(MetaclusterAPI::decommissionMetacluster(db));
+
+	fmt::print("The cluster is no longer a metacluster.\n");
+	return true;
+}
+
+// metacluster register command
+ACTOR Future<bool> metaclusterRegisterCommand(Reference<IDatabase> db, std::vector<StringRef> tokens) {
+	if (tokens.size() < 4) {
+		fmt::print("Usage: metacluster register <NAME> connection_string=<CONNECTION_STRING>\n"
+		           "[max_tenant_groups=<NUM_GROUPS>]\n\n");
+		fmt::print("Adds a data cluster to a metacluster.\n");
+		fmt::print("NAME is used to identify the cluster in future commands.\n");
+		printMetaclusterConfigureOptionsUsage();
+		return false;
+	}
+
+	DataClusterEntry defaultEntry;
+	auto config = parseClusterConfiguration(tokens, defaultEntry, 3);
+	if (!config.present()) {
+		return false;
+	} else if (!config.get().first.present()) {
+		fmt::print(stderr, "ERROR: connection_string must be configured when registering a cluster.\n");
+		return false;
+	}
+
+	wait(MetaclusterAPI::registerCluster(
+	    db, tokens[2], config.get().first.get(), config.get().second.orDefault(defaultEntry)));
+
+	fmt::print("The cluster `{}' has been added\n", printable(tokens[2]).c_str());
+	return true;
+}
+
+// metacluster remove command
+ACTOR Future<bool> metaclusterRemoveCommand(Reference<IDatabase> db, std::vector<StringRef> tokens) {
+	if (tokens.size() < 3 || tokens.size() > 4 || (tokens.size() == 4 && tokens[2] != "FORCE"_sr)) {
+		fmt::print("Usage: metacluster remove [FORCE] <NAME> \n\n");
+		fmt::print("Removes the specified data cluster from a metacluster.\n");
+		fmt::print("If FORCE is specified, then the cluster will be detached even if it has\n"
+		           "tenants assigned to it.\n");
+		return false;
+	}
+
+	state ClusterNameRef clusterName = tokens[tokens.size() - 1];
+	wait(MetaclusterAPI::removeCluster(db, clusterName, tokens.size() == 4));
+
+	fmt::print("The cluster `{}' has been removed\n", printable(clusterName).c_str());
+	return true;
+}
+
+// metacluster configure command
+ACTOR Future<bool> metaclusterConfigureCommand(Reference<IDatabase> db, std::vector<StringRef> tokens) {
+	if (tokens.size() < 4) {
+		fmt::print("Usage: metacluster configure <NAME> <max_tenant_groups=<NUM_GROUPS>|\n"
+		           "connection_string=<CONNECTION_STRING>> ...\n\n");
+		fmt::print("Updates the configuration of the metacluster.\n");
+		printMetaclusterConfigureOptionsUsage();
+		return false;
+	}
+
+	state Reference<ITransaction> tr = db->createTransaction();
+
+	loop {
+		try {
+			tr->setOption(FDBTransactionOptions::ACCESS_SYSTEM_KEYS);
+			tr->setOption(FDBTransactionOptions::SPECIAL_KEY_SPACE_ENABLE_WRITES);
+			Optional<DataClusterMetadata> metadata = wait(MetaclusterAPI::tryGetClusterTransaction(tr, tokens[2]));
+			if (!metadata.present()) {
+				throw cluster_not_found();
+			}
+
+			auto config = parseClusterConfiguration(tokens, metadata.get().entry, 3);
+			if (!config.present()) {
+				return false;
+			}
+
+			MetaclusterAPI::updateClusterMetadata(
+			    tr, tokens[2], metadata.get(), config.get().first, config.get().second);
+
+			wait(safeThreadFutureToFuture(tr->commit()));
+			break;
+		} catch (Error& e) {
+			wait(safeThreadFutureToFuture(tr->onError(e)));
+		}
+	}
+
+	return true;
+}
+
+// metacluster list command
+ACTOR Future<bool> metaclusterListCommand(Reference<IDatabase> db, std::vector<StringRef> tokens) {
+	if (tokens.size() > 5) {
+		fmt::print("Usage: metacluster list [BEGIN] [END] [LIMIT]\n\n");
+		fmt::print("Lists the data clusters in a metacluster.\n");
+		fmt::print("Only cluster names in the range BEGIN - END will be printed.\n");
+		fmt::print("An optional LIMIT can be specified to limit the number of results (default 100).\n");
+		return false;
+	}
+
+	state ClusterNameRef begin = tokens.size() > 2 ? tokens[2] : ""_sr;
+	state ClusterNameRef end = tokens.size() > 3 ? tokens[3] : "\xff"_sr;
+	int limit = 100;
+
+	if (tokens.size() > 4) {
+		int n = 0;
+		if (sscanf(tokens[3].toString().c_str(), "%d%n", &limit, &n) != 1 || n != tokens[3].size() || limit < 0) {
+			fmt::print(stderr, "ERROR: invalid limit {}\n", tokens[3].toString().c_str());
+			return false;
+		}
+	}
+
+	std::map<ClusterName, DataClusterMetadata> clusters = wait(MetaclusterAPI::listClusters(db, begin, end, limit));
+	if (clusters.empty()) {
+		if (tokens.size() == 2) {
+			fmt::print("The metacluster has no registered data clusters\n");
+		} else {
+			fmt::print("The metacluster has no registered data clusters in the specified range\n");
+		}
+	}
+
+	int index = 0;
+	for (auto cluster : clusters) {
+		fmt::print("  {}. {}\n", ++index, printable(cluster.first).c_str());
+	}
+
+	return true;
+}
+
+// metacluster get command
+ACTOR Future<bool> metaclusterGetCommand(Reference<IDatabase> db, std::vector<StringRef> tokens) {
+	if (tokens.size() > 4 || (tokens.size() == 4 && tokens[3] != "JSON"_sr)) {
+		fmt::print("Usage: metacluster get <NAME> [JSON]\n\n");
+		fmt::print("Prints metadata associated with the given data cluster.\n");
+		fmt::print("If JSON is specified, then the output will be in JSON format.\n");
+		return false;
+	}
+
+	state bool useJson = tokens.size() == 4;
+
+	try {
+		DataClusterMetadata metadata = wait(MetaclusterAPI::getCluster(db, tokens[2]));
+
+		if (useJson) {
+			json_spirit::mObject obj;
+			obj["type"] = "success";
+			obj["cluster"] = metadata.toJson();
+			fmt::print("{}\n", json_spirit::write_string(json_spirit::mValue(obj), json_spirit::pretty_print).c_str());
+		} else {
+			fmt::print("  connection string: {}\n", metadata.connectionString.toString().c_str());
+			fmt::print("  cluster state: {}\n", DataClusterEntry::clusterStateToString(metadata.entry.clusterState));
+			fmt::print("  tenant group capacity: {}\n", metadata.entry.capacity.numTenantGroups);
+			fmt::print("  allocated tenant groups: {}\n", metadata.entry.allocated.numTenantGroups);
+		}
+	} catch (Error& e) {
+		if (useJson) {
+			json_spirit::mObject obj;
+			obj["type"] = "error";
+			obj["error"] = e.what();
+			fmt::print("{}\n", json_spirit::write_string(json_spirit::mValue(obj), json_spirit::pretty_print).c_str());
+			return false;
+		} else {
+			throw;
+		}
+	}
+
+	return true;
+}
+
+// metacluster status command
+ACTOR Future<bool> metaclusterStatusCommand(Reference<IDatabase> db, std::vector<StringRef> tokens) {
+	if (tokens.size() < 2 || tokens.size() > 3) {
+		fmt::print("Usage: metacluster status [JSON]\n\n");
+		fmt::print("Prints metacluster metadata.\n");
+		fmt::print("If JSON is specified, then the output will be in JSON format.\n");
+		return false;
+	}
+
+	state bool useJson = tokens.size() == 3;
+
+	try {
+		std::map<ClusterName, DataClusterMetadata> clusters =
+		    wait(MetaclusterAPI::listClusters(db, ""_sr, "\xff"_sr, CLIENT_KNOBS->MAX_DATA_CLUSTERS));
+
+		ClusterUsage totalCapacity;
+		ClusterUsage totalAllocated;
+		for (auto cluster : clusters) {
+			totalCapacity.numTenantGroups +=
+			    std::max(cluster.second.entry.capacity.numTenantGroups, cluster.second.entry.allocated.numTenantGroups);
+			totalAllocated.numTenantGroups += cluster.second.entry.allocated.numTenantGroups;
+		}
+
+		if (useJson) {
+			json_spirit::mObject obj;
+			obj["type"] = "success";
+
+			json_spirit::mObject metaclusterObj;
+			metaclusterObj["data_clusters"] = (int)clusters.size();
+			metaclusterObj["capacity"] = totalCapacity.toJson();
+			metaclusterObj["allocated"] = totalAllocated.toJson();
+
+			obj["metacluster"] = metaclusterObj;
+			fmt::print("{}\n", json_spirit::write_string(json_spirit::mValue(obj), json_spirit::pretty_print).c_str());
+		} else {
+			fmt::print("  number of data clusters: {}\n", clusters.size());
+			fmt::print("  tenant group capacity: {}\n", totalCapacity.numTenantGroups);
+			fmt::print("  allocated tenant groups: {}\n", totalAllocated.numTenantGroups);
+		}
+
+		return true;
+	} catch (Error& e) {
+		if (useJson) {
+			json_spirit::mObject obj;
+			obj["type"] = "error";
+			obj["error"] = e.what();
+			fmt::print("{}\n", json_spirit::write_string(json_spirit::mValue(obj), json_spirit::pretty_print).c_str());
+			return false;
+		} else {
+			throw;
+		}
+	}
+}
+
+// metacluster command
+Future<bool> metaclusterCommand(Reference<IDatabase> db, std::vector<StringRef> tokens) {
+	if (tokens.size() == 1) {
+		printUsage(tokens[0]);
+		return true;
+	} else if (tokencmp(tokens[1], "create_experimental")) {
+		return metaclusterCreateCommand(db, tokens);
+	} else if (tokencmp(tokens[1], "decommission")) {
+		return metaclusterDecommissionCommand(db, tokens);
+	} else if (tokencmp(tokens[1], "register")) {
+		return metaclusterRegisterCommand(db, tokens);
+	} else if (tokencmp(tokens[1], "remove")) {
+		return metaclusterRemoveCommand(db, tokens);
+	} else if (tokencmp(tokens[1], "configure")) {
+		return metaclusterConfigureCommand(db, tokens);
+	} else if (tokencmp(tokens[1], "list")) {
+		return metaclusterListCommand(db, tokens);
+	} else if (tokencmp(tokens[1], "get")) {
+		return metaclusterGetCommand(db, tokens);
+	} else if (tokencmp(tokens[1], "status")) {
+		return metaclusterStatusCommand(db, tokens);
+	} else {
+		printUsage(tokens[0]);
+		return true;
+	}
+}
+
+void metaclusterGenerator(const char* text,
+                          const char* line,
+                          std::vector<std::string>& lc,
+                          std::vector<StringRef> const& tokens) {
+	if (tokens.size() == 1) {
+		const char* opts[] = {
+			"create_experimental", "decommission", "register", "remove", "configure", "list", "get", "status", nullptr
+		};
+		arrayGenerator(text, line, opts, lc);
+	} else if (tokens.size() > 1 && (tokencmp(tokens[1], "register") || tokencmp(tokens[1], "configure"))) {
+		const char* opts[] = { "max_tenant_groups=", "connection_string=", nullptr };
+		arrayGenerator(text, line, opts, lc);
+	} else if ((tokens.size() == 2 && tokencmp(tokens[1], "status")) ||
+	           (tokens.size() == 3 && tokencmp(tokens[1], "get"))) {
+		const char* opts[] = { "JSON", nullptr };
+		arrayGenerator(text, line, opts, lc);
+	}
+}
+
+std::vector<const char*> metaclusterHintGenerator(std::vector<StringRef> const& tokens, bool inArgument) {
+	if (tokens.size() == 1) {
+		return { "<create_experimental|decommission|register|remove|configure|list|get|status>", "[ARGS]" };
+	} else if (tokencmp(tokens[1], "create_experimental")) {
+		return { "<NAME>" };
+	} else if (tokencmp(tokens[1], "decommission")) {
+		return {};
+	} else if (tokencmp(tokens[1], "register") && tokens.size() < 5) {
+		static std::vector<const char*> opts = { "<NAME>",
+			                                     "connection_string=<CONNECTION_STRING>",
+			                                     "[max_tenant_groups=<NUM_GROUPS>]" };
+		return std::vector<const char*>(opts.begin() + tokens.size() - 2, opts.end());
+	} else if (tokencmp(tokens[1], "remove") && tokens.size() < 4) {
+		static std::vector<const char*> opts = { "[FORCE]", "<NAME>" };
+		if (tokens.size() == 2) {
+			return opts;
+		} else if (tokens.size() == 3 && (inArgument || tokens[2].size() == "FORCE"_sr.size()) &&
+		           "FORCE"_sr.startsWith(tokens[2])) {
+			return std::vector<const char*>(opts.begin() + tokens.size() - 2, opts.end());
+		} else {
+			return {};
+		}
+	} else if (tokencmp(tokens[1], "configure")) {
+		static std::vector<const char*> opts = {
+			"<NAME>", "<max_tenant_groups=<NUM_GROUPS>|connection_string=<CONNECTION_STRING>>"
+		};
+		return std::vector<const char*>(opts.begin() + std::min<int>(1, tokens.size() - 2), opts.end());
+	} else if (tokencmp(tokens[1], "list") && tokens.size() < 5) {
+		static std::vector<const char*> opts = { "[BEGIN]", "[END]", "[LIMIT]" };
+		return std::vector<const char*>(opts.begin() + tokens.size() - 2, opts.end());
+	} else if (tokencmp(tokens[1], "get") && tokens.size() < 4) {
+		static std::vector<const char*> opts = { "<NAME>", "[JSON]" };
+		return std::vector<const char*>(opts.begin() + tokens.size() - 2, opts.end());
+	} else if (tokencmp(tokens[1], "status") && tokens.size() == 2) {
+		return { "[JSON]" };
+	} else {
+		return {};
+	}
+}
+
+CommandFactory metaclusterRegisterFactory(
+    "metacluster",
+    CommandHelp("metacluster <create_experimental|decommission|register|remove|configure|list|get|status> [ARGS]",
+                "view and manage a metacluster",
+                "`create_experimental' and `decommission' set up or deconfigure a metacluster.\n"
+                "`register' and `remove' add and remove data clusters from the metacluster.\n"
+                "`configure' updates the configuration of a data cluster.\n"
+                "`list' prints a list of data clusters in the metacluster.\n"
+                "`get' prints the metadata for a particular data cluster.\n"
+                "`status' prints metacluster metadata.\n"),
+    &metaclusterGenerator,
+    &metaclusterHintGenerator);
+
+} // namespace fdb_cli
--- a/fdbcli/StatusCommand.actor.cpp
+++ b/fdbcli/StatusCommand.actor.cpp
@ -411,6 +411,7 @@ void printStatus(StatusObjectReader statusObj,
 			outputString += "\nConfiguration:";
 			std::string outputStringCache = outputString;
 			bool isOldMemory = false;
+			bool blobGranuleEnabled{ false };
 			try {
 				// Configuration section
 				// FIXME: Should we suppress this if there are cluster messages implying that the database has no
@ -434,7 +435,6 @@ void printStatus(StatusObjectReader statusObj,
 					outputString += "unknown";

 				int intVal = 0;
-				bool blobGranuleEnabled{ false };
 				if (statusObjConfig.get("blob_granules_enabled", intVal) && intVal) {
 					blobGranuleEnabled = true;
 				}
@ -1110,6 +1110,15 @@ void printStatus(StatusObjectReader statusObj,
 					outputString += "\n\nCoordination servers:";
 					outputString += getCoordinatorsInfoString(statusObj);
 				}
+
+				if (blobGranuleEnabled) {
+					outputString += "\n\nBlob Granules:";
+					StatusObjectReader statusObjBlobGranules = statusObjCluster["blob_granules"];
+					auto numWorkers = statusObjBlobGranules["number_of_blob_workers"].get_int();
+					outputString += "\n  Number of Workers      - " + format("%d", numWorkers);
+					auto numKeyRanges = statusObjBlobGranules["number_of_key_ranges"].get_int();
+					outputString += "\n  Number of Key Ranges   - " + format("%d", numKeyRanges);
+				}
 			}

 			// client time
--- a/fdbcli/SuspendCommand.actor.cpp
+++ b/fdbcli/SuspendCommand.actor.cpp
@ -43,7 +43,7 @@ ACTOR Future<bool> suspendCommandActor(Reference<IDatabase> db,
 	if (tokens.size() == 1) {
 		// initialize worker interfaces
 		address_interface->clear();
-		wait(getWorkerInterfaces(tr, address_interface));
+		wait(getWorkerInterfaces(tr, address_interface, true));
 		if (address_interface->size() == 0) {
 			printf("\nNo addresses can be suspended.\n");
 		} else if (address_interface->size() == 1) {
--- a/fdbcli/TenantCommands.actor.cpp
+++ b/fdbcli/TenantCommands.actor.cpp
@ -25,6 +25,7 @@
 #include "fdbclient/IClientApi.h"
 #include "fdbclient/Knobs.h"
 #include "fdbclient/ManagementAPI.actor.h"
+#include "fdbclient/MetaclusterManagement.actor.h"
 #include "fdbclient/TenantManagement.actor.h"
 #include "fdbclient/Schemas.h"

@ -100,9 +101,9 @@ Key makeConfigKey(TenantNameRef tenantName, StringRef configName) {
 	return tenantConfigSpecialKeyRange.begin.withSuffix(Tuple().append(tenantName).append(configName).pack());
 }

-void applyConfiguration(Reference<ITransaction> tr,
-                        TenantNameRef tenantName,
-                        std::map<Standalone<StringRef>, Optional<Value>> configuration) {
+void applyConfigurationToSpecialKeys(Reference<ITransaction> tr,
+                                     TenantNameRef tenantName,
+                                     std::map<Standalone<StringRef>, Optional<Value>> configuration) {
 	for (auto [configName, value] : configuration) {
 		if (value.present()) {
 			tr->set(makeConfigKey(tenantName, configName), value.get());
@ -136,21 +137,32 @@ ACTOR Future<bool> createTenantCommandActor(Reference<IDatabase> db, std::vector
 	}

 	loop {
-		tr->setOption(FDBTransactionOptions::SPECIAL_KEY_SPACE_ENABLE_WRITES);
 		try {
-			if (!doneExistenceCheck) {
-				// Hold the reference to the standalone's memory
-				state ThreadFuture<Optional<Value>> existingTenantFuture = tr->get(tenantNameKey);
-				Optional<Value> existingTenant = wait(safeThreadFutureToFuture(existingTenantFuture));
-				if (existingTenant.present()) {
-					throw tenant_already_exists();
+			tr->setOption(FDBTransactionOptions::SPECIAL_KEY_SPACE_ENABLE_WRITES);
+			tr->setOption(FDBTransactionOptions::READ_SYSTEM_KEYS);
+			state ClusterType clusterType = wait(TenantAPI::getClusterType(tr));
+			if (clusterType == ClusterType::METACLUSTER_MANAGEMENT) {
+				TenantMapEntry tenantEntry;
+				for (auto const& [name, value] : configuration.get()) {
+					tenantEntry.configure(name, value);
 				}
-				doneExistenceCheck = true;
+				wait(MetaclusterAPI::createTenant(db, tokens[1], tenantEntry));
+			} else {
+				if (!doneExistenceCheck) {
+					// Hold the reference to the standalone's memory
+					state ThreadFuture<Optional<Value>> existingTenantFuture = tr->get(tenantNameKey);
+					Optional<Value> existingTenant = wait(safeThreadFutureToFuture(existingTenantFuture));
+					if (existingTenant.present()) {
+						throw tenant_already_exists();
+					}
+					doneExistenceCheck = true;
+				}
+
+				tr->set(tenantNameKey, ValueRef());
+				applyConfigurationToSpecialKeys(tr, tokens[1], configuration.get());
+				wait(safeThreadFutureToFuture(tr->commit()));
 			}

-			tr->set(tenantNameKey, ValueRef());
-			applyConfiguration(tr, tokens[1], configuration.get());
-			wait(safeThreadFutureToFuture(tr->commit()));
 			break;
 		} catch (Error& e) {
 			state Error err(e);
@ -167,10 +179,12 @@ ACTOR Future<bool> createTenantCommandActor(Reference<IDatabase> db, std::vector
 	return true;
 }

-CommandFactory createTenantFactory("createtenant",
-                                   CommandHelp("createtenant <TENANT_NAME> [tenant_group=<TENANT_GROUP>]",
-                                               "creates a new tenant in the cluster",
-                                               "Creates a new tenant in the cluster with the specified name."));
+CommandFactory createTenantFactory(
+    "createtenant",
+    CommandHelp("createtenant <TENANT_NAME> [tenant_group=<TENANT_GROUP>]",
+                "creates a new tenant in the cluster",
+                "Creates a new tenant in the cluster with the specified name. An optional group can be specified"
+                "that will require this tenant to be placed on the same cluster as other tenants in the same group."));

 // deletetenant command
 ACTOR Future<bool> deleteTenantCommandActor(Reference<IDatabase> db, std::vector<StringRef> tokens, int apiVersion) {
@ -184,20 +198,27 @@ ACTOR Future<bool> deleteTenantCommandActor(Reference<IDatabase> db, std::vector
 	state bool doneExistenceCheck = false;

 	loop {
-		tr->setOption(FDBTransactionOptions::SPECIAL_KEY_SPACE_ENABLE_WRITES);
 		try {
-			if (!doneExistenceCheck) {
-				// Hold the reference to the standalone's memory
-				state ThreadFuture<Optional<Value>> existingTenantFuture = tr->get(tenantNameKey);
-				Optional<Value> existingTenant = wait(safeThreadFutureToFuture(existingTenantFuture));
-				if (!existingTenant.present()) {
-					throw tenant_not_found();
+			tr->setOption(FDBTransactionOptions::SPECIAL_KEY_SPACE_ENABLE_WRITES);
+			tr->setOption(FDBTransactionOptions::READ_SYSTEM_KEYS);
+			state ClusterType clusterType = wait(TenantAPI::getClusterType(tr));
+			if (clusterType == ClusterType::METACLUSTER_MANAGEMENT) {
+				wait(MetaclusterAPI::deleteTenant(db, tokens[1]));
+			} else {
+				if (!doneExistenceCheck) {
+					// Hold the reference to the standalone's memory
+					state ThreadFuture<Optional<Value>> existingTenantFuture = tr->get(tenantNameKey);
+					Optional<Value> existingTenant = wait(safeThreadFutureToFuture(existingTenantFuture));
+					if (!existingTenant.present()) {
+						throw tenant_not_found();
+					}
+					doneExistenceCheck = true;
 				}
-				doneExistenceCheck = true;
+
+				tr->clear(tenantNameKey);
+				wait(safeThreadFutureToFuture(tr->commit()));
 			}

-			tr->clear(tenantNameKey);
-			wait(safeThreadFutureToFuture(tr->commit()));
 			break;
 		} catch (Error& e) {
 			state Error err(e);
@ -228,8 +249,8 @@ ACTOR Future<bool> listTenantsCommandActor(Reference<IDatabase> db, std::vector<
 		return false;
 	}

-	StringRef beginTenant = ""_sr;
-	StringRef endTenant = "\xff\xff"_sr;
+	state StringRef beginTenant = ""_sr;
+	state StringRef endTenant = "\xff\xff"_sr;
 	state int limit = 100;

 	if (tokens.size() >= 2) {
@ -256,12 +277,26 @@ ACTOR Future<bool> listTenantsCommandActor(Reference<IDatabase> db, std::vector<

 	loop {
 		try {
-			// Hold the reference to the standalone's memory
-			state ThreadFuture<RangeResult> kvsFuture =
-			    tr->getRange(firstGreaterOrEqual(beginTenantKey), firstGreaterOrEqual(endTenantKey), limit);
-			RangeResult tenants = wait(safeThreadFutureToFuture(kvsFuture));
+			tr->setOption(FDBTransactionOptions::READ_SYSTEM_KEYS);
+			state ClusterType clusterType = wait(TenantAPI::getClusterType(tr));
+			state std::vector<TenantNameRef> tenantNames;
+			if (clusterType == ClusterType::METACLUSTER_MANAGEMENT) {
+				std::vector<std::pair<TenantName, TenantMapEntry>> tenants =
+				    wait(MetaclusterAPI::listTenantsTransaction(tr, beginTenant, endTenant, limit));
+				for (auto tenant : tenants) {
+					tenantNames.push_back(tenant.first);
+				}
+			} else {
+				// Hold the reference to the standalone's memory
+				state ThreadFuture<RangeResult> kvsFuture =
+				    tr->getRange(firstGreaterOrEqual(beginTenantKey), firstGreaterOrEqual(endTenantKey), limit);
+				RangeResult tenants = wait(safeThreadFutureToFuture(kvsFuture));
+				for (auto tenant : tenants) {
+					tenantNames.push_back(tenant.key.removePrefix(tenantMapSpecialKeyRange(apiVersion).begin));
+				}
+			}

-			if (tenants.empty()) {
+			if (tenantNames.empty()) {
 				if (tokens.size() == 1) {
 					fmt::print("The cluster has no tenants\n");
 				} else {
@ -270,10 +305,8 @@ ACTOR Future<bool> listTenantsCommandActor(Reference<IDatabase> db, std::vector<
 			}

 			int index = 0;
-			for (auto tenant : tenants) {
-				fmt::print("  {}. {}\n",
-				           ++index,
-				           printable(tenant.key.removePrefix(tenantMapSpecialKeyRange(apiVersion).begin)).c_str());
+			for (auto tenantName : tenantNames) {
+				fmt::print("  {}. {}\n", ++index, printable(tenantName).c_str());
 			}

 			return true;
@ -309,15 +342,24 @@ ACTOR Future<bool> getTenantCommandActor(Reference<IDatabase> db, std::vector<St

 	loop {
 		try {
-			// Hold the reference to the standalone's memory
-			state ThreadFuture<Optional<Value>> tenantFuture = tr->get(tenantNameKey);
-			Optional<Value> tenant = wait(safeThreadFutureToFuture(tenantFuture));
-			if (!tenant.present()) {
-				throw tenant_not_found();
+			tr->setOption(FDBTransactionOptions::READ_SYSTEM_KEYS);
+			state ClusterType clusterType = wait(TenantAPI::getClusterType(tr));
+			state std::string tenantJson;
+			if (clusterType == ClusterType::METACLUSTER_MANAGEMENT) {
+				TenantMapEntry entry = wait(MetaclusterAPI::getTenantTransaction(tr, tokens[1]));
+				tenantJson = entry.toJson(apiVersion);
+			} else {
+				// Hold the reference to the standalone's memory
+				state ThreadFuture<Optional<Value>> tenantFuture = tr->get(tenantNameKey);
+				Optional<Value> tenant = wait(safeThreadFutureToFuture(tenantFuture));
+				if (!tenant.present()) {
+					throw tenant_not_found();
+				}
+				tenantJson = tenant.get().toString();
 			}

 			json_spirit::mValue jsonObject;
-			json_spirit::read_string(tenant.get().toString(), jsonObject);
+			json_spirit::read_string(tenantJson, jsonObject);

 			if (useJson) {
 				json_spirit::mObject resultObj;
@ -333,6 +375,7 @@ ACTOR Future<bool> getTenantCommandActor(Reference<IDatabase> db, std::vector<St
 				std::string prefix;
 				std::string tenantState;
 				std::string tenantGroup;
+				std::string assignedCluster;

 				doc.get("id", id);

@ -344,6 +387,7 @@ ACTOR Future<bool> getTenantCommandActor(Reference<IDatabase> db, std::vector<St

 				doc.get("tenant_state", tenantState);
 				bool hasTenantGroup = doc.tryGet("tenant_group.printable", tenantGroup);
+				bool hasAssignedCluster = doc.tryGet("assigned_cluster", assignedCluster);

 				fmt::print("  id: {}\n", id);
 				fmt::print("  prefix: {}\n", printable(prefix).c_str());
@ -351,8 +395,10 @@ ACTOR Future<bool> getTenantCommandActor(Reference<IDatabase> db, std::vector<St
 				if (hasTenantGroup) {
 					fmt::print("  tenant group: {}\n", tenantGroup.c_str());
 				}
+				if (hasAssignedCluster) {
+					fmt::print("  assigned cluster: {}\n", printable(assignedCluster).c_str());
+				}
 			}
-
 			return true;
 		} catch (Error& e) {
 			try {
@ -408,10 +454,17 @@ ACTOR Future<bool> configureTenantCommandActor(Reference<IDatabase> db, std::vec
 	state Reference<ITransaction> tr = db->createTransaction();

 	loop {
-		tr->setOption(FDBTransactionOptions::SPECIAL_KEY_SPACE_ENABLE_WRITES);
 		try {
-			applyConfiguration(tr, tokens[1], configuration.get());
-			wait(safeThreadFutureToFuture(tr->commit()));
+			tr->setOption(FDBTransactionOptions::SPECIAL_KEY_SPACE_ENABLE_WRITES);
+			tr->setOption(FDBTransactionOptions::READ_SYSTEM_KEYS);
+			ClusterType clusterType = wait(TenantAPI::getClusterType(tr));
+			if (clusterType == ClusterType::METACLUSTER_MANAGEMENT) {
+				TenantMapEntry tenantEntry;
+				wait(MetaclusterAPI::configureTenant(db, tokens[1], configuration.get()));
+			} else {
+				applyConfigurationToSpecialKeys(tr, tokens[1], configuration.get());
+				wait(safeThreadFutureToFuture(tr->commit()));
+			}
 			break;
 		} catch (Error& e) {
 			state Error err(e);
@ -456,50 +509,56 @@ ACTOR Future<bool> renameTenantCommandActor(Reference<IDatabase> db, std::vector
 	state Key tenantOldNameKey = tenantMapSpecialKeyRange(apiVersion).begin.withSuffix(tokens[1]);
 	state Key tenantNewNameKey = tenantMapSpecialKeyRange(apiVersion).begin.withSuffix(tokens[2]);
 	state bool firstTry = true;
-	state int64_t id;
+	state int64_t id = -1;
 	loop {
-		tr->setOption(FDBTransactionOptions::SPECIAL_KEY_SPACE_ENABLE_WRITES);
 		try {
-			// Hold the reference to the standalone's memory
-			state ThreadFuture<Optional<Value>> oldEntryFuture = tr->get(tenantOldNameKey);
-			state ThreadFuture<Optional<Value>> newEntryFuture = tr->get(tenantNewNameKey);
-			state Optional<Value> oldEntry = wait(safeThreadFutureToFuture(oldEntryFuture));
-			state Optional<Value> newEntry = wait(safeThreadFutureToFuture(newEntryFuture));
-			if (firstTry) {
-				if (!oldEntry.present()) {
-					throw tenant_not_found();
-				}
-				if (newEntry.present()) {
-					throw tenant_already_exists();
-				}
-				// Store the id we see when first reading this key
-				id = getTenantId(oldEntry.get());
-
-				firstTry = false;
+			tr->setOption(FDBTransactionOptions::SPECIAL_KEY_SPACE_ENABLE_WRITES);
+			tr->setOption(FDBTransactionOptions::READ_SYSTEM_KEYS);
+			state ClusterType clusterType = wait(TenantAPI::getClusterType(tr));
+			if (clusterType == ClusterType::METACLUSTER_MANAGEMENT) {
+				wait(MetaclusterAPI::renameTenant(db, tokens[1], tokens[2]));
 			} else {
-				// If we got commit_unknown_result, the rename may have already occurred.
-				if (newEntry.present()) {
-					int64_t checkId = getTenantId(newEntry.get());
-					if (id == checkId) {
-						ASSERT(!oldEntry.present() || getTenantId(oldEntry.get()) != id);
-						return true;
+				// Hold the reference to the standalone's memory
+				state ThreadFuture<Optional<Value>> oldEntryFuture = tr->get(tenantOldNameKey);
+				state ThreadFuture<Optional<Value>> newEntryFuture = tr->get(tenantNewNameKey);
+				state Optional<Value> oldEntry = wait(safeThreadFutureToFuture(oldEntryFuture));
+				state Optional<Value> newEntry = wait(safeThreadFutureToFuture(newEntryFuture));
+				if (firstTry) {
+					if (!oldEntry.present()) {
+						throw tenant_not_found();
+					}
+					if (newEntry.present()) {
+						throw tenant_already_exists();
+					}
+					// Store the id we see when first reading this key
+					id = getTenantId(oldEntry.get());
+
+					firstTry = false;
+				} else {
+					// If we got commit_unknown_result, the rename may have already occurred.
+					if (newEntry.present()) {
+						int64_t checkId = getTenantId(newEntry.get());
+						if (id == checkId) {
+							ASSERT(!oldEntry.present() || getTenantId(oldEntry.get()) != id);
+							return true;
+						}
+						// If the new entry is present but does not match, then
+						// the rename should fail, so we throw an error.
+						throw tenant_already_exists();
+					}
+					if (!oldEntry.present()) {
+						throw tenant_not_found();
+					}
+					int64_t checkId = getTenantId(oldEntry.get());
+					// If the id has changed since we made our first attempt,
+					// then it's possible we've already moved the tenant. Don't move it again.
+					if (id != checkId) {
+						throw tenant_not_found();
 					}
-					// If the new entry is present but does not match, then
-					// the rename should fail, so we throw an error.
-					throw tenant_already_exists();
-				}
-				if (!oldEntry.present()) {
-					throw tenant_not_found();
-				}
-				int64_t checkId = getTenantId(oldEntry.get());
-				// If the id has changed since we made our first attempt,
-				// then it's possible we've already moved the tenant. Don't move it again.
-				if (id != checkId) {
-					throw tenant_not_found();
 				}
+				tr->set(tenantRenameKey, tokens[2]);
+				wait(safeThreadFutureToFuture(tr->commit()));
 			}
-			tr->set(tenantRenameKey, tokens[2]);
-			wait(safeThreadFutureToFuture(tr->commit()));
 			break;
 		} catch (Error& e) {
 			state Error err(e);
--- a/fdbcli/Util.actor.cpp
+++ b/fdbcli/Util.actor.cpp
@ -62,56 +62,52 @@ ACTOR Future<std::string> getSpecialKeysFailureErrorMessage(Reference<ITransacti
 	return valueObj["message"].get_str();
 }

-ACTOR Future<Void> verifyAndAddInterface(std::map<Key, std::pair<Value, ClientLeaderRegInterface>>* address_interface,
-                                         Reference<FlowLock> connectLock,
-                                         KeyValue kv) {
-	wait(connectLock->take());
-	state FlowLock::Releaser releaser(*connectLock);
-	state ClientWorkerInterface workerInterf;
-	try {
-		// the interface is back-ward compatible, thus if parsing failed, it needs to upgrade cli version
-		workerInterf = BinaryReader::fromStringRef<ClientWorkerInterface>(kv.value, IncludeVersion());
-	} catch (Error& e) {
-		fprintf(stderr, "Error: %s; CLI version is too old, please update to use a newer version\n", e.what());
-		return Void();
-	}
-	state ClientLeaderRegInterface leaderInterf(workerInterf.address());
-	choose {
-		when(Optional<LeaderInfo> rep =
-		         wait(brokenPromiseToNever(leaderInterf.getLeader.getReply(GetLeaderRequest())))) {
-			StringRef ip_port =
-			    (kv.key.endsWith(LiteralStringRef(":tls")) ? kv.key.removeSuffix(LiteralStringRef(":tls")) : kv.key)
-			        .removePrefix(LiteralStringRef("\xff\xff/worker_interfaces/"));
-			(*address_interface)[ip_port] = std::make_pair(kv.value, leaderInterf);
-
-			if (workerInterf.reboot.getEndpoint().addresses.secondaryAddress.present()) {
-				Key full_ip_port2 =
-				    StringRef(workerInterf.reboot.getEndpoint().addresses.secondaryAddress.get().toString());
-				StringRef ip_port2 = full_ip_port2.endsWith(LiteralStringRef(":tls"))
-				                         ? full_ip_port2.removeSuffix(LiteralStringRef(":tls"))
-				                         : full_ip_port2;
-				(*address_interface)[ip_port2] = std::make_pair(kv.value, leaderInterf);
-			}
+void addInterfacesFromKVs(RangeResult& kvs,
+                          std::map<Key, std::pair<Value, ClientLeaderRegInterface>>* address_interface) {
+	for (const auto& kv : kvs) {
+		ClientWorkerInterface workerInterf;
+		try {
+			// the interface is back-ward compatible, thus if parsing failed, it needs to upgrade cli version
+			workerInterf = BinaryReader::fromStringRef<ClientWorkerInterface>(kv.value, IncludeVersion());
+		} catch (Error& e) {
+			fprintf(stderr, "Error: %s; CLI version is too old, please update to use a newer version\n", e.what());
+			return;
+		}
+		ClientLeaderRegInterface leaderInterf(workerInterf.address());
+		StringRef ip_port =
+		    (kv.key.endsWith(LiteralStringRef(":tls")) ? kv.key.removeSuffix(LiteralStringRef(":tls")) : kv.key)
+		        .removePrefix(LiteralStringRef("\xff\xff/worker_interfaces/"));
+		(*address_interface)[ip_port] = std::make_pair(kv.value, leaderInterf);
+
+		if (workerInterf.reboot.getEndpoint().addresses.secondaryAddress.present()) {
+			Key full_ip_port2 =
+			    StringRef(workerInterf.reboot.getEndpoint().addresses.secondaryAddress.get().toString());
+			StringRef ip_port2 = full_ip_port2.endsWith(LiteralStringRef(":tls"))
+			                         ? full_ip_port2.removeSuffix(LiteralStringRef(":tls"))
+			                         : full_ip_port2;
+			(*address_interface)[ip_port2] = std::make_pair(kv.value, leaderInterf);
 		}
-		when(wait(delay(CLIENT_KNOBS->CLI_CONNECT_TIMEOUT))) {}
 	}
-	return Void();
 }

 ACTOR Future<Void> getWorkerInterfaces(Reference<ITransaction> tr,
-                                       std::map<Key, std::pair<Value, ClientLeaderRegInterface>>* address_interface) {
+                                       std::map<Key, std::pair<Value, ClientLeaderRegInterface>>* address_interface,
+                                       bool verify) {
+	if (verify) {
+		tr->setOption(FDBTransactionOptions::SPECIAL_KEY_SPACE_ENABLE_WRITES);
+		tr->set(workerInterfacesVerifyOptionSpecialKey, ValueRef());
+	}
 	// Hold the reference to the standalone's memory
 	state ThreadFuture<RangeResult> kvsFuture = tr->getRange(
 	    KeyRangeRef(LiteralStringRef("\xff\xff/worker_interfaces/"), LiteralStringRef("\xff\xff/worker_interfaces0")),
 	    CLIENT_KNOBS->TOO_MANY);
-	RangeResult kvs = wait(safeThreadFutureToFuture(kvsFuture));
+	state RangeResult kvs = wait(safeThreadFutureToFuture(kvsFuture));
 	ASSERT(!kvs.more);
-	auto connectLock = makeReference<FlowLock>(CLIENT_KNOBS->CLI_CONNECT_PARALLELISM);
-	std::vector<Future<Void>> addInterfs;
-	for (auto it : kvs) {
-		addInterfs.push_back(verifyAndAddInterface(address_interface, connectLock, it));
+	if (verify) {
+		// remove the option if set
+		tr->clear(workerInterfacesVerifyOptionSpecialKey);
 	}
-	wait(waitForAll(addInterfs));
+	addInterfacesFromKVs(kvs, address_interface);
 	return Void();
 }

--- a/fdbcli/fdbcli.actor.cpp
+++ b/fdbcli/fdbcli.actor.cpp
@ -103,6 +103,7 @@ enum {
 	OPT_DEBUG_TLS,
 	OPT_API_VERSION,
 	OPT_MEMORY,
+	OPT_USE_FUTURE_PROTOCOL_VERSION
 };

 CSimpleOpt::SOption g_rgOptions[] = { { OPT_CONNFILE, "-C", SO_REQ_SEP },
@ -127,6 +128,7 @@ CSimpleOpt::SOption g_rgOptions[] = { { OPT_CONNFILE, "-C", SO_REQ_SEP },
 	                                  { OPT_DEBUG_TLS, "--debug-tls", SO_NONE },
 	                                  { OPT_API_VERSION, "--api-version", SO_REQ_SEP },
 	                                  { OPT_MEMORY, "--memory", SO_REQ_SEP },
+	                                  { OPT_USE_FUTURE_PROTOCOL_VERSION, "--use-future-protocol-version", SO_NONE },
 	                                  TLS_OPTION_FLAGS,
 	                                  SO_END_OF_OPTIONS };

@ -475,6 +477,9 @@ static void printProgramUsage(const char* name) {
 	       "                 Useful in reporting and diagnosing TLS issues.\n"
 	       "  --build-flags  Print build information and exit.\n"
 	       "  --memory       Resident memory limit of the CLI (defaults to 8GiB).\n"
+	       "  --use-future-protocol-version\n"
+	       "                 Use the simulated future protocol version to connect to the cluster.\n"
+	       "                 This option can be used testing purposes only!\n"
 	       "  -v, --version  Print FoundationDB CLI version information and exit.\n"
 	       "  -h, --help     Display this help and exit.\n");
 }
@ -578,7 +583,7 @@ void initHelp() {
 void printVersion() {
 	printf("FoundationDB CLI " FDB_VT_PACKAGE_NAME " (v" FDB_VT_VERSION ")\n");
 	printf("source version %s\n", getSourceVersion());
-	printf("protocol %" PRIx64 "\n", currentProtocolVersion.version());
+	printf("protocol %" PRIx64 "\n", currentProtocolVersion().version());
 }

 void printBuildInformation() {
@ -872,6 +877,7 @@ struct CLIOptions {
 	Optional<std::string> exec;
 	bool initialStatusCheck = true;
 	bool cliHints = true;
+	bool useFutureProtocolVersion = false;
 	bool debugTLS = false;
 	std::string tlsCertPath;
 	std::string tlsKeyPath;
@ -973,6 +979,10 @@ struct CLIOptions {
 			break;
 		case OPT_NO_HINTS:
 			cliHints = false;
+			break;
+		case OPT_USE_FUTURE_PROTOCOL_VERSION:
+			useFutureProtocolVersion = true;
+			break;

 		// TLS Options
 		case TLSConfig::OPT_TLS_PLUGIN:
@ -1040,36 +1050,6 @@ Future<T> stopNetworkAfter(Future<T> what) {
 	}
 }

-ACTOR Future<Void> addInterface(std::map<Key, std::pair<Value, ClientLeaderRegInterface>>* address_interface,
-                                Reference<FlowLock> connectLock,
-                                KeyValue kv) {
-	wait(connectLock->take());
-	state FlowLock::Releaser releaser(*connectLock);
-	state ClientWorkerInterface workerInterf =
-	    BinaryReader::fromStringRef<ClientWorkerInterface>(kv.value, IncludeVersion());
-	state ClientLeaderRegInterface leaderInterf(workerInterf.address());
-	choose {
-		when(Optional<LeaderInfo> rep =
-		         wait(brokenPromiseToNever(leaderInterf.getLeader.getReply(GetLeaderRequest())))) {
-			StringRef ip_port =
-			    (kv.key.endsWith(LiteralStringRef(":tls")) ? kv.key.removeSuffix(LiteralStringRef(":tls")) : kv.key)
-			        .removePrefix(LiteralStringRef("\xff\xff/worker_interfaces/"));
-			(*address_interface)[ip_port] = std::make_pair(kv.value, leaderInterf);
-
-			if (workerInterf.reboot.getEndpoint().addresses.secondaryAddress.present()) {
-				Key full_ip_port2 =
-				    StringRef(workerInterf.reboot.getEndpoint().addresses.secondaryAddress.get().toString());
-				StringRef ip_port2 = full_ip_port2.endsWith(LiteralStringRef(":tls"))
-				                         ? full_ip_port2.removeSuffix(LiteralStringRef(":tls"))
-				                         : full_ip_port2;
-				(*address_interface)[ip_port2] = std::make_pair(kv.value, leaderInterf);
-			}
-		}
-		when(wait(delay(CLIENT_KNOBS->CLI_CONNECT_TIMEOUT))) {}
-	}
-	return Void();
-}
-
 ACTOR Future<int> cli(CLIOptions opt, LineNoise* plinenoise) {
 	state LineNoise& linenoise = *plinenoise;
 	state bool intrans = false;
@ -1967,6 +1947,13 @@ ACTOR Future<int> cli(CLIOptions opt, LineNoise* plinenoise) {
 					continue;
 				}

+				if (tokencmp(tokens[0], "metacluster")) {
+					bool _result = wait(makeInterruptable(metaclusterCommand(db, tokens)));
+					if (!_result)
+						is_error = true;
+					continue;
+				}
+
 				fprintf(stderr, "ERROR: Unknown command `%s'. Try `help'?\n", formatStringRef(tokens[0]).c_str());
 				is_error = true;
 			}
@ -2192,6 +2179,9 @@ int main(int argc, char** argv) {

 	try {
 		API->selectApiVersion(opt.apiVersion);
+		if (opt.useFutureProtocolVersion) {
+			API->useFutureProtocolVersion();
+		}
 		API->setupNetwork();
 		opt.setupKnobs();
 		if (opt.exit_code != -1) {
--- a/fdbcli/include/fdbcli/fdbcli.actor.h
+++ b/fdbcli/include/fdbcli/fdbcli.actor.h
@ -120,6 +120,7 @@ extern const KeyRangeRef processClassSourceSpecialKeyRange;
 extern const KeyRangeRef processClassTypeSpecialKeyRange;
 // Other special keys
 inline const KeyRef errorMsgSpecialKey = LiteralStringRef("\xff\xff/error_message");
+inline const KeyRef workerInterfacesVerifyOptionSpecialKey = "\xff\xff/management/options/worker_interfaces/verify"_sr;
 // help functions (Copied from fdbcli.actor.cpp)

 // get all workers' info
@ -132,13 +133,14 @@ void printUsage(StringRef command);
 // Pre: tr failed with special_keys_api_failure error
 // Read the error message special key and return the message
 ACTOR Future<std::string> getSpecialKeysFailureErrorMessage(Reference<ITransaction> tr);
-// Using \xff\xff/worker_interfaces/ special key, get all worker interfaces
+// Using \xff\xff/worker_interfaces/ special key, get all worker interfaces.
+// A worker list will be returned from CC.
+// If verify, we will try to establish connections to all workers returned.
+// In particular, it will deserialize \xff\xff/worker_interfaces/<address>:=<ClientInterface> kv pairs and issue RPC
+// calls, then only return interfaces(kv pairs) the client can talk to
 ACTOR Future<Void> getWorkerInterfaces(Reference<ITransaction> tr,
-                                       std::map<Key, std::pair<Value, ClientLeaderRegInterface>>* address_interface);
-// Deserialize \xff\xff/worker_interfaces/<address>:=<ClientInterface> k-v pair and verify by a RPC call
-ACTOR Future<Void> verifyAndAddInterface(std::map<Key, std::pair<Value, ClientLeaderRegInterface>>* address_interface,
-                                         Reference<FlowLock> connectLock,
-                                         KeyValue kv);
+                                       std::map<Key, std::pair<Value, ClientLeaderRegInterface>>* address_interface,
+                                       bool verify = false);
 // print cluster status info
 void printStatus(StatusObjectReader statusObj,
                 StatusClient::StatusLevel level,
@ -200,6 +202,10 @@ ACTOR Future<bool> listTenantsCommandActor(Reference<IDatabase> db, std::vector<
 // lock/unlock command
 ACTOR Future<bool> lockCommandActor(Reference<IDatabase> db, std::vector<StringRef> tokens);
 ACTOR Future<bool> unlockDatabaseActor(Reference<IDatabase> db, UID uid);
+
+// metacluster command
+Future<bool> metaclusterCommand(Reference<IDatabase> db, std::vector<StringRef> tokens);
+
 // changefeed command
 ACTOR Future<bool> changeFeedCommandActor(Database localDb,
                                          Optional<TenantMapEntry> tenantEntry,
--- a/fdbclient/BackupContainer.actor.cpp
+++ b/fdbclient/BackupContainer.actor.cpp
@ -288,11 +288,46 @@ Reference<IBackupContainer> IBackupContainer::openContainer(const std::string& u
 #ifdef BUILD_AZURE_BACKUP
 		else if (u.startsWith("azure://"_sr)) {
 			u.eat("azure://"_sr);
-			auto accountName = u.eat("@"_sr).toString();
-			auto endpoint = u.eat("/"_sr).toString();
-			auto containerName = u.eat("/"_sr).toString();
-			r = makeReference<BackupContainerAzureBlobStore>(
-			    endpoint, accountName, containerName, encryptionKeyFileName);
+			auto address = u.eat("/"_sr);
+			if (address.endsWith(std::string(azure::storage_lite::constants::default_endpoint_suffix))) {
+				CODE_PROBE(true, "Azure backup url with standard azure storage account endpoint");
+				// <account>.<service>.core.windows.net/<resource_path>
+				auto endPoint = address.toString();
+				auto accountName = address.eat("."_sr).toString();
+				auto containerName = u.eat("/"_sr).toString();
+				r = makeReference<BackupContainerAzureBlobStore>(
+				    endPoint, accountName, containerName, encryptionKeyFileName);
+			} else {
+				// resolve the network address if necessary
+				std::string endpoint(address.toString());
+				Optional<NetworkAddress> parsedAddress = NetworkAddress::parseOptional(endpoint);
+				if (!parsedAddress.present()) {
+					try {
+						auto hostname = Hostname::parse(endpoint);
+						auto resolvedAddress = hostname.resolveBlocking();
+						if (resolvedAddress.present()) {
+							CODE_PROBE(true, "Azure backup url with hostname in the endpoint");
+							parsedAddress = resolvedAddress.get();
+						}
+					} catch (Error& e) {
+						TraceEvent(SevError, "InvalidAzureBackupUrl").error(e).detail("Endpoint", endpoint);
+						throw backup_invalid_url();
+					}
+				}
+				if (!parsedAddress.present()) {
+					TraceEvent(SevError, "InvalidAzureBackupUrl").detail("Endpoint", endpoint);
+					throw backup_invalid_url();
+				}
+				auto accountName = u.eat("/"_sr).toString();
+				// Avoid including ":tls" and "(fromHostname)"
+				// note: the endpoint needs to contain the account name
+				// so either "<account_name>.blob.core.windows.net" or "<ip>:<port>/<account_name>"
+				endpoint =
+				    fmt::format("{}/{}", formatIpPort(parsedAddress.get().ip, parsedAddress.get().port), accountName);
+				auto containerName = u.eat("/"_sr).toString();
+				r = makeReference<BackupContainerAzureBlobStore>(
+				    endpoint, accountName, containerName, encryptionKeyFileName);
+			}
 		}
 #endif
 		else {
--- a/fdbclient/BackupContainerFileSystem.actor.cpp
+++ b/fdbclient/BackupContainerFileSystem.actor.cpp
@ -1523,11 +1523,46 @@ Reference<BackupContainerFileSystem> BackupContainerFileSystem::openContainerFS(
 #ifdef BUILD_AZURE_BACKUP
 		else if (u.startsWith("azure://"_sr)) {
 			u.eat("azure://"_sr);
-			auto accountName = u.eat("@"_sr).toString();
-			auto endpoint = u.eat("/"_sr).toString();
-			auto containerName = u.eat("/"_sr).toString();
-			r = makeReference<BackupContainerAzureBlobStore>(
-			    endpoint, accountName, containerName, encryptionKeyFileName);
+			auto address = u.eat("/"_sr);
+			if (address.endsWith(std::string(azure::storage_lite::constants::default_endpoint_suffix))) {
+				CODE_PROBE(true, "Azure backup url with standard azure storage account endpoint");
+				// <account>.<service>.core.windows.net/<resource_path>
+				auto endPoint = address.toString();
+				auto accountName = address.eat("."_sr).toString();
+				auto containerName = u.eat("/"_sr).toString();
+				r = makeReference<BackupContainerAzureBlobStore>(
+				    endPoint, accountName, containerName, encryptionKeyFileName);
+			} else {
+				// resolve the network address if necessary
+				std::string endpoint(address.toString());
+				Optional<NetworkAddress> parsedAddress = NetworkAddress::parseOptional(endpoint);
+				if (!parsedAddress.present()) {
+					try {
+						auto hostname = Hostname::parse(endpoint);
+						auto resolvedAddress = hostname.resolveBlocking();
+						if (resolvedAddress.present()) {
+							CODE_PROBE(true, "Azure backup url with hostname in the endpoint");
+							parsedAddress = resolvedAddress.get();
+						}
+					} catch (Error& e) {
+						TraceEvent(SevError, "InvalidAzureBackupUrl").error(e).detail("Endpoint", endpoint);
+						throw backup_invalid_url();
+					}
+				}
+				if (!parsedAddress.present()) {
+					TraceEvent(SevError, "InvalidAzureBackupUrl").detail("Endpoint", endpoint);
+					throw backup_invalid_url();
+				}
+				auto accountName = u.eat("/"_sr).toString();
+				// Avoid including ":tls" and "(fromHostname)"
+				// note: the endpoint needs to contain the account name
+				// so either "<account_name>.blob.core.windows.net" or "<ip>:<port>/<account_name>"
+				endpoint =
+				    fmt::format("{}/{}", formatIpPort(parsedAddress.get().ip, parsedAddress.get().port), accountName);
+				auto containerName = u.eat("/"_sr).toString();
+				r = makeReference<BackupContainerAzureBlobStore>(
+				    endpoint, accountName, containerName, encryptionKeyFileName);
+			}
 		}
 #endif
 		else {
--- a/fdbclient/BlobGranuleCommon.cpp
+++ b/fdbclient/BlobGranuleCommon.cpp
@ -0,0 +1,45 @@
+/*
+ * BlobGranuleCommon.cpp
+ *
+ * This source file is part of the FoundationDB open source project
+ *
+ * Copyright 2013-2022 Apple Inc. and the FoundationDB project authors
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "fdbclient/BlobGranuleCommon.h"
+
+BlobGranuleSummaryRef summarizeGranuleChunk(Arena& ar, const BlobGranuleChunkRef& chunk) {
+	BlobGranuleSummaryRef summary;
+	ASSERT(chunk.snapshotFile.present());
+	ASSERT(chunk.snapshotVersion != invalidVersion);
+	ASSERT(chunk.includedVersion >= chunk.snapshotVersion);
+	ASSERT(chunk.newDeltas.empty());
+
+	if (chunk.tenantPrefix.present()) {
+		summary.keyRange = KeyRangeRef(ar, chunk.keyRange.removePrefix(chunk.tenantPrefix.get()));
+	} else {
+		summary.keyRange = KeyRangeRef(ar, chunk.keyRange);
+	}
+
+	summary.snapshotVersion = chunk.snapshotVersion;
+	summary.snapshotSize = chunk.snapshotFile.get().length;
+	summary.deltaVersion = chunk.includedVersion;
+	summary.deltaSize = 0;
+	for (auto& it : chunk.deltaFiles) {
+		summary.deltaSize += it.length;
+	}
+
+	return summary;
+}
--- a/fdbclient/BlobGranuleFiles.cpp
+++ b/fdbclient/BlobGranuleFiles.cpp
@ -40,6 +40,7 @@

 #include <cstring>
 #include <fstream> // for perf microbenchmark
+#include <limits>
 #include <vector>

 #define BG_READ_DEBUG false
@ -209,16 +210,21 @@ namespace {
 BlobGranuleFileEncryptionKeys getEncryptBlobCipherKey(const BlobGranuleCipherKeysCtx cipherKeysCtx) {
 	BlobGranuleFileEncryptionKeys eKeys;

+	// Cipher key reconstructed is 'never' inserted into BlobCipherKey cache, choose 'neverExpire'
 	eKeys.textCipherKey = makeReference<BlobCipherKey>(cipherKeysCtx.textCipherKey.encryptDomainId,
 	                                                   cipherKeysCtx.textCipherKey.baseCipherId,
 	                                                   cipherKeysCtx.textCipherKey.baseCipher.begin(),
 	                                                   cipherKeysCtx.textCipherKey.baseCipher.size(),
-	                                                   cipherKeysCtx.textCipherKey.salt);
+	                                                   cipherKeysCtx.textCipherKey.salt,
+	                                                   std::numeric_limits<int64_t>::max(),
+	                                                   std::numeric_limits<int64_t>::max());
 	eKeys.headerCipherKey = makeReference<BlobCipherKey>(cipherKeysCtx.headerCipherKey.encryptDomainId,
 	                                                     cipherKeysCtx.headerCipherKey.baseCipherId,
 	                                                     cipherKeysCtx.headerCipherKey.baseCipher.begin(),
 	                                                     cipherKeysCtx.headerCipherKey.baseCipher.size(),
-	                                                     cipherKeysCtx.headerCipherKey.salt);
+	                                                     cipherKeysCtx.headerCipherKey.salt,
+	                                                     std::numeric_limits<int64_t>::max(),
+	                                                     std::numeric_limits<int64_t>::max());

 	return eKeys;
 }
@ -346,7 +352,9 @@ struct IndexBlockRef {

 			decrypt(cipherKeysCtx.get(), *this, arena);
 		} else {
-			TraceEvent("IndexBlockSize").detail("Sz", buffer.size());
+			if (BG_ENCRYPT_COMPRESS_DEBUG) {
+				TraceEvent("IndexBlockSize").detail("Sz", buffer.size());
+			}

 			ObjectReader dataReader(buffer.begin(), IncludeVersion());
 			dataReader.deserialize(FileIdentifierFor<IndexBlock>::value, block, arena);
@ -368,7 +376,11 @@ struct IndexBlockRef {
 			    arena, ObjectWriter::toValue(block, IncludeVersion(ProtocolVersion::withBlobGranuleFile())).contents());
 		}

-		TraceEvent(SevDebug, "IndexBlockSize").detail("Sz", buffer.size()).detail("Encrypted", cipherKeysCtx.present());
+		if (BG_ENCRYPT_COMPRESS_DEBUG) {
+			TraceEvent(SevDebug, "IndexBlockSize")
+			    .detail("Sz", buffer.size())
+			    .detail("Encrypted", cipherKeysCtx.present());
+		}
 	}

 	template <class Ar>
@ -804,10 +816,6 @@ static Standalone<VectorRef<ParsedDeltaBoundaryRef>> loadSnapshotFile(

 	ASSERT(file.indexBlockRef.block.children.size() >= 2);

-	// TODO: refactor this out of delta tree
-	// int commonPrefixLen = commonPrefixLength(index.dataBlockOffsets.front().first,
-	// index.dataBlockOffsets.back().first);
-
 	// find range of blocks needed to read
 	ChildBlockPointerRef* currentBlock = file.findStartBlock(keyRange.begin);

@ -1157,10 +1165,6 @@ Standalone<VectorRef<ParsedDeltaBoundaryRef>> loadChunkedDeltaFile(const Standal

 	ASSERT(file.indexBlockRef.block.children.size() >= 2);

-	// TODO: refactor this out of delta tree
-	// int commonPrefixLen = commonPrefixLength(index.dataBlockOffsets.front().first,
-	// index.dataBlockOffsets.back().first);
-
 	// find range of blocks needed to read
 	ChildBlockPointerRef* currentBlock = file.findStartBlock(keyRange.begin);

@ -1169,7 +1173,8 @@ Standalone<VectorRef<ParsedDeltaBoundaryRef>> loadChunkedDeltaFile(const Standal
 		return deltas;
 	}

-	// TODO: could cpu optimize first block a bit more by seeking right to start
+	// FIXME: shared prefix for key comparison
+	// FIXME: could cpu optimize first block a bit more by seeking right to start
 	bool lastBlock = false;
 	bool prevClearAfter = false;
 	while (!lastBlock) {
@ -1553,12 +1558,23 @@ RangeResult materializeBlobGranule(const BlobGranuleChunkRef& chunk,
 	return mergeDeltaStreams(chunk, streams, startClears);
 }

+struct GranuleLoadFreeHandle : NonCopyable, ReferenceCounted<GranuleLoadFreeHandle> {
+	const ReadBlobGranuleContext* granuleContext;
+	int64_t loadId;
+
+	GranuleLoadFreeHandle(const ReadBlobGranuleContext* granuleContext, int64_t loadId)
+	  : granuleContext(granuleContext), loadId(loadId) {}
+
+	~GranuleLoadFreeHandle() { granuleContext->free_load_f(loadId, granuleContext->userContext); }
+};
+
 struct GranuleLoadIds {
 	Optional<int64_t> snapshotId;
 	std::vector<int64_t> deltaIds;
+	std::vector<Reference<GranuleLoadFreeHandle>> freeHandles;
 };

-static void startLoad(const ReadBlobGranuleContext granuleContext,
+static void startLoad(const ReadBlobGranuleContext* granuleContext,
                      const BlobGranuleChunkRef& chunk,
                      GranuleLoadIds& loadIds) {

@ -1568,12 +1584,13 @@ static void startLoad(const ReadBlobGranuleContext granuleContext,
 		// FIXME: remove when we implement file multiplexing
 		ASSERT(chunk.snapshotFile.get().offset == 0);
 		ASSERT(chunk.snapshotFile.get().length == chunk.snapshotFile.get().fullFileLength);
-		loadIds.snapshotId = granuleContext.start_load_f(snapshotFname.c_str(),
-		                                                 snapshotFname.size(),
-		                                                 chunk.snapshotFile.get().offset,
-		                                                 chunk.snapshotFile.get().length,
-		                                                 chunk.snapshotFile.get().fullFileLength,
-		                                                 granuleContext.userContext);
+		loadIds.snapshotId = granuleContext->start_load_f(snapshotFname.c_str(),
+		                                                  snapshotFname.size(),
+		                                                  chunk.snapshotFile.get().offset,
+		                                                  chunk.snapshotFile.get().length,
+		                                                  chunk.snapshotFile.get().fullFileLength,
+		                                                  granuleContext->userContext);
+		loadIds.freeHandles.push_back(makeReference<GranuleLoadFreeHandle>(granuleContext, loadIds.snapshotId.get()));
 	}
 	loadIds.deltaIds.reserve(chunk.deltaFiles.size());
 	for (int deltaFileIdx = 0; deltaFileIdx < chunk.deltaFiles.size(); deltaFileIdx++) {
@ -1581,13 +1598,14 @@ static void startLoad(const ReadBlobGranuleContext granuleContext,
 		// FIXME: remove when we implement file multiplexing
 		ASSERT(chunk.deltaFiles[deltaFileIdx].offset == 0);
 		ASSERT(chunk.deltaFiles[deltaFileIdx].length == chunk.deltaFiles[deltaFileIdx].fullFileLength);
-		int64_t deltaLoadId = granuleContext.start_load_f(deltaFName.c_str(),
-		                                                  deltaFName.size(),
-		                                                  chunk.deltaFiles[deltaFileIdx].offset,
-		                                                  chunk.deltaFiles[deltaFileIdx].length,
-		                                                  chunk.deltaFiles[deltaFileIdx].fullFileLength,
-		                                                  granuleContext.userContext);
+		int64_t deltaLoadId = granuleContext->start_load_f(deltaFName.c_str(),
+		                                                   deltaFName.size(),
+		                                                   chunk.deltaFiles[deltaFileIdx].offset,
+		                                                   chunk.deltaFiles[deltaFileIdx].length,
+		                                                   chunk.deltaFiles[deltaFileIdx].fullFileLength,
+		                                                   granuleContext->userContext);
 		loadIds.deltaIds.push_back(deltaLoadId);
+		loadIds.freeHandles.push_back(makeReference<GranuleLoadFreeHandle>(granuleContext, deltaLoadId));
 	}
 }

@ -1606,17 +1624,16 @@ ErrorOr<RangeResult> loadAndMaterializeBlobGranules(const Standalone<VectorRef<B

 	GranuleLoadIds loadIds[files.size()];

-	// Kick off first file reads if parallelism > 1
-	for (int i = 0; i < parallelism - 1 && i < files.size(); i++) {
-		startLoad(granuleContext, files[i], loadIds[i]);
-	}
-
 	try {
+		// Kick off first file reads if parallelism > 1
+		for (int i = 0; i < parallelism - 1 && i < files.size(); i++) {
+			startLoad(&granuleContext, files[i], loadIds[i]);
+		}
 		RangeResult results;
 		for (int chunkIdx = 0; chunkIdx < files.size(); chunkIdx++) {
 			// Kick off files for this granule if parallelism == 1, or future granule if parallelism > 1
 			if (chunkIdx + parallelism - 1 < files.size()) {
-				startLoad(granuleContext, files[chunkIdx + parallelism - 1], loadIds[chunkIdx + parallelism - 1]);
+				startLoad(&granuleContext, files[chunkIdx + parallelism - 1], loadIds[chunkIdx + parallelism - 1]);
 			}

 			RangeResult chunkRows;
@ -1632,7 +1649,8 @@ ErrorOr<RangeResult> loadAndMaterializeBlobGranules(const Standalone<VectorRef<B
 				}
 			}

-			StringRef deltaData[files[chunkIdx].deltaFiles.size()];
+			// +1 to avoid UBSAN variable length array of size zero
+			StringRef deltaData[files[chunkIdx].deltaFiles.size() + 1];
 			for (int i = 0; i < files[chunkIdx].deltaFiles.size(); i++) {
 				deltaData[i] =
 				    StringRef(granuleContext.get_load_f(loadIds[chunkIdx].deltaIds[i], granuleContext.userContext),
@ -1650,12 +1668,8 @@ ErrorOr<RangeResult> loadAndMaterializeBlobGranules(const Standalone<VectorRef<B
 			results.arena().dependsOn(chunkRows.arena());
 			results.append(results.arena(), chunkRows.begin(), chunkRows.size());

-			if (loadIds[chunkIdx].snapshotId.present()) {
-				granuleContext.free_load_f(loadIds[chunkIdx].snapshotId.get(), granuleContext.userContext);
-			}
-			for (int i = 0; i < loadIds[chunkIdx].deltaIds.size(); i++) {
-				granuleContext.free_load_f(loadIds[chunkIdx].deltaIds[i], granuleContext.userContext);
-			}
+			// free once done by forcing FreeHandles to trigger
+			loadIds[chunkIdx].freeHandles.clear();
 		}
 		return ErrorOr<RangeResult>(results);
 	} catch (Error& e) {
@ -2372,7 +2386,6 @@ void checkDeltaRead(const KeyValueGen& kvGen,
 	std::string filename = randomBGFilename(
 	    deterministicRandom()->randomUniqueID(), deterministicRandom()->randomUniqueID(), readVersion, ".delta");
 	Standalone<BlobGranuleChunkRef> chunk;
-	// TODO need to add cipher keys meta
 	chunk.deltaFiles.emplace_back_deep(
 	    chunk.arena(), filename, 0, serialized->size(), serialized->size(), kvGen.cipherKeys);
 	chunk.keyRange = kvGen.allRange;
@ -2429,7 +2442,6 @@ static std::tuple<KeyRange, Version, Version> randomizeKeyAndVersions(const KeyV
 		}
 	}

-	// TODO randomize begin and read version to sometimes +/- 1 and readRange begin and end to keyAfter sometimes
 	return { readRange, beginVersion, readVersion };
 }

@ -2653,7 +2665,11 @@ TEST_CASE("/blobgranule/files/granuleReadUnitTest") {
 	                 serializedDeltaFiles,
 	                 inMemoryDeltas);

-	for (int i = 0; i < std::min(100, 5 + snapshotData.size() * deltaData.size()); i++) {
+	// prevent overflow by doing min before multiply
+	int maxRuns = 100;
+	int snapshotAndDeltaSize = 5 + std::min(maxRuns, snapshotData.size()) * std::min(maxRuns, deltaData.size());
+	int lim = std::min(maxRuns, snapshotAndDeltaSize);
+	for (int i = 0; i < lim; i++) {
 		auto params = randomizeKeyAndVersions(kvGen, deltaData);
 		fmt::print("Partial test {0}: [{1} - {2}) @ {3} - {4}\n",
 		           i,
--- a/fdbclient/BlobGranuleReader.actor.cpp
+++ b/fdbclient/BlobGranuleReader.actor.cpp
@ -31,13 +31,6 @@
 #include "fdbclient/FDBTypes.h"
 #include "flow/actorcompiler.h" // This must be the last #include.

-// TODO more efficient data structure besides std::map? PTree is unnecessary since this isn't versioned, but some other
-// sorted thing could work. And if it used arenas it'd probably be more efficient with allocations, since everything
-// else is in 1 arena and discarded at the end.
-
-// TODO could refactor the file reading code from here and the delta file function into another actor,
-// then this part would also be testable? but meh
-
 ACTOR Future<Standalone<StringRef>> readFile(Reference<BlobConnectionProvider> bstoreProvider, BlobFilePointerRef f) {
 	try {
 		state Arena arena;
@ -140,3 +133,66 @@ ACTOR Future<Void> readBlobGranules(BlobGranuleFileRequest request,

 	return Void();
 }
+
+// Return true if a given range is fully covered by blob chunks
+bool isRangeFullyCovered(KeyRange range, Standalone<VectorRef<BlobGranuleChunkRef>> blobChunks) {
+	std::vector<KeyRangeRef> blobRanges;
+	for (const BlobGranuleChunkRef& chunk : blobChunks) {
+		blobRanges.push_back(chunk.keyRange);
+	}
+
+	return range.isCovered(blobRanges);
+}
+
+void testAddChunkRange(KeyRef begin, KeyRef end, Standalone<VectorRef<BlobGranuleChunkRef>>& chunks) {
+	BlobGranuleChunkRef chunk;
+	chunk.keyRange = KeyRangeRef(begin, end);
+	chunks.push_back(chunks.arena(), chunk);
+}
+
+TEST_CASE("/fdbserver/blobgranule/isRangeCoveredByBlob") {
+	Standalone<VectorRef<BlobGranuleChunkRef>> chunks;
+	// chunk1 key_a1 - key_a9
+	testAddChunkRange("key_a1"_sr, "key_a9"_sr, chunks);
+	// chunk2 key_b1 - key_b9
+	testAddChunkRange("key_b1"_sr, "key_b9"_sr, chunks);
+
+	// check empty range. not covered
+	{ ASSERT(isRangeFullyCovered(KeyRangeRef(), chunks) == false); }
+
+	// check empty chunks. not covered
+	{
+		Standalone<VectorRef<BlobGranuleChunkRef>> empyChunks;
+		ASSERT(isRangeFullyCovered(KeyRangeRef(), empyChunks) == false);
+	}
+
+	// check '' to \xff
+	{ ASSERT(isRangeFullyCovered(KeyRangeRef(LiteralStringRef(""), LiteralStringRef("\xff")), chunks) == false); }
+
+	// check {key_a1, key_a9}
+	{ ASSERT(isRangeFullyCovered(KeyRangeRef("key_a1"_sr, "key_a9"_sr), chunks)); }
+
+	// check {key_a1, key_a3}
+	{ ASSERT(isRangeFullyCovered(KeyRangeRef("key_a1"_sr, "key_a3"_sr), chunks)); }
+
+	// check {key_a0, key_a3}
+	{ ASSERT(isRangeFullyCovered(KeyRangeRef("key_a0"_sr, "key_a3"_sr), chunks) == false); }
+
+	// check {key_a5, key_b2}
+	{
+		auto range = KeyRangeRef("key_a5"_sr, "key_b5"_sr);
+		ASSERT(isRangeFullyCovered(range, chunks) == false);
+		ASSERT(range.begin == "key_a5"_sr);
+		ASSERT(range.end == "key_b5"_sr);
+	}
+
+	// check continued chunks
+	{
+		Standalone<VectorRef<BlobGranuleChunkRef>> continuedChunks;
+		testAddChunkRange("key_a1"_sr, "key_a9"_sr, continuedChunks);
+		testAddChunkRange("key_a9"_sr, "key_b1"_sr, continuedChunks);
+		testAddChunkRange("key_b1"_sr, "key_b9"_sr, continuedChunks);
+		ASSERT(isRangeFullyCovered(KeyRangeRef("key_a1"_sr, "key_b9"_sr), continuedChunks) == false);
+	}
+	return Void();
+}
--- a/fdbclient/CMakeLists.txt
+++ b/fdbclient/CMakeLists.txt
@ -90,8 +90,8 @@ add_flow_target(LINK_TEST NAME fdbclientlinktest SRCS LinkTest.cpp)
 target_link_libraries(fdbclientlinktest PRIVATE fdbclient rapidxml) # re-link rapidxml due to private link interface

 if(BUILD_AZURE_BACKUP)
-  target_link_libraries(fdbclient PRIVATE curl uuid azure-storage-lite)
-  target_link_libraries(fdbclient_sampling PRIVATE curl uuid azure-storage-lite)
+  target_link_libraries(fdbclient PRIVATE curl azure-storage-lite)
+  target_link_libraries(fdbclient_sampling PRIVATE curl azure-storage-lite)
 endif()

 if(BUILD_AWS_BACKUP)
--- a/fdbclient/ClientKnobs.cpp
+++ b/fdbclient/ClientKnobs.cpp
@ -42,10 +42,6 @@ void ClientKnobs::initialize(Randomize randomize) {

 	init( FAILURE_MAX_DELAY,                       5.0 );
 	init( FAILURE_MIN_DELAY,                       4.0 ); if( randomize && BUGGIFY ) FAILURE_MIN_DELAY = 1.0;
-	init( FAILURE_TIMEOUT_DELAY,     FAILURE_MIN_DELAY );
-	init( CLIENT_FAILURE_TIMEOUT_DELAY, FAILURE_MIN_DELAY );
-	init( FAILURE_EMERGENCY_DELAY,                30.0 );
-	init( FAILURE_MAX_GENERATIONS,                  10 );
 	init( RECOVERY_DELAY_START_GENERATION,          70 );
 	init( RECOVERY_DELAY_SECONDS_PER_GENERATION,  60.0 );
 	init( MAX_GENERATIONS,                         100 );
@ -64,6 +60,7 @@ void ClientKnobs::initialize(Randomize randomize) {

 	init( WRONG_SHARD_SERVER_DELAY,                .01 ); if( randomize && BUGGIFY ) WRONG_SHARD_SERVER_DELAY = deterministicRandom()->random01(); // FLOW_KNOBS->PREVENT_FAST_SPIN_DELAY; // SOMEDAY: This delay can limit performance of retrieving data when the cache is mostly wrong (e.g. dumping the database after a test)
 	init( FUTURE_VERSION_RETRY_DELAY,              .01 ); if( randomize && BUGGIFY ) FUTURE_VERSION_RETRY_DELAY = deterministicRandom()->random01();// FLOW_KNOBS->PREVENT_FAST_SPIN_DELAY;
+	init( GRV_ERROR_RETRY_DELAY,                   5.0 ); if( randomize && BUGGIFY ) GRV_ERROR_RETRY_DELAY = 0.01 + 5 * deterministicRandom()->random01();
 	init( UNKNOWN_TENANT_RETRY_DELAY,              0.0 ); if( randomize && BUGGIFY ) UNKNOWN_TENANT_RETRY_DELAY = deterministicRandom()->random01();
 	init( REPLY_BYTE_LIMIT,                      80000 );
 	init( DEFAULT_BACKOFF,                         .01 ); if( randomize && BUGGIFY ) DEFAULT_BACKOFF = deterministicRandom()->random01();
@ -84,6 +81,7 @@ void ClientKnobs::initialize(Randomize randomize) {
 	init( CHANGE_FEED_CACHE_SIZE,               100000 ); if( randomize && BUGGIFY ) CHANGE_FEED_CACHE_SIZE = 1;
 	init( CHANGE_FEED_POP_TIMEOUT,                10.0 );
 	init( CHANGE_FEED_STREAM_MIN_BYTES,            1e4 ); if( randomize && BUGGIFY ) CHANGE_FEED_STREAM_MIN_BYTES = 1;
+	init( CHANGE_FEED_START_INTERVAL,             10.0 );

 	init( MAX_BATCH_SIZE,                         1000 ); if( randomize && BUGGIFY ) MAX_BATCH_SIZE = 1;
 	init( GRV_BATCH_TIMEOUT,                     0.005 ); if( randomize && BUGGIFY ) GRV_BATCH_TIMEOUT = 0.1;
@ -159,8 +157,6 @@ void ClientKnobs::initialize(Randomize randomize) {
 	init( BACKUP_AGGREGATE_POLL_RATE_UPDATE_INTERVAL, 60);
 	init( BACKUP_AGGREGATE_POLL_RATE,              2.0 ); // polls per second target for all agents on the cluster
 	init( BACKUP_LOG_WRITE_BATCH_MAX_SIZE,         1e6 ); //Must be much smaller than TRANSACTION_SIZE_LIMIT
-	init( BACKUP_LOG_ATOMIC_OPS_SIZE,			  1000 );
-	init( BACKUP_OPERATION_COST_OVERHEAD,		    50 );
 	init( BACKUP_MAX_LOG_RANGES,                    21 ); if( randomize && BUGGIFY ) BACKUP_MAX_LOG_RANGES = 4;
 	init( BACKUP_SIM_COPY_LOG_RANGES,              100 );
 	init( BACKUP_VERSION_DELAY,           5*CORE_VERSIONSPERSECOND );
@ -279,18 +275,21 @@ void ClientKnobs::initialize(Randomize randomize) {
 	init( BUSYNESS_SPIKE_START_THRESHOLD,         0.100 );
 	init( BUSYNESS_SPIKE_SATURATED_THRESHOLD,     0.500 );

-	// multi-version client control
-	init( MVC_CLIENTLIB_CHUNK_SIZE,              8*1024 );
-	init( MVC_CLIENTLIB_CHUNKS_PER_TRANSACTION,      32 );
-
 	// Blob granules
 	init( BG_MAX_GRANULE_PARALLELISM,                10 );
+	init( BG_TOO_MANY_GRANULES,                   10000 );

 	init( CHANGE_QUORUM_BAD_STATE_RETRY_TIMES,        3 );
 	init( CHANGE_QUORUM_BAD_STATE_RETRY_DELAY,      2.0 );

 	// Tenants and Metacluster
-	init( MAX_TENANTS_PER_CLUSTER,                  1e6 ); if ( randomize && BUGGIFY ) MAX_TENANTS_PER_CLUSTER = deterministicRandom()->randomInt(20, 100);
+	init( MAX_TENANTS_PER_CLUSTER,                  1e6 );
+	init( TENANT_TOMBSTONE_CLEANUP_INTERVAL,         60 ); if ( randomize && BUGGIFY ) TENANT_TOMBSTONE_CLEANUP_INTERVAL = deterministicRandom()->random01() * 30;
+	init( MAX_DATA_CLUSTERS,                        1e5 );
+	init( REMOVE_CLUSTER_TENANT_BATCH_SIZE,         1e4 ); if ( randomize && BUGGIFY ) REMOVE_CLUSTER_TENANT_BATCH_SIZE = 1;
+	init( METACLUSTER_ASSIGNMENT_CLUSTERS_TO_CHECK,   5 ); if ( randomize && BUGGIFY ) METACLUSTER_ASSIGNMENT_CLUSTERS_TO_CHECK = 1;
+	init( METACLUSTER_ASSIGNMENT_FIRST_CHOICE_DELAY, 1.0 ); if ( randomize && BUGGIFY ) METACLUSTER_ASSIGNMENT_FIRST_CHOICE_DELAY = deterministicRandom()->random01() * 60;
+	init( METACLUSTER_ASSIGNMENT_AVAILABILITY_TIMEOUT, 10.0 ); if ( randomize && BUGGIFY ) METACLUSTER_ASSIGNMENT_AVAILABILITY_TIMEOUT = 1 + deterministicRandom()->random01() * 59;

 	// clang-format on
 }
--- a/fdbclient/KeyRangeMap.actor.cpp
+++ b/fdbclient/KeyRangeMap.actor.cpp
@ -23,6 +23,7 @@
 #include "fdbclient/CommitTransaction.h"
 #include "fdbclient/FDBTypes.h"
 #include "fdbclient/ReadYourWrites.h"
+#include "flow/UnitTest.h"
 #include "flow/actorcompiler.h" // has to be last include

 void KeyRangeActorMap::getRangesAffectedByInsertion(const KeyRangeRef& keys, std::vector<KeyRange>& affectedRanges) {
@ -35,32 +36,54 @@ void KeyRangeActorMap::getRangesAffectedByInsertion(const KeyRangeRef& keys, std
 		affectedRanges.push_back(KeyRangeRef(keys.end, e.end()));
 }

-RangeResult krmDecodeRanges(KeyRef mapPrefix, KeyRange keys, RangeResult kv) {
+RangeResult krmDecodeRanges(KeyRef mapPrefix, KeyRange keys, RangeResult kv, bool align) {
 	ASSERT(!kv.more || kv.size() > 1);
 	KeyRange withPrefix =
 	    KeyRangeRef(mapPrefix.toString() + keys.begin.toString(), mapPrefix.toString() + keys.end.toString());

-	ValueRef beginValue, endValue;
-	if (kv.size() && kv[0].key.startsWith(mapPrefix))
-		beginValue = kv[0].value;
-	if (kv.size() && kv.end()[-1].key.startsWith(mapPrefix))
-		endValue = kv.end()[-1].value;
-
 	RangeResult result;
 	result.arena().dependsOn(kv.arena());
 	result.arena().dependsOn(keys.arena());

-	result.push_back(result.arena(), KeyValueRef(keys.begin, beginValue));
+	// Always push a kv pair <= keys.begin.
+	KeyRef beginKey = keys.begin;
+	if (!align && !kv.empty() && kv.front().key.startsWith(mapPrefix) && kv.front().key < withPrefix.begin) {
+		beginKey = kv[0].key.removePrefix(mapPrefix);
+	}
+	ValueRef beginValue;
+	if (!kv.empty() && kv.front().key.startsWith(mapPrefix) && kv.front().key <= withPrefix.begin) {
+		beginValue = kv.front().value;
+	}
+	result.push_back(result.arena(), KeyValueRef(beginKey, beginValue));
+
 	for (int i = 0; i < kv.size(); i++) {
 		if (kv[i].key > withPrefix.begin && kv[i].key < withPrefix.end) {
 			KeyRef k = kv[i].key.removePrefix(mapPrefix);
 			result.push_back(result.arena(), KeyValueRef(k, kv[i].value));
-		} else if (kv[i].key >= withPrefix.end)
+		} else if (kv[i].key >= withPrefix.end) {
 			kv.more = false;
+			// There should be at most 1 value past mapPrefix + keys.end.
+			ASSERT(i == kv.size() - 1);
+			break;
+		}
 	}

-	if (!kv.more)
-		result.push_back(result.arena(), KeyValueRef(keys.end, endValue));
+	if (!kv.more) {
+		KeyRef endKey = keys.end;
+		if (!align && !kv.empty() && kv.back().key.startsWith(mapPrefix) && kv.back().key >= withPrefix.end) {
+			endKey = kv.back().key.removePrefix(mapPrefix);
+		}
+		ValueRef endValue;
+		if (!kv.empty()) {
+			// In the aligned case, carry the last value to be the end value.
+			if (align && kv.back().key.startsWith(mapPrefix) && kv.back().key > withPrefix.end) {
+				endValue = result.back().value;
+			} else {
+				endValue = kv.back().value;
+			}
+		}
+		result.push_back(result.arena(), KeyValueRef(endKey, endValue));
+	}
 	result.more = kv.more;

 	return result;
@ -93,6 +116,37 @@ ACTOR Future<RangeResult> krmGetRanges(Reference<ReadYourWritesTransaction> tr,
 	return krmDecodeRanges(mapPrefix, keys, kv);
 }

+// Returns keys.begin, all transitional points in keys, and keys.end, and their values
+ACTOR Future<RangeResult> krmGetRangesUnaligned(Transaction* tr,
+                                                Key mapPrefix,
+                                                KeyRange keys,
+                                                int limit,
+                                                int limitBytes) {
+	KeyRange withPrefix =
+	    KeyRangeRef(mapPrefix.toString() + keys.begin.toString(), mapPrefix.toString() + keys.end.toString());
+
+	state GetRangeLimits limits(limit, limitBytes);
+	limits.minRows = 2;
+	RangeResult kv = wait(tr->getRange(lastLessOrEqual(withPrefix.begin), firstGreaterThan(withPrefix.end), limits));
+
+	return krmDecodeRanges(mapPrefix, keys, kv, false);
+}
+
+ACTOR Future<RangeResult> krmGetRangesUnaligned(Reference<ReadYourWritesTransaction> tr,
+                                                Key mapPrefix,
+                                                KeyRange keys,
+                                                int limit,
+                                                int limitBytes) {
+	KeyRange withPrefix =
+	    KeyRangeRef(mapPrefix.toString() + keys.begin.toString(), mapPrefix.toString() + keys.end.toString());
+
+	state GetRangeLimits limits(limit, limitBytes);
+	limits.minRows = 2;
+	RangeResult kv = wait(tr->getRange(lastLessOrEqual(withPrefix.begin), firstGreaterThan(withPrefix.end), limits));
+
+	return krmDecodeRanges(mapPrefix, keys, kv, false);
+}
+
 void krmSetPreviouslyEmptyRange(Transaction* tr,
                                const KeyRef& mapPrefix,
                                const KeyRangeRef& keys,
@ -254,3 +308,87 @@ Future<Void> krmSetRangeCoalescing(Reference<ReadYourWritesTransaction> const& t
                                   Value const& value) {
 	return holdWhile(tr, krmSetRangeCoalescing_(tr.getPtr(), mapPrefix, range, maxRange, value));
 }
+
+TEST_CASE("/keyrangemap/decoderange/aligned") {
+	Arena arena;
+	Key prefix = LiteralStringRef("/prefix/");
+	StringRef fullKeyA = StringRef(arena, LiteralStringRef("/prefix/a"));
+	StringRef fullKeyB = StringRef(arena, LiteralStringRef("/prefix/b"));
+	StringRef fullKeyC = StringRef(arena, LiteralStringRef("/prefix/c"));
+	StringRef fullKeyD = StringRef(arena, LiteralStringRef("/prefix/d"));
+
+	StringRef keyA = StringRef(arena, LiteralStringRef("a"));
+	StringRef keyB = StringRef(arena, LiteralStringRef("b"));
+	StringRef keyC = StringRef(arena, LiteralStringRef("c"));
+	StringRef keyD = StringRef(arena, LiteralStringRef("d"));
+	StringRef keyE = StringRef(arena, LiteralStringRef("e"));
+	StringRef keyAB = StringRef(arena, LiteralStringRef("ab"));
+	StringRef keyCD = StringRef(arena, LiteralStringRef("cd"));
+
+	// Fake getRange() call.
+	RangeResult kv;
+	kv.push_back(arena, KeyValueRef(fullKeyA, keyA));
+	kv.push_back(arena, KeyValueRef(fullKeyB, keyB));
+	kv.push_back(arena, KeyValueRef(fullKeyC, keyC));
+	kv.push_back(arena, KeyValueRef(fullKeyD, keyD));
+
+	// [A, AB(start), B, C, CD(end), D]
+	RangeResult decodedRanges = krmDecodeRanges(prefix, KeyRangeRef(keyAB, keyCD), kv);
+	ASSERT(decodedRanges.size() == 4);
+	ASSERT(decodedRanges.front().key == keyAB);
+	ASSERT(decodedRanges.front().value == keyA);
+	ASSERT(decodedRanges.back().key == keyCD);
+	ASSERT(decodedRanges.back().value == keyC);
+
+	// [""(start), A, B, C, D, E(end)]
+	decodedRanges = krmDecodeRanges(prefix, KeyRangeRef(StringRef(), keyE), kv);
+	ASSERT(decodedRanges.size() == 6);
+	ASSERT(decodedRanges.front().key == StringRef());
+	ASSERT(decodedRanges.front().value == StringRef());
+	ASSERT(decodedRanges.back().key == keyE);
+	ASSERT(decodedRanges.back().value == keyD);
+
+	return Void();
+}
+
+TEST_CASE("/keyrangemap/decoderange/unaligned") {
+	Arena arena;
+	Key prefix = LiteralStringRef("/prefix/");
+	StringRef fullKeyA = StringRef(arena, LiteralStringRef("/prefix/a"));
+	StringRef fullKeyB = StringRef(arena, LiteralStringRef("/prefix/b"));
+	StringRef fullKeyC = StringRef(arena, LiteralStringRef("/prefix/c"));
+	StringRef fullKeyD = StringRef(arena, LiteralStringRef("/prefix/d"));
+
+	StringRef keyA = StringRef(arena, LiteralStringRef("a"));
+	StringRef keyB = StringRef(arena, LiteralStringRef("b"));
+	StringRef keyC = StringRef(arena, LiteralStringRef("c"));
+	StringRef keyD = StringRef(arena, LiteralStringRef("d"));
+	StringRef keyE = StringRef(arena, LiteralStringRef("e"));
+	StringRef keyAB = StringRef(arena, LiteralStringRef("ab"));
+	StringRef keyCD = StringRef(arena, LiteralStringRef("cd"));
+
+	// Fake getRange() call.
+	RangeResult kv;
+	kv.push_back(arena, KeyValueRef(fullKeyA, keyA));
+	kv.push_back(arena, KeyValueRef(fullKeyB, keyB));
+	kv.push_back(arena, KeyValueRef(fullKeyC, keyC));
+	kv.push_back(arena, KeyValueRef(fullKeyD, keyD));
+
+	// [A, AB(start), B, C, CD(end), D]
+	RangeResult decodedRanges = krmDecodeRanges(prefix, KeyRangeRef(keyAB, keyCD), kv, false);
+	ASSERT(decodedRanges.size() == 4);
+	ASSERT(decodedRanges.front().key == keyA);
+	ASSERT(decodedRanges.front().value == keyA);
+	ASSERT(decodedRanges.back().key == keyD);
+	ASSERT(decodedRanges.back().value == keyD);
+
+	// [""(start), A, B, C, D, E(end)]
+	decodedRanges = krmDecodeRanges(prefix, KeyRangeRef(StringRef(), keyE), kv, false);
+	ASSERT(decodedRanges.size() == 6);
+	ASSERT(decodedRanges.front().key == StringRef());
+	ASSERT(decodedRanges.front().value == StringRef());
+	ASSERT(decodedRanges.back().key == keyE);
+	ASSERT(decodedRanges.back().value == keyD);
+
+	return Void();
+}
--- a/fdbclient/ManagementAPI.actor.cpp
+++ b/fdbclient/ManagementAPI.actor.cpp
@ -2559,7 +2559,7 @@ TEST_CASE("/ManagementAPI/AutoQuorumChange/checkLocality") {
 			                       ProcessClass(ProcessClass::CoordinatorClass, ProcessClass::CommandLineSource),
 			                       "",
 			                       "",
-			                       currentProtocolVersion);
+			                       currentProtocolVersion());
 		}

 		workers.push_back(data);
--- a/fdbclient/Metacluster.cpp
+++ b/fdbclient/Metacluster.cpp
@ -0,0 +1,71 @@
+/*
+ * Metacluster.cpp
+ *
+ * This source file is part of the FoundationDB open source project
+ *
+ * Copyright 2013-2022 Apple Inc. and the FoundationDB project authors
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "fdbclient/Metacluster.h"
+#include "fdbclient/MetaclusterManagement.actor.h"
+
+FDB_DEFINE_BOOLEAN_PARAM(AddNewTenants);
+FDB_DEFINE_BOOLEAN_PARAM(RemoveMissingTenants);
+
+std::string DataClusterEntry::clusterStateToString(DataClusterState clusterState) {
+	switch (clusterState) {
+	case DataClusterState::READY:
+		return "ready";
+	case DataClusterState::REMOVING:
+		return "removing";
+	case DataClusterState::RESTORING:
+		return "restoring";
+	default:
+		UNREACHABLE();
+	}
+}
+
+DataClusterState DataClusterEntry::stringToClusterState(std::string stateStr) {
+	if (stateStr == "ready") {
+		return DataClusterState::READY;
+	} else if (stateStr == "removing") {
+		return DataClusterState::REMOVING;
+	} else if (stateStr == "restoring") {
+		return DataClusterState::RESTORING;
+	}
+
+	UNREACHABLE();
+}
+
+json_spirit::mObject DataClusterEntry::toJson() const {
+	json_spirit::mObject obj;
+	obj["capacity"] = capacity.toJson();
+	obj["allocated"] = allocated.toJson();
+	obj["cluster_state"] = DataClusterEntry::clusterStateToString(clusterState);
+	return obj;
+}
+
+json_spirit::mObject ClusterUsage::toJson() const {
+	json_spirit::mObject obj;
+	obj["num_tenant_groups"] = numTenantGroups;
+	return obj;
+}
+
+KeyBackedObjectProperty<MetaclusterRegistrationEntry, decltype(IncludeVersion())>&
+MetaclusterMetadata::metaclusterRegistration() {
+	static KeyBackedObjectProperty<MetaclusterRegistrationEntry, decltype(IncludeVersion())> instance(
+	    "\xff/metacluster/clusterRegistration"_sr, IncludeVersion());
+	return instance;
+}
--- a/fdbclient/MetaclusterManagement.actor.cpp
+++ b/fdbclient/MetaclusterManagement.actor.cpp
@ -0,0 +1,67 @@
+/*
+ * MetaclusterManagement.actor.cpp
+ *
+ * This source file is part of the FoundationDB open source project
+ *
+ * Copyright 2013-2022 Apple Inc. and the FoundationDB project authors
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "fdbclient/ClusterConnectionMemoryRecord.h"
+#include "fdbclient/DatabaseContext.h"
+#include "fdbclient/FDBTypes.h"
+#include "fdbclient/MetaclusterManagement.actor.h"
+#include "fdbclient/ThreadSafeTransaction.h"
+#include "flow/actorcompiler.h" // has to be last include
+
+namespace MetaclusterAPI {
+
+ACTOR Future<Reference<IDatabase>> openDatabase(ClusterConnectionString connectionString) {
+	if (g_network->isSimulated()) {
+		Reference<IClusterConnectionRecord> clusterFile =
+		    makeReference<ClusterConnectionMemoryRecord>(connectionString);
+		Database nativeDb = Database::createDatabase(clusterFile, -1);
+		Reference<IDatabase> threadSafeDb =
+		    wait(unsafeThreadFutureToFuture(ThreadSafeDatabase::createFromExistingDatabase(nativeDb)));
+		return MultiVersionDatabase::debugCreateFromExistingDatabase(threadSafeDb);
+	} else {
+		return MultiVersionApi::api->createDatabaseFromConnectionString(connectionString.toString().c_str());
+	}
+}
+
+KeyBackedObjectMap<ClusterName, DataClusterEntry, decltype(IncludeVersion())>&
+ManagementClusterMetadata::dataClusters() {
+	static KeyBackedObjectMap<ClusterName, DataClusterEntry, decltype(IncludeVersion())> instance(
+	    "metacluster/dataCluster/metadata/"_sr, IncludeVersion());
+	return instance;
+}
+
+KeyBackedMap<ClusterName,
+             ClusterConnectionString,
+             TupleCodec<ClusterName>,
+             ManagementClusterMetadata::ConnectionStringCodec>
+    ManagementClusterMetadata::dataClusterConnectionRecords("metacluster/dataCluster/connectionString/"_sr);
+
+KeyBackedSet<Tuple> ManagementClusterMetadata::clusterCapacityIndex("metacluster/clusterCapacityIndex/"_sr);
+KeyBackedMap<ClusterName, int64_t, TupleCodec<ClusterName>, BinaryCodec<int64_t>>
+    ManagementClusterMetadata::clusterTenantCount("metacluster/clusterTenantCount/"_sr);
+KeyBackedSet<Tuple> ManagementClusterMetadata::clusterTenantIndex("metacluster/dataCluster/tenantMap/"_sr);
+KeyBackedSet<Tuple> ManagementClusterMetadata::clusterTenantGroupIndex("metacluster/dataCluster/tenantGroupMap/"_sr);
+
+TenantMetadataSpecification& ManagementClusterMetadata::tenantMetadata() {
+	static TenantMetadataSpecification instance(""_sr);
+	return instance;
+}
+
+}; // namespace MetaclusterAPI
--- a/fdbclient/MonitorLeader.actor.cpp
+++ b/fdbclient/MonitorLeader.actor.cpp
@ -663,69 +663,43 @@ ACTOR Future<Void> asyncDeserializeClusterInterface(Reference<AsyncVar<Value>> s
 	}
 }

-struct ClientStatusStats {
-	int count;
-	std::vector<std::pair<NetworkAddress, Key>> examples;
+namespace {

-	ClientStatusStats() : count(0) { examples.reserve(CLIENT_KNOBS->CLIENT_EXAMPLE_AMOUNT); }
-};
+void tryInsertIntoSamples(OpenDatabaseRequest::Samples& samples,
+                          const NetworkAddress& networkAddress,
+                          const Key& traceLogGroup) {
+	++samples.count;
+	if (samples.samples.size() < static_cast<size_t>(CLIENT_KNOBS->CLIENT_EXAMPLE_AMOUNT)) {
+		samples.samples.insert({ networkAddress, traceLogGroup });
+	}
+}
+
+} // namespace

 OpenDatabaseRequest ClientData::getRequest() {
 	OpenDatabaseRequest req;

-	std::map<StringRef, ClientStatusStats> issueMap;
-	std::map<ClientVersionRef, ClientStatusStats> versionMap;
-	std::map<StringRef, ClientStatusStats> maxProtocolMap;
-	int clientCount = 0;
-
-	// SOMEDAY: add a yield in this loop
 	for (auto& ci : clientStatusInfoMap) {
-		for (auto& it : ci.second.issues) {
-			auto& entry = issueMap[it];
-			entry.count++;
-			if (entry.examples.size() < CLIENT_KNOBS->CLIENT_EXAMPLE_AMOUNT) {
-				entry.examples.emplace_back(ci.first, ci.second.traceLogGroup);
-			}
-		}
-		if (ci.second.versions.size()) {
-			clientCount++;
-			StringRef maxProtocol;
-			for (auto& it : ci.second.versions) {
-				maxProtocol = std::max(maxProtocol, it.protocolVersion);
-				auto& entry = versionMap[it];
-				entry.count++;
-				if (entry.examples.size() < CLIENT_KNOBS->CLIENT_EXAMPLE_AMOUNT) {
-					entry.examples.emplace_back(ci.first, ci.second.traceLogGroup);
-				}
-			}
-			auto& maxEntry = maxProtocolMap[maxProtocol];
-			maxEntry.count++;
-			if (maxEntry.examples.size() < CLIENT_KNOBS->CLIENT_EXAMPLE_AMOUNT) {
-				maxEntry.examples.emplace_back(ci.first, ci.second.traceLogGroup);
-			}
-		} else {
-			auto& entry = versionMap[ClientVersionRef()];
-			entry.count++;
-			if (entry.examples.size() < CLIENT_KNOBS->CLIENT_EXAMPLE_AMOUNT) {
-				entry.examples.emplace_back(ci.first, ci.second.traceLogGroup);
-			}
-		}
-	}
+		const auto& networkAddress = ci.first;
+		const auto& traceLogGroup = ci.second.traceLogGroup;

-	req.issues.reserve(issueMap.size());
-	for (auto& it : issueMap) {
-		req.issues.push_back(ItemWithExamples<Key>(it.first, it.second.count, it.second.examples));
+		for (auto& issue : ci.second.issues) {
+			tryInsertIntoSamples(req.issues[issue], networkAddress, traceLogGroup);
+		}
+
+		if (!ci.second.versions.size()) {
+			tryInsertIntoSamples(req.supportedVersions[ClientVersionRef()], networkAddress, traceLogGroup);
+			continue;
+		}
+
+		++req.clientCount;
+		StringRef maxProtocol;
+		for (auto& it : ci.second.versions) {
+			maxProtocol = std::max(maxProtocol, it.protocolVersion);
+			tryInsertIntoSamples(req.supportedVersions[it], networkAddress, traceLogGroup);
+		}
+		tryInsertIntoSamples(req.maxProtocolSupported[maxProtocol], networkAddress, traceLogGroup);
 	}
-	req.supportedVersions.reserve(versionMap.size());
-	for (auto& it : versionMap) {
-		req.supportedVersions.push_back(
-		    ItemWithExamples<Standalone<ClientVersionRef>>(it.first, it.second.count, it.second.examples));
-	}
-	req.maxProtocolSupported.reserve(maxProtocolMap.size());
-	for (auto& it : maxProtocolMap) {
-		req.maxProtocolSupported.push_back(ItemWithExamples<Key>(it.first, it.second.count, it.second.examples));
-	}
-	req.clientCount = clientCount;

 	return req;
 }
--- a/Show More
+++ b/Show More