From 6d3a514fe23da1980eb69f384388cc6b4e62129f Mon Sep 17 00:00:00 2001 From: Bhaskar Muppana Date: Thu, 22 Feb 2018 18:09:41 -0800 Subject: [PATCH 001/127] Backup design documentation. --- design/backup.md | 116 +++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 116 insertions(+) create mode 100644 design/backup.md diff --git a/design/backup.md b/design/backup.md new file mode 100644 index 0000000000..6bcd22bb02 --- /dev/null +++ b/design/backup.md @@ -0,0 +1,116 @@ +### File Backup Design + + This implementation of `BackupAgentBase` implements mechanism to backup keyspace into file. File could be on the local + file system or a remote blob store. Basic idea is backup task would record KV ranges and mutation logs into a + container(directory!) to multiple files. Restore would read files from a container and apply KV Ranges and then + mutation logs. + + Just like any thing else in FoundationDB, Backup-Restore also depend on monotonically increasing versions. Simplest way + to describe backup would be record KVRanges at a version v0 and record mutation logs from v1 to vn. That gives us the + capability to restore to any version between v0 and vn. Restore to version vk (v0 <= vk <= vn) will be done by + * just load KV ranges recorded at v0. + * if (vk == v0) then no need to replay mutation logs else replay mutation logs from v1 to vk. + + There is a small flaw in the above simplistic design, versions in FoundationDB are short lived (expiring in 5 seconds). + There is no way to take snapshot. There is no way to record KV Ranges for the complete key space at a given version. For + a keyspace a-z, its not possible to record KV range (a-z, v0), if keyspace a-z is not small enough. Instead, we can record + KV ranges {(a-b, v0), (c-d, v1), (e-f, v2) ... (y-z, v10)}. With mutation log recorded all along, we can still use + the simple backup-restore scheme described above on sub keyspaces seperately. Assuming we did record mutation log from + v0 to vn, that allows us to restore + +* Keyspace a-b to any version between v0 and vn +* Keyspace c-d to any version between v1 and vn +* Keyspace y-z to any version between v10 and vn + +But, we are not interested in restoring sub keyspaces, we want to restore a-z. Well, we can restore a-z, to any +version between v10 and vn by restoring individual sub spaces seperately. + +#### Key Value Ranges + +Backing up KV ranges and restoring them just uses standard client API, tr->getRange() to read KV ranges and tr->set() to +restore them. + +#### Mutation Logs + +For a backup enabled cluster, proxy keeps a copy of all mutations in system keyspace for backup task to pick them +up and record. During restore, restore task puts the mutations in system keyspace, then proxy reads them and applies +to the database. + +#### Restore + +As we discussed above KV ranges are for subspaces, where as mutation log is combined for the full keyspace. If we +take the same example + +* KV ranges => {(a-b, v0), (c-d, v1), (e-f, v2) ... (y-z, v10)} +* Mutation log => [(a-z, v0), (a-z, v1), .. , (a-z, v10), .. , (a-z, vn)] + => (a-z, vx) represents all mutations on keys a-z at version x. + +Restoring to version vk (v10 < vk <= vn), needs KV ranges to be restored first and then replaying mutation logs. For +each KV range (k1-k2, vx) that is restored we need to replay mutation log [(k1-k2, vx+1), .., (k1-k2, vk)]. But, this +needs scanning complete muation log to get mutaions for k1-k2, that is sub-optimal, for any decent sized database +this will take forever. + +Instead looking at restore on key space, we can replay events on version space, that way we need to scan +mutation log only once. At each version vx, +* Wait for all the KV ranges recorded before vx to restore first. +* Replay (a-z, vx), but ignore mutations for the keys that are not yet restored by KV ranges. + +For the above example, it would look like +* KV ranges: + * Restore KV ranges in parallel +* Mutation Logs: + * At v0: + * Wait for all KV ranges recorded before v0 to restore - nothing + * Nothing to replay as no keys restored by KV ranges yet. + * At v1: + * Wait for all KV ranges recorded before v1 to restore - {(a-b, v0)} + * Replay (a-b, v1) + * At v10: + * Wait for all KV ranges recorded before v10 to restore - Everything except (y-z, v10) + * Replay (a-x, v10) + * At v11: + * Nothing to wait + * Replay (a-z, v11) + * At vk: + * Replay (a-z, vk) + +Even though, in the above description logging mutations is shown as continuous task, this is actually devided into + two logical parts. During KV ranges are being backedup, we start a mutation log backup in parallel. Until, we complete + KV range backup and the parallel mutation logs backup, cluster is not in restorable phase. Once these two tasks are + completed, now cluster can be restored back to the last version. To be able to restore after even after the KV range + backup, we continue to backup mutation logs called differental log backup. With differential backup, we can restore + to any version since KV range backup completed. It would look like below + +``` + Backup Too large x x + Diff backup ---------------- ----------------- + Mutations -------- --------- + KV Range -------- --------- + Start x x + Version space --------------------------------------------------------------- + a b c d e f g h +``` + + To explain sequence of events here + +|Version|Event| +|-----------|----------------------------------------------------------------------------------| +| a | Asked to start backup| +| b | Started backing up KV ranges and mutation logs| +| c | Completed backing up KV ranges and mutation logs also mark backup restorable| +| c+1 | Started differential backup of logs| +| d | Decided backup is too large already and discontinued differential backup| +| e | Asked to start backup| +| f | Started backing up KV ranges and mutation logs| +| g | Completed backing up KV ranges and mutation logs also mark backup restorable| +| g+1 | Started differential backup of logs| +| h | Decided backup is too large already and discontinued differential backup| + +With the above backup scenario, we could restore to any version from c to d or g to h. No other versions are + restorable. + +#### Continuous Backup + +Instead of going through the pain of monitoring for backup becoming too large and restarting backup, we could just +have backup running continuously with KVrange task restarting once in a while (with some policy ofcourse). Continuous +backup is already committed, its not discussed here yet. From 470f5c01f3bb597392a8eec0bd9e5e55ad9dfe64 Mon Sep 17 00:00:00 2001 From: Evan Tschannen Date: Mon, 26 Feb 2018 17:09:09 -0800 Subject: [PATCH 002/127] changed remoteDcId to a vector of ids, to support future configurations where there are multiple remote databases --- fdbclient/ManagementAPI.actor.cpp | 2 +- fdbserver/ClusterController.actor.cpp | 20 +++++++-------- fdbserver/DataDistribution.actor.cpp | 6 ++--- fdbserver/DataDistribution.h | 2 +- fdbserver/DatabaseConfiguration.cpp | 25 ++++++++++++++----- fdbserver/DatabaseConfiguration.h | 2 +- fdbserver/SimulatedCluster.actor.cpp | 7 +++--- fdbserver/masterserver.actor.cpp | 12 ++++----- .../workloads/ConsistencyCheck.actor.cpp | 4 +-- 9 files changed, 47 insertions(+), 33 deletions(-) diff --git a/fdbclient/ManagementAPI.actor.cpp b/fdbclient/ManagementAPI.actor.cpp index 0f5d55cb17..e6553e423f 100644 --- a/fdbclient/ManagementAPI.actor.cpp +++ b/fdbclient/ManagementAPI.actor.cpp @@ -69,7 +69,7 @@ std::map configForToken( std::string const& mode ) { out[p+key] = value; } - if( key == "primary_dc" || key == "remote_dc" || key == "primary_satellite_dcs" || key == "remote_satellite_dcs" ) { + if( key == "primary_dc" || key == "remote_dcs" || key == "primary_satellite_dcs" || key == "remote_satellite_dcs" ) { out[p+key] = value; } diff --git a/fdbserver/ClusterController.actor.cpp b/fdbserver/ClusterController.actor.cpp index b72b7e3537..5abf340c36 100644 --- a/fdbserver/ClusterController.actor.cpp +++ b/fdbserver/ClusterController.actor.cpp @@ -530,10 +530,10 @@ public: std::map< Optional>, int> id_used; id_used[masterProcessId]++; id_used[clusterControllerProcessId]++; - ASSERT(dcId == req.configuration.primaryDcId || dcId == req.configuration.remoteDcId); + ASSERT(req.configuration.remoteDcIds.size() == 1 && ( dcId == req.configuration.primaryDcId || dcId == req.configuration.remoteDcIds[0] ) ); std::set> primaryDC; - primaryDC.insert(dcId == req.configuration.primaryDcId ? req.configuration.primaryDcId : req.configuration.remoteDcId); - result.remoteDcId = dcId == req.configuration.primaryDcId ? req.configuration.remoteDcId : req.configuration.primaryDcId; + primaryDC.insert(dcId == req.configuration.primaryDcId ? req.configuration.primaryDcId : req.configuration.remoteDcIds[0]); + result.remoteDcId = dcId == req.configuration.primaryDcId ? req.configuration.remoteDcIds[0] : req.configuration.primaryDcId; if(req.recruitSeedServers) { auto primaryStorageServers = getWorkersForSeedServers( req.configuration, req.configuration.storagePolicy, dcId ); @@ -601,7 +601,7 @@ public: setPrimaryDesired = true; vector> dcPriority; dcPriority.push_back(req.configuration.primaryDcId); - dcPriority.push_back(req.configuration.remoteDcId); + dcPriority.push_back(req.configuration.remoteDcIds[0]); desiredDcIds.set(dcPriority); if(reply.isError()) { throw reply.getError(); @@ -614,16 +614,16 @@ public: throw; } TraceEvent(SevWarn, "AttemptingRecruitmentInRemoteDC", id).error(e); - auto reply = findWorkersForConfiguration(req, req.configuration.remoteDcId); + auto reply = findWorkersForConfiguration(req, req.configuration.remoteDcIds[0]); if(!setPrimaryDesired) { vector> dcPriority; - dcPriority.push_back(req.configuration.remoteDcId); + dcPriority.push_back(req.configuration.remoteDcIds[0]); dcPriority.push_back(req.configuration.primaryDcId); desiredDcIds.set(dcPriority); } if(reply.isError()) { throw reply.getError(); - } else if(req.configuration.remoteDcId == clusterControllerDcId) { + } else if(req.configuration.remoteDcIds[0] == clusterControllerDcId) { return reply.get(); } throw; @@ -735,7 +735,7 @@ public: vector> dcPriority; dcPriority.push_back(db.config.primaryDcId); - dcPriority.push_back(db.config.remoteDcId); + dcPriority.push_back(db.config.remoteDcIds[0]); desiredDcIds.set(dcPriority); } catch( Error &e ) { if(e.code() != error_code_no_more_servers) { @@ -835,8 +835,8 @@ public: std::set> satelliteDCs; std::set> remoteDC; if(db.config.primaryDcId.present()) { - primaryDC.insert(clusterControllerDcId == db.config.primaryDcId ? db.config.primaryDcId : db.config.remoteDcId); - remoteDC.insert(clusterControllerDcId == db.config.primaryDcId ? db.config.remoteDcId : db.config.primaryDcId); + primaryDC.insert(clusterControllerDcId == db.config.primaryDcId ? db.config.primaryDcId : db.config.remoteDcIds[0]); + remoteDC.insert(clusterControllerDcId == db.config.primaryDcId ? db.config.remoteDcIds[0] : db.config.primaryDcId); if(db.config.satelliteTLogReplicationFactor > 0) { if( clusterControllerDcId == db.config.primaryDcId ) { satelliteDCs.insert( db.config.primarySatelliteDcIds.begin(), db.config.primarySatelliteDcIds.end() ); diff --git a/fdbserver/DataDistribution.actor.cpp b/fdbserver/DataDistribution.actor.cpp index eac0ca1cf3..007bf23467 100644 --- a/fdbserver/DataDistribution.actor.cpp +++ b/fdbserver/DataDistribution.actor.cpp @@ -2058,7 +2058,7 @@ ACTOR Future dataDistribution( Reference logSystem, Version recoveryCommitVersion, std::vector> primaryDcId, - std::vector> remoteDcId, + std::vector> remoteDcIds, double* lastLimited) { state Database cx = openDBOnServer(db, TaskDataDistributionLaunch, true, true); @@ -2172,9 +2172,9 @@ ACTOR Future dataDistribution( actors.push_back( popOldTags( cx, logSystem, recoveryCommitVersion) ); actors.push_back( reportErrorsExcept( dataDistributionTracker( initData, cx, shardsAffectedByTeamFailure, output, getShardMetrics, getAverageShardBytes.getFuture(), readyToStart, anyZeroHealthyTeams, mi.id() ), "DDTracker", mi.id(), &normalDDQueueErrors() ) ); actors.push_back( reportErrorsExcept( dataDistributionQueue( cx, output, getShardMetrics, tcis, shardsAffectedByTeamFailure, lock, getAverageShardBytes, mi, storageTeamSize, configuration.durableStorageQuorum, lastLimited ), "DDQueue", mi.id(), &normalDDQueueErrors() ) ); - actors.push_back( reportErrorsExcept( dataDistributionTeamCollection( initData, tcis[0], cx, db, shardsAffectedByTeamFailure, lock, output, mi.id(), configuration, primaryDcId, configuration.remoteTLogReplicationFactor > 0 ? remoteDcId : std::vector>(), serverChanges, readyToStart.getFuture(), zeroHealthyTeams[0] ), "DDTeamCollectionPrimary", mi.id(), &normalDDQueueErrors() ) ); + actors.push_back( reportErrorsExcept( dataDistributionTeamCollection( initData, tcis[0], cx, db, shardsAffectedByTeamFailure, lock, output, mi.id(), configuration, primaryDcId, configuration.remoteTLogReplicationFactor > 0 ? remoteDcIds : std::vector>(), serverChanges, readyToStart.getFuture(), zeroHealthyTeams[0] ), "DDTeamCollectionPrimary", mi.id(), &normalDDQueueErrors() ) ); if (configuration.remoteTLogReplicationFactor > 0) { - actors.push_back( reportErrorsExcept( dataDistributionTeamCollection( initData, tcis[1], cx, db, shardsAffectedByTeamFailure, lock, output, mi.id(), configuration, remoteDcId, Optional>>(), serverChanges, readyToStart.getFuture(), zeroHealthyTeams[1] ), "DDTeamCollectionSecondary", mi.id(), &normalDDQueueErrors() ) ); + actors.push_back( reportErrorsExcept( dataDistributionTeamCollection( initData, tcis[1], cx, db, shardsAffectedByTeamFailure, lock, output, mi.id(), configuration, remoteDcIds, Optional>>(), serverChanges, readyToStart.getFuture(), zeroHealthyTeams[1] ), "DDTeamCollectionSecondary", mi.id(), &normalDDQueueErrors() ) ); } Void _ = wait( waitForAll( actors ) ); diff --git a/fdbserver/DataDistribution.h b/fdbserver/DataDistribution.h index 817f457375..ed2a35e7c1 100644 --- a/fdbserver/DataDistribution.h +++ b/fdbserver/DataDistribution.h @@ -174,7 +174,7 @@ Future dataDistribution( Reference const& logSystem, Version const& recoveryCommitVersion, std::vector> const& primaryDcId, - std::vector> const& remoteDcId, + std::vector> const& remoteDcIds, double* const& lastLimited); Future dataDistributionTracker( diff --git a/fdbserver/DatabaseConfiguration.cpp b/fdbserver/DatabaseConfiguration.cpp index 236d19f87b..851fab0397 100644 --- a/fdbserver/DatabaseConfiguration.cpp +++ b/fdbserver/DatabaseConfiguration.cpp @@ -34,7 +34,8 @@ void DatabaseConfiguration::resetInternal() { autoMasterProxyCount = CLIENT_KNOBS->DEFAULT_AUTO_PROXIES; autoResolverCount = CLIENT_KNOBS->DEFAULT_AUTO_RESOLVERS; autoDesiredTLogCount = CLIENT_KNOBS->DEFAULT_AUTO_LOGS; - primaryDcId = remoteDcId = Optional>(); + primaryDcId = Optional>(); + remoteDcIds.clear(); tLogPolicy = storagePolicy = remoteTLogPolicy = satelliteTLogPolicy = IRepPolicyRef(); remoteDesiredTLogCount = satelliteDesiredTLogCount = desiredLogRouterCount = -1; @@ -99,8 +100,9 @@ bool DatabaseConfiguration::isValid() const { getDesiredRemoteLogs() >= 1 && getDesiredLogRouters() >= 1 && remoteTLogReplicationFactor >= 0 && - ( remoteTLogReplicationFactor == 0 || ( remoteTLogPolicy && primaryDcId.present() && remoteDcId.present() && durableStorageQuorum == storageTeamSize ) ) && - primaryDcId.present() == remoteDcId.present() && + remoteDcIds.size() < 2 && + ( remoteTLogReplicationFactor == 0 || ( remoteTLogPolicy && primaryDcId.present() && remoteDcIds.size() && durableStorageQuorum == storageTeamSize ) ) && + primaryDcId.present() == remoteDcIds.size() && getDesiredSatelliteLogs() >= 1 && satelliteTLogReplicationFactor >= 0 && satelliteTLogWriteAntiQuorum >= 0 && @@ -146,8 +148,19 @@ std::map DatabaseConfiguration::toMap() const { if(primaryDcId.present()) { result["primary_dc"] = printable(primaryDcId.get()); } - if(remoteDcId.present()) { - result["remote_dc"] = printable(remoteDcId.get()); + if(remoteDcIds.size()) { + std::string remoteDcStr = ""; + bool first = true; + for(auto& it : remoteDcIds) { + if(it.present()) { + if(!first) { + remoteDcStr += ","; + first = false; + } + remoteDcStr += printable(it.get()); + } + } + result["remote_dcs"] = remoteDcStr; } if(primarySatelliteDcIds.size()) { std::string primaryDcStr = ""; @@ -265,7 +278,7 @@ bool DatabaseConfiguration::setInternal(KeyRef key, ValueRef value) { else if (ck == LiteralStringRef("satellite_anti_quorum")) parse(&satelliteTLogWriteAntiQuorum, value); else if (ck == LiteralStringRef("satellite_usable_dcs")) parse(&satelliteTLogUsableDcs, value); else if (ck == LiteralStringRef("primary_dc")) primaryDcId = value; - else if (ck == LiteralStringRef("remote_dc")) remoteDcId = value; + else if (ck == LiteralStringRef("remote_dcs")) parse(&remoteDcIds, value); else if (ck == LiteralStringRef("primary_satellite_dcs")) parse(&primarySatelliteDcIds, value); else if (ck == LiteralStringRef("remote_satellite_dcs")) parse(&remoteSatelliteDcIds, value); else if (ck == LiteralStringRef("log_routers")) parse(&desiredLogRouterCount, value); diff --git a/fdbserver/DatabaseConfiguration.h b/fdbserver/DatabaseConfiguration.h index db30e8098b..c1c560f279 100644 --- a/fdbserver/DatabaseConfiguration.h +++ b/fdbserver/DatabaseConfiguration.h @@ -98,7 +98,7 @@ struct DatabaseConfiguration { int32_t remoteTLogReplicationFactor; int32_t desiredLogRouterCount; IRepPolicyRef remoteTLogPolicy; - Optional> remoteDcId; + std::vector>> remoteDcIds; // Satellite TLogs IRepPolicyRef satelliteTLogPolicy; diff --git a/fdbserver/SimulatedCluster.actor.cpp b/fdbserver/SimulatedCluster.actor.cpp index d6809ff714..d96fb7988e 100644 --- a/fdbserver/SimulatedCluster.actor.cpp +++ b/fdbserver/SimulatedCluster.actor.cpp @@ -745,7 +745,8 @@ void SimulationConfig::generateNormalConfig(int minimumReplication) { if(generateFearless || (datacenters == 2 && g_random->random01() < 0.5)) { db.primaryDcId = LiteralStringRef("0"); - db.remoteDcId = LiteralStringRef("1"); + db.remoteDcIds.resize(1); + db.remoteDcIds[0] = LiteralStringRef("1"); } if(generateFearless) { @@ -836,7 +837,7 @@ std::string SimulationConfig::toString() { if(db.primaryDcId.present()) { config << " primary_dc=" << db.primaryDcId.get().printable(); - config << " remote_dc=" << db.remoteDcId.get().printable(); + config << " remote_dcs=" << db.remoteDcIds[0].get().printable(); } if(db.primarySatelliteDcIds.size()) { @@ -868,7 +869,7 @@ void setupSimulatedSystem( vector> *systemActors, std::string baseF g_simulator.primaryDcId = simconfig.db.primaryDcId; g_simulator.hasRemoteReplication = simconfig.db.remoteTLogReplicationFactor > 0; g_simulator.remoteTLogPolicy = simconfig.db.remoteTLogPolicy; - g_simulator.remoteDcId = simconfig.db.remoteDcId; + if(simconfig.db.remoteDcIds.size()) g_simulator.remoteDcId = simconfig.db.remoteDcIds[0]; g_simulator.hasSatelliteReplication = simconfig.db.satelliteTLogReplicationFactor > 0; g_simulator.satelliteTLogPolicy = simconfig.db.satelliteTLogPolicy; g_simulator.satelliteTLogWriteAntiQuorum = simconfig.db.satelliteTLogWriteAntiQuorum; diff --git a/fdbserver/masterserver.actor.cpp b/fdbserver/masterserver.actor.cpp index 5c16f0544c..3e7fa69bc3 100644 --- a/fdbserver/masterserver.actor.cpp +++ b/fdbserver/masterserver.actor.cpp @@ -173,7 +173,7 @@ struct MasterData : NonCopyable, ReferenceCounted { DatabaseConfiguration originalConfiguration; DatabaseConfiguration configuration; std::vector> primaryDcId; - std::vector> remoteDcId; + std::vector> remoteDcIds; bool hasConfiguration; ServerCoordinators coordinators; @@ -290,7 +290,7 @@ ACTOR Future newResolvers( Reference self, RecruitFromConfigur ACTOR Future newTLogServers( Reference self, RecruitFromConfigurationReply recr, Reference oldLogSystem, vector>* initialConfChanges ) { if(self->configuration.remoteTLogReplicationFactor > 0) { - state Optional primaryDcId = recr.remoteDcId == self->configuration.remoteDcId ? self->configuration.primaryDcId : self->configuration.remoteDcId; + state Optional primaryDcId = recr.remoteDcId == self->configuration.remoteDcIds[0] ? self->configuration.primaryDcId : self->configuration.remoteDcIds[0]; if( !self->dcId_locality.count(primaryDcId) ) { TraceEvent(SevWarnAlways, "UnknownPrimaryDCID", self->dbgid).detail("found", self->dcId_locality.count(primaryDcId)).detail("primaryId", printable(primaryDcId)); int8_t loc = self->getNextLocality(); @@ -551,10 +551,10 @@ ACTOR Future recruitEverything( Reference self, vectorconfiguration, self->lastEpochEnd==0 ) ) ) ); self->primaryDcId.clear(); - self->remoteDcId.clear(); + self->remoteDcIds.clear(); if(recruits.remoteDcId.present()) { - self->primaryDcId.push_back(recruits.remoteDcId == self->configuration.remoteDcId ? self->configuration.primaryDcId : self->configuration.remoteDcId); - self->remoteDcId.push_back(recruits.remoteDcId); + self->primaryDcId.push_back(recruits.remoteDcId == self->configuration.remoteDcIds[0] ? self->configuration.primaryDcId : self->configuration.remoteDcIds[0]); + self->remoteDcIds.push_back(recruits.remoteDcId); } TraceEvent("MasterRecoveryState", self->dbgid) @@ -1253,7 +1253,7 @@ ACTOR Future masterCore( Reference self ) { { PromiseStream< std::pair> > ddStorageServerChanges; state double lastLimited = 0; - self->addActor.send( reportErrorsExcept( dataDistribution( self->dbInfo, self->myInterface, self->configuration, ddStorageServerChanges, self->logSystem, self->recoveryTransactionVersion, self->primaryDcId, self->remoteDcId, &lastLimited ), "DataDistribution", self->dbgid, &normalMasterErrors() ) ); + self->addActor.send( reportErrorsExcept( dataDistribution( self->dbInfo, self->myInterface, self->configuration, ddStorageServerChanges, self->logSystem, self->recoveryTransactionVersion, self->primaryDcId, self->remoteDcIds, &lastLimited ), "DataDistribution", self->dbgid, &normalMasterErrors() ) ); self->addActor.send( reportErrors( rateKeeper( self->dbInfo, ddStorageServerChanges, self->myInterface.getRateInfo.getFuture(), self->dbName, self->configuration, &lastLimited ), "Ratekeeper", self->dbgid) ); } diff --git a/fdbserver/workloads/ConsistencyCheck.actor.cpp b/fdbserver/workloads/ConsistencyCheck.actor.cpp index 11669f2dd9..2535c02eac 100644 --- a/fdbserver/workloads/ConsistencyCheck.actor.cpp +++ b/fdbserver/workloads/ConsistencyCheck.actor.cpp @@ -1069,8 +1069,8 @@ struct ConsistencyCheckWorkload : TestWorkload } if((!configuration.primaryDcId.present() && missingStorage.size()) || - (configuration.primaryDcId.present() && configuration.remoteTLogReplicationFactor == 0 && missingStorage.count(configuration.primaryDcId) && missingStorage.count(configuration.remoteDcId)) || - (configuration.primaryDcId.present() && configuration.remoteTLogReplicationFactor > 0 && (missingStorage.count(configuration.primaryDcId) || missingStorage.count(configuration.remoteDcId)))) { + (configuration.primaryDcId.present() && configuration.remoteTLogReplicationFactor == 0 && missingStorage.count(configuration.primaryDcId) && missingStorage.count(configuration.remoteDcIds[0])) || + (configuration.primaryDcId.present() && configuration.remoteTLogReplicationFactor > 0 && (missingStorage.count(configuration.primaryDcId) || missingStorage.count(configuration.remoteDcIds[0])))) { self->testFailure("No storage server on worker"); return false; } From 47dc6de55d632642793064b9d774b285f85885fe Mon Sep 17 00:00:00 2001 From: Alvin Moore Date: Thu, 1 Mar 2018 11:17:20 -0800 Subject: [PATCH 003/127] Fixed suffix for build dependency --- build/packages.mk | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/build/packages.mk b/build/packages.mk index b91e994611..36f10dcc9d 100644 --- a/build/packages.mk +++ b/build/packages.mk @@ -26,7 +26,7 @@ PACKAGE_CONTENTS := $(addprefix bin/, $(PACKAGE_BINARIES)) $(addprefix bin/, $(a packages: TGZ FDBSERVERAPI -TGZ: $(PACKAGE_CONTENTS) versions.target lib/libfdb_java.so +TGZ: $(PACKAGE_CONTENTS) versions.target lib/libfdb_java.$(DLEXT) @echo "Archiving tgz" @mkdir -p packages @rm -f packages/FoundationDB-$(PLATFORM)-*.tar.gz From 9463cad2140f03ca49d69e423f7503c61cceea9d Mon Sep 17 00:00:00 2001 From: AlvinMooreSr <36203359+AlvinMooreSr@users.noreply.github.com> Date: Mon, 5 Mar 2018 14:21:56 -0800 Subject: [PATCH 004/127] Updated Version to 6.0 --- versions.target | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/versions.target b/versions.target index 68447f3feb..2d9f589a33 100644 --- a/versions.target +++ b/versions.target @@ -1,7 +1,7 @@ - 5.2.0 - 5.2 + 6.0.0 + 6.0 From 1194e3a3619efea4856adf7da899b7487d28af89 Mon Sep 17 00:00:00 2001 From: Evan Tschannen Date: Mon, 5 Mar 2018 19:27:46 -0800 Subject: [PATCH 005/127] added region-based configuration to support a large variety of fearless setups. Currently only 1 primary 1 remote setups are allowed. --- fdbclient/ManagementAPI.actor.cpp | 63 +--- fdbserver/ClusterController.actor.cpp | 121 +++--- fdbserver/ClusterRecruitmentInterface.h | 4 +- fdbserver/DatabaseConfiguration.cpp | 350 +++++++++++------- fdbserver/DatabaseConfiguration.h | 107 ++++-- fdbserver/SimulatedCluster.actor.cpp | 224 ++++++----- fdbserver/Status.actor.cpp | 46 +-- fdbserver/TagPartitionedLogSystem.actor.cpp | 12 +- fdbserver/masterserver.actor.cpp | 30 +- fdbserver/tester.actor.cpp | 4 +- .../workloads/ConsistencyCheck.actor.cpp | 6 +- 11 files changed, 535 insertions(+), 432 deletions(-) diff --git a/fdbclient/ManagementAPI.actor.cpp b/fdbclient/ManagementAPI.actor.cpp index e6553e423f..2688a58cfc 100644 --- a/fdbclient/ManagementAPI.actor.cpp +++ b/fdbclient/ManagementAPI.actor.cpp @@ -69,8 +69,13 @@ std::map configForToken( std::string const& mode ) { out[p+key] = value; } - if( key == "primary_dc" || key == "remote_dcs" || key == "primary_satellite_dcs" || key == "remote_satellite_dcs" ) { - out[p+key] = value; + if( key == "regions" ) { + json_spirit::mValue mv; + json_spirit::read_string( value, mv ); + + StatusObject regionObj; + regionObj["regions"] = mv; + out[p+key] = BinaryWriter::toValue(regionObj, IncludeVersion()).toString(); } return out; @@ -173,7 +178,7 @@ std::map configForToken( std::string const& mode ) { remote_redundancy="3"; remote_log_replicas="3"; remoteTLogPolicy = IRepPolicyRef(new PolicyAcross(3, "zoneid", IRepPolicyRef(new PolicyOne()))); - } else if(mode == "remote_three_data_hall") { + } else if(mode == "remote_three_data_hall") { //FIXME: not tested in simulation remote_redundancy="3"; remote_log_replicas="4"; remoteTLogPolicy = IRepPolicyRef(new PolicyAcross(2, "data_hall", @@ -182,8 +187,6 @@ std::map configForToken( std::string const& mode ) { } else remoteRedundancySpecified = false; if (remoteRedundancySpecified) { - out[p+"remote_storage_replicas"] = - out[p+"remote_storage_quorum"] = remote_redundancy; out[p+"remote_log_replicas"] = remote_log_replicas; BinaryWriter policyWriter(IncludeVersion()); @@ -192,56 +195,6 @@ std::map configForToken( std::string const& mode ) { return out; } - std::string satellite_log_replicas, satellite_anti_quorum, satellite_usable_dcs; - IRepPolicyRef satelliteTLogPolicy; - bool satelliteRedundancySpecified = true; - if (mode == "satellite_none") { - satellite_anti_quorum="0"; - satellite_usable_dcs="0"; - satellite_log_replicas="0"; - satelliteTLogPolicy = IRepPolicyRef(); - } else if (mode == "one_satellite_single") { - satellite_anti_quorum="0"; - satellite_usable_dcs="1"; - satellite_log_replicas="1"; - satelliteTLogPolicy = IRepPolicyRef(new PolicyOne()); - } else if(mode == "one_satellite_double") { - satellite_anti_quorum="0"; - satellite_usable_dcs="1"; - satellite_log_replicas="2"; - satelliteTLogPolicy = IRepPolicyRef(new PolicyAcross(2, "zoneid", IRepPolicyRef(new PolicyOne()))); - } else if(mode == "one_satellite_triple") { - satellite_anti_quorum="0"; - satellite_usable_dcs="1"; - satellite_log_replicas="3"; - satelliteTLogPolicy = IRepPolicyRef(new PolicyAcross(3, "zoneid", IRepPolicyRef(new PolicyOne()))); - } else if(mode == "two_satellite_safe") { - satellite_anti_quorum="0"; - satellite_usable_dcs="2"; - satellite_log_replicas="4"; - satelliteTLogPolicy = IRepPolicyRef(new PolicyAcross(2, "dcid", - IRepPolicyRef(new PolicyAcross(2, "zoneid", IRepPolicyRef(new PolicyOne()))) - )); - } else if(mode == "two_satellite_fast") { - satellite_anti_quorum="2"; - satellite_usable_dcs="2"; - satellite_log_replicas="4"; - satelliteTLogPolicy = IRepPolicyRef(new PolicyAcross(2, "dcid", - IRepPolicyRef(new PolicyAcross(2, "zoneid", IRepPolicyRef(new PolicyOne()))) - )); - } else - satelliteRedundancySpecified = false; - if (satelliteRedundancySpecified) { - out[p+"satellite_anti_quorum"] = satellite_anti_quorum; - out[p+"satellite_usable_dcs"] = satellite_usable_dcs; - out[p+"satellite_log_replicas"] = satellite_log_replicas; - - BinaryWriter policyWriter(IncludeVersion()); - serializeReplicationPolicy(policyWriter, satelliteTLogPolicy); - out[p+"satellite_log_policy"] = policyWriter.toStringRef().toString(); - return out; - } - return out; } diff --git a/fdbserver/ClusterController.actor.cpp b/fdbserver/ClusterController.actor.cpp index 5abf340c36..f50129a153 100644 --- a/fdbserver/ClusterController.actor.cpp +++ b/fdbserver/ClusterController.actor.cpp @@ -530,18 +530,32 @@ public: std::map< Optional>, int> id_used; id_used[masterProcessId]++; id_used[clusterControllerProcessId]++; - ASSERT(req.configuration.remoteDcIds.size() == 1 && ( dcId == req.configuration.primaryDcId || dcId == req.configuration.remoteDcIds[0] ) ); - std::set> primaryDC; - primaryDC.insert(dcId == req.configuration.primaryDcId ? req.configuration.primaryDcId : req.configuration.remoteDcIds[0]); - result.remoteDcId = dcId == req.configuration.primaryDcId ? req.configuration.remoteDcIds[0] : req.configuration.primaryDcId; + ASSERT(dcId.present()); + + std::set> primaryDC; + primaryDC.insert(dcId); + result.dcId = dcId; + + Optional remoteDcId; + RegionInfo region; + for(auto& r : req.configuration.regions) { + if(r.dcId != dcId.get()) { + ASSERT(!remoteDcId.present()); + remoteDcId = r.dcId; + } else { + ASSERT(region.dcId == StringRef()); + region = r; + } + } + if(req.recruitSeedServers) { auto primaryStorageServers = getWorkersForSeedServers( req.configuration, req.configuration.storagePolicy, dcId ); for(int i = 0; i < primaryStorageServers.size(); i++) result.storageServers.push_back(primaryStorageServers[i].first); if(req.configuration.remoteTLogReplicationFactor > 0) { - auto remoteStorageServers = getWorkersForSeedServers( req.configuration, req.configuration.storagePolicy, result.remoteDcId ); + auto remoteStorageServers = getWorkersForSeedServers( req.configuration, req.configuration.storagePolicy, remoteDcId ); for(int i = 0; i < remoteStorageServers.size(); i++) result.storageServers.push_back(remoteStorageServers[i].first); } @@ -553,15 +567,13 @@ public: } std::vector> satelliteLogs; - if(req.configuration.satelliteTLogReplicationFactor > 0) { + if(region.satelliteTLogReplicationFactor > 0) { std::set> satelliteDCs; - if( dcId == req.configuration.primaryDcId ) { - satelliteDCs.insert( req.configuration.primarySatelliteDcIds.begin(), req.configuration.primarySatelliteDcIds.end() ); - } else { - satelliteDCs.insert( req.configuration.remoteSatelliteDcIds.begin(), req.configuration.remoteSatelliteDcIds.end() ); + for(auto& s : region.satellites) { + satelliteDCs.insert(s.dcId); } //FIXME: recruitment does not respect usable_dcs, a.k.a if usable_dcs is 1 we should recruit all tlogs in one data center - satelliteLogs = getWorkersForTlogs( req.configuration, req.configuration.satelliteTLogReplicationFactor, req.configuration.getDesiredSatelliteLogs(), req.configuration.satelliteTLogPolicy, id_used, false, satelliteDCs ); + satelliteLogs = getWorkersForTlogs( req.configuration, region.satelliteTLogReplicationFactor, req.configuration.getDesiredSatelliteLogs(dcId), region.satelliteTLogPolicy, id_used, false, satelliteDCs ); for(int i = 0; i < satelliteLogs.size(); i++) { result.satelliteTLogs.push_back(satelliteLogs[i].first); @@ -584,7 +596,7 @@ public: if( now() - startTime < SERVER_KNOBS->WAIT_FOR_GOOD_RECRUITMENT_DELAY && ( RoleFitness(tlogs, ProcessClass::TLog) > RoleFitness(SERVER_KNOBS->EXPECTED_TLOG_FITNESS, req.configuration.getDesiredLogs()) || - ( req.configuration.satelliteTLogReplicationFactor > 0 && RoleFitness(satelliteLogs, ProcessClass::TLog) > RoleFitness(SERVER_KNOBS->EXPECTED_TLOG_FITNESS, req.configuration.getDesiredSatelliteLogs()) ) || + ( region.satelliteTLogReplicationFactor > 0 && RoleFitness(satelliteLogs, ProcessClass::TLog) > RoleFitness(SERVER_KNOBS->EXPECTED_TLOG_FITNESS, req.configuration.getDesiredSatelliteLogs(dcId)) ) || RoleFitness(proxies, ProcessClass::Proxy) > RoleFitness(SERVER_KNOBS->EXPECTED_PROXY_FITNESS, req.configuration.getDesiredProxies()) || RoleFitness(resolvers, ProcessClass::Resolver) > RoleFitness(SERVER_KNOBS->EXPECTED_RESOLVER_FITNESS, req.configuration.getDesiredResolvers()) ) ) { return operation_failed(); @@ -594,18 +606,18 @@ public: } RecruitFromConfigurationReply findWorkersForConfiguration( RecruitFromConfigurationRequest const& req ) { - if(req.configuration.primaryDcId.present()) { + if(req.configuration.regions.size() > 1) { bool setPrimaryDesired = false; try { - auto reply = findWorkersForConfiguration(req, req.configuration.primaryDcId); + auto reply = findWorkersForConfiguration(req, req.configuration.regions[0].dcId); setPrimaryDesired = true; vector> dcPriority; - dcPriority.push_back(req.configuration.primaryDcId); - dcPriority.push_back(req.configuration.remoteDcIds[0]); + dcPriority.push_back(req.configuration.regions[0].dcId); + dcPriority.push_back(req.configuration.regions[1].dcId); desiredDcIds.set(dcPriority); if(reply.isError()) { throw reply.getError(); - } else if(req.configuration.primaryDcId == clusterControllerDcId) { + } else if(clusterControllerDcId.present() && req.configuration.regions[0].dcId == clusterControllerDcId.get()) { return reply.get(); } throw no_more_servers(); @@ -614,16 +626,16 @@ public: throw; } TraceEvent(SevWarn, "AttemptingRecruitmentInRemoteDC", id).error(e); - auto reply = findWorkersForConfiguration(req, req.configuration.remoteDcIds[0]); + auto reply = findWorkersForConfiguration(req, req.configuration.regions[1].dcId); if(!setPrimaryDesired) { vector> dcPriority; - dcPriority.push_back(req.configuration.remoteDcIds[0]); - dcPriority.push_back(req.configuration.primaryDcId); + dcPriority.push_back(req.configuration.regions[1].dcId); + dcPriority.push_back(req.configuration.regions[0].dcId); desiredDcIds.set(dcPriority); } if(reply.isError()) { throw reply.getError(); - } else if(req.configuration.remoteDcIds[0] == clusterControllerDcId) { + } else if(clusterControllerDcId.present() && req.configuration.regions[1].dcId == clusterControllerDcId.get()) { return reply.get(); } throw; @@ -714,28 +726,30 @@ public: } void checkPrimaryDC() { - if(db.config.primaryDcId.present() && db.config.primaryDcId != clusterControllerDcId) { + if(db.config.regions.size() && clusterControllerDcId.present() && db.config.regions[0].dcId != clusterControllerDcId.get()) { try { std::map< Optional>, int> id_used; - getWorkerForRoleInDatacenter(db.config.primaryDcId, ProcessClass::ClusterController, ProcessClass::ExcludeFit, db.config, id_used, true); - getWorkerForRoleInDatacenter(db.config.primaryDcId, ProcessClass::Master, ProcessClass::ExcludeFit, db.config, id_used, true); + getWorkerForRoleInDatacenter(db.config.regions[0].dcId, ProcessClass::ClusterController, ProcessClass::ExcludeFit, db.config, id_used, true); + getWorkerForRoleInDatacenter(db.config.regions[0].dcId, ProcessClass::Master, ProcessClass::ExcludeFit, db.config, id_used, true); std::set> primaryDC; - primaryDC.insert(db.config.primaryDcId); + primaryDC.insert(db.config.regions[0].dcId); getWorkersForTlogs(db.config, db.config.tLogReplicationFactor, db.config.desiredTLogCount, db.config.tLogPolicy, id_used, true, primaryDC); - if(db.config.satelliteTLogReplicationFactor > 0) { + if(db.config.regions[0].satelliteTLogReplicationFactor > 0) { std::set> satelliteDCs; - satelliteDCs.insert( db.config.primarySatelliteDcIds.begin(), db.config.primarySatelliteDcIds.end() ); - getWorkersForTlogs(db.config, db.config.satelliteTLogReplicationFactor, db.config.getDesiredSatelliteLogs(), db.config.satelliteTLogPolicy, id_used, true, satelliteDCs); + for(auto &s : db.config.regions[0].satellites) { + satelliteDCs.insert(s.dcId); + } + getWorkersForTlogs(db.config, db.config.regions[0].satelliteTLogReplicationFactor, db.config.getDesiredSatelliteLogs(db.config.regions[0].dcId), db.config.regions[0].satelliteTLogPolicy, id_used, true, satelliteDCs); } - getWorkerForRoleInDatacenter( db.config.primaryDcId, ProcessClass::Resolver, ProcessClass::ExcludeFit, db.config, id_used, true ); - getWorkerForRoleInDatacenter( db.config.primaryDcId, ProcessClass::Proxy, ProcessClass::ExcludeFit, db.config, id_used, true ); + getWorkerForRoleInDatacenter( db.config.regions[0].dcId, ProcessClass::Resolver, ProcessClass::ExcludeFit, db.config, id_used, true ); + getWorkerForRoleInDatacenter( db.config.regions[0].dcId, ProcessClass::Proxy, ProcessClass::ExcludeFit, db.config, id_used, true ); vector> dcPriority; - dcPriority.push_back(db.config.primaryDcId); - dcPriority.push_back(db.config.remoteDcIds[0]); + dcPriority.push_back(db.config.regions[0].dcId); + dcPriority.push_back(db.config.regions[1].dcId); desiredDcIds.set(dcPriority); } catch( Error &e ) { if(e.code() != error_code_no_more_servers) { @@ -834,33 +848,42 @@ public: std::set> primaryDC; std::set> satelliteDCs; std::set> remoteDC; - if(db.config.primaryDcId.present()) { - primaryDC.insert(clusterControllerDcId == db.config.primaryDcId ? db.config.primaryDcId : db.config.remoteDcIds[0]); - remoteDC.insert(clusterControllerDcId == db.config.primaryDcId ? db.config.remoteDcIds[0] : db.config.primaryDcId); - if(db.config.satelliteTLogReplicationFactor > 0) { - if( clusterControllerDcId == db.config.primaryDcId ) { - satelliteDCs.insert( db.config.primarySatelliteDcIds.begin(), db.config.primarySatelliteDcIds.end() ); + + RegionInfo region; + if(db.config.regions.size() > 1 && clusterControllerDcId.present()) { + primaryDC.insert(clusterControllerDcId); + for(auto& r : db.config.regions) { + if(r.dcId != clusterControllerDcId.get()) { + ASSERT(remoteDC.empty()); + remoteDC.insert(r.dcId); } else { - satelliteDCs.insert( db.config.remoteSatelliteDcIds.begin(), db.config.remoteSatelliteDcIds.end() ); + ASSERT(region.dcId == StringRef()); + region = r; + } + } + + if(region.satelliteTLogReplicationFactor > 0) { + for(auto &s : region.satellites) { + satelliteDCs.insert(s.dcId); } } } // Check tLog fitness RoleFitness oldTLogFit(tlogs, ProcessClass::TLog); - RoleFitness newTLotFit(getWorkersForTlogs(db.config, db.config.tLogReplicationFactor, db.config.desiredTLogCount, db.config.tLogPolicy, id_used, true, primaryDC), ProcessClass::TLog); + RoleFitness newTLogFit(getWorkersForTlogs(db.config, db.config.tLogReplicationFactor, db.config.desiredTLogCount, db.config.tLogPolicy, id_used, true, primaryDC), ProcessClass::TLog); - if(oldTLogFit < newTLotFit) return false; + if(oldTLogFit < newTLogFit) return false; RoleFitness oldSatelliteTLogFit(satellite_tlogs, ProcessClass::TLog); - RoleFitness newSatelliteTLotFit(db.config.satelliteTLogReplicationFactor > 0 ? getWorkersForTlogs(db.config, db.config.satelliteTLogReplicationFactor, db.config.getDesiredSatelliteLogs(), db.config.satelliteTLogPolicy, id_used, true, satelliteDCs) : satellite_tlogs, ProcessClass::TLog); + RoleFitness newSatelliteTLogFit(region.satelliteTLogReplicationFactor > 0 ? getWorkersForTlogs(db.config, region.satelliteTLogReplicationFactor, db.config.getDesiredSatelliteLogs(clusterControllerDcId), region.satelliteTLogPolicy, id_used, true, satelliteDCs) : satellite_tlogs, ProcessClass::TLog); - if(oldSatelliteTLogFit < newSatelliteTLotFit) return false; + if(oldSatelliteTLogFit < newSatelliteTLogFit) return false; RoleFitness oldRemoteTLogFit(remote_tlogs, ProcessClass::TLog); - RoleFitness newRemoteTLotFit((db.config.remoteTLogReplicationFactor > 0 && dbi.recoveryState == RecoveryState::REMOTE_RECOVERED) ? getWorkersForTlogs(db.config, db.config.remoteTLogReplicationFactor, db.config.getDesiredRemoteLogs(), db.config.remoteTLogPolicy, id_used, true, remoteDC) : remote_tlogs, ProcessClass::TLog); + RoleFitness newRemoteTLogFit((db.config.remoteTLogReplicationFactor > 0 && dbi.recoveryState == RecoveryState::REMOTE_RECOVERED) ? getWorkersForTlogs(db.config, db.config.remoteTLogReplicationFactor, db.config.getDesiredRemoteLogs(), db.config.remoteTLogPolicy, id_used, true, remoteDC) : remote_tlogs, ProcessClass::TLog); - if(oldRemoteTLogFit < newRemoteTLotFit) return false; + if(oldRemoteTLogFit < newRemoteTLogFit) return false; RoleFitness oldLogRoutersFit(log_routers, ProcessClass::LogRouter); RoleFitness newLogRoutersFit((db.config.remoteTLogReplicationFactor > 0 && dbi.recoveryState == RecoveryState::REMOTE_RECOVERED) ? getWorkersForRoleInDatacenter( *remoteDC.begin(), ProcessClass::LogRouter, db.config.getDesiredLogRouters(), db.config, id_used, Optional(), true ) : log_routers, ProcessClass::LogRouter); @@ -882,11 +905,11 @@ public: if(oldInFit.betterFitness(newInFit)) return false; - if(oldTLogFit > newTLotFit || oldInFit > newInFit || oldSatelliteTLogFit > newSatelliteTLotFit || oldRemoteTLogFit > newRemoteTLotFit || oldLogRoutersFit > newLogRoutersFit) { + if(oldTLogFit > newTLogFit || oldInFit > newInFit || oldSatelliteTLogFit > newSatelliteTLogFit || oldRemoteTLogFit > newRemoteTLogFit || oldLogRoutersFit > newLogRoutersFit) { TraceEvent("BetterMasterExists", id).detail("oldMasterFit", oldMasterFit).detail("newMasterFit", mworker.fitness) - .detail("oldTLogFitC", oldTLogFit.count).detail("newTLotFitC", newTLotFit.count) - .detail("oldTLogWorstFitT", oldTLogFit.worstFit).detail("newTLotWorstFitT", newTLotFit.worstFit) - .detail("oldTLogBestFitT", oldTLogFit.bestFit).detail("newTLotBestFitT", newTLotFit.bestFit) + .detail("oldTLogFitC", oldTLogFit.count).detail("newTLogFitC", newTLogFit.count) + .detail("oldTLogWorstFitT", oldTLogFit.worstFit).detail("newTLogWorstFitT", newTLogFit.worstFit) + .detail("oldTLogBestFitT", oldTLogFit.bestFit).detail("newTLogBestFitT", newTLogFit.bestFit) .detail("oldInFitW", oldInFit.worstFit).detail("newInFitW", newInFit.worstFit) .detail("oldInFitB", oldInFit.bestFit).detail("newInFitB", newInFit.bestFit) .detail("oldInFitC", oldInFit.count).detail("newInFitC", newInFit.count); diff --git a/fdbserver/ClusterRecruitmentInterface.h b/fdbserver/ClusterRecruitmentInterface.h index e6f3182b63..b3f23527ac 100644 --- a/fdbserver/ClusterRecruitmentInterface.h +++ b/fdbserver/ClusterRecruitmentInterface.h @@ -85,11 +85,11 @@ struct RecruitFromConfigurationReply { vector proxies; vector resolvers; vector storageServers; - Optional remoteDcId; + Optional dcId; template void serialize( Ar& ar ) { - ar & tLogs & satelliteTLogs & proxies & resolvers & storageServers & remoteDcId; + ar & tLogs & satelliteTLogs & proxies & resolvers & storageServers & dcId; } }; diff --git a/fdbserver/DatabaseConfiguration.cpp b/fdbserver/DatabaseConfiguration.cpp index 851fab0397..c40f49a19d 100644 --- a/fdbserver/DatabaseConfiguration.cpp +++ b/fdbserver/DatabaseConfiguration.cpp @@ -34,25 +34,11 @@ void DatabaseConfiguration::resetInternal() { autoMasterProxyCount = CLIENT_KNOBS->DEFAULT_AUTO_PROXIES; autoResolverCount = CLIENT_KNOBS->DEFAULT_AUTO_RESOLVERS; autoDesiredTLogCount = CLIENT_KNOBS->DEFAULT_AUTO_LOGS; - primaryDcId = Optional>(); - remoteDcIds.clear(); - tLogPolicy = storagePolicy = remoteTLogPolicy = satelliteTLogPolicy = IRepPolicyRef(); + regions.clear(); + tLogPolicy = storagePolicy = remoteTLogPolicy = IRepPolicyRef(); - remoteDesiredTLogCount = satelliteDesiredTLogCount = desiredLogRouterCount = -1; - remoteTLogReplicationFactor = satelliteTLogReplicationFactor = satelliteTLogWriteAntiQuorum = satelliteTLogUsableDcs = 0; - primarySatelliteDcIds.clear(); - remoteSatelliteDcIds.clear(); -} - -void parse( std::vector>>* dcs, ValueRef const& v ) { - int lastBegin = 0; - for(int i = 0; i < v.size(); i++) { - if(v[i] == ',') { - dcs->push_back(v.substr(lastBegin,i)); - lastBegin = i + 1; - } - } - dcs->push_back(v.substr(lastBegin)); + remoteDesiredTLogCount = desiredLogRouterCount = -1; + remoteTLogReplicationFactor = 0; } void parse( int* i, ValueRef const& v ) { @@ -65,6 +51,73 @@ void parseReplicationPolicy(IRepPolicyRef* policy, ValueRef const& v) { serializeReplicationPolicy(reader, *policy); } +void parse( std::vector* regions, ValueRef const& v ) { + try { + StatusObject statusObj = BinaryReader::fromStringRef(v, IncludeVersion()); + StatusArray regionArray = statusObj["regions"].get_array(); + regions->clear(); + for (StatusObjectReader dc : regionArray) { + RegionInfo info; + std::string idStr; + dc.get("id", idStr); + info.dcId = idStr; + dc.get("priority", info.priority); + dc.tryGet("satellite_logs", info.satelliteDesiredTLogCount); + std::string satelliteReplication; + if(dc.tryGet("satellite_redundancy_mode", satelliteReplication)) { + if(satelliteReplication == "one_satellite_single") { + info.satelliteTLogReplicationFactor = 1; + info.satelliteTLogUsableDcs = 1; + info.satelliteTLogWriteAntiQuorum = 0; + info.satelliteTLogPolicy = IRepPolicyRef(new PolicyOne()); + } else if(satelliteReplication == "one_satellite_double") { + info.satelliteTLogReplicationFactor = 2; + info.satelliteTLogUsableDcs = 1; + info.satelliteTLogWriteAntiQuorum = 0; + info.satelliteTLogPolicy = IRepPolicyRef(new PolicyAcross(2, "zoneid", IRepPolicyRef(new PolicyOne()))); + } else if(satelliteReplication == "one_satellite_triple") { + info.satelliteTLogReplicationFactor = 3; + info.satelliteTLogUsableDcs = 1; + info.satelliteTLogWriteAntiQuorum = 0; + info.satelliteTLogPolicy = IRepPolicyRef(new PolicyAcross(3, "zoneid", IRepPolicyRef(new PolicyOne()))); + } else if(satelliteReplication == "two_satellite_safe") { + info.satelliteTLogReplicationFactor = 4; + info.satelliteTLogUsableDcs = 2; + info.satelliteTLogWriteAntiQuorum = 0; + info.satelliteTLogPolicy = IRepPolicyRef(new PolicyAcross(2, "dcid", IRepPolicyRef(new PolicyAcross(2, "zoneid", IRepPolicyRef(new PolicyOne()))))); + } else if(satelliteReplication == "two_satellite_fast") { + info.satelliteTLogReplicationFactor = 4; + info.satelliteTLogUsableDcs = 2; + info.satelliteTLogWriteAntiQuorum = 2; + info.satelliteTLogPolicy = IRepPolicyRef(new PolicyAcross(2, "dcid", IRepPolicyRef(new PolicyAcross(2, "zoneid", IRepPolicyRef(new PolicyOne()))))); + } else { + throw invalid_option(); + } + } + dc.tryGet("satellite_log_replicas", info.satelliteTLogReplicationFactor); + dc.tryGet("satellite_usable_dcs", info.satelliteTLogUsableDcs); + dc.tryGet("satellite_anti_quorum", info.satelliteTLogWriteAntiQuorum); + json_spirit::mArray satellites; + if( dc.tryGet("satellites", satellites) ) { + for (StatusObjectReader s : satellites) { + SatelliteInfo satInfo; + std::string sidStr; + s.get("id", sidStr); + satInfo.dcId = sidStr; + s.get("priority", satInfo.priority); + info.satellites.push_back(satInfo); + } + std::sort(info.satellites.begin(), info.satellites.end(), SatelliteInfo::sort_by_priority() ); + } + regions->push_back(info); + } + std::sort(regions->begin(), regions->end(), RegionInfo::sort_by_priority() ); + } catch( Error &e ) { + regions->clear(); + return; + } +} + void DatabaseConfiguration::setDefaultReplicationPolicy() { if(!storagePolicy) { storagePolicy = IRepPolicyRef(new PolicyAcross(storageTeamSize, "zoneid", IRepPolicyRef(new PolicyOne()))); @@ -75,13 +128,15 @@ void DatabaseConfiguration::setDefaultReplicationPolicy() { if(remoteTLogReplicationFactor > 0 && !remoteTLogPolicy) { remoteTLogPolicy = IRepPolicyRef(new PolicyAcross(remoteTLogReplicationFactor, "zoneid", IRepPolicyRef(new PolicyOne()))); } - if(satelliteTLogReplicationFactor > 0 && !satelliteTLogPolicy) { - satelliteTLogPolicy = IRepPolicyRef(new PolicyAcross(satelliteTLogReplicationFactor, "zoneid", IRepPolicyRef(new PolicyOne()))); + for(auto r : regions) { + if(r.satelliteTLogReplicationFactor > 0 && !r.satelliteTLogPolicy) { + r.satelliteTLogPolicy = IRepPolicyRef(new PolicyAcross(r.satelliteTLogReplicationFactor, "zoneid", IRepPolicyRef(new PolicyOne()))); + } } } bool DatabaseConfiguration::isValid() const { - return initialized && + if( !(initialized && tLogWriteAntiQuorum >= 0 && tLogReplicationFactor >= 1 && durableStorageQuorum >= 1 && @@ -100,154 +155,169 @@ bool DatabaseConfiguration::isValid() const { getDesiredRemoteLogs() >= 1 && getDesiredLogRouters() >= 1 && remoteTLogReplicationFactor >= 0 && - remoteDcIds.size() < 2 && - ( remoteTLogReplicationFactor == 0 || ( remoteTLogPolicy && primaryDcId.present() && remoteDcIds.size() && durableStorageQuorum == storageTeamSize ) ) && - primaryDcId.present() == remoteDcIds.size() && - getDesiredSatelliteLogs() >= 1 && - satelliteTLogReplicationFactor >= 0 && - satelliteTLogWriteAntiQuorum >= 0 && - satelliteTLogUsableDcs >= 0 && - ( satelliteTLogReplicationFactor == 0 || ( satelliteTLogPolicy && primarySatelliteDcIds.size() && remoteSatelliteDcIds.size() && remoteTLogReplicationFactor > 0 ) ) && - primarySatelliteDcIds.size() == remoteSatelliteDcIds.size(); + (regions.size() == 0 || regions.size() == 2) && + ( remoteTLogReplicationFactor == 0 || ( remoteTLogPolicy && regions.size() == 2 && durableStorageQuorum == storageTeamSize ) ) ) ) { + return false; + } + + std::set dcIds; + std::set priorities; + dcIds.insert(Key()); + for(auto r : regions) { + if( !(!dcIds.count(r.dcId) && + !priorities.count(r.priority) && + r.satelliteTLogReplicationFactor >= 0 && + r.satelliteTLogWriteAntiQuorum >= 0 && + r.satelliteTLogUsableDcs >= 0 && + ( r.satelliteTLogReplicationFactor == 0 || ( r.satelliteTLogPolicy && r.satellites.size() ) ) ) ) { + return false; + } + dcIds.insert(r.dcId); + priorities.insert(r.priority); + for(auto s : r.satellites) { + if(dcIds.count(s.dcId)) { + return false; + } + dcIds.insert(s.dcId); + } + } + + return true; } -std::map DatabaseConfiguration::toMap() const { - std::map result; +StatusObject DatabaseConfiguration::toJSON(bool noPolicies) const { + StatusObject result; if( initialized ) { std::string tlogInfo = tLogPolicy->info(); std::string storageInfo = storagePolicy->info(); - if( durableStorageQuorum == storageTeamSize && - tLogWriteAntiQuorum == 0 ) { - if( tLogReplicationFactor == 1 && durableStorageQuorum == 1 ) + bool customRedundancy = false; + if( durableStorageQuorum == storageTeamSize && tLogWriteAntiQuorum == 0 ) { + if( tLogReplicationFactor == 1 && durableStorageQuorum == 1 ) { result["redundancy_mode"] = "single"; - else if( tLogReplicationFactor == 2 && durableStorageQuorum == 2 ) + } else if( tLogReplicationFactor == 2 && durableStorageQuorum == 2 ) { result["redundancy_mode"] = "double"; - else if( tLogReplicationFactor == 3 && durableStorageQuorum == 3 && tlogInfo == "((dcid^3 x 1) & (zoneid^3 x 1))" && storageInfo == "((dcid^3 x 1) & (zoneid^3 x 1))" ) + } else if( tLogReplicationFactor == 3 && durableStorageQuorum == 3 && tlogInfo == "((dcid^3 x 1) & (zoneid^3 x 1))" && storageInfo == "((dcid^3 x 1) & (zoneid^3 x 1))" ) { result["redundancy_mode"] = "three_datacenter"; - else if( tLogReplicationFactor == 3 && durableStorageQuorum == 3 ) + } else if( tLogReplicationFactor == 3 && durableStorageQuorum == 3 ) { result["redundancy_mode"] = "triple"; - else if( tLogReplicationFactor == 4 && durableStorageQuorum == 3 && tlogInfo == "data_hall^2 x zoneid^2 x 1" && storageInfo == "data_hall^3 x 1" ) + } else if( tLogReplicationFactor == 4 && durableStorageQuorum == 3 && tlogInfo == "data_hall^2 x zoneid^2 x 1" && storageInfo == "data_hall^3 x 1" ) { result["redundancy_mode"] = "three_data_hall"; - else if( tLogReplicationFactor == 4 && durableStorageQuorum == 6 && tlogInfo == "dcid^2 x zoneid^2 x 1" && storageInfo == "dcid^3 x zoneid^2 x 1" ) + } else if( tLogReplicationFactor == 4 && durableStorageQuorum == 6 && tlogInfo == "dcid^2 x zoneid^2 x 1" && storageInfo == "dcid^3 x zoneid^2 x 1" ) { result["redundancy_mode"] = "multi_dc"; - else - result["redundancy_mode"] = "custom"; - } else - result["redundancy_mode"] = "custom"; - - if( tLogDataStoreType == KeyValueStoreType::SSD_BTREE_V1 && storageServerStoreType == KeyValueStoreType::SSD_BTREE_V1) - result["storage_engine"] = "ssd-1"; - else if (tLogDataStoreType == KeyValueStoreType::SSD_BTREE_V2 && storageServerStoreType == KeyValueStoreType::SSD_BTREE_V2) - result["storage_engine"] = "ssd-2"; - else if( tLogDataStoreType == KeyValueStoreType::MEMORY && storageServerStoreType == KeyValueStoreType::MEMORY ) - result["storage_engine"] = "memory"; - else - result["storage_engine"] = "custom"; - - if(primaryDcId.present()) { - result["primary_dc"] = printable(primaryDcId.get()); - } - if(remoteDcIds.size()) { - std::string remoteDcStr = ""; - bool first = true; - for(auto& it : remoteDcIds) { - if(it.present()) { - if(!first) { - remoteDcStr += ","; - first = false; - } - remoteDcStr += printable(it.get()); - } + } else { + customRedundancy = true; } - result["remote_dcs"] = remoteDcStr; - } - if(primarySatelliteDcIds.size()) { - std::string primaryDcStr = ""; - bool first = true; - for(auto& it : primarySatelliteDcIds) { - if(it.present()) { - if(!first) { - primaryDcStr += ","; - first = false; - } - primaryDcStr += printable(it.get()); - } - } - result["primary_satellite_dcs"] = primaryDcStr; - } - if(remoteSatelliteDcIds.size()) { - std::string remoteDcStr = ""; - bool first = true; - for(auto& it : remoteSatelliteDcIds) { - if(it.present()) { - if(!first) { - remoteDcStr += ","; - first = false; - } - remoteDcStr += printable(it.get()); - } - } - result["remote_satellite_dcs"] = remoteDcStr; - } - - if(satelliteTLogReplicationFactor == 1 && satelliteTLogUsableDcs == 1 && satelliteTLogWriteAntiQuorum == 0) { - result["satellite_redundancy_mode"] = "one_satellite_single"; - } else if(satelliteTLogReplicationFactor == 2 && satelliteTLogUsableDcs == 1 && satelliteTLogWriteAntiQuorum == 0) { - result["satellite_redundancy_mode"] = "one_satellite_double"; - } else if(satelliteTLogReplicationFactor == 3 && satelliteTLogUsableDcs == 1 && satelliteTLogWriteAntiQuorum == 0) { - result["satellite_redundancy_mode"] = "one_satellite_triple"; - } else if(satelliteTLogReplicationFactor == 4 && satelliteTLogUsableDcs == 2 && satelliteTLogWriteAntiQuorum == 0) { - result["satellite_redundancy_mode"] = "two_satellite_safe"; - } else if(satelliteTLogReplicationFactor == 4 && satelliteTLogUsableDcs == 2 && satelliteTLogWriteAntiQuorum == 2) { - result["satellite_redundancy_mode"] = "two_satellite_fast"; - } else if(satelliteTLogReplicationFactor == 0) { - result["satellite_redundancy_mode"] = "none"; } else { - result["satellite_redundancy_mode"] = "custom"; + customRedundancy = true; } - if( remoteTLogReplicationFactor == 1 ) { + if(customRedundancy) { + result["storage_replicas"] = storageTeamSize; + result["storage_quorum"] = durableStorageQuorum; + result["log_replicas"] = tLogReplicationFactor; + result["log_anti_quorum"] = tLogWriteAntiQuorum; + if(!noPolicies) result["storage_replication_policy"] = storagePolicy->info(); + if(!noPolicies) result["log_replication_policy"] = tLogPolicy->info(); + } + + if( tLogDataStoreType == KeyValueStoreType::SSD_BTREE_V1 && storageServerStoreType == KeyValueStoreType::SSD_BTREE_V1) { + result["storage_engine"] = "ssd-1"; + } else if (tLogDataStoreType == KeyValueStoreType::SSD_BTREE_V2 && storageServerStoreType == KeyValueStoreType::SSD_BTREE_V2) { + result["storage_engine"] = "ssd-2"; + } else if( tLogDataStoreType == KeyValueStoreType::MEMORY && storageServerStoreType == KeyValueStoreType::MEMORY ) { + result["storage_engine"] = "memory"; + } + + if( remoteTLogReplicationFactor == 0 ) { + result["remote_redundancy_mode"] = "remote_none"; + } else if( remoteTLogReplicationFactor == 1 ) { result["remote_redundancy_mode"] = "remote_single"; } else if( remoteTLogReplicationFactor == 2 ) { result["remote_redundancy_mode"] = "remote_double"; } else if( remoteTLogReplicationFactor == 3 ) { result["remote_redundancy_mode"] = "remote_triple"; - } else if(remoteTLogReplicationFactor == 0) { - result["remote_redundancy_mode"] = "none"; } else { - result["remote_redundancy_mode"] = "custom"; + result["remote_log_replicas"] = remoteTLogReplicationFactor; + if(noPolicies && remoteTLogPolicy) result["remote_log_policy"] = remoteTLogPolicy->info(); } - if( desiredTLogCount != -1 ) - result["logs"] = format("%d", desiredTLogCount); + if(regions.size()) { + StatusArray regionArr; + for( auto r : regions) { + StatusObject dcObj; + dcObj["id"] = r.dcId.toString(); + dcObj["priority"] = r.priority; - if( desiredTLogCount != -1 ) - result["remote_logs"] = format("%d", remoteDesiredTLogCount); + if(r.satelliteTLogReplicationFactor == 1 && r.satelliteTLogUsableDcs == 1 && r.satelliteTLogWriteAntiQuorum == 0) { + dcObj["satellite_redundancy_mode"] = "one_satellite_single"; + } else if(r.satelliteTLogReplicationFactor == 2 && r.satelliteTLogUsableDcs == 1 && r.satelliteTLogWriteAntiQuorum == 0) { + dcObj["satellite_redundancy_mode"] = "one_satellite_double"; + } else if(r.satelliteTLogReplicationFactor == 3 && r.satelliteTLogUsableDcs == 1 && r.satelliteTLogWriteAntiQuorum == 0) { + dcObj["satellite_redundancy_mode"] = "one_satellite_triple"; + } else if(r.satelliteTLogReplicationFactor == 4 && r.satelliteTLogUsableDcs == 2 && r.satelliteTLogWriteAntiQuorum == 0) { + dcObj["satellite_redundancy_mode"] = "two_satellite_safe"; + } else if(r.satelliteTLogReplicationFactor == 4 && r.satelliteTLogUsableDcs == 2 && r.satelliteTLogWriteAntiQuorum == 2) { + dcObj["satellite_redundancy_mode"] = "two_satellite_fast"; + } else if(r.satelliteTLogReplicationFactor != 0) { + dcObj["satellite_log_replicas"] = r.satelliteTLogReplicationFactor; + dcObj["satellite_usable_dcs"] = r.satelliteTLogUsableDcs; + dcObj["satellite_anti_quorum"] = r.satelliteTLogWriteAntiQuorum; + if(r.satelliteTLogPolicy) dcObj["satellite_log_policy"] = r.satelliteTLogPolicy->info(); + } - if( desiredTLogCount != -1 ) - result["satellite_logs"] = format("%d", satelliteDesiredTLogCount); + if( r.satelliteDesiredTLogCount != -1 ) { + dcObj["satellite_logs"] = r.satelliteDesiredTLogCount; + } - if( masterProxyCount != -1 ) - result["proxies"] = format("%d", masterProxyCount); + if(r.satellites.size()) { + StatusArray satellitesArr; + for(auto s : r.satellites) { + StatusObject satObj; + satObj["id"] = s.dcId.toString(); + satObj["priority"] = s.priority; - if( resolverCount != -1 ) - result["resolvers"] = format("%d", resolverCount); + satellitesArr.push_back(satObj); + } + dcObj["satellites"] = satellitesArr; + } + + regionArr.push_back(dcObj); + } + result["regions"] = regionArr; + } + + if( desiredTLogCount != -1 ) { + result["logs"] = desiredTLogCount; + } + if( masterProxyCount != -1 ) { + result["proxies"] = masterProxyCount; + } + if( resolverCount != -1 ) { + result["resolvers"] = resolverCount; + } + if( remoteDesiredTLogCount != -1 ) { + result["remote_logs"] = remoteDesiredTLogCount; + } + if( desiredLogRouterCount != -1 ) { + result["log_routers"] = desiredLogRouterCount; + } + if( autoMasterProxyCount != CLIENT_KNOBS->DEFAULT_AUTO_PROXIES ) { + result["auto_proxies"] = autoMasterProxyCount; + } + if (autoResolverCount != CLIENT_KNOBS->DEFAULT_AUTO_RESOLVERS) { + result["auto_resolvers"] = autoResolverCount; + } + if (autoDesiredTLogCount != CLIENT_KNOBS->DEFAULT_AUTO_LOGS) { + result["auto_logs"] = autoDesiredTLogCount; + } } return result; } std::string DatabaseConfiguration::toString() const { - std::string result; - std::map config = toMap(); - - for(auto itr : config) { - result += itr.first + "=" + itr.second; - result += ";"; - } - - return result.substr(0, result.length()-1); + return json_spirit::write_string(json_spirit::mValue(toJSON()), json_spirit::Output_options::none); } bool DatabaseConfiguration::setInternal(KeyRef key, ValueRef value) { @@ -272,16 +342,8 @@ bool DatabaseConfiguration::setInternal(KeyRef key, ValueRef value) { else if (ck == LiteralStringRef("remote_logs")) parse(&remoteDesiredTLogCount, value); else if (ck == LiteralStringRef("remote_log_replicas")) parse(&remoteTLogReplicationFactor, value); else if (ck == LiteralStringRef("remote_log_policy")) parseReplicationPolicy(&remoteTLogPolicy, value); - else if (ck == LiteralStringRef("satellite_log_policy")) parseReplicationPolicy(&satelliteTLogPolicy, value); - else if (ck == LiteralStringRef("satellite_logs")) parse(&satelliteDesiredTLogCount, value); - else if (ck == LiteralStringRef("satellite_log_replicas")) parse(&satelliteTLogReplicationFactor, value); - else if (ck == LiteralStringRef("satellite_anti_quorum")) parse(&satelliteTLogWriteAntiQuorum, value); - else if (ck == LiteralStringRef("satellite_usable_dcs")) parse(&satelliteTLogUsableDcs, value); - else if (ck == LiteralStringRef("primary_dc")) primaryDcId = value; - else if (ck == LiteralStringRef("remote_dcs")) parse(&remoteDcIds, value); - else if (ck == LiteralStringRef("primary_satellite_dcs")) parse(&primarySatelliteDcIds, value); - else if (ck == LiteralStringRef("remote_satellite_dcs")) parse(&remoteSatelliteDcIds, value); else if (ck == LiteralStringRef("log_routers")) parse(&desiredLogRouterCount, value); + else if (ck == LiteralStringRef("regions")) parse(®ions, value); else return false; return true; // All of the above options currently require recovery to take effect } diff --git a/fdbserver/DatabaseConfiguration.h b/fdbserver/DatabaseConfiguration.h index c1c560f279..00f84c378d 100644 --- a/fdbserver/DatabaseConfiguration.h +++ b/fdbserver/DatabaseConfiguration.h @@ -25,9 +25,50 @@ #include "fdbclient/FDBTypes.h" #include "fdbclient/CommitTransaction.h" #include "fdbrpc/ReplicationPolicy.h" +#include "fdbclient/Status.h" // SOMEDAY: Buggify DatabaseConfiguration +struct SatelliteInfo { + Key dcId; + int32_t priority; + + SatelliteInfo() : priority(0) {} + + struct sort_by_priority { + bool operator ()(SatelliteInfo const&a, SatelliteInfo const& b) const { return a.priority > b.priority; } + }; + + template + void serialize(Ar& ar) { + ar & dcId & priority; + } +}; + +struct RegionInfo { + Key dcId; + int32_t priority; + + IRepPolicyRef satelliteTLogPolicy; + int32_t satelliteDesiredTLogCount; + int32_t satelliteTLogReplicationFactor; + int32_t satelliteTLogWriteAntiQuorum; + int32_t satelliteTLogUsableDcs; + + std::vector satellites; + + RegionInfo() : priority(0), satelliteDesiredTLogCount(-1), satelliteTLogReplicationFactor(0), satelliteTLogWriteAntiQuorum(0), satelliteTLogUsableDcs(0) {} + + struct sort_by_priority { + bool operator ()(RegionInfo const&a, RegionInfo const& b) const { return a.priority > b.priority; } + }; + + template + void serialize(Ar& ar) { + ar & dcId & priority & satelliteTLogPolicy & satelliteDesiredTLogCount & satelliteTLogReplicationFactor & satelliteTLogWriteAntiQuorum & & satelliteTLogUsableDcs & satellites; + } +}; + struct DatabaseConfiguration { DatabaseConfiguration(); @@ -41,13 +82,27 @@ struct DatabaseConfiguration { bool initialized; std::string toString() const; - std::map toMap() const; - int expectedLogSets() { + StatusObject toJSON(bool noPolicies = false) const; + + RegionInfo getRegion( Optional dcId ) const { + if(!dcId.present()) { + return RegionInfo(); + } + for(auto& r : regions) { + if(r.dcId == dcId.get()) { + return r; + } + } + return RegionInfo(); + } + + int expectedLogSets( Optional dcId ) const { int result = 1; - if( satelliteTLogReplicationFactor > 0) { + if(dcId.present() && getRegion(dcId.get()).satelliteTLogReplicationFactor > 0) { result++; } - if( remoteTLogReplicationFactor > 0) { + + if(remoteTLogReplicationFactor > 0) { result++; } return result; @@ -55,17 +110,30 @@ struct DatabaseConfiguration { // SOMEDAY: think about changing storageTeamSize to durableStorageQuorum int32_t minDatacentersRequired() const { - if(!primaryDcId.present()) return 1; - return 2 + primarySatelliteDcIds.size() + remoteSatelliteDcIds.size(); + int minRequired = 0; + for(auto r : regions) { + minRequired += 1 + r.satellites.size(); + } + return minRequired; + } + int32_t minMachinesRequiredPerDatacenter() const { + int minRequired = std::max( remoteTLogReplicationFactor, std::max(tLogReplicationFactor, storageTeamSize) ); + for(auto r : regions) { + minRequired = std::max( minRequired, r.satelliteTLogReplicationFactor/std::max(1, r.satelliteTLogUsableDcs) ); + } + return minRequired; } - int32_t minMachinesRequiredPerDatacenter() const { return std::max( satelliteTLogReplicationFactor/std::max(1,satelliteTLogUsableDcs), std::max( remoteTLogReplicationFactor, std::max(tLogReplicationFactor, storageTeamSize) ) ); } //Killing an entire datacenter counts as killing one machine in modes that support it int32_t maxMachineFailuresTolerated() const { - if(remoteTLogReplicationFactor > 0 && satelliteTLogReplicationFactor > 0) { - return 1 + std::min(std::max(tLogReplicationFactor - 1 - tLogWriteAntiQuorum, satelliteTLogReplicationFactor - 1 - satelliteTLogWriteAntiQuorum), durableStorageQuorum - 1); - } else if(satelliteTLogReplicationFactor > 0) { - return std::min(tLogReplicationFactor + satelliteTLogReplicationFactor - 2 - tLogWriteAntiQuorum - satelliteTLogWriteAntiQuorum, durableStorageQuorum - 1); + int worstSatellite = regions.size() ? std::numeric_limits::max() : 0; + for(auto r : regions) { + worstSatellite = std::min(worstSatellite, r.satelliteTLogReplicationFactor - r.satelliteTLogWriteAntiQuorum); + } + if(remoteTLogReplicationFactor > 0 && worstSatellite > 0) { + return 1 + std::min(std::max(tLogReplicationFactor - 1 - tLogWriteAntiQuorum, worstSatellite - 1), durableStorageQuorum - 1); + } else if(worstSatellite > 0) { + return std::min(tLogReplicationFactor + worstSatellite - 2 - tLogWriteAntiQuorum, durableStorageQuorum - 1); } return std::min(tLogReplicationFactor - 1 - tLogWriteAntiQuorum, durableStorageQuorum - 1); } @@ -85,7 +153,6 @@ struct DatabaseConfiguration { int32_t tLogWriteAntiQuorum; int32_t tLogReplicationFactor; KeyValueStoreType tLogDataStoreType; - Optional> primaryDcId; // Storage Servers IRepPolicyRef storagePolicy; @@ -98,16 +165,9 @@ struct DatabaseConfiguration { int32_t remoteTLogReplicationFactor; int32_t desiredLogRouterCount; IRepPolicyRef remoteTLogPolicy; - std::vector>> remoteDcIds; - // Satellite TLogs - IRepPolicyRef satelliteTLogPolicy; - int32_t satelliteDesiredTLogCount; - int32_t satelliteTLogReplicationFactor; - int32_t satelliteTLogWriteAntiQuorum; - int32_t satelliteTLogUsableDcs; - std::vector>> primarySatelliteDcIds; - std::vector>> remoteSatelliteDcIds; + //Data centers + std::vector regions; // Excluded servers (no state should be here) bool isExcludedServer( NetworkAddress ) const; @@ -116,9 +176,12 @@ struct DatabaseConfiguration { int32_t getDesiredProxies() const { if(masterProxyCount == -1) return autoMasterProxyCount; return masterProxyCount; } int32_t getDesiredResolvers() const { if(resolverCount == -1) return autoResolverCount; return resolverCount; } int32_t getDesiredLogs() const { if(desiredTLogCount == -1) return autoDesiredTLogCount; return desiredTLogCount; } - int32_t getDesiredSatelliteLogs() const { if(satelliteDesiredTLogCount == -1) return autoDesiredTLogCount; return satelliteDesiredTLogCount; } int32_t getDesiredRemoteLogs() const { if(remoteDesiredTLogCount == -1) return autoDesiredTLogCount; return remoteDesiredTLogCount; } int32_t getDesiredLogRouters() const { if(desiredLogRouterCount == -1) return getDesiredRemoteLogs(); return desiredLogRouterCount; } + int32_t getDesiredSatelliteLogs( Optional dcId ) const { + auto desired = getRegion(dcId).satelliteDesiredTLogCount; + if(desired == -1) return autoDesiredTLogCount; return desired; + } bool operator == ( DatabaseConfiguration const& rhs ) const { const_cast(this)->makeConfigurationImmutable(); diff --git a/fdbserver/SimulatedCluster.actor.cpp b/fdbserver/SimulatedCluster.actor.cpp index d96fb7988e..95e93ea88d 100644 --- a/fdbserver/SimulatedCluster.actor.cpp +++ b/fdbserver/SimulatedCluster.actor.cpp @@ -660,9 +660,6 @@ struct SimulationConfig { int machine_count; // Total, not per DC. int processes_per_machine; int coordinators; - - std::string toString(); - private: void generateNormalConfig(int minimumReplication); }; @@ -685,7 +682,7 @@ StringRef StringRefOf(const char* s) { void SimulationConfig::generateNormalConfig(int minimumReplication) { set_config("new"); - bool generateFearless = false; //FIXME: g_random->random01() < 0.5; + bool generateFearless = false; //FIXME g_random->random01() < 0.5; datacenters = generateFearless ? 4 : g_random->randomInt( 1, 4 ); if (g_random->random01() < 0.25) db.desiredTLogCount = g_random->randomInt(1,7); if (g_random->random01() < 0.25) db.masterProxyCount = g_random->randomInt(1,7); @@ -744,24 +741,104 @@ void SimulationConfig::generateNormalConfig(int minimumReplication) { } if(generateFearless || (datacenters == 2 && g_random->random01() < 0.5)) { - db.primaryDcId = LiteralStringRef("0"); - db.remoteDcIds.resize(1); - db.remoteDcIds[0] = LiteralStringRef("1"); - } + StatusObject primaryObj; + primaryObj["id"] = "0"; + primaryObj["priority"] = 1; - if(generateFearless) { - db.primarySatelliteDcIds.resize(1); - db.primarySatelliteDcIds[0] = LiteralStringRef("2"); - db.remoteSatelliteDcIds.resize(1); - db.remoteSatelliteDcIds[0] = LiteralStringRef("3"); + StatusObject remoteObj; + remoteObj["id"] = "1"; + remoteObj["priority"] = 0; + if(generateFearless) { + StatusObject primarySatelliteObj; + primarySatelliteObj["id"] = "2"; + primarySatelliteObj["priority"] = 1; + StatusArray primarySatellitesArr; + primarySatellitesArr.push_back(primarySatelliteObj); + primaryObj["satellites"] = primarySatellitesArr; - //FIXME: random setups - set_config("remote_single"); - set_config("one_satellite_single"); + StatusObject remoteSatelliteObj; + remoteSatelliteObj["id"] = "3"; + remoteSatelliteObj["priority"] = 1; + StatusArray remoteSatellitesArr; + remoteSatellitesArr.push_back(remoteSatelliteObj); + remoteObj["satellites"] = remoteSatellitesArr; - db.remoteDesiredTLogCount = 1; - db.desiredLogRouterCount = 1; - db.satelliteDesiredTLogCount = 1; + int satellite_replication_type = 2;//FIXME: g_random->randomInt(0,5); + switch (satellite_replication_type) { + case 0: { + TEST( true ); // Simulated cluster using custom satellite redundancy mode + break; + } + case 1: { + TEST( true ); // Simulated cluster using no satellite redundancy mode + break; + } + case 2: { + TEST( true ); // Simulated cluster using single satellite redundancy mode + primaryObj["satellite_redundancy_mode"] = "one_satellite_single"; + remoteObj["satellite_redundancy_mode"] = "one_satellite_single"; + break; + } + case 3: { + TEST( true ); // Simulated cluster using double satellite redundancy mode + primaryObj["satellite_redundancy_mode"] = "one_satellite_double"; + remoteObj["satellite_redundancy_mode"] = "one_satellite_double"; + break; + } + case 4: { + TEST( true ); // Simulated cluster using triple satellite redundancy mode + primaryObj["satellite_redundancy_mode"] = "one_satellite_triple"; + remoteObj["satellite_redundancy_mode"] = "one_satellite_triple"; + break; + } + default: + ASSERT(false); // Programmer forgot to adjust cases. + } + + if (g_random->random01() < 0.25) { + int logs = g_random->randomInt(1,7); + primaryObj["satellite_logs"] = logs; + remoteObj["satellite_logs"] = logs; + } + + int remote_replication_type = 2;//FIXME: g_random->randomInt(0,5); + switch (remote_replication_type) { + case 0: { + TEST( true ); // Simulated cluster using custom remote redundancy mode + break; + } + case 1: { + TEST( true ); // Simulated cluster using no remote redundancy mode + break; + } + case 2: { + TEST( true ); // Simulated cluster using single remote redundancy mode + set_config("remote_single"); + break; + } + case 3: { + TEST( true ); // Simulated cluster using double remote redundancy mode + set_config("remote_double"); + break; + } + case 4: { + TEST( true ); // Simulated cluster using triple remote redundancy mode + set_config("remote_triple"); + break; + } + default: + ASSERT(false); // Programmer forgot to adjust cases. + } + + if (g_random->random01() < 0.25) db.remoteDesiredTLogCount = g_random->randomInt(1,7); + if (g_random->random01() < 0.25) db.desiredLogRouterCount = g_random->randomInt(1,7); + } + + StatusArray regionArr; + regionArr.push_back(primaryObj); + regionArr.push_back(remoteObj); + + set_config("regions=" + json_spirit::write_string(json_spirit::mValue(regionArr), json_spirit::Output_options::none)); } if(generateFearless) { @@ -790,96 +867,55 @@ void SimulationConfig::generateNormalConfig(int minimumReplication) { } } -std::string SimulationConfig::toString() { - std::stringstream config; - std::map&& dbconfig = db.toMap(); - config << "new"; - - if (dbconfig["redundancy_mode"] != "custom") { - config << " " << dbconfig["redundancy_mode"]; - } else { - config << " " << "log_replicas:=" << db.tLogReplicationFactor; - config << " " << "log_anti_quorum:=" << db.tLogWriteAntiQuorum; - config << " " << "storage_replicas:=" << db.storageTeamSize; - config << " " << "storage_quorum:=" << db.durableStorageQuorum; - } - - if(dbconfig["remote_redundancy_mode"] != "none") { - if (dbconfig["remote_redundancy_mode"] != "custom") { - config << " " << dbconfig["remote_redundancy_mode"]; - } else { - config << " " << "remote_log_replicas:=" << db.remoteTLogReplicationFactor; - } - } - - if(dbconfig["satellite_redundancy_mode"] != "none") { - if (dbconfig["satellite_redundancy_mode"] != "custom") { - config << " " << dbconfig["satellite_redundancy_mode"]; - } else { - config << " " << "satellite_log_replicas:=" << db.satelliteTLogReplicationFactor; - config << " " << "satellite_anti_quorum:=" << db.satelliteTLogWriteAntiQuorum; - config << " " << "satellite_usable_dcs:=" << db.satelliteTLogUsableDcs; - } - } - - config << " logs=" << db.getDesiredLogs(); - config << " proxies=" << db.getDesiredProxies(); - config << " resolvers=" << db.getDesiredResolvers(); - - if(db.remoteTLogReplicationFactor > 0) { - config << " remote_logs=" << db.getDesiredRemoteLogs(); - config << " log_routers=" << db.getDesiredLogRouters(); - } - - if(db.satelliteTLogReplicationFactor > 0) { - config << " satellite_logs=" << db.getDesiredSatelliteLogs(); - } - - if(db.primaryDcId.present()) { - config << " primary_dc=" << db.primaryDcId.get().printable(); - config << " remote_dcs=" << db.remoteDcIds[0].get().printable(); - } - - if(db.primarySatelliteDcIds.size()) { - config << " primary_satellite_dcs=" << db.primarySatelliteDcIds[0].get().printable(); - for(int i = 1; i < db.primarySatelliteDcIds.size(); i++) { - config << "," << db.primarySatelliteDcIds[i].get().printable(); - } - config << " remote_satellite_dcs=" << db.remoteSatelliteDcIds[0].get().printable(); - for(int i = 1; i < db.remoteSatelliteDcIds.size(); i++) { - config << "," << db.remoteSatelliteDcIds[i].get().printable(); - } - } - - config << " " << dbconfig["storage_engine"]; - return config.str(); -} - void setupSimulatedSystem( vector> *systemActors, std::string baseFolder, int* pTesterCount, Optional *pConnString, Standalone *pStartingConfiguration, int extraDB, int minimumReplication) { // SOMEDAY: this does not test multi-interface configurations SimulationConfig simconfig(extraDB, minimumReplication); - std::string startingConfigString = simconfig.toString(); + StatusObject startingConfigJSON = simconfig.db.toJSON(true); + std::string startingConfigString = "new"; + for( auto kv : startingConfigJSON) { + startingConfigString += " "; + if( kv.second.type() == json_spirit::int_type ) { + startingConfigString += kv.first + ":=" + format("%d", kv.second.get_int()); + } else if( kv.second.type() == json_spirit::str_type ) { + startingConfigString += kv.second.get_str(); + } else if( kv.second.type() == json_spirit::array_type ) { + startingConfigString += kv.first + "=" + json_spirit::write_string(json_spirit::mValue(kv.second.get_array()), json_spirit::Output_options::none); + } else { + ASSERT(false); + } + } g_simulator.storagePolicy = simconfig.db.storagePolicy; g_simulator.tLogPolicy = simconfig.db.tLogPolicy; g_simulator.tLogWriteAntiQuorum = simconfig.db.tLogWriteAntiQuorum; - g_simulator.primaryDcId = simconfig.db.primaryDcId; g_simulator.hasRemoteReplication = simconfig.db.remoteTLogReplicationFactor > 0; g_simulator.remoteTLogPolicy = simconfig.db.remoteTLogPolicy; - if(simconfig.db.remoteDcIds.size()) g_simulator.remoteDcId = simconfig.db.remoteDcIds[0]; - g_simulator.hasSatelliteReplication = simconfig.db.satelliteTLogReplicationFactor > 0; - g_simulator.satelliteTLogPolicy = simconfig.db.satelliteTLogPolicy; - g_simulator.satelliteTLogWriteAntiQuorum = simconfig.db.satelliteTLogWriteAntiQuorum; - g_simulator.primarySatelliteDcIds = simconfig.db.primarySatelliteDcIds; - g_simulator.remoteSatelliteDcIds = simconfig.db.remoteSatelliteDcIds; + if(simconfig.db.regions.size() == 2) { + g_simulator.primaryDcId = simconfig.db.regions[0].dcId; + g_simulator.remoteDcId = simconfig.db.regions[1].dcId; + g_simulator.hasSatelliteReplication = simconfig.db.regions[0].satelliteTLogReplicationFactor > 0 && simconfig.db.regions[0].satelliteTLogPolicy == simconfig.db.regions[1].satelliteTLogPolicy; + g_simulator.satelliteTLogPolicy = simconfig.db.regions[0].satelliteTLogPolicy; + g_simulator.satelliteTLogWriteAntiQuorum = simconfig.db.regions[0].satelliteTLogWriteAntiQuorum; + + for(auto s : simconfig.db.regions[0].satellites) { + g_simulator.primarySatelliteDcIds.push_back(s.dcId); + } + for(auto s : simconfig.db.regions[1].satellites) { + g_simulator.remoteSatelliteDcIds.push_back(s.dcId); + } + } else { + g_simulator.hasSatelliteReplication = false; + g_simulator.satelliteTLogWriteAntiQuorum = 0; + } + ASSERT(g_simulator.storagePolicy && g_simulator.tLogPolicy); ASSERT(!g_simulator.hasRemoteReplication || g_simulator.remoteTLogPolicy); ASSERT(!g_simulator.hasSatelliteReplication || g_simulator.satelliteTLogPolicy); - TraceEvent("simulatorConfig").detail("ConfigString", startingConfigString); + TraceEvent("simulatorConfig").detail("ConfigString", printable(StringRef(startingConfigString))); const int dataCenters = simconfig.datacenters; const int machineCount = simconfig.machine_count; @@ -919,7 +955,7 @@ void setupSimulatedSystem( vector> *systemActors, std::string baseF *pConnString = conn; - TraceEvent("SimulatedConnectionString").detail("String", conn.toString()).detail("ConfigString", startingConfigString); + TraceEvent("SimulatedConnectionString").detail("String", conn.toString()).detail("ConfigString", printable(StringRef(startingConfigString))); int assignedMachines = 0, nonVersatileMachines = 0; for( int dc = 0; dc < dataCenters; dc++ ) { diff --git a/fdbserver/Status.actor.cpp b/fdbserver/Status.actor.cpp index 5be54d6ea3..3de4a4f212 100644 --- a/fdbserver/Status.actor.cpp +++ b/fdbserver/Status.actor.cpp @@ -1103,25 +1103,9 @@ ACTOR static Future> loadConfiguration(Database static StatusObject configurationFetcher(Optional conf, ServerCoordinators coordinators, std::set *incomplete_reasons) { StatusObject statusObj; try { - StatusArray coordinatorLeaderServersArr; - vector< ClientLeaderRegInterface > coordinatorLeaderServers = coordinators.clientLeaderServers; - int count = coordinatorLeaderServers.size(); - statusObj["coordinators_count"] = count; - if(conf.present()) { DatabaseConfiguration configuration = conf.get(); - std::map configMap = configuration.toMap(); - for (auto it = configMap.begin(); it != configMap.end(); it++) { - if (it->first == "redundancy_mode") - { - StatusObject redundancyStatusObj; - redundancyStatusObj["factor"] = it->second; - statusObj["redundancy"] = redundancyStatusObj; - } - else { - statusObj[it->first] = it->second; - } - } + statusObj = configuration.toJSON(); StatusArray excludedServersArr; std::set excludedServers = configuration.getExcludedServers(); @@ -1131,31 +1115,11 @@ static StatusObject configurationFetcher(Optional conf, S excludedServersArr.push_back(statusObj); } statusObj["excluded_servers"] = excludedServersArr; - - if (configuration.masterProxyCount != -1) - statusObj["proxies"] = configuration.getDesiredProxies(); - else if (configuration.autoMasterProxyCount != CLIENT_KNOBS->DEFAULT_AUTO_PROXIES) - statusObj["auto_proxies"] = configuration.autoMasterProxyCount; - - if (configuration.resolverCount != -1) - statusObj["resolvers"] = configuration.getDesiredResolvers(); - else if (configuration.autoResolverCount != CLIENT_KNOBS->DEFAULT_AUTO_RESOLVERS) - statusObj["auto_resolvers"] = configuration.autoResolverCount; - - if (configuration.desiredTLogCount != -1) - statusObj["logs"] = configuration.getDesiredLogs(); - else if (configuration.autoDesiredTLogCount != CLIENT_KNOBS->DEFAULT_AUTO_LOGS) - statusObj["auto_logs"] = configuration.autoDesiredTLogCount; - - statusObj["remote_logs"] = configuration.remoteDesiredTLogCount; - - if(configuration.storagePolicy) { - statusObj["storage_policy"] = configuration.storagePolicy->info(); - } - if(configuration.tLogPolicy) { - statusObj["tlog_policy"] = configuration.tLogPolicy->info(); - } } + StatusArray coordinatorLeaderServersArr; + vector< ClientLeaderRegInterface > coordinatorLeaderServers = coordinators.clientLeaderServers; + int count = coordinatorLeaderServers.size(); + statusObj["coordinators_count"] = count; } catch (Error &e){ incomplete_reasons->insert("Could not retrieve all configuration status information."); diff --git a/fdbserver/TagPartitionedLogSystem.actor.cpp b/fdbserver/TagPartitionedLogSystem.actor.cpp index 0f43d06996..c79b9843d8 100644 --- a/fdbserver/TagPartitionedLogSystem.actor.cpp +++ b/fdbserver/TagPartitionedLogSystem.actor.cpp @@ -1100,11 +1100,13 @@ struct TagPartitionedLogSystem : ILogSystem, ReferenceCountedtLogs[0]->hasBestPolicy = HasBestPolicyId; logSystem->tLogs[0]->locality = primaryLocality; - if(configuration.satelliteTLogReplicationFactor > 0) { + RegionInfo region = configuration.getRegion(recr.dcId); + + if(region.satelliteTLogReplicationFactor > 0) { logSystem->tLogs.push_back( Reference( new LogSet() ) ); - logSystem->tLogs[1]->tLogWriteAntiQuorum = configuration.satelliteTLogWriteAntiQuorum; - logSystem->tLogs[1]->tLogReplicationFactor = configuration.satelliteTLogReplicationFactor; - logSystem->tLogs[1]->tLogPolicy = configuration.satelliteTLogPolicy; + logSystem->tLogs[1]->tLogWriteAntiQuorum = region.satelliteTLogWriteAntiQuorum; + logSystem->tLogs[1]->tLogReplicationFactor = region.satelliteTLogReplicationFactor; + logSystem->tLogs[1]->tLogPolicy = region.satelliteTLogPolicy; logSystem->tLogs[1]->isLocal = true; logSystem->tLogs[1]->hasBestPolicy = HasBestPolicyNone; logSystem->tLogs[1]->locality = -99; @@ -1164,7 +1166,7 @@ struct TagPartitionedLogSystem : ILogSystem, ReferenceCounted> recoveryComplete; - if(configuration.satelliteTLogReplicationFactor > 0) { + if(region.satelliteTLogReplicationFactor > 0) { state vector> satelliteInitializationReplies; vector< InitializeTLogRequest > sreqs( recr.satelliteTLogs.size() ); diff --git a/fdbserver/masterserver.actor.cpp b/fdbserver/masterserver.actor.cpp index 3e7fa69bc3..91d1d859d8 100644 --- a/fdbserver/masterserver.actor.cpp +++ b/fdbserver/masterserver.actor.cpp @@ -290,28 +290,28 @@ ACTOR Future newResolvers( Reference self, RecruitFromConfigur ACTOR Future newTLogServers( Reference self, RecruitFromConfigurationReply recr, Reference oldLogSystem, vector>* initialConfChanges ) { if(self->configuration.remoteTLogReplicationFactor > 0) { - state Optional primaryDcId = recr.remoteDcId == self->configuration.remoteDcIds[0] ? self->configuration.primaryDcId : self->configuration.remoteDcIds[0]; - if( !self->dcId_locality.count(primaryDcId) ) { - TraceEvent(SevWarnAlways, "UnknownPrimaryDCID", self->dbgid).detail("found", self->dcId_locality.count(primaryDcId)).detail("primaryId", printable(primaryDcId)); + state Optional remoteDcId = self->remoteDcIds.size() ? self->remoteDcIds[0] : Optional(); + if( !self->dcId_locality.count(recr.dcId) ) { + TraceEvent(SevWarnAlways, "UnknownPrimaryDCID", self->dbgid).detail("found", self->dcId_locality.count(recr.dcId)).detail("primaryId", printable(recr.dcId)); int8_t loc = self->getNextLocality(); Standalone tr; - tr.set(tr.arena(), tagLocalityListKeyFor(primaryDcId), tagLocalityListValue(loc)); + tr.set(tr.arena(), tagLocalityListKeyFor(recr.dcId), tagLocalityListValue(loc)); initialConfChanges->push_back(tr); - self->dcId_locality[primaryDcId] = loc; + self->dcId_locality[recr.dcId] = loc; } - if( !self->dcId_locality.count(recr.remoteDcId) ) { - TraceEvent(SevWarnAlways, "UnknownRemoteDCID", self->dbgid).detail("remoteFound", self->dcId_locality.count(recr.remoteDcId)).detail("remoteId", printable(recr.remoteDcId)); + if( !self->dcId_locality.count(remoteDcId) ) { + TraceEvent(SevWarnAlways, "UnknownRemoteDCID", self->dbgid).detail("remoteFound", self->dcId_locality.count(remoteDcId)).detail("remoteId", printable(remoteDcId)); int8_t loc = self->getNextLocality(); Standalone tr; - tr.set(tr.arena(), tagLocalityListKeyFor(recr.remoteDcId), tagLocalityListValue(loc)); + tr.set(tr.arena(), tagLocalityListKeyFor(remoteDcId), tagLocalityListValue(loc)); initialConfChanges->push_back(tr); - self->dcId_locality[recr.remoteDcId] = loc; + self->dcId_locality[remoteDcId] = loc; } - Future fRemoteWorkers = brokenPromiseToNever( self->clusterController.recruitRemoteFromConfiguration.getReply( RecruitRemoteFromConfigurationRequest( self->configuration, recr.remoteDcId ) ) ); + Future fRemoteWorkers = brokenPromiseToNever( self->clusterController.recruitRemoteFromConfiguration.getReply( RecruitRemoteFromConfigurationRequest( self->configuration, remoteDcId ) ) ); - Reference newLogSystem = wait( oldLogSystem->newEpoch( recr, fRemoteWorkers, self->configuration, self->cstate.myDBState.recoveryCount + 1, self->dcId_locality[primaryDcId], self->dcId_locality[recr.remoteDcId] ) ); + Reference newLogSystem = wait( oldLogSystem->newEpoch( recr, fRemoteWorkers, self->configuration, self->cstate.myDBState.recoveryCount + 1, self->dcId_locality[recr.dcId], self->dcId_locality[remoteDcId] ) ); self->logSystem = newLogSystem; } else { Reference newLogSystem = wait( oldLogSystem->newEpoch( recr, Never(), self->configuration, self->cstate.myDBState.recoveryCount + 1, tagLocalitySpecial, tagLocalitySpecial ) ); @@ -552,9 +552,9 @@ ACTOR Future recruitEverything( Reference self, vectorprimaryDcId.clear(); self->remoteDcIds.clear(); - if(recruits.remoteDcId.present()) { - self->primaryDcId.push_back(recruits.remoteDcId == self->configuration.remoteDcIds[0] ? self->configuration.primaryDcId : self->configuration.remoteDcIds[0]); - self->remoteDcIds.push_back(recruits.remoteDcId); + if(recruits.dcId.present()) { + self->primaryDcId.push_back(recruits.dcId); + self->remoteDcIds.push_back(recruits.dcId.get() == self->configuration.regions[0].dcId ? self->configuration.regions[1].dcId : self->configuration.regions[0].dcId); } TraceEvent("MasterRecoveryState", self->dbgid) @@ -1014,7 +1014,7 @@ ACTOR Future trackTlogRecovery( Reference self, Reference changed = self->logSystem->onCoreStateChanged(); ASSERT( newState.tLogs[0].tLogWriteAntiQuorum == self->configuration.tLogWriteAntiQuorum && newState.tLogs[0].tLogReplicationFactor == self->configuration.tLogReplicationFactor ); - state bool finalUpdate = !newState.oldTLogData.size() && newState.tLogs.size() == self->configuration.expectedLogSets(); + state bool finalUpdate = !newState.oldTLogData.size() && newState.tLogs.size() == self->configuration.expectedLogSets(self->primaryDcId.size() ? self->primaryDcId[0] : Optional()); Void _ = wait( self->cstate.write(newState, finalUpdate) ); if( finalUpdate ) { diff --git a/fdbserver/tester.actor.cpp b/fdbserver/tester.actor.cpp index 3261faaae4..4e2424d033 100644 --- a/fdbserver/tester.actor.cpp +++ b/fdbserver/tester.actor.cpp @@ -1000,12 +1000,12 @@ ACTOR Future reconfigureAfter(Database cx, double time) { TraceEvent(SevWarnAlways, "DisablingFearlessConfiguration"); g_simulator.hasRemoteReplication = false; g_simulator.hasSatelliteReplication = false; - ConfigurationResult::Type _ = wait( changeConfig( cx, "remote_none satellite_none" ) ); + ConfigurationResult::Type _ = wait( changeConfig( cx, "remote_none" ) ); if (g_network->isSimulated() && g_simulator.extraDB) { Reference extraFile(new ClusterConnectionFile(*g_simulator.extraDB)); Reference cluster = Cluster::createCluster(extraFile, -1); Database extraDB = cluster->createDatabase(LiteralStringRef("DB")).get(); - ConfigurationResult::Type _ = wait(changeConfig(extraDB, "remote_none satellite_none")); + ConfigurationResult::Type _ = wait(changeConfig(extraDB, "remote_none")); } } diff --git a/fdbserver/workloads/ConsistencyCheck.actor.cpp b/fdbserver/workloads/ConsistencyCheck.actor.cpp index 2535c02eac..4b405bb37d 100644 --- a/fdbserver/workloads/ConsistencyCheck.actor.cpp +++ b/fdbserver/workloads/ConsistencyCheck.actor.cpp @@ -1068,9 +1068,9 @@ struct ConsistencyCheckWorkload : TestWorkload } } - if((!configuration.primaryDcId.present() && missingStorage.size()) || - (configuration.primaryDcId.present() && configuration.remoteTLogReplicationFactor == 0 && missingStorage.count(configuration.primaryDcId) && missingStorage.count(configuration.remoteDcIds[0])) || - (configuration.primaryDcId.present() && configuration.remoteTLogReplicationFactor > 0 && (missingStorage.count(configuration.primaryDcId) || missingStorage.count(configuration.remoteDcIds[0])))) { + if((!configuration.regions.size() && missingStorage.size()) || + (configuration.regions.size() && configuration.remoteTLogReplicationFactor == 0 && missingStorage.count(configuration.regions[0].dcId) && missingStorage.count(configuration.regions[1].dcId)) || + (configuration.regions.size() && configuration.remoteTLogReplicationFactor > 0 && (missingStorage.count(configuration.regions[0].dcId) || missingStorage.count(configuration.regions[1].dcId)))) { self->testFailure("No storage server on worker"); return false; } From 8c880416080067967272ced5580acfc7e74b1a31 Mon Sep 17 00:00:00 2001 From: Evan Tschannen Date: Tue, 6 Mar 2018 16:31:21 -0800 Subject: [PATCH 006/127] fix: we must commit to the number of log routers we are going to use when recruiting the primary, because it determines the number of log router tags that will be attached to mutations --- fdbserver/ClusterController.actor.cpp | 6 +++++- fdbserver/ClusterRecruitmentInterface.h | 9 +++++---- fdbserver/TagPartitionedLogSystem.actor.cpp | 8 ++++++-- fdbserver/masterserver.actor.cpp | 2 +- 4 files changed, 17 insertions(+), 8 deletions(-) diff --git a/fdbserver/ClusterController.actor.cpp b/fdbserver/ClusterController.actor.cpp index f50129a153..e908d42832 100644 --- a/fdbserver/ClusterController.actor.cpp +++ b/fdbserver/ClusterController.actor.cpp @@ -511,7 +511,7 @@ public: result.remoteTLogs.push_back(remoteLogs[i].first); } - auto logRouters = getWorkersForRoleInDatacenter( req.dcId, ProcessClass::LogRouter, req.configuration.getDesiredLogRouters(), req.configuration, id_used ); + auto logRouters = getWorkersForRoleInDatacenter( req.dcId, ProcessClass::LogRouter, req.logRouterCount, req.configuration, id_used ); for(int i = 0; i < logRouters.size(); i++) { result.logRouters.push_back(logRouters[i].first); } @@ -594,6 +594,9 @@ public: for(int i = 0; i < proxies.size(); i++) result.proxies.push_back(proxies[i].first); + auto logRouters = getWorkersForRoleInDatacenter( remoteDcId, ProcessClass::LogRouter, req.configuration.getDesiredLogRouters(), req.configuration, id_used ); + result.logRouterCount = logRouters.size() ? logRouters.size() : 1; + if( now() - startTime < SERVER_KNOBS->WAIT_FOR_GOOD_RECRUITMENT_DELAY && ( RoleFitness(tlogs, ProcessClass::TLog) > RoleFitness(SERVER_KNOBS->EXPECTED_TLOG_FITNESS, req.configuration.getDesiredLogs()) || ( region.satelliteTLogReplicationFactor > 0 && RoleFitness(satelliteLogs, ProcessClass::TLog) > RoleFitness(SERVER_KNOBS->EXPECTED_TLOG_FITNESS, req.configuration.getDesiredSatelliteLogs(dcId)) ) || @@ -642,6 +645,7 @@ public: } } else { RecruitFromConfigurationReply result; + result.logRouterCount = 0; std::map< Optional>, int> id_used; id_used[masterProcessId]++; id_used[clusterControllerProcessId]++; diff --git a/fdbserver/ClusterRecruitmentInterface.h b/fdbserver/ClusterRecruitmentInterface.h index b3f23527ac..b52f68b618 100644 --- a/fdbserver/ClusterRecruitmentInterface.h +++ b/fdbserver/ClusterRecruitmentInterface.h @@ -85,26 +85,27 @@ struct RecruitFromConfigurationReply { vector proxies; vector resolvers; vector storageServers; + int logRouterCount; Optional dcId; template void serialize( Ar& ar ) { - ar & tLogs & satelliteTLogs & proxies & resolvers & storageServers & dcId; + ar & tLogs & satelliteTLogs & proxies & resolvers & storageServers & dcId & logRouterCount; } }; struct RecruitRemoteFromConfigurationRequest { DatabaseConfiguration configuration; Optional dcId; + int logRouterCount; ReplyPromise< struct RecruitRemoteFromConfigurationReply > reply; RecruitRemoteFromConfigurationRequest() {} - explicit RecruitRemoteFromConfigurationRequest(DatabaseConfiguration const& configuration, Optional const& dcId) - : configuration(configuration), dcId(dcId) {} + RecruitRemoteFromConfigurationRequest(DatabaseConfiguration const& configuration, Optional const& dcId, int logRouterCount) : configuration(configuration), dcId(dcId), logRouterCount(logRouterCount) {} template void serialize( Ar& ar ) { - ar & configuration & dcId & reply; + ar & configuration & dcId & logRouterCount & reply; } }; diff --git a/fdbserver/TagPartitionedLogSystem.actor.cpp b/fdbserver/TagPartitionedLogSystem.actor.cpp index c79b9843d8..e1d8e8be2b 100644 --- a/fdbserver/TagPartitionedLogSystem.actor.cpp +++ b/fdbserver/TagPartitionedLogSystem.actor.cpp @@ -1007,6 +1007,11 @@ struct TagPartitionedLogSystem : ILogSystem, ReferenceCountedminRouters) { + TraceEvent("RemoteLogRecruitment_MismatchedLogRouters").detail("minRouters", self->minRouters).detail("workers", remoteWorkers.logRouters.size()); + throw master_recovery_failed(); + } state Reference logSet = Reference( new LogSet() ); logSet->tLogReplicationFactor = configuration.remoteTLogReplicationFactor; @@ -1078,7 +1083,6 @@ struct TagPartitionedLogSystem : ILogSystem, ReferenceCountedlogServers[i]->get().interf().recoveryFinished.getReplyUnlessFailedFor( TLogRecoveryFinishedRequest(), SERVER_KNOBS->TLOG_TIMEOUT, SERVER_KNOBS->MASTER_FAILURE_SLOPE_DURING_RECOVERY ) ), master_recovery_failed() ) ); self->remoteRecoveryComplete = waitForAll(recoveryComplete); - logSet->logRouters.resize(remoteWorkers.remoteTLogs.size()); self->tLogs.push_back( logSet ); TraceEvent("RemoteLogRecruitment_CompletingRecovery"); return Void(); @@ -1114,7 +1118,7 @@ struct TagPartitionedLogSystem : ILogSystem, ReferenceCounted 0) { - logSystem->minRouters = configuration.getDesiredLogRouters(); + logSystem->minRouters = recr.logRouterCount; logSystem->expectedLogSets++; } else { logSystem->minRouters = 0; diff --git a/fdbserver/masterserver.actor.cpp b/fdbserver/masterserver.actor.cpp index 91d1d859d8..ac6a371277 100644 --- a/fdbserver/masterserver.actor.cpp +++ b/fdbserver/masterserver.actor.cpp @@ -309,7 +309,7 @@ ACTOR Future newTLogServers( Reference self, RecruitFromConfig self->dcId_locality[remoteDcId] = loc; } - Future fRemoteWorkers = brokenPromiseToNever( self->clusterController.recruitRemoteFromConfiguration.getReply( RecruitRemoteFromConfigurationRequest( self->configuration, remoteDcId ) ) ); + Future fRemoteWorkers = brokenPromiseToNever( self->clusterController.recruitRemoteFromConfiguration.getReply( RecruitRemoteFromConfigurationRequest( self->configuration, remoteDcId, recr.logRouterCount ) ) ); Reference newLogSystem = wait( oldLogSystem->newEpoch( recr, fRemoteWorkers, self->configuration, self->cstate.myDBState.recoveryCount + 1, self->dcId_locality[recr.dcId], self->dcId_locality[remoteDcId] ) ); self->logSystem = newLogSystem; From 68606c79843b3963107aa7b97449df10bdb73fc0 Mon Sep 17 00:00:00 2001 From: Evan Tschannen Date: Tue, 6 Mar 2018 18:38:05 -0800 Subject: [PATCH 007/127] fix: sim2 logic for when a kill is safe was incorrect --- fdbrpc/sim2.actor.cpp | 16 ++++++++++++---- fdbserver/tester.actor.cpp | 1 - 2 files changed, 12 insertions(+), 5 deletions(-) diff --git a/fdbrpc/sim2.actor.cpp b/fdbrpc/sim2.actor.cpp index 080dd928b7..fe552f34e5 100644 --- a/fdbrpc/sim2.actor.cpp +++ b/fdbrpc/sim2.actor.cpp @@ -1045,7 +1045,7 @@ public: std::vector badCombo; std::set>> uniqueMachines; - if(!hasRemoteReplication) { + if(!primaryDcId.present()) { for (auto processInfo : availableProcesses) { primaryProcessesLeft.add(processInfo->locality); primaryLocalitiesLeft.push_back(processInfo->locality); @@ -1093,21 +1093,29 @@ public: bool notEnoughLeft = false; bool primaryTLogsDead = tLogWriteAntiQuorum ? !validateAllCombinations(badCombo, primaryProcessesDead, tLogPolicy, primaryLocalitiesLeft, tLogWriteAntiQuorum, false) : primaryProcessesDead.validate(tLogPolicy); - if(!hasRemoteReplication) { + if(!primaryDcId.present()) { tooManyDead = primaryTLogsDead || primaryProcessesDead.validate(storagePolicy); notEnoughLeft = !primaryProcessesLeft.validate(tLogPolicy) || !primaryProcessesLeft.validate(storagePolicy); } else { bool remoteTLogsDead = tLogWriteAntiQuorum ? !validateAllCombinations(badCombo, remoteProcessesDead, tLogPolicy, remoteLocalitiesLeft, tLogWriteAntiQuorum, false) : remoteProcessesDead.validate(tLogPolicy); if(!hasSatelliteReplication) { - tooManyDead = primaryTLogsDead || remoteTLogsDead || ( primaryProcessesDead.validate(storagePolicy) && remoteProcessesDead.validate(storagePolicy) ); notEnoughLeft = ( !primaryProcessesLeft.validate(tLogPolicy) || !primaryProcessesLeft.validate(storagePolicy) ) && ( !remoteProcessesLeft.validate(tLogPolicy) || !remoteProcessesLeft.validate(storagePolicy) ); + if(hasRemoteReplication) { + tooManyDead = primaryTLogsDead || remoteTLogsDead || ( primaryProcessesDead.validate(storagePolicy) && remoteProcessesDead.validate(storagePolicy) ); + } else { + tooManyDead = primaryTLogsDead || remoteTLogsDead || primaryProcessesDead.validate(storagePolicy) || remoteProcessesDead.validate(storagePolicy); + } } else { bool primarySatelliteTLogsDead = satelliteTLogWriteAntiQuorum ? !validateAllCombinations(badCombo, primarySatelliteProcessesDead, satelliteTLogPolicy, primarySatelliteLocalitiesLeft, satelliteTLogWriteAntiQuorum, false) : primarySatelliteProcessesDead.validate(satelliteTLogPolicy); bool remoteSatelliteTLogsDead = satelliteTLogWriteAntiQuorum ? !validateAllCombinations(badCombo, remoteSatelliteProcessesDead, satelliteTLogPolicy, remoteSatelliteLocalitiesLeft, satelliteTLogWriteAntiQuorum, false) : remoteSatelliteProcessesDead.validate(satelliteTLogPolicy); - tooManyDead = ( primaryTLogsDead && primarySatelliteTLogsDead ) || ( remoteTLogsDead && remoteSatelliteTLogsDead ) || ( primaryProcessesDead.validate(storagePolicy) && remoteProcessesDead.validate(storagePolicy) ); notEnoughLeft = ( !primaryProcessesLeft.validate(tLogPolicy) || !primaryProcessesLeft.validate(storagePolicy) || !primarySatelliteProcessesLeft.validate(satelliteTLogPolicy) ) && ( !remoteProcessesLeft.validate(tLogPolicy) || !remoteProcessesLeft.validate(storagePolicy) || !remoteSatelliteProcessesLeft.validate(satelliteTLogPolicy) ); + if(hasRemoteReplication) { + tooManyDead = ( primaryTLogsDead && primarySatelliteTLogsDead ) || ( remoteTLogsDead && remoteSatelliteTLogsDead ) || ( primaryProcessesDead.validate(storagePolicy) && remoteProcessesDead.validate(storagePolicy) ); + } else { + tooManyDead = ( primaryTLogsDead && primarySatelliteTLogsDead ) || ( remoteTLogsDead && remoteSatelliteTLogsDead ) || primaryProcessesDead.validate(storagePolicy) || remoteProcessesDead.validate(storagePolicy); + } } } diff --git a/fdbserver/tester.actor.cpp b/fdbserver/tester.actor.cpp index 4e2424d033..ce619d172e 100644 --- a/fdbserver/tester.actor.cpp +++ b/fdbserver/tester.actor.cpp @@ -999,7 +999,6 @@ ACTOR Future reconfigureAfter(Database cx, double time) { if(g_network->isSimulated()) { TraceEvent(SevWarnAlways, "DisablingFearlessConfiguration"); g_simulator.hasRemoteReplication = false; - g_simulator.hasSatelliteReplication = false; ConfigurationResult::Type _ = wait( changeConfig( cx, "remote_none" ) ); if (g_network->isSimulated() && g_simulator.extraDB) { Reference extraFile(new ClusterConnectionFile(*g_simulator.extraDB)); From 9d4cdc828bca17570ba0f0b62b1d07ee0b19dd43 Mon Sep 17 00:00:00 2001 From: Evan Tschannen Date: Wed, 7 Mar 2018 12:54:53 -0800 Subject: [PATCH 008/127] fix: inactive cursors are still useful if their version is larger than the current version --- fdbserver/LogSystemPeekCursor.actor.cpp | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fdbserver/LogSystemPeekCursor.actor.cpp b/fdbserver/LogSystemPeekCursor.actor.cpp index cce070d137..224d35eb05 100644 --- a/fdbserver/LogSystemPeekCursor.actor.cpp +++ b/fdbserver/LogSystemPeekCursor.actor.cpp @@ -617,7 +617,7 @@ ACTOR Future setPeekGetMore(ILogSystem::SetPeekCursor* self, LogMessageVer if(bestSetValid) { self->localityGroup.clear(); for( int i = 0; i < self->serverCursors[self->bestSet].size(); i++) { - if(!self->serverCursors[self->bestSet][i]->isActive()) { + if(!self->serverCursors[self->bestSet][i]->isActive() && self->serverCursors[self->bestSet][i]->version() <= self->messageVersion) { self->localityGroup.add(self->logSets[self->bestSet]->tLogLocalities[i]); } } From fa7eaea7cf02bc710140bc27c364650877d9cb04 Mon Sep 17 00:00:00 2001 From: Evan Tschannen Date: Thu, 8 Mar 2018 10:50:05 -0800 Subject: [PATCH 009/127] fix: shards affected by team failure did not properly handle separate teams for the remote and primary data centers --- fdbserver/DataDistribution.actor.cpp | 130 +++++++++++++++----- fdbserver/DataDistribution.h | 40 ++++-- fdbserver/DataDistributionQueue.actor.cpp | 15 ++- fdbserver/DataDistributionTracker.actor.cpp | 82 +++++------- 4 files changed, 168 insertions(+), 99 deletions(-) diff --git a/fdbserver/DataDistribution.actor.cpp b/fdbserver/DataDistribution.actor.cpp index 007bf23467..5ceb15e167 100644 --- a/fdbserver/DataDistribution.actor.cpp +++ b/fdbserver/DataDistribution.actor.cpp @@ -327,7 +327,7 @@ ACTOR Future storageServerFailureTracker( } // Read keyservers, return unique set of teams -ACTOR Future> getInitialDataDistribution( Database cx, UID masterId, MoveKeysLock moveKeysLock ) { +ACTOR Future> getInitialDataDistribution( Database cx, UID masterId, MoveKeysLock moveKeysLock, std::vector> remoteDcIds ) { state Reference result = Reference(new InitialDataDistribution); state Key beginKey = allKeys.begin; @@ -335,8 +335,12 @@ ACTOR Future> getInitialDataDistribution( Dat state Transaction tr( cx ); + state std::map> server_dc; + state std::map, std::pair, vector>> team_cache; + //Get the server list in its own try/catch block since it modifies result. We don't want a subsequent failure causing entries to be duplicated loop { + server_dc.clear(); succeeded = false; try { result->mode = 1; @@ -364,6 +368,7 @@ ACTOR Future> getInitialDataDistribution( Dat for( int i = 0; i < serverList.get().size(); i++ ) { auto ssi = decodeServerListValue( serverList.get()[i].value ); result->allServers.push_back( std::make_pair(ssi, id_data[ssi.locality.processId()].processClass) ); + server_dc[ssi.id()] = ssi.locality.dcId(); } break; @@ -392,17 +397,56 @@ ACTOR Future> getInitialDataDistribution( Dat // for each range for(int i = 0; i < keyServers.size() - 1; i++) { - KeyRangeRef keys( keyServers[i].key, keyServers[i+1].key ); + ShardInfo info( keyServers[i].key ); decodeKeyServersValue( keyServers[i].value, src, dest ); - std::pair,vector> teams; - for(int j=0; jshards.push_back( keyRangeWith(keys, teams) ); - result->teams.insert( teams.first ); - if (dest.size()) - result->teams.insert( teams.second ); + if(remoteDcIds.size()) { + auto srcIter = team_cache.find(src); + if(srcIter == team_cache.end()) { + for(auto& id : src) { + auto& dc = server_dc[id]; + if(std::find(remoteDcIds.begin(), remoteDcIds.end(), dc) != remoteDcIds.end()) { + info.remoteSrc.push_back(id); + } else { + info.primarySrc.push_back(id); + } + } + result->primaryTeams.insert( info.primarySrc ); + result->remoteTeams.insert( info.remoteSrc ); + team_cache[src] = std::make_pair(info.primarySrc, info.remoteSrc); + } else { + info.primarySrc = srcIter->second.first; + info.remoteSrc = srcIter->second.second; + } + if(dest.size()) { + info.hasDest = true; + auto destIter = team_cache.find(dest); + if(destIter == team_cache.end()) { + for(auto& id : dest) { + auto& dc = server_dc[id]; + if(std::find(remoteDcIds.begin(), remoteDcIds.end(), dc) != remoteDcIds.end()) { + info.remoteDest.push_back(id); + } else { + info.primaryDest.push_back(id); + } + } + result->primaryTeams.insert( info.primaryDest ); + result->remoteTeams.insert( info.remoteDest ); + team_cache[dest] = std::make_pair(info.primaryDest, info.remoteDest); + } else { + info.primaryDest = destIter->second.first; + info.remoteDest = destIter->second.second; + } + } + } else { + info.primarySrc = src; + result->primaryTeams.insert( src ); + if (dest.size()) { + info.hasDest = true; + info.primaryDest = dest; + result->primaryTeams.insert( dest ); + } + } + result->shards.push_back( info ); } ASSERT(keyServers.size() > 0); @@ -420,7 +464,7 @@ ACTOR Future> getInitialDataDistribution( Dat } // a dummy shard at the end with no keys or servers makes life easier for trackInitialShards() - result->shards.push_back( keyRangeWith(KeyRangeRef(allKeys.end,allKeys.end), std::pair, vector>()) ); + result->shards.push_back( ShardInfo(allKeys.end) ); return result; } @@ -480,6 +524,7 @@ struct DDTeamCollection { std::vector> includedDCs; Optional>> otherTrackedDCs; + bool primary; DDTeamCollection( Database const& cx, UID masterId, @@ -490,12 +535,12 @@ struct DDTeamCollection { std::vector> includedDCs, Optional>> otherTrackedDCs, PromiseStream< std::pair> > const& serverChanges, - Future readyToStart, Reference> zeroHealthyTeams ) + Future readyToStart, Reference> zeroHealthyTeams, bool primary ) :cx(cx), masterId(masterId), lock(lock), output(output), shardsAffectedByTeamFailure(shardsAffectedByTeamFailure), doBuildTeams( true ), teamBuilder( Void() ), configuration(configuration), serverChanges(serverChanges), initialFailureReactionDelay( delay( BUGGIFY ? 0 : SERVER_KNOBS->INITIAL_FAILURE_REACTION_DELAY, TaskDataDistribution ) ), healthyTeamCount( 0 ), initializationDoneActor(logOnCompletion(readyToStart && initialFailureReactionDelay, this)), optimalTeamCount( 0 ), recruitingStream(0), restartRecruiting( SERVER_KNOBS->DEBOUNCE_RECRUITING_DELAY ), - unhealthyServers(0), includedDCs(includedDCs), otherTrackedDCs(otherTrackedDCs), zeroHealthyTeams(zeroHealthyTeams), zeroOptimalTeams(true) + unhealthyServers(0), includedDCs(includedDCs), otherTrackedDCs(otherTrackedDCs), zeroHealthyTeams(zeroHealthyTeams), zeroOptimalTeams(true), primary(primary) { TraceEvent("DDTrackerStarting", masterId) .detail( "State", "Inactive" ) @@ -759,8 +804,14 @@ struct DDTeamCollection { } } - for(auto t = initTeams.teams.begin(); t != initTeams.teams.end(); ++t) { - addTeam(t->begin(), t->end() ); + if(primary) { + for(auto t = initTeams.primaryTeams.begin(); t != initTeams.primaryTeams.end(); ++t) { + addTeam(t->begin(), t->end() ); + } + } else { + for(auto t = initTeams.remoteTeams.begin(); t != initTeams.remoteTeams.end(); ++t) { + addTeam(t->begin(), t->end() ); + } } addSubsetOfEmergencyTeams(); @@ -830,9 +881,6 @@ struct DDTeamCollection { } } - if(newTeamServers.empty()) { - return; - } Reference teamInfo( new TCTeamInfo( newTeamServers ) ); TraceEvent("TeamCreation", masterId).detail("Team", teamInfo->getDesc()); teamInfo->tracker = teamTracker( this, teamInfo ); @@ -1290,20 +1338,20 @@ ACTOR Future teamTracker( DDTeamCollection *self, ReferencezeroHealthyTeams->get(); //set this again in case it changed from this teams health changing if( self->initialFailureReactionDelay.isReady() && !self->zeroHealthyTeams->get() ) { - vector shards = self->shardsAffectedByTeamFailure->getShardsFor( team->getServerIDs() ); + vector shards = self->shardsAffectedByTeamFailure->getShardsFor( ShardsAffectedByTeamFailure::Team(team->getServerIDs(), self->primary) ); for(int i=0; igetPriority(); auto teams = self->shardsAffectedByTeamFailure->getTeamsFor( shards[i] ); for( int t=0; tserver_info.count( teams[t][0] ) ) { - auto& info = self->server_info[teams[t][0]]; + if( self->server_info.count( teams[t].servers[0] ) ) { + auto& info = self->server_info[teams[t].servers[0]]; bool found = false; for( int i = 0; i < info->teams.size(); i++ ) { - if( info->teams[i]->serverIDs == teams[t] ) { + if( info->teams[i]->serverIDs == teams[t].servers ) { maxPriority = std::max( maxPriority, info->teams[i]->getPriority() ); found = true; break; @@ -1826,9 +1874,10 @@ ACTOR Future dataDistributionTeamCollection( Optional>> otherTrackedDCs, PromiseStream< std::pair> > serverChanges, Future readyToStart, - Reference> zeroHealthyTeams ) + Reference> zeroHealthyTeams, + bool primary) { - state DDTeamCollection self( cx, masterId, lock, output, shardsAffectedByTeamFailure, configuration, includedDCs, otherTrackedDCs, serverChanges, readyToStart, zeroHealthyTeams ); + state DDTeamCollection self( cx, masterId, lock, output, shardsAffectedByTeamFailure, configuration, includedDCs, otherTrackedDCs, serverChanges, readyToStart, zeroHealthyTeams, primary ); state Future loggingTrigger = Void(); state PromiseStream serverRemoved; state Future interfaceChanges; @@ -2113,9 +2162,9 @@ ACTOR Future dataDistribution( TraceEvent("DDInitTakingMoveKeysLock", mi.id()); state MoveKeysLock lock = wait( takeMoveKeysLock( cx, mi.id() ) ); TraceEvent("DDInitTookMoveKeysLock", mi.id()); - state Reference initData = wait( getInitialDataDistribution(cx, mi.id(), lock) ); + state Reference initData = wait( getInitialDataDistribution(cx, mi.id(), lock, configuration.remoteTLogReplicationFactor > 0 ? remoteDcIds : std::vector>() ) ); if(initData->shards.size() > 1) { - TraceEvent("DDInitGotInitialDD", mi.id()).detail("b", printable(initData->shards.end()[-2].begin)).detail("e", printable(initData->shards.end()[-2].end)).detail("src", describe(initData->shards.end()[-2].value.first)).detail("dest", describe(initData->shards.end()[-2].value.second)).trackLatest("InitialDD"); + TraceEvent("DDInitGotInitialDD", mi.id()).detail("b", printable(initData->shards.end()[-2].key)).detail("e", printable(initData->shards.end()[-1].key)).detail("src", describe(initData->shards.end()[-2].primarySrc)).detail("dest", describe(initData->shards.end()[-2].primaryDest)).trackLatest("InitialDD"); } else { TraceEvent("DDInitGotInitialDD", mi.id()).detail("b","").detail("e", "").detail("src", "[no items]").detail("dest", "[no items]").trackLatest("InitialDD"); } @@ -2168,13 +2217,29 @@ ACTOR Future dataDistribution( Reference shardsAffectedByTeamFailure( new ShardsAffectedByTeamFailure ); + for(int s=0; sshards.size() - 1; s++) { + KeyRangeRef keys = KeyRangeRef(initData->shards[s].key, initData->shards[s+1].key); + shardsAffectedByTeamFailure->defineShard(keys); + std::vector teams; + teams.push_back(ShardsAffectedByTeamFailure::Team(initData->shards[s].primarySrc, true)); + if(configuration.remoteTLogReplicationFactor > 0) { + teams.push_back(ShardsAffectedByTeamFailure::Team(initData->shards[s].remoteSrc, false)); + } + shardsAffectedByTeamFailure->moveShard(keys, teams); + if(initData->shards[s].hasDest) { + // This shard is already in flight. Ideally we should use dest in sABTF and generate a dataDistributionRelocator directly in + // DataDistributionQueue to track it, but it's easier to just (with low priority) schedule it for movement. + output.send( RelocateShard( keys, PRIORITY_RECOVER_MOVE ) ); + } + } + actors.push_back( pollMoveKeysLock(cx, lock) ); actors.push_back( popOldTags( cx, logSystem, recoveryCommitVersion) ); - actors.push_back( reportErrorsExcept( dataDistributionTracker( initData, cx, shardsAffectedByTeamFailure, output, getShardMetrics, getAverageShardBytes.getFuture(), readyToStart, anyZeroHealthyTeams, mi.id() ), "DDTracker", mi.id(), &normalDDQueueErrors() ) ); + actors.push_back( reportErrorsExcept( dataDistributionTracker( initData, cx, output, getShardMetrics, getAverageShardBytes.getFuture(), readyToStart, anyZeroHealthyTeams, mi.id() ), "DDTracker", mi.id(), &normalDDQueueErrors() ) ); actors.push_back( reportErrorsExcept( dataDistributionQueue( cx, output, getShardMetrics, tcis, shardsAffectedByTeamFailure, lock, getAverageShardBytes, mi, storageTeamSize, configuration.durableStorageQuorum, lastLimited ), "DDQueue", mi.id(), &normalDDQueueErrors() ) ); - actors.push_back( reportErrorsExcept( dataDistributionTeamCollection( initData, tcis[0], cx, db, shardsAffectedByTeamFailure, lock, output, mi.id(), configuration, primaryDcId, configuration.remoteTLogReplicationFactor > 0 ? remoteDcIds : std::vector>(), serverChanges, readyToStart.getFuture(), zeroHealthyTeams[0] ), "DDTeamCollectionPrimary", mi.id(), &normalDDQueueErrors() ) ); + actors.push_back( reportErrorsExcept( dataDistributionTeamCollection( initData, tcis[0], cx, db, shardsAffectedByTeamFailure, lock, output, mi.id(), configuration, primaryDcId, configuration.remoteTLogReplicationFactor > 0 ? remoteDcIds : std::vector>(), serverChanges, readyToStart.getFuture(), zeroHealthyTeams[0], true ), "DDTeamCollectionPrimary", mi.id(), &normalDDQueueErrors() ) ); if (configuration.remoteTLogReplicationFactor > 0) { - actors.push_back( reportErrorsExcept( dataDistributionTeamCollection( initData, tcis[1], cx, db, shardsAffectedByTeamFailure, lock, output, mi.id(), configuration, remoteDcIds, Optional>>(), serverChanges, readyToStart.getFuture(), zeroHealthyTeams[1] ), "DDTeamCollectionSecondary", mi.id(), &normalDDQueueErrors() ) ); + actors.push_back( reportErrorsExcept( dataDistributionTeamCollection( initData, tcis[1], cx, db, shardsAffectedByTeamFailure, lock, output, mi.id(), configuration, remoteDcIds, Optional>>(), serverChanges, readyToStart.getFuture(), zeroHealthyTeams[1], false ), "DDTeamCollectionSecondary", mi.id(), &normalDDQueueErrors() ) ); } Void _ = wait( waitForAll( actors ) ); @@ -2215,7 +2280,8 @@ DDTeamCollection* testTeamCollection(int teamSize, IRepPolicyRef policy, int pro {}, PromiseStream>>(), Future(Void()), - Reference>( new AsyncVar(true) ) + Reference>( new AsyncVar(true) ), + true ); for(int id = 1; id <= processCount; id++) { diff --git a/fdbserver/DataDistribution.h b/fdbserver/DataDistribution.h index ed2a35e7c1..93de1b9a3d 100644 --- a/fdbserver/DataDistribution.h +++ b/fdbserver/DataDistribution.h @@ -120,7 +120,23 @@ struct TeamCollectionInterface { class ShardsAffectedByTeamFailure : public ReferenceCounted { public: ShardsAffectedByTeamFailure() {} - typedef vector Team; // sorted + + struct Team { + vector servers; // sorted + bool primary; + + Team() : primary(true) {} + Team(vector const& servers, bool primary) : servers(servers), primary(primary) {} + + bool operator < ( const Team& r ) const { + if( servers == r.servers ) return primary < r.primary; + return servers < r.servers; + } + bool operator == ( const Team& r ) const { + return servers == r.servers && primary == r.primary; + } + }; + // This tracks the data distribution on the data distribution server so that teamTrackers can // relocate the right shards when a team is degraded. @@ -138,9 +154,9 @@ public: int getNumberOfShards( UID ssID ); vector getShardsFor( Team team ); - vector> getTeamsFor( KeyRangeRef keys ); + vector getTeamsFor( KeyRangeRef keys ); void defineShard( KeyRangeRef keys ); - void moveShard( KeyRangeRef keys, Team destinationTeam ); + void moveShard( KeyRangeRef keys, std::vector destinationTeam ); void check(); private: struct OrderByTeamKey { @@ -159,12 +175,23 @@ private: void insert(Team team, KeyRange const& range); }; +struct ShardInfo { + Key key; + vector primarySrc; + vector remoteSrc; + vector primaryDest; + vector remoteDest; + bool hasDest; + + explicit ShardInfo(Key key) : key(key), hasDest(false) {} +}; + struct InitialDataDistribution : ReferenceCounted { - typedef vector Team; // sorted int mode; vector> allServers; - std::set< Team > teams; - vector>> shards; + std::set> primaryTeams; + std::set> remoteTeams; + vector shards; }; Future dataDistribution( @@ -180,7 +207,6 @@ Future dataDistribution( Future dataDistributionTracker( Reference const& initData, Database const& cx, - Reference const& shardsAffectedByTeamFailure, PromiseStream const& output, PromiseStream const& getShardMetrics, FutureStream> const& getAverageShardBytes, diff --git a/fdbserver/DataDistributionQueue.actor.cpp b/fdbserver/DataDistributionQueue.actor.cpp index ee1c4960be..8bac1b521c 100644 --- a/fdbserver/DataDistributionQueue.actor.cpp +++ b/fdbserver/DataDistributionQueue.actor.cpp @@ -837,6 +837,7 @@ ACTOR Future dataDistributionRelocator( DDQueueData *self, RelocateData rd state bool signalledTransferComplete = false; state UID masterId = self->mi.id(); state ParallelTCInfo destination; + state std::vector destinationTeams; state ParallelTCInfo healthyDestinations; state bool anyHealthy = false; state int durableStorageQuorum = 0; @@ -865,6 +866,7 @@ ACTOR Future dataDistributionRelocator( DDQueueData *self, RelocateData rd state int tciIndex = 0; state bool foundTeams = true; destination.clear(); + destinationTeams.clear(); healthyDestinations.clear(); anyHealthy = false; durableStorageQuorum = 0; @@ -883,6 +885,7 @@ ACTOR Future dataDistributionRelocator( DDQueueData *self, RelocateData rd Optional> bestTeam = wait(brokenPromiseToNever(self->teamCollections[tciIndex].getTeam.getReply(req))); if (bestTeam.present()) { destination.addTeam(bestTeam.get()); + destinationTeams.push_back(ShardsAffectedByTeamFailure::Team(bestTeam.get()->getServerIDs(), tciIndex == 0)); if(bestTeam.get()->isHealthy()) { healthyDestinations.addTeam(bestTeam.get()); anyHealthy = true; @@ -912,7 +915,7 @@ ACTOR Future dataDistributionRelocator( DDQueueData *self, RelocateData rd Void _ = wait( delay( SERVER_KNOBS->BEST_TEAM_STUCK_DELAY, TaskDataDistributionLaunch ) ); } - self->shardsAffectedByTeamFailure->moveShard(rd.keys, destination.getServerIDs()); + self->shardsAffectedByTeamFailure->moveShard(rd.keys, destinationTeams); //FIXME: do not add data in flight to servers that were already in the src. destination.addDataInFlightToTeam(+metrics.bytes); @@ -1009,12 +1012,12 @@ ACTOR Future dataDistributionRelocator( DDQueueData *self, RelocateData rd } } -ACTOR Future rebalanceTeams( DDQueueData* self, int priority, Reference sourceTeam, Reference destTeam ) { +ACTOR Future rebalanceTeams( DDQueueData* self, int priority, Reference sourceTeam, Reference destTeam, bool primary ) { if(g_network->isSimulated() && g_simulator.speedUpSimulation) { return false; } - std::vector shards = self->shardsAffectedByTeamFailure->getShardsFor( sourceTeam->getServerIDs() ); + std::vector shards = self->shardsAffectedByTeamFailure->getShardsFor( ShardsAffectedByTeamFailure::Team( sourceTeam->getServerIDs(), primary ) ); if( !shards.size() ) return false; @@ -1028,7 +1031,7 @@ ACTOR Future rebalanceTeams( DDQueueData* self, int priority, Reference shards = self->shardsAffectedByTeamFailure->getShardsFor( sourceTeam->getServerIDs() ); + std::vector shards = self->shardsAffectedByTeamFailure->getShardsFor( ShardsAffectedByTeamFailure::Team( sourceTeam->getServerIDs(), primary ) ); for( int i = 0; i < shards.size(); i++ ) { if( moveShard == shards[i] ) { TraceEvent(priority == PRIORITY_REBALANCE_OVERUTILIZED_TEAM ? "BgDDMountainChopper" : "BgDDValleyFiller", self->mi.id()) @@ -1057,7 +1060,7 @@ ACTOR Future BgDDMountainChopper( DDQueueData* self, int teamCollectionInd if( randomTeam.get()->getMinFreeSpaceRatio() > SERVER_KNOBS->FREE_SPACE_RATIO_DD_CUTOFF ) { state Optional> loadedTeam = wait( brokenPromiseToNever( self->teamCollections[teamCollectionIndex].getTeam.getReply( GetTeamRequest( true, true, false ) ) ) ); if( loadedTeam.present() ) { - bool moved = wait( rebalanceTeams( self, PRIORITY_REBALANCE_OVERUTILIZED_TEAM, loadedTeam.get(), randomTeam.get() ) ); + bool moved = wait( rebalanceTeams( self, PRIORITY_REBALANCE_OVERUTILIZED_TEAM, loadedTeam.get(), randomTeam.get(), teamCollectionIndex == 0 ) ); if(moved) { resetCount = 0; } else { @@ -1092,7 +1095,7 @@ ACTOR Future BgDDValleyFiller( DDQueueData* self, int teamCollectionIndex) state Optional> unloadedTeam = wait( brokenPromiseToNever( self->teamCollections[teamCollectionIndex].getTeam.getReply( GetTeamRequest( true, true, true ) ) ) ); if( unloadedTeam.present() ) { if( unloadedTeam.get()->getMinFreeSpaceRatio() > SERVER_KNOBS->FREE_SPACE_RATIO_DD_CUTOFF ) { - bool moved = wait( rebalanceTeams( self, PRIORITY_REBALANCE_UNDERUTILIZED_TEAM, randomTeam.get(), unloadedTeam.get() ) ); + bool moved = wait( rebalanceTeams( self, PRIORITY_REBALANCE_UNDERUTILIZED_TEAM, randomTeam.get(), unloadedTeam.get(), teamCollectionIndex == 0 ) ); if(moved) { resetCount = 0; } else { diff --git a/fdbserver/DataDistributionTracker.actor.cpp b/fdbserver/DataDistributionTracker.actor.cpp index 61db4b6945..3cc754daa2 100644 --- a/fdbserver/DataDistributionTracker.actor.cpp +++ b/fdbserver/DataDistributionTracker.actor.cpp @@ -598,9 +598,7 @@ void restartShardTrackers( DataDistributionTracker* self, KeyRangeRef keys, Opti } } -ACTOR Future trackInitialShards(DataDistributionTracker *self, - Reference initData, - Reference shardsAffectedByTeamFailure) +ACTOR Future trackInitialShards(DataDistributionTracker *self, Reference initData) { TraceEvent("TrackInitialShards", self->masterId).detail("InitialShardCount", initData->shards.size()); @@ -608,35 +606,9 @@ ACTOR Future trackInitialShards(DataDistributionTracker *self, //SOMEDAY: Figure out what this priority should actually be Void _ = wait( delay( 0.0, TaskDataDistribution ) ); - state int lastBegin = -1; - state vector last; - state int s; - for(s=0; sshards.size(); s++) { - state InitialDataDistribution::Team src = initData->shards[s].value.first; - auto& dest = initData->shards[s].value.second; - if (dest.size()) { - // This shard is already in flight. Ideally we should use dest in sABTF and generate a dataDistributionRelocator directly in - // DataDistributionQueue to track it, but it's easier to just (with low priority) schedule it for movement. - self->output.send( RelocateShard( initData->shards[s], PRIORITY_RECOVER_MOVE ) ); - } - - // The following clause was here for no remembered reason. It was removed, however, because on resumption of stopped - // clusters (of size 3) it was grouping all the the shards in the system into one, and then splitting them all back out, - // causing unecessary data distribution. - //if (s==0 || s+1==initData.shards.size() || lastBegin<0 || src != last || initData.shards[s].begin == keyServersPrefix) { - // end current run, start a new shardTracker - // relies on the dummy shard at allkeysend - - if (lastBegin >= 0) { - state KeyRangeRef keys( initData->shards[lastBegin].begin, initData->shards[s].begin ); - restartShardTrackers( self, keys ); - shardsAffectedByTeamFailure->defineShard( keys ); - shardsAffectedByTeamFailure->moveShard( keys, last ); - } - lastBegin = s; - last = src; - //} + for(s=0; sshards.size()-1; s++) { + restartShardTrackers( self, KeyRangeRef( initData->shards[s].key, initData->shards[s+1].key ) ); Void _ = wait( yield( TaskDataDistribution ) ); } @@ -692,7 +664,6 @@ ACTOR Future fetchShardMetrics( DataDistributionTracker* self, GetMetricsR ACTOR Future dataDistributionTracker( Reference initData, Database cx, - Reference shardsAffectedByTeamFailure, PromiseStream output, PromiseStream getShardMetrics, FutureStream> getAverageShardBytes, @@ -703,7 +674,7 @@ ACTOR Future dataDistributionTracker( state DataDistributionTracker self(cx, masterId, readyToStart, output, anyZeroHealthyTeams); state Future loggingTrigger = Void(); try { - Void _ = wait( trackInitialShards( &self, initData, shardsAffectedByTeamFailure ) ); + Void _ = wait( trackInitialShards( &self, initData ) ); initData = Reference(); loop choose { @@ -731,9 +702,7 @@ ACTOR Future dataDistributionTracker( vector ShardsAffectedByTeamFailure::getShardsFor( Team team ) { vector r; - for(auto it = team_shards.lower_bound( std::pair( team, KeyRangeRef() ) ); - it != team_shards.end() && it->first == team; - ++it) + for(auto it = team_shards.lower_bound( std::pair( team, KeyRangeRef() ) ); it != team_shards.end() && it->first == team; ++it) r.push_back( it->second ); return r; } @@ -742,20 +711,20 @@ int ShardsAffectedByTeamFailure::getNumberOfShards( UID ssID ) { return storageServerShards[ssID]; } -vector> ShardsAffectedByTeamFailure::getTeamsFor( KeyRangeRef keys ) { +vector ShardsAffectedByTeamFailure::getTeamsFor( KeyRangeRef keys ) { return shard_teams[keys.begin]; } void ShardsAffectedByTeamFailure::erase(Team team, KeyRange const& range) { if(team_shards.erase( std::pair(team, range) ) > 0) { - for(auto uid = team.begin(); uid != team.end(); ++uid) + for(auto uid = team.servers.begin(); uid != team.servers.end(); ++uid) storageServerShards[*uid]--; } } void ShardsAffectedByTeamFailure::insert(Team team, KeyRange const& range) { if(team_shards.insert( std::pair( team, range ) ).second) { - for(auto uid = team.begin(); uid != team.end(); ++uid) + for(auto uid = team.servers.begin(); uid != team.servers.end(); ++uid) storageServerShards[*uid]++; } } @@ -787,7 +756,7 @@ void ShardsAffectedByTeamFailure::defineShard( KeyRangeRef keys ) { check(); } -void ShardsAffectedByTeamFailure::moveShard( KeyRangeRef keys, Team destinationTeam ) { +void ShardsAffectedByTeamFailure::moveShard( KeyRangeRef keys, std::vector destinationTeams ) { /*TraceEvent("ShardsAffectedByTeamFailureMove") .detail("KeyBegin", printable(keys.begin)) .detail("KeyEnd", printable(keys.end)) @@ -795,31 +764,36 @@ void ShardsAffectedByTeamFailure::moveShard( KeyRangeRef keys, Team destinationT .detail("NewTeam", describe(destinationTeam));*/ auto ranges = shard_teams.intersectingRanges( keys ); - std::vector< std::pair > modifiedShards; + std::vector< std::pair,KeyRange> > modifiedShards; for(auto it = ranges.begin(); it != ranges.end(); ++it) { if( keys.contains( it->range() ) ) { - // erase the many teams that were assiciated with this one shard + // erase the many teams that were associated with this one shard for(auto t = it->value().begin(); t != it->value().end(); ++t) { erase(*t, it->range()); } // save this modification for later insertion - modifiedShards.push_back( std::pair( destinationTeam, it->range() ) ); + modifiedShards.push_back( std::pair,KeyRange>( destinationTeams, it->range() ) ); } else { // for each range that touches this move, add our team as affecting this range - insert(destinationTeam, it->range()); + for(auto& team : destinationTeams) { + insert(team, it->range()); - // if we are not in the list of teams associated with this shard, add us in - auto& teams = it->value(); - if( std::find( teams.begin(), teams.end(), destinationTeam ) == teams.end() ) - teams.push_back( destinationTeam ); + // if we are not in the list of teams associated with this shard, add us in + auto& teams = it->value(); + if( std::find( teams.begin(), teams.end(), team ) == teams.end() ) { + teams.push_back( team ); + } + } } } // we cannot modify the KeyRangeMap while iterating through it, so add saved modifications now for( int i = 0; i < modifiedShards.size(); i++ ) { - insert(modifiedShards[i].first, modifiedShards[i].second); - shard_teams.insert( modifiedShards[i].second, vector( 1, modifiedShards[i].first ) ); + for( auto& t : modifiedShards[i].first) { + insert(t, modifiedShards[i].second); + } + shard_teams.insert( modifiedShards[i].second, modifiedShards[i].first ); } check(); @@ -838,11 +812,11 @@ void ShardsAffectedByTeamFailure::check() { auto rs = shard_teams.ranges(); for(auto i = rs.begin(); i != rs.end(); ++i) for(vector::iterator t = i->value().begin(); t != i->value().end(); ++t) - if (!team_shards.count( make_pair( *t, i->range() ) )) { + if (!team_shards.count( std::make_pair( *t, i->range() ) )) { std::string teamDesc, shards; - for(int k=0; ksize(); k++) - teamDesc += format("%llx ", (*t)[k].first()); - for(auto x = team_shards.lower_bound( make_pair( *t, KeyRangeRef() ) ); x != team_shards.end() && x->first == *t; ++x) + for(int k=0; kservers.size(); k++) + teamDesc += format("%llx ", t->servers[k].first()); + for(auto x = team_shards.lower_bound( std::make_pair( *t, KeyRangeRef() ) ); x != team_shards.end() && x->first == *t; ++x) shards += printable(x->second.begin) + "-" + printable(x->second.end) + ","; TraceEvent(SevError,"SATFInvariantError2") .detail("KB", printable(i->begin())) From c74211bd920772f464a2c23abc8cedf64f3b4292 Mon Sep 17 00:00:00 2001 From: Evan Tschannen Date: Fri, 9 Mar 2018 16:52:37 -0800 Subject: [PATCH 010/127] fix: merge problem --- fdbserver/LeaderElection.actor.cpp | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fdbserver/LeaderElection.actor.cpp b/fdbserver/LeaderElection.actor.cpp index 8942e8e17e..c68d64b6e1 100644 --- a/fdbserver/LeaderElection.actor.cpp +++ b/fdbserver/LeaderElection.actor.cpp @@ -82,7 +82,7 @@ ACTOR Future tryBecomeLeaderInternal( ServerCoordinators coordinators, Val state bool iAmLeader = false; state UID prevChangeID; - if( asyncProcessClass->get().machineClassFitness(ProcessClass::ClusterController) > ProcessClass::UnsetFit || asyncIsExcluded->get() ) { + if( asyncPriorityInfo->get().processClassFitness > ProcessClass::UnsetFit || asyncPriorityInfo->get().dcFitness == ClusterControllerPriorityInfo::FitnessBad || asyncPriorityInfo->get().isExcluded ) { Void _ = wait( delay(SERVER_KNOBS->WAIT_FOR_GOOD_RECRUITMENT_DELAY) ); } From 72d56a700c409a3ab8bed25c6043bf378e7ab3ba Mon Sep 17 00:00:00 2001 From: Evan Tschannen Date: Sat, 10 Mar 2018 09:52:09 -0800 Subject: [PATCH 011/127] fix: do not serialize an a tlog interface without a unique id --- fdbserver/OldTLogServer.actor.cpp | 4 +--- fdbserver/TLogInterface.h | 4 +++- fdbserver/worker.actor.cpp | 3 +-- 3 files changed, 5 insertions(+), 6 deletions(-) diff --git a/fdbserver/OldTLogServer.actor.cpp b/fdbserver/OldTLogServer.actor.cpp index cbad3eae60..1249aaf161 100644 --- a/fdbserver/OldTLogServer.actor.cpp +++ b/fdbserver/OldTLogServer.actor.cpp @@ -1273,9 +1273,7 @@ namespace oldTLog { UID id2 = BinaryReader::fromStringRef( fRecoverCounts.get()[idx].key.removePrefix(persistRecoveryCountKeys.begin), Unversioned() ); ASSERT(id1 == id2); - TLogInterface recruited; - recruited.uniqueID = id1; - recruited.locality = locality; + TLogInterface recruited(id1, self->dbgid, locality); recruited.initEndpoints(); DUMPTOKEN( recruited.peekMessages ); diff --git a/fdbserver/TLogInterface.h b/fdbserver/TLogInterface.h index 8836d51040..b3e4c397b5 100644 --- a/fdbserver/TLogInterface.h +++ b/fdbserver/TLogInterface.h @@ -43,7 +43,8 @@ struct TLogInterface { RequestStream> waitFailure; RequestStream< struct TLogRecoveryFinishedRequest > recoveryFinished; - TLogInterface() { } + TLogInterface() {} + explicit TLogInterface(LocalityData locality) : uniqueID( g_random->randomUniqueID() ), locality(locality) { sharedTLogID = uniqueID; } TLogInterface(UID sharedTLogID, LocalityData locality) : uniqueID( g_random->randomUniqueID() ), sharedTLogID(sharedTLogID), locality(locality) {} TLogInterface(UID uniqueID, UID sharedTLogID, LocalityData locality) : uniqueID(uniqueID), sharedTLogID(sharedTLogID), locality(locality) {} UID id() const { return uniqueID; } @@ -61,6 +62,7 @@ struct TLogInterface { template void serialize( Ar& ar ) { + ASSERT(ar.isDeserializing || uniqueID != UID()); ar & uniqueID & sharedTLogID & locality & peekMessages & popMessages & commit & lock & getQueuingMetrics & confirmRunning & waitFailure & recoveryFinished; } diff --git a/fdbserver/worker.actor.cpp b/fdbserver/worker.actor.cpp index c077c164a7..adfa1adbcc 100644 --- a/fdbserver/worker.actor.cpp +++ b/fdbserver/worker.actor.cpp @@ -809,8 +809,7 @@ ACTOR Future workerServer( Reference connFile, Refe req.reply.send(recruited); } when( InitializeLogRouterRequest req = waitNext(interf.logRouter.getFuture()) ) { - TLogInterface recruited; - recruited.locality = locality; + TLogInterface recruited(locality); recruited.initEndpoints(); std::map details; From f6a22c103523215f360e6384fb8e255df283bd5c Mon Sep 17 00:00:00 2001 From: Evan Tschannen Date: Mon, 12 Mar 2018 16:56:34 -0700 Subject: [PATCH 012/127] fix: the recovery actor was holding a copy of the tlogInterface after the tlog was removed --- fdbserver/TLogServer.actor.cpp | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/fdbserver/TLogServer.actor.cpp b/fdbserver/TLogServer.actor.cpp index c203e28dc3..3673120bd3 100644 --- a/fdbserver/TLogServer.actor.cpp +++ b/fdbserver/TLogServer.actor.cpp @@ -367,7 +367,6 @@ struct LogData : NonCopyable, public ReferenceCounted { Future removed; PromiseStream> addActor; TLogData* tLogData; - Future recovery; Promise recoveryComplete; Version unrecoveredBefore; @@ -377,7 +376,7 @@ struct LogData : NonCopyable, public ReferenceCounted { explicit LogData(TLogData* tLogData, TLogInterface interf, Optional remoteTag) : tLogData(tLogData), knownCommittedVersion(0), logId(interf.id()), cc("TLog", interf.id().toString()), bytesInput("bytesInput", cc), bytesDurable("bytesDurable", cc), remoteTag(remoteTag), logSystem(new AsyncVar>()), // These are initialized differently on init() or recovery - recoveryCount(), stopped(false), initialized(false), queueCommittingVersion(0), newPersistentDataVersion(invalidVersion), recovery(Void()), unrecoveredBefore(0) + recoveryCount(), stopped(false), initialized(false), queueCommittingVersion(0), newPersistentDataVersion(invalidVersion), unrecoveredBefore(0) { startRole(interf.id(), UID(), "TLog"); @@ -1450,7 +1449,6 @@ ACTOR Future tLogCore( TLogData* self, Reference logData, TLogInt state Future warningCollector = timeoutWarningCollector( warningCollectorInput.getFuture(), 1.0, "TLogQueueCommitSlow", self->dbgid ); state Future error = actorCollection( logData->addActor.getFuture() ); - logData->addActor.send( logData->recovery ); logData->addActor.send( waitFailureServer( tli.waitFailure.getFuture()) ); logData->addActor.send( logData->removed ); //FIXME: update tlogMetrics to include new information, or possibly only have one copy for the shared instance @@ -1940,7 +1938,7 @@ ACTOR Future tLogStart( TLogData* self, InitializeTLogRequest req, Localit throw worker_removed(); } - logData->recovery = respondToRecovered( recruited, logData->recoveryComplete, recoverFromLogSystem( self, logData, req.recoverFrom, req.recoverAt, req.knownCommittedVersion, req.recoverTags, copyComplete ) ); + logData->addActor.send( respondToRecovered( recruited, logData->recoveryComplete, recoverFromLogSystem( self, logData, req.recoverFrom, req.recoverAt, req.knownCommittedVersion, req.recoverTags, copyComplete ) ) ); Void _ = wait(copyComplete.getFuture() || logData->removed ); } else { // Brand new tlog, initialization has already been done by caller From 2e741057d4a17cfb81916057f42862c3d09b6171 Mon Sep 17 00:00:00 2001 From: Evan Tschannen Date: Tue, 13 Mar 2018 12:59:07 -0700 Subject: [PATCH 013/127] use references instead of copying regionInfo --- fdbserver/DatabaseConfiguration.cpp | 10 +++++----- fdbserver/DatabaseConfiguration.h | 6 +++--- 2 files changed, 8 insertions(+), 8 deletions(-) diff --git a/fdbserver/DatabaseConfiguration.cpp b/fdbserver/DatabaseConfiguration.cpp index c40f49a19d..b595ff7b40 100644 --- a/fdbserver/DatabaseConfiguration.cpp +++ b/fdbserver/DatabaseConfiguration.cpp @@ -128,7 +128,7 @@ void DatabaseConfiguration::setDefaultReplicationPolicy() { if(remoteTLogReplicationFactor > 0 && !remoteTLogPolicy) { remoteTLogPolicy = IRepPolicyRef(new PolicyAcross(remoteTLogReplicationFactor, "zoneid", IRepPolicyRef(new PolicyOne()))); } - for(auto r : regions) { + for(auto& r : regions) { if(r.satelliteTLogReplicationFactor > 0 && !r.satelliteTLogPolicy) { r.satelliteTLogPolicy = IRepPolicyRef(new PolicyAcross(r.satelliteTLogReplicationFactor, "zoneid", IRepPolicyRef(new PolicyOne()))); } @@ -163,7 +163,7 @@ bool DatabaseConfiguration::isValid() const { std::set dcIds; std::set priorities; dcIds.insert(Key()); - for(auto r : regions) { + for(auto& r : regions) { if( !(!dcIds.count(r.dcId) && !priorities.count(r.priority) && r.satelliteTLogReplicationFactor >= 0 && @@ -174,7 +174,7 @@ bool DatabaseConfiguration::isValid() const { } dcIds.insert(r.dcId); priorities.insert(r.priority); - for(auto s : r.satellites) { + for(auto& s : r.satellites) { if(dcIds.count(s.dcId)) { return false; } @@ -244,7 +244,7 @@ StatusObject DatabaseConfiguration::toJSON(bool noPolicies) const { if(regions.size()) { StatusArray regionArr; - for( auto r : regions) { + for(auto& r : regions) { StatusObject dcObj; dcObj["id"] = r.dcId.toString(); dcObj["priority"] = r.priority; @@ -272,7 +272,7 @@ StatusObject DatabaseConfiguration::toJSON(bool noPolicies) const { if(r.satellites.size()) { StatusArray satellitesArr; - for(auto s : r.satellites) { + for(auto& s : r.satellites) { StatusObject satObj; satObj["id"] = s.dcId.toString(); satObj["priority"] = s.priority; diff --git a/fdbserver/DatabaseConfiguration.h b/fdbserver/DatabaseConfiguration.h index 00f84c378d..fc412ed763 100644 --- a/fdbserver/DatabaseConfiguration.h +++ b/fdbserver/DatabaseConfiguration.h @@ -111,14 +111,14 @@ struct DatabaseConfiguration { // SOMEDAY: think about changing storageTeamSize to durableStorageQuorum int32_t minDatacentersRequired() const { int minRequired = 0; - for(auto r : regions) { + for(auto& r : regions) { minRequired += 1 + r.satellites.size(); } return minRequired; } int32_t minMachinesRequiredPerDatacenter() const { int minRequired = std::max( remoteTLogReplicationFactor, std::max(tLogReplicationFactor, storageTeamSize) ); - for(auto r : regions) { + for(auto& r : regions) { minRequired = std::max( minRequired, r.satelliteTLogReplicationFactor/std::max(1, r.satelliteTLogUsableDcs) ); } return minRequired; @@ -127,7 +127,7 @@ struct DatabaseConfiguration { //Killing an entire datacenter counts as killing one machine in modes that support it int32_t maxMachineFailuresTolerated() const { int worstSatellite = regions.size() ? std::numeric_limits::max() : 0; - for(auto r : regions) { + for(auto& r : regions) { worstSatellite = std::min(worstSatellite, r.satelliteTLogReplicationFactor - r.satelliteTLogWriteAntiQuorum); } if(remoteTLogReplicationFactor > 0 && worstSatellite > 0) { From 59723f51f8fbb69b4b35018f7aceb5f2ed47fb8c Mon Sep 17 00:00:00 2001 From: Evan Tschannen Date: Wed, 14 Mar 2018 12:39:55 -0700 Subject: [PATCH 014/127] fix: continue to attempt to lock logs until remote logs are recovered, this is so that remote logs get locked and readers know they will not have any more data do not throttle trace events in simulation --- fdbserver/TagPartitionedLogSystem.actor.cpp | 2 +- flow/Trace.cpp | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/fdbserver/TagPartitionedLogSystem.actor.cpp b/fdbserver/TagPartitionedLogSystem.actor.cpp index e1d8e8be2b..bc99bc2c90 100644 --- a/fdbserver/TagPartitionedLogSystem.actor.cpp +++ b/fdbserver/TagPartitionedLogSystem.actor.cpp @@ -969,7 +969,7 @@ struct TagPartitionedLogSystem : ILogSystem, ReferenceCountedtLogs = logServers; logSystem->oldLogData = oldLogData; logSystem->logSystemType = prevState.logSystemType; - logSystem->rejoins = rejoins; + logSystem->rejoins = holdWhile( tLogReply, rejoins ); logSystem->epochEndVersion = end.get(); logSystem->knownCommittedVersion = knownCommittedVersion; diff --git a/flow/Trace.cpp b/flow/Trace.cpp index 183fb41db1..dc0bd63dd6 100644 --- a/flow/Trace.cpp +++ b/flow/Trace.cpp @@ -878,7 +878,7 @@ TraceEvent::~TraceEvent() { try { if (enabled) { // TRACE_EVENT_THROTTLER - if (severity > SevDebug && isNetworkThread()) { + if (!g_network->isSimulated() && severity > SevDebug && isNetworkThread()) { if (traceEventThrottlerCache->isAboveThreshold(StringRef((uint8_t *)type, strlen(type)))) { TraceEvent(SevWarnAlways, std::string(TRACE_EVENT_THROTTLE_STARTING_TYPE).append(type).c_str()).suppressFor(5); // Throttle Msg From 65b532658fbc9009b2343153cec1cabf5c524386 Mon Sep 17 00:00:00 2001 From: Evan Tschannen Date: Thu, 15 Mar 2018 10:59:30 -0700 Subject: [PATCH 015/127] added support for single region configurations --- fdbserver/ClusterController.actor.cpp | 15 +++++++++++++-- fdbserver/DatabaseConfiguration.cpp | 2 +- fdbserver/SimulatedCluster.actor.cpp | 17 ++++++++++++++--- fdbserver/masterserver.actor.cpp | 4 +++- fdbserver/workloads/ConsistencyCheck.actor.cpp | 7 ++++--- 5 files changed, 35 insertions(+), 10 deletions(-) diff --git a/fdbserver/ClusterController.actor.cpp b/fdbserver/ClusterController.actor.cpp index b509fb76a0..109c4653ea 100644 --- a/fdbserver/ClusterController.actor.cpp +++ b/fdbserver/ClusterController.actor.cpp @@ -643,6 +643,17 @@ public: } throw; } + } else if(req.configuration.regions.size() == 1) { + vector> dcPriority; + dcPriority.push_back(req.configuration.regions[0].dcId); + desiredDcIds.set(dcPriority); + auto reply = findWorkersForConfiguration(req, req.configuration.regions[0].dcId); + if(reply.isError()) { + throw reply.getError(); + } else if(clusterControllerDcId.present() && req.configuration.regions[0].dcId == clusterControllerDcId.get()) { + return reply.get(); + } + throw no_more_servers(); } else { RecruitFromConfigurationReply result; result.logRouterCount = 0; @@ -730,7 +741,7 @@ public: } void checkPrimaryDC() { - if(db.config.regions.size() && clusterControllerDcId.present() && db.config.regions[0].dcId != clusterControllerDcId.get()) { + if(db.config.regions.size() > 1 && clusterControllerDcId.present() && db.config.regions[0].dcId != clusterControllerDcId.get()) { try { std::map< Optional>, int> id_used; getWorkerForRoleInDatacenter(db.config.regions[0].dcId, ProcessClass::ClusterController, ProcessClass::ExcludeFit, db.config, id_used, true); @@ -854,7 +865,7 @@ public: std::set> remoteDC; RegionInfo region; - if(db.config.regions.size() > 1 && clusterControllerDcId.present()) { + if(db.config.regions.size() && clusterControllerDcId.present()) { primaryDC.insert(clusterControllerDcId); for(auto& r : db.config.regions) { if(r.dcId != clusterControllerDcId.get()) { diff --git a/fdbserver/DatabaseConfiguration.cpp b/fdbserver/DatabaseConfiguration.cpp index b595ff7b40..a2f3fae346 100644 --- a/fdbserver/DatabaseConfiguration.cpp +++ b/fdbserver/DatabaseConfiguration.cpp @@ -155,7 +155,7 @@ bool DatabaseConfiguration::isValid() const { getDesiredRemoteLogs() >= 1 && getDesiredLogRouters() >= 1 && remoteTLogReplicationFactor >= 0 && - (regions.size() == 0 || regions.size() == 2) && + regions.size() <= 2 && ( remoteTLogReplicationFactor == 0 || ( remoteTLogPolicy && regions.size() == 2 && durableStorageQuorum == storageTeamSize ) ) ) ) { return false; } diff --git a/fdbserver/SimulatedCluster.actor.cpp b/fdbserver/SimulatedCluster.actor.cpp index 95e93ea88d..d5574da9c1 100644 --- a/fdbserver/SimulatedCluster.actor.cpp +++ b/fdbserver/SimulatedCluster.actor.cpp @@ -682,7 +682,7 @@ StringRef StringRefOf(const char* s) { void SimulationConfig::generateNormalConfig(int minimumReplication) { set_config("new"); - bool generateFearless = false; //FIXME g_random->random01() < 0.5; + bool generateFearless = true; //FIXME g_random->random01() < 0.5; datacenters = generateFearless ? 4 : g_random->randomInt( 1, 4 ); if (g_random->random01() < 0.25) db.desiredTLogCount = g_random->randomInt(1,7); if (g_random->random01() < 0.25) db.masterProxyCount = g_random->randomInt(1,7); @@ -763,9 +763,10 @@ void SimulationConfig::generateNormalConfig(int minimumReplication) { remoteSatellitesArr.push_back(remoteSatelliteObj); remoteObj["satellites"] = remoteSatellitesArr; - int satellite_replication_type = 2;//FIXME: g_random->randomInt(0,5); + int satellite_replication_type = g_random->randomInt(0,5); switch (satellite_replication_type) { case 0: { + //FIXME: implement TEST( true ); // Simulated cluster using custom satellite redundancy mode break; } @@ -801,9 +802,10 @@ void SimulationConfig::generateNormalConfig(int minimumReplication) { remoteObj["satellite_logs"] = logs; } - int remote_replication_type = 2;//FIXME: g_random->randomInt(0,5); + int remote_replication_type = g_random->randomInt(0,5); switch (remote_replication_type) { case 0: { + //FIXME: implement TEST( true ); // Simulated cluster using custom remote redundancy mode break; } @@ -907,6 +909,15 @@ void setupSimulatedSystem( vector> *systemActors, std::string baseF for(auto s : simconfig.db.regions[1].satellites) { g_simulator.remoteSatelliteDcIds.push_back(s.dcId); } + } else if(simconfig.db.regions.size() == 1) { + g_simulator.primaryDcId = simconfig.db.regions[0].dcId; + g_simulator.hasSatelliteReplication = simconfig.db.regions[0].satelliteTLogReplicationFactor > 0; + g_simulator.satelliteTLogPolicy = simconfig.db.regions[0].satelliteTLogPolicy; + g_simulator.satelliteTLogWriteAntiQuorum = simconfig.db.regions[0].satelliteTLogWriteAntiQuorum; + + for(auto s : simconfig.db.regions[0].satellites) { + g_simulator.primarySatelliteDcIds.push_back(s.dcId); + } } else { g_simulator.hasSatelliteReplication = false; g_simulator.satelliteTLogWriteAntiQuorum = 0; diff --git a/fdbserver/masterserver.actor.cpp b/fdbserver/masterserver.actor.cpp index ac6a371277..a7797d9d93 100644 --- a/fdbserver/masterserver.actor.cpp +++ b/fdbserver/masterserver.actor.cpp @@ -554,7 +554,9 @@ ACTOR Future recruitEverything( Reference self, vectorremoteDcIds.clear(); if(recruits.dcId.present()) { self->primaryDcId.push_back(recruits.dcId); - self->remoteDcIds.push_back(recruits.dcId.get() == self->configuration.regions[0].dcId ? self->configuration.regions[1].dcId : self->configuration.regions[0].dcId); + if(self->configuration.regions.size() > 1) { + self->remoteDcIds.push_back(recruits.dcId.get() == self->configuration.regions[0].dcId ? self->configuration.regions[1].dcId : self->configuration.regions[0].dcId); + } } TraceEvent("MasterRecoveryState", self->dbgid) diff --git a/fdbserver/workloads/ConsistencyCheck.actor.cpp b/fdbserver/workloads/ConsistencyCheck.actor.cpp index 1402acfd82..e098b42f4d 100644 --- a/fdbserver/workloads/ConsistencyCheck.actor.cpp +++ b/fdbserver/workloads/ConsistencyCheck.actor.cpp @@ -1070,9 +1070,10 @@ struct ConsistencyCheckWorkload : TestWorkload } } - if((!configuration.regions.size() && missingStorage.size()) || - (configuration.regions.size() && configuration.remoteTLogReplicationFactor == 0 && missingStorage.count(configuration.regions[0].dcId) && missingStorage.count(configuration.regions[1].dcId)) || - (configuration.regions.size() && configuration.remoteTLogReplicationFactor > 0 && (missingStorage.count(configuration.regions[0].dcId) || missingStorage.count(configuration.regions[1].dcId)))) { + if(( configuration.regions.size() == 0 && missingStorage.size()) || + (configuration.regions.size() == 1 && missingStorage.count(configuration.regions[0].dcId)) || + (configuration.regions.size() == 2 && configuration.remoteTLogReplicationFactor == 0 && missingStorage.count(configuration.regions[0].dcId) && missingStorage.count(configuration.regions[1].dcId)) || + (configuration.regions.size() == 2 && configuration.remoteTLogReplicationFactor > 0 && (missingStorage.count(configuration.regions[0].dcId) || missingStorage.count(configuration.regions[1].dcId)))) { self->testFailure("No storage server on worker"); return false; } From 82fb6424ec975f3e57595f68136689e1448d2ff4 Mon Sep 17 00:00:00 2001 From: Evan Tschannen Date: Thu, 15 Mar 2018 11:00:44 -0700 Subject: [PATCH 016/127] fix: storage recruitment could get stuck in a spin loop --- fdbserver/DataDistribution.actor.cpp | 1 + 1 file changed, 1 insertion(+) diff --git a/fdbserver/DataDistribution.actor.cpp b/fdbserver/DataDistribution.actor.cpp index 4dc6be88bb..1b5ec8e72d 100644 --- a/fdbserver/DataDistribution.actor.cpp +++ b/fdbserver/DataDistribution.actor.cpp @@ -1849,6 +1849,7 @@ ACTOR Future storageRecruiter( DDTeamCollection *self, ReferencerestartRecruiting.onTrigger() ) ) {} } + Void _ = wait( delay(FLOW_KNOBS->PREVENT_FAST_SPIN_DELAY) ); } catch( Error &e ) { if(e.code() != error_code_timed_out) { throw; From a42205eb8e48d330626833eb75381e5326650d19 Mon Sep 17 00:00:00 2001 From: Evan Tschannen Date: Thu, 15 Mar 2018 15:40:58 -0700 Subject: [PATCH 017/127] test running with only one region --- fdbserver/SimulatedCluster.actor.cpp | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/fdbserver/SimulatedCluster.actor.cpp b/fdbserver/SimulatedCluster.actor.cpp index d5574da9c1..3ec54e071a 100644 --- a/fdbserver/SimulatedCluster.actor.cpp +++ b/fdbserver/SimulatedCluster.actor.cpp @@ -748,6 +748,8 @@ void SimulationConfig::generateNormalConfig(int minimumReplication) { StatusObject remoteObj; remoteObj["id"] = "1"; remoteObj["priority"] = 0; + + bool needsRemote = generateFearless; if(generateFearless) { StatusObject primarySatelliteObj; primarySatelliteObj["id"] = "2"; @@ -810,6 +812,7 @@ void SimulationConfig::generateNormalConfig(int minimumReplication) { break; } case 1: { + needsRemote = false; TEST( true ); // Simulated cluster using no remote redundancy mode break; } @@ -838,7 +841,9 @@ void SimulationConfig::generateNormalConfig(int minimumReplication) { StatusArray regionArr; regionArr.push_back(primaryObj); - regionArr.push_back(remoteObj); + if(needsRemote || g_random->random01() < 0.5) { + regionArr.push_back(remoteObj); + } set_config("regions=" + json_spirit::write_string(json_spirit::mValue(regionArr), json_spirit::Output_options::none)); } From 820382ea6880081ee02f50deeeea99950d7c59ea Mon Sep 17 00:00:00 2001 From: Evan Tschannen Date: Fri, 16 Mar 2018 11:40:21 -0700 Subject: [PATCH 018/127] optimized the log router commit path to avoid re-serializing the data --- fdbserver/LogRouter.actor.cpp | 109 ++++++++---------------- fdbserver/LogSystem.h | 37 +++++--- fdbserver/LogSystemPeekCursor.actor.cpp | 32 +++++-- fdbserver/TLogServer.actor.cpp | 78 +++++++++++++++++ flow/serialize.h | 14 ++- 5 files changed, 174 insertions(+), 96 deletions(-) diff --git a/fdbserver/LogRouter.actor.cpp b/fdbserver/LogRouter.actor.cpp index 6d8abec71b..77d7146e58 100644 --- a/fdbserver/LogRouter.actor.cpp +++ b/fdbserver/LogRouter.actor.cpp @@ -83,19 +83,27 @@ struct LogRouterData { LogRouterData(UID dbgid, Tag routerTag, int logSet) : dbgid(dbgid), routerTag(routerTag), logSet(logSet), logSystem(new AsyncVar>()) {} }; -void commitMessages( LogRouterData* self, Version version, Arena arena, StringRef messages, VectorRef< TagMessagesRef > tags) { - if(!messages.size()) { +struct TagsAndMessage { + StringRef message; + std::vector tags; +}; + +void commitMessages( LogRouterData* self, Version version, const std::vector& taggedMessages ) { + if(!taggedMessages.size()) { return; } - StringRef messages1; // the first block of messages, if they aren't all stored contiguously. otherwise empty + int msgSize = 0; + for(auto& i : taggedMessages) { + msgSize += i.message.size(); + } // Grab the last block in the blocks list so we can share its arena // We pop all of the elements of it to create a "fresh" vector that starts at the end of the previous vector Standalone> block; if(self->messageBlocks.empty()) { block = Standalone>(); - block.reserve(block.arena(), std::max(SERVER_KNOBS->TLOG_MESSAGE_BLOCK_BYTES, messages.size())); + block.reserve(block.arena(), std::max(SERVER_KNOBS->TLOG_MESSAGE_BLOCK_BYTES, msgSize)); } else { block = self->messageBlocks.back().second; @@ -103,56 +111,31 @@ void commitMessages( LogRouterData* self, Version version, Arena arena, StringRe block.pop_front(block.size()); - // If the current batch of messages doesn't fit entirely in the remainder of the last block in the list - if(messages.size() + block.size() > block.capacity()) { - // Find how many messages will fit - LengthPrefixedStringRef r((uint32_t*)messages.begin()); - uint8_t const* end = messages.begin() + block.capacity() - block.size(); - while(r.toStringRef().end() <= end) { - r = LengthPrefixedStringRef( (uint32_t*)r.toStringRef().end() ); - } - - // Fill up the rest of this block - int bytes = (uint8_t*)r.getLengthPtr()-messages.begin(); - if (bytes) { - TEST(true); // Splitting commit messages across multiple blocks - messages1 = StringRef(block.end(), bytes); - block.append(block.arena(), messages.begin(), bytes); + for(auto& msg : taggedMessages) { + if(msg.message.size() > block.capacity() - block.size()) { self->messageBlocks.push_back( std::make_pair(version, block) ); - messages = messages.substr(bytes); + block = Standalone>(); + block.reserve(block.arena(), std::max(SERVER_KNOBS->TLOG_MESSAGE_BLOCK_BYTES, msgSize)); } - // Make a new block - block = Standalone>(); - block.reserve(block.arena(), std::max(SERVER_KNOBS->TLOG_MESSAGE_BLOCK_BYTES, messages.size())); - } + block.append(block.arena(), msg.message.begin(), msg.message.size()); + for(auto& tag : msg.tags) { + auto tsm = self->tag_data.find(tag); + if (tsm == self->tag_data.end()) { + tsm = self->tag_data.insert( mapPair(std::move(Tag(tag)), LogRouterData::TagData(Version(0), tag) ), false ); + } - // Copy messages into block - ASSERT(messages.size() <= block.capacity() - block.size()); - block.append(block.arena(), messages.begin(), messages.size()); - self->messageBlocks.push_back( std::make_pair(version, block) ); - messages = StringRef(block.end()-messages.size(), messages.size()); - - for(auto tag = tags.begin(); tag != tags.end(); ++tag) { - int64_t tagMessages = 0; - - auto tsm = self->tag_data.find(tag->tag); - if (tsm == self->tag_data.end()) { - tsm = self->tag_data.insert( mapPair(std::move(Tag(tag->tag)), LogRouterData::TagData(Version(0), tag->tag) ), false ); - } - - if (version >= tsm->value.popped) { - for(int m = 0; m < tag->messageOffsets.size(); ++m) { - int offs = tag->messageOffsets[m]; - uint8_t const* p = offs < messages1.size() ? messages1.begin() + offs : messages.begin() + offs - messages1.size(); - tsm->value.version_messages.push_back(std::make_pair(version, LengthPrefixedStringRef((uint32_t*)p))); + if (version >= tsm->value.popped) { + tsm->value.version_messages.push_back(std::make_pair(version, LengthPrefixedStringRef((uint32_t*)(block.end() - msg.message.size())))); if(tsm->value.version_messages.back().second.expectedSize() > SERVER_KNOBS->MAX_MESSAGE_SIZE) { TraceEvent(SevWarnAlways, "LargeMessage").detail("Size", tsm->value.version_messages.back().second.expectedSize()); } - ++tagMessages; } } + + msgSize -= msg.message.size(); } + self->messageBlocks.push_back( std::make_pair(version, block) ); } ACTOR Future pullAsyncData( LogRouterData *self, Tag tag ) { @@ -181,59 +164,37 @@ ACTOR Future pullAsyncData( LogRouterData *self, Tag tag ) { } Version ver = 0; - Arena arena; - BinaryWriter wr(Unversioned()); - Map tag_offsets; + std::vector messages; while (true) { bool foundMessage = r->hasMessage(); if (!foundMessage || r->version().version != ver) { ASSERT(r->version().version > lastVer); if (ver) { - VectorRef r; - for(auto& t : tag_offsets) - r.push_back( arena, t.value ); - commitMessages(self, ver, arena, wr.toStringRef(), r); + commitMessages(self, ver, messages); self->version.set( ver ); //TraceEvent("LogRouterVersion").detail("ver",ver); } lastVer = ver; ver = r->version().version; - tag_offsets.clear(); - wr = BinaryWriter(Unversioned()); - arena = Arena(); + messages.clear(); if (!foundMessage) { ver--; //ver is the next possible version we will get data for if(ver > self->version.get()) { - commitMessages(self, ver, arena, StringRef(), VectorRef()); self->version.set( ver ); } break; } } - StringRef msg = r->getMessage(); - auto originalTags = r->getTags(); + TagsAndMessage tagAndMsg; + tagAndMsg.message = r->getMessageWithTags(); tags.clear(); - //FIXME: do we add txsTags? - self->logSystem->get()->addRemoteTags(self->logSet, originalTags, tags); - + self->logSystem->get()->addRemoteTags(self->logSet, r->getTags(), tags); for(auto t : tags) { - Tag fullTag(tagLocalityRemoteLog, t); - auto it = tag_offsets.find(fullTag); - if (it == tag_offsets.end()) { - it = tag_offsets.insert(mapPair( fullTag, TagMessagesRef() )); - it->value.tag = it->key; - } - it->value.messageOffsets.push_back( arena, wr.getLength() ); + tagAndMsg.tags.push_back(Tag(tagLocalityRemoteLog, t)); } - - //FIXME: do not reserialize tags - wr << uint32_t( msg.size() + sizeof(uint32_t) +sizeof(uint16_t) + originalTags.size()*sizeof(Tag) ) << r->version().sub << uint16_t(originalTags.size()); - for(auto t : originalTags) { - wr << t; - } - wr.serializeBytes( msg ); + messages.push_back(std::move(tagAndMsg)); r->nextMessage(); } diff --git a/fdbserver/LogSystem.h b/fdbserver/LogSystem.h index d65aa5dece..a2d7fd9e60 100644 --- a/fdbserver/LogSystem.h +++ b/fdbserver/LogSystem.h @@ -152,30 +152,35 @@ struct ILogSystem { virtual void setProtocolVersion( uint64_t version ) = 0; - //if hasMessage() returns true, getMessage() or reader() can be called. + //if hasMessage() returns true, getMessage(), getMessageWithTags(), or reader() can be called. //does not modify the cursor virtual bool hasMessage() = 0; //pre: only callable if hasMessage() returns true //return the tags associated with the message for teh current sequence - virtual std::vector getTags() = 0; + virtual const std::vector& getTags() = 0; //pre: only callable if hasMessage() returns true - //returns the arena containing the contents of getMessage() and reader() + //returns the arena containing the contents of getMessage(), getMessageWithTags(), and reader() virtual Arena& arena() = 0; //pre: only callable if hasMessage() returns true //returns an arena reader for the next message - //caller cannot call both getMessage() and reader() + //caller cannot call getMessage(), getMessageWithTags(), and reader() //the caller must advance the reader before calling nextMessage() virtual ArenaReader* reader() = 0; //pre: only callable if hasMessage() returns true - //caller cannot call both getMessage() and reader() + //caller cannot call getMessage(), getMessageWithTags(), and reader() //return the contents of the message for the current sequence virtual StringRef getMessage() = 0; - //pre: only callable after getMessage() or reader() + //pre: only callable if hasMessage() returns true + //caller cannot call getMessage(), getMessageWithTags(), and reader() + //return the contents of the message for the current sequence + virtual StringRef getMessageWithTags() = 0; + + //pre: only callable after getMessage(), getMessageWithTags(), or reader() //post: hasMessage() and version() have been updated //hasMessage() will never return false "in the middle" of a version (that is, if it does return false, version().subsequence will be zero) < FIXME: Can we lose this property? virtual void nextMessage() = 0; @@ -223,7 +228,7 @@ struct ILogSystem { ArenaReader rd; LogMessageVersion messageVersion, end; Version poppedVersion; - int32_t messageLength; + int32_t messageLength, rawLength; std::vector tags; bool hasMsg; Future more; @@ -237,7 +242,7 @@ struct ILogSystem { ServerPeekCursor( Reference>> const& interf, Tag tag, Version begin, Version end, bool returnIfBlocked, bool parallelGetMore ); - ServerPeekCursor( TLogPeekReply const& results, LogMessageVersion const& messageVersion, LogMessageVersion const& end, int32_t messageLength, bool hasMsg, Version poppedVersion, Tag tag ); + ServerPeekCursor( TLogPeekReply const& results, LogMessageVersion const& messageVersion, LogMessageVersion const& end, int32_t messageLength, int32_t rawLength, bool hasMsg, Version poppedVersion, Tag tag ); virtual Reference cloneNoMore(); @@ -253,7 +258,9 @@ struct ILogSystem { virtual StringRef getMessage(); - virtual std::vector getTags(); + virtual StringRef getMessageWithTags(); + + virtual const std::vector& getTags(); virtual void advanceTo(LogMessageVersion n); @@ -316,7 +323,9 @@ struct ILogSystem { virtual StringRef getMessage(); - virtual std::vector getTags(); + virtual StringRef getMessageWithTags(); + + virtual const std::vector& getTags(); virtual void advanceTo(LogMessageVersion n); @@ -374,7 +383,9 @@ struct ILogSystem { virtual StringRef getMessage(); - virtual std::vector getTags(); + virtual StringRef getMessageWithTags(); + + virtual const std::vector& getTags(); virtual void advanceTo(LogMessageVersion n); @@ -420,7 +431,9 @@ struct ILogSystem { virtual StringRef getMessage(); - virtual std::vector getTags(); + virtual StringRef getMessageWithTags(); + + virtual const std::vector& getTags(); virtual void advanceTo(LogMessageVersion n); diff --git a/fdbserver/LogSystemPeekCursor.actor.cpp b/fdbserver/LogSystemPeekCursor.actor.cpp index 224d35eb05..b7ade80f8f 100644 --- a/fdbserver/LogSystemPeekCursor.actor.cpp +++ b/fdbserver/LogSystemPeekCursor.actor.cpp @@ -29,8 +29,8 @@ ILogSystem::ServerPeekCursor::ServerPeekCursor( ReferencerandomUniqueID()), poppedVersion(poppedVersion), returnIfBlocked(false), sequence(0), parallelGetMore(false) +ILogSystem::ServerPeekCursor::ServerPeekCursor( TLogPeekReply const& results, LogMessageVersion const& messageVersion, LogMessageVersion const& end, int32_t messageLength, int32_t rawLength, bool hasMsg, Version poppedVersion, Tag tag ) + : results(results), tag(tag), rd(results.arena, results.messages, Unversioned()), messageVersion(messageVersion), end(end), messageLength(messageLength), rawLength(rawLength), hasMsg(hasMsg), randomID(g_random->randomUniqueID()), poppedVersion(poppedVersion), returnIfBlocked(false), sequence(0), parallelGetMore(false) { //TraceEvent("SPC_clone", randomID); this->results.maxKnownVersion = 0; @@ -41,7 +41,7 @@ ILogSystem::ServerPeekCursor::ServerPeekCursor( TLogPeekReply const& results, Lo } Reference ILogSystem::ServerPeekCursor::cloneNoMore() { - return Reference( new ILogSystem::ServerPeekCursor( results, messageVersion, end, messageLength, hasMsg, poppedVersion, tag ) ); + return Reference( new ILogSystem::ServerPeekCursor( results, messageVersion, end, messageLength, rawLength, hasMsg, poppedVersion, tag ) ); } void ILogSystem::ServerPeekCursor::setProtocolVersion( uint64_t version ) { @@ -87,12 +87,14 @@ void ILogSystem::ServerPeekCursor::nextMessage() { } uint16_t tagCount; + rd.checkpoint(); rd >> messageLength >> messageVersion.sub >> tagCount; tags.resize(tagCount); for(int i = 0; i < tagCount; i++) { rd >> tags[i]; } - messageLength -= (sizeof(messageVersion.sub) + sizeof(tagCount) +tagCount*sizeof(Tag)); + rawLength = messageLength + sizeof(messageLength); + messageLength -= (sizeof(messageVersion.sub) + sizeof(tagCount) + tagCount*sizeof(Tag)); hasMsg = true; //TraceEvent("SPC_nextMessageB", randomID).detail("messageVersion", messageVersion.toString()); } @@ -102,7 +104,12 @@ StringRef ILogSystem::ServerPeekCursor::getMessage() { return StringRef( (uint8_t const*)rd.readBytes(messageLength), messageLength); } -std::vector ILogSystem::ServerPeekCursor::getTags() { +StringRef ILogSystem::ServerPeekCursor::getMessageWithTags() { + rd.rewind(); + return StringRef( (uint8_t const*)rd.readBytes(rawLength), rawLength); +} + +const std::vector& ILogSystem::ServerPeekCursor::getTags() { return tags; } @@ -358,7 +365,10 @@ void ILogSystem::MergedPeekCursor::nextMessage() { StringRef ILogSystem::MergedPeekCursor::getMessage() { return serverCursors[currentCursor]->getMessage(); } -std::vector ILogSystem::MergedPeekCursor::getTags() { +StringRef ILogSystem::MergedPeekCursor::getMessageWithTags() { return serverCursors[currentCursor]->getMessageWithTags(); } + + +const std::vector& ILogSystem::MergedPeekCursor::getTags() { return serverCursors[currentCursor]->getTags(); } @@ -590,7 +600,9 @@ void ILogSystem::SetPeekCursor::nextMessage() { StringRef ILogSystem::SetPeekCursor::getMessage() { return serverCursors[currentSet][currentCursor]->getMessage(); } -std::vector ILogSystem::SetPeekCursor::getTags() { +StringRef ILogSystem::SetPeekCursor::getMessageWithTags() { return serverCursors[currentSet][currentCursor]->getMessageWithTags(); } + +const std::vector& ILogSystem::SetPeekCursor::getTags() { return serverCursors[currentSet][currentCursor]->getTags(); } @@ -735,7 +747,11 @@ StringRef ILogSystem::MultiCursor::getMessage() { return cursors.back()->getMessage(); } -std::vector ILogSystem::MultiCursor::getTags() { +StringRef ILogSystem::MultiCursor::getMessageWithTags() { + return cursors.back()->getMessageWithTags(); +} + +const std::vector& ILogSystem::MultiCursor::getTags() { return cursors.back()->getTags(); } diff --git a/fdbserver/TLogServer.actor.cpp b/fdbserver/TLogServer.actor.cpp index 3673120bd3..86b38b5a8b 100644 --- a/fdbserver/TLogServer.actor.cpp +++ b/fdbserver/TLogServer.actor.cpp @@ -671,6 +671,84 @@ ACTOR Future updateStorageLoop( TLogData* self ) { } } +struct TagsAndMessage { + StringRef message; + std::vector tags; +}; + +void commitMessages( Reference self, Version version, const std::vector& taggedMessages, int64_t& bytesInput ) { + // SOMEDAY: This method of copying messages is reasonably memory efficient, but it's still a lot of bytes copied. Find a + // way to do the memory allocation right as we receive the messages in the network layer. + + int64_t addedBytes = 0; + int64_t expectedBytes = 0; + + if(!taggedMessages.size()) { + return; + } + + int msgSize = 0; + for(auto& i : taggedMessages) { + msgSize += i.message.size(); + } + + // Grab the last block in the blocks list so we can share its arena + // We pop all of the elements of it to create a "fresh" vector that starts at the end of the previous vector + Standalone> block; + if(self->messageBlocks.empty()) { + block = Standalone>(); + block.reserve(block.arena(), std::max(SERVER_KNOBS->TLOG_MESSAGE_BLOCK_BYTES, msgSize)); + } + else { + block = self->messageBlocks.back().second; + } + + block.pop_front(block.size()); + + for(auto& msg : taggedMessages) { + if(msg.message.size() > block.capacity() - block.size()) { + self->messageBlocks.push_back( std::make_pair(version, block) ); + addedBytes += int64_t(block.size()) * SERVER_KNOBS->TLOG_MESSAGE_BLOCK_OVERHEAD_FACTOR; + block = Standalone>(); + block.reserve(block.arena(), std::max(SERVER_KNOBS->TLOG_MESSAGE_BLOCK_BYTES, msgSize)); + } + + block.append(block.arena(), msg.message.begin(), msg.message.size()); + for(auto& tag : msg.tags) { + auto tsm = self->tag_data.find(tag); + if (tsm == self->tag_data.end()) { + tsm = self->tag_data.insert( mapPair(std::move(Tag(tag)), LogData::TagData(Version(0), true, true, tag) ), false ); + } + + if (version >= tsm->value.popped) { + tsm->value.version_messages.push_back(std::make_pair(version, LengthPrefixedStringRef((uint32_t*)(block.end() - msg.message.size())))); + if(tsm->value.version_messages.back().second.expectedSize() > SERVER_KNOBS->MAX_MESSAGE_SIZE) { + TraceEvent(SevWarnAlways, "LargeMessage").detail("Size", tsm->value.version_messages.back().second.expectedSize()); + } + if (tag != txsTag) { + expectedBytes += tsm->value.version_messages.back().second.expectedSize(); + } + } + + // The factor of VERSION_MESSAGES_OVERHEAD is intended to be an overestimate of the actual memory used to store this data in a std::deque. + // In practice, this number is probably something like 528/512 ~= 1.03, but this could vary based on the implementation. + // There will also be a fixed overhead per std::deque, but its size should be trivial relative to the size of the TLog + // queue and can be thought of as increasing the capacity of the queue slightly. + addedBytes += (sizeof(std::pair) * SERVER_KNOBS->VERSION_MESSAGES_OVERHEAD_FACTOR_1024THS) >> 10; + } + + msgSize -= msg.message.size(); + } + self->messageBlocks.push_back( std::make_pair(version, block) ); + addedBytes += int64_t(block.size()) * SERVER_KNOBS->TLOG_MESSAGE_BLOCK_OVERHEAD_FACTOR; + + self->version_sizes[version] = make_pair(expectedBytes, expectedBytes); + self->bytesInput += addedBytes; + bytesInput += addedBytes; + + //TraceEvent("TLogPushed", self->dbgid).detail("Bytes", addedBytes).detail("MessageBytes", messages.size()).detail("Tags", tags.size()).detail("expectedBytes", expectedBytes).detail("mCount", mCount).detail("tCount", tCount); +} + void commitMessages( Reference self, Version version, Arena arena, StringRef messages, VectorRef< TagMessagesRef > tags, int64_t& bytesInput) { // SOMEDAY: This method of copying messages is reasonably memory efficient, but it's still a lot of bytes copied. Find a // way to do the memory allocation right as we receive the messages in the network layer. diff --git a/flow/serialize.h b/flow/serialize.h index fa4e5e81ef..c0ef5c0a99 100644 --- a/flow/serialize.h +++ b/flow/serialize.h @@ -464,7 +464,7 @@ public: } template - ArenaReader( Arena const& arena, const StringRef& input, VersionOptions vo ) : m_pool(arena) { + ArenaReader( Arena const& arena, const StringRef& input, VersionOptions vo ) : m_pool(arena), check(NULL) { begin = (const char*)input.begin(); end = begin + input.size(); vo.read(*this); @@ -477,8 +477,18 @@ public: bool empty() const { return begin == end; } + void checkpoint() { + check = begin; + } + + void rewind() { + ASSERT(check != NULL); + begin = check; + check = NULL; + } + private: - const char *begin, *end; + const char *begin, *end, *check; Arena m_pool; uint64_t m_protocolVersion; }; From ccd70fd005dd876c3ac122af8c715c958c5bca6f Mon Sep 17 00:00:00 2001 From: Evan Tschannen Date: Fri, 16 Mar 2018 16:47:05 -0700 Subject: [PATCH 019/127] The tlog uses the tags embedded in the message instead of a separate vector of locations optimized remote tlog committing to avoid re-serializing the message --- fdbclient/FDBTypes.h | 8 + fdbserver/Knobs.cpp | 1 + fdbserver/Knobs.h | 1 + fdbserver/LogRouter.actor.cpp | 5 - fdbserver/LogSystem.h | 31 +-- fdbserver/TLogInterface.h | 9 +- fdbserver/TLogServer.actor.cpp | 215 ++++++-------------- fdbserver/TagPartitionedLogSystem.actor.cpp | 2 +- 8 files changed, 83 insertions(+), 189 deletions(-) diff --git a/fdbclient/FDBTypes.h b/fdbclient/FDBTypes.h index 542c1baf03..62154caf8d 100644 --- a/fdbclient/FDBTypes.h +++ b/fdbclient/FDBTypes.h @@ -66,6 +66,14 @@ static const Tag txsTag {tagLocalitySpecial, 1}; enum { txsTagOld = -1, invalidTagOld = -100 }; +struct TagsAndMessage { + StringRef message; + std::vector tags; + + TagsAndMessage() {} + TagsAndMessage(StringRef message, const std::vector& tags) : message(message), tags(tags) {} +}; + struct KeyRangeRef; struct KeyValueRef; diff --git a/fdbserver/Knobs.cpp b/fdbserver/Knobs.cpp index f0e445f62e..689ac7462c 100644 --- a/fdbserver/Knobs.cpp +++ b/fdbserver/Knobs.cpp @@ -45,6 +45,7 @@ ServerKnobs::ServerKnobs(bool randomize, ClientKnobs* clientKnobs) { init( TLOG_PEEK_DELAY, 0.00005 ); init( LEGACY_TLOG_UPGRADE_ENTRIES_PER_VERSION, 100 ); init( VERSION_MESSAGES_OVERHEAD_FACTOR_1024THS, 1072 ); // Based on a naive interpretation of the gcc version of std::deque, we would expect this to be 16 bytes overhead per 512 bytes data. In practice, it seems to be 24 bytes overhead per 512. + init( VERSION_MESSAGES_ENTRY_BYTES_WITH_OVERHEAD, std::ceil(16.0 * VERSION_MESSAGES_OVERHEAD_FACTOR_1024THS / 1024) ); init( LOG_SYSTEM_PUSHED_DATA_BLOCK_SIZE, 1e5 ); init( MAX_MESSAGE_SIZE, std::max(LOG_SYSTEM_PUSHED_DATA_BLOCK_SIZE, 1e5 + 2e4 + 1) + 8 ); // VALUE_SIZE_LIMIT + SYSTEM_KEY_SIZE_LIMIT + 9 bytes (4 bytes for length, 4 bytes for sequence number, and 1 byte for mutation type) init( TLOG_MESSAGE_BLOCK_BYTES, 10e6 ); diff --git a/fdbserver/Knobs.h b/fdbserver/Knobs.h index 5016a762aa..3e9301f3ca 100644 --- a/fdbserver/Knobs.h +++ b/fdbserver/Knobs.h @@ -50,6 +50,7 @@ public: double TLOG_PEEK_DELAY; int LEGACY_TLOG_UPGRADE_ENTRIES_PER_VERSION; int VERSION_MESSAGES_OVERHEAD_FACTOR_1024THS; // Multiplicative factor to bound total space used to store a version message (measured in 1/1024ths, e.g. a value of 2048 yields a factor of 2). + int64_t VERSION_MESSAGES_ENTRY_BYTES_WITH_OVERHEAD; double TLOG_MESSAGE_BLOCK_OVERHEAD_FACTOR; int64_t TLOG_MESSAGE_BLOCK_BYTES; int64_t MAX_MESSAGE_SIZE; diff --git a/fdbserver/LogRouter.actor.cpp b/fdbserver/LogRouter.actor.cpp index 77d7146e58..995927bfb4 100644 --- a/fdbserver/LogRouter.actor.cpp +++ b/fdbserver/LogRouter.actor.cpp @@ -83,11 +83,6 @@ struct LogRouterData { LogRouterData(UID dbgid, Tag routerTag, int logSet) : dbgid(dbgid), routerTag(routerTag), logSet(logSet), logSystem(new AsyncVar>()) {} }; -struct TagsAndMessage { - StringRef message; - std::vector tags; -}; - void commitMessages( LogRouterData* self, Version version, const std::vector& taggedMessages ) { if(!taggedMessages.size()) { return; diff --git a/fdbserver/LogSystem.h b/fdbserver/LogSystem.h index a2d7fd9e60..d8c7112199 100644 --- a/fdbserver/LogSystem.h +++ b/fdbserver/LogSystem.h @@ -569,16 +569,13 @@ struct LogPushData : NonCopyable { // Log subsequences have to start at 1 (the MergedPeekCursor relies on this to make sure we never have !hasMessage() in the middle of data for a version explicit LogPushData(Reference logSystem) : logSystem(logSystem), subsequence(1) { - int totalSize = 0; for(auto& log : logSystem->getLogSystemConfig().tLogs) { if(log.isLocal) { - totalSize += log.tLogs.size(); + for(int i = 0; i < log.tLogs.size(); i++) { + messagesWriter.push_back( BinaryWriter( AssumeVersion(currentProtocolVersion) ) ); + } } } - tags.resize( totalSize ); - for(int i = 0; i < tags.size(); i++) { - messagesWriter.push_back( BinaryWriter( AssumeVersion(currentProtocolVersion) ) ); - } } // addTag() adds a tag for the *next* message to be added @@ -601,9 +598,6 @@ struct LogPushData : NonCopyable { } uint32_t subseq = this->subsequence++; for(int loc : msg_locations) { - for(auto& tag : prev_tags) - addTagToLoc( tag, loc ); - messagesWriter[loc] << uint32_t(rawMessageWithoutLength.size() + sizeof(subseq) + sizeof(uint16_t) + sizeof(Tag)*prev_tags.size()) << subseq << uint16_t(prev_tags.size()); for(auto& tag : prev_tags) messagesWriter[loc] << tag; @@ -625,9 +619,6 @@ struct LogPushData : NonCopyable { uint32_t subseq = this->subsequence++; for(int loc : msg_locations) { - for(auto& tag : prev_tags) - addTagToLoc( tag, loc ); - // FIXME: memcpy after the first time BinaryWriter& wr = messagesWriter[loc]; int offset = wr.getLength(); @@ -644,28 +635,12 @@ struct LogPushData : NonCopyable { StringRef getMessages(int loc) { return StringRef( arena, messagesWriter[loc].toStringRef() ); // FIXME: Unnecessary copy! } - VectorRef getTags(int loc) { - VectorRef r; - for(auto& t : tags[loc]) - r.push_back( arena, t.value ); - return r; - } private: - void addTagToLoc( Tag tag, int loc ) { - auto it = tags[loc].find(tag); - if (it == tags[loc].end()) { - it = tags[loc].insert(mapPair( tag, TagMessagesRef() )); - it->value.tag = it->key; - } - it->value.messageOffsets.push_back( arena, messagesWriter[loc].getLength() ); - } - Reference logSystem; Arena arena; vector next_message_tags; vector prev_tags; - vector> tags; vector messagesWriter; vector msg_locations; uint32_t subsequence; diff --git a/fdbserver/TLogInterface.h b/fdbserver/TLogInterface.h index b3e4c397b5..a8fe3ee5aa 100644 --- a/fdbserver/TLogInterface.h +++ b/fdbserver/TLogInterface.h @@ -200,18 +200,17 @@ struct TLogCommitRequest { Arena arena; Version prevVersion, version, knownCommittedVersion; - StringRef messages; // Each message prefixed by a 4-byte length - VectorRef< TagMessagesRef > tags; + StringRef messages;// Each message prefixed by a 4-byte length ReplyPromise reply; Optional debugID; TLogCommitRequest() {} - TLogCommitRequest( const Arena& a, Version prevVersion, Version version, Version knownCommittedVersion, StringRef messages, VectorRef< TagMessagesRef > tags, Optional debugID ) - : arena(a), prevVersion(prevVersion), version(version), knownCommittedVersion(knownCommittedVersion), messages(messages), tags(tags), debugID(debugID) {} + TLogCommitRequest( const Arena& a, Version prevVersion, Version version, Version knownCommittedVersion, StringRef messages, Optional debugID ) + : arena(a), prevVersion(prevVersion), version(version), knownCommittedVersion(knownCommittedVersion), messages(messages), debugID(debugID) {} template void serialize( Ar& ar ) { - ar & prevVersion & version & knownCommittedVersion & messages & tags & reply & arena & debugID; + ar & prevVersion & version & knownCommittedVersion & messages & reply & arena & debugID; } }; diff --git a/fdbserver/TLogServer.actor.cpp b/fdbserver/TLogServer.actor.cpp index 86b38b5a8b..a0bd0f146e 100644 --- a/fdbserver/TLogServer.actor.cpp +++ b/fdbserver/TLogServer.actor.cpp @@ -49,25 +49,46 @@ struct TLogQueueEntryRef { Version version; Version knownCommittedVersion; StringRef messages; - VectorRef< TagMessagesRef > tags; TLogQueueEntryRef() : version(0), knownCommittedVersion(0) {} TLogQueueEntryRef(Arena &a, TLogQueueEntryRef const &from) - : version(from.version), knownCommittedVersion(from.knownCommittedVersion), id(from.id), messages(a, from.messages), tags(a, from.tags) { + : version(from.version), knownCommittedVersion(from.knownCommittedVersion), id(from.id), messages(a, from.messages) { } template void serialize(Ar& ar) { - if( ar.protocolVersion() >= 0x0FDB00A460010001) { - ar & version & messages & tags & knownCommittedVersion & id; - } else if(ar.isDeserializing) { - ar & version & messages & tags; - knownCommittedVersion = 0; - id = UID(); - } + ar & version & messages & knownCommittedVersion & id; } size_t expectedSize() const { - return messages.expectedSize() + tags.expectedSize(); + return messages.expectedSize(); + } +}; + +struct AlternativeTLogQueueEntryRef { + UID id; + Version version; + Version knownCommittedVersion; + std::vector* alternativeMessages; + + AlternativeTLogQueueEntryRef() : version(0), knownCommittedVersion(0), alternativeMessages(NULL) {} + + template + void serialize(Ar& ar) { + ASSERT(!ar.isDeserializing && alternativeMessages); + uint32_t msgSize = expectedSize(); + ar & version & msgSize; + for(auto& msg : *alternativeMessages) { + ar.serializeBytes( msg.message ); + } + ar & knownCommittedVersion & id; + } + + uint32_t expectedSize() const { + uint32_t msgSize = 0; + for(auto& msg : *alternativeMessages) { + msgSize += msg.message.size(); + } + return msgSize; } }; @@ -94,7 +115,8 @@ public: return readNext( this ); } - void push( TLogQueueEntryRef const& qe ) { + template + void push( T const& qe ) { BinaryWriter wr( Unversioned() ); // outer framing is not versioned wr << uint32_t(0); IncludeVersion().write(wr); // payload is versioned @@ -316,7 +338,7 @@ struct LogData : NonCopyable, public ReferenceCounted { self->version_messages.pop_front(); } - int64_t bytesErased = (messagesErased * sizeof(std::pair) * SERVER_KNOBS->VERSION_MESSAGES_OVERHEAD_FACTOR_1024THS) >> 10; + int64_t bytesErased = messagesErased * SERVER_KNOBS->VERSION_MESSAGES_ENTRY_BYTES_WITH_OVERHEAD; tlogData->bytesDurable += bytesErased; *gBytesErased += bytesErased; Void _ = wait(yield(taskID)); @@ -671,11 +693,6 @@ ACTOR Future updateStorageLoop( TLogData* self ) { } } -struct TagsAndMessage { - StringRef message; - std::vector tags; -}; - void commitMessages( Reference self, Version version, const std::vector& taggedMessages, int64_t& bytesInput ) { // SOMEDAY: This method of copying messages is reasonably memory efficient, but it's still a lot of bytes copied. Find a // way to do the memory allocation right as we receive the messages in the network layer. @@ -728,13 +745,13 @@ void commitMessages( Reference self, Version version, const std::vector if (tag != txsTag) { expectedBytes += tsm->value.version_messages.back().second.expectedSize(); } - } - // The factor of VERSION_MESSAGES_OVERHEAD is intended to be an overestimate of the actual memory used to store this data in a std::deque. - // In practice, this number is probably something like 528/512 ~= 1.03, but this could vary based on the implementation. - // There will also be a fixed overhead per std::deque, but its size should be trivial relative to the size of the TLog - // queue and can be thought of as increasing the capacity of the queue slightly. - addedBytes += (sizeof(std::pair) * SERVER_KNOBS->VERSION_MESSAGES_OVERHEAD_FACTOR_1024THS) >> 10; + // The factor of VERSION_MESSAGES_OVERHEAD is intended to be an overestimate of the actual memory used to store this data in a std::deque. + // In practice, this number is probably something like 528/512 ~= 1.03, but this could vary based on the implementation. + // There will also be a fixed overhead per std::deque, but its size should be trivial relative to the size of the TLog + // queue and can be thought of as increasing the capacity of the queue slightly. + addedBytes += SERVER_KNOBS->VERSION_MESSAGES_ENTRY_BYTES_WITH_OVERHEAD; + } } msgSize -= msg.message.size(); @@ -749,99 +766,26 @@ void commitMessages( Reference self, Version version, const std::vector //TraceEvent("TLogPushed", self->dbgid).detail("Bytes", addedBytes).detail("MessageBytes", messages.size()).detail("Tags", tags.size()).detail("expectedBytes", expectedBytes).detail("mCount", mCount).detail("tCount", tCount); } -void commitMessages( Reference self, Version version, Arena arena, StringRef messages, VectorRef< TagMessagesRef > tags, int64_t& bytesInput) { - // SOMEDAY: This method of copying messages is reasonably memory efficient, but it's still a lot of bytes copied. Find a - // way to do the memory allocation right as we receive the messages in the network layer. - - int64_t addedBytes = 0; - int64_t expectedBytes = 0; - - if(!messages.size()) { - return; - } - - StringRef messages1; // the first block of messages, if they aren't all stored contiguously. otherwise empty - - // Grab the last block in the blocks list so we can share its arena - // We pop all of the elements of it to create a "fresh" vector that starts at the end of the previous vector - Standalone> block; - if(self->messageBlocks.empty()) { - block = Standalone>(); - block.reserve(block.arena(), std::max(SERVER_KNOBS->TLOG_MESSAGE_BLOCK_BYTES, messages.size())); - } - else { - block = self->messageBlocks.back().second; - } - - block.pop_front(block.size()); - - // If the current batch of messages doesn't fit entirely in the remainder of the last block in the list - if(messages.size() + block.size() > block.capacity()) { - // Find how many messages will fit - LengthPrefixedStringRef r((uint32_t*)messages.begin()); - uint8_t const* end = messages.begin() + block.capacity() - block.size(); - while(r.toStringRef().end() <= end) { - r = LengthPrefixedStringRef( (uint32_t*)r.toStringRef().end() ); +void commitMessages( Reference self, Version version, Arena arena, StringRef messages, int64_t& bytesInput ) { + ArenaReader rd( arena, messages, Unversioned() ); + int32_t messageLength, rawLength; + uint16_t tagCount; + uint32_t sub; + std::vector msgs; + while(!rd.empty()) { + TagsAndMessage tagsAndMsg; + rd.checkpoint(); + rd >> messageLength >> sub >> tagCount; + tagsAndMsg.tags.resize(tagCount); + for(int i = 0; i < tagCount; i++) { + rd >> tagsAndMsg.tags[i]; } - - // Fill up the rest of this block - int bytes = (uint8_t*)r.getLengthPtr()-messages.begin(); - if (bytes) { - TEST(true); // Splitting commit messages across multiple blocks - messages1 = StringRef(block.end(), bytes); - block.append(block.arena(), messages.begin(), bytes); - self->messageBlocks.push_back( std::make_pair(version, block) ); - addedBytes += int64_t(block.size()) * SERVER_KNOBS->TLOG_MESSAGE_BLOCK_OVERHEAD_FACTOR; - messages = messages.substr(bytes); - } - - // Make a new block - block = Standalone>(); - block.reserve(block.arena(), std::max(SERVER_KNOBS->TLOG_MESSAGE_BLOCK_BYTES, messages.size())); + rawLength = messageLength + sizeof(messageLength); + rd.rewind(); + tagsAndMsg.message = StringRef((uint8_t const*)rd.readBytes(rawLength), rawLength); + msgs.push_back(std::move(tagsAndMsg)); } - - // Copy messages into block - ASSERT(messages.size() <= block.capacity() - block.size()); - block.append(block.arena(), messages.begin(), messages.size()); - self->messageBlocks.push_back( std::make_pair(version, block) ); - addedBytes += int64_t(block.size()) * SERVER_KNOBS->TLOG_MESSAGE_BLOCK_OVERHEAD_FACTOR; - messages = StringRef(block.end()-messages.size(), messages.size()); - - for(auto tag = tags.begin(); tag != tags.end(); ++tag) { - int64_t tagMessages = 0; - - auto tsm = self->tag_data.find(tag->tag); - if (tsm == self->tag_data.end()) { - tsm = self->tag_data.insert( mapPair(std::move(Tag(tag->tag)), LogData::TagData(Version(0), true, true, tag->tag) ), false ); - } - - if (version >= tsm->value.popped) { - for(int m = 0; m < tag->messageOffsets.size(); ++m) { - int offs = tag->messageOffsets[m]; - uint8_t const* p = offs < messages1.size() ? messages1.begin() + offs : messages.begin() + offs - messages1.size(); - tsm->value.version_messages.push_back(std::make_pair(version, LengthPrefixedStringRef((uint32_t*)p))); - if(tsm->value.version_messages.back().second.expectedSize() > SERVER_KNOBS->MAX_MESSAGE_SIZE) { - TraceEvent(SevWarnAlways, "LargeMessage").detail("Size", tsm->value.version_messages.back().second.expectedSize()); - } - if (tag->tag != txsTag) - expectedBytes += tsm->value.version_messages.back().second.expectedSize(); - - ++tagMessages; - } - } - - // The factor of VERSION_MESSAGES_OVERHEAD is intended to be an overestimate of the actual memory used to store this data in a std::deque. - // In practice, this number is probably something like 528/512 ~= 1.03, but this could vary based on the implementation. - // There will also be a fixed overhead per std::deque, but its size should be trivial relative to the size of the TLog - // queue and can be thought of as increasing the capacity of the queue slightly. - addedBytes += (tagMessages * sizeof(std::pair) * SERVER_KNOBS->VERSION_MESSAGES_OVERHEAD_FACTOR_1024THS) >> 10; - } - - self->version_sizes[version] = make_pair(expectedBytes, expectedBytes); - self->bytesInput += addedBytes; - bytesInput += addedBytes; - - //TraceEvent("TLogPushed", self->dbgid).detail("Bytes", addedBytes).detail("MessageBytes", messages.size()).detail("Tags", tags.size()).detail("expectedBytes", expectedBytes).detail("mCount", mCount).detail("tCount", tCount); + commitMessages(self, version, msgs, bytesInput); } Version poppedVersion( Reference self, Tag tag) { @@ -1147,14 +1091,13 @@ ACTOR Future tLogCommit( g_traceBatch.addEvent("CommitDebug", tlogDebugID.get().first(), "TLog.tLogCommit.Before"); TraceEvent("TLogCommit", logData->logId).detail("Version", req.version); - commitMessages(logData, req.version, req.arena, req.messages, req.tags, self->bytesInput); + commitMessages(logData, req.version, req.arena, req.messages, self->bytesInput); // Log the changes to the persistent queue, to be committed by commitQueue() TLogQueueEntryRef qe; qe.version = req.version; qe.knownCommittedVersion = req.knownCommittedVersion; qe.messages = req.messages; - qe.tags = req.tags; qe.id = logData->logId; self->persistentQueue->push( qe ); @@ -1416,25 +1359,19 @@ ACTOR Future pullAsyncData( TLogData* self, Reference logData, Ta } Version ver = 0; - Arena arena; - BinaryWriter wr(Unversioned()); - Map tag_offsets; + std::vector messages; while (true) { bool foundMessage = r->hasMessage(); if (!foundMessage || r->version().version != ver) { ASSERT(r->version().version > lastVer); if (ver) { - VectorRef r; - for(auto& t : tag_offsets) - r.push_back( arena, t.value ); - commitMessages(logData, ver, arena, wr.toStringRef(), r, self->bytesInput); + commitMessages(logData, ver, messages, self->bytesInput); // Log the changes to the persistent queue, to be committed by commitQueue() - TLogQueueEntryRef qe; + AlternativeTLogQueueEntryRef qe; qe.version = ver; qe.knownCommittedVersion = 0; - qe.messages = wr.toStringRef(); - qe.tags = r; + qe.alternativeMessages = &messages; qe.id = logData->logId; self->persistentQueue->push( qe ); @@ -1450,21 +1387,15 @@ ACTOR Future pullAsyncData( TLogData* self, Reference logData, Ta } lastVer = ver; ver = r->version().version; - tag_offsets = Map(); - wr = BinaryWriter(Unversioned()); - arena = Arena(); if (!foundMessage) { ver--; if(ver > logData->version.get()) { - commitMessages(logData, ver, arena, StringRef(), VectorRef(), self->bytesInput); - // Log the changes to the persistent queue, to be committed by commitQueue() TLogQueueEntryRef qe; qe.version = ver; qe.knownCommittedVersion = 0; qe.messages = StringRef(); - qe.tags = VectorRef(); qe.id = logData->logId; self->persistentQueue->push( qe ); @@ -1482,23 +1413,7 @@ ACTOR Future pullAsyncData( TLogData* self, Reference logData, Ta } } - StringRef msg = r->getMessage(); - auto tags = r->getTags(); - for(auto tag : tags) { - auto it = tag_offsets.find(tag); - if (it == tag_offsets.end()) { - it = tag_offsets.insert(mapPair( tag, TagMessagesRef() )); - it->value.tag = it->key; - } - it->value.messageOffsets.push_back( arena, wr.getLength() ); - } - - //FIXME: do not reserialize tag data - wr << uint32_t( msg.size() + sizeof(uint32_t) + sizeof(uint16_t) + tags.size()*sizeof(Tag) ) << r->version().sub << uint16_t(tags.size()); - for(auto t : tags) { - wr << t; - } - wr.serializeBytes( msg ); + messages.push_back( TagsAndMessage(r->getMessageWithTags(), r->getTags()) ); r->nextMessage(); } @@ -1719,7 +1634,7 @@ ACTOR Future restorePersistentState( TLogData* self, LocalityData locality if(logData) { logData->knownCommittedVersion = std::max(logData->knownCommittedVersion, qe.knownCommittedVersion); if( qe.version > logData->version.get() ) { - commitMessages(logData, qe.version, qe.arena(), qe.messages, qe.tags, self->bytesInput); + commitMessages(logData, qe.version, qe.arena(), qe.messages, self->bytesInput); logData->version.set( qe.version ); logData->queueCommittedVersion.set( qe.version ); diff --git a/fdbserver/TagPartitionedLogSystem.actor.cpp b/fdbserver/TagPartitionedLogSystem.actor.cpp index bc99bc2c90..4ae34daf88 100644 --- a/fdbserver/TagPartitionedLogSystem.actor.cpp +++ b/fdbserver/TagPartitionedLogSystem.actor.cpp @@ -346,7 +346,7 @@ struct TagPartitionedLogSystem : ILogSystem, ReferenceCountedlogServers.size(); loc++) { Future commitMessage = reportTLogCommitErrors( it->logServers[loc]->get().interf().commit.getReply( - TLogCommitRequest( data.getArena(), prevVersion, version, knownCommittedVersion, data.getMessages(location), data.getTags(location), debugID ), TaskTLogCommitReply ), + TLogCommitRequest( data.getArena(), prevVersion, version, knownCommittedVersion, data.getMessages(location), debugID ), TaskTLogCommitReply ), getDebugID()); actors.add(commitMessage); tLogCommitResults.push_back(commitMessage); From 26b93ff9208d0dba7aa5e43170e5373464e01d96 Mon Sep 17 00:00:00 2001 From: Yichi Chiang Date: Tue, 20 Feb 2018 13:22:31 -0800 Subject: [PATCH 020/127] Share log mutations between backups and DRs which have the same backup range --- fdbclient/BackupAgent.h | 26 +- fdbclient/BackupAgentBase.actor.cpp | 75 +++- fdbclient/DatabaseBackupAgent.actor.cpp | 360 ++++++++++++++---- fdbclient/FileBackupAgent.actor.cpp | 204 +++++++++- fdbclient/Knobs.cpp | 1 + fdbclient/Knobs.h | 1 + fdbclient/SystemData.cpp | 11 +- fdbclient/SystemData.h | 5 +- fdbrpc/simulator.h | 3 +- fdbserver/SimulatedCluster.actor.cpp | 23 +- fdbserver/tester.actor.cpp | 16 +- .../workloads/AtomicSwitchover.actor.cpp | 4 +- .../workloads/BackupCorrectness.actor.cpp | 93 +++-- fdbserver/workloads/BackupToDBAbort.actor.cpp | 4 +- .../workloads/BackupToDBCorrectness.actor.cpp | 82 ++-- fdbserver/workloads/workloads.h | 4 +- tests/slow/SharedBackupCorrectness.txt | 25 ++ 17 files changed, 749 insertions(+), 188 deletions(-) create mode 100644 tests/slow/SharedBackupCorrectness.txt diff --git a/fdbclient/BackupAgent.h b/fdbclient/BackupAgent.h index cf0a343a0c..54e04ce7cb 100644 --- a/fdbclient/BackupAgent.h +++ b/fdbclient/BackupAgent.h @@ -46,6 +46,7 @@ public: static const Key keyFolderId; static const Key keyBeginVersion; static const Key keyEndVersion; + static const Key keyPrevBeginVersion; static const Key keyConfigBackupTag; static const Key keyConfigLogUid; static const Key keyConfigBackupRanges; @@ -55,6 +56,8 @@ public: static const Key keyLastUid; static const Key keyBeginKey; static const Key keyEndKey; + static const Key destUid; + static const Key backupDone; static const Key keyTagName; static const Key keyStates; @@ -353,6 +356,11 @@ public: return runRYWTransaction(cx, [=](Reference tr){ return getStateValue(tr, logUid); }); } + Future getDestUid(Reference tr, UID logUid); + Future getDestUid(Database cx, UID logUid) { + return runRYWTransaction(cx, [=](Reference tr){ return getDestUid(tr, logUid); }); + } + Future getLogUid(Reference tr, Key tagName); Future getLogUid(Database cx, Key tagName) { return runRYWTransaction(cx, [=](Reference tr){ return getLogUid(tr, tagName); }); @@ -410,8 +418,9 @@ struct RCGroup { bool copyParameter(Reference source, Reference dest, Key key); Version getVersionFromString(std::string const& value); -Standalone> getLogRanges(Version beginVersion, Version endVersion, Key backupUid, int blockSize = CLIENT_KNOBS->LOG_RANGE_BLOCK_SIZE); +Standalone> getLogRanges(Version beginVersion, Version endVersion, Key destUidValue, int blockSize = CLIENT_KNOBS->LOG_RANGE_BLOCK_SIZE); Standalone> getApplyRanges(Version beginVersion, Version endVersion, Key backupUid); +Future clearLogRanges(Reference tr, bool clearVersionHistory, Key logUidValue, Key destUidValue, Version beginVersion, Version endVersion); Key getApplyKey( Version version, Key backupUid ); std::pair decodeBKMutationLogKey(Key key); Standalone> decodeBackupLogValue(StringRef value); @@ -500,9 +509,14 @@ public: KeyBackedConfig(StringRef prefix, Reference task) : KeyBackedConfig(prefix, TaskParams.uid().get(task)) {} - Future toTask(Reference tr, Reference task) { + Future toTask(Reference tr, Reference task, bool setValidation = true) { // Set the uid task parameter TaskParams.uid().set(task, uid); + + if (!setValidation) { + return Void(); + } + // Set the validation condition for the task which is that the restore uid's tag's uid is the same as the restore uid. // Get this uid's tag, then get the KEY for the tag's uid but don't read it. That becomes the validation key // which TaskBucket will check, and its value must be this restore config's uid. @@ -701,6 +715,10 @@ public: return configSpace.pack(LiteralStringRef(__FUNCTION__)); } + KeyBackedProperty destUidValue() { + return configSpace.pack(LiteralStringRef(__FUNCTION__)); + } + Future> getLatestRestorableVersion(Reference tr) { tr->setOption(FDBTransactionOptions::READ_SYSTEM_KEYS); tr->setOption(FDBTransactionOptions::READ_LOCK_AWARE); @@ -720,8 +738,8 @@ public: return configSpace.pack(LiteralStringRef(__FUNCTION__)); } - void startMutationLogs(Reference tr, KeyRangeRef backupRange) { - Key mutationLogsDestKey = uidPrefixKey(backupLogKeys.begin, getUid()); + void startMutationLogs(Reference tr, KeyRangeRef backupRange, Key destUidValue) { + Key mutationLogsDestKey = destUidValue.withPrefix(backupLogKeys.begin); tr->set(logRangesEncodeKey(backupRange.begin, getUid()), logRangesEncodeValue(backupRange.end, mutationLogsDestKey)); } diff --git a/fdbclient/BackupAgentBase.actor.cpp b/fdbclient/BackupAgentBase.actor.cpp index 39d50c839c..c33d131101 100644 --- a/fdbclient/BackupAgentBase.actor.cpp +++ b/fdbclient/BackupAgentBase.actor.cpp @@ -25,6 +25,7 @@ const Key BackupAgentBase::keyFolderId = LiteralStringRef("config_folderid"); const Key BackupAgentBase::keyBeginVersion = LiteralStringRef("beginVersion"); const Key BackupAgentBase::keyEndVersion = LiteralStringRef("endVersion"); +const Key BackupAgentBase::keyPrevBeginVersion = LiteralStringRef("prevBeginVersion"); const Key BackupAgentBase::keyConfigBackupTag = LiteralStringRef("config_backup_tag"); const Key BackupAgentBase::keyConfigLogUid = LiteralStringRef("config_log_uid"); const Key BackupAgentBase::keyConfigBackupRanges = LiteralStringRef("config_backup_ranges"); @@ -34,6 +35,8 @@ const Key BackupAgentBase::keyStateStatus = LiteralStringRef("state_status"); const Key BackupAgentBase::keyLastUid = LiteralStringRef("last_uid"); const Key BackupAgentBase::keyBeginKey = LiteralStringRef("beginKey"); const Key BackupAgentBase::keyEndKey = LiteralStringRef("endKey"); +const Key BackupAgentBase::destUid = LiteralStringRef("destUid"); +const Key BackupAgentBase::backupDone = LiteralStringRef("backupDone"); const Key BackupAgentBase::keyTagName = LiteralStringRef("tagname"); const Key BackupAgentBase::keyStates = LiteralStringRef("state"); @@ -68,12 +71,12 @@ Version getVersionFromString(std::string const& value) { // \xff / bklog / keyspace in a funny order for performance reasons. // Return the ranges of keys that contain the data for the given range // of versions. -Standalone> getLogRanges(Version beginVersion, Version endVersion, Key backupUid, int blockSize) { +Standalone> getLogRanges(Version beginVersion, Version endVersion, Key destUidValue, int blockSize) { Standalone> ret; - Key baLogRangePrefix = backupUid.withPrefix(backupLogKeys.begin); + Key baLogRangePrefix = destUidValue.withPrefix(backupLogKeys.begin); - //TraceEvent("getLogRanges").detail("backupUid", backupUid).detail("prefix", printable(StringRef(baLogRangePrefix))); + //TraceEvent("getLogRanges").detail("destUidValue", destUidValue).detail("prefix", printable(StringRef(baLogRangePrefix))); for (int64_t vblock = beginVersion / blockSize; vblock < (endVersion + blockSize - 1) / blockSize; ++vblock) { int64_t tb = vblock * blockSize / CLIENT_KNOBS->LOG_RANGE_BLOCK_SIZE; @@ -619,3 +622,69 @@ ACTOR Future applyMutations(Database cx, Key uid, Key addPrefix, Key remov throw; } } + +ACTOR Future _clearLogRanges(Reference tr, bool clearVersionHistory, Key logUidValue, Key destUidValue, Version beginVersion, Version endVersion) { + if (!destUidValue.size()) { + return Void(); + } + + state Key backupLatestVersionsPath = destUidValue.withPrefix(backupLatestVersionsPrefix); + state Key backupLatestVersionsKey = logUidValue.withPrefix(backupLatestVersionsPath); + tr->setOption(FDBTransactionOptions::ACCESS_SYSTEM_KEYS); + tr->setOption(FDBTransactionOptions::LOCK_AWARE); + + Optional v = wait(tr->get(backupLatestVersionsKey)); + if (!v.present()) { + return Void(); + } + + state Standalone backupVersions = wait(tr->getRange(KeyRangeRef(backupLatestVersionsPath, strinc(backupLatestVersionsPath)), CLIENT_KNOBS->TOO_MANY)); + Version nextSmallestVersion = endVersion; + bool clearLogRangesRequired = true; + + // More than one backup/DR with the same range + if (backupVersions.size() > 1) { + bool countSelf = false; + + for (auto backupVersion : backupVersions) { + Version currVersion = BinaryReader::fromStringRef(backupVersion.value, Unversioned()); + if (currVersion > beginVersion) { + if (currVersion < nextSmallestVersion) { + nextSmallestVersion = currVersion; + } + } else if (currVersion == beginVersion && !countSelf) { + countSelf = true; + } else { + clearLogRangesRequired = false; + break; + } + } + } + + if (clearVersionHistory && backupVersions.size() == 1) { + tr->clear(prefixRange(backupLatestVersionsPath)); + tr->clear(prefixRange(destUidValue.withPrefix(backupLogKeys.begin))); + } else { + if (clearVersionHistory) { + // Clear current backup version history + tr->clear(backupLatestVersionsKey); + } else { + // Update current backup latest version + tr->set(backupLatestVersionsKey, BinaryWriter::toValue(endVersion, Unversioned())); + } + + // Clear log ranges if needed + if (clearLogRangesRequired) { + Standalone> ranges = getLogRanges(beginVersion, nextSmallestVersion, destUidValue); + for (auto& range : ranges) { + tr->clear(range); + } + } + } + + return Void(); +} + +Future clearLogRanges(Reference tr, bool clearVersionHistory, Key logUidValue, Key destUidValue, Version beginVersion, Version endVersion) { + return _clearLogRanges(tr, clearVersionHistory, logUidValue, destUidValue, beginVersion, endVersion); +} \ No newline at end of file diff --git a/fdbclient/DatabaseBackupAgent.actor.cpp b/fdbclient/DatabaseBackupAgent.actor.cpp index f8f9abba27..71120f76e9 100644 --- a/fdbclient/DatabaseBackupAgent.actor.cpp +++ b/fdbclient/DatabaseBackupAgent.actor.cpp @@ -94,6 +94,7 @@ namespace dbBackup { if (source) { copyParameter(source, dest, BackupAgentBase::keyFolderId); copyParameter(source, dest, BackupAgentBase::keyConfigLogUid); + copyParameter(source, dest, BackupAgentBase::destUid); copyParameter(source, dest, DatabaseBackupAgent::keyAddPrefix); copyParameter(source, dest, DatabaseBackupAgent::keyRemovePrefix); @@ -469,23 +470,38 @@ namespace dbBackup { Future execute(Database cx, Reference tb, Reference fb, Reference task) { return _execute(cx, tb, fb, task); }; Future finish(Reference tr, Reference tb, Reference fb, Reference task) { return _finish(tr, tb, fb, task); }; - ACTOR static Future eraseLogData(Database cx, Reference task, Version beginVersion, Version endVersion) { + ACTOR static Future eraseLogData(Database cx, Reference task, bool backupDone, Version beginVersion, Version endVersion) { if (endVersion <= beginVersion) return Void(); - state Transaction tr(cx); - loop{ - try { - tr.setOption(FDBTransactionOptions::LOCK_AWARE); - Standalone> ranges = getLogRanges(beginVersion, endVersion, task->params[BackupAgentBase::keyConfigLogUid]); - for (auto & rng : ranges) - tr.clear(rng); - Void _ = wait(tr.commit()); - return Void(); - } - catch (Error &e) { - Void _ = wait(tr.onError(e)); - } + + state Version currBeginVersion = beginVersion; + state Version currEndVersion; + state bool clearVersionHistory = false; + + while (currBeginVersion < endVersion) { + state Reference tr(new ReadYourWritesTransaction(cx)); + + loop{ + try { + currEndVersion = std::min(currBeginVersion + CLIENT_KNOBS->CLEAR_LOG_RANGE_COUNT * CLIENT_KNOBS->LOG_RANGE_BLOCK_SIZE, endVersion); + tr->setOption(FDBTransactionOptions::LOCK_AWARE); + tr->setOption(FDBTransactionOptions::ACCESS_SYSTEM_KEYS); + + if (backupDone && currEndVersion == endVersion) { + clearVersionHistory = true; + } + + Void _ = wait(clearLogRanges(tr, clearVersionHistory, task->params[BackupAgentBase::keyConfigLogUid], task->params[BackupAgentBase::destUid], currBeginVersion, currEndVersion)); + Void _ = wait(tr->commit()); + currBeginVersion = currEndVersion; + break; + } catch (Error &e) { + Void _ = wait(tr->onError(e)); + } + } } + + return Void(); } ACTOR static Future _execute(Database cx, Reference taskBucket, Reference futureBucket, Reference task) { @@ -495,13 +511,14 @@ namespace dbBackup { Version beginVersion = BinaryReader::fromStringRef(task->params[DatabaseBackupAgent::keyBeginVersion], Unversioned()); Version endVersion = BinaryReader::fromStringRef(task->params[DatabaseBackupAgent::keyEndVersion], Unversioned()); + bool backupDone = BinaryReader::fromStringRef(task->params[DatabaseBackupAgent::backupDone], Unversioned()); - Void _ = wait(eraseLogData(taskBucket->src, task, beginVersion, endVersion)); + Void _ = wait(eraseLogData(taskBucket->src, task, backupDone, beginVersion, endVersion)); return Void(); } - ACTOR static Future addTask(Reference tr, Reference taskBucket, Reference parentTask, Version beginVersion, Version endVersion, TaskCompletionKey completionKey, Reference waitFor = Reference()) { + ACTOR static Future addTask(Reference tr, Reference taskBucket, Reference parentTask, Version beginVersion, Version endVersion, bool backupDone, TaskCompletionKey completionKey, Reference waitFor = Reference()) { Key doneKey = wait(completionKey.get(tr, taskBucket)); Reference task(new Task(EraseLogRangeTaskFunc::name, EraseLogRangeTaskFunc::version, doneKey, 1)); @@ -509,6 +526,7 @@ namespace dbBackup { task->params[DatabaseBackupAgent::keyBeginVersion] = BinaryWriter::toValue(beginVersion, Unversioned()); task->params[DatabaseBackupAgent::keyEndVersion] = BinaryWriter::toValue(endVersion, Unversioned()); + task->params[DatabaseBackupAgent::backupDone] = BinaryWriter::toValue(backupDone, Unversioned()); if (!waitFor) { return taskBucket->addTask(tr, task, parentTask->params[Task::reservedTaskParamValidKey], task->params[BackupAgentBase::keyFolderId]); @@ -520,8 +538,6 @@ namespace dbBackup { ACTOR static Future _finish(Reference tr, Reference taskBucket, Reference futureBucket, Reference task) { - state Version beginVersion = BinaryReader::fromStringRef(task->params[DatabaseBackupAgent::keyBeginVersion], Unversioned()); - state Version endVersion = BinaryReader::fromStringRef(task->params[DatabaseBackupAgent::keyEndVersion], Unversioned()); state Reference taskFuture = futureBucket->unpack(task->params[Task::reservedTaskParamKeyDone]); Void _ = wait(taskFuture->set(tr, taskBucket) && taskBucket->finish(tr, task)); @@ -609,7 +625,7 @@ namespace dbBackup { tr.addReadConflictRange(singleKeyRange(kv.key)); first = false; } - tr.set(kv.key.removePrefix(backupLogKeys.begin).withPrefix(applyLogKeys.begin), kv.value); + tr.set(kv.key.removePrefix(backupLogKeys.begin).removePrefix(task->params[BackupAgentBase::destUid]).withPrefix(task->params[BackupAgentBase::keyConfigLogUid]).withPrefix(applyLogKeys.begin), kv.value); bytesSet += kv.expectedSize() - backupLogKeys.begin.expectedSize() + applyLogKeys.begin.expectedSize(); } } @@ -644,7 +660,7 @@ namespace dbBackup { state Version endVersion = BinaryReader::fromStringRef(task->params[DatabaseBackupAgent::keyEndVersion], Unversioned()); state Version newEndVersion = std::min(endVersion, (((beginVersion-1) / CLIENT_KNOBS->BACKUP_BLOCK_SIZE) + 2 + (g_network->isSimulated() ? CLIENT_KNOBS->BACKUP_SIM_COPY_LOG_RANGES : 0)) * CLIENT_KNOBS->BACKUP_BLOCK_SIZE); - state Standalone> ranges = getLogRanges(beginVersion, newEndVersion, task->params[BackupAgentBase::keyConfigLogUid], CLIENT_KNOBS->BACKUP_BLOCK_SIZE); + state Standalone> ranges = getLogRanges(beginVersion, newEndVersion, task->params[BackupAgentBase::destUid], CLIENT_KNOBS->BACKUP_BLOCK_SIZE); state std::vector> results; state std::vector> rc; state std::vector> dump; @@ -695,12 +711,10 @@ namespace dbBackup { if (task->params.find(CopyLogRangeTaskFunc::keyNextBeginVersion) != task->params.end()) { state Version nextVersion = BinaryReader::fromStringRef(task->params[CopyLogRangeTaskFunc::keyNextBeginVersion], Unversioned()); Void _ = wait(success(CopyLogRangeTaskFunc::addTask(tr, taskBucket, task, nextVersion, endVersion, TaskCompletionKey::signal(taskFuture->key))) && - success(EraseLogRangeTaskFunc::addTask(tr, taskBucket, task, beginVersion, nextVersion, TaskCompletionKey::noSignal())) && taskBucket->finish(tr, task)); } else { - Void _ = wait(success(EraseLogRangeTaskFunc::addTask(tr, taskBucket, task, beginVersion, endVersion, TaskCompletionKey::noSignal())) && - taskFuture->set(tr, taskBucket) && + Void _ = wait(taskFuture->set(tr, taskBucket) && taskBucket->finish(tr, task)); } @@ -722,6 +736,7 @@ namespace dbBackup { Void _ = wait(checkTaskVersion(tr, task, CopyLogsTaskFunc::name, CopyLogsTaskFunc::version)); state Version beginVersion = BinaryReader::fromStringRef(task->params[DatabaseBackupAgent::keyBeginVersion], Unversioned()); + state Version prevBeginVersion = BinaryReader::fromStringRef(task->params[DatabaseBackupAgent::keyPrevBeginVersion], Unversioned()); state Future> fStopValue = tr->get(states.pack(DatabaseBackupAgent::keyCopyStop)); state Future> fAppliedValue = tr->get(task->params[BackupAgentBase::keyConfigLogUid].withPrefix(applyMutationsBeginRange.begin)); @@ -733,7 +748,7 @@ namespace dbBackup { if (endVersion <= beginVersion) { Void _ = wait(delay(FLOW_KNOBS->PREVENT_FAST_SPIN_DELAY)); - Key _ = wait(CopyLogsTaskFunc::addTask(tr, taskBucket, task, beginVersion, TaskCompletionKey::signal(onDone))); + Key _ = wait(CopyLogsTaskFunc::addTask(tr, taskBucket, task, prevBeginVersion, beginVersion, TaskCompletionKey::signal(onDone))); Void _ = wait(taskBucket->finish(tr, task)); return Void(); } @@ -758,7 +773,7 @@ namespace dbBackup { if ((stopVersionData == -1) || (stopVersionData >= applyVersion)) { state Reference allPartsDone = futureBucket->future(tr); std::vector> addTaskVector; - addTaskVector.push_back(CopyLogsTaskFunc::addTask(tr, taskBucket, task, endVersion, TaskCompletionKey::signal(onDone), allPartsDone)); + addTaskVector.push_back(CopyLogsTaskFunc::addTask(tr, taskBucket, task, beginVersion, endVersion, TaskCompletionKey::signal(onDone), allPartsDone)); int blockSize = std::max(1, ((endVersion - beginVersion)/CLIENT_KNOBS->BACKUP_COPY_TASKS)/CLIENT_KNOBS->BACKUP_BLOCK_SIZE); for (int64_t vblock = beginVersion / CLIENT_KNOBS->BACKUP_BLOCK_SIZE; vblock < (endVersion + CLIENT_KNOBS->BACKUP_BLOCK_SIZE - 1) / CLIENT_KNOBS->BACKUP_BLOCK_SIZE; vblock += blockSize) { addTaskVector.push_back(CopyLogRangeTaskFunc::addTask(tr, taskBucket, task, @@ -766,11 +781,17 @@ namespace dbBackup { std::min(endVersion, (vblock + blockSize) * CLIENT_KNOBS->BACKUP_BLOCK_SIZE), TaskCompletionKey::joinWith(allPartsDone))); } + + // Do not erase at the first time + if (prevBeginVersion > 0) { + addTaskVector.push_back(EraseLogRangeTaskFunc::addTask(tr, taskBucket, task, prevBeginVersion, beginVersion, false, TaskCompletionKey::joinWith(allPartsDone))); + } + Void _ = wait(waitForAll(addTaskVector) && taskBucket->finish(tr, task)); } else { if(appliedVersion <= stopVersionData) { Void _ = wait(delay(FLOW_KNOBS->PREVENT_FAST_SPIN_DELAY)); - Key _ = wait(CopyLogsTaskFunc::addTask(tr, taskBucket, task, beginVersion, TaskCompletionKey::signal(onDone))); + Key _ = wait(CopyLogsTaskFunc::addTask(tr, taskBucket, task, prevBeginVersion, beginVersion, TaskCompletionKey::signal(onDone))); Void _ = wait(taskBucket->finish(tr, task)); return Void(); } @@ -782,12 +803,13 @@ namespace dbBackup { return Void(); } - ACTOR static Future addTask(Reference tr, Reference taskBucket, Reference parentTask, Version beginVersion, TaskCompletionKey completionKey, Reference waitFor = Reference()) { + ACTOR static Future addTask(Reference tr, Reference taskBucket, Reference parentTask, Version prevBeginVersion, Version beginVersion, TaskCompletionKey completionKey, Reference waitFor = Reference()) { Key doneKey = wait(completionKey.get(tr, taskBucket)); Reference task(new Task(CopyLogsTaskFunc::name, CopyLogsTaskFunc::version, doneKey, 1)); copyDefaultParameters(parentTask, task); task->params[BackupAgentBase::keyBeginVersion] = BinaryWriter::toValue(beginVersion, Unversioned()); + task->params[DatabaseBackupAgent::keyPrevBeginVersion] = BinaryWriter::toValue(prevBeginVersion, Unversioned()); if (!waitFor) { return taskBucket->addTask(tr, task, parentTask->params[Task::reservedTaskParamValidKey], task->params[BackupAgentBase::keyFolderId]); @@ -839,29 +861,74 @@ namespace dbBackup { } } - state Transaction tr(taskBucket->src); + state Reference tr(new ReadYourWritesTransaction(taskBucket->src)); + state Key logUidValue = task->params[DatabaseBackupAgent::keyConfigLogUid]; + state Key destUidValue = task->params[BackupAgentBase::destUid]; + state Version beginVersion; + state Version endVersion; + loop { try { - tr.setOption(FDBTransactionOptions::LOCK_AWARE); - Optional v = wait( tr.get( sourceStates.pack(DatabaseBackupAgent::keyFolderId) ) ); + tr->setOption(FDBTransactionOptions::ACCESS_SYSTEM_KEYS); + tr->setOption(FDBTransactionOptions::LOCK_AWARE); + Optional v = wait( tr->get( sourceStates.pack(DatabaseBackupAgent::keyFolderId) ) ); if(v.present() && BinaryReader::fromStringRef(v.get(), Unversioned()) > BinaryReader::fromStringRef(task->params[DatabaseBackupAgent::keyFolderId], Unversioned())) return Void(); - UID logUid = BinaryReader::fromStringRef(task->params[DatabaseBackupAgent::keyConfigLogUid], Unversioned()); - Key configPath = uidPrefixKey(logRangesRange.begin, logUid); - Key logsPath = uidPrefixKey(backupLogKeys.begin, logUid); + Key configPath = logUidValue.withPrefix(logRangesRange.begin); + tr->clear(KeyRangeRef(configPath, strinc(configPath))); - tr.set(sourceStates.pack(DatabaseBackupAgent::keyStateStatus), StringRef(BackupAgentBase::getStateText(BackupAgentBase::STATE_COMPLETED))); - tr.clear(KeyRangeRef(configPath, strinc(configPath))); - tr.clear(KeyRangeRef(logsPath, strinc(logsPath))); + state Key latestVersionKey = logUidValue.withPrefix(task->params[BackupAgentBase::destUid].withPrefix(backupLatestVersionsPrefix)); + state Optional bVersion = wait(tr->get(latestVersionKey)); - Void _ = wait(tr.commit()); + if (!bVersion.present()) { + return Void(); + } + beginVersion = BinaryReader::fromStringRef(bVersion.get(), Unversioned()); + + Void _ = wait(tr->commit()); + endVersion = tr->getCommittedVersion(); break; } catch(Error &e) { - Void _ = wait(tr.onError(e)); + Void _ = wait(tr->onError(e)); } } + state bool clearVersionHistory = false; + state Version currBeginVersion = beginVersion; + state Version currEndVersion; + + while (currBeginVersion < endVersion) { + state Reference clearSrcTr(new ReadYourWritesTransaction(taskBucket->src)); + + loop{ + try { + clearSrcTr->setOption(FDBTransactionOptions::ACCESS_SYSTEM_KEYS); + clearSrcTr->setOption(FDBTransactionOptions::LOCK_AWARE); + Optional v = wait( clearSrcTr->get( sourceStates.pack(DatabaseBackupAgent::keyFolderId) ) ); + if(v.present() && BinaryReader::fromStringRef(v.get(), Unversioned()) > BinaryReader::fromStringRef(task->params[DatabaseBackupAgent::keyFolderId], Unversioned())) + return Void(); + + currEndVersion = std::min(currBeginVersion + CLIENT_KNOBS->CLEAR_LOG_RANGE_COUNT * CLIENT_KNOBS->LOG_RANGE_BLOCK_SIZE, endVersion); + + if (currEndVersion == endVersion) { + clearVersionHistory = true; + } + + + + + Void _ = wait(clearLogRanges(clearSrcTr, clearVersionHistory, logUidValue, destUidValue, currBeginVersion, currEndVersion)); + Void _ = wait(clearSrcTr->commit()); + currBeginVersion = currEndVersion; + break; + } catch (Error &e) { + Void _ = wait(clearSrcTr->onError(e)); + } + } + } + + return Void(); } @@ -920,6 +987,7 @@ namespace dbBackup { Void _ = wait(checkTaskVersion(tr, task, CopyDiffLogsTaskFunc::name, CopyDiffLogsTaskFunc::version)); state Version beginVersion = BinaryReader::fromStringRef(task->params[DatabaseBackupAgent::keyBeginVersion], Unversioned()); + state Version prevBeginVersion = BinaryReader::fromStringRef(task->params[DatabaseBackupAgent::keyPrevBeginVersion], Unversioned()); state Future> fStopWhenDone = tr->get(conf.pack(DatabaseBackupAgent::keyConfigStopWhenDoneKey)); Transaction srcTr(taskBucket->src); @@ -930,7 +998,7 @@ namespace dbBackup { if (endVersion <= beginVersion) { Void _ = wait(delay(FLOW_KNOBS->PREVENT_FAST_SPIN_DELAY)); - Key _ = wait(CopyDiffLogsTaskFunc::addTask(tr, taskBucket, task, beginVersion, TaskCompletionKey::signal(onDone))); + Key _ = wait(CopyDiffLogsTaskFunc::addTask(tr, taskBucket, task, prevBeginVersion, beginVersion, TaskCompletionKey::signal(onDone))); Void _ = wait(taskBucket->finish(tr, task)); return Void(); } @@ -945,7 +1013,7 @@ namespace dbBackup { if (!stopWhenDone.present()) { state Reference allPartsDone = futureBucket->future(tr); std::vector> addTaskVector; - addTaskVector.push_back(CopyDiffLogsTaskFunc::addTask(tr, taskBucket, task, endVersion, TaskCompletionKey::signal(onDone), allPartsDone)); + addTaskVector.push_back(CopyDiffLogsTaskFunc::addTask(tr, taskBucket, task, beginVersion, endVersion, TaskCompletionKey::signal(onDone), allPartsDone)); int blockSize = std::max(1, ((endVersion - beginVersion)/ CLIENT_KNOBS->BACKUP_COPY_TASKS)/CLIENT_KNOBS->BACKUP_BLOCK_SIZE); for (int64_t vblock = beginVersion / CLIENT_KNOBS->BACKUP_BLOCK_SIZE; vblock < (endVersion + CLIENT_KNOBS->BACKUP_BLOCK_SIZE - 1) / CLIENT_KNOBS->BACKUP_BLOCK_SIZE; vblock += blockSize) { addTaskVector.push_back(CopyLogRangeTaskFunc::addTask(tr, taskBucket, task, @@ -953,6 +1021,11 @@ namespace dbBackup { std::min(endVersion, (vblock + blockSize) * CLIENT_KNOBS->BACKUP_BLOCK_SIZE), TaskCompletionKey::joinWith(allPartsDone))); } + + if (prevBeginVersion > 0) { + addTaskVector.push_back(EraseLogRangeTaskFunc::addTask(tr, taskBucket, task, prevBeginVersion, beginVersion, false, TaskCompletionKey::joinWith(allPartsDone))); + } + Void _ = wait(waitForAll(addTaskVector) && taskBucket->finish(tr, task)); } else { Void _ = wait(onDone->set(tr, taskBucket) && taskBucket->finish(tr, task)); @@ -960,13 +1033,14 @@ namespace dbBackup { return Void(); } - ACTOR static Future addTask(Reference tr, Reference taskBucket, Reference parentTask, Version beginVersion, TaskCompletionKey completionKey, Reference waitFor = Reference()) { + ACTOR static Future addTask(Reference tr, Reference taskBucket, Reference parentTask, Version prevBeginVersion, Version beginVersion, TaskCompletionKey completionKey, Reference waitFor = Reference()) { Key doneKey = wait(completionKey.get(tr, taskBucket)); Reference task(new Task(CopyDiffLogsTaskFunc::name, CopyDiffLogsTaskFunc::version, doneKey, 1)); copyDefaultParameters(parentTask, task); task->params[DatabaseBackupAgent::keyBeginVersion] = BinaryWriter::toValue(beginVersion, Unversioned()); + task->params[DatabaseBackupAgent::keyPrevBeginVersion] = BinaryWriter::toValue(prevBeginVersion, Unversioned()); if (!waitFor) { return taskBucket->addTask(tr, task, parentTask->params[Task::reservedTaskParamValidKey], task->params[BackupAgentBase::keyFolderId]); @@ -1033,7 +1107,7 @@ namespace dbBackup { tr->set(states.pack(DatabaseBackupAgent::keyStateStatus), StringRef(BackupAgentBase::getStateText(BackupAgentBase::STATE_DIFFERENTIAL))); allPartsDone = futureBucket->future(tr); - Key _ = wait(CopyDiffLogsTaskFunc::addTask(tr, taskBucket, task, restoreVersion, TaskCompletionKey::joinWith(allPartsDone))); + Key _ = wait(CopyDiffLogsTaskFunc::addTask(tr, taskBucket, task, 0, restoreVersion, TaskCompletionKey::joinWith(allPartsDone))); // After the Backup completes, clear the backup subspace and update the status Key _ = wait(FinishedFullBackupTaskFunc::addTask(tr, taskBucket, task, TaskCompletionKey::noSignal(), allPartsDone)); @@ -1071,48 +1145,106 @@ namespace dbBackup { static const uint32_t version; ACTOR static Future _execute(Database cx, Reference taskBucket, Reference futureBucket, Reference task) { - state Subspace sourceStates = Subspace(databaseBackupPrefixRange.begin).get(BackupAgentBase::keySourceStates).get(task->params[BackupAgentBase::keyConfigLogUid]); + state Key logUidValue = task->params[DatabaseBackupAgent::keyConfigLogUid]; + state Subspace sourceStates = Subspace(databaseBackupPrefixRange.begin).get(BackupAgentBase::keySourceStates).get(logUidValue); Void _ = wait(checkTaskVersion(cx, task, StartFullBackupTaskFunc::name, StartFullBackupTaskFunc::version)); - state UID logUid = BinaryReader::fromStringRef(task->params[DatabaseBackupAgent::keyConfigLogUid], Unversioned()); - state Key logUidDest = uidPrefixKey(backupLogKeys.begin, logUid); + state Key destUidValue(logUidValue); + state UID logUid = BinaryReader::fromStringRef(logUidValue, Unversioned()); + state Standalone> backupRanges = BinaryReader::fromStringRef>>(task->params[DatabaseBackupAgent::keyConfigBackupRanges], IncludeVersion()); - state Transaction tr(taskBucket->src); + + state Reference srcTr(new ReadYourWritesTransaction(taskBucket->src)); + loop { + try { + srcTr->setOption(FDBTransactionOptions::LOCK_AWARE); + srcTr->setOption(FDBTransactionOptions::ACCESS_SYSTEM_KEYS); + + // Initialize destUid + if (backupRanges.size() == 1) { + state Key destUidLookupPath = BinaryWriter::toValue(backupRanges[0], IncludeVersion()).withPrefix(destUidLookupPrefix); + Optional existingDestUidValue = wait(srcTr->get(destUidLookupPath)); + if (existingDestUidValue.present()) { + destUidValue = existingDestUidValue.get(); + } else { + destUidValue = BinaryWriter::toValue(g_random->randomUniqueID(), Unversioned()); + srcTr->set(destUidLookupPath, destUidValue); + } + } + + task->params[BackupAgentBase::destUid] = destUidValue; + + Void _ = wait(srcTr->commit()); + break; + } catch(Error &e) { + Void _ = wait(srcTr->onError(e)); + } + } + loop { try { - tr.setOption(FDBTransactionOptions::LOCK_AWARE); - Optional v = wait( tr.get( sourceStates.pack(DatabaseBackupAgent::keyFolderId) ) ); - task->params[DatabaseBackupAgent::keyBeginVersion] = BinaryWriter::toValue(tr.getReadVersion().get(), Unversioned()); + state Reference tr(new ReadYourWritesTransaction(cx)); + tr->setOption(FDBTransactionOptions::LOCK_AWARE); + tr->setOption(FDBTransactionOptions::ACCESS_SYSTEM_KEYS); + state Future verified = taskBucket->keepRunning(tr, task); + Void _ = wait(verified); + + Subspace config = Subspace(databaseBackupPrefixRange.begin).get(BackupAgentBase::keyConfig).get(logUidValue); + tr->set(config.get(logUidValue).pack(BackupAgentBase::destUid), task->params[BackupAgentBase::destUid]); + Void _ = wait(tr->commit()); + break; + } catch (Error &e) { + Void _ = wait(srcTr->onError(e)); + } + } + + loop { + try { + state Reference srcTr2(new ReadYourWritesTransaction(taskBucket->src)); + srcTr2->setOption(FDBTransactionOptions::LOCK_AWARE); + srcTr2->setOption(FDBTransactionOptions::ACCESS_SYSTEM_KEYS); + + state Optional v = wait( srcTr2->get( sourceStates.pack(DatabaseBackupAgent::keyFolderId) ) ); + + state Standalone beginVersion = BinaryWriter::toValue(srcTr2->getReadVersion().get(), Unversioned()); + task->params[BackupAgentBase::keyBeginVersion] = beginVersion; if(v.present() && BinaryReader::fromStringRef(v.get(), Unversioned()) >= BinaryReader::fromStringRef(task->params[DatabaseBackupAgent::keyFolderId], Unversioned())) return Void(); - tr.set( Subspace(databaseBackupPrefixRange.begin).get(BackupAgentBase::keySourceTagName).pack(task->params[BackupAgentBase::keyTagName]), task->params[BackupAgentBase::keyConfigLogUid] ); - tr.set( sourceStates.pack(DatabaseBackupAgent::keyFolderId), task->params[DatabaseBackupAgent::keyFolderId] ); - tr.set( sourceStates.pack(DatabaseBackupAgent::keyStateStatus), StringRef(BackupAgentBase::getStateText(BackupAgentBase::STATE_BACKUP))); + Key versionKey = logUidValue.withPrefix(destUidValue).withPrefix(backupLatestVersionsPrefix); + srcTr2->set(versionKey, beginVersion); + srcTr2->set( Subspace(databaseBackupPrefixRange.begin).get(BackupAgentBase::keySourceTagName).pack(task->params[BackupAgentBase::keyTagName]), logUidValue ); + srcTr2->set( sourceStates.pack(DatabaseBackupAgent::keyFolderId), task->params[DatabaseBackupAgent::keyFolderId] ); + srcTr2->set( sourceStates.pack(DatabaseBackupAgent::keyStateStatus), StringRef(BackupAgentBase::getStateText(BackupAgentBase::STATE_BACKUP))); + + state Key destPath = destUidValue.withPrefix(backupLogKeys.begin); // Start logging the mutations for the specified ranges of the tag for (auto &backupRange : backupRanges) { - tr.set(logRangesEncodeKey(backupRange.begin, logUid), logRangesEncodeValue(backupRange.end, logUidDest)); + srcTr2->set(logRangesEncodeKey(backupRange.begin, logUid), logRangesEncodeValue(backupRange.end, destPath)); } - Void _ = wait(tr.commit()); - return Void(); - } catch(Error &e) { - Void _ = wait(tr.onError(e)); + Void _ = wait(srcTr2->commit()); + break; + } catch (Error &e) { + Void _ = wait(srcTr2->onError(e)); } } + + return Void(); } ACTOR static Future _finish(Reference tr, Reference taskBucket, Reference futureBucket, Reference task) { - state Subspace states = Subspace(databaseBackupPrefixRange.begin).get(BackupAgentBase::keyStates).get(task->params[BackupAgentBase::keyConfigLogUid]); + state Key logUidValue = task->params[BackupAgentBase::keyConfigLogUid]; + state Subspace states = Subspace(databaseBackupPrefixRange.begin).get(BackupAgentBase::keyStates).get(logUidValue); + state Subspace config = Subspace(databaseBackupPrefixRange.begin).get(BackupAgentBase::keyConfig).get(logUidValue); state Version beginVersion = BinaryReader::fromStringRef(task->params[BackupAgentBase::keyBeginVersion], Unversioned()); state Standalone> backupRanges = BinaryReader::fromStringRef>>(task->params[DatabaseBackupAgent::keyConfigBackupRanges], IncludeVersion()); - TraceEvent("DBA_StartFullBackup").detail("beginVer", beginVersion); - tr->set(task->params[BackupAgentBase::keyConfigLogUid].withPrefix(applyMutationsBeginRange.begin), BinaryWriter::toValue(beginVersion, Unversioned())); - tr->set(task->params[BackupAgentBase::keyConfigLogUid].withPrefix(applyMutationsEndRange.begin), BinaryWriter::toValue(beginVersion, Unversioned())); + tr->set(logUidValue.withPrefix(applyMutationsBeginRange.begin), BinaryWriter::toValue(beginVersion, Unversioned())); + tr->set(logUidValue.withPrefix(applyMutationsEndRange.begin), BinaryWriter::toValue(beginVersion, Unversioned())); tr->set(states.pack(DatabaseBackupAgent::keyStateStatus), StringRef(BackupAgentBase::getStateText(BackupAgentBase::STATE_BACKUP))); state Reference kvBackupRangeComplete = futureBucket->future(tr); @@ -1132,7 +1264,7 @@ namespace dbBackup { Key _ = wait(FinishFullBackupTaskFunc::addTask(tr, taskBucket, task, TaskCompletionKey::noSignal(), kvBackupRangeComplete)); // Backup the logs which will create BackupLogRange tasks - Key _ = wait(CopyLogsTaskFunc::addTask(tr, taskBucket, task, beginVersion, TaskCompletionKey::joinWith(kvBackupComplete))); + Key _ = wait(CopyLogsTaskFunc::addTask(tr, taskBucket, task, 0, beginVersion, TaskCompletionKey::joinWith(kvBackupComplete))); // After the Backup completes, clear the backup subspace and update the status Key _ = wait(BackupRestorableTaskFunc::addTask(tr, taskBucket, task, TaskCompletionKey::noSignal(), kvBackupComplete)); @@ -1141,7 +1273,7 @@ namespace dbBackup { return Void(); } - ACTOR static Future addTask(Reference tr, Reference taskBucket, Key logUid, Key backupUid, Key keyAddPrefix, Key keyRemovePrefix, Key keyConfigBackupRanges, Key tagName, TaskCompletionKey completionKey, Reference waitFor = Reference(), bool databasesInSync=false) + ACTOR static Future addTask(Reference tr, Reference taskBucket, Key logUid, Key backupUid, /*Key destUid,*/ Key keyAddPrefix, Key keyRemovePrefix, Key keyConfigBackupRanges, Key tagName, TaskCompletionKey completionKey, Reference waitFor = Reference(), bool databasesInSync=false) { Key doneKey = wait(completionKey.get(tr, taskBucket)); Reference task(new Task(StartFullBackupTaskFunc::name, StartFullBackupTaskFunc::version, doneKey)); @@ -1492,7 +1624,8 @@ public: ACTOR static Future abortBackup(DatabaseBackupAgent* backupAgent, Database cx, Key tagName, bool partial) { state Reference tr(new ReadYourWritesTransaction(cx)); - state Key logUid; + state Key logUidValue, destUidValue; + state UID logUid, destUid; state Value backupUid; loop { @@ -1502,25 +1635,34 @@ public: tr->setOption(FDBTransactionOptions::COMMIT_ON_FIRST_PROXY); UID _logUid = wait(backupAgent->getLogUid(tr, tagName)); - logUid = BinaryWriter::toValue(_logUid, Unversioned()); + logUid = _logUid; + logUidValue = BinaryWriter::toValue(logUid, Unversioned()); - int status = wait(backupAgent->getStateValue(tr, _logUid)); + state Future statusFuture= backupAgent->getStateValue(tr, logUid); + state Future destUidFuture = backupAgent->getDestUid(tr, logUid); + Void _ = wait(success(statusFuture) && success(destUidFuture)); + + UID destUid = destUidFuture.get(); + if (destUid.isValid()) { + destUidValue = BinaryWriter::toValue(destUid, Unversioned()); + } + int status = statusFuture.get(); if (!backupAgent->isRunnable((BackupAgentBase::enumState)status)) { throw backup_unneeded(); } - Optional _backupUid = wait(tr->get(backupAgent->states.get(logUid).pack(DatabaseBackupAgent::keyFolderId))); + Optional _backupUid = wait(tr->get(backupAgent->states.get(logUidValue).pack(DatabaseBackupAgent::keyFolderId))); backupUid = _backupUid.get(); // Clearing the folder id will prevent future tasks from executing - tr->clear(backupAgent->config.get(logUid).range()); + tr->clear(backupAgent->config.get(logUidValue).range()); // Clearing the end version of apply mutation cancels ongoing apply work - tr->clear(logUid.withPrefix(applyMutationsEndRange.begin)); + tr->clear(logUidValue.withPrefix(applyMutationsEndRange.begin)); - tr->clear(prefixRange(logUid.withPrefix(applyLogKeys.begin))); + tr->clear(prefixRange(logUidValue.withPrefix(applyLogKeys.begin))); - tr->set(StringRef(backupAgent->states.get(logUid).pack(DatabaseBackupAgent::keyStateStatus)), StringRef(DatabaseBackupAgent::getStateText(BackupAgentBase::STATE_PARTIALLY_ABORTED))); + tr->set(StringRef(backupAgent->states.get(logUidValue).pack(DatabaseBackupAgent::keyStateStatus)), StringRef(DatabaseBackupAgent::getStateText(BackupAgentBase::STATE_PARTIALLY_ABORTED))); Void _ = wait(tr->commit()); TraceEvent("DBA_Abort").detail("commitVersion", tr->getCommittedVersion()); @@ -1545,7 +1687,7 @@ public: tr->setOption(FDBTransactionOptions::COMMIT_ON_FIRST_PROXY); try { // Ensure that we're at a version higher than the data that we've written. - Optional lastApplied = wait(tr->get(logUid.withPrefix(applyMutationsBeginRange.begin))); + Optional lastApplied = wait(tr->get(logUidValue.withPrefix(applyMutationsBeginRange.begin))); if (lastApplied.present()) { Version current = tr->getReadVersion().get(); Version applied = BinaryReader::fromStringRef(lastApplied.get(), Unversioned()); @@ -1576,23 +1718,38 @@ public: return Void(); state Reference srcTr(new ReadYourWritesTransaction(backupAgent->taskBucket->src)); + state Version beginVersion; + state Version endVersion; + state bool clearSrcDb = true; loop { try { srcTr->setOption(FDBTransactionOptions::ACCESS_SYSTEM_KEYS); srcTr->setOption(FDBTransactionOptions::LOCK_AWARE); - Optional v = wait( srcTr->get( backupAgent->sourceStates.get(logUid).pack(DatabaseBackupAgent::keyFolderId) ) ); + Optional v = wait( srcTr->get( backupAgent->sourceStates.get(logUidValue).pack(DatabaseBackupAgent::keyFolderId) ) ); - if(v.present() && BinaryReader::fromStringRef(v.get(), Unversioned()) > BinaryReader::fromStringRef(backupUid, Unversioned())) + if(v.present() && BinaryReader::fromStringRef(v.get(), Unversioned()) > BinaryReader::fromStringRef(backupUid, Unversioned())) { + clearSrcDb = false; break; + } - srcTr->set( backupAgent->sourceStates.pack(DatabaseBackupAgent::keyStateStatus), StringRef(BackupAgentBase::getStateText(BackupAgentBase::STATE_ABORTED) )); - srcTr->set( backupAgent->sourceStates.get(logUid).pack(DatabaseBackupAgent::keyFolderId), backupUid ); + Key latestVersionKey = logUidValue.withPrefix(destUidValue.withPrefix(backupLatestVersionsPrefix)); - srcTr->clear(prefixRange(logUid.withPrefix(backupLogKeys.begin))); - srcTr->clear(prefixRange(logUid.withPrefix(logRangesRange.begin))); + Optional bVersion = wait(srcTr->get(latestVersionKey)); + if (bVersion.present()) { + beginVersion = BinaryReader::fromStringRef(bVersion.get(), Unversioned()); + } else { + clearSrcDb = false; + break; + } + + srcTr->set( backupAgent->sourceStates.pack(DatabaseBackupAgent::keyStateStatus), StringRef(DatabaseBackupAgent::getStateText(BackupAgentBase::STATE_PARTIALLY_ABORTED) )); + srcTr->set( backupAgent->sourceStates.get(logUidValue).pack(DatabaseBackupAgent::keyFolderId), backupUid ); + srcTr->clear(prefixRange(logUidValue.withPrefix(logRangesRange.begin))); Void _ = wait(srcTr->commit()); + endVersion = srcTr->getCommittedVersion(); + break; } catch (Error &e) { @@ -1600,18 +1757,48 @@ public: } } + if (clearSrcDb) { + state bool clearVersionHistory = false; + state Version currBeginVersion = beginVersion; + state Version currEndVersion; + + while (currBeginVersion < endVersion) { + state Reference clearSrcTr(new ReadYourWritesTransaction(backupAgent->taskBucket->src)); + + loop{ + try { + clearSrcTr->setOption(FDBTransactionOptions::ACCESS_SYSTEM_KEYS); + clearSrcTr->setOption(FDBTransactionOptions::LOCK_AWARE); + currEndVersion = std::min(currBeginVersion + CLIENT_KNOBS->CLEAR_LOG_RANGE_COUNT * CLIENT_KNOBS->LOG_RANGE_BLOCK_SIZE, endVersion); + + if (currEndVersion == endVersion) { + clearSrcTr->set( backupAgent->sourceStates.pack(DatabaseBackupAgent::keyStateStatus), StringRef(BackupAgentBase::getStateText(BackupAgentBase::STATE_ABORTED) )); + clearVersionHistory = true; + } + + Void _ = wait(clearLogRanges(clearSrcTr, clearVersionHistory, logUidValue, destUidValue, currBeginVersion, currEndVersion)); + Void _ = wait(clearSrcTr->commit()); + currBeginVersion = currEndVersion; + break; + } catch (Error &e) { + Void _ = wait(clearSrcTr->onError(e)); + } + } + } + } + tr = Reference(new ReadYourWritesTransaction(cx)); loop { try { tr->setOption(FDBTransactionOptions::ACCESS_SYSTEM_KEYS); tr->setOption(FDBTransactionOptions::LOCK_AWARE); - Optional v = wait(tr->get(StringRef(backupAgent->config.get(logUid).pack(DatabaseBackupAgent::keyFolderId)))); + Optional v = wait(tr->get(StringRef(backupAgent->config.get(logUidValue).pack(DatabaseBackupAgent::keyFolderId)))); if(v.present()) { return Void(); } - tr->set(StringRef(backupAgent->states.get(logUid).pack(DatabaseBackupAgent::keyStateStatus)), StringRef(DatabaseBackupAgent::getStateText(BackupAgentBase::STATE_ABORTED))); + tr->set(StringRef(backupAgent->states.get(logUidValue).pack(DatabaseBackupAgent::keyStateStatus)), StringRef(DatabaseBackupAgent::getStateText(BackupAgentBase::STATE_ABORTED))); Void _ = wait(tr->commit()); @@ -1753,6 +1940,15 @@ public: return (!status.present()) ? DatabaseBackupAgent::STATE_NEVERRAN : BackupAgentBase::getState(status.get().toString()); } + ACTOR static Future getDestUid(DatabaseBackupAgent* backupAgent, Reference tr, UID logUid) { + tr->setOption(FDBTransactionOptions::ACCESS_SYSTEM_KEYS); + tr->setOption(FDBTransactionOptions::LOCK_AWARE); + state Key destUidKey = backupAgent->config.get(BinaryWriter::toValue(logUid, Unversioned())).pack(BackupAgentBase::destUid); + Optional destUid = wait(tr->get(destUidKey)); + + return (destUid.present()) ? BinaryReader::fromStringRef(destUid.get(), Unversioned()) : UID(); + } + ACTOR static Future getLogUid(DatabaseBackupAgent* backupAgent, Reference tr, Key tagName) { tr->setOption(FDBTransactionOptions::ACCESS_SYSTEM_KEYS); tr->setOption(FDBTransactionOptions::LOCK_AWARE); @@ -1790,6 +1986,10 @@ Future DatabaseBackupAgent::getStateValue(Reference DatabaseBackupAgent::getDestUid(Reference tr, UID logUid) { + return DatabaseBackupAgentImpl::getDestUid(this, tr, logUid); +} + Future DatabaseBackupAgent::getLogUid(Reference tr, Key tagName) { return DatabaseBackupAgentImpl::getLogUid(this, tr, tagName); } diff --git a/fdbclient/FileBackupAgent.actor.cpp b/fdbclient/FileBackupAgent.actor.cpp index a79fe34d9f..b1a4530d36 100644 --- a/fdbclient/FileBackupAgent.actor.cpp +++ b/fdbclient/FileBackupAgent.actor.cpp @@ -807,7 +807,8 @@ namespace fileBackup { BackupConfig config, Reference waitFor = Reference(), std::function)> setupTaskFn = NOP_SETUP_TASK_FN, - int priority = 0) { + int priority = 0, + bool setValidation = true) { tr->setOption(FDBTransactionOptions::ACCESS_SYSTEM_KEYS); tr->setOption(FDBTransactionOptions::LOCK_AWARE); @@ -815,7 +816,7 @@ namespace fileBackup { state Reference task(new Task(name, version, doneKey, priority)); // Bind backup config to new task - Void _ = wait(config.toTask(tr, task)); + Void _ = wait(config.toTask(tr, task, setValidation)); // Set task specific params setupTaskFn(task); @@ -1539,7 +1540,7 @@ namespace fileBackup { .detail("ScheduledVersion", scheduledVersion) .detail("BeginKey", range.begin.printable()) .detail("EndKey", range.end.printable()) - .suppressFor(2); + .suppressFor(2, true); } else { // This shouldn't happen because if the transaction was already done or if another execution @@ -1680,7 +1681,8 @@ namespace fileBackup { } } - state Standalone> ranges = getLogRanges(beginVersion, endVersion, config.getUidAsKey()); + Key destUidValue = wait(config.destUidValue().getOrThrow(tr)); + state Standalone> ranges = getLogRanges(beginVersion, endVersion, destUidValue); if (ranges.size() > CLIENT_KNOBS->BACKUP_MAX_LOG_RANGES) { Params.addBackupLogRangeTasks().set(task, true); return Void(); @@ -1804,12 +1806,6 @@ namespace fileBackup { Void _ = wait(taskFuture->set(tr, taskBucket)); } - if(endVersion > beginVersion) { - Standalone> ranges = getLogRanges(beginVersion, endVersion, config.getUidAsKey()); - for (auto & rng : ranges) - tr->clear(rng); - } - Void _ = wait(taskBucket->finish(tr, task)); return Void(); } @@ -1819,11 +1815,146 @@ namespace fileBackup { const uint32_t BackupLogRangeTaskFunc::version = 1; REGISTER_TASKFUNC(BackupLogRangeTaskFunc); + struct EraseLogRangeTaskFunc : BackupTaskFuncBase { + static StringRef name; + static const uint32_t version; + StringRef getName() const { return name; }; + + static struct { + static TaskParam beginVersion() { + return LiteralStringRef(__FUNCTION__); + } + static TaskParam endVersion() { + return LiteralStringRef(__FUNCTION__); + } + static TaskParam backupDone() { + return LiteralStringRef(__FUNCTION__); + } + static TaskParam destUidValue() { + return LiteralStringRef(__FUNCTION__); + } + } Params; + + ACTOR static Future eraseLogData(Database cx, Key logUidValue, Key destUidValue, bool backupDone, Version beginVersion, Version endVersion) { + if (endVersion <= beginVersion) + return Void(); + + state Version currBeginVersion = beginVersion; + state Version currEndVersion; + state bool clearVersionHistory = false; + + while (currBeginVersion < endVersion) { + state Reference tr(new ReadYourWritesTransaction(cx)); + + loop{ + try { + currEndVersion = std::min(currBeginVersion + CLIENT_KNOBS->CLEAR_LOG_RANGE_COUNT * CLIENT_KNOBS->LOG_RANGE_BLOCK_SIZE, endVersion); + tr->setOption(FDBTransactionOptions::LOCK_AWARE); + tr->setOption(FDBTransactionOptions::ACCESS_SYSTEM_KEYS); + + if (backupDone && currEndVersion == endVersion) { + clearVersionHistory = true; + } + Void _ = wait(clearLogRanges(tr, clearVersionHistory, logUidValue, destUidValue, currBeginVersion, currEndVersion)); + Void _ = wait(tr->commit()); + currBeginVersion = currEndVersion; + break; + } catch (Error &e) { + Void _ = wait(tr->onError(e)); + } + } + } + + return Void(); + } + + ACTOR static Future _execute(Database cx, Reference taskBucket, Reference futureBucket, Reference task) { + state Reference lock(new FlowLock(CLIENT_KNOBS->BACKUP_LOCK_BYTES)); + Void _ = wait(checkTaskVersion(cx, task, EraseLogRangeTaskFunc::name, EraseLogRangeTaskFunc::version)); + + state Version beginVersion = Params.beginVersion().get(task); + state Version endVersion = Params.endVersion().get(task); + state bool backupDone = Params.backupDone().get(task); + state Key destUidValue = Params.destUidValue().get(task); + + state BackupConfig config(task); + state Key logUidValue = config.getUidAsKey(); + + state Reference tr(new ReadYourWritesTransaction(cx)); + + loop { + try { + tr->setOption(FDBTransactionOptions::LOCK_AWARE); + tr->setOption(FDBTransactionOptions::ACCESS_SYSTEM_KEYS); + + if (beginVersion == 0) { + Key latestVersionKey = logUidValue.withPrefix(destUidValue.withPrefix(backupLatestVersionsPrefix)); + + Optional bVersion = wait(tr->get(latestVersionKey)); + if (bVersion.present()) { + beginVersion = BinaryReader::fromStringRef(bVersion.get(), Unversioned()); + } else { + return Void(); + } + + Version eVersion = wait(tr->getReadVersion()); + endVersion = eVersion; + } + + break; + } catch (Error &e) { + Void _ = wait(tr->onError(e)); + } + } + + Void _ = wait(eraseLogData(cx, logUidValue, destUidValue, backupDone, beginVersion, endVersion)); + + return Void(); + } + + ACTOR static Future addTask(Reference tr, Reference taskBucket, UID logUid, TaskCompletionKey completionKey, bool backupDone, Key destUidValue, Version beginVersion = 0, Version endVersion = 0, Reference waitFor = Reference()) { + Key key = wait(addBackupTask(EraseLogRangeTaskFunc::name, + EraseLogRangeTaskFunc::version, + tr, taskBucket, completionKey, + BackupConfig(logUid), + waitFor, + [=](Reference task) { + Params.beginVersion().set(task, beginVersion); + Params.endVersion().set(task, endVersion); + Params.backupDone().set(task, backupDone); + Params.destUidValue().set(task, destUidValue); + }, + 0, false)); + + return key; + } + + + ACTOR static Future _finish(Reference tr, Reference taskBucket, Reference futureBucket, Reference task) { + state Reference taskFuture = futureBucket->unpack(task->params[Task::reservedTaskParamKeyDone]); + + Void _ = wait(taskFuture->set(tr, taskBucket) && taskBucket->finish(tr, task)); + + return Void(); + } + + Future execute(Database cx, Reference tb, Reference fb, Reference task) { return _execute(cx, tb, fb, task); }; + Future finish(Reference tr, Reference tb, Reference fb, Reference task) { return _finish(tr, tb, fb, task); }; + }; + StringRef EraseLogRangeTaskFunc::name = LiteralStringRef("file_backup_erase_logs"); + const uint32_t EraseLogRangeTaskFunc::version = 1; + REGISTER_TASKFUNC(EraseLogRangeTaskFunc); + + + struct BackupLogsDispatchTask : BackupTaskFuncBase { static StringRef name; static const uint32_t version; static struct { + static TaskParam prevBeginVersion() { + return LiteralStringRef(__FUNCTION__); + } static TaskParam beginVersion() { return LiteralStringRef(__FUNCTION__); } @@ -1836,6 +1967,7 @@ namespace fileBackup { tr->setOption(FDBTransactionOptions::LOCK_AWARE); state Reference onDone = task->getDoneFuture(futureBucket); + state Version prevBeginVersion = Params.prevBeginVersion().get(task); state Version beginVersion = Params.beginVersion().get(task); state BackupConfig config(task); config.latestLogEndVersion().set(tr, beginVersion); @@ -1883,7 +2015,13 @@ namespace fileBackup { // Add the next logs dispatch task which will run after this batch is done Key _ = wait(BackupLogRangeTaskFunc::addTask(tr, taskBucket, task, beginVersion, endVersion, TaskCompletionKey::joinWith(logDispatchBatchFuture))); - Key _ = wait(BackupLogsDispatchTask::addTask(tr, taskBucket, task, endVersion, TaskCompletionKey::signal(onDone), logDispatchBatchFuture)); + + // Do not erase at the first time + if (prevBeginVersion > 0) { + state Key destUidValue = wait(config.destUidValue().getOrThrow(tr)); + Key _ = wait(EraseLogRangeTaskFunc::addTask(tr, taskBucket, config.getUid(), TaskCompletionKey::joinWith(logDispatchBatchFuture), false, destUidValue, prevBeginVersion, beginVersion)); + } + Key _ = wait(BackupLogsDispatchTask::addTask(tr, taskBucket, task, beginVersion, endVersion, TaskCompletionKey::signal(onDone), logDispatchBatchFuture)); Void _ = wait(taskBucket->finish(tr, task)); @@ -1896,13 +2034,14 @@ namespace fileBackup { return Void(); } - ACTOR static Future addTask(Reference tr, Reference taskBucket, Reference parentTask, Version beginVersion, TaskCompletionKey completionKey, Reference waitFor = Reference()) { + ACTOR static Future addTask(Reference tr, Reference taskBucket, Reference parentTask, Version prevBeginVersion, Version beginVersion, TaskCompletionKey completionKey, Reference waitFor = Reference()) { Key key = wait(addBackupTask(BackupLogsDispatchTask::name, BackupLogsDispatchTask::version, tr, taskBucket, completionKey, BackupConfig(parentTask), waitFor, [=](Reference task) { + Params.prevBeginVersion().set(task, prevBeginVersion); Params.beginVersion().set(task, beginVersion); })); return key; @@ -1930,11 +2069,12 @@ namespace fileBackup { state UID uid = backup.getUid(); state Key configPath = uidPrefixKey(logRangesRange.begin, uid); - state Key logsPath = uidPrefixKey(backupLogKeys.begin, uid); tr->setOption(FDBTransactionOptions::COMMIT_ON_FIRST_PROXY); tr->clear(KeyRangeRef(configPath, strinc(configPath))); - tr->clear(KeyRangeRef(logsPath, strinc(logsPath))); + state Key destUidValue = wait(backup.destUidValue().getOrThrow(tr)); + Key _ = wait(EraseLogRangeTaskFunc::addTask(tr, taskBucket, backup.getUid(), TaskCompletionKey::noSignal(), true, destUidValue)); + backup.stateEnum().set(tr, EBackupState::STATE_COMPLETED); Void _ = wait(taskBucket->finish(tr, task)); @@ -2153,11 +2293,15 @@ namespace fileBackup { state BackupConfig config(task); state Version beginVersion = Params.beginVersion().get(task); - state std::vector backupRanges = wait(config.backupRanges().getOrThrow(tr)); + state Future> backupRangesFuture = config.backupRanges().getOrThrow(tr); + state Future destUidValueFuture = config.destUidValue().getOrThrow(tr); + Void _ = wait(success(backupRangesFuture) && success(destUidValueFuture)); + std::vector backupRanges = backupRangesFuture.get(); + Key destUidValue = destUidValueFuture.get(); // Start logging the mutations for the specified ranges of the tag for (auto &backupRange : backupRanges) { - config.startMutationLogs(tr, backupRange); + config.startMutationLogs(tr, backupRange, destUidValue); } config.stateEnum().set(tr, EBackupState::STATE_BACKUP); @@ -2168,7 +2312,7 @@ namespace fileBackup { // The initial snapshot has a desired duration of 0, meaning go as fast as possible. Void _ = wait(config.initNewSnapshot(tr, 0)); Key _ = wait(BackupSnapshotDispatchTask::addTask(tr, taskBucket, task, TaskCompletionKey::joinWith(backupFinished))); - Key _ = wait(BackupLogsDispatchTask::addTask(tr, taskBucket, task, beginVersion, TaskCompletionKey::joinWith(backupFinished))); + Key _ = wait(BackupLogsDispatchTask::addTask(tr, taskBucket, task, 0, beginVersion, TaskCompletionKey::joinWith(backupFinished))); // If a clean stop is requested, the log and snapshot tasks will quit after the backup is restorable, then the following // task will clean up and set the completed state. @@ -3307,6 +3451,21 @@ public: config.clear(tr); + state Key destUidValue(BinaryWriter::toValue(uid, Unversioned())); + if (normalizedRanges.size() == 1) { + state Key destUidLookupPath = BinaryWriter::toValue(normalizedRanges[0], IncludeVersion()).withPrefix(destUidLookupPrefix); + Optional existingDestUidValue = wait(tr->get(destUidLookupPath)); + if (existingDestUidValue.present()) { + destUidValue = existingDestUidValue.get(); + } else { + destUidValue = BinaryWriter::toValue(g_random->randomUniqueID(), Unversioned()); + tr->set(destUidLookupPath, destUidValue); + } + } + Version initVersion = 1; + tr->set(config.getUidAsKey().withPrefix(destUidValue).withPrefix(backupLatestVersionsPrefix), BinaryWriter::toValue(initVersion, Unversioned())); + config.destUidValue().set(tr, destUidValue); + // Point the tag to this new uid tag.set(tr, {uid, false}); @@ -3464,11 +3623,14 @@ public: Void _ = wait(tag.cancel(tr)); Key configPath = uidPrefixKey(logRangesRange.begin, config.getUid()); - Key logsPath = uidPrefixKey(backupLogKeys.begin, config.getUid()); tr->setOption(FDBTransactionOptions::COMMIT_ON_FIRST_PROXY); tr->clear(KeyRangeRef(configPath, strinc(configPath))); - tr->clear(KeyRangeRef(logsPath, strinc(logsPath))); + + state Key destUidValue = wait(config.destUidValue().getOrThrow(tr)); + state Version endVersion = wait(tr->getReadVersion()); + + Key _ = wait(fileBackup::EraseLogRangeTaskFunc::addTask(tr, backupAgent->taskBucket, config.getUid(), TaskCompletionKey::noSignal(), true, destUidValue)); config.stateEnum().set(tr, EBackupState::STATE_COMPLETED); @@ -3494,6 +3656,7 @@ public: state UidAndAbortedFlagT current = wait(tag.getOrThrow(tr, false, backup_unneeded())); state BackupConfig config(current.first); + state Key destUidValue = wait(config.destUidValue().getOrThrow(tr)); EBackupState status = wait(config.stateEnum().getD(tr, EBackupState::STATE_NEVERRAN)); if (!backupAgent->isRunnable((BackupAgentBase::enumState)status)) { @@ -3508,10 +3671,9 @@ public: Void _ = wait(tag.cancel(tr)); Key configPath = uidPrefixKey(logRangesRange.begin, config.getUid()); - Key logsPath = uidPrefixKey(backupLogKeys.begin, config.getUid()); tr->clear(KeyRangeRef(configPath, strinc(configPath))); - tr->clear(KeyRangeRef(logsPath, strinc(logsPath))); + Key _ = wait(fileBackup::EraseLogRangeTaskFunc::addTask(tr, backupAgent->taskBucket, config.getUid(), TaskCompletionKey::noSignal(), true, destUidValue)); config.stateEnum().set(tr, EBackupState::STATE_ABORTED); diff --git a/fdbclient/Knobs.cpp b/fdbclient/Knobs.cpp index ce959dfbfa..72514a4a3d 100644 --- a/fdbclient/Knobs.cpp +++ b/fdbclient/Knobs.cpp @@ -132,6 +132,7 @@ ClientKnobs::ClientKnobs(bool randomize) { init( BACKUP_ERROR_DELAY, 10.0 ); init( BACKUP_STATUS_DELAY, 40.0 ); init( BACKUP_STATUS_JITTER, 0.05 ); + init( CLEAR_LOG_RANGE_COUNT, 1500); // transaction size / (size of '\xff\x02/blog/' + size of UID + size of hash result) = 200,000 / (8 + 16 + 8) // Configuration init( DEFAULT_AUTO_PROXIES, 3 ); diff --git a/fdbclient/Knobs.h b/fdbclient/Knobs.h index e7a002a97f..f7a9424ba1 100644 --- a/fdbclient/Knobs.h +++ b/fdbclient/Knobs.h @@ -118,6 +118,7 @@ public: int BACKUP_COPY_TASKS; int BACKUP_BLOCK_SIZE; int BACKUP_TASKS_PER_AGENT; + int CLEAR_LOG_RANGE_COUNT; int SIM_BACKUP_TASKS_PER_AGENT; int BACKUP_RANGEFILE_BLOCK_SIZE; int BACKUP_LOGFILE_BLOCK_SIZE; diff --git a/fdbclient/SystemData.cpp b/fdbclient/SystemData.cpp index 6b7adb97df..cf73d00e8b 100644 --- a/fdbclient/SystemData.cpp +++ b/fdbclient/SystemData.cpp @@ -451,6 +451,11 @@ const KeyRangeRef fileBackupPrefixRange(LiteralStringRef("\xff\x02/backup-agent/ // DR Agent configuration constant variables const KeyRangeRef databaseBackupPrefixRange(LiteralStringRef("\xff\x02/db-backup-agent/"), LiteralStringRef("\xff\x02/db-backup-agent0")); +// \xff\x02/sharedLogRangesConfig/destUidLookup/[keyRange] +const KeyRef destUidLookupPrefix = LiteralStringRef("\xff\x02/sharedLogRangesConfig/destUidLookup/"); +// \xff\x02/sharedLogRangesConfig/backuplatestVersions/[destUid]/[logUid] +const KeyRef backupLatestVersionsPrefix = LiteralStringRef("\xff\x02/sharedLogRangesConfig/backupLatestVersions/"); + // Returns the encoded key comprised of begin key and log uid Key logRangesEncodeKey(KeyRef keyBegin, UID logUid) { return keyBegin.withPrefix(uidPrefixKey(logRangesRange.begin, logUid)); @@ -470,10 +475,10 @@ KeyRef logRangesDecodeKey(KeyRef key, UID* logUid) { return key.substr(logRangesRange.begin.size() + sizeof(UID)); } -// Returns the encoded key value comprised of the end key and destination prefix -Key logRangesEncodeValue(KeyRef keyEnd, KeyRef destKeyPrefix) { +// Returns the encoded key value comprised of the end key and destination path +Key logRangesEncodeValue(KeyRef keyEnd, KeyRef destPath) { BinaryWriter wr(IncludeVersion()); - wr << std::make_pair(keyEnd, destKeyPrefix); + wr << std::make_pair(keyEnd, destPath); return wr.toStringRef(); } diff --git a/fdbclient/SystemData.h b/fdbclient/SystemData.h index 6b8c27991c..a8497d14dc 100644 --- a/fdbclient/SystemData.h +++ b/fdbclient/SystemData.h @@ -166,7 +166,7 @@ KeyRef logRangesDecodeKey(KeyRef key, UID* logUid = NULL); Key logRangesDecodeValue(KeyRef keyValue, Key* destKeyPrefix = NULL); // Returns the encoded key value comprised of the end key and destination prefix -Key logRangesEncodeValue(KeyRef keyEnd, KeyRef destKeyPrefix); +Key logRangesEncodeValue(KeyRef keyEnd, KeyRef destPath); // Returns a key prefixed with the specified key with // the given uid encoded at the end @@ -219,6 +219,9 @@ extern const KeyRangeRef fileRestorePrefixRange; // Key range reserved by database backup agent to storing configuration and state information extern const KeyRangeRef databaseBackupPrefixRange; +extern const KeyRef destUidLookupPrefix; +extern const KeyRef backupLatestVersionsPrefix; + // Key range reserved by backup agent to storing mutations extern const KeyRangeRef backupLogKeys; extern const KeyRangeRef applyLogKeys; diff --git a/fdbrpc/simulator.h b/fdbrpc/simulator.h index 2971f769a9..42bd3a4b44 100644 --- a/fdbrpc/simulator.h +++ b/fdbrpc/simulator.h @@ -34,7 +34,7 @@ enum ClogMode { ClogDefault, ClogAll, ClogSend, ClogReceive }; class ISimulator : public INetwork { public: - ISimulator() : desiredCoordinators(1), physicalDatacenters(1), processesPerMachine(0), isStopped(false), lastConnectionFailure(0), connectionFailuresDisableDuration(0), speedUpSimulation(false), allSwapsDisabled(false), backupAgents(WaitForType), extraDB(NULL) {} + ISimulator() : desiredCoordinators(1), physicalDatacenters(1), processesPerMachine(0), isStopped(false), lastConnectionFailure(0), connectionFailuresDisableDuration(0), speedUpSimulation(false), allSwapsDisabled(false), backupAgents(WaitForType), drAgents(WaitForType), extraDB(NULL) {} // Order matters! enum KillType { KillInstantly, InjectFaults, RebootAndDelete, RebootProcessAndDelete, Reboot, RebootProcess, None }; @@ -298,6 +298,7 @@ public: double connectionFailuresDisableDuration; bool speedUpSimulation; BackupAgentType backupAgents; + BackupAgentType drAgents; virtual flowGlobalType global(int id) { return getCurrentProcess()->global(id); }; virtual void setGlobal(size_t id, flowGlobalType v) { getCurrentProcess()->setGlobal(id,v); }; diff --git a/fdbserver/SimulatedCluster.actor.cpp b/fdbserver/SimulatedCluster.actor.cpp index 95e93ea88d..e0db03a805 100644 --- a/fdbserver/SimulatedCluster.actor.cpp +++ b/fdbserver/SimulatedCluster.actor.cpp @@ -146,7 +146,19 @@ ACTOR Future runBackup( Reference connFile ) { it.cancel(); } } - else if (g_simulator.backupAgents == ISimulator::BackupToDB) { + + Void _= wait(Future(Never())); + throw internal_error(); +} + +ACTOR Future runDr( Reference connFile ) { + state std::vector> agentFutures; + + while (g_simulator.drAgents == ISimulator::WaitForType) { + Void _ = wait(delay(1.0)); + } + + if (g_simulator.drAgents == ISimulator::BackupToDB) { Reference cluster = Cluster::createCluster(connFile, -1); Database cx = cluster->createDatabase(LiteralStringRef("DB")).get(); @@ -154,7 +166,7 @@ ACTOR Future runBackup( Reference connFile ) { Reference extraCluster = Cluster::createCluster(extraFile, -1); state Database extraDB = extraCluster->createDatabase(LiteralStringRef("DB")).get(); - TraceEvent("StartingBackupAgents").detail("connFile", connFile->getConnectionString().toString()).detail("extraString", extraFile->getConnectionString().toString()); + TraceEvent("StartingDrAgents").detail("connFile", connFile->getConnectionString().toString()).detail("extraString", extraFile->getConnectionString().toString()); state DatabaseBackupAgent dbAgent = DatabaseBackupAgent(cx); state DatabaseBackupAgent extraAgent = DatabaseBackupAgent(extraDB); @@ -165,11 +177,11 @@ ACTOR Future runBackup( Reference connFile ) { agentFutures.push_back(extraAgent.run(cx, &dr1PollDelay, CLIENT_KNOBS->SIM_BACKUP_TASKS_PER_AGENT)); agentFutures.push_back(dbAgent.run(extraDB, &dr2PollDelay, CLIENT_KNOBS->SIM_BACKUP_TASKS_PER_AGENT)); - while (g_simulator.backupAgents == ISimulator::BackupToDB) { + while (g_simulator.drAgents == ISimulator::BackupToDB) { Void _ = wait(delay(1.0)); } - TraceEvent("StoppingBackupAgents"); + TraceEvent("StoppingDrAgents"); for(auto it : agentFutures) { it.cancel(); @@ -243,8 +255,9 @@ ACTOR Future simulatedFDBDRebooter( Future listen = FlowTransport::transport().bind( n, n ); Future fd = fdbd( connFile, localities, processClass, *dataFolder, *coordFolder, 500e6, "", ""); Future backup = runBackupAgents ? runBackup(connFile) : Future(Never()); + Future dr = runBackupAgents ? runDr(connFile) : Future(Never()); - Void _ = wait(listen || fd || success(onShutdown) || backup); + Void _ = wait(listen || fd || success(onShutdown) || backup || dr); } catch (Error& e) { // If in simulation, if we make it here with an error other than io_timeout but enASIOTimedOut is set then somewhere an io_timeout was converted to a different error. if(g_network->isSimulated() && e.code() != error_code_io_timeout && (bool)g_network->global(INetwork::enASIOTimedOut)) diff --git a/fdbserver/tester.actor.cpp b/fdbserver/tester.actor.cpp index ce619d172e..f8b952617d 100644 --- a/fdbserver/tester.actor.cpp +++ b/fdbserver/tester.actor.cpp @@ -944,13 +944,17 @@ vector readTests( ifstream& ifs ) { g_simulator.connectionFailuresDisableDuration = spec.simConnectionFailuresDisableDuration; TraceEvent("TestParserTest").detail("ParsedSimConnectionFailuresDisableDuration", spec.simConnectionFailuresDisableDuration); } else if( attrib == "simBackupAgents" ) { - if (value == "BackupToFile") + if (value == "BackupToFile" || value == "BackupToFileAndDB") spec.simBackupAgents = ISimulator::BackupToFile; - else if (value == "BackupToDB") - spec.simBackupAgents = ISimulator::BackupToDB; else spec.simBackupAgents = ISimulator::NoBackupAgents; TraceEvent("TestParserTest").detail("ParsedSimBackupAgents", spec.simBackupAgents); + + if (value == "BackupToDB" || value == "BackupToFileAndDB") + spec.simDrAgents = ISimulator::BackupToDB; + else + spec.simDrAgents = ISimulator::NoBackupAgents; + TraceEvent("TestParserTest").detail("ParsedSimDrAgents", spec.simDrAgents); } else if( attrib == "extraDB" ) { TraceEvent("TestParserTest").detail("ParsedExtraDB", ""); } else if( attrib == "minimumReplication" ) { @@ -1023,6 +1027,7 @@ ACTOR Future runTests( ReferenceuseDB ) useDB = true; @@ -1031,10 +1036,15 @@ ACTOR Future runTests( ReferencestartDelay ); databasePingDelay = std::min( databasePingDelay, iter->databasePingDelay ); if (iter->simBackupAgents != ISimulator::NoBackupAgents) simBackupAgents = iter->simBackupAgents; + + if (iter->simDrAgents != ISimulator::NoBackupAgents) { + simDrAgents = iter->simDrAgents; + } } if (g_network->isSimulated()) { g_simulator.backupAgents = simBackupAgents; + g_simulator.drAgents = simDrAgents; } // turn off the database ping functionality if the suite of tests are not going to be using the database diff --git a/fdbserver/workloads/AtomicSwitchover.actor.cpp b/fdbserver/workloads/AtomicSwitchover.actor.cpp index 60bae20c29..9020954855 100644 --- a/fdbserver/workloads/AtomicSwitchover.actor.cpp +++ b/fdbserver/workloads/AtomicSwitchover.actor.cpp @@ -174,8 +174,8 @@ struct AtomicSwitchoverWorkload : TestWorkload { TraceEvent("AS_Done"); // SOMEDAY: Remove after backup agents can exist quiescently - if (g_simulator.backupAgents == ISimulator::BackupToDB) { - g_simulator.backupAgents = ISimulator::NoBackupAgents; + if (g_simulator.drAgents == ISimulator::BackupToDB) { + g_simulator.drAgents = ISimulator::NoBackupAgents; } return Void(); diff --git a/fdbserver/workloads/BackupCorrectness.actor.cpp b/fdbserver/workloads/BackupCorrectness.actor.cpp index b2992cb0e0..705f3ce69f 100644 --- a/fdbserver/workloads/BackupCorrectness.actor.cpp +++ b/fdbserver/workloads/BackupCorrectness.actor.cpp @@ -37,6 +37,7 @@ struct BackupAndRestoreCorrectnessWorkload : TestWorkload { static int backupAgentRequests; bool locked; bool allowPauses; + bool shareLogRange; BackupAndRestoreCorrectnessWorkload(WorkloadContext const& wcx) : TestWorkload(wcx) { @@ -53,12 +54,21 @@ struct BackupAndRestoreCorrectnessWorkload : TestWorkload { differentialBackup ? g_random->random01() * (restoreAfter - std::max(abortAndRestartAfter,backupAfter)) + std::max(abortAndRestartAfter,backupAfter) : 0.0); agentRequest = getOption(options, LiteralStringRef("simBackupAgents"), true); allowPauses = getOption(options, LiteralStringRef("allowPauses"), true); + shareLogRange = getOption(options, LiteralStringRef("shareLogRange"), false); KeyRef beginRange; KeyRef endRange; UID randomID = g_nondeterministic_random->randomUniqueID(); - if(backupRangesCount <= 0) { + if (shareLogRange) { + if (g_random->random01() < 0.5) { + backupRanges.push_back_deep(backupRanges.arena(), normalKeys); + } else if (g_random->random01() < 0.75) { + backupRanges.push_back_deep(backupRanges.arena(), KeyRangeRef(normalKeys.begin, LiteralStringRef("\x7f"))); + } else { + backupRanges.push_back_deep(backupRanges.arena(), KeyRangeRef(LiteralStringRef("\x7f"), normalKeys.end)); + } + } else if (backupRangesCount <= 0) { backupRanges.push_back_deep(backupRanges.arena(), normalKeys); } else { // Add backup ranges @@ -341,6 +351,7 @@ struct BackupAndRestoreCorrectnessWorkload : TestWorkload { state KeyBackedTag keyBackedTag = makeBackupTag(self->backupTag.toString()); UidAndAbortedFlagT uidFlag = wait(keyBackedTag.getOrThrow(cx)); state UID logUid = uidFlag.first; + state Key destUidValue = wait(BackupConfig(logUid).destUidValue().getD(cx)); state Reference lastBackupContainer = wait(BackupConfig(logUid).backupContainer().getD(cx)); // Occasionally start yet another backup that might still be running when we restore @@ -430,8 +441,10 @@ struct BackupAndRestoreCorrectnessWorkload : TestWorkload { } } - state Key backupAgentKey = uidPrefixKey(logRangesRange.begin, logUid); - state Key backupLogValuesKey = uidPrefixKey(backupLogKeys.begin, logUid); + state Key backupAgentKey = uidPrefixKey(logRangesRange.begin, logUid); + state Key backupLogValuesKey = destUidValue.withPrefix(backupLogKeys.begin); + state Key backupLatestVersionsPath = destUidValue.withPrefix(backupLatestVersionsPrefix); + state Key backupLatestVersionsKey = uidPrefixKey(backupLatestVersionsPath, logUid); state int displaySystemKeys = 0; // Ensure that there is no left over key within the backup subspace @@ -443,39 +456,7 @@ struct BackupAndRestoreCorrectnessWorkload : TestWorkload { try { tr->setOption(FDBTransactionOptions::ACCESS_SYSTEM_KEYS); - Standalone agentValues = wait(tr->getRange(KeyRange(KeyRangeRef(backupAgentKey, strinc(backupAgentKey))), 100)); - // Error if the system keyspace for the backup tag is not empty - if (agentValues.size() > 0) { - displaySystemKeys ++; - printf("BackupCorrectnessLeftOverMutationKeys: (%d) %s\n", agentValues.size(), printable(backupAgentKey).c_str()); - TraceEvent(SevError, "BackupCorrectnessLeftOverMutationKeys", randomID).detail("backupTag", printable(self->backupTag)) - .detail("LeftOverKeys", agentValues.size()).detail("keySpace", printable(backupAgentKey)); - for (auto & s : agentValues) { - TraceEvent("BARW_LeftOverKey", randomID).detail("key", printable(StringRef(s.key.toString()))).detail("value", printable(StringRef(s.value.toString()))); - printf(" Key: %-50s Value: %s\n", printable(StringRef(s.key.toString())).c_str(), printable(StringRef(s.value.toString())).c_str()); - } - } - else { - printf("No left over backup agent configuration keys\n"); - } - - Standalone logValues = wait(tr->getRange(KeyRange(KeyRangeRef(backupLogValuesKey, strinc(backupLogValuesKey))), 100)); - - // Error if the log/mutation keyspace for the backup tag is not empty - if (logValues.size() > 0) { - displaySystemKeys ++; - printf("BackupCorrectnessLeftOverLogKeys: (%d) %s\n", logValues.size(), printable(backupLogValuesKey).c_str()); - TraceEvent(SevError, "BackupCorrectnessLeftOverLogKeys", randomID).detail("backupTag", printable(self->backupTag)) - .detail("LeftOverKeys", logValues.size()).detail("keySpace", printable(backupLogValuesKey)); - for (auto & s : logValues) { - TraceEvent("BARW_LeftOverKey", randomID).detail("key", printable(StringRef(s.key.toString()))).detail("value", printable(StringRef(s.value.toString()))); - printf(" Key: %-50s Value: %s\n", printable(StringRef(s.key.toString())).c_str(), printable(StringRef(s.value.toString())).c_str()); - } - } - else { - printf("No left over backup log keys\n"); - } // Check the left over tasks // We have to wait for the list to empty since an abort and get status @@ -513,6 +494,48 @@ struct BackupAndRestoreCorrectnessWorkload : TestWorkload { printf("BackupCorrectnessLeftOverLogTasks: %ld\n", (long) taskCount); } + + + Standalone agentValues = wait(tr->getRange(KeyRange(KeyRangeRef(backupAgentKey, strinc(backupAgentKey))), 100)); + + // Error if the system keyspace for the backup tag is not empty + if (agentValues.size() > 0) { + displaySystemKeys ++; + printf("BackupCorrectnessLeftOverMutationKeys: (%d) %s\n", agentValues.size(), printable(backupAgentKey).c_str()); + TraceEvent(SevError, "BackupCorrectnessLeftOverMutationKeys", randomID).detail("backupTag", printable(self->backupTag)) + .detail("LeftOverKeys", agentValues.size()).detail("keySpace", printable(backupAgentKey)); + for (auto & s : agentValues) { + TraceEvent("BARW_LeftOverKey", randomID).detail("key", printable(StringRef(s.key.toString()))).detail("value", printable(StringRef(s.value.toString()))); + printf(" Key: %-50s Value: %s\n", printable(StringRef(s.key.toString())).c_str(), printable(StringRef(s.value.toString())).c_str()); + } + } + else { + printf("No left over backup agent configuration keys\n"); + } + + Optional latestVersion = wait(tr->get(backupLatestVersionsKey)); + if (latestVersion.present()) { + TraceEvent(SevError, "BackupCorrectnessLeftOverVersionKey", randomID).detail("backupTag", printable(self->backupTag)).detail("backupLatestVersionsKey", backupLatestVersionsKey.printable()).detail("destUidValue", destUidValue.printable()); + } else { + printf("No left over backup version key\n"); + } + + Standalone versions = wait(tr->getRange(KeyRange(KeyRangeRef(backupLatestVersionsPath, strinc(backupLatestVersionsPath))), 1)); + if (!self->shareLogRange || !versions.size()) { + Standalone logValues = wait(tr->getRange(KeyRange(KeyRangeRef(backupLogValuesKey, strinc(backupLogValuesKey))), 100)); + + // Error if the log/mutation keyspace for the backup tag is not empty + if (logValues.size() > 0) { + displaySystemKeys ++; + printf("BackupCorrectnessLeftOverLogKeys: (%d) %s\n", logValues.size(), printable(backupLogValuesKey).c_str()); + TraceEvent(SevError, "BackupCorrectnessLeftOverLogKeys", randomID).detail("backupTag", printable(self->backupTag)) + .detail("LeftOverKeys", logValues.size()).detail("keySpace", printable(backupLogValuesKey)); + } + else { + printf("No left over backup log keys\n"); + } + } + break; } catch (Error &e) { diff --git a/fdbserver/workloads/BackupToDBAbort.actor.cpp b/fdbserver/workloads/BackupToDBAbort.actor.cpp index d5531d7e22..df8640de6e 100644 --- a/fdbserver/workloads/BackupToDBAbort.actor.cpp +++ b/fdbserver/workloads/BackupToDBAbort.actor.cpp @@ -86,8 +86,8 @@ struct BackupToDBAbort : TestWorkload { TraceEvent("BDBA_End"); // SOMEDAY: Remove after backup agents can exist quiescently - if (g_simulator.backupAgents == ISimulator::BackupToDB) { - g_simulator.backupAgents = ISimulator::NoBackupAgents; + if (g_simulator.drAgents == ISimulator::BackupToDB) { + g_simulator.drAgents = ISimulator::NoBackupAgents; } return Void(); diff --git a/fdbserver/workloads/BackupToDBCorrectness.actor.cpp b/fdbserver/workloads/BackupToDBCorrectness.actor.cpp index 0368a30645..f9cb1bfc1e 100644 --- a/fdbserver/workloads/BackupToDBCorrectness.actor.cpp +++ b/fdbserver/workloads/BackupToDBCorrectness.actor.cpp @@ -34,9 +34,11 @@ struct BackupToDBCorrectnessWorkload : TestWorkload { int backupRangesCount, backupRangeLengthMax; bool differentialBackup, performRestore, agentRequest; Standalone> backupRanges; - static int backupAgentRequests; + static int drAgentRequests; Database extraDB; bool locked; + bool shareLogRange; + UID destUid; BackupToDBCorrectnessWorkload(WorkloadContext const& wcx) : TestWorkload(wcx) { @@ -53,7 +55,8 @@ struct BackupToDBCorrectnessWorkload : TestWorkload { differentialBackup = getOption(options, LiteralStringRef("differentialBackup"), g_random->random01() < 0.5 ? true : false); stopDifferentialAfter = getOption(options, LiteralStringRef("stopDifferentialAfter"), differentialBackup ? g_random->random01() * (restoreAfter - std::max(abortAndRestartAfter,backupAfter)) + std::max(abortAndRestartAfter,backupAfter) : 0.0); - agentRequest = getOption(options, LiteralStringRef("simBackupAgents"), true); + agentRequest = getOption(options, LiteralStringRef("simDrAgents"), true); + shareLogRange = getOption(options, LiteralStringRef("shareLogRange"), false); beforePrefix = g_random->random01() < 0.5; if (beforePrefix) { @@ -72,7 +75,15 @@ struct BackupToDBCorrectnessWorkload : TestWorkload { KeyRef endRange; UID randomID = g_nondeterministic_random->randomUniqueID(); - if(backupRangesCount <= 0) { + if (shareLogRange) { + if (g_random->random01() < 0.5) { + backupRanges.push_back_deep(backupRanges.arena(), normalKeys); + } else if (g_random->random01() < 0.75) { + backupRanges.push_back_deep(backupRanges.arena(), KeyRangeRef(normalKeys.begin, LiteralStringRef("\x7f"))); + } else { + backupRanges.push_back_deep(backupRanges.arena(), KeyRangeRef(LiteralStringRef("\x7f"), normalKeys.end)); + } + } else if(backupRangesCount <= 0) { if (beforePrefix) backupRanges.push_back_deep(backupRanges.arena(), KeyRangeRef(normalKeys.begin, std::min(backupPrefix, extraPrefix))); else @@ -249,6 +260,8 @@ struct BackupToDBCorrectnessWorkload : TestWorkload { submitted.send(Void()); + state UID logUid = wait(backupAgent->getLogUid(cx, tag)); + // Stop the differential backup, if enabled if (stopDifferentialDelay) { TEST(!stopDifferentialFuture.isReady()); //Restore starts at specified time @@ -261,7 +274,6 @@ struct BackupToDBCorrectnessWorkload : TestWorkload { TraceEvent("BARW_doBackup waitForRestorable", randomID).detail("tag", printable(tag)); // Wait until the backup is in a restorable state state int resultWait = wait(backupAgent->waitBackup(cx, tag, false)); - state UID logUid = wait(backupAgent->getLogUid(cx, tag)); TraceEvent("BARW_lastBackupFolder", randomID).detail("backupTag", printable(tag)) .detail("logUid", logUid).detail("waitStatus", resultWait); @@ -296,6 +308,10 @@ struct BackupToDBCorrectnessWorkload : TestWorkload { // Wait for the backup to complete TraceEvent("BARW_doBackup waitBackup", randomID).detail("tag", printable(tag)); + + UID _destUid = wait(backupAgent->getDestUid(cx, logUid)); + self->destUid = _destUid; + state int statusValue = wait(backupAgent->waitBackup(cx, tag, true)); Void _ = wait(backupAgent->unlockBackup(cx, tag)); @@ -311,9 +327,11 @@ struct BackupToDBCorrectnessWorkload : TestWorkload { return Void(); } - ACTOR static Future checkData(Database cx, UID logUid, UID randomID, Key tag, DatabaseBackupAgent* backupAgent) { + ACTOR static Future checkData(Database cx, UID logUid, UID destUid, UID randomID, Key tag, DatabaseBackupAgent* backupAgent, bool shareLogRange) { state Key backupAgentKey = uidPrefixKey(logRangesRange.begin, logUid); - state Key backupLogValuesKey = uidPrefixKey(backupLogKeys.begin, logUid); + state Key backupLogValuesKey = uidPrefixKey(backupLogKeys.begin, destUid); + state Key backupLatestVersionsPath = uidPrefixKey(backupLatestVersionsPrefix, destUid); + state Key backupLatestVersionsKey = uidPrefixKey(backupLatestVersionsPath, logUid); state int displaySystemKeys = 0; // Ensure that there is no left over key within the backup subspace @@ -378,21 +396,31 @@ struct BackupToDBCorrectnessWorkload : TestWorkload { printf("No left over backup agent configuration keys\n"); } - Standalone logValues = wait(tr->getRange(KeyRange(KeyRangeRef(backupLogValuesKey, strinc(backupLogValuesKey))), 100)); - - // Error if the log/mutation keyspace for the backup tag is not empty - if (logValues.size() > 0) { - displaySystemKeys++; - printf("BackupCorrectnessLeftOverLogKeys: (%d) %s\n", logValues.size(), printable(backupLogValuesKey).c_str()); - TraceEvent(SevError, "BackupCorrectnessLeftOverLogKeys", randomID).detail("backupTag", printable(tag)) - .detail("LeftOverKeys", logValues.size()).detail("keySpace", printable(backupLogValuesKey)).detail("version", decodeBKMutationLogKey(logValues[0].key).first); - for (auto & s : logValues) { - TraceEvent("BARW_LeftOverKey", randomID).detail("key", printable(StringRef(s.key.toString()))).detail("value", printable(StringRef(s.value.toString()))); - printf(" Key: %-50s Value: %s\n", printable(StringRef(s.key.toString())).c_str(), printable(StringRef(s.value.toString())).c_str()); - } + Optional latestVersion = wait(tr->get(backupLatestVersionsKey)); + if (latestVersion.present()) { + TraceEvent(SevError, "BackupCorrectnessLeftOverVersionKey", randomID).detail("backupTag", printable(tag)).detail("key", backupLatestVersionsKey.printable()).detail("value", BinaryReader::fromStringRef(latestVersion.get(), Unversioned())); + } else { + printf("No left over backup version key\n"); } - else { - printf("No left over backup log keys\n"); + + Standalone versions = wait(tr->getRange(KeyRange(KeyRangeRef(backupLatestVersionsPath, strinc(backupLatestVersionsPath))), 1)); + if (!shareLogRange || !versions.size()) { + Standalone logValues = wait(tr->getRange(KeyRange(KeyRangeRef(backupLogValuesKey, strinc(backupLogValuesKey))), 100)); + + // Error if the log/mutation keyspace for the backup tag is not empty + if (logValues.size() > 0) { + displaySystemKeys++; + printf("BackupCorrectnessLeftOverLogKeys: (%d) %s\n", logValues.size(), printable(backupLogValuesKey).c_str()); + TraceEvent(SevError, "BackupCorrectnessLeftOverLogKeys", randomID).detail("backupTag", printable(tag)) + .detail("LeftOverKeys", logValues.size()).detail("keySpace", printable(backupLogValuesKey)).detail("version", decodeBKMutationLogKey(logValues[0].key).first); + for (auto & s : logValues) { + TraceEvent("BARW_LeftOverKey", randomID).detail("key", printable(StringRef(s.key.toString()))).detail("value", printable(StringRef(s.value.toString()))); + printf(" Key: %-50s Value: %s\n", printable(StringRef(s.key.toString())).c_str(), printable(StringRef(s.value.toString())).c_str()); + } + } + else { + printf("No left over backup log keys\n"); + } } break; @@ -421,7 +449,7 @@ struct BackupToDBCorrectnessWorkload : TestWorkload { // Increment the backup agent requets if (self->agentRequest) { - BackupToDBCorrectnessWorkload::backupAgentRequests++; + BackupToDBCorrectnessWorkload::drAgentRequests++; } try{ @@ -526,23 +554,23 @@ struct BackupToDBCorrectnessWorkload : TestWorkload { } } - Void _ = wait( checkData(self->extraDB, logUid, randomID, self->backupTag, &backupAgent) ); + Void _ = wait( checkData(self->extraDB, logUid, self->destUid, randomID, self->backupTag, &backupAgent, self->shareLogRange) ); if (self->performRestore) { state UID restoreUid = wait(backupAgent.getLogUid(self->extraDB, self->restoreTag)); - Void _ = wait( checkData(cx, restoreUid, randomID, self->restoreTag, &restoreAgent) ); + Void _ = wait( checkData(cx, restoreUid, restoreUid, randomID, self->restoreTag, &restoreAgent, self->shareLogRange) ); } TraceEvent("BARW_complete", randomID).detail("backupTag", printable(self->backupTag)); // Decrement the backup agent requets if (self->agentRequest) { - BackupToDBCorrectnessWorkload::backupAgentRequests--; + BackupToDBCorrectnessWorkload::drAgentRequests--; } // SOMEDAY: Remove after backup agents can exist quiescently - if ((g_simulator.backupAgents == ISimulator::BackupToDB) && (!BackupToDBCorrectnessWorkload::backupAgentRequests)) { - g_simulator.backupAgents = ISimulator::NoBackupAgents; + if ((g_simulator.drAgents == ISimulator::BackupToDB) && (!BackupToDBCorrectnessWorkload::drAgentRequests)) { + g_simulator.drAgents = ISimulator::NoBackupAgents; } } catch (Error& e) { @@ -554,6 +582,6 @@ struct BackupToDBCorrectnessWorkload : TestWorkload { } }; -int BackupToDBCorrectnessWorkload::backupAgentRequests = 0; +int BackupToDBCorrectnessWorkload::drAgentRequests = 0; WorkloadFactory BackupToDBCorrectnessWorkloadFactory("BackupToDBCorrectness"); diff --git a/fdbserver/workloads/workloads.h b/fdbserver/workloads/workloads.h index 6a1db46d71..3544329377 100644 --- a/fdbserver/workloads/workloads.h +++ b/fdbserver/workloads/workloads.h @@ -164,6 +164,7 @@ public: simCheckRelocationDuration = false; simConnectionFailuresDisableDuration = 0; simBackupAgents = ISimulator::NoBackupAgents; + simDrAgents = ISimulator::NoBackupAgents; } TestSpec( StringRef title, bool dump, bool clear, double startDelay = 30.0, bool useDB = true, double databasePingDelay = -1.0 ) : title( title ), dumpAfterTest( dump ), @@ -171,7 +172,7 @@ public: useDB( useDB ), timeout( 600 ), databasePingDelay( databasePingDelay ), runConsistencyCheck( g_network->isSimulated() ), waitForQuiescenceBegin( true ), waitForQuiescenceEnd( true ), simCheckRelocationDuration( false ), - simConnectionFailuresDisableDuration( 0 ), simBackupAgents( ISimulator::NoBackupAgents ) { + simConnectionFailuresDisableDuration( 0 ), simBackupAgents( ISimulator::NoBackupAgents ), simDrAgents( ISimulator::NoBackupAgents ) { phases = TestWorkload::SETUP | TestWorkload::EXECUTION | TestWorkload::CHECK | TestWorkload::METRICS; if( databasePingDelay < 0 ) databasePingDelay = g_network->isSimulated() ? 0.0 : 15.0; @@ -193,6 +194,7 @@ public: bool simCheckRelocationDuration; //If set to true, then long duration relocations generate SevWarnAlways messages. Once any workload sets this to true, it will be true for the duration of the program. Can only be used in simulation. double simConnectionFailuresDisableDuration; ISimulator::BackupAgentType simBackupAgents; //If set to true, then the simulation runs backup agents on the workers. Can only be used in simulation. + ISimulator::BackupAgentType simDrAgents; }; Future runWorkload( diff --git a/tests/slow/SharedBackupCorrectness.txt b/tests/slow/SharedBackupCorrectness.txt new file mode 100644 index 0000000000..035f40ad96 --- /dev/null +++ b/tests/slow/SharedBackupCorrectness.txt @@ -0,0 +1,25 @@ +testTitle=BackupAndRestore + testName=Cycle + nodeCount=3000 + transactionsPerSecond=500.0 + testDuration=30.0 + expectedRate=0 + clearAfterTest=false + + testName=BackupAndRestoreCorrectness + backupTag=backup2 + backupAfter=20.0 + clearAfterTest=false + simBackupAgents=BackupToFileAndDB + shareLogRange=true + performRestore=false + + testName=BackupToDBCorrectness + backupTag=backup3 + backupPrefix=b1 + backupAfter=15.0 + restoreAfter=60.0 + performRestore=false + clearAfterTest=false + simBackupAgents=BackupToFileAndDB + shareLogRange=true \ No newline at end of file From f12c1d811c2fd90222394b9bfeb3a745c2c0eb0c Mon Sep 17 00:00:00 2001 From: Yichi Chiang Date: Tue, 13 Mar 2018 11:21:24 -0700 Subject: [PATCH 021/127] Fix all review comments --- fdbclient/BackupAgent.h | 5 +- fdbclient/BackupAgentBase.actor.cpp | 90 ++++++++-- fdbclient/DatabaseBackupAgent.actor.cpp | 155 +++++------------- fdbclient/FileBackupAgent.actor.cpp | 44 +---- .../workloads/BackupCorrectness.actor.cpp | 12 +- .../workloads/BackupToDBCorrectness.actor.cpp | 16 +- tests/slow/SharedBackupCorrectness.txt | 13 +- tests/slow/SharedBackupToDBCorrectness.txt | 27 +++ 8 files changed, 170 insertions(+), 192 deletions(-) create mode 100644 tests/slow/SharedBackupToDBCorrectness.txt diff --git a/fdbclient/BackupAgent.h b/fdbclient/BackupAgent.h index 54e04ce7cb..8429fa55d0 100644 --- a/fdbclient/BackupAgent.h +++ b/fdbclient/BackupAgent.h @@ -58,6 +58,7 @@ public: static const Key keyEndKey; static const Key destUid; static const Key backupDone; + static const Key backupStartVersion; static const Key keyTagName; static const Key keyStates; @@ -420,7 +421,7 @@ bool copyParameter(Reference source, Reference dest, Key key); Version getVersionFromString(std::string const& value); Standalone> getLogRanges(Version beginVersion, Version endVersion, Key destUidValue, int blockSize = CLIENT_KNOBS->LOG_RANGE_BLOCK_SIZE); Standalone> getApplyRanges(Version beginVersion, Version endVersion, Key backupUid); -Future clearLogRanges(Reference tr, bool clearVersionHistory, Key logUidValue, Key destUidValue, Version beginVersion, Version endVersion); +Future eraseLogData(Database cx, Key logUidValue, Key destUidValue, bool backupDone, Version beginVersion, Version endVersion, bool checkBackupUid = false, Version backupUid = 0); Key getApplyKey( Version version, Key backupUid ); std::pair decodeBKMutationLogKey(Key key); Standalone> decodeBackupLogValue(StringRef value); @@ -740,7 +741,7 @@ public: void startMutationLogs(Reference tr, KeyRangeRef backupRange, Key destUidValue) { Key mutationLogsDestKey = destUidValue.withPrefix(backupLogKeys.begin); - tr->set(logRangesEncodeKey(backupRange.begin, getUid()), logRangesEncodeValue(backupRange.end, mutationLogsDestKey)); + tr->set(logRangesEncodeKey(backupRange.begin, BinaryReader::fromStringRef(destUidValue, Unversioned())), logRangesEncodeValue(backupRange.end, mutationLogsDestKey)); } Future logError(Database cx, Error e, std::string details, void *taskInstance = nullptr) { diff --git a/fdbclient/BackupAgentBase.actor.cpp b/fdbclient/BackupAgentBase.actor.cpp index c33d131101..c0f71cd3bf 100644 --- a/fdbclient/BackupAgentBase.actor.cpp +++ b/fdbclient/BackupAgentBase.actor.cpp @@ -37,6 +37,7 @@ const Key BackupAgentBase::keyBeginKey = LiteralStringRef("beginKey"); const Key BackupAgentBase::keyEndKey = LiteralStringRef("endKey"); const Key BackupAgentBase::destUid = LiteralStringRef("destUid"); const Key BackupAgentBase::backupDone = LiteralStringRef("backupDone"); +const Key BackupAgentBase::backupStartVersion = LiteralStringRef("backupStartVersion"); const Key BackupAgentBase::keyTagName = LiteralStringRef("tagname"); const Key BackupAgentBase::keyStates = LiteralStringRef("state"); @@ -633,28 +634,41 @@ ACTOR Future _clearLogRanges(Reference tr, bool tr->setOption(FDBTransactionOptions::ACCESS_SYSTEM_KEYS); tr->setOption(FDBTransactionOptions::LOCK_AWARE); - Optional v = wait(tr->get(backupLatestVersionsKey)); - if (!v.present()) { + state Standalone backupVersions = wait(tr->getRange(KeyRangeRef(backupLatestVersionsPath, strinc(backupLatestVersionsPath)), CLIENT_KNOBS->TOO_MANY)); + + // Make sure version history key does exist and lower the beginVersion if needed + bool foundSelf = false; + for (auto backupVersion : backupVersions) { + Key currLogUidValue = backupVersion.key.removePrefix(backupLatestVersionsPrefix).removePrefix(destUidValue); + + if (currLogUidValue == logUidValue) { + foundSelf = true; + beginVersion = std::min(beginVersion, BinaryReader::fromStringRef(backupVersion.value, Unversioned())); + } + } + + // Do not clear anything if version history key cannot be found + if (!foundSelf) { return Void(); } - state Standalone backupVersions = wait(tr->getRange(KeyRangeRef(backupLatestVersionsPath, strinc(backupLatestVersionsPath)), CLIENT_KNOBS->TOO_MANY)); - Version nextSmallestVersion = endVersion; + // If clear version history is required, then we need to clear log ranges up to next smallest version which might be greater than endVersion + // If size of backupVersions is greater than 1, we can definitely find a version less than INTMAX_MAX, otherwise we clear all log ranges without calling getLogRanges() + Version nextSmallestVersion = clearVersionHistory ? INTMAX_MAX : endVersion; bool clearLogRangesRequired = true; // More than one backup/DR with the same range if (backupVersions.size() > 1) { - bool countSelf = false; - for (auto backupVersion : backupVersions) { + Key currLogUidValue = backupVersion.key.removePrefix(backupLatestVersionsPrefix).removePrefix(destUidValue); Version currVersion = BinaryReader::fromStringRef(backupVersion.value, Unversioned()); - if (currVersion > beginVersion) { - if (currVersion < nextSmallestVersion) { - nextSmallestVersion = currVersion; - } - } else if (currVersion == beginVersion && !countSelf) { - countSelf = true; + + if (currLogUidValue == logUidValue) { + continue; + } else if (currVersion > beginVersion) { + nextSmallestVersion = std::min(currVersion, nextSmallestVersion); } else { + // If we can find a version less than or equal to beginVersion, clearing log ranges is not required clearLogRangesRequired = false; break; } @@ -662,8 +676,14 @@ ACTOR Future _clearLogRanges(Reference tr, bool } if (clearVersionHistory && backupVersions.size() == 1) { + // Clear version history tr->clear(prefixRange(backupLatestVersionsPath)); + + // Clear everything under blog/[destUid] tr->clear(prefixRange(destUidValue.withPrefix(backupLogKeys.begin))); + + // Disable committing mutations into blog + tr->clear(prefixRange(destUidValue.withPrefix(logRangesRange.begin))); } else { if (clearVersionHistory) { // Clear current backup version history @@ -685,6 +705,52 @@ ACTOR Future _clearLogRanges(Reference tr, bool return Void(); } +// The difference between beginVersion and endVersion should not be too large Future clearLogRanges(Reference tr, bool clearVersionHistory, Key logUidValue, Key destUidValue, Version beginVersion, Version endVersion) { return _clearLogRanges(tr, clearVersionHistory, logUidValue, destUidValue, beginVersion, endVersion); +} + +ACTOR static Future _eraseLogData(Database cx, Key logUidValue, Key destUidValue, bool backupDone, Version beginVersion, Version endVersion, bool checkBackupUid, Version backupUid) { + if (endVersion <= beginVersion) + return Void(); + + state Version currBeginVersion = beginVersion; + state Version currEndVersion; + state bool clearVersionHistory = false; + + while (currBeginVersion < endVersion) { + state Reference tr(new ReadYourWritesTransaction(cx)); + + loop{ + try { + currEndVersion = std::min(currBeginVersion + CLIENT_KNOBS->CLEAR_LOG_RANGE_COUNT * CLIENT_KNOBS->LOG_RANGE_BLOCK_SIZE, endVersion); + tr->setOption(FDBTransactionOptions::LOCK_AWARE); + tr->setOption(FDBTransactionOptions::ACCESS_SYSTEM_KEYS); + + if (checkBackupUid) { + Subspace sourceStates = Subspace(databaseBackupPrefixRange.begin).get(BackupAgentBase::keySourceStates).get(logUidValue); + Optional v = wait( tr->get( sourceStates.pack(DatabaseBackupAgent::keyFolderId) ) ); + if(v.present() && BinaryReader::fromStringRef(v.get(), Unversioned()) > backupUid) + return Void(); + } + + if (backupDone && currEndVersion == endVersion) { + clearVersionHistory = true; + } + + Void _ = wait(clearLogRanges(tr, clearVersionHistory, logUidValue, destUidValue, currBeginVersion, currEndVersion)); + Void _ = wait(tr->commit()); + currBeginVersion = currEndVersion; + break; + } catch (Error &e) { + Void _ = wait(tr->onError(e)); + } + } + } + + return Void(); +} + +Future eraseLogData(Database cx, Key logUidValue, Key destUidValue, bool backupDone, Version beginVersion, Version endVersion, bool checkBackupUid, Version backupUid) { + return _eraseLogData(cx, logUidValue, destUidValue, backupDone, beginVersion, endVersion, checkBackupUid, backupUid); } \ No newline at end of file diff --git a/fdbclient/DatabaseBackupAgent.actor.cpp b/fdbclient/DatabaseBackupAgent.actor.cpp index 71120f76e9..ed6f20d52b 100644 --- a/fdbclient/DatabaseBackupAgent.actor.cpp +++ b/fdbclient/DatabaseBackupAgent.actor.cpp @@ -470,40 +470,6 @@ namespace dbBackup { Future execute(Database cx, Reference tb, Reference fb, Reference task) { return _execute(cx, tb, fb, task); }; Future finish(Reference tr, Reference tb, Reference fb, Reference task) { return _finish(tr, tb, fb, task); }; - ACTOR static Future eraseLogData(Database cx, Reference task, bool backupDone, Version beginVersion, Version endVersion) { - if (endVersion <= beginVersion) - return Void(); - - state Version currBeginVersion = beginVersion; - state Version currEndVersion; - state bool clearVersionHistory = false; - - while (currBeginVersion < endVersion) { - state Reference tr(new ReadYourWritesTransaction(cx)); - - loop{ - try { - currEndVersion = std::min(currBeginVersion + CLIENT_KNOBS->CLEAR_LOG_RANGE_COUNT * CLIENT_KNOBS->LOG_RANGE_BLOCK_SIZE, endVersion); - tr->setOption(FDBTransactionOptions::LOCK_AWARE); - tr->setOption(FDBTransactionOptions::ACCESS_SYSTEM_KEYS); - - if (backupDone && currEndVersion == endVersion) { - clearVersionHistory = true; - } - - Void _ = wait(clearLogRanges(tr, clearVersionHistory, task->params[BackupAgentBase::keyConfigLogUid], task->params[BackupAgentBase::destUid], currBeginVersion, currEndVersion)); - Void _ = wait(tr->commit()); - currBeginVersion = currEndVersion; - break; - } catch (Error &e) { - Void _ = wait(tr->onError(e)); - } - } - } - - return Void(); - } - ACTOR static Future _execute(Database cx, Reference taskBucket, Reference futureBucket, Reference task) { state FlowLock lock(CLIENT_KNOBS->BACKUP_LOCK_BYTES); @@ -513,7 +479,7 @@ namespace dbBackup { Version endVersion = BinaryReader::fromStringRef(task->params[DatabaseBackupAgent::keyEndVersion], Unversioned()); bool backupDone = BinaryReader::fromStringRef(task->params[DatabaseBackupAgent::backupDone], Unversioned()); - Void _ = wait(eraseLogData(taskBucket->src, task, backupDone, beginVersion, endVersion)); + Void _ = wait(eraseLogData(taskBucket->src, task->params[BackupAgentBase::keyConfigLogUid], task->params[BackupAgentBase::destUid], backupDone, beginVersion, endVersion, false, BinaryReader::fromStringRef(task->params[BackupAgentBase::keyFolderId], Unversioned()))); return Void(); } @@ -875,9 +841,6 @@ namespace dbBackup { if(v.present() && BinaryReader::fromStringRef(v.get(), Unversioned()) > BinaryReader::fromStringRef(task->params[DatabaseBackupAgent::keyFolderId], Unversioned())) return Void(); - Key configPath = logUidValue.withPrefix(logRangesRange.begin); - tr->clear(KeyRangeRef(configPath, strinc(configPath))); - state Key latestVersionKey = logUidValue.withPrefix(task->params[BackupAgentBase::destUid].withPrefix(backupLatestVersionsPrefix)); state Optional bVersion = wait(tr->get(latestVersionKey)); @@ -886,48 +849,15 @@ namespace dbBackup { } beginVersion = BinaryReader::fromStringRef(bVersion.get(), Unversioned()); - Void _ = wait(tr->commit()); - endVersion = tr->getCommittedVersion(); + endVersion = tr->getReadVersion().get(); break; } catch(Error &e) { Void _ = wait(tr->onError(e)); } } - state bool clearVersionHistory = false; - state Version currBeginVersion = beginVersion; - state Version currEndVersion; - - while (currBeginVersion < endVersion) { - state Reference clearSrcTr(new ReadYourWritesTransaction(taskBucket->src)); - - loop{ - try { - clearSrcTr->setOption(FDBTransactionOptions::ACCESS_SYSTEM_KEYS); - clearSrcTr->setOption(FDBTransactionOptions::LOCK_AWARE); - Optional v = wait( clearSrcTr->get( sourceStates.pack(DatabaseBackupAgent::keyFolderId) ) ); - if(v.present() && BinaryReader::fromStringRef(v.get(), Unversioned()) > BinaryReader::fromStringRef(task->params[DatabaseBackupAgent::keyFolderId], Unversioned())) - return Void(); - - currEndVersion = std::min(currBeginVersion + CLIENT_KNOBS->CLEAR_LOG_RANGE_COUNT * CLIENT_KNOBS->LOG_RANGE_BLOCK_SIZE, endVersion); - - if (currEndVersion == endVersion) { - clearVersionHistory = true; - } - - - - - Void _ = wait(clearLogRanges(clearSrcTr, clearVersionHistory, logUidValue, destUidValue, currBeginVersion, currEndVersion)); - Void _ = wait(clearSrcTr->commit()); - currBeginVersion = currEndVersion; - break; - } catch (Error &e) { - Void _ = wait(clearSrcTr->onError(e)); - } - } - } - + Version backupUid = BinaryReader::fromStringRef(task->params[BackupAgentBase::keyFolderId], Unversioned()); + Void _ = wait(eraseLogData(taskBucket->src, logUidValue, destUidValue, true, beginVersion, endVersion, true, backupUid)); return Void(); } @@ -1072,6 +1002,15 @@ namespace dbBackup { tr.setOption(FDBTransactionOptions::LOCK_AWARE); tr.addReadConflictRange(singleKeyRange(sourceStates.pack(DatabaseBackupAgent::keyStateStatus))); tr.set(sourceStates.pack(DatabaseBackupAgent::keyStateStatus), StringRef(BackupAgentBase::getStateText(BackupAgentBase::STATE_DIFFERENTIAL))); + + Key versionKey = task->params[DatabaseBackupAgent::keyConfigLogUid].withPrefix(task->params[BackupAgentBase::destUid]).withPrefix(backupLatestVersionsPrefix); + Optional prevBeginVersion = wait(tr.get(versionKey)); + if (!prevBeginVersion.present()) { + return Void(); + } + + task->params[DatabaseBackupAgent::keyPrevBeginVersion] = prevBeginVersion.get(); + Void _ = wait(tr.commit()); return Void(); } @@ -1107,7 +1046,9 @@ namespace dbBackup { tr->set(states.pack(DatabaseBackupAgent::keyStateStatus), StringRef(BackupAgentBase::getStateText(BackupAgentBase::STATE_DIFFERENTIAL))); allPartsDone = futureBucket->future(tr); - Key _ = wait(CopyDiffLogsTaskFunc::addTask(tr, taskBucket, task, 0, restoreVersion, TaskCompletionKey::joinWith(allPartsDone))); + + Version prevBeginVersion = BinaryReader::fromStringRef(task->params[DatabaseBackupAgent::keyPrevBeginVersion], Unversioned()); + Key _ = wait(CopyDiffLogsTaskFunc::addTask(tr, taskBucket, task, prevBeginVersion, restoreVersion, TaskCompletionKey::joinWith(allPartsDone))); // After the Backup completes, clear the backup subspace and update the status Key _ = wait(FinishedFullBackupTaskFunc::addTask(tr, taskBucket, task, TaskCompletionKey::noSignal(), allPartsDone)); @@ -1153,8 +1094,9 @@ namespace dbBackup { state UID logUid = BinaryReader::fromStringRef(logUidValue, Unversioned()); state Standalone> backupRanges = BinaryReader::fromStringRef>>(task->params[DatabaseBackupAgent::keyConfigBackupRanges], IncludeVersion()); + state Key beginVersionKey; - state Reference srcTr(new ReadYourWritesTransaction(taskBucket->src)); + state Reference srcTr(new ReadYourWritesTransaction(taskBucket->src)); loop { try { srcTr->setOption(FDBTransactionOptions::LOCK_AWARE); @@ -1172,6 +1114,9 @@ namespace dbBackup { } } + Version bVersion = wait(srcTr->getReadVersion()); + beginVersionKey = BinaryWriter::toValue(bVersion, Unversioned()); + task->params[BackupAgentBase::destUid] = destUidValue; Void _ = wait(srcTr->commit()); @@ -1189,12 +1134,24 @@ namespace dbBackup { state Future verified = taskBucket->keepRunning(tr, task); Void _ = wait(verified); - Subspace config = Subspace(databaseBackupPrefixRange.begin).get(BackupAgentBase::keyConfig).get(logUidValue); - tr->set(config.get(logUidValue).pack(BackupAgentBase::destUid), task->params[BackupAgentBase::destUid]); + // Set destUid at destination side + state Subspace config = Subspace(databaseBackupPrefixRange.begin).get(BackupAgentBase::keyConfig).get(logUidValue); + tr->set(config.pack(BackupAgentBase::destUid), task->params[BackupAgentBase::destUid]); + + // Use existing beginVersion if we already have one + Optional backupStartVersion = wait(tr->get(config.pack(BackupAgentBase::backupStartVersion))); + if (backupStartVersion.present()) { + beginVersionKey = backupStartVersion.get(); + } else { + tr->set(config.pack(BackupAgentBase::backupStartVersion), beginVersionKey); + } + + task->params[BackupAgentBase::keyBeginVersion] = beginVersionKey; + Void _ = wait(tr->commit()); break; } catch (Error &e) { - Void _ = wait(srcTr->onError(e)); + Void _ = wait(tr->onError(e)); } } @@ -1206,14 +1163,11 @@ namespace dbBackup { state Optional v = wait( srcTr2->get( sourceStates.pack(DatabaseBackupAgent::keyFolderId) ) ); - state Standalone beginVersion = BinaryWriter::toValue(srcTr2->getReadVersion().get(), Unversioned()); - task->params[BackupAgentBase::keyBeginVersion] = beginVersion; - if(v.present() && BinaryReader::fromStringRef(v.get(), Unversioned()) >= BinaryReader::fromStringRef(task->params[DatabaseBackupAgent::keyFolderId], Unversioned())) return Void(); Key versionKey = logUidValue.withPrefix(destUidValue).withPrefix(backupLatestVersionsPrefix); - srcTr2->set(versionKey, beginVersion); + srcTr2->set(versionKey, beginVersionKey); srcTr2->set( Subspace(databaseBackupPrefixRange.begin).get(BackupAgentBase::keySourceTagName).pack(task->params[BackupAgentBase::keyTagName]), logUidValue ); srcTr2->set( sourceStates.pack(DatabaseBackupAgent::keyFolderId), task->params[DatabaseBackupAgent::keyFolderId] ); @@ -1222,7 +1176,7 @@ namespace dbBackup { state Key destPath = destUidValue.withPrefix(backupLogKeys.begin); // Start logging the mutations for the specified ranges of the tag for (auto &backupRange : backupRanges) { - srcTr2->set(logRangesEncodeKey(backupRange.begin, logUid), logRangesEncodeValue(backupRange.end, destPath)); + srcTr2->set(logRangesEncodeKey(backupRange.begin, BinaryReader::fromStringRef(destUidValue, Unversioned())), logRangesEncodeValue(backupRange.end, destPath)); } Void _ = wait(srcTr2->commit()); @@ -1273,7 +1227,7 @@ namespace dbBackup { return Void(); } - ACTOR static Future addTask(Reference tr, Reference taskBucket, Key logUid, Key backupUid, /*Key destUid,*/ Key keyAddPrefix, Key keyRemovePrefix, Key keyConfigBackupRanges, Key tagName, TaskCompletionKey completionKey, Reference waitFor = Reference(), bool databasesInSync=false) + ACTOR static Future addTask(Reference tr, Reference taskBucket, Key logUid, Key backupUid, Key keyAddPrefix, Key keyRemovePrefix, Key keyConfigBackupRanges, Key tagName, TaskCompletionKey completionKey, Reference waitFor = Reference(), bool databasesInSync=false) { Key doneKey = wait(completionKey.get(tr, taskBucket)); Reference task(new Task(StartFullBackupTaskFunc::name, StartFullBackupTaskFunc::version, doneKey)); @@ -1745,10 +1699,9 @@ public: srcTr->set( backupAgent->sourceStates.pack(DatabaseBackupAgent::keyStateStatus), StringRef(DatabaseBackupAgent::getStateText(BackupAgentBase::STATE_PARTIALLY_ABORTED) )); srcTr->set( backupAgent->sourceStates.get(logUidValue).pack(DatabaseBackupAgent::keyFolderId), backupUid ); - srcTr->clear(prefixRange(logUidValue.withPrefix(logRangesRange.begin))); Void _ = wait(srcTr->commit()); - endVersion = srcTr->getCommittedVersion(); + endVersion = srcTr->getCommittedVersion() + 1; break; } @@ -1758,33 +1711,7 @@ public: } if (clearSrcDb) { - state bool clearVersionHistory = false; - state Version currBeginVersion = beginVersion; - state Version currEndVersion; - - while (currBeginVersion < endVersion) { - state Reference clearSrcTr(new ReadYourWritesTransaction(backupAgent->taskBucket->src)); - - loop{ - try { - clearSrcTr->setOption(FDBTransactionOptions::ACCESS_SYSTEM_KEYS); - clearSrcTr->setOption(FDBTransactionOptions::LOCK_AWARE); - currEndVersion = std::min(currBeginVersion + CLIENT_KNOBS->CLEAR_LOG_RANGE_COUNT * CLIENT_KNOBS->LOG_RANGE_BLOCK_SIZE, endVersion); - - if (currEndVersion == endVersion) { - clearSrcTr->set( backupAgent->sourceStates.pack(DatabaseBackupAgent::keyStateStatus), StringRef(BackupAgentBase::getStateText(BackupAgentBase::STATE_ABORTED) )); - clearVersionHistory = true; - } - - Void _ = wait(clearLogRanges(clearSrcTr, clearVersionHistory, logUidValue, destUidValue, currBeginVersion, currEndVersion)); - Void _ = wait(clearSrcTr->commit()); - currBeginVersion = currEndVersion; - break; - } catch (Error &e) { - Void _ = wait(clearSrcTr->onError(e)); - } - } - } + Void _ = wait(eraseLogData(backupAgent->taskBucket->src, logUidValue, destUidValue, true, beginVersion, endVersion)); } tr = Reference(new ReadYourWritesTransaction(cx)); diff --git a/fdbclient/FileBackupAgent.actor.cpp b/fdbclient/FileBackupAgent.actor.cpp index b1a4530d36..ed4dfe3104 100644 --- a/fdbclient/FileBackupAgent.actor.cpp +++ b/fdbclient/FileBackupAgent.actor.cpp @@ -1540,7 +1540,7 @@ namespace fileBackup { .detail("ScheduledVersion", scheduledVersion) .detail("BeginKey", range.begin.printable()) .detail("EndKey", range.end.printable()) - .suppressFor(2, true); + .suppressFor(2); } else { // This shouldn't happen because if the transaction was already done or if another execution @@ -1835,39 +1835,6 @@ namespace fileBackup { } } Params; - ACTOR static Future eraseLogData(Database cx, Key logUidValue, Key destUidValue, bool backupDone, Version beginVersion, Version endVersion) { - if (endVersion <= beginVersion) - return Void(); - - state Version currBeginVersion = beginVersion; - state Version currEndVersion; - state bool clearVersionHistory = false; - - while (currBeginVersion < endVersion) { - state Reference tr(new ReadYourWritesTransaction(cx)); - - loop{ - try { - currEndVersion = std::min(currBeginVersion + CLIENT_KNOBS->CLEAR_LOG_RANGE_COUNT * CLIENT_KNOBS->LOG_RANGE_BLOCK_SIZE, endVersion); - tr->setOption(FDBTransactionOptions::LOCK_AWARE); - tr->setOption(FDBTransactionOptions::ACCESS_SYSTEM_KEYS); - - if (backupDone && currEndVersion == endVersion) { - clearVersionHistory = true; - } - Void _ = wait(clearLogRanges(tr, clearVersionHistory, logUidValue, destUidValue, currBeginVersion, currEndVersion)); - Void _ = wait(tr->commit()); - currBeginVersion = currEndVersion; - break; - } catch (Error &e) { - Void _ = wait(tr->onError(e)); - } - } - } - - return Void(); - } - ACTOR static Future _execute(Database cx, Reference taskBucket, Reference futureBucket, Reference task) { state Reference lock(new FlowLock(CLIENT_KNOBS->BACKUP_LOCK_BYTES)); Void _ = wait(checkTaskVersion(cx, task, EraseLogRangeTaskFunc::name, EraseLogRangeTaskFunc::version)); @@ -2068,10 +2035,7 @@ namespace fileBackup { state BackupConfig backup(task); state UID uid = backup.getUid(); - state Key configPath = uidPrefixKey(logRangesRange.begin, uid); - tr->setOption(FDBTransactionOptions::COMMIT_ON_FIRST_PROXY); - tr->clear(KeyRangeRef(configPath, strinc(configPath))); state Key destUidValue = wait(backup.destUidValue().getOrThrow(tr)); Key _ = wait(EraseLogRangeTaskFunc::addTask(tr, taskBucket, backup.getUid(), TaskCompletionKey::noSignal(), true, destUidValue)); @@ -3622,10 +3586,7 @@ public: // Cancel all backup tasks through tag Void _ = wait(tag.cancel(tr)); - Key configPath = uidPrefixKey(logRangesRange.begin, config.getUid()); - tr->setOption(FDBTransactionOptions::COMMIT_ON_FIRST_PROXY); - tr->clear(KeyRangeRef(configPath, strinc(configPath))); state Key destUidValue = wait(config.destUidValue().getOrThrow(tr)); state Version endVersion = wait(tr->getReadVersion()); @@ -3670,9 +3631,6 @@ public: // Cancel backup task through tag Void _ = wait(tag.cancel(tr)); - Key configPath = uidPrefixKey(logRangesRange.begin, config.getUid()); - - tr->clear(KeyRangeRef(configPath, strinc(configPath))); Key _ = wait(fileBackup::EraseLogRangeTaskFunc::addTask(tr, backupAgent->taskBucket, config.getUid(), TaskCompletionKey::noSignal(), true, destUidValue)); config.stateEnum().set(tr, EBackupState::STATE_ABORTED); diff --git a/fdbserver/workloads/BackupCorrectness.actor.cpp b/fdbserver/workloads/BackupCorrectness.actor.cpp index 705f3ce69f..9f968d1aa5 100644 --- a/fdbserver/workloads/BackupCorrectness.actor.cpp +++ b/fdbserver/workloads/BackupCorrectness.actor.cpp @@ -61,13 +61,11 @@ struct BackupAndRestoreCorrectnessWorkload : TestWorkload { UID randomID = g_nondeterministic_random->randomUniqueID(); if (shareLogRange) { - if (g_random->random01() < 0.5) { - backupRanges.push_back_deep(backupRanges.arena(), normalKeys); - } else if (g_random->random01() < 0.75) { - backupRanges.push_back_deep(backupRanges.arena(), KeyRangeRef(normalKeys.begin, LiteralStringRef("\x7f"))); - } else { - backupRanges.push_back_deep(backupRanges.arena(), KeyRangeRef(LiteralStringRef("\x7f"), normalKeys.end)); - } + bool beforePrefix = sharedRandomNumber & 1; + if (beforePrefix) + backupRanges.push_back_deep(backupRanges.arena(), KeyRangeRef(normalKeys.begin, LiteralStringRef("\xfe\xff\xfe"))); + else + backupRanges.push_back_deep(backupRanges.arena(), KeyRangeRef(strinc(LiteralStringRef("\x00\x00\x01")), normalKeys.end)); } else if (backupRangesCount <= 0) { backupRanges.push_back_deep(backupRanges.arena(), normalKeys); } else { diff --git a/fdbserver/workloads/BackupToDBCorrectness.actor.cpp b/fdbserver/workloads/BackupToDBCorrectness.actor.cpp index f9cb1bfc1e..5b3144436d 100644 --- a/fdbserver/workloads/BackupToDBCorrectness.actor.cpp +++ b/fdbserver/workloads/BackupToDBCorrectness.actor.cpp @@ -58,11 +58,12 @@ struct BackupToDBCorrectnessWorkload : TestWorkload { agentRequest = getOption(options, LiteralStringRef("simDrAgents"), true); shareLogRange = getOption(options, LiteralStringRef("shareLogRange"), false); - beforePrefix = g_random->random01() < 0.5; + // Use sharedRandomNumber if shareLogRange is true so that we can ensure backup and DR both backup the same range + beforePrefix = shareLogRange ? (sharedRandomNumber & 1) : (g_random->random01() < 0.5); + if (beforePrefix) { extraPrefix = backupPrefix.withPrefix(LiteralStringRef("\xfe\xff\xfe")); backupPrefix = backupPrefix.withPrefix(LiteralStringRef("\xfe\xff\xff")); - } else { extraPrefix = backupPrefix.withPrefix(LiteralStringRef("\x00\x00\x01")); @@ -76,13 +77,10 @@ struct BackupToDBCorrectnessWorkload : TestWorkload { UID randomID = g_nondeterministic_random->randomUniqueID(); if (shareLogRange) { - if (g_random->random01() < 0.5) { - backupRanges.push_back_deep(backupRanges.arena(), normalKeys); - } else if (g_random->random01() < 0.75) { - backupRanges.push_back_deep(backupRanges.arena(), KeyRangeRef(normalKeys.begin, LiteralStringRef("\x7f"))); - } else { - backupRanges.push_back_deep(backupRanges.arena(), KeyRangeRef(LiteralStringRef("\x7f"), normalKeys.end)); - } + if (beforePrefix) + backupRanges.push_back_deep(backupRanges.arena(), KeyRangeRef(normalKeys.begin, LiteralStringRef("\xfe\xff\xfe"))); + else + backupRanges.push_back_deep(backupRanges.arena(), KeyRangeRef(strinc(LiteralStringRef("\x00\x00\x01")), normalKeys.end)); } else if(backupRangesCount <= 0) { if (beforePrefix) backupRanges.push_back_deep(backupRanges.arena(), KeyRangeRef(normalKeys.begin, std::min(backupPrefix, extraPrefix))); diff --git a/tests/slow/SharedBackupCorrectness.txt b/tests/slow/SharedBackupCorrectness.txt index 035f40ad96..24d0afff74 100644 --- a/tests/slow/SharedBackupCorrectness.txt +++ b/tests/slow/SharedBackupCorrectness.txt @@ -7,19 +7,22 @@ testTitle=BackupAndRestore clearAfterTest=false testName=BackupAndRestoreCorrectness - backupTag=backup2 - backupAfter=20.0 + backupTag=backup1 + backupAfter=10.0 + restoreAfter=60.0 clearAfterTest=false simBackupAgents=BackupToFileAndDB shareLogRange=true - performRestore=false + performRestore=true + allowPauses=false testName=BackupToDBCorrectness - backupTag=backup3 + backupTag=backup2 backupPrefix=b1 backupAfter=15.0 restoreAfter=60.0 performRestore=false clearAfterTest=false simBackupAgents=BackupToFileAndDB - shareLogRange=true \ No newline at end of file + shareLogRange=true + extraDB=1 \ No newline at end of file diff --git a/tests/slow/SharedBackupToDBCorrectness.txt b/tests/slow/SharedBackupToDBCorrectness.txt new file mode 100644 index 0000000000..e0d2770a77 --- /dev/null +++ b/tests/slow/SharedBackupToDBCorrectness.txt @@ -0,0 +1,27 @@ +testTitle=BackupAndRestore + testName=Cycle + nodeCount=3000 + transactionsPerSecond=500.0 + testDuration=30.0 + expectedRate=0 + clearAfterTest=false + + testName=BackupAndRestoreCorrectness + backupTag=backup1 + backupAfter=10.0 + clearAfterTest=false + simBackupAgents=BackupToFileAndDB + shareLogRange=true + performRestore=false + allowPauses=false + + testName=BackupToDBCorrectness + backupTag=backup2 + backupPrefix=b2 + backupAfter=15.0 + restoreAfter=60.0 + performRestore=true + clearAfterTest=false + simBackupAgents=BackupToFileAndDB + shareLogRange=true + extraDB=1 \ No newline at end of file From 582b875a05bf030f873cb42b676c389919937a09 Mon Sep 17 00:00:00 2001 From: Yichi Chiang Date: Fri, 16 Mar 2018 15:40:59 -0700 Subject: [PATCH 022/127] Refactor EraseLogData() --- fdbclient/BackupAgent.h | 3 +- fdbclient/BackupAgentBase.actor.cpp | 48 +++++++++++++++---------- fdbclient/DatabaseBackupAgent.actor.cpp | 14 ++++---- fdbclient/FileBackupAgent.actor.cpp | 46 +++++------------------- 4 files changed, 46 insertions(+), 65 deletions(-) diff --git a/fdbclient/BackupAgent.h b/fdbclient/BackupAgent.h index 8429fa55d0..788fa58510 100644 --- a/fdbclient/BackupAgent.h +++ b/fdbclient/BackupAgent.h @@ -57,7 +57,6 @@ public: static const Key keyBeginKey; static const Key keyEndKey; static const Key destUid; - static const Key backupDone; static const Key backupStartVersion; static const Key keyTagName; @@ -421,7 +420,7 @@ bool copyParameter(Reference source, Reference dest, Key key); Version getVersionFromString(std::string const& value); Standalone> getLogRanges(Version beginVersion, Version endVersion, Key destUidValue, int blockSize = CLIENT_KNOBS->LOG_RANGE_BLOCK_SIZE); Standalone> getApplyRanges(Version beginVersion, Version endVersion, Key backupUid); -Future eraseLogData(Database cx, Key logUidValue, Key destUidValue, bool backupDone, Version beginVersion, Version endVersion, bool checkBackupUid = false, Version backupUid = 0); +Future eraseLogData(Database cx, Key logUidValue, Key destUidValue, Optional beginVersion = Optional(), Optional endVersion = Optional(), bool checkBackupUid = false, Version backupUid = 0); Key getApplyKey( Version version, Key backupUid ); std::pair decodeBKMutationLogKey(Key key); Standalone> decodeBackupLogValue(StringRef value); diff --git a/fdbclient/BackupAgentBase.actor.cpp b/fdbclient/BackupAgentBase.actor.cpp index c0f71cd3bf..867c3a3a1e 100644 --- a/fdbclient/BackupAgentBase.actor.cpp +++ b/fdbclient/BackupAgentBase.actor.cpp @@ -36,7 +36,6 @@ const Key BackupAgentBase::keyLastUid = LiteralStringRef("last_uid"); const Key BackupAgentBase::keyBeginKey = LiteralStringRef("beginKey"); const Key BackupAgentBase::keyEndKey = LiteralStringRef("endKey"); const Key BackupAgentBase::destUid = LiteralStringRef("destUid"); -const Key BackupAgentBase::backupDone = LiteralStringRef("backupDone"); const Key BackupAgentBase::backupStartVersion = LiteralStringRef("backupStartVersion"); const Key BackupAgentBase::keyTagName = LiteralStringRef("tagname"); @@ -625,10 +624,6 @@ ACTOR Future applyMutations(Database cx, Key uid, Key addPrefix, Key remov } ACTOR Future _clearLogRanges(Reference tr, bool clearVersionHistory, Key logUidValue, Key destUidValue, Version beginVersion, Version endVersion) { - if (!destUidValue.size()) { - return Void(); - } - state Key backupLatestVersionsPath = destUidValue.withPrefix(backupLatestVersionsPrefix); state Key backupLatestVersionsKey = logUidValue.withPrefix(backupLatestVersionsPath); tr->setOption(FDBTransactionOptions::ACCESS_SYSTEM_KEYS); @@ -652,9 +647,7 @@ ACTOR Future _clearLogRanges(Reference tr, bool return Void(); } - // If clear version history is required, then we need to clear log ranges up to next smallest version which might be greater than endVersion - // If size of backupVersions is greater than 1, we can definitely find a version less than INTMAX_MAX, otherwise we clear all log ranges without calling getLogRanges() - Version nextSmallestVersion = clearVersionHistory ? INTMAX_MAX : endVersion; + Version nextSmallestVersion = endVersion; bool clearLogRangesRequired = true; // More than one backup/DR with the same range @@ -710,20 +703,34 @@ Future clearLogRanges(Reference tr, bool clearV return _clearLogRanges(tr, clearVersionHistory, logUidValue, destUidValue, beginVersion, endVersion); } -ACTOR static Future _eraseLogData(Database cx, Key logUidValue, Key destUidValue, bool backupDone, Version beginVersion, Version endVersion, bool checkBackupUid, Version backupUid) { - if (endVersion <= beginVersion) +ACTOR static Future _eraseLogData(Database cx, Key logUidValue, Key destUidValue, Optional beginVersion, Optional endVersion, bool checkBackupUid, Version backupUid) { + if ((beginVersion.present() && endVersion.present() && endVersion.get() <= beginVersion.get()) || !destUidValue.size()) return Void(); - state Version currBeginVersion = beginVersion; + state Version currBeginVersion; + state Version endVersionValue; state Version currEndVersion; - state bool clearVersionHistory = false; + state bool clearVersionHistory; - while (currBeginVersion < endVersion) { + ASSERT(beginVersion.present() == endVersion.present()); + if (beginVersion.present()) { + currBeginVersion = beginVersion.get(); + endVersionValue = endVersion.get(); + clearVersionHistory = false; + } else { + // If beginVersion and endVersion are not presented, it means backup is done and we need to clear version history. + // Set currBeginVersion to INTMAX_MAX and it will be set to the correct version in clearLogRanges(). + // Set endVersionValue to INTMAX_MAX since we need to clear log ranges up to next smallest version. + currBeginVersion = endVersionValue = currEndVersion = INTMAX_MAX; + clearVersionHistory = true; + } + + + while (currBeginVersion < endVersionValue || clearVersionHistory) { state Reference tr(new ReadYourWritesTransaction(cx)); loop{ try { - currEndVersion = std::min(currBeginVersion + CLIENT_KNOBS->CLEAR_LOG_RANGE_COUNT * CLIENT_KNOBS->LOG_RANGE_BLOCK_SIZE, endVersion); tr->setOption(FDBTransactionOptions::LOCK_AWARE); tr->setOption(FDBTransactionOptions::ACCESS_SYSTEM_KEYS); @@ -734,12 +741,17 @@ ACTOR static Future _eraseLogData(Database cx, Key logUidValue, Key destUi return Void(); } - if (backupDone && currEndVersion == endVersion) { - clearVersionHistory = true; + if (!clearVersionHistory) { + currEndVersion = std::min(currBeginVersion + CLIENT_KNOBS->CLEAR_LOG_RANGE_COUNT * CLIENT_KNOBS->LOG_RANGE_BLOCK_SIZE, endVersionValue); } Void _ = wait(clearLogRanges(tr, clearVersionHistory, logUidValue, destUidValue, currBeginVersion, currEndVersion)); Void _ = wait(tr->commit()); + + if (clearVersionHistory) { + return Void(); + } + currBeginVersion = currEndVersion; break; } catch (Error &e) { @@ -751,6 +763,6 @@ ACTOR static Future _eraseLogData(Database cx, Key logUidValue, Key destUi return Void(); } -Future eraseLogData(Database cx, Key logUidValue, Key destUidValue, bool backupDone, Version beginVersion, Version endVersion, bool checkBackupUid, Version backupUid) { - return _eraseLogData(cx, logUidValue, destUidValue, backupDone, beginVersion, endVersion, checkBackupUid, backupUid); +Future eraseLogData(Database cx, Key logUidValue, Key destUidValue, Optional beginVersion, Optional endVersion, bool checkBackupUid, Version backupUid) { + return _eraseLogData(cx, logUidValue, destUidValue, beginVersion, endVersion, checkBackupUid, backupUid); } \ No newline at end of file diff --git a/fdbclient/DatabaseBackupAgent.actor.cpp b/fdbclient/DatabaseBackupAgent.actor.cpp index ed6f20d52b..523fb04f10 100644 --- a/fdbclient/DatabaseBackupAgent.actor.cpp +++ b/fdbclient/DatabaseBackupAgent.actor.cpp @@ -477,14 +477,13 @@ namespace dbBackup { Version beginVersion = BinaryReader::fromStringRef(task->params[DatabaseBackupAgent::keyBeginVersion], Unversioned()); Version endVersion = BinaryReader::fromStringRef(task->params[DatabaseBackupAgent::keyEndVersion], Unversioned()); - bool backupDone = BinaryReader::fromStringRef(task->params[DatabaseBackupAgent::backupDone], Unversioned()); - Void _ = wait(eraseLogData(taskBucket->src, task->params[BackupAgentBase::keyConfigLogUid], task->params[BackupAgentBase::destUid], backupDone, beginVersion, endVersion, false, BinaryReader::fromStringRef(task->params[BackupAgentBase::keyFolderId], Unversioned()))); + Void _ = wait(eraseLogData(taskBucket->src, task->params[BackupAgentBase::keyConfigLogUid], task->params[BackupAgentBase::destUid], Optional(beginVersion), Optional(endVersion), true, BinaryReader::fromStringRef(task->params[BackupAgentBase::keyFolderId], Unversioned()))); return Void(); } - ACTOR static Future addTask(Reference tr, Reference taskBucket, Reference parentTask, Version beginVersion, Version endVersion, bool backupDone, TaskCompletionKey completionKey, Reference waitFor = Reference()) { + ACTOR static Future addTask(Reference tr, Reference taskBucket, Reference parentTask, Version beginVersion, Version endVersion, TaskCompletionKey completionKey, Reference waitFor = Reference()) { Key doneKey = wait(completionKey.get(tr, taskBucket)); Reference task(new Task(EraseLogRangeTaskFunc::name, EraseLogRangeTaskFunc::version, doneKey, 1)); @@ -492,7 +491,6 @@ namespace dbBackup { task->params[DatabaseBackupAgent::keyBeginVersion] = BinaryWriter::toValue(beginVersion, Unversioned()); task->params[DatabaseBackupAgent::keyEndVersion] = BinaryWriter::toValue(endVersion, Unversioned()); - task->params[DatabaseBackupAgent::backupDone] = BinaryWriter::toValue(backupDone, Unversioned()); if (!waitFor) { return taskBucket->addTask(tr, task, parentTask->params[Task::reservedTaskParamValidKey], task->params[BackupAgentBase::keyFolderId]); @@ -750,7 +748,7 @@ namespace dbBackup { // Do not erase at the first time if (prevBeginVersion > 0) { - addTaskVector.push_back(EraseLogRangeTaskFunc::addTask(tr, taskBucket, task, prevBeginVersion, beginVersion, false, TaskCompletionKey::joinWith(allPartsDone))); + addTaskVector.push_back(EraseLogRangeTaskFunc::addTask(tr, taskBucket, task, prevBeginVersion, beginVersion, TaskCompletionKey::joinWith(allPartsDone))); } Void _ = wait(waitForAll(addTaskVector) && taskBucket->finish(tr, task)); @@ -857,7 +855,7 @@ namespace dbBackup { } Version backupUid = BinaryReader::fromStringRef(task->params[BackupAgentBase::keyFolderId], Unversioned()); - Void _ = wait(eraseLogData(taskBucket->src, logUidValue, destUidValue, true, beginVersion, endVersion, true, backupUid)); + Void _ = wait(eraseLogData(taskBucket->src, logUidValue, destUidValue, Optional(), Optional(), true, backupUid)); return Void(); } @@ -953,7 +951,7 @@ namespace dbBackup { } if (prevBeginVersion > 0) { - addTaskVector.push_back(EraseLogRangeTaskFunc::addTask(tr, taskBucket, task, prevBeginVersion, beginVersion, false, TaskCompletionKey::joinWith(allPartsDone))); + addTaskVector.push_back(EraseLogRangeTaskFunc::addTask(tr, taskBucket, task, prevBeginVersion, beginVersion, TaskCompletionKey::joinWith(allPartsDone))); } Void _ = wait(waitForAll(addTaskVector) && taskBucket->finish(tr, task)); @@ -1711,7 +1709,7 @@ public: } if (clearSrcDb) { - Void _ = wait(eraseLogData(backupAgent->taskBucket->src, logUidValue, destUidValue, true, beginVersion, endVersion)); + Void _ = wait(eraseLogData(backupAgent->taskBucket->src, logUidValue, destUidValue)); } tr = Reference(new ReadYourWritesTransaction(cx)); diff --git a/fdbclient/FileBackupAgent.actor.cpp b/fdbclient/FileBackupAgent.actor.cpp index ed4dfe3104..13d20403df 100644 --- a/fdbclient/FileBackupAgent.actor.cpp +++ b/fdbclient/FileBackupAgent.actor.cpp @@ -1827,9 +1827,6 @@ namespace fileBackup { static TaskParam endVersion() { return LiteralStringRef(__FUNCTION__); } - static TaskParam backupDone() { - return LiteralStringRef(__FUNCTION__); - } static TaskParam destUidValue() { return LiteralStringRef(__FUNCTION__); } @@ -1841,45 +1838,21 @@ namespace fileBackup { state Version beginVersion = Params.beginVersion().get(task); state Version endVersion = Params.endVersion().get(task); - state bool backupDone = Params.backupDone().get(task); state Key destUidValue = Params.destUidValue().get(task); state BackupConfig config(task); state Key logUidValue = config.getUidAsKey(); - state Reference tr(new ReadYourWritesTransaction(cx)); - - loop { - try { - tr->setOption(FDBTransactionOptions::LOCK_AWARE); - tr->setOption(FDBTransactionOptions::ACCESS_SYSTEM_KEYS); - - if (beginVersion == 0) { - Key latestVersionKey = logUidValue.withPrefix(destUidValue.withPrefix(backupLatestVersionsPrefix)); - - Optional bVersion = wait(tr->get(latestVersionKey)); - if (bVersion.present()) { - beginVersion = BinaryReader::fromStringRef(bVersion.get(), Unversioned()); - } else { - return Void(); - } - - Version eVersion = wait(tr->getReadVersion()); - endVersion = eVersion; - } - - break; - } catch (Error &e) { - Void _ = wait(tr->onError(e)); - } + if (beginVersion == 0) { + Void _ = wait(eraseLogData(cx, logUidValue, destUidValue)); + } else { + Void _ = wait(eraseLogData(cx, logUidValue, destUidValue, Optional(beginVersion), Optional(endVersion))); } - Void _ = wait(eraseLogData(cx, logUidValue, destUidValue, backupDone, beginVersion, endVersion)); - return Void(); } - ACTOR static Future addTask(Reference tr, Reference taskBucket, UID logUid, TaskCompletionKey completionKey, bool backupDone, Key destUidValue, Version beginVersion = 0, Version endVersion = 0, Reference waitFor = Reference()) { + ACTOR static Future addTask(Reference tr, Reference taskBucket, UID logUid, TaskCompletionKey completionKey, Key destUidValue, Version beginVersion = 0, Version endVersion = 0, Reference waitFor = Reference()) { Key key = wait(addBackupTask(EraseLogRangeTaskFunc::name, EraseLogRangeTaskFunc::version, tr, taskBucket, completionKey, @@ -1888,7 +1861,6 @@ namespace fileBackup { [=](Reference task) { Params.beginVersion().set(task, beginVersion); Params.endVersion().set(task, endVersion); - Params.backupDone().set(task, backupDone); Params.destUidValue().set(task, destUidValue); }, 0, false)); @@ -1986,7 +1958,7 @@ namespace fileBackup { // Do not erase at the first time if (prevBeginVersion > 0) { state Key destUidValue = wait(config.destUidValue().getOrThrow(tr)); - Key _ = wait(EraseLogRangeTaskFunc::addTask(tr, taskBucket, config.getUid(), TaskCompletionKey::joinWith(logDispatchBatchFuture), false, destUidValue, prevBeginVersion, beginVersion)); + Key _ = wait(EraseLogRangeTaskFunc::addTask(tr, taskBucket, config.getUid(), TaskCompletionKey::joinWith(logDispatchBatchFuture), destUidValue, prevBeginVersion, beginVersion)); } Key _ = wait(BackupLogsDispatchTask::addTask(tr, taskBucket, task, beginVersion, endVersion, TaskCompletionKey::signal(onDone), logDispatchBatchFuture)); @@ -2037,7 +2009,7 @@ namespace fileBackup { tr->setOption(FDBTransactionOptions::COMMIT_ON_FIRST_PROXY); state Key destUidValue = wait(backup.destUidValue().getOrThrow(tr)); - Key _ = wait(EraseLogRangeTaskFunc::addTask(tr, taskBucket, backup.getUid(), TaskCompletionKey::noSignal(), true, destUidValue)); + Key _ = wait(EraseLogRangeTaskFunc::addTask(tr, taskBucket, backup.getUid(), TaskCompletionKey::noSignal(), destUidValue)); backup.stateEnum().set(tr, EBackupState::STATE_COMPLETED); @@ -3591,7 +3563,7 @@ public: state Key destUidValue = wait(config.destUidValue().getOrThrow(tr)); state Version endVersion = wait(tr->getReadVersion()); - Key _ = wait(fileBackup::EraseLogRangeTaskFunc::addTask(tr, backupAgent->taskBucket, config.getUid(), TaskCompletionKey::noSignal(), true, destUidValue)); + Key _ = wait(fileBackup::EraseLogRangeTaskFunc::addTask(tr, backupAgent->taskBucket, config.getUid(), TaskCompletionKey::noSignal(), destUidValue)); config.stateEnum().set(tr, EBackupState::STATE_COMPLETED); @@ -3631,7 +3603,7 @@ public: // Cancel backup task through tag Void _ = wait(tag.cancel(tr)); - Key _ = wait(fileBackup::EraseLogRangeTaskFunc::addTask(tr, backupAgent->taskBucket, config.getUid(), TaskCompletionKey::noSignal(), true, destUidValue)); + Key _ = wait(fileBackup::EraseLogRangeTaskFunc::addTask(tr, backupAgent->taskBucket, config.getUid(), TaskCompletionKey::noSignal(), destUidValue)); config.stateEnum().set(tr, EBackupState::STATE_ABORTED); From fecfea0f7dffdc9ba0ef0cb0d706e2620362f3e4 Mon Sep 17 00:00:00 2001 From: Evan Tschannen Date: Sat, 17 Mar 2018 10:24:44 -0700 Subject: [PATCH 023/127] fix: messages vector was not cleared --- fdbserver/TLogServer.actor.cpp | 1 + 1 file changed, 1 insertion(+) diff --git a/fdbserver/TLogServer.actor.cpp b/fdbserver/TLogServer.actor.cpp index a0bd0f146e..edae2d95b1 100644 --- a/fdbserver/TLogServer.actor.cpp +++ b/fdbserver/TLogServer.actor.cpp @@ -1387,6 +1387,7 @@ ACTOR Future pullAsyncData( TLogData* self, Reference logData, Ta } lastVer = ver; ver = r->version().version; + messages.clear(); if (!foundMessage) { ver--; From 9c8cb445d6f979a471aa4499338f551706806383 Mon Sep 17 00:00:00 2001 From: Evan Tschannen Date: Sat, 17 Mar 2018 10:36:19 -0700 Subject: [PATCH 024/127] optimized the tlog to use a vector for tags instead of a map --- fdbclient/FDBTypes.h | 2 +- fdbserver/Knobs.cpp | 3 +- fdbserver/Knobs.h | 1 - fdbserver/MoveKeys.actor.cpp | 2 +- fdbserver/TLogServer.actor.cpp | 229 ++++++++++++++++++++++----------- 5 files changed, 155 insertions(+), 82 deletions(-) diff --git a/fdbclient/FDBTypes.h b/fdbclient/FDBTypes.h index 62154caf8d..75cf3c5bfc 100644 --- a/fdbclient/FDBTypes.h +++ b/fdbclient/FDBTypes.h @@ -33,7 +33,7 @@ typedef StringRef KeyRef; typedef StringRef ValueRef; typedef int64_t Generation; -enum { tagLocalitySpecial = -100, tagLocalityLogRouter = -1, tagLocalityRemoteLog = -2, tagLocalityUpgraded = -3}; +enum { tagLocalitySpecial = -1, tagLocalityLogRouter = -2, tagLocalityRemoteLog = -3, tagLocalityUpgraded = -4}; //The TLog and LogRouter require these number to be as compact as possible #pragma pack(push, 1) struct Tag { diff --git a/fdbserver/Knobs.cpp b/fdbserver/Knobs.cpp index 689ac7462c..371112fb35 100644 --- a/fdbserver/Knobs.cpp +++ b/fdbserver/Knobs.cpp @@ -283,8 +283,7 @@ ServerKnobs::ServerKnobs(bool randomize, ClientKnobs* clientKnobs) { init( REMOVE_RETRY_DELAY, 1.0 ); init( MOVE_KEYS_KRM_LIMIT, 2000 ); if( randomize && BUGGIFY ) MOVE_KEYS_KRM_LIMIT = 2; init( MOVE_KEYS_KRM_LIMIT_BYTES, 1e5 ); if( randomize && BUGGIFY ) MOVE_KEYS_KRM_LIMIT_BYTES = 5e4; //This must be sufficiently larger than CLIENT_KNOBS->KEY_SIZE_LIMIT (fdbclient/Knobs.h) to ensure that at least two entries will be returned from an attempt to read a key range map - init( SKIP_TAGS_GROWTH_RATE, 2.0 ); - init( MAX_SKIP_TAGS, 100 ); + init( MAX_SKIP_TAGS, 1 ); //The TLogs require tags to be densely packed to be memory efficient, so be careful increasing this knob //FdbServer bool longReboots = randomize && BUGGIFY; diff --git a/fdbserver/Knobs.h b/fdbserver/Knobs.h index 3e9301f3ca..74d95d1596 100644 --- a/fdbserver/Knobs.h +++ b/fdbserver/Knobs.h @@ -227,7 +227,6 @@ public: double REMOVE_RETRY_DELAY; int MOVE_KEYS_KRM_LIMIT; int MOVE_KEYS_KRM_LIMIT_BYTES; //This must be sufficiently larger than CLIENT_KNOBS->KEY_SIZE_LIMIT (fdbclient/Knobs.h) to ensure that at least two entries will be returned from an attempt to read a key range map - double SKIP_TAGS_GROWTH_RATE; int MAX_SKIP_TAGS; //FdbServer diff --git a/fdbserver/MoveKeys.actor.cpp b/fdbserver/MoveKeys.actor.cpp index 8a96dd1d2e..25b29616bc 100644 --- a/fdbserver/MoveKeys.actor.cpp +++ b/fdbserver/MoveKeys.actor.cpp @@ -720,7 +720,7 @@ ACTOR Future> addStorageServer( Database cx, StorageServ throw recruitment_failed(); // There is a remote possibility that we successfully added ourselves and then someone removed us, so we have to fail if(e.code() == error_code_not_committed) { - maxSkipTags = std::min(maxSkipTags * SERVER_KNOBS->SKIP_TAGS_GROWTH_RATE, SERVER_KNOBS->MAX_SKIP_TAGS); + maxSkipTags = SERVER_KNOBS->MAX_SKIP_TAGS; } Void _ = wait( tr.onError(e) ); diff --git a/fdbserver/TLogServer.actor.cpp b/fdbserver/TLogServer.actor.cpp index edae2d95b1..557edb62b7 100644 --- a/fdbserver/TLogServer.actor.cpp +++ b/fdbserver/TLogServer.actor.cpp @@ -302,22 +302,24 @@ struct TLogData : NonCopyable { }; struct LogData : NonCopyable, public ReferenceCounted { - struct TagData { + struct TagData : NonCopyable, public ReferenceCounted { std::deque> version_messages; bool nothing_persistent; // true means tag is *known* to have no messages in persistentData. false means nothing. bool popped_recently; // `popped` has changed since last updatePersistentData Version popped; // see popped version tracking contract below bool update_version_sizes; + Tag tag; - TagData( Version popped, bool nothing_persistent, bool popped_recently, Tag tag ) : nothing_persistent(nothing_persistent), popped(popped), popped_recently(popped_recently), update_version_sizes(tag != txsTag) {} + TagData( Tag tag, Version popped, bool nothing_persistent, bool popped_recently ) : tag(tag), nothing_persistent(nothing_persistent), popped(popped), popped_recently(popped_recently), update_version_sizes(tag != txsTag) {} - TagData(TagData&& r) noexcept(true) : version_messages(std::move(r.version_messages)), nothing_persistent(r.nothing_persistent), popped_recently(r.popped_recently), popped(r.popped), update_version_sizes(r.update_version_sizes) {} + TagData(TagData&& r) noexcept(true) : version_messages(std::move(r.version_messages)), nothing_persistent(r.nothing_persistent), popped_recently(r.popped_recently), popped(r.popped), update_version_sizes(r.update_version_sizes), tag(r.tag) {} void operator= (TagData&& r) noexcept(true) { version_messages = std::move(r.version_messages); nothing_persistent = r.nothing_persistent; popped_recently = r.popped_recently; popped = r.popped; update_version_sizes = r.update_version_sizes; + tag = r.tag; } // Erase messages not needed to update *from* versions >= before (thus, messages with toversion <= before) @@ -376,7 +378,35 @@ struct LogData : NonCopyable, public ReferenceCounted { Version knownCommittedVersion; Deque>>> messageBlocks; - Map< Tag, TagData > tag_data; + std::vector>>> tag_data; //negative or positive tag.locality | abs(tag.locality) | tag.id + + Reference getTagData(int8_t locality, uint16_t id, std::vector>>& data) { + if(locality >= data.size()) { + data.resize(locality+1); + } + if(id >= data[locality].size()) { + data[locality].resize(id+1); + } + return data[locality][id]; + } + + Reference getTagData(Tag tag) { + if(tag.locality < 0) { + return getTagData(-(1+tag.locality), tag.id, tag_data[1]); + } + return getTagData(tag.locality, tag.id, tag_data[0]); + } + + //only callable after getTagData returns a null reference + Reference createTagData(Tag tag, Version popped, bool nothing_persistent, bool popped_recently) { + Reference newTagData = Reference( new TagData(tag, popped, nothing_persistent, popped_recently) ); + if(tag.locality < 0) { + tag_data[1][-(1+tag.locality)][tag.id] = newTagData; + } else { + tag_data[0][tag.locality][tag.id] = newTagData; + } + return newTagData; + } Map> version_sizes; @@ -400,6 +430,7 @@ struct LogData : NonCopyable, public ReferenceCounted { // These are initialized differently on init() or recovery recoveryCount(), stopped(false), initialized(false), queueCommittingVersion(0), newPersistentDataVersion(invalidVersion), unrecoveredBefore(0) { + tag_data.resize(2); startRole(interf.id(), UID(), "TLog"); persistentDataVersion.init(LiteralStringRef("TLog.PersistentDataVersion"), cc.id); @@ -464,28 +495,35 @@ ACTOR Future tLogLock( TLogData* self, ReplyPromise< TLogLockResult > repl TLogLockResult result; result.end = stopVersion; result.knownCommittedVersion = logData->knownCommittedVersion; - for( auto & tag : logData->tag_data ) - result.tags.push_back( tag.key ); + + for(int tag_special = 0; tag_special < 2; tag_special++) { + for(int tag_locality = 0; tag_locality < logData->tag_data[tag_special].size(); tag_locality++) { + for(int tag_id = 0; tag_id < logData->tag_data[tag_special][tag_locality].size(); tag_id++) { + if(logData->tag_data[tag_special][tag_locality][tag_id]) { + result.tags.push_back(logData->tag_data[tag_special][tag_locality][tag_id]->tag); + } + } + } + } TraceEvent("TLogStop2", self->dbgid).detail("logId", logData->logId).detail("Ver", stopVersion).detail("isStopped", logData->stopped).detail("queueCommitted", logData->queueCommittedVersion.get()).detail("tags", describe(result.tags)); - reply.send( result ); return Void(); } -void updatePersistentPopped( TLogData* self, Reference logData, Tag tag, LogData::TagData& data ) { - if (!data.popped_recently) return; - self->persistentData->set(KeyValueRef( persistTagPoppedKey(logData->logId, tag), persistTagPoppedValue(data.popped) )); - data.popped_recently = false; +void updatePersistentPopped( TLogData* self, Reference logData, Reference data ) { + if (!data->popped_recently) return; + self->persistentData->set(KeyValueRef( persistTagPoppedKey(logData->logId, data->tag), persistTagPoppedValue(data->popped) )); + data->popped_recently = false; - if (data.nothing_persistent) return; + if (data->nothing_persistent) return; self->persistentData->clear( KeyRangeRef( - persistTagMessagesKey( logData->logId, tag, Version(0) ), - persistTagMessagesKey( logData->logId, tag, data.popped ) ) ); - if (data.popped > logData->persistentDataVersion) - data.nothing_persistent = true; + persistTagMessagesKey( logData->logId, data->tag, Version(0) ), + persistTagMessagesKey( logData->logId, data->tag, data->popped ) ) ); + if (data->popped > logData->persistentDataVersion) + data->nothing_persistent = true; } ACTOR Future updatePersistentData( TLogData* self, Reference logData, Version newPersistentDataVersion ) { @@ -498,33 +536,44 @@ ACTOR Future updatePersistentData( TLogData* self, Reference logD //TraceEvent("updatePersistentData", self->dbgid).detail("seq", newPersistentDataSeq); state bool anyData = false; - state Map::iterator tag; + // For all existing tags - for(tag = logData->tag_data.begin(); tag != logData->tag_data.end(); ++tag) { - state Version currentVersion = 0; - // Clear recently popped versions from persistentData if necessary - updatePersistentPopped( self, logData, tag->key, tag->value ); - // Transfer unpopped messages with version numbers less than newPersistentDataVersion to persistentData - state std::deque>::iterator msg = tag->value.version_messages.begin(); - while(msg != tag->value.version_messages.end() && msg->first <= newPersistentDataVersion) { - currentVersion = msg->first; - anyData = true; - tag->value.nothing_persistent = false; - BinaryWriter wr( Unversioned() ); + state int tag_special = 0; + state int tag_locality = 0; + state int tag_id = 0; - for(; msg != tag->value.version_messages.end() && msg->first == currentVersion; ++msg) - wr << msg->second.toStringRef(); + for(tag_special = 0; tag_special < 2; tag_special++) { + for(tag_locality = 0; tag_locality < logData->tag_data[tag_special].size(); tag_locality++) { + for(tag_id = 0; tag_id < logData->tag_data[tag_special][tag_locality].size(); tag_id++) { + state Reference tagData = logData->tag_data[tag_special][tag_locality][tag_id]; + if(tagData) { + state Version currentVersion = 0; + // Clear recently popped versions from persistentData if necessary + updatePersistentPopped( self, logData, tagData ); + // Transfer unpopped messages with version numbers less than newPersistentDataVersion to persistentData + state std::deque>::iterator msg = tagData->version_messages.begin(); + while(msg != tagData->version_messages.end() && msg->first <= newPersistentDataVersion) { + currentVersion = msg->first; + anyData = true; + tagData->nothing_persistent = false; + BinaryWriter wr( Unversioned() ); - self->persistentData->set( KeyValueRef( persistTagMessagesKey( logData->logId, tag->key, currentVersion ), wr.toStringRef() ) ); + for(; msg != tagData->version_messages.end() && msg->first == currentVersion; ++msg) + wr << msg->second.toStringRef(); - Future f = yield(TaskUpdateStorage); - if(!f.isReady()) { - Void _ = wait(f); - msg = std::upper_bound(tag->value.version_messages.begin(), tag->value.version_messages.end(), std::make_pair(currentVersion, LengthPrefixedStringRef()), CompareFirst>()); + self->persistentData->set( KeyValueRef( persistTagMessagesKey( logData->logId, tagData->tag, currentVersion ), wr.toStringRef() ) ); + + Future f = yield(TaskUpdateStorage); + if(!f.isReady()) { + Void _ = wait(f); + msg = std::upper_bound(tagData->version_messages.begin(), tagData->version_messages.end(), std::make_pair(currentVersion, LengthPrefixedStringRef()), CompareFirst>()); + } + } + + Void _ = wait(yield(TaskUpdateStorage)); + } } } - - Void _ = wait(yield(TaskUpdateStorage)); } self->persistentData->set( KeyValueRef( BinaryWriter::toValue(logData->logId,Unversioned()).withPrefix(persistCurrentVersionKeys.begin), BinaryWriter::toValue(newPersistentDataVersion, Unversioned()) ) ); @@ -538,9 +587,15 @@ ACTOR Future updatePersistentData( TLogData* self, Reference logD TEST(anyData); // TLog moved data to persistentData logData->persistentDataDurableVersion = newPersistentDataVersion; - for(tag = logData->tag_data.begin(); tag != logData->tag_data.end(); ++tag) { - Void _ = wait(tag->value.eraseMessagesBefore( newPersistentDataVersion+1, &self->bytesDurable, logData, TaskUpdateStorage )); - Void _ = wait(yield(TaskUpdateStorage)); + for(tag_special = 0; tag_special < 2; tag_special++) { + for(tag_locality = 0; tag_locality < logData->tag_data[tag_special].size(); tag_locality++) { + for(tag_id = 0; tag_id < logData->tag_data[tag_special][tag_locality].size(); tag_id++) { + if(logData->tag_data[tag_special][tag_locality][tag_id]) { + Void _ = wait(logData->tag_data[tag_special][tag_locality][tag_id]->eraseMessagesBefore( newPersistentDataVersion+1, &self->bytesDurable, logData, TaskUpdateStorage )); + Void _ = wait(yield(TaskUpdateStorage)); + } + } + } } logData->version_sizes.erase(logData->version_sizes.begin(), logData->version_sizes.lower_bound(logData->persistentDataDurableVersion)); @@ -586,12 +641,26 @@ ACTOR Future updateStorage( TLogData* self ) { state Version nextVersion = 0; state int totalSize = 0; + state int tag_special = 0; + state int tag_locality = 0; + state int tag_id = 0; + state Reference tagData; + if(logData->stopped) { if (self->bytesInput - self->bytesDurable >= SERVER_KNOBS->TLOG_SPILL_THRESHOLD) { while(logData->persistentDataDurableVersion != logData->version.get()) { std::vector>::iterator, std::deque>::iterator>> iters; - for(auto tag = logData->tag_data.begin(); tag != logData->tag_data.end(); ++tag) - iters.push_back(std::make_pair(tag->value.version_messages.begin(), tag->value.version_messages.end())); + + for(tag_special = 0; tag_special < 2; tag_special++) { + for(tag_locality = 0; tag_locality < logData->tag_data[tag_special].size(); tag_locality++) { + for(tag_id = 0; tag_id < logData->tag_data[tag_special][tag_locality].size(); tag_id++) { + tagData = logData->tag_data[tag_special][tag_locality][tag_id]; + if(tagData) { + iters.push_back(std::make_pair(tagData->version_messages.begin(), tagData->version_messages.end())); + } + } + } + } nextVersion = 0; while( totalSize < SERVER_KNOBS->UPDATE_STORAGE_BYTE_LIMIT || nextVersion <= logData->persistentDataVersion ) { @@ -646,14 +715,20 @@ ACTOR Future updateStorage( TLogData* self ) { ++sizeItr; nextVersion = sizeItr == logData->version_sizes.end() ? logData->version.get() : sizeItr->key; - state Map::iterator tag; - for(tag = logData->tag_data.begin(); tag != logData->tag_data.end(); ++tag) { - auto it = std::lower_bound(tag->value.version_messages.begin(), tag->value.version_messages.end(), std::make_pair(prevVersion, LengthPrefixedStringRef()), CompareFirst>()); - for(; it != tag->value.version_messages.end() && it->first < nextVersion; ++it) { - totalSize += it->second.expectedSize(); - } + for(tag_special = 0; tag_special < 2; tag_special++) { + for(tag_locality = 0; tag_locality < logData->tag_data[tag_special].size(); tag_locality++) { + for(tag_id = 0; tag_id < logData->tag_data[tag_special][tag_locality].size(); tag_id++) { + tagData = logData->tag_data[tag_special][tag_locality][tag_id]; + if(tagData) { + auto it = std::lower_bound(tagData->version_messages.begin(), tagData->version_messages.end(), std::make_pair(prevVersion, LengthPrefixedStringRef()), CompareFirst>()); + for(; it != tagData->version_messages.end() && it->first < nextVersion; ++it) { + totalSize += it->second.expectedSize(); + } - Void _ = wait(yield(TaskUpdateStorage)); + Void _ = wait(yield(TaskUpdateStorage)); + } + } + } } prevVersion = nextVersion; @@ -732,18 +807,18 @@ void commitMessages( Reference self, Version version, const std::vector block.append(block.arena(), msg.message.begin(), msg.message.size()); for(auto& tag : msg.tags) { - auto tsm = self->tag_data.find(tag); - if (tsm == self->tag_data.end()) { - tsm = self->tag_data.insert( mapPair(std::move(Tag(tag)), LogData::TagData(Version(0), true, true, tag) ), false ); + Reference tagData = self->getTagData(tag); + if(!tagData) { + tagData = self->createTagData(tag, 0, true, true); } - if (version >= tsm->value.popped) { - tsm->value.version_messages.push_back(std::make_pair(version, LengthPrefixedStringRef((uint32_t*)(block.end() - msg.message.size())))); - if(tsm->value.version_messages.back().second.expectedSize() > SERVER_KNOBS->MAX_MESSAGE_SIZE) { - TraceEvent(SevWarnAlways, "LargeMessage").detail("Size", tsm->value.version_messages.back().second.expectedSize()); + if (version >= tagData->popped) { + tagData->version_messages.push_back(std::make_pair(version, LengthPrefixedStringRef((uint32_t*)(block.end() - msg.message.size())))); + if(tagData->version_messages.back().second.expectedSize() > SERVER_KNOBS->MAX_MESSAGE_SIZE) { + TraceEvent(SevWarnAlways, "LargeMessage").detail("Size", tagData->version_messages.back().second.expectedSize()); } if (tag != txsTag) { - expectedBytes += tsm->value.version_messages.back().second.expectedSize(); + expectedBytes += tagData->version_messages.back().second.expectedSize(); } // The factor of VERSION_MESSAGES_OVERHEAD is intended to be an overestimate of the actual memory used to store this data in a std::deque. @@ -789,31 +864,30 @@ void commitMessages( Reference self, Version version, Arena arena, Stri } Version poppedVersion( Reference self, Tag tag) { - auto mapIt = self->tag_data.find(tag); - if (mapIt == self->tag_data.end()) + auto tagData = self->getTagData(tag); + if (!tagData) return Version(0); - return mapIt->value.popped; + return tagData->popped; } std::deque> & get_version_messages( Reference self, Tag tag ) { - auto mapIt = self->tag_data.find(tag); - if (mapIt == self->tag_data.end()) { + auto tagData = self->getTagData(tag); + if (!tagData) { static std::deque> empty; return empty; } - return mapIt->value.version_messages; + return tagData->version_messages; }; ACTOR Future tLogPop( TLogData* self, TLogPopRequest req, Reference logData ) { - auto ti = logData->tag_data.find(req.tag); - if (ti == logData->tag_data.end()) { - ti = logData->tag_data.insert( mapPair(std::move(Tag(req.tag)), LogData::TagData(req.to, true, true, req.tag)) ); - } else if (req.to > ti->value.popped) { - ti->value.popped = req.to; - ti->value.popped_recently = true; - //if (to.epoch == self->epoch()) + auto tagData = logData->getTagData(req.tag); + if (!tagData) { + tagData = logData->createTagData(req.tag, req.to, true, true); + } else if (req.to > tagData->popped) { + tagData->popped = req.to; + tagData->popped_recently = true; if ( req.to > logData->persistentDataDurableVersion ) - Void _ = wait(ti->value.eraseMessagesBefore( req.to, &self->bytesDurable, logData, TaskTLogPop )); + Void _ = wait(tagData->eraseMessagesBefore( req.to, &self->bytesDurable, logData, TaskTLogPop )); //TraceEvent("TLogPop", self->dbgid).detail("Tag", req.tag).detail("To", req.to); } @@ -1595,8 +1669,9 @@ ACTOR Future restorePersistentState( TLogData* self, LocalityData locality Tag tag = decodeTagPoppedKey(rawId, kv.key); Version popped = decodeTagPoppedValue(kv.value); TraceEvent("TLogRestorePop", logData->logId).detail("Tag", tag.toString()).detail("To", popped); - ASSERT( logData->tag_data.find(tag) == logData->tag_data.end() ); - logData->tag_data.insert( mapPair( std::move(Tag(tag)), LogData::TagData( popped, false, false, tag )) ); + auto tagData = logData->getTagData(tag); + ASSERT( !tagData ); + logData->createTagData(tag, popped, false, false); } } } @@ -1765,14 +1840,14 @@ ACTOR Future recoverTagFromLogSystem( TLogData* self, Reference l } if(r) tagPopped = std::max(tagPopped, r->popped()); - auto tsm = logData->tag_data.find(tag); - if (tsm == logData->tag_data.end()) { - logData->tag_data.insert( mapPair(std::move(Tag(tag)), LogData::TagData(tagPopped, false, true, tag)) ); + auto tagData = logData->getTagData(tag); + if(!tagData) { + tagData = logData->createTagData(tag, tagPopped, false, true); } Void _ = wait(tLogPop( self, TLogPopRequest(tagPopped, tag), logData )); - updatePersistentPopped( self, logData, tag, logData->tag_data.find(tag)->value ); + updatePersistentPopped( self, logData, logData->getTagData(tag) ); TraceEvent("LogRecoveringTagComplete", logData->logId).detail("Tag", tag.toString()).detail("recoverAt", endVersion); return Void(); From 4dcef08260b4243ea0d50d0c13503801567e6ef5 Mon Sep 17 00:00:00 2001 From: Evan Tschannen Date: Sat, 17 Mar 2018 11:08:37 -0700 Subject: [PATCH 025/127] optimized the log router to use a vector instead of a map for tag data --- fdbserver/LogRouter.actor.cpp | 71 ++++++++++++++++++++++------------- 1 file changed, 45 insertions(+), 26 deletions(-) diff --git a/fdbserver/LogRouter.actor.cpp b/fdbserver/LogRouter.actor.cpp index 995927bfb4..6cb787f039 100644 --- a/fdbserver/LogRouter.actor.cpp +++ b/fdbserver/LogRouter.actor.cpp @@ -34,16 +34,17 @@ #include "flow/Stats.h" struct LogRouterData { - struct TagData { + struct TagData : NonCopyable, public ReferenceCounted { std::deque> version_messages; Version popped; - Tag t; + Tag tag; - TagData( Version popped, Tag tag ) : popped(popped), t(tag) {} + TagData( Tag tag, Version popped ) : tag(tag), popped(popped) {} - TagData(TagData&& r) noexcept(true) : version_messages(std::move(r.version_messages)), popped(r.popped) {} + TagData(TagData&& r) noexcept(true) : version_messages(std::move(r.version_messages)), tag(r.tag), popped(r.popped) {} void operator= (TagData&& r) noexcept(true) { version_messages = std::move(r.version_messages); + tag = r.tag; popped = r.popped; } @@ -76,10 +77,26 @@ struct LogRouterData { NotifiedVersion version; Version minPopped; Deque>>> messageBlocks; - Map< Tag, TagData > tag_data; Tag routerTag; int logSet; + std::vector> tag_data; //we only store data for the remote tag locality + + Reference getTagData(Tag tag) { + ASSERT(tag.locality == tagLocalityRemoteLog); + if(tag.id >= tag_data.size()) { + tag_data.resize(tag.id+1); + } + return tag_data[tag.id]; + } + + //only callable after getTagData returns a null reference + Reference createTagData(Tag tag, Version popped) { + Reference newTagData = Reference( new TagData(tag, popped) ); + tag_data[tag.id] = newTagData; + return newTagData; + } + LogRouterData(UID dbgid, Tag routerTag, int logSet) : dbgid(dbgid), routerTag(routerTag), logSet(logSet), logSystem(new AsyncVar>()) {} }; @@ -115,15 +132,15 @@ void commitMessages( LogRouterData* self, Version version, const std::vectortag_data.find(tag); - if (tsm == self->tag_data.end()) { - tsm = self->tag_data.insert( mapPair(std::move(Tag(tag)), LogRouterData::TagData(Version(0), tag) ), false ); + auto tagData = self->getTagData(tag); + if(!tagData) { + tagData = self->createTagData(tag, 0); } - if (version >= tsm->value.popped) { - tsm->value.version_messages.push_back(std::make_pair(version, LengthPrefixedStringRef((uint32_t*)(block.end() - msg.message.size())))); - if(tsm->value.version_messages.back().second.expectedSize() > SERVER_KNOBS->MAX_MESSAGE_SIZE) { - TraceEvent(SevWarnAlways, "LargeMessage").detail("Size", tsm->value.version_messages.back().second.expectedSize()); + if (version >= tagData->popped) { + tagData->version_messages.push_back(std::make_pair(version, LengthPrefixedStringRef((uint32_t*)(block.end() - msg.message.size())))); + if(tagData->version_messages.back().second.expectedSize() > SERVER_KNOBS->MAX_MESSAGE_SIZE) { + TraceEvent(SevWarnAlways, "LargeMessage").detail("Size", tagData->version_messages.back().second.expectedSize()); } } } @@ -199,12 +216,12 @@ ACTOR Future pullAsyncData( LogRouterData *self, Tag tag ) { } std::deque> & get_version_messages( LogRouterData* self, Tag tag ) { - auto mapIt = self->tag_data.find(tag); - if (mapIt == self->tag_data.end()) { + auto tagData = self->getTagData(tag); + if (!tagData) { static std::deque> empty; return empty; } - return mapIt->value.version_messages; + return tagData->version_messages; }; void peekMessagesFromMemory( LogRouterData* self, TLogPeekRequest const& req, BinaryWriter& messages, Version& endVersion ) { @@ -233,10 +250,10 @@ void peekMessagesFromMemory( LogRouterData* self, TLogPeekRequest const& req, Bi } Version poppedVersion( LogRouterData* self, Tag tag) { - auto mapIt = self->tag_data.find(tag); - if (mapIt == self->tag_data.end()) + auto tagData = self->getTagData(tag); + if (!tagData) return Version(0); - return mapIt->value.popped; + return tagData->popped; } ACTOR Future logRouterPeekMessages( LogRouterData* self, TLogPeekRequest req ) { @@ -280,17 +297,19 @@ ACTOR Future logRouterPeekMessages( LogRouterData* self, TLogPeekRequest r } ACTOR Future logRouterPop( LogRouterData* self, TLogPopRequest req ) { - auto ti = self->tag_data.find(req.tag); - if (ti == self->tag_data.end()) { - ti = self->tag_data.insert( mapPair(std::move(Tag(req.tag)), LogRouterData::TagData(req.to, req.tag)) ); - } else if (req.to > ti->value.popped) { - ti->value.popped = req.to; - Void _ = wait(ti->value.eraseMessagesBefore( req.to, self, TaskTLogPop )); + auto tagData = self->getTagData(req.tag); + if (!tagData) { + tagData = self->createTagData(req.tag, req.to); + } else if (req.to > tagData->popped) { + tagData->popped = req.to; + Void _ = wait(tagData->eraseMessagesBefore( req.to, self, TaskTLogPop )); } state Version minPopped = std::numeric_limits::max(); - for( auto& it : self->tag_data ) { - minPopped = std::min( it.value.popped, minPopped ); + for( auto it : self->tag_data ) { + if(it) { + minPopped = std::min( it->popped, minPopped ); + } } while(!self->messageBlocks.empty() && self->messageBlocks.front().first <= minPopped) { From 54be14000d5086c106db3b4f1e2bf1b106cdc21a Mon Sep 17 00:00:00 2001 From: Evan Tschannen Date: Sat, 17 Mar 2018 11:24:18 -0700 Subject: [PATCH 026/127] do not deserialize tags --- fdbserver/TLogServer.actor.cpp | 9 ++------- 1 file changed, 2 insertions(+), 7 deletions(-) diff --git a/fdbserver/TLogServer.actor.cpp b/fdbserver/TLogServer.actor.cpp index 557edb62b7..e1c5ee66b3 100644 --- a/fdbserver/TLogServer.actor.cpp +++ b/fdbserver/TLogServer.actor.cpp @@ -1818,13 +1818,8 @@ ACTOR Future recoverTagFromLogSystem( TLogData* self, Reference l break; } - // FIXME: This logic duplicates stuff in LogPushData::addMessage(), and really would be better in PeekResults or somewhere else. Also unnecessary copying. - StringRef msg = r->getMessage(); - auto tags = r->getTags(); - wr << uint32_t( msg.size() + sizeof(uint32_t) + sizeof(uint16_t) + tags.size()*sizeof(Tag) ) << r->version().sub << uint16_t(tags.size()); - for(auto t : tags) { - wr << t; - } + // FIXME: Unnecessary copying. + StringRef msg = r->getMessageWithTags(); wr.serializeBytes( msg ); r->nextMessage(); } From d8e064d8bb277c8467beab9449a751e2591a6b5b Mon Sep 17 00:00:00 2001 From: Evan Tschannen Date: Mon, 19 Mar 2018 17:48:28 -0700 Subject: [PATCH 027/127] fix: when a new log is recruited on a shared log, all outstanding commits need to be notified that they are stopped, because there is no longer a guarantee that their queueCommittedVersion will advance --- fdbserver/TLogServer.actor.cpp | 11 ++++++++++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/fdbserver/TLogServer.actor.cpp b/fdbserver/TLogServer.actor.cpp index e1c5ee66b3..dbdd6018b8 100644 --- a/fdbserver/TLogServer.actor.cpp +++ b/fdbserver/TLogServer.actor.cpp @@ -369,6 +369,7 @@ struct LogData : NonCopyable, public ReferenceCounted { Impl: Check tag_data->popped (after all waits) */ + AsyncTrigger stopCommit; bool stopped, initialized; DBRecoveryCount recoveryCount; @@ -1188,7 +1189,14 @@ ACTOR Future tLogCommit( g_traceBatch.addEvent("CommitDebug", tlogDebugID.get().first(), "TLog.tLogCommit.AfterTLogCommit"); } // Send replies only once all prior messages have been received and committed. - Void _ = wait( timeoutWarning( logData->queueCommittedVersion.whenAtLeast( req.version ), 0.1, warningCollectorInput ) ); + state Future stopped = logData->stopCommit.onTrigger(); + Void _ = wait( timeoutWarning( logData->queueCommittedVersion.whenAtLeast( req.version ) || stopped, 0.1, warningCollectorInput ) ); + + if(stopped.isReady()) { + ASSERT(logData->stopped); + req.reply.sendError( tlog_stopped() ); + return Void(); + } if(req.debugID.present()) g_traceBatch.addEvent("CommitDebug", tlogDebugID.get().first(), "TLog.tLogCommit.After"); @@ -1971,6 +1979,7 @@ ACTOR Future tLogStart( TLogData* self, InitializeTLogRequest req, Localit if(!it.second->recoveryComplete.isSet()) { it.second->recoveryComplete.sendError(end_of_stream()); } + it.second->stopCommit.trigger(); } state Reference logData = Reference( new LogData(self, recruited, req.remoteTag) ); From 0746fe4d561a08358d1736ec593279241093acc7 Mon Sep 17 00:00:00 2001 From: Evan Tschannen Date: Tue, 20 Mar 2018 10:41:42 -0700 Subject: [PATCH 028/127] optimized tag lookups on the tlog by removing one level of vectors --- fdbserver/TLogServer.actor.cpp | 134 ++++++++++++++------------------- 1 file changed, 56 insertions(+), 78 deletions(-) diff --git a/fdbserver/TLogServer.actor.cpp b/fdbserver/TLogServer.actor.cpp index dbdd6018b8..dca87018b6 100644 --- a/fdbserver/TLogServer.actor.cpp +++ b/fdbserver/TLogServer.actor.cpp @@ -379,33 +379,24 @@ struct LogData : NonCopyable, public ReferenceCounted { Version knownCommittedVersion; Deque>>> messageBlocks; - std::vector>>> tag_data; //negative or positive tag.locality | abs(tag.locality) | tag.id - - Reference getTagData(int8_t locality, uint16_t id, std::vector>>& data) { - if(locality >= data.size()) { - data.resize(locality+1); - } - if(id >= data[locality].size()) { - data[locality].resize(id+1); - } - return data[locality][id]; - } + std::vector>> tag_data; //tag.locality | tag.id Reference getTagData(Tag tag) { - if(tag.locality < 0) { - return getTagData(-(1+tag.locality), tag.id, tag_data[1]); + int idx = tag.locality >= 0 ? 2*tag.locality : 1-(2*tag.locality); + if(idx >= tag_data.size()) { + tag_data.resize(idx+1); } - return getTagData(tag.locality, tag.id, tag_data[0]); + if(tag.id >= tag_data[idx].size()) { + tag_data[idx].resize(tag.id+1); + } + return tag_data[idx][tag.id]; } //only callable after getTagData returns a null reference Reference createTagData(Tag tag, Version popped, bool nothing_persistent, bool popped_recently) { Reference newTagData = Reference( new TagData(tag, popped, nothing_persistent, popped_recently) ); - if(tag.locality < 0) { - tag_data[1][-(1+tag.locality)][tag.id] = newTagData; - } else { - tag_data[0][tag.locality][tag.id] = newTagData; - } + int idx = tag.locality >= 0 ? 2*tag.locality : 1-(2*tag.locality); + tag_data[idx][tag.id] = newTagData; return newTagData; } @@ -431,7 +422,6 @@ struct LogData : NonCopyable, public ReferenceCounted { // These are initialized differently on init() or recovery recoveryCount(), stopped(false), initialized(false), queueCommittingVersion(0), newPersistentDataVersion(invalidVersion), unrecoveredBefore(0) { - tag_data.resize(2); startRole(interf.id(), UID(), "TLog"); persistentDataVersion.init(LiteralStringRef("TLog.PersistentDataVersion"), cc.id); @@ -497,12 +487,10 @@ ACTOR Future tLogLock( TLogData* self, ReplyPromise< TLogLockResult > repl result.end = stopVersion; result.knownCommittedVersion = logData->knownCommittedVersion; - for(int tag_special = 0; tag_special < 2; tag_special++) { - for(int tag_locality = 0; tag_locality < logData->tag_data[tag_special].size(); tag_locality++) { - for(int tag_id = 0; tag_id < logData->tag_data[tag_special][tag_locality].size(); tag_id++) { - if(logData->tag_data[tag_special][tag_locality][tag_id]) { - result.tags.push_back(logData->tag_data[tag_special][tag_locality][tag_id]->tag); - } + for(int tag_locality = 0; tag_locality < logData->tag_data.size(); tag_locality++) { + for(int tag_id = 0; tag_id < logData->tag_data[tag_locality].size(); tag_id++) { + if(logData->tag_data[tag_locality][tag_id]) { + result.tags.push_back(logData->tag_data[tag_locality][tag_id]->tag); } } } @@ -539,40 +527,37 @@ ACTOR Future updatePersistentData( TLogData* self, Reference logD state bool anyData = false; // For all existing tags - state int tag_special = 0; state int tag_locality = 0; state int tag_id = 0; - for(tag_special = 0; tag_special < 2; tag_special++) { - for(tag_locality = 0; tag_locality < logData->tag_data[tag_special].size(); tag_locality++) { - for(tag_id = 0; tag_id < logData->tag_data[tag_special][tag_locality].size(); tag_id++) { - state Reference tagData = logData->tag_data[tag_special][tag_locality][tag_id]; - if(tagData) { - state Version currentVersion = 0; - // Clear recently popped versions from persistentData if necessary - updatePersistentPopped( self, logData, tagData ); - // Transfer unpopped messages with version numbers less than newPersistentDataVersion to persistentData - state std::deque>::iterator msg = tagData->version_messages.begin(); - while(msg != tagData->version_messages.end() && msg->first <= newPersistentDataVersion) { - currentVersion = msg->first; - anyData = true; - tagData->nothing_persistent = false; - BinaryWriter wr( Unversioned() ); + for(tag_locality = 0; tag_locality < logData->tag_data.size(); tag_locality++) { + for(tag_id = 0; tag_id < logData->tag_data[tag_locality].size(); tag_id++) { + state Reference tagData = logData->tag_data[tag_locality][tag_id]; + if(tagData) { + state Version currentVersion = 0; + // Clear recently popped versions from persistentData if necessary + updatePersistentPopped( self, logData, tagData ); + // Transfer unpopped messages with version numbers less than newPersistentDataVersion to persistentData + state std::deque>::iterator msg = tagData->version_messages.begin(); + while(msg != tagData->version_messages.end() && msg->first <= newPersistentDataVersion) { + currentVersion = msg->first; + anyData = true; + tagData->nothing_persistent = false; + BinaryWriter wr( Unversioned() ); - for(; msg != tagData->version_messages.end() && msg->first == currentVersion; ++msg) - wr << msg->second.toStringRef(); + for(; msg != tagData->version_messages.end() && msg->first == currentVersion; ++msg) + wr << msg->second.toStringRef(); - self->persistentData->set( KeyValueRef( persistTagMessagesKey( logData->logId, tagData->tag, currentVersion ), wr.toStringRef() ) ); + self->persistentData->set( KeyValueRef( persistTagMessagesKey( logData->logId, tagData->tag, currentVersion ), wr.toStringRef() ) ); - Future f = yield(TaskUpdateStorage); - if(!f.isReady()) { - Void _ = wait(f); - msg = std::upper_bound(tagData->version_messages.begin(), tagData->version_messages.end(), std::make_pair(currentVersion, LengthPrefixedStringRef()), CompareFirst>()); - } + Future f = yield(TaskUpdateStorage); + if(!f.isReady()) { + Void _ = wait(f); + msg = std::upper_bound(tagData->version_messages.begin(), tagData->version_messages.end(), std::make_pair(currentVersion, LengthPrefixedStringRef()), CompareFirst>()); } - - Void _ = wait(yield(TaskUpdateStorage)); } + + Void _ = wait(yield(TaskUpdateStorage)); } } } @@ -588,13 +573,11 @@ ACTOR Future updatePersistentData( TLogData* self, Reference logD TEST(anyData); // TLog moved data to persistentData logData->persistentDataDurableVersion = newPersistentDataVersion; - for(tag_special = 0; tag_special < 2; tag_special++) { - for(tag_locality = 0; tag_locality < logData->tag_data[tag_special].size(); tag_locality++) { - for(tag_id = 0; tag_id < logData->tag_data[tag_special][tag_locality].size(); tag_id++) { - if(logData->tag_data[tag_special][tag_locality][tag_id]) { - Void _ = wait(logData->tag_data[tag_special][tag_locality][tag_id]->eraseMessagesBefore( newPersistentDataVersion+1, &self->bytesDurable, logData, TaskUpdateStorage )); - Void _ = wait(yield(TaskUpdateStorage)); - } + for(tag_locality = 0; tag_locality < logData->tag_data.size(); tag_locality++) { + for(tag_id = 0; tag_id < logData->tag_data[tag_locality].size(); tag_id++) { + if(logData->tag_data[tag_locality][tag_id]) { + Void _ = wait(logData->tag_data[tag_locality][tag_id]->eraseMessagesBefore( newPersistentDataVersion+1, &self->bytesDurable, logData, TaskUpdateStorage )); + Void _ = wait(yield(TaskUpdateStorage)); } } } @@ -642,7 +625,6 @@ ACTOR Future updateStorage( TLogData* self ) { state Version nextVersion = 0; state int totalSize = 0; - state int tag_special = 0; state int tag_locality = 0; state int tag_id = 0; state Reference tagData; @@ -652,13 +634,11 @@ ACTOR Future updateStorage( TLogData* self ) { while(logData->persistentDataDurableVersion != logData->version.get()) { std::vector>::iterator, std::deque>::iterator>> iters; - for(tag_special = 0; tag_special < 2; tag_special++) { - for(tag_locality = 0; tag_locality < logData->tag_data[tag_special].size(); tag_locality++) { - for(tag_id = 0; tag_id < logData->tag_data[tag_special][tag_locality].size(); tag_id++) { - tagData = logData->tag_data[tag_special][tag_locality][tag_id]; - if(tagData) { - iters.push_back(std::make_pair(tagData->version_messages.begin(), tagData->version_messages.end())); - } + for(tag_locality = 0; tag_locality < logData->tag_data.size(); tag_locality++) { + for(tag_id = 0; tag_id < logData->tag_data[tag_locality].size(); tag_id++) { + tagData = logData->tag_data[tag_locality][tag_id]; + if(tagData) { + iters.push_back(std::make_pair(tagData->version_messages.begin(), tagData->version_messages.end())); } } } @@ -716,18 +696,16 @@ ACTOR Future updateStorage( TLogData* self ) { ++sizeItr; nextVersion = sizeItr == logData->version_sizes.end() ? logData->version.get() : sizeItr->key; - for(tag_special = 0; tag_special < 2; tag_special++) { - for(tag_locality = 0; tag_locality < logData->tag_data[tag_special].size(); tag_locality++) { - for(tag_id = 0; tag_id < logData->tag_data[tag_special][tag_locality].size(); tag_id++) { - tagData = logData->tag_data[tag_special][tag_locality][tag_id]; - if(tagData) { - auto it = std::lower_bound(tagData->version_messages.begin(), tagData->version_messages.end(), std::make_pair(prevVersion, LengthPrefixedStringRef()), CompareFirst>()); - for(; it != tagData->version_messages.end() && it->first < nextVersion; ++it) { - totalSize += it->second.expectedSize(); - } - - Void _ = wait(yield(TaskUpdateStorage)); + for(tag_locality = 0; tag_locality < logData->tag_data.size(); tag_locality++) { + for(tag_id = 0; tag_id < logData->tag_data[tag_locality].size(); tag_id++) { + tagData = logData->tag_data[tag_locality][tag_id]; + if(tagData) { + auto it = std::lower_bound(tagData->version_messages.begin(), tagData->version_messages.end(), std::make_pair(prevVersion, LengthPrefixedStringRef()), CompareFirst>()); + for(; it != tagData->version_messages.end() && it->first < nextVersion; ++it) { + totalSize += it->second.expectedSize(); } + + Void _ = wait(yield(TaskUpdateStorage)); } } } From e5c9940f1bc732a3659a23bb4105e239ca71c746 Mon Sep 17 00:00:00 2001 From: Dave Lester Date: Thu, 22 Mar 2018 11:59:40 -0700 Subject: [PATCH 029/127] Adds Code of Conduct, based on Contributor Covenant Code of Conduct. --- CODE_OF_CONDUCT.md | 71 ++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 71 insertions(+) create mode 100644 CODE_OF_CONDUCT.md diff --git a/CODE_OF_CONDUCT.md b/CODE_OF_CONDUCT.md new file mode 100644 index 0000000000..2d67240606 --- /dev/null +++ b/CODE_OF_CONDUCT.md @@ -0,0 +1,71 @@ +# Contributor Covenant Code of Conduct + +## Our Pledge + +In the interest of fostering an open and welcoming environment, we as +contributors and maintainers pledge to making participation in our project and +our community a harassment-free experience for everyone, regardless of age, body +size, disability, ethnicity, gender identity and expression, level of experience, +education, socio-economic status, nationality, personal appearance, race, +religion, or sexual identity and orientation. + +## Our Standards + +Examples of behavior that contributes to creating a positive environment +include: + +* Using welcoming and inclusive language +* Being respectful of differing viewpoints and experiences +* Gracefully accepting constructive criticism +* Focusing on what is best for the community +* Showing empathy towards other community members + +Examples of unacceptable behavior by participants include: + +* The use of sexualized language or imagery and unwelcome sexual attention or + advances +* Trolling, insulting/derogatory comments, and personal or political attacks +* Public or private harassment +* Publishing others' private information, such as a physical or electronic + address, without explicit permission +* Other conduct which could reasonably be considered inappropriate in a + professional setting + +## Our Responsibilities + +Project maintainers are responsible for clarifying the standards of acceptable +behavior and are expected to take appropriate and fair corrective action in +response to any instances of unacceptable behavior. + +Project maintainers have the right and responsibility to remove, edit, or +reject comments, commits, code, wiki edits, issues, and other contributions +that are not aligned to this Code of Conduct, or to ban temporarily or +permanently any contributor for other behaviors that they deem inappropriate, +threatening, offensive, or harmful. + +## Scope + +This Code of Conduct applies both within project spaces and in public spaces +when an individual is representing the project or its community. Examples of +representing a project or community include using an official project e-mail +address, posting via an official social media account, or acting as an appointed +representative at an online or offline event. Representation of a project may be +further defined and clarified by project maintainers. + +## Enforcement + +Instances of abusive, harassing, or otherwise unacceptable behavior may be +reported by contacting the project team at fdb-conduct@group.apple.com. All +complaints will be reviewed and investigated and will result in a response that +is deemed necessary and appropriate to the circumstances. The project team is +obligated to maintain confidentiality with regard to the reporter of an incident. +Further details of specific enforcement policies may be posted separately. + +Project maintainers who do not follow or enforce the Code of Conduct in good +faith may face temporary or permanent repercussions as determined by other +members of the project's leadership. + +## Attribution + +This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4, +available at https://www.contributor-covenant.org/version/1/4/code-of-conduct.html \ No newline at end of file From 156a5523bebb915ce92c5b6a96376d80d0cc5cf1 Mon Sep 17 00:00:00 2001 From: Evan Tschannen Date: Thu, 22 Mar 2018 15:19:55 -0700 Subject: [PATCH 030/127] hide options that are intended for internal use only --- fdbclient/vexillographer/fdb.options | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/fdbclient/vexillographer/fdb.options b/fdbclient/vexillographer/fdb.options index 9714bc4cf8..69740f4eef 100644 --- a/fdbclient/vexillographer/fdb.options +++ b/fdbclient/vexillographer/fdb.options @@ -132,12 +132,14 @@ description is not currently required but encouraged.