StorageServerTracker:Fix OOM bug caused by server healthyness toggles infinitely

When there is only one healthy team, the bug will set a server's status as unhealthy; which causes the healthyTeam to 0, triggering StorageServerTracker to loop back; which resets the server's status to healthy, and thus the healthyTeam to non-zero. This pattern will cause infinite loop. Infinite loop will prevent TraceEvent from flushing, which causes TraceEvent to use most of memory and out-of-memory. Kudos to JingYu Zhou (jingyu_zhou@apple.com) who is the main contributor who found the bug!
2019-10-09 17:45:06 -07:00 · 2019-10-09 17:45:06 -07:00 · 26e1d565f6
parent c9097cca18
commit 26e1d565f6
1 changed files with 3 additions and 4 deletions
--- a/fdbserver/DataDistribution.actor.cpp
+++ b/fdbserver/DataDistribution.actor.cpp
@ -3474,9 +3474,9 @@ ACTOR Future<Void> storageServerTracker(
 			}

 			if( server->lastKnownClass.machineClassFitness( ProcessClass::Storage ) > ProcessClass::UnsetFit ) {
-				// We saw a corner case in in 3 data_hall configuration
-				// when optimalTeamCount = 1, healthyTeamCount = 0.
-				if (self->optimalTeamCount > 0 && self->healthyTeamCount > 0) {
+				// NOTE: Should not use self->healthyTeamCount > 0 in if statement, which will cause status bouncing between
+				// healthy and unhealthy
+				if (self->optimalTeamCount > 0) {
 					TraceEvent(SevWarn, "UndesiredStorageServer", self->distributorId)
 					    .detail("Server", server->id)
 					    .detail("OptimalTeamCount", self->optimalTeamCount)
@ -3484,7 +3484,6 @@ ACTOR Future<Void> storageServerTracker(
 					status.isUndesired = true;
 				}
 				otherChanges.push_back( self->zeroOptimalTeams.onChange() );
-				otherChanges.push_back(self->zeroHealthyTeams->onChange());
 			}

 			//If this storage server has the wrong key-value store type, then mark it undesired so it will be replaced with a server having the correct type