StorageServerTracker:Fix OOM bug caused by server healthyness toggles infinitely

When there is only one healthy team, the bug will set a server's status as unhealthy;
which causes the healthyTeam to 0, triggering StorageServerTracker to loop back;
which resets the server's status to healthy, and thus the healthyTeam to non-zero.

This pattern will cause infinite loop.

Infinite loop will prevent TraceEvent from flushing, which causes
TraceEvent to use most of memory and out-of-memory.

Kudos to JingYu Zhou (jingyu_zhou@apple.com) who is the main contributor who found the bug!
This commit is contained in:
Meng Xu 2019-10-09 17:45:06 -07:00
parent c9097cca18
commit 26e1d565f6
1 changed files with 3 additions and 4 deletions

View File

@ -3474,9 +3474,9 @@ ACTOR Future<Void> storageServerTracker(
}
if( server->lastKnownClass.machineClassFitness( ProcessClass::Storage ) > ProcessClass::UnsetFit ) {
// We saw a corner case in in 3 data_hall configuration
// when optimalTeamCount = 1, healthyTeamCount = 0.
if (self->optimalTeamCount > 0 && self->healthyTeamCount > 0) {
// NOTE: Should not use self->healthyTeamCount > 0 in if statement, which will cause status bouncing between
// healthy and unhealthy
if (self->optimalTeamCount > 0) {
TraceEvent(SevWarn, "UndesiredStorageServer", self->distributorId)
.detail("Server", server->id)
.detail("OptimalTeamCount", self->optimalTeamCount)
@ -3484,7 +3484,6 @@ ACTOR Future<Void> storageServerTracker(
status.isUndesired = true;
}
otherChanges.push_back( self->zeroOptimalTeams.onChange() );
otherChanges.push_back(self->zeroHealthyTeams->onChange());
}
//If this storage server has the wrong key-value store type, then mark it undesired so it will be replaced with a server having the correct type