StorageServerTracker:Fix OOM bug caused by server healthyness toggles infinitely
When there is only one healthy team, the bug will set a server's status as unhealthy; which causes the healthyTeam to 0, triggering StorageServerTracker to loop back; which resets the server's status to healthy, and thus the healthyTeam to non-zero. This pattern will cause infinite loop. Infinite loop will prevent TraceEvent from flushing, which causes TraceEvent to use most of memory and out-of-memory. Kudos to JingYu Zhou (jingyu_zhou@apple.com) who is the main contributor who found the bug!
This commit is contained in:
parent
c9097cca18
commit
26e1d565f6
|
@ -3474,9 +3474,9 @@ ACTOR Future<Void> storageServerTracker(
|
|||
}
|
||||
|
||||
if( server->lastKnownClass.machineClassFitness( ProcessClass::Storage ) > ProcessClass::UnsetFit ) {
|
||||
// We saw a corner case in in 3 data_hall configuration
|
||||
// when optimalTeamCount = 1, healthyTeamCount = 0.
|
||||
if (self->optimalTeamCount > 0 && self->healthyTeamCount > 0) {
|
||||
// NOTE: Should not use self->healthyTeamCount > 0 in if statement, which will cause status bouncing between
|
||||
// healthy and unhealthy
|
||||
if (self->optimalTeamCount > 0) {
|
||||
TraceEvent(SevWarn, "UndesiredStorageServer", self->distributorId)
|
||||
.detail("Server", server->id)
|
||||
.detail("OptimalTeamCount", self->optimalTeamCount)
|
||||
|
@ -3484,7 +3484,6 @@ ACTOR Future<Void> storageServerTracker(
|
|||
status.isUndesired = true;
|
||||
}
|
||||
otherChanges.push_back( self->zeroOptimalTeams.onChange() );
|
||||
otherChanges.push_back(self->zeroHealthyTeams->onChange());
|
||||
}
|
||||
|
||||
//If this storage server has the wrong key-value store type, then mark it undesired so it will be replaced with a server having the correct type
|
||||
|
|
Loading…
Reference in New Issue