If a cluster does not change its storeType for a while, we do not need to
call removeWrongStoreType actor periodically.
This solution is the same as how badTeamRemover actor is handled.
In case the wrong storeType SS picked to be removed fails before
it triggers the next round of checking if a SS has wrong store type,
we should time out and invoke the checking.
Otherwise, the removeWrongStoreType actor will never be running again.
Multiple storage server recruitment requests may be buffered in
cluster controller, hoping that in the near future cluster controller
will find an available worker for the request.
It is possible that many outstanding storage recruitment requests are
fullfilled by the cluster controller in a very short time interval.
When DD recruit those requests, it blindly initiaze a storage server
on the recruited worker and let the storage server tracker remove
storage servers on the same process (ip, port).
This is problematic because multiple SS on the same process can push
the process OOM. Even in simulation, initializing too many SS causes
simulator OOM.
This commit limits the max number of SS on a process to be 2.
We cannot enforce the number of SS on a process to be 1 right now,
because current simulation tests may change configuration in a situation that
without allowing more than 1 SS on a process will fail the tests.
Because PhysicalDiskMetrics metric is quite stable within short period of time,
we should suppress it with 60 second interval to avoid spammy log messages.
Spammy warning log can cause false positive in correctness test.
When too many outstanding requests cannot find a worker for storage server
role, many same errors will be put into trace log. Only one error is enough
to alert the problem.
Too many same errors cause false positive in nightly test and thus should be suppressed.
If DD checks storage server's storeType before some storage servers are
fully available, DD may miss those storage servers to remove.
To ensure no storage servers with wrong storeType is missed, SS marks the
doRemoveWrongStoreType to be true.
To avoid removing multiple servers at the same time, the actor waits for a
configurable delay before checking and removing a storage server.