Description
Register worker maintain 'id_worker' map to track the worker interface
registeration, it also sets 'WorkerAvailabilityWatcher' actor to track
for failures. However, if worker is already registered but interface gets
updated, the existing code doesn't actively cancel the watcher. One possible
race condition is, old watcher detects worker failure and removes the worker
from id_worker map even before the new watcher started monitoring the new
interface. If such a scenario is hit, it trips an assert in rebootAndCheck
routine which expects the worker should be present in the id_worker map.
Patch addresses the race condition by actively cancelling existing watcher
actor before registering the new watcher actor.
Testing
devRunCorrectness - 100K
* Cleaned up BlobGranule TODO + FIXMEs and addressed some
* popping feed at correct version
* blob worker taking over a granule will pop from where previous worker left off
* addressed fixme of blob worker not re-snapshotting from old change feed
* formatting
* more change feed popped fixes after pop updates
* Getting rid of change feed parallelism lock since it can cause deadlocks in fetching, and relying on full fetch lock
* New blob worker metric and fixing old one
* server-side popped checking still doesn't work because of pops at non-mutation versions
* format
The intent of this test is to show that figureVersion does something
reasonable if undefined behavior is invoked, but just invoking it is
already problematic so it's not an interesting test.
* Added purgeAtLatest to BlobGranuleVerifier
* Also checking for merge resnapshot/fully delete granule purge races
* error check and count for purgeAtLatest
* changing test defaults back
* adding final data check after final availability check in blob granule verifier
* Fixing merge boundary recovery
* fixing an edge case in blob manager repeat recruitment
* fixing a race between tenant loading and key alignment
* formatting
* doing force purge at the end of the test if it didn't happen during
* better checking for granule metadata left over after purge
* retrying for in-flight granules that weren't known before purge
* letting test pass for a hard to fix purge case
* changed post-test force purge to only be after final availability check
* Made force purge more robust to split+merge races
* fix flush/force purge change feed leak
* Better fix for assign while force purge races
* more merge/force purge races
* cleanup