Previously, when this module was unloaded via 'rmmod' with at least one
drive attached, the SCSI error handler thread would become stuck in an
infinite recovery loop and lockup the system, necessitating a reboot.
Once the SAS layer is detached, the driver will fail any subsequent
commands since the target devices are removed. However, removing the
SCSI host generates a SYNCHRONIZE CACHE (10) command, which was failed
and left the error handler no method of recovery.
This patch simply removes the SCSI host first so that no more commands
can come down, prior to cleaning up the SAS layer. Note that the stack
is built up with the SCSI host first, and then the SAS layer. Perhaps
it should be reversed for symmetry, so that commands cannot be sent to
the pm80xx driver prior to attaching the SAS layer?
What was really strange about this bug was that it was introduced at
commit cff549e486 ("[SCSI]: proper state checking and module refcount
handling in scsi_device_get"). This commit appears to tinker with how
the reference counting is performed for SCSI device objects. My theory
is that prior to this commit, the refcount for a device object was
blindly incremented at some point during the teardown process which
coincidentially made the device stick around during the procedure, which
also coincidentially made any commands sent to the driver not fail
(since the device was technically still "there"). After this commit was
applied, my theory is the refcount for the device object is not being
incremented at a specific point anymore, which makes the device go away,
and thus made the pm80xx driver fail any subsequent commands.
You may also want to see the following for more details:
[1] http://www.spinics.net/lists/linux-scsi/msg37208.html
[2] http://marc.info/?l=linux-scsi&m=144416476406993&w=2
Signed-off-by: Benjamin Rood <brood@attotech.com>
Acked-by: Jack Wang <jinpu.wang@profitbricks.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>