OpenCloudOS-Kernel/drivers/vfio
Anthony DeRossi e806e22362 vfio/pci: Check the device set open count on reset
vfio_pci_dev_set_needs_reset() inspects the open_count of every device
in the set to determine whether a reset is allowed. The current device
always has open_count == 1 within vfio_pci_core_disable(), effectively
disabling the reset logic. This field is also documented as private in
vfio_device, so it should not be used to determine whether other devices
in the set are open.

Checking for vfio_device_set_open_count() > 1 on the device set fixes
both issues.

After commit 2cd8b14aaa ("vfio/pci: Move to the device set
infrastructure"), failure to create a new file for a device would cause
the reset to be skipped due to open_count being decremented after
calling close_device() in the error path.

After commit eadd86f835 ("vfio: Remove calls to
vfio_group_add_container_user()"), releasing a device would always skip
the reset due to an ordering change in vfio_device_fops_release().

Failing to reset the device leaves it in an unknown state, potentially
causing errors when it is accessed later or bound to a different driver.

This issue was observed with a Radeon RX Vega 56 [1002:687f] (rev c3)
assigned to a Windows guest. After shutting down the guest, unbinding
the device from vfio-pci, and binding the device to amdgpu:

[  548.007102] [drm:psp_hw_start [amdgpu]] *ERROR* PSP create ring failed!
[  548.027174] [drm:psp_hw_init [amdgpu]] *ERROR* PSP firmware loading failed
[  548.027242] [drm:amdgpu_device_fw_loading [amdgpu]] *ERROR* hw_init of IP block <psp> failed -22
[  548.027306] amdgpu 0000:0a:00.0: amdgpu: amdgpu_device_ip_init failed
[  548.027308] amdgpu 0000:0a:00.0: amdgpu: Fatal error during GPU init

Fixes: 2cd8b14aaa ("vfio/pci: Move to the device set infrastructure")
Fixes: eadd86f835 ("vfio: Remove calls to vfio_group_add_container_user()")
Signed-off-by: Anthony DeRossi <ajderossi@gmail.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20221110014027.28780-4-ajderossi@gmail.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2022-11-10 12:03:36 -07:00
..
fsl-mc vfio/fsl-mc: Use the new device life cycle helpers 2022-09-21 14:15:11 -06:00
mdev vfio/mdev: add mdev available instance checking to the core 2022-10-04 12:06:58 -06:00
pci vfio/pci: Check the device set open count on reset 2022-11-10 12:03:36 -07:00
platform vfio/amba: Use the new device life cycle helpers 2022-09-21 14:15:11 -06:00
Kconfig vfio: Introduce the DMA logging feature support 2022-09-08 12:59:00 -06:00
Makefile vfio: Move container code into drivers/vfio/container.c 2022-09-22 15:46:06 -06:00
container.c vfio: Change vfio_group->group_rwsem to a mutex 2022-10-04 12:06:58 -06:00
iova_bitmap.c vfio: Add an IOVA bitmap support 2022-09-08 12:59:00 -06:00
vfio.h vfio: Make the group FD disassociate from the iommu_group 2022-10-07 08:10:52 -06:00
vfio_iommu_spapr_tce.c vfio/spapr_tce: Fix the comment 2022-07-22 16:24:47 -06:00
vfio_iommu_type1.c Merge branches 'apple/dart', 'arm/mediatek', 'arm/omap', 'arm/smmu', 'virtio', 'x86/vt-d', 'x86/amd' and 'core' into next 2022-09-26 15:52:31 +02:00
vfio_main.c vfio: Export the device set open count 2022-11-10 12:03:36 -07:00
vfio_spapr_eeh.c treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 500 2019-06-19 17:09:55 +02:00
virqfd.c vfio/virqfd: Drain events from eventfd in virqfd_wakeup() 2020-11-15 09:49:10 -05:00