amba_id are not supposed to change at runtime. All functions
working with const amba_id. So mark the non-const structs as const.
Signed-off-by: Arvind Yadav <arvind.yadav.cs@gmail.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
When the user unbinds the last device of a group from a vfio bus
driver, the devices within that group should be available for other
purposes. We currently have a race that makes this generally, but
not always true. The device can be unbound from the vfio bus driver,
but remaining IOMMU context of the group attached to the container
can result in errors as the next driver configures DMA for the device.
Wait for the group to be detached from the IOMMU backend before
allowing the bus driver remove callback to complete.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
In vfio_iommu_group_get() we want to increase the reference
count of the iommu group.
In noiommu case, the group does not exist and is allocated.
iommu_group_add_device() increases the group ref count. However we
then call iommu_group_put() which decrements it.
This leads to a "refcount_t: underflow WARN_ON".
Only decrement the ref count in case of iommu_group_add_device
failure.
Signed-off-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
If the IOMMU driver advertises 'real' reserved regions for MSIs, but
still includes the software-managed region as well, we are currently
blind to the former and will configure the IOMMU domain to map MSIs into
the latter, which is unlikely to work as expected.
Since it would take a ridiculous hardware topology for both regions to
be valid (which would be rather difficult to support in general), we
should be safe to assume that the presence of any hardware regions makes
the software region irrelevant. However, the IOMMU driver might still
advertise the software region by default, particularly if the hardware
regions are filled in elsewhere by generic code, so it might not be fair
for VFIO to be super-strict about not mixing them. To that end, make
vfio_iommu_has_sw_msi() robust against the presence of both region types
at once, so that we end up doing what is almost certainly right, rather
than what is almost certainly wrong.
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
For ARM-based systems with a GICv3 ITS to provide interrupt isolation,
but hardware limitations which are worked around by having MSIs bypass
SMMU translation (e.g. HiSilicon Hip06/Hip07), VFIO neglects to check
for the IRQ_DOMAIN_FLAG_MSI_REMAP capability, (and thus erroneously
demands unsafe_interrupts) if a software-managed MSI region is absent.
Fix this by always checking for isolation capability at both the IRQ
domain and IOMMU domain levels, rather than predicating that on whether
MSIs require an IOMMU mapping (which was always slightly tenuous logic).
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Root complex integrated endpoints do not have a link and therefore may
use a smaller PCIe capability in config space than we expect when
building our config map. Add a case for these to avoid reporting an
erroneous overlap.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Device lock bites again; if a device .remove() callback races a user
calling ioctl(VFIO_GROUP_GET_DEVICE_FD), the unbind request will hold
the device lock, but the user ioctl may have already taken a vfio_device
reference. In the case of a PCI device, the initial open will attempt
to reset the device, which again attempts to get the device lock,
resulting in deadlock. Use the trylock PCI reset interface and return
error on the open path if reset fails due to lock contention.
Link: https://lkml.org/lkml/2017/7/25/381
Reported-by: Wen Congyang <wencongyang2@huawei.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
- Include Intel XXV710 in INTx workaround (Alex Williamson)
- Make use of ERR_CAST() for error return (Dan Carpenter)
- Fix vfio_group release deadlock from iommu notifier (Alex Williamson)
- Unset KVM-VFIO attributes only on group match (Alex Williamson)
- Fix release path group/file matching with KVM-VFIO (Alex Williamson)
- Remove unnecessary lock uses triggering lockdep splat (Alex Williamson)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.14 (GNU/Linux)
iQIcBAABAgAGBQJZZ71TAAoJECObm247sIsi1mUP/1/5kubhHRE/Y11SnX4Bpie0
X6YVbV3WLGuV9jaFMI7EgYJLZ6pBvCRX0CLUzEdrbS4LAlTLQu2GSY5tqhcLJumE
0mV1Wj1jOO9BAur13/sJ+IohbZeK10dtbTkWv+YhrVSpRaLP+ituaKsajEV12WRH
6dxho2bqfvNcTC8yjE+vqe9mU3jXYPGuCx8oYGaDXEsCjzrdbOFnAG0/s/2cWpb+
D1ADHd3020VAHZRnHRBLFfMczza1jqllhSAUfdMw1gRGCQDq3k1XenzVNLLTbtYy
VmEWHa+R/OCfbKVxaDqPsgOTK7x7DKn+Pzb3lWCdQ8v5X+2ubHpVIYjxTDSSTbt3
YJ7a+hNk8AHkFgwS7x8BdOT8mmNGb1NZldjS4dv2VWkfcTnMQnubnYSCzGztto9h
P2THKBil6djPb9S3pCvtKUiHSIZedYZlKofUldrOdGDAZmzLLlf8lTzijGjDYKFM
pQeZC+xeEhZXURipgkH+a+paYgDtKEfwSlABODghjCcJf7S/GbyVPLOKLXzoVb2y
Ml8eGlo4O/cNniQK5faH447ilM7hzS1aG83uGnHTfe8VgKjI7Z5ZSxKOtoEq5bDz
bb91E6GVLKHqT0LVS1YZfrnqK0hX/QAd/sK1REM5nN8JNmPLyLpjv8FaJEhpk1vC
z4At5+pfKM8DYrW3EGmc
=3A4K
-----END PGP SIGNATURE-----
Merge tag 'vfio-v4.13-rc1' of git://github.com/awilliam/linux-vfio
Pull VFIO updates from Alex Williamson:
- Include Intel XXV710 in INTx workaround (Alex Williamson)
- Make use of ERR_CAST() for error return (Dan Carpenter)
- Fix vfio_group release deadlock from iommu notifier (Alex Williamson)
- Unset KVM-VFIO attributes only on group match (Alex Williamson)
- Fix release path group/file matching with KVM-VFIO (Alex Williamson)
- Remove unnecessary lock uses triggering lockdep splat (Alex Williamson)
* tag 'vfio-v4.13-rc1' of git://github.com/awilliam/linux-vfio:
vfio: Remove unnecessary uses of vfio_container.group_lock
vfio: New external user group/file match
kvm-vfio: Decouple only when we match a group
vfio: Fix group release deadlock
vfio: Use ERR_CAST() instead of open coding it
vfio/pci: Add Intel XXV710 to hidden INTx devices
The original intent of vfio_container.group_lock is to protect
vfio_container.group_list, however over time it's become a crutch to
prevent changes in container composition any time we call into the
iommu driver backend. This introduces problems when we start to have
more complex interactions, for example when a user's DMA unmap request
triggers a notification to an mdev vendor driver, who responds by
attempting to unpin mappings within that request, re-entering the
iommu backend. We incorrectly assume that the use of read-locks here
allow for this nested locking behavior, but a poorly timed write-lock
could in fact trigger a deadlock.
The current use of group_lock seems to fall into the trap of locking
code, not data. Correct that by removing uses of group_lock that are
not directly related to group_list. Note that the vfio type1 iommu
backend has its own mutex, vfio_iommu.lock, which it uses to protect
itself for each of these interfaces anyway. The group_lock appears to
be a redundancy for these interfaces and type1 even goes so far as to
release its mutex to allow for exactly the re-entrant code path above.
Reported-by: Chuanxiao Dong <chuanxiao.dong@intel.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Acked-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Cc: stable@vger.kernel.org # v4.10+
At the point where the kvm-vfio pseudo device wants to release its
vfio group reference, we can't always acquire a new reference to make
that happen. The group can be in a state where we wouldn't allow a
new reference to be added. This new helper function allows a caller
to match a file to a group to facilitate this. Given a file and
group, report if they match. Thus the caller needs to already have a
group reference to match to the file. This allows the deletion of a
group without acquiring a new reference.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Tested-by: Eric Auger <eric.auger@redhat.com>
Cc: stable@vger.kernel.org
If vfio_iommu_group_notifier() acquires a group reference and that
reference becomes the last reference to the group, then vfio_group_put
introduces a deadlock code path where we're trying to unregister from
the iommu notifier chain from within a callout of that chain. Use a
work_struct to release this reference asynchronously.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Tested-by: Eric Auger <eric.auger@redhat.com>
Cc: stable@vger.kernel.org
Rename:
wait_queue_t => wait_queue_entry_t
'wait_queue_t' was always a slight misnomer: its name implies that it's a "queue",
but in reality it's a queue *entry*. The 'real' queue is the wait queue head,
which had to carry the name.
Start sorting this out by renaming it to 'wait_queue_entry_t'.
This also allows the real structure name 'struct __wait_queue' to
lose its double underscore and become 'struct wait_queue_entry',
which is the more canonical nomenclature for such data types.
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
It's a small cleanup to use ERR_CAST() here.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
XXV710 has the same broken INTx behavior as the rest of the X/XL710
series, the interrupt status register is not wired to report pending
INTx interrupts, thus we never associate the interrupt to the device.
Extend the device IDs to include these so that we hide that the
device supports INTx at all to the user.
Reported-by: Stefan Assmann <sassmann@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Acked-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Highlights include:
- Larger virtual address space on 64-bit server CPUs. By default we use a 128TB
virtual address space, but a process can request access to the full 512TB by
passing a hint to mmap().
- Support for the new Power9 "XIVE" interrupt controller.
- TLB flushing optimisations for the radix MMU on Power9.
- Support for CAPI cards on Power9, using the "Coherent Accelerator Interface
Architecture 2.0".
- The ability to configure the mmap randomisation limits at build and runtime.
- Several small fixes and cleanups to the kprobes code, as well as support for
KPROBES_ON_FTRACE.
- Major improvements to handling of system reset interrupts, correctly treating
them as NMIs, giving them a dedicated stack and using a new hypervisor call
to trigger them, all of which should aid debugging and robustness.
Many fixes and other minor enhancements.
Thanks to:
Alastair D'Silva, Alexey Kardashevskiy, Alistair Popple, Andrew Donnellan,
Aneesh Kumar K.V, Anshuman Khandual, Anton Blanchard, Balbir Singh, Ben
Hutchings, Benjamin Herrenschmidt, Bhupesh Sharma, Chris Packham, Christian
Zigotzky, Christophe Leroy, Christophe Lombard, Daniel Axtens, David Gibson,
Gautham R. Shenoy, Gavin Shan, Geert Uytterhoeven, Guilherme G. Piccoli,
Hamish Martin, Hari Bathini, Kees Cook, Laurent Dufour, Madhavan Srinivasan,
Mahesh J Salgaonkar, Mahesh Salgaonkar, Masami Hiramatsu, Matt Brown, Matthew
R. Ochs, Michael Neuling, Naveen N. Rao, Nicholas Piggin, Oliver O'Halloran,
Pan Xinhui, Paul Mackerras, Rashmica Gupta, Russell Currey, Sukadev
Bhattiprolu, Thadeu Lima de Souza Cascardo, Tobin C. Harding, Tyrel Datwyler,
Uma Krishnan, Vaibhav Jain, Vipin K Parashar, Yang Shi.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJZDHUMAAoJEFHr6jzI4aWAT7oQALkE2Nj3gjcn1z0SkFhq/1iO
Py9Elmqm4E+L6NKYtBY5dS8xVAJ088ffzERyqJ1FY1LHkB8tn8bWRcMQmbjAFzTI
V4TAzDNI890BN/F4ptrYRwNFxRBHAvZ4NDunTzagwYnwmTzW9PYHmOi4pvWTo3Tw
KFUQ0joLSEgHzyfXxYB3fyj41u8N0FZvhfazdNSqia2Y5Vwwv/ION5jKplDM+09Y
EtVEXFvaKAS1sjbM/d/Jo5rblHfR0D9/lYV10+jjyIokjzslIpyTbnj3izeYoM5V
I4h99372zfsEjBGPPXyM3khL3zizGMSDYRmJHQSaKxjtecS9SPywPTZ8ufO/aSzV
Ngq6nlND+f1zep29VQ0cxd3Jh40skWOXzxJaFjfDT25xa6FbfsWP2NCtk8PGylZ7
EyqTuCWkMgIP02KlX3oHvEB2LRRPCDmRU2zECecRGNJrIQwYC2xjoiVi7Q8Qe8rY
gr7Ib5Jj/a+uiTcCIy37+5nXq2s14/JBOKqxuYZIxeuZFvKYuRUipbKWO05WDOAz
m/pSzeC3J8AAoYiqR0gcSOuJTOnJpGhs7zrQFqnEISbXIwLW+ICumzOmTAiBqOEY
Rt8uW2gYkPwKLrE05445RfVUoERaAjaE06eRMOWS6slnngHmmnRJbf3PcoALiJkT
ediqGEj0/N1HMB31V5tS
=vSF3
-----END PGP SIGNATURE-----
Merge tag 'powerpc-4.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates from Michael Ellerman:
"Highlights include:
- Larger virtual address space on 64-bit server CPUs. By default we
use a 128TB virtual address space, but a process can request access
to the full 512TB by passing a hint to mmap().
- Support for the new Power9 "XIVE" interrupt controller.
- TLB flushing optimisations for the radix MMU on Power9.
- Support for CAPI cards on Power9, using the "Coherent Accelerator
Interface Architecture 2.0".
- The ability to configure the mmap randomisation limits at build and
runtime.
- Several small fixes and cleanups to the kprobes code, as well as
support for KPROBES_ON_FTRACE.
- Major improvements to handling of system reset interrupts,
correctly treating them as NMIs, giving them a dedicated stack and
using a new hypervisor call to trigger them, all of which should
aid debugging and robustness.
- Many fixes and other minor enhancements.
Thanks to: Alastair D'Silva, Alexey Kardashevskiy, Alistair Popple,
Andrew Donnellan, Aneesh Kumar K.V, Anshuman Khandual, Anton
Blanchard, Balbir Singh, Ben Hutchings, Benjamin Herrenschmidt,
Bhupesh Sharma, Chris Packham, Christian Zigotzky, Christophe Leroy,
Christophe Lombard, Daniel Axtens, David Gibson, Gautham R. Shenoy,
Gavin Shan, Geert Uytterhoeven, Guilherme G. Piccoli, Hamish Martin,
Hari Bathini, Kees Cook, Laurent Dufour, Madhavan Srinivasan, Mahesh J
Salgaonkar, Mahesh Salgaonkar, Masami Hiramatsu, Matt Brown, Matthew
R. Ochs, Michael Neuling, Naveen N. Rao, Nicholas Piggin, Oliver
O'Halloran, Pan Xinhui, Paul Mackerras, Rashmica Gupta, Russell
Currey, Sukadev Bhattiprolu, Thadeu Lima de Souza Cascardo, Tobin C.
Harding, Tyrel Datwyler, Uma Krishnan, Vaibhav Jain, Vipin K Parashar,
Yang Shi"
* tag 'powerpc-4.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (214 commits)
powerpc/64s: Power9 has no LPCR[VRMASD] field so don't set it
powerpc/powernv: Fix TCE kill on NVLink2
powerpc/mm/radix: Drop support for CPUs without lockless tlbie
powerpc/book3s/mce: Move add_taint() later in virtual mode
powerpc/sysfs: Move #ifdef CONFIG_HOTPLUG_CPU out of the function body
powerpc/smp: Document irq enable/disable after migrating IRQs
powerpc/mpc52xx: Don't select user-visible RTAS_PROC
powerpc/powernv: Document cxl dependency on special case in pnv_eeh_reset()
powerpc/eeh: Clean up and document event handling functions
powerpc/eeh: Avoid use after free in eeh_handle_special_event()
cxl: Mask slice error interrupts after first occurrence
cxl: Route eeh events to all drivers in cxl_pci_error_detected()
cxl: Force context lock during EEH flow
powerpc/64: Allow CONFIG_RELOCATABLE if COMPILE_TEST
powerpc/xmon: Teach xmon oops about radix vectors
powerpc/mm/hash: Fix off-by-one in comment about kernel contexts ids
powerpc/pseries: Enable VFIO
powerpc/powernv: Fix iommu table size calculation hook for small tables
powerpc/powernv: Check kzalloc() return value in pnv_pci_table_alloc
powerpc: Add arch/powerpc/tools directory
...
vfio_pin_pages_remote() is typically called to iterate over a range
of memory. Testing CAP_IPC_LOCK is relatively expensive, so it makes
sense to push it up to the caller, which can then repeatedly call
vfio_pin_pages_remote() using that value. This can show nearly a 20%
improvement on the worst case path through VFIO_IOMMU_MAP_DMA with
contiguous page mapping disabled. Testing RLIMIT_MEMLOCK is much more
lightweight, but we bring it along on the same principle and it does
seem to show a marginal improvement.
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
With vfio_lock_acct() testing the locked memory limit under mmap_sem,
it's redundant to do it here for a single page. We can also reorder
our tests such that we can avoid testing for reserved pages if we're
not doing accounting and let vfio_lock_acct() test the process
CAP_IPC_LOCK. Finally, this function oddly returns 1 on success.
Update to return zero on success, -errno on error. Since the function
only pins a single page, there's no need to return the number of pages
pinned.
N.B. vfio_pin_pages_remote() can pin a large contiguous range of pages
before calling vfio_lock_acct(). If we were to similarly remove the
extra test there, a user could temporarily pin far more pages than
they're allowed.
Suggested-by: Kirti Wankhede <kwankhede@nvidia.com>
Suggested-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
If the mmap_sem is contented then the vfio type1 IOMMU backend will
defer locked page accounting updates to a workqueue task. This has a
few problems and depending on which side the user tries to play, they
might be over-penalized for unmaps that haven't yet been accounted or
race the workqueue to enter more mappings than they're allowed. The
original intent of this workqueue mechanism seems to be focused on
reducing latency through the ioctl, but we cannot do so at the cost
of correctness. Remove this workqueue mechanism and update the
callers to allow for failure. We can also now recheck the limit under
write lock to make sure we don't exceed it.
vfio_pin_pages_remote() also now necessarily includes an unwind path
which we can jump to directly if the consecutive page pinning finds
that we're exceeding the user's memory limits. This avoids the
current lazy approach which does accounting and mapping up to the
fault, only to return an error on the next iteration to unwind the
entire vfio_dma.
Cc: stable@vger.kernel.org
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
This adds missing checking for kzalloc() return value.
Fixes: 4b6fad7097 ("powerpc/mm/iommu, vfio/spapr: Put pages on VFIO container shutdown")
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The existing SPAPR TCE driver advertises both VFIO_SPAPR_TCE_IOMMU and
VFIO_SPAPR_TCE_v2_IOMMU types to the userspace and the userspace usually
picks the v2.
Normally the userspace would create a container, attach an IOMMU group
to it and only then set the IOMMU type (which would normally be v2).
However a specific IOMMU group may not support v2, in other words
it may not implement set_window/unset_window/take_ownership/
release_ownership and such a group should not be attached to
a v2 container.
This adds extra checks that a new group can do what the selected IOMMU
type suggests. The userspace can then test the return value from
ioctl(VFIO_SET_IOMMU, VFIO_SPAPR_TCE_v2_IOMMU) and try
VFIO_SPAPR_TCE_IOMMU.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
So far iommu_table obejcts were only used in virtual mode and had
a single owner. We are going to change this by implementing in-kernel
acceleration of DMA mapping requests. The proposed acceleration
will handle requests in real mode and KVM will keep references to tables.
This adds a kref to iommu_table and defines new helpers to update it.
This replaces iommu_free_table() with iommu_tce_table_put() and makes
iommu_free_table() static. iommu_tce_table_get() is not used in this patch
but it will be in the following patch.
Since this touches prototypes, this also removes @node_name parameter as
it has never been really useful on powernv and carrying it for
the pseries platform code to iommu_free_table() seems to be quite
useless as well.
This should cause no behavioral change.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
At the moment iommu_table can be disposed by either calling
iommu_table_free() directly or it_ops::free(); the only implementation
of free() is in IODA2 - pnv_ioda2_table_free() - and it calls
iommu_table_free() anyway.
As we are going to have reference counting on tables, we need an unified
way of disposing tables.
This moves it_ops::free() call into iommu_free_table() and makes use
of the latter. The free() callback now handles only platform-specific
data.
As from now on the iommu_free_table() calls it_ops->free(), we need
to have it_ops initialized before calling iommu_free_table() so this
moves this initialization in pnv_pci_ioda2_create_table().
This should cause no behavioral change.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The introduction of reserved regions has left a couple of rough edges
which we could do with sorting out sooner rather than later. Since we
are not yet addressing the potential dynamic aspect of software-managed
reservations and presenting them at arbitrary fixed addresses, it is
incongruous that we end up displaying hardware vs. software-managed MSI
regions to userspace differently, especially since ARM-based systems may
actually require one or the other, or even potentially both at once,
(which iommu-dma currently has no hope of dealing with at all). Let's
resolve the former user-visible inconsistency ASAP before the ABI has
been baked into a kernel release, in a way that also lays the groundwork
for the latter shortcoming to be addressed by follow-up patches.
For clarity, rename the software-managed type to IOMMU_RESV_SW_MSI, use
IOMMU_RESV_MSI to describe the hardware type, and document everything a
little bit. Since the x86 MSI remapping hardware falls squarely under
this meaning of IOMMU_RESV_MSI, apply that type to their regions as well,
so that we tell the same story to userspace across all platforms.
Secondly, as the various region types require quite different handling,
and it really makes little sense to ever try combining them, convert the
bitfield-esque #defines to a plain enum in the process before anyone
gets the wrong impression.
Fixes: d30ddcaa7b ("iommu: Add a new type field in iommu_resv_region")
Reviewed-by: Eric Auger <eric.auger@redhat.com>
CC: Alex Williamson <alex.williamson@redhat.com>
CC: David Woodhouse <dwmw2@infradead.org>
CC: kvm@vger.kernel.org
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
The intent of the original warning is make sure that the mdev vendor
driver has removed any group notifiers at the point where the group
is closed by the user. Theoretically this would be through an
orderly shutdown where any devices are release prior to the group
release. We can't always count on an orderly shutdown, the user can
close the group before the notifier can be removed or the user task
might be killed. We'd like to add this sanity test when the group is
idle and the only references are from the devices within the group
themselves, but we don't have a good way to do that. Instead check
both when the group itself is removed and when the group is opened.
A bit later than we'd prefer, but better than the current over
aggressive approach.
Fixes: ccd46dbae7 ("vfio: support notifier chain in vfio_group")
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Cc: <stable@vger.kernel.org> # v4.10
Cc: Jike Song <jike.song@intel.com>
We are going to split <linux/sched/signal.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.
Create a trivial placeholder <linux/sched/signal.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.
Include the new header in the files that are going to need it.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We are going to split <linux/sched/mm.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.
Create a trivial placeholder <linux/sched/mm.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.
The APIs that are going to be moved first are:
mm_alloc()
__mmdrop()
mmdrop()
mmdrop_async_fn()
mmdrop_async()
mmget_not_zero()
mmput()
mmput_async()
get_task_mm()
mm_access()
mm_release()
Include the new header in the files that are going to need it.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
- Kconfig fixes for SPAPR_TCE_IOMMU=n (Michael Ellerman)
- Module softdep rather than request_module to simplify usage from
initrd (Alex Williamson)
- Comment typo fix (Changbin Du)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.14 (GNU/Linux)
iQIcBAABAgAGBQJYry1kAAoJECObm247sIsieQ8P/0iydvBdW3AACPADQIvfV3+K
PfAj3RwO1AvKB2Uiiql8t+S+qr5WO2Z4oeyOsaapmaOq8fxN76nTtIPUPg9SzkUx
DDFNT+Zl/wFGA+vDfPcDFaFAf35mNTLN2hmqcrkTwLL7flA0ICYZzpT0U3nP6aEt
go10DkQXQ1aGJkTfBluIqe0FTBd2EZ3+/mJbjiaHRlV5CcXGEtAAe8IC+XJoJJds
AZ1RQiLauE6Rj4t2JQegZJH+ntX7zPk329jD0DjxZq9y08uEqdyeXWj31O8SRucB
hD4QGN6NS8Qv+TlpdMVnsv6E3oZhUZIObH2p0nhUJ8ZpdEkTZoR/73rJKipz6svX
9TuuiIJL2ibZ40rEpABIx1N0ykHiUmXgtCuVfjiKSXmsdZYSu7bK/CbOVBGvS+WE
dKbgob44QozgD34wt5mv9by1jnxxeDodbIfkXE3p4b2TKaaFFnpCG2Yex6XOjjCU
T4peLwxTyszqVDdghJ5l8S2661+UqS+VZQXSXMQRg4fLLx+HjB0JPo/NcfjNFhUE
/RwXvOoVRRn0EKTX5v5Ty5YVx+/KWGm3wAsa1Ar+YGATKn3tS91Vz4cKYZA/bdUc
GKkGI5FrARjtWF860cBtq/KWk2qYy7EzCdVfkk56e5s4clrZwjbet6W/RmZGSvVn
R+XAN7GucRg/aMJyrwJM
=6Gze
-----END PGP SIGNATURE-----
Merge tag 'vfio-v4.11-rc1' of git://github.com/awilliam/linux-vfio
Pull VFIO updates from Alex Williamson:
- Kconfig fixes for SPAPR_TCE_IOMMU=n (Michael Ellerman)
- Module softdep rather than request_module to simplify usage from
initrd (Alex Williamson)
- Comment typo fix (Changbin Du)
* tag 'vfio-v4.11-rc1' of git://github.com/awilliam/linux-vfio:
vfio: fix a typo in comment of function vfio_pin_pages
vfio: Replace module request with softdep
vfio/mdev: Use a module softdep for vfio_mdev
vfio: Fix build break when SPAPR_TCE_IOMMU=n
Correct the description that 'unpinned' -> 'pinned'.
Signed-off-by: Changbin Du <changbin.du@intel.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The changes include:
* KVM PCIe/MSI passthrough support on ARM/ARM64
* Introduction of a core representation for individual hardware
iommus
* Support for IOMMU privileged mappings as supported by some
ARM IOMMUS
* 16-bit SID support for ARM-SMMUv2
* Stream table optimization for ARM-SMMUv3
* Various fixes and other small improvements
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABAgAGBQJYqw3hAAoJECvwRC2XARrjPy0P/35ykfHIAESJuF+72ziaoAYA
ZvMrli8rGq7n+ntaIGPx9rV+hZTUSF8V2bfsYV7SAn5iYuViXZqvOtC3BAEp6GNC
cdMeQfqXoHiWVMdXdOihzk+6YCQvBxqPOvUtYFqVhOo3Yrz8Dc71KsKvrTndEUVY
f7bXHKssVONkWMga9lIVDgEefG5VyJPEQaxJXB9ymLHXbwWOcISe1lgtkrzFSxSH
H9YNI07Tfcxfn6rN8jGmcYFYM58xwBicpB4HBw5uytMBYAsxqTEsx4X5dGpOF6RH
cFW9nby+9ZlcTMyuWXKAck3o8df2ZC1xiSjnz+DHQdBPFiFNqIL3PVUcaz9PnF2e
e6Y+DA3s+jykeiCvi2K0Z9RwTg7t8S5spel+UCeNVSnIjE9pqZNLF8vsDjF17zuR
+zcFm7RVI397QVQGp0dbqhtxnwqt/3CX/wlzpvuNdEZa4vwujpcnM9tfl6gyFrF8
awK9Fj5ryAn4DEiM+8yiRHwLrU5ij1cfc8jQdqleEB2ca7Wv3g1uhhS0QTXOFY9u
A7ygOna25U1EcOwjC6ebjiEL115ZEOrXo+eChhzCHoUEHCVxL+L/NAMEsUcMqPIw
3XsHhru0HbXgd5O5wHX39s2je8G3+ElqQwy8Ja3DimV6tvon7yaKCXy9QU+2aa1u
3r53R/0mW1ijtOfK+I0b
=5b3I
-----END PGP SIGNATURE-----
Merge tag 'iommu-updates-v4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu
Pull IOMMU UPDATES from Joerg Roedel:
- KVM PCIe/MSI passthrough support on ARM/ARM64
- introduction of a core representation for individual hardware iommus
- support for IOMMU privileged mappings as supported by some ARM IOMMUS
- 16-bit SID support for ARM-SMMUv2
- stream table optimization for ARM-SMMUv3
- various fixes and other small improvements
* tag 'iommu-updates-v4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (61 commits)
vfio/type1: Fix error return code in vfio_iommu_type1_attach_group()
iommu: Remove iommu_register_instance interface
iommu/exynos: Make use of iommu_device_register interface
iommu/mediatek: Make use of iommu_device_register interface
iommu/msm: Make use of iommu_device_register interface
iommu/arm-smmu: Make use of the iommu_register interface
iommu: Add iommu_device_set_fwnode() interface
iommu: Make iommu_device_link/unlink take a struct iommu_device
iommu: Add sysfs bindings for struct iommu_device
iommu: Introduce new 'struct iommu_device'
iommu: Rename struct iommu_device
iommu: Rename iommu_get_instance()
iommu: Fix static checker warning in iommu_insert_device_resv_regions
iommu: Avoid unnecessary assignment of dev->iommu_fwspec
iommu/mediatek: Remove bogus 'select' statements
iommu/dma: Remove bogus dma_supported() implementation
iommu/ipmmu-vmsa: Restrict IOMMU Domain Geometry to 32-bit address space
iommu/vt-d: Don't over-free page table directories
iommu/vt-d: Tylersburg isoch identity map check is done too late.
iommu/vt-d: Fix some macros that are incorrectly specified in intel-iommu
...
Fix to return a negative error code from the error handling
case instead of 0, as done elsewhere in this function.
Fixes: 5d70499218 ("vfio/type1: Allow transparent MSI IOVA allocation")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Rather than doing a module request from within the init function, add
a soft dependency on the available IOMMU backend drivers. This makes
the dependency visible to userspace when picking modules for the
ram disk.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Use an explicit module softdep rather than a request module call such
that the dependency is exposed to userspace. This allows us to more
easily support modules loaded at initrd time.
Reviewed by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Currently the kconfig logic for VFIO_IOMMU_SPAPR_TCE and VFIO_SPAPR_EEH
is broken when SPAPR_TCE_IOMMU=n. Leading to:
warning: (VFIO) selects VFIO_IOMMU_SPAPR_TCE which has unmet direct dependencies (VFIO && SPAPR_TCE_IOMMU)
warning: (VFIO) selects VFIO_IOMMU_SPAPR_TCE which has unmet direct dependencies (VFIO && SPAPR_TCE_IOMMU)
drivers/vfio/vfio_iommu_spapr_tce.c:113:8: error: implicit declaration of function 'mm_iommu_find'
This stems from the fact that VFIO selects VFIO_IOMMU_SPAPR_TCE, and
although it has an if clause, the condition is not correct.
We could fix it by doing select VFIO_IOMMU_SPAPR_TCE if SPAPR_TCE_IOMMU,
but the cleaner fix is to drop the selects and tie VFIO_IOMMU_SPAPR_TCE
to the value of VFIO, and express the dependencies in only once place.
Do the same for VFIO_SPAPR_EEH.
The end result is that the values of VFIO_IOMMU_SPAPR_TCE and
VFIO_SPAPR_EEH follow the value of VFIO, except when SPAPR_TCE_IOMMU=n
and/or EEH=n. Which is exactly what we want to happen.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
If a container already has a group attached, attaching a new group
should just program already created IOMMU tables to the hardware via
the iommu_table_group_ops::set_window() callback.
However commit 6f01cc692a ("vfio/spapr: Add a helper to create
default DMA window") did not just simplify the code but also removed
the set_window() calls in the case of attaching groups to a container
which already has tables so it broke VFIO PCI hotplug.
This reverts set_window() bits in tce_iommu_take_ownership_ddw().
Fixes: 6f01cc692a ("vfio/spapr: Add a helper to create default DMA window")
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Commit d9c728949d ("vfio/spapr: Postpone default window creation")
added an additional exit to the VFIO_IOMMU_SPAPR_TCE_CREATE case and
made it possible to return from tce_iommu_ioctl() without unlocking
container->lock; this fixes the issue.
Fixes: d9c728949d ("vfio/spapr: Postpone default window creation")
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
In case the IOMMU translates MSI transactions (typical case
on ARM), we check MSI remapping capability at IRQ domain
level. Otherwise it is checked at IOMMU level.
At this stage the arm-smmu-(v3) still advertise the
IOMMU_CAP_INTR_REMAP capability at IOMMU level. This will be
removed in subsequent patches.
Signed-off-by: Eric Auger <eric.auger@redhat.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Tomasz Nowicki <tomasz.nowicki@caviumnetworks.com>
Tested-by: Tomasz Nowicki <tomasz.nowicki@caviumnetworks.com>
Tested-by: Bharat Bhushan <bharat.bhushan@nxp.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
When attaching a group to the container, check the group's
reserved regions and test whether the IOMMU translates MSI
transactions. If yes, we initialize an IOVA allocator through
the iommu_get_msi_cookie API. This will allow the MSI IOVAs
to be transparently allocated on MSI controller's compose().
Signed-off-by: Eric Auger <eric.auger@redhat.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Tomasz Nowicki <tomasz.nowicki@caviumnetworks.com>
Tested-by: Tomasz Nowicki <tomasz.nowicki@caviumnetworks.com>
Tested-by: Bharat Bhushan <bharat.bhushan@nxp.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Using has_capability() rather than ns_capable(), we're no longer using
this header.
Cc: Jike Song <jike.song@intel.com>
Cc: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Before the mdev enhancement type1 iommu used capable() to test the
capability of current task; in the course of mdev development a
new requirement, testing for another task other than current, was
raised. ns_capable() was used for this purpose, however it still
tests current, the only difference is, in a specified namespace.
Fix it by using has_capability() instead, which tests the cap for
specified task in init_user_ns, the same namespace as capable().
Cc: Gerd Hoffmann <kraxel@redhat.com>
Signed-off-by: Jike Song <jike.song@intel.com>
Reviewed-by: James Morris <james.l.morris@oracle.com>
Reviewed-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Here, pci_iomap can fail, handle this case release selected
pci regions and return -ENOMEM.
Signed-off-by: Arvind Yadav <arvind.yadav.cs@gmail.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Using ancient compilers (gcc-4.5 or older) on ARM, we get a link
failure with the vfio-pci driver:
ERROR: "__aeabi_lcmp" [drivers/vfio/pci/vfio-pci.ko] undefined!
The reason is that the compiler tries to do a comparison of
a 64-bit range. This changes it to convert to a 32-bit number
explicitly first, as newer compilers do for themselves.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Abstract access to mdev_device so that we can define which interfaces
are public rather than relying on comments in the structure.
Cc: Zhenyu Wang <zhenyuw@linux.intel.com>
Cc: Zhi Wang <zhi.a.wang@intel.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Jike Song <jike.song@intel.com>
Reviewed by: Kirti Wankhede <kwankhede@nvidia.com>
Rather than hoping for good behavior by marking some elements
internal, enforce it by making the entire structure private and
creating an accessor function for the one useful external field.
Cc: Zhenyu Wang <zhenyuw@linux.intel.com>
Cc: Zhi Wang <zhi.a.wang@intel.com>
Cc: Jike Song <jike.song@intel.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed by: Kirti Wankhede <kwankhede@nvidia.com>
Add an mdev_ prefix so we're not poluting the namespace so much.
Cc: Zhenyu Wang <zhenyuw@linux.intel.com>
Cc: Zhi Wang <zhi.a.wang@intel.com>
Cc: Jike Song <jike.song@intel.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed by: Kirti Wankhede <kwankhede@nvidia.com>
Using the mtty mdev sample driver we can generate a remove race by
starting one shell that continuously creates mtty devices and several
other shells all attempting to remove devices, in my case four remove
shells. The fault occurs in mdev_remove_sysfs_files() where the
passed type arg is NULL, which suggests we've received a struct device
in mdev_device_remove() but it's in some sort of teardown state. The
solution here is to make use of the accidentally unused list_head on
the mdev_device such that the mdev core keeps a list of all the mdev
devices. This allows us to validate that we have a valid mdev before
we start removal, remove it from the list to prevent others from
working on it, and if the vendor driver refuses to remove, we can
re-add it to the list.
Cc: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
As part of the mdev support, type1 now gets a task reference per
vfio_dma and uses that to get an mm reference for the task while
working on accounting. That's correct, but it's not fast. For some
paths, like vfio_pin_pages_remote(), we know we're only called from
user context, so we can restore the lighter weight calls. In other
cases, we're effectively already testing whether we're in the stored
task context elsewhere, extend this vfio_lock_acct() as well.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed by: Kirti Wankhede <kwankhede@nvidia.com>