Commit Graph

225 Commits

Author SHA1 Message Date
Feng Wu 6d7425f109 vfio: Register/unregister irq_bypass_producer
This patch adds the registration/unregistration of an
irq_bypass_producer for MSI/MSIx on vfio pci devices.

Acked-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Feng Wu <feng.wu@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-10-01 15:06:50 +02:00
Alex Williamson 4bc94d5dc9 vfio: Fix lockdep issue
When we open a device file descriptor, we currently have the
following:

vfio_group_get_device_fd()
  mutex_lock(&group->device_lock);
    open()
    ...
    if (ret)
      release()

If we hit that error case, we call the backend driver release path,
which for vfio-pci looks like this:

vfio_pci_release()
  vfio_pci_disable()
    vfio_pci_try_bus_reset()
      vfio_pci_get_devs()
        vfio_device_get_from_dev()
          vfio_group_get_device()
            mutex_lock(&group->device_lock);

Whoops, we've stumbled back onto group.device_lock and created a
deadlock.  There's a low likelihood of ever seeing this play out, but
obviously it needs to be fixed.  To do that we can use a reference to
the vfio_device for vfio_group_get_device_fd() rather than holding the
lock.  There was a loop in this function, theoretically allowing
multiple devices with the same name, but in practice we don't expect
such a thing to happen and the code is already aborting from the loop
with break on any sort of error rather than continuing and only
parsing the first match anyway, so the loop was effectively unused
already.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Fixes: 20f300175a ("vfio/pci: Fix racy vfio_device_get_from_dev() call")
Reported-by: Joerg Roedel <joro@8bytes.org>
Tested-by: Joerg Roedel <jroedel@suse.de>
2015-07-24 15:14:04 -06:00
Linus Torvalds b779157dd3 VFIO updates for v4.2
- Fix race with device reference versus driver release (Alex Williamson)
 - Add reset hooks and Calxeda xgmac reset for vfio-platform (Eric Auger)
 - Enable vfio-platform for ARM64 (Eric Auger)
 - Tag Baptiste Reynal as vfio-platform sub-maintainer (Alex Williamson)
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJVjaiUAAoJECObm247sIsiRJQP/RiqcYg9uDGbBt8Gi+Qi2BmG
 KrVqG/ty18CZgyis01GZa5O8bL2ViQFAwfpnIJ6l5v4IQAYUUPlfy6Gy9ZgotQir
 U5ot4ABT9KGqJyjkDixvfUmg5M1gj1euO/xC+KVMpnxN0tC1wbymtAahzd3/RXyq
 5cef/cran5bwnh2RtAN/rFkbjZBuqppVZJJV7uyOd69HXhktrZw5skQssL/61+vT
 znww9H6WT/Qk60haQrcD+SVqAaPv6Tx8bu1LVZ1Zc/rWVxJ2HisOp5dTxGeRnaSu
 PtMfOndgtYwe5fBaYyNjWluJLpBzY7b5UvHtsZ52hPXvlgkKSMrfCJxM4BQQSekW
 txmeV6LpQsm+Nq0+sIOFph8h/LzZaebvMsc5YnLamixGb/mAxbbr31yGjokTuS4F
 ndKUR+9/mFrdcqeBT4UD1Yt45YHl7bRZ9AWKJIC1TstrTW+1rEZ49r71PLea6y4f
 SJJ+eCvhYr/eebvrsfePqzeo1X+S8HNTgCXrSlqZdohS21qtnsh4g/CaOxI2UVt8
 zaZosxNQLzpNK/Z8s7MZpCMenWZsq2frcckROeEgNKdPntwkbVT4MG3LBrWvtuBU
 HGAKfWCGAjfIIN8ZBUSFtIqua/kXCcec99CcYYQOtv16UbmhadAPoUlMur3HI9T9
 S0vWtIKTsRa8ymyrGBDp
 =/gWA
 -----END PGP SIGNATURE-----

Merge tag 'vfio-v4.2-rc1' of git://github.com/awilliam/linux-vfio

Pull VFIO updates from Alex Williamson:

 - fix race with device reference versus driver release (Alex Williamson)

 - add reset hooks and Calxeda xgmac reset for vfio-platform (Eric Auger)

 - enable vfio-platform for ARM64 (Eric Auger)

 - tag Baptiste Reynal as vfio-platform sub-maintainer (Alex Williamson)

* tag 'vfio-v4.2-rc1' of git://github.com/awilliam/linux-vfio:
  MAINTAINERS: Add vfio-platform sub-maintainer
  VFIO: platform: enable ARM64 build
  VFIO: platform: Calxeda xgmac reset module
  VFIO: platform: populate the reset function on probe
  VFIO: platform: add reset callback
  VFIO: platform: add reset struct and lookup table
  vfio/pci: Fix racy vfio_device_get_from_dev() call
2015-06-28 12:32:13 -07:00
Linus Torvalds 08d183e3c1 powerpc updates for 4.2
- Disable the 32-bit vdso when building LE, so we can build with a 64-bit only
    toolchain.
  - EEH fixes from Gavin & Richard.
  - Enable the sys_kcmp syscall from Laurent.
  - Sysfs control for fastsleep workaround from Shreyas.
  - Expose OPAL events as an irq chip by Alistair.
  - MSI ops moved to pci_controller_ops by Daniel.
  - Fix for kernel to userspace backtraces for perf from Anton.
  - Merge pseries and pseries_le defconfigs from Cyril.
  - CXL in-kernel API from Mikey.
  - OPAL prd driver from Jeremy.
  - Fix for DSCR handling & tests from Anshuman.
  - Powernv flash mtd driver from Cyril.
  - Dynamic DMA Window support on powernv from Alexey.
  - LLVM clang fixes & workarounds from Anton.
  - Reworked version of the patch to abort syscalls when transactional.
  - Fix the swap encoding to support 4TB, from Aneesh.
  - Various fixes as usual.
  - Freescale updates from Scott: Highlights include more 8xx optimizations, an
    e6500 hugetlb optimization, QMan device tree nodes, t1024/t1023 support, and
    various fixes and cleanup.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJViSZqAAoJEFHr6jzI4aWAA7kQAKq3+pejfo2rY7alpKJyeVao
 vlaIEaDNOTh+ctcmu3MFF9Jy6fai8gNZziRXU5JRmE5RW4GVBN4KZiqXRbkVjdBK
 uG9sCX7Y58VRsS2vnGBYLsamfTMgjaXeDvgunQHVLiechJnrDr0RHEK90F3LSi73
 Axp6l8XIG63a3zFZmkhzANMCme2lm5+MWmGlSjUUNi5F+viQUgJc5iiO8xrVUgM5
 RpNlV2NJSqFiU+gMQWJ226V85UIniouq4j+qtyUcu8/m9BberyolXVU0GPlPFdsx
 r/Qh9uCJyZaUdSB5hzomQZj50IsSz6J6nEuJTeGRoVZOmeI8Dnc2xU9fxQF5fC8H
 lUJw10WPoNOggQZTeSUKn7wTXw3i4p3KsWNUczaW68VJdhqZUVaSp0+I6mnDSqzs
 9iGC+VffLYNa1OHq7mGRFrgDdLBCHes31aZ3CxlQsmyNpAPCwMzsD4TUfVnvOG6E
 oJOeaQ4mZM9PvqxEYJfoIL+vgRxmQ8sdIBtNY4in+C7J6eFnZNFO9xmPnJZuVU31
 PGtx60kjFCOVMXvqn34WkRNbgqGWI91IK0KcRwFO2LXVio1uY77TWL52kNK2IMsp
 Az+VDDvqnT3+BoV1yz0P6SrXAkwTpvFk2y+IdmEiUUN7zZFL5ZSA2epej9AzHTAK
 WID2bc5yVtIL6p6x5ICH
 =d9Wh
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-4.2-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux

Pull powerpc updates from Michael Ellerman:

 - disable the 32-bit vdso when building LE, so we can build with a
   64-bit only toolchain.

 - EEH fixes from Gavin & Richard.

 - enable the sys_kcmp syscall from Laurent.

 - sysfs control for fastsleep workaround from Shreyas.

 - expose OPAL events as an irq chip by Alistair.

 - MSI ops moved to pci_controller_ops by Daniel.

 - fix for kernel to userspace backtraces for perf from Anton.

 - merge pseries and pseries_le defconfigs from Cyril.

 - CXL in-kernel API from Mikey.

 - OPAL prd driver from Jeremy.

 - fix for DSCR handling & tests from Anshuman.

 - Powernv flash mtd driver from Cyril.

 - dynamic DMA Window support on powernv from Alexey.

 - LLVM clang fixes & workarounds from Anton.

 - reworked version of the patch to abort syscalls when transactional.

 - fix the swap encoding to support 4TB, from Aneesh.

 - various fixes as usual.

 - Freescale updates from Scott: Highlights include more 8xx
   optimizations, an e6500 hugetlb optimization, QMan device tree nodes,
   t1024/t1023 support, and various fixes and cleanup.

* tag 'powerpc-4.2-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux: (180 commits)
  cxl: Fix typo in debug print
  cxl: Add CXL_KERNEL_API config option
  powerpc/powernv: Fix wrong IOMMU table in pnv_ioda_setup_bus_dma()
  powerpc/mm: Change the swap encoding in pte.
  powerpc/mm: PTE_RPN_MAX is not used, remove the same
  powerpc/tm: Abort syscalls in active transactions
  powerpc/iommu/ioda2: Enable compile with IOV=on and IOMMU_API=off
  powerpc/include: Add opal-prd to installed uapi headers
  powerpc/powernv: fix construction of opal PRD messages
  powerpc/powernv: Increase opal-irqchip initcall priority
  powerpc: Make doorbell check preemption safe
  powerpc/powernv: pnv_init_idle_states() should only run on powernv
  macintosh/nvram: Remove as unused
  powerpc: Don't use gcc specific options on clang
  powerpc: Don't use -mno-strict-align on clang
  powerpc: Only use -mtraceback=no, -mno-string and -msoft-float if toolchain supports it
  powerpc: Only use -mabi=altivec if toolchain supports it
  powerpc: Fix duplicate const clang warning in user access code
  vfio: powerpc/spapr: Support Dynamic DMA windows
  vfio: powerpc/spapr: Register memory and define IOMMU v2
  ...
2015-06-24 08:46:32 -07:00
Eric Auger e6bcd47f8f VFIO: platform: enable ARM64 build
This patch enables building VFIO platform and derivatives on ARM64.

Signed-off-by: Eric Auger <eric.auger@linaro.org>
Acked-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Tested-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-06-22 09:35:47 -06:00
Eric Auger 713cc334a6 VFIO: platform: Calxeda xgmac reset module
This patch introduces a module that registers and implements a basic
reset function for the Calxeda xgmac device. This latter basically disables
interrupts and stops DMA transfers.

The reset function code is inherited from the native calxeda xgmac driver.

Signed-off-by: Eric Auger <eric.auger@linaro.org>
Acked-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Tested-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-06-22 09:35:38 -06:00
Eric Auger 3eeb0d510c VFIO: platform: populate the reset function on probe
The reset function lookup happens on vfio-platform probe. The reset
module load is requested  and a reference to the function symbol is
hold. The reference is released on vfio-platform remove.

Signed-off-by: Eric Auger <eric.auger@linaro.org>
Acked-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Tested-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-06-22 09:35:30 -06:00
Eric Auger 813ae66008 VFIO: platform: add reset callback
A new reset callback is introduced. If this callback is populated,
the reset is invoked on device first open/last close or upon userspace
ioctl.  The modality is exposed on VFIO_DEVICE_GET_INFO.

Signed-off-by: Eric Auger <eric.auger@linaro.org>
Acked-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Tested-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-06-22 09:35:22 -06:00
Eric Auger 9f85d8f9fa VFIO: platform: add reset struct and lookup table
This patch introduces the vfio_platform_reset_combo struct that
stores all the information useful to handle the reset modality:
compat string, name of the reset function, name of the module that
implements the reset function. A lookup table of such structures
is added, currently void.

Signed-off-by: Eric Auger <eric.auger@linaro.org>
Acked-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Tested-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-06-17 15:39:09 -06:00
Alexey Kardashevskiy e633bc86a9 vfio: powerpc/spapr: Support Dynamic DMA windows
This adds create/remove window ioctls to create and remove DMA windows.
sPAPR defines a Dynamic DMA windows capability which allows
para-virtualized guests to create additional DMA windows on a PCI bus.
The existing linux kernels use this new window to map the entire guest
memory and switch to the direct DMA operations saving time on map/unmap
requests which would normally happen in a big amounts.

This adds 2 ioctl handlers - VFIO_IOMMU_SPAPR_TCE_CREATE and
VFIO_IOMMU_SPAPR_TCE_REMOVE - to create and remove windows.
Up to 2 windows are supported now by the hardware and by this driver.

This changes VFIO_IOMMU_SPAPR_TCE_GET_INFO handler to return additional
information such as a number of supported windows and maximum number
levels of TCE tables.

DDW is added as a capability, not as a SPAPR TCE IOMMU v2 unique feature
as we still want to support v2 on platforms which cannot do DDW for
the sake of TCE acceleration in KVM (coming soon).

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
[aw: for the vfio related changes]
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-11 15:16:55 +10:00
Alexey Kardashevskiy 2157e7b82f vfio: powerpc/spapr: Register memory and define IOMMU v2
The existing implementation accounts the whole DMA window in
the locked_vm counter. This is going to be worse with multiple
containers and huge DMA windows. Also, real-time accounting would requite
additional tracking of accounted pages due to the page size difference -
IOMMU uses 4K pages and system uses 4K or 64K pages.

Another issue is that actual pages pinning/unpinning happens on every
DMA map/unmap request. This does not affect the performance much now as
we spend way too much time now on switching context between
guest/userspace/host but this will start to matter when we add in-kernel
DMA map/unmap acceleration.

This introduces a new IOMMU type for SPAPR - VFIO_SPAPR_TCE_v2_IOMMU.
New IOMMU deprecates VFIO_IOMMU_ENABLE/VFIO_IOMMU_DISABLE and introduces
2 new ioctls to register/unregister DMA memory -
VFIO_IOMMU_SPAPR_REGISTER_MEMORY and VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY -
which receive user space address and size of a memory region which
needs to be pinned/unpinned and counted in locked_vm.
New IOMMU splits physical pages pinning and TCE table update
into 2 different operations. It requires:
1) guest pages to be registered first
2) consequent map/unmap requests to work only with pre-registered memory.
For the default single window case this means that the entire guest
(instead of 2GB) needs to be pinned before using VFIO.
When a huge DMA window is added, no additional pinning will be
required, otherwise it would be guest RAM + 2GB.

The new memory registration ioctls are not supported by
VFIO_SPAPR_TCE_IOMMU. Dynamic DMA window and in-kernel acceleration
will require memory to be preregistered in order to work.

The accounting is done per the user process.

This advertises v2 SPAPR TCE IOMMU and restricts what the userspace
can do with v1 or v2 IOMMUs.

In order to support memory pre-registration, we need a way to track
the use of every registered memory region and only allow unregistration
if a region is not in use anymore. So we need a way to tell from what
region the just cleared TCE was from.

This adds a userspace view of the TCE table into iommu_table struct.
It contains userspace address, one per TCE entry. The table is only
allocated when the ownership over an IOMMU group is taken which means
it is only used from outside of the powernv code (such as VFIO).

As v2 IOMMU supports IODA2 and pre-IODA2 IOMMUs (which do not support
DDW API), this creates a default DMA window for IODA2 for consistency.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
[aw: for the vfio related changes]
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-11 15:16:55 +10:00
Alexey Kardashevskiy 46d3e1e162 vfio: powerpc/spapr: powerpc/powernv/ioda2: Use DMA windows API in ownership control
Before the IOMMU user (VFIO) would take control over the IOMMU table
belonging to a specific IOMMU group. This approach did not allow sharing
tables between IOMMU groups attached to the same container.

This introduces a new IOMMU ownership flavour when the user can not
just control the existing IOMMU table but remove/create tables on demand.
If an IOMMU implements take/release_ownership() callbacks, this lets
the user have full control over the IOMMU group. When the ownership
is taken, the platform code removes all the windows so the caller must
create them.
Before returning the ownership back to the platform code, VFIO
unprograms and removes all the tables it created.

This changes IODA2's onwership handler to remove the existing table
rather than manipulating with the existing one. From now on,
iommu_take_ownership() and iommu_release_ownership() are only called
from the vfio_iommu_spapr_tce driver.

Old-style ownership is still supported allowing VFIO to run on older
P5IOC2 and IODA IO controllers.

No change in userspace-visible behaviour is expected. Since it recreates
TCE tables on each ownership change, related kernel traces will appear
more often.

This adds a pnv_pci_ioda2_setup_default_config() which is called
when PE is being configured at boot time and when the ownership is
passed from VFIO to the platform code.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
[aw: for the vfio related changes]
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-11 15:16:54 +10:00
Alexey Kardashevskiy 4793d65d1a vfio: powerpc/spapr: powerpc/powernv/ioda: Define and implement DMA windows API
This extends iommu_table_group_ops by a set of callbacks to support
dynamic DMA windows management.

create_table() creates a TCE table with specific parameters.
it receives iommu_table_group to know nodeid in order to allocate
TCE table memory closer to the PHB. The exact format of allocated
multi-level table might be also specific to the PHB model (not
the case now though).
This callback calculated the DMA window offset on a PCI bus from @num
and stores it in a just created table.

set_window() sets the window at specified TVT index + @num on PHB.

unset_window() unsets the window from specified TVT.

This adds a free() callback to iommu_table_ops to free the memory
(potentially a tree of tables) allocated for the TCE table.

create_table() and free() are supposed to be called once per
VFIO container and set_window()/unset_window() are supposed to be
called for every group in a container.

This adds IOMMU capabilities to iommu_table_group such as default
32bit window parameters and others. This makes use of new values in
vfio_iommu_spapr_tce. IODA1/P5IOC2 do not support DDW so they do not
advertise pagemasks to the userspace.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-11 15:16:52 +10:00
Alexey Kardashevskiy 05c6cfb9dc powerpc/iommu/powernv: Release replaced TCE
At the moment writing new TCE value to the IOMMU table fails with EBUSY
if there is a valid entry already. However PAPR specification allows
the guest to write new TCE value without clearing it first.

Another problem this patch is addressing is the use of pool locks for
external IOMMU users such as VFIO. The pool locks are to protect
DMA page allocator rather than entries and since the host kernel does
not control what pages are in use, there is no point in pool locks and
exchange()+put_page(oldtce) is sufficient to avoid possible races.

This adds an exchange() callback to iommu_table_ops which does the same
thing as set() plus it returns replaced TCE and DMA direction so
the caller can release the pages afterwards. The exchange() receives
a physical address unlike set() which receives linear mapping address;
and returns a physical address as the clear() does.

This implements exchange() for P5IOC2/IODA/IODA2. This adds a requirement
for a platform to have exchange() implemented in order to support VFIO.

This replaces iommu_tce_build() and iommu_clear_tce() with
a single iommu_tce_xchg().

This makes sure that TCE permission bits are not set in TCE passed to
IOMMU API as those are to be calculated by platform code from
DMA direction.

This moves SetPageDirty() to the IOMMU code to make it work for both
VFIO ioctl interface in in-kernel TCE acceleration (when it becomes
available later).

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
[aw: for the vfio related changes]
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-11 15:16:49 +10:00
Alexey Kardashevskiy f87a88642e vfio: powerpc/spapr/iommu/powernv/ioda2: Rework IOMMU ownership control
This adds tce_iommu_take_ownership() and tce_iommu_release_ownership
which call in a loop iommu_take_ownership()/iommu_release_ownership()
for every table on the group. As there is just one now, no change in
behaviour is expected.

At the moment the iommu_table struct has a set_bypass() which enables/
disables DMA bypass on IODA2 PHB. This is exposed to POWERPC IOMMU code
which calls this callback when external IOMMU users such as VFIO are
about to get over a PHB.

The set_bypass() callback is not really an iommu_table function but
IOMMU/PE function. This introduces a iommu_table_group_ops struct and
adds take_ownership()/release_ownership() callbacks to it which are
called when an external user takes/releases control over the IOMMU.

This replaces set_bypass() with ownership callbacks as it is not
necessarily just bypass enabling, it can be something else/more
so let's give it more generic name.

The callbacks is implemented for IODA2 only. Other platforms (P5IOC2,
IODA1) will use the old iommu_take_ownership/iommu_release_ownership API.
The following patches will replace iommu_take_ownership/
iommu_release_ownership calls in IODA2 with full IOMMU table release/
create.

As we here and touching bypass control, this removes
pnv_pci_ioda2_setup_bypass_pe() as it does not do much
more compared to pnv_pci_ioda2_set_bypass. This moves tce_bypass_base
initialization to pnv_pci_ioda2_setup_dma_pe.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
[aw: for the vfio related changes]
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-11 15:16:47 +10:00
Alexey Kardashevskiy 0eaf4defc7 powerpc/spapr: vfio: Switch from iommu_table to new iommu_table_group
So far one TCE table could only be used by one IOMMU group. However
IODA2 hardware allows programming the same TCE table address to
multiple PE allowing sharing tables.

This replaces a single pointer to a group in a iommu_table struct
with a linked list of groups which provides the way of invalidating
TCE cache for every PE when an actual TCE table is updated. This adds
pnv_pci_link_table_and_group() and pnv_pci_unlink_table_and_group()
helpers to manage the list. However without VFIO, it is still going
to be a single IOMMU group per iommu_table.

This changes iommu_add_device() to add a device to a first group
from the group list of a table as it is only called from the platform
init code or PCI bus notifier and at these moments there is only
one group per table.

This does not change TCE invalidation code to loop through all
attached groups in order to simplify this patch and because
it is not really needed in most cases. IODA2 is fixed in a later
patch.

This should cause no behavioural change.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
[aw: for the vfio related changes]
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-11 15:16:15 +10:00
Alexey Kardashevskiy b348aa6529 powerpc/spapr: vfio: Replace iommu_table with iommu_table_group
Modern IBM POWERPC systems support multiple (currently two) TCE tables
per IOMMU group (a.k.a. PE). This adds a iommu_table_group container
for TCE tables. Right now just one table is supported.

This defines iommu_table_group struct which stores pointers to
iommu_group and iommu_table(s). This replaces iommu_table with
iommu_table_group where iommu_table was used to identify a group:
- iommu_register_group();
- iommudata of generic iommu_group;

This removes @data from iommu_table as it_table_group provides
same access to pnv_ioda_pe.

For IODA, instead of embedding iommu_table, the new iommu_table_group
keeps pointers to those. The iommu_table structs are allocated
dynamically.

For P5IOC2, both iommu_table_group and iommu_table are embedded into
PE struct. As there is no EEH and SRIOV support for P5IOC2,
iommu_free_table() should not be called on iommu_table struct pointers
so we can keep it embedded in pnv_phb::p5ioc2.

For pSeries, this replaces multiple calls of kzalloc_node() with a new
iommu_pseries_alloc_group() helper and stores the table group struct
pointer into the pci_dn struct. For release, a iommu_table_free_group()
helper is added.

This moves iommu_table struct allocation from SR-IOV code to
the generic DMA initialization code in pnv_pci_ioda_setup_dma_pe and
pnv_pci_ioda2_setup_dma_pe as this is where DMA is actually initialized.
This change is here because those lines had to be changed anyway.

This should cause no behavioural change.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
[aw: for the vfio related changes]
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-11 15:14:57 +10:00
Alexey Kardashevskiy 22af48596e vfio: powerpc/spapr: Rework groups attaching
This is to make extended ownership and multiple groups support patches
simpler for review.

This should cause no behavioural change.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
[aw: for the vfio related changes]
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-11 15:14:56 +10:00
Alexey Kardashevskiy 649354b75d vfio: powerpc/spapr: Moving pinning/unpinning to helpers
This is a pretty mechanical patch to make next patches simpler.

New tce_iommu_unuse_page() helper does put_page() now but it might skip
that after the memory registering patch applied.

As we are here, this removes unnecessary checks for a value returned
by pfn_to_page() as it cannot possibly return NULL.

This moves tce_iommu_disable() later to let tce_iommu_clear() know if
the container has been enabled because if it has not been, then
put_page() must not be called on TCEs from the TCE table. This situation
is not yet possible but it will after KVM acceleration patchset is
applied.

This changes code to work with physical addresses rather than linear
mapping addresses for better code readability. Following patches will
add an xchg() callback for an IOMMU table which will accept/return
physical addresses (unlike current tce_build()) which will eliminate
redundant conversions.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
[aw: for the vfio related changes]
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-11 15:14:56 +10:00
Alexey Kardashevskiy 3c56e822f8 vfio: powerpc/spapr: Disable DMA mappings on disabled container
At the moment DMA map/unmap requests are handled irrespective to
the container's state. This allows the user space to pin memory which
it might not be allowed to pin.

This adds checks to MAP/UNMAP that the container is enabled, otherwise
-EPERM is returned.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
[aw: for the vfio related changes]
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-11 15:14:56 +10:00
Alexey Kardashevskiy 2d270df8f7 vfio: powerpc/spapr: Move locked_vm accounting to helpers
There moves locked pages accounting to helpers.
Later they will be reused for Dynamic DMA windows (DDW).

This reworks debug messages to show the current value and the limit.

This stores the locked pages number in the container so when unlocking
the iommu table pointer won't be needed. This does not have an effect
now but it will with the multiple tables per container as then we will
allow attaching/detaching groups on fly and we may end up having
a container with no group attached but with the counter incremented.

While we are here, update the comment explaining why RLIMIT_MEMLOCK
might be required to be bigger than the guest RAM. This also prints
pid of the current process in pr_warn/pr_debug.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
[aw: for the vfio related changes]
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-11 15:14:55 +10:00
Alexey Kardashevskiy 00663d4ee0 vfio: powerpc/spapr: Use it_page_size
This makes use of the it_page_size from the iommu_table struct
as page size can differ.

This replaces missing IOMMU_PAGE_SHIFT macro in commented debug code
as recently introduced IOMMU_PAGE_XXX macros do not include
IOMMU_PAGE_SHIFT.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
[aw: for the vfio related changes]
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-11 15:14:55 +10:00
Alexey Kardashevskiy e432bc7e15 vfio: powerpc/spapr: Check that IOMMU page is fully contained by system page
This checks that the TCE table page size is not bigger that the size of
a page we just pinned and going to put its physical address to the table.

Otherwise the hardware gets unwanted access to physical memory between
the end of the actual page and the end of the aligned up TCE page.

Since compound_order() and compound_head() work correctly on non-huge
pages, there is no need for additional check whether the page is huge.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
[aw: for the vfio related changes]
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-11 15:14:55 +10:00
Alexey Kardashevskiy 9b14a1ff86 vfio: powerpc/spapr: Move page pinning from arch code to VFIO IOMMU driver
This moves page pinning (get_user_pages_fast()/put_page()) code out of
the platform IOMMU code and puts it to VFIO IOMMU driver where it belongs
to as the platform code does not deal with page pinning.

This makes iommu_take_ownership()/iommu_release_ownership() deal with
the IOMMU table bitmap only.

This removes page unpinning from iommu_take_ownership() as the actual
TCE table might contain garbage and doing put_page() on it is undefined
behaviour.

Besides the last part, the rest of the patch is mechanical.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
[aw: for the vfio related changes]
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-11 15:14:55 +10:00
Alex Williamson 20f300175a vfio/pci: Fix racy vfio_device_get_from_dev() call
Testing the driver for a PCI device is racy, it can be all but
complete in the release path and still report the driver as ours.
Therefore we can't trust drvdata to be valid.  This race can sometimes
be seen when one port of a multifunction device is being unbound from
the vfio-pci driver while another function is being released by the
user and attempting a bus reset.  The device in the remove path is
found as a dependent device for the bus reset of the release path
device, the driver is still set to vfio-pci, but the drvdata has
already been cleared, resulting in a null pointer dereference.

To resolve this, fix vfio_device_get_from_dev() to not take the
dev_get_drvdata() shortcut and instead traverse through the
iommu_group, vfio_group, vfio_device path to get a reference we
can trust.  Once we have that reference, we know the device isn't
in transition and we can test to make sure the driver is still what
we expect, so that we don't interfere with devices we don't own.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-06-09 10:08:57 -06:00
Will Deacon 8a0a01bff8 drivers/vfio: Allow type-1 IOMMU instantiation on top of an ARM SMMUv3
The ARM SMMUv3 driver is compatible with the notion of a type-1 IOMMU in
VFIO.

This patch allows VFIO_IOMMU_TYPE1 to be selected if ARM_SMMU_V3=y.

Signed-off-by: Will Deacon <will.deacon@arm.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
2015-05-29 11:12:40 +02:00
Gavin Shan 68cbbc3a9d drivers/vfio: Support EEH error injection
The patch adds one more EEH sub-command (VFIO_EEH_PE_INJECT_ERR)
to inject the specified EEH error, which is represented by
(struct vfio_eeh_pe_err), to the indicated PE for testing purpose.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-05-12 20:33:35 +10:00
Alex Williamson db7d4d7f40 vfio: Fix runaway interruptible timeout
Commit 13060b64b8 ("vfio: Add and use device request op for vfio
bus drivers") incorrectly makes use of an interruptible timeout.
When interrupted, the signal remains pending resulting in subsequent
timeouts occurring instantly.  This makes the loop spin at a much
higher rate than intended.

Instead of making this completely non-interruptible, we can change
this into a sort of interruptible-once behavior and use the "once"
to log debug information.  The driver API doesn't allow us to abort
and return an error code.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Fixes: 13060b64b8
Cc: stable@vger.kernel.org # v4.0
2015-05-01 16:31:41 -06:00
Alex Williamson 5f55d2ae69 vfio-pci: Log device requests more verbosely
Log some clues indicating whether the user is receiving device
request interfaces or not listening.  This can help indicate why a
driver unbind is blocked or explain why QEMU automatically unplugged
a device from the VM.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-05-01 14:00:53 -06:00
Alex Williamson 5a0ff17741 vfio-pci: Fix use after free
Reported by 0-day test infrastructure.

Fixes: ecaa1f6a01 ("vfio-pci: Add VGA arbiter client")
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-04-08 08:11:51 -06:00
Alex Williamson 6eb7018705 vfio-pci: Move idle devices to D3hot power state
We can save some power by putting devices that are bound to vfio-pci
but not in use by the user in the D3hot power state.  Devices get
woken into D0 when opened by the user.  Resets return the device to
D0, so we need to re-apply the low power state after a bus reset.
It's tempting to try to use D3cold, but we have no reason to inhibit
hotplug of idle devices and we might get into a loop of having the
device disappear before we have a chance to try to use it.

A new module parameter allows this feature to be disabled if there are
devices that misbehave as a result of this change.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-04-07 11:14:46 -06:00
Alex Williamson 561d72ddbb vfio-pci: Remove warning if try-reset fails
As indicated in the comment, this is not entirely uncommon and
causes user concern for no reason.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-04-07 11:14:44 -06:00
Alex Williamson 80c7e8cc2a vfio-pci: Allow PCI IDs to be specified as module options
This copies the same support from pci-stub for exactly the same
purpose, enabling a set of PCI IDs to be automatically added to the
driver's dynamic ID table at module load time.  The code here is
pretty simple and both vfio-pci and pci-stub are fairly unique in
being meta drivers, capable of attaching to any device, so there's no
attempt made to generalize the code into pci-core.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-04-07 11:14:43 -06:00
Alex Williamson ecaa1f6a01 vfio-pci: Add VGA arbiter client
If VFIO VGA access is disabled for the user, either by CONFIG option
or module parameter, we can often opt-out of VGA arbitration.  We can
do this when PCI bridge control of VGA routing is possible.  This
means that we must have a parent bridge and there must only be a
single VGA device below that bridge.  Fortunately this is the typical
case for discrete GPUs.

Doing this allows us to minimize the impact of additional GPUs, in
terms of VGA arbitration, when they are only used via vfio-pci for
non-VGA applications.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-04-07 11:14:41 -06:00
Alex Williamson 88c0dead9f vfio-pci: Add module option to disable VGA region access
Add a module option so that we don't require a CONFIG change and
kernel rebuild to disable VGA support.  Not only can VGA support be
troublesome in itself, but by disabling it we can reduce the impact
to host devices by doing a VGA arbitration opt-out.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-04-07 11:14:40 -06:00
Alex Williamson 71be3423a6 vfio: Split virqfd into a separate module for vfio bus drivers
An unintended consequence of commit 42ac9bd18d ("vfio: initialize
the virqfd workqueue in VFIO generic code") is that the vfio module
is renamed to vfio_core so that it can include both vfio and virqfd.
That's a user visible change that may break module loading scritps
and it imposes eventfd support as a dependency on the core vfio code,
which it's really not.  virqfd is intended to be provided as a service
to vfio bus drivers, so instead of wrapping it into vfio.ko, we can
make it a stand-alone module toggled by vfio bus drivers.  This has
the additional benefit of removing initialization and exit from the
core vfio code.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-03-17 08:33:38 -06:00
kbuild test robot 66fdc052d7 vfio: virqfd_lock can be static
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Reviewed-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-03-17 08:20:31 -06:00
Zhen Lei 2f51bf4be9 vfio: put off the allocation of "minor" in vfio_create_group
The next code fragment "list_for_each_entry" is not depend on "minor". With this
patch, the free of "minor" in "list_for_each_entry" can be reduced, and there is
no functional change.

Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-03-16 14:08:56 -06:00
Antonios Motakis a7fa7c77cf vfio/platform: implement IRQ masking/unmasking via an eventfd
With this patch the VFIO user will be able to set an eventfd that can be
used in order to mask and unmask IRQs of platform devices.

Signed-off-by: Antonios Motakis <a.motakis@virtualopensystems.com>
Signed-off-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Reviewed-by: Eric Auger <eric.auger@linaro.org>
Tested-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-03-16 14:08:55 -06:00
Antonios Motakis 42ac9bd18d vfio: initialize the virqfd workqueue in VFIO generic code
Now we have finally completely decoupled virqfd from VFIO_PCI. We can
initialize it from the VFIO generic code, in order to safely use it from
multiple independent VFIO bus drivers.

Signed-off-by: Antonios Motakis <a.motakis@virtualopensystems.com>
Signed-off-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Reviewed-by: Eric Auger <eric.auger@linaro.org>
Tested-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-03-16 14:08:54 -06:00
Antonios Motakis 7e992d6927 vfio: move eventfd support code for VFIO_PCI to a separate file
The virqfd functionality that is used by VFIO_PCI to implement interrupt
masking and unmasking via an eventfd, is generic enough and can be reused
by another driver. Move it to a separate file in order to allow the code
to be shared.

Signed-off-by: Antonios Motakis <a.motakis@virtualopensystems.com>
Signed-off-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Reviewed-by: Eric Auger <eric.auger@linaro.org>
Tested-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-03-16 14:08:54 -06:00
Antonios Motakis 09bbcb8810 vfio: pass an opaque pointer on virqfd initialization
VFIO_PCI passes the VFIO device structure *vdev via eventfd to the handler
that implements masking/unmasking of IRQs via an eventfd. We can replace
it in the virqfd infrastructure with an opaque type so we can make use
of the mechanism from other VFIO bus drivers.

Signed-off-by: Antonios Motakis <a.motakis@virtualopensystems.com>
Signed-off-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Reviewed-by: Eric Auger <eric.auger@linaro.org>
Tested-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-03-16 14:08:53 -06:00
Antonios Motakis 9269c393e7 vfio: add local lock for virqfd instead of depending on VFIO PCI
The Virqfd code needs to keep accesses to any struct *virqfd safe, but
this comes into play only when creating or destroying eventfds, so sharing
the same spinlock with the VFIO bus driver is not necessary.

Signed-off-by: Antonios Motakis <a.motakis@virtualopensystems.com>
Signed-off-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Reviewed-by: Eric Auger <eric.auger@linaro.org>
Tested-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-03-16 14:08:52 -06:00
Antonios Motakis bb78e9eaab vfio: virqfd: rename vfio_pci_virqfd_init and vfio_pci_virqfd_exit
The functions vfio_pci_virqfd_init and vfio_pci_virqfd_exit are not really
PCI specific, since we plan to reuse the virqfd code with more VFIO drivers
in addition to VFIO_PCI.

Signed-off-by: Antonios Motakis <a.motakis@virtualopensystems.com>
[Baptiste Reynal: Move rename vfio_pci_virqfd_init and vfio_pci_virqfd_exit
from "vfio: add a vfio_ prefix to virqfd_enable and virqfd_disable and export"]
Signed-off-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Reviewed-by: Eric Auger <eric.auger@linaro.org>
Tested-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-03-16 14:08:52 -06:00
Antonios Motakis bdc5e1021b vfio: add a vfio_ prefix to virqfd_enable and virqfd_disable and export
We want to reuse virqfd functionality in multiple VFIO drivers; before
moving these functions to core VFIO, add the vfio_ prefix to the
virqfd_enable and virqfd_disable functions, and export them so they can
be used from other modules.

Signed-off-by: Antonios Motakis <a.motakis@virtualopensystems.com>
Signed-off-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Reviewed-by: Eric Auger <eric.auger@linaro.org>
Tested-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-03-16 14:08:51 -06:00
Antonios Motakis 06211b40ce vfio/platform: support for level sensitive interrupts
Level sensitive interrupts are exposed as maskable and automasked
interrupts and are masked and disabled automatically when they fire.

Signed-off-by: Antonios Motakis <a.motakis@virtualopensystems.com>
[Baptiste Reynal: Move masked interrupt initialization from "vfio/platform:
trigger an interrupt via eventfd"]
Signed-off-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Reviewed-by: Eric Auger <eric.auger@linaro.org>
Tested-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-03-16 14:08:50 -06:00
Antonios Motakis 57f972e2b3 vfio/platform: trigger an interrupt via eventfd
This patch allows to set an eventfd for a platform device's interrupt,
and also to trigger the interrupt eventfd from userspace for testing.
Level sensitive interrupts are marked as maskable and are handled in
a later patch. Edge triggered interrupts are not advertised as maskable
and are implemented here using a simple and efficient IRQ handler.

Signed-off-by: Antonios Motakis <a.motakis@virtualopensystems.com>
[Baptiste Reynal: fix masked interrupt initialization]
Signed-off-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Reviewed-by: Eric Auger <eric.auger@linaro.org>
Tested-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-03-16 14:08:50 -06:00
Antonios Motakis 9a36321c8d vfio/platform: initial interrupts support code
This patch is a skeleton for the VFIO_DEVICE_SET_IRQS IOCTL, around which
most IRQ functionality is implemented in VFIO.

Signed-off-by: Antonios Motakis <a.motakis@virtualopensystems.com>
Signed-off-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Reviewed-by: Eric Auger <eric.auger@linaro.org>
Tested-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-03-16 14:08:49 -06:00
Antonios Motakis 682704c41e vfio/platform: return IRQ info
Return information for the interrupts exposed by the device.
This patch extends VFIO_DEVICE_GET_INFO with the number of IRQs
and enables VFIO_DEVICE_GET_IRQ_INFO.

Signed-off-by: Antonios Motakis <a.motakis@virtualopensystems.com>
Signed-off-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Reviewed-by: Eric Auger <eric.auger@linaro.org>
Tested-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-03-16 14:08:48 -06:00
Antonios Motakis fad4d5b1f0 vfio/platform: support MMAP of MMIO regions
Allow to memory map the MMIO regions of the device so userspace can
directly access them. PIO regions are not being handled at this point.

Signed-off-by: Antonios Motakis <a.motakis@virtualopensystems.com>
Signed-off-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Reviewed-by: Eric Auger <eric.auger@linaro.org>
Tested-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-03-16 14:08:48 -06:00
Antonios Motakis 6e3f264560 vfio/platform: read and write support for the device fd
VFIO returns a file descriptor which we can use to manipulate the memory
regions of the device. Usually, the user will mmap memory regions that are
addressable on page boundaries, however for memory regions where this is
not the case we cannot provide mmap functionality due to security concerns.
For this reason we also allow to use read and write functions to the file
descriptor pointing to the memory regions.

We implement this functionality only for MMIO regions of platform devices;
PIO regions are not being handled at this point.

Signed-off-by: Antonios Motakis <a.motakis@virtualopensystems.com>
Signed-off-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Reviewed-by: Eric Auger <eric.auger@linaro.org>
Tested-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-03-16 14:08:47 -06:00
Antonios Motakis e8909e67ca vfio/platform: return info for device memory mapped IO regions
This patch enables the IOCTLs VFIO_DEVICE_GET_REGION_INFO ioctl call,
which allows the user to learn about the available MMIO resources of
a device.

Signed-off-by: Antonios Motakis <a.motakis@virtualopensystems.com>
Signed-off-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Reviewed-by: Eric Auger <eric.auger@linaro.org>
Tested-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-03-16 14:08:46 -06:00
Antonios Motakis 2e8567bbb5 vfio/platform: return info for bound device
A VFIO userspace driver will start by opening the VFIO device
that corresponds to an IOMMU group, and will use the ioctl interface
to get the basic device info, such as number of memory regions and
interrupts, and their properties. This patch enables the
VFIO_DEVICE_GET_INFO ioctl call.

Signed-off-by: Antonios Motakis <a.motakis@virtualopensystems.com>
[Baptiste Reynal: added include in vfio_platform_common.c]
Signed-off-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Reviewed-by: Eric Auger <eric.auger@linaro.org>
Tested-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-03-16 14:08:46 -06:00
Antonios Motakis b13329adc2 vfio: amba: add the VFIO for AMBA devices module to Kconfig
Enable building the VFIO AMBA driver. VFIO_AMBA depends on VFIO_PLATFORM,
since it is sharing a portion of the code, and it is essentially implemented
as a platform device whose resources are discovered via AMBA specific APIs
in the kernel.

Signed-off-by: Antonios Motakis <a.motakis@virtualopensystems.com>
Signed-off-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Reviewed-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-03-16 14:08:45 -06:00
Antonios Motakis 36fe431f28 vfio: amba: VFIO support for AMBA devices
Add support for discovering AMBA devices with VFIO and handle them
similarly to Linux platform devices.

Signed-off-by: Antonios Motakis <a.motakis@virtualopensystems.com>
Signed-off-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Reviewed-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-03-16 14:08:44 -06:00
Antonios Motakis 5316153239 vfio: platform: add the VFIO PLATFORM module to Kconfig
Enable building the VFIO PLATFORM driver that allows to use Linux platform
devices with VFIO.

Signed-off-by: Antonios Motakis <a.motakis@virtualopensystems.com>
Signed-off-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Reviewed-by: Eric Auger <eric.auger@linaro.org>
Tested-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-03-16 14:08:44 -06:00
Antonios Motakis 9df85aaa43 vfio: platform: probe to devices on the platform bus
Driver to bind to Linux platform devices, and callbacks to discover their
resources to be used by the main VFIO PLATFORM code.

Signed-off-by: Antonios Motakis <a.motakis@virtualopensystems.com>
Signed-off-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Reviewed-by: Eric Auger <eric.auger@linaro.org>
Tested-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-03-16 14:08:43 -06:00
Antonios Motakis de49fc0d99 vfio/platform: initial skeleton of VFIO support for platform devices
This patch forms the common skeleton code for platform devices support
with VFIO. This will include the core functionality of VFIO_PLATFORM,
however binding to the device and discovering the device resources will
be done with the help of a separate file where any Linux platform bus
specific code will reside.

This will allow us to implement support for also discovering AMBA devices
and their resources, but still reuse a large part of the VFIO_PLATFORM
implementation.

Signed-off-by: Antonios Motakis <a.motakis@virtualopensystems.com>
[Baptiste Reynal: added includes in vfio_platform_private.h]
Signed-off-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Reviewed-by: Eric Auger <eric.auger@linaro.org>
Tested-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-03-16 14:08:42 -06:00
Alexey Kardashevskiy ec76f40070 vfio-pci: Add missing break to enable VFIO_PCI_ERR_IRQ_INDEX
This adds a missing break statement to VFIO_DEVICE_SET_IRQS handler
without which vfio_pci_set_err_trigger() would never be called.

While we are here, add another "break" to VFIO_PCI_REQ_IRQ_INDEX case
so if we add more indexes later, we won't miss it.

Fixes: 6140a8f562 ("vfio-pci: Add device request interface")
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-03-12 09:51:38 -06:00
Alex Williamson 6140a8f562 vfio-pci: Add device request interface
Userspace can opt to receive a device request notification,
indicating that the device should be released.  This is setup
the same way as the error IRQ and also supports eventfd signaling.
Future support may forcefully remove the device from the user if
the request is ignored.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-02-10 12:38:14 -07:00
Alex Williamson cac80d6e38 vfio-pci: Generalize setup of simple eventfds
We want another single vector IRQ index to support signaling of
the device request to userspace.  Generalize the error reporting
IRQ index to avoid code duplication.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-02-10 12:37:57 -07:00
Alex Williamson 13060b64b8 vfio: Add and use device request op for vfio bus drivers
When a request is made to unbind a device from a vfio bus driver,
we need to wait for the device to become unused, ie. for userspace
to release the device.  However, we have a long standing TODO in
the code to do something proactive to make that happen.  To enable
this, we add a request callback on the vfio bus driver struct,
which is intended to signal the user through the vfio device
interface to release the device.  Instead of passively waiting for
the device to become unused, we can now pester the user to give
it up.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-02-10 12:37:47 -07:00
Alex Williamson 4a68810dbb vfio: Tie IOMMU group reference to vfio group
Move the iommu_group reference from the device to the vfio_group.
This ensures that the iommu_group persists as long as the vfio_group
remains.  This can be important if all of the device from an
iommu_group are removed, but we still have an outstanding vfio_group
reference; we can still walk the empty list of devices.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-02-06 15:05:06 -07:00
Alex Williamson 60720a0fc6 vfio: Add device tracking during unbind
There's a small window between the vfio bus driver calling
vfio_del_group_dev() and the device being completely unbound where
the vfio group appears to be non-viable.  This creates a race for
users like QEMU/KVM where the kvm-vfio module tries to get an
external reference to the group in order to match and release an
existing reference, while the device is potentially being removed
from the vfio bus driver.  If the group is momentarily non-viable,
kvm-vfio may not be able to release the group reference until VM
shutdown, making the group unusable until that point.

Bridge the gap between device removal from the group and completion
of the driver unbind by tracking it in a list.  The device is added
to the list before the bus driver reference is released and removed
using the existing unbind notifier.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-02-06 15:05:06 -07:00
Alex Williamson c5e6688752 vfio/type1: Add conditional rescheduling
IOMMU operations can be expensive and it's not very difficult for a
user to give us a lot of work to do for a map or unmap operation.
Killing a large VM will vfio assigned devices can result in soft
lockups and IOMMU tracing shows that we can easily spend 80% of our
time with need-resched set.  A sprinkling of conf_resched() calls
after map and unmap calls has a very tiny affect on performance
while resulting in traces with <1% of calls overflowing into needs-
resched.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-02-06 14:19:12 -07:00
Alex Williamson babbf17609 vfio/type1: Chunk contiguous reserved/invalid page mappings
We currently map invalid and reserved pages, such as often occur from
mapping MMIO regions of a VM through the IOMMU, using single pages.
There's really no reason we can't instead follow the methodology we
use for normal pages and find the largest possible physically
contiguous chunk for mapping.  The only difference is that we don't
do locked memory accounting for these since they're not back by RAM.

In most applications this will be a very minor improvement, but when
graphics and GPGPU devices are in play, MMIO BARs become non-trivial.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-02-06 10:59:16 -07:00
Alex Williamson 6fe1010d6d vfio/type1: DMA unmap chunking
When unmapping DMA entries we try to rely on the IOMMU API behavior
that allows the IOMMU to unmap a larger area than requested, up to
the size of the original mapping.  This works great when the IOMMU
supports superpages *and* they're in use.  Otherwise, each PAGE_SIZE
increment is unmapped separately, resulting in poor performance.

Instead we can use the IOVA-to-physical-address translation provided
by the IOMMU API and unmap using the largest contiguous physical
memory chunk available, which is also how vfio/type1 would have
mapped the region.  For a synthetic 1TB guest VM mapping and shutdown
test on Intel VT-d (2M IOMMU pagesize support), this achieves about
a 30% overall improvement mapping standard 4K pages, regardless of
IOMMU superpage enabling, and about a 40% improvement mapping 2M
hugetlbfs pages when IOMMU superpages are not available.  Hugetlbfs
with IOMMU superpages enabled is effectively unchanged.

Unfortunately the same algorithm does not work well on IOMMUs with
fine-grained superpages, like AMD-Vi, costing about 25% extra since
the IOMMU will automatically unmap any power-of-two contiguous
mapping we've provided it.  We add a routine and a domain flag to
detect this feature, leaving AMD-Vi unaffected by this unmap
optimization.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-02-06 10:58:56 -07:00
Wei Yang 7c2e211f3c vfio-pci: Fix the check on pci device type in vfio_pci_probe()
Current vfio-pci just supports normal pci device, so vfio_pci_probe() will
return if the pci device is not a normal device. While current code makes a
mistake. PCI_HEADER_TYPE is the offset in configuration space of the device
type, but we use this value to mask the type value.

This patch fixs this by do the check directly on the pci_dev->hdr_type.

Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Cc: stable@vger.kernel.org # v3.6+
2015-01-07 10:29:11 -07:00
Linus Torvalds cc669743a3 VFIO updates for v3.19-rc1
- s390 support (Frank Blaschka)
  - Enable iommu-type1 for ARM SMMU (Will Deacon)
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJUka+sAAoJECObm247sIsiiy0P/iqrQpv94Z7rUKRlV3K8YtJj
 8Oi5fLLnT2by9v5mS+KMElnQ5gLU/C5B/QGLMNrF2uQl8lguWSXJw37r7MkbIkpN
 RDLx1NhetLqbJ4CYLjyv/Jx3vl+Wr/2nNWRVIS5ajBmMjEgKVLvjYs4SaXELc3a8
 a3YzcGW10BrVFlCJgUYqYIFGS1BmKjf7fbD5YBocj8tPv6NAlCiNNYYr+0pzW8Lf
 GTi39JlZ2t06hDq33eiUkrySWNjrIBn4g4PfAl7HBAscsZKS1w18MD1qVw4UXXa2
 15+CBbsHU7ZLVo6G7vuZeJNCX9tdQ0WIZWQzHstQa914l86WYImTJ2tyH7Rn0ZcQ
 3Mu9fzef9JgjkI56ol2zDwuOs+qttOYaLWjhhHiW4jkIxdnljnesIFlvmM3XeDGz
 3Zowg09HzE3K+dt8265jVKkcNJbPLzspLvF27nPMudZNHBozcoPE0jKxU7QC3eMT
 Ij36+puQq+jccUic3Np6rxk5tzTHEat1a7w3IUwXCCUP5P5QW+kuuIjbd4hqQkHn
 VDRjnT6MWC3GguUCXR5VyO0zezpI20pTbWwE8u2qwnE349m0Eq/vxytj2lCLYLPR
 Jjtdduf1/Ppam7tATd3PwTu6KljY3dJiUUikyOc1J0KmkgkSMw+BtR6G7qytyW4q
 /fhClcsaNtxSheVh0N+b
 =1L2V
 -----END PGP SIGNATURE-----

Merge tag 'vfio-v3.19-rc1' of git://github.com/awilliam/linux-vfio

Pull VFIO updates from Alex Williamson:
 - s390 support (Frank Blaschka)
 - Enable iommu-type1 for ARM SMMU (Will Deacon)

* tag 'vfio-v3.19-rc1' of git://github.com/awilliam/linux-vfio:
  drivers/vfio: allow type-1 IOMMU instantiation on top of an ARM SMMU
  vfio: make vfio run on s390
2014-12-17 10:44:22 -08:00
Jiang Liu 83a18912b0 PCI/MSI: Rename write_msi_msg() to pci_write_msi_msg()
Rename write_msi_msg() to pci_write_msi_msg() to mark it as PCI
specific.

Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Grant Likely <grant.likely@linaro.org>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Yingjoe Chen <yingjoe.chen@mediatek.com>
Cc: Yijing Wang <wangyijing@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-11-23 13:01:45 +01:00
Will Deacon 5e9f36c59a drivers/vfio: allow type-1 IOMMU instantiation on top of an ARM SMMU
The ARM SMMU driver is compatible with the notion of a type-1 IOMMU in
VFIO.

This patch allows VFIO_IOMMU_TYPE1 to be selected if ARM_SMMU=y.

Signed-off-by: Will Deacon <will.deacon@arm.com>
[aw: update for existing S390 patch]
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2014-11-14 09:10:59 -07:00
Frank Blaschka 1d53a3a7d3 vfio: make vfio run on s390
add Kconfig switch to hide INTx
add Kconfig switch to let vfio announce PCI BARs are not mapable

Signed-off-by: Frank Blaschka <frank.blaschka@de.ibm.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2014-11-07 09:52:22 -07:00
Linus Torvalds 23971bdfff IOMMU Updates for Linux v3.18
This pull-request includes:
 
 	* Change in the IOMMU-API to convert the former iommu_domain_capable
 	  function to just iommu_capable
 
 	* Various fixes in handling RMRR ranges for the VT-d driver (one fix
 	  requires a device driver core change which was acked
 	  by Greg KH)
 
 	* The AMD IOMMU driver now assigns and deassigns complete alias groups
 	  to fix issues with devices using the wrong PCI request-id
 
 	* MMU-401 support for the ARM SMMU driver
 
 	* Multi-master IOMMU group support for the ARM SMMU driver
 
 	* Various other small fixes all over the place
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQIcBAABAgAGBQJUPNxYAAoJECvwRC2XARrjMwMP/RLSr+oA31rGVjLXcmcCHl7Q
 Uj7xpcnG19qB0aqNR1JeJuZNkK/tw44pE353MQPbz4N9UVUiogklGIVD1iJvFV53
 0qm84bvpDJIof4aP35B3H3Umft2USTn/lmsQg/RklQcNTW8DzNj63b8BTNR7k/GL
 G7bLg7F1BUCl0shZCCsFspOIulQPAJYN2OvHlfYBav/bfDvfouQ3lrV+loGrK44r
 F2Hmp+imXlIhUCjfbiWz6wKFxvPrxZx482vm2pXBCSnXEdW4/fz6nf9VHUK/Cfsq
 JAimY1CfiDo1aqH9/yVHUOw5SD/NYOXq6E5bFPg/WENbipbbae5cK2u6PX5MMBAn
 CG4BM8l9xicfGPqgn5YFSRY/6qC6K7NlxMnt9U8l18QIkDVDqEtUgJQISJuce7wx
 FWx6eSWaxpIe5yhq19/h2ELalUUyR/fPq+UXXjYDL1kLV/vcvC/lC3mbNAQU93zU
 WK0bG2tDg88JHavc25Ewa2aOn4BVM2BpwuLbYlgQReaEmsQRnEPgtmRNyLJHqbFE
 wwpCj8pBWdufsJWRyvpnXQ+CfA7oSz4e7hz1G+0/5uiDmagfvg16Ql5JtPmmuLUm
 Kc3dVIiG0s1ewohZIIJETGCqprQbCSqs8CCQqB6p2zDBWFKpNT7F38lm/KlehkCz
 JpAiI7Y2K9Jejp0VIPrt
 =OMOt
 -----END PGP SIGNATURE-----

Merge tag 'iommu-updates-v3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu

Pull IOMMU updates from Joerg Roedel:
 "This pull-request includes:

   - change in the IOMMU-API to convert the former iommu_domain_capable
     function to just iommu_capable

   - various fixes in handling RMRR ranges for the VT-d driver (one fix
     requires a device driver core change which was acked by Greg KH)

   - the AMD IOMMU driver now assigns and deassigns complete alias
     groups to fix issues with devices using the wrong PCI request-id

   - MMU-401 support for the ARM SMMU driver

   - multi-master IOMMU group support for the ARM SMMU driver

   - various other small fixes all over the place"

* tag 'iommu-updates-v3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (41 commits)
  iommu/vt-d: Work around broken RMRR firmware entries
  iommu/vt-d: Store bus information in RMRR PCI device path
  iommu/vt-d: Only remove domain when device is removed
  driver core: Add BUS_NOTIFY_REMOVED_DEVICE event
  iommu/amd: Fix devid mapping for ivrs_ioapic override
  iommu/irq_remapping: Fix the regression of hpet irq remapping
  iommu: Fix bus notifier breakage
  iommu/amd: Split init_iommu_group() from iommu_init_device()
  iommu: Rework iommu_group_get_for_pci_dev()
  iommu: Make of_device_id array const
  amd_iommu: do not dereference a NULL pointer address.
  iommu/omap: Remove omap_iommu unused owner field
  iommu: Remove iommu_domain_has_cap() API function
  IB/usnic: Convert to use new iommu_capable() API function
  vfio: Convert to use new iommu_capable() API function
  kvm: iommu: Convert to use new iommu_capable() API function
  iommu/tegra: Convert to iommu_capable() API function
  iommu/msm: Convert to iommu_capable() API function
  iommu/vt-d: Convert to iommu_capable() API function
  iommu/fsl: Convert to iommu_capable() API function
  ...
2014-10-15 07:23:49 +02:00
Linus Torvalds 27a9716bc8 VFIO updates for v3.18-rc1
- Nested IOMMU extension to type1 (Will Deacon)
  - Restore MSIx message before enabling (Gavin Shan)
  - Fix remove path locking (Alex Williamson)
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJUOOETAAoJECObm247sIsihDQP/jADEe9KFu4ymWu7rqi24w1L
 81hGNLXlfx2PPomluN3jENpyueo7vWdP5yZ8q/bi6oF6UbShL8Po01UKHOJzJJwW
 8GW86YcNsmPz/jl8Jcdbkex3dKvT1OzrDjFjCiKTJBHxE9nEdtWlRV8mO1pwd00t
 YFiXF8xFbkpHExMiQNU36rq/fzZCTOu4ZpCK9kDT7Sy+lsKAnGoXuM1IZK+7DGJo
 jcsMF32DVDmji6riy3uHHPc0qprP24QNVy6FfOmLEUvuOEIUOxMAYM9je9mmsHeS
 CeR/NHexr4RgYQE33jL1w8A1saT0rbu7DSKSa7OQebnY2Zte+oncLtqFZR2/Wylh
 jBU5r7P3PdxM6ykqEeC/3ytx7iFX6c7jc0SU4I5m8bFexmUQXqOko28gGIt0OL3n
 R8CmNF/MDs3gqYprhW6MvSJI1diY1+pX7pX0e7k7lDAoZ1QOjPNSGv+YOfF3H1YB
 AggIVxIKXW0T0bQ/hKcQiDKkxQ88vi1hld2LknbiBW9nMNLjNkxl2RZSGunFvWWN
 LzOYkBgR6rrTbhTvsWApsfYguYtGkgAGGJZSR1oev0BJnx4UHOfL1bykJRyUHdUd
 KDSBEni5TY65087IKD93nkyRhassszOa9XHmRDwQLxQeJCKRZi6bQRSzFZVheXIO
 O3XINOo2wNF1bIrfD/vR
 =s2+/
 -----END PGP SIGNATURE-----

Merge tag 'vfio-v3.18-rc1' of git://github.com/awilliam/linux-vfio

Pull VFIO updates from Alex Williamson:
 - Nested IOMMU extension to type1 (Will Deacon)
 - Restore MSIx message before enabling (Gavin Shan)
 - Fix remove path locking (Alex Williamson)

* tag 'vfio-v3.18-rc1' of git://github.com/awilliam/linux-vfio:
  vfio-pci: Fix remove path locking
  drivers/vfio: Export vfio_spapr_iommu_eeh_ioctl() with GPL
  vfio/pci: Restore MSIx message prior to enabling
  PCI: Export MSI message relevant functions
  vfio/iommu_type1: add new VFIO_TYPE1_NESTING_IOMMU IOMMU type
  iommu: introduce domain attribute for nesting IOMMUs
2014-10-11 06:49:24 -04:00
Alex Williamson 93899a679f vfio-pci: Fix remove path locking
Locking both the remove() and release() path results in a deadlock
that should have been obvious.  To fix this we can get and hold the
vfio_device reference as we evaluate whether to do a bus/slot reset.
This will automatically block any remove() calls, allowing us to
remove the explict lock.  Fixes 61d792562b.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Cc: stable@vger.kernel.org	[3.17]
2014-09-29 17:18:39 -06:00
Gavin Shan 0f905ce2b5 drivers/vfio: Export vfio_spapr_iommu_eeh_ioctl() with GPL
The function should have been exported with EXPORT_SYMBOL_GPL()
as part of commit 92d18a6851 ("drivers/vfio: Fix EEH build error").

Suggested-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2014-09-29 10:31:51 -06:00
Gavin Shan b8f02af096 vfio/pci: Restore MSIx message prior to enabling
The MSIx vector table lives in device memory, which may be cleared as
part of a backdoor device reset. This is the case on the IBM IPR HBA
when the BIST is run on the device. When assigned to a QEMU guest,
the guest driver does a pci_save_state(), issues a BIST, then does a
pci_restore_state(). The BIST clears the MSIx vector table, but due
to the way interrupts are configured the pci_restore_state() does not
restore the vector table as expected. Eventually this results in an
EEH error on Power platforms when the device attempts to signal an
interrupt with the zero'd table entry.

Fix the problem by restoring the host cached MSI message prior to
enabling each vector.

Reported-by: Wen Xiong <wenxiong@linux.vnet.ibm.com>
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2014-09-29 10:16:24 -06:00
Will Deacon f5c9ecebaf vfio/iommu_type1: add new VFIO_TYPE1_NESTING_IOMMU IOMMU type
VFIO allows devices to be safely handed off to userspace by putting
them behind an IOMMU configured to ensure DMA and interrupt isolation.
This enables userspace KVM clients, such as kvmtool and qemu, to further
map the device into a virtual machine.

With IOMMUs such as the ARM SMMU, it is then possible to provide SMMU
translation services to the guest operating system, which are nested
with the existing translation installed by VFIO. However, enabling this
feature means that the IOMMU driver must be informed that the VFIO domain
is being created for the purposes of nested translation.

This patch adds a new IOMMU type (VFIO_TYPE1_NESTING_IOMMU) to the VFIO
type-1 driver. The new IOMMU type acts identically to the
VFIO_TYPE1v2_IOMMU type, but additionally sets the DOMAIN_ATTR_NESTING
attribute on its IOMMU domains.

Cc: Joerg Roedel <joro@8bytes.org>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2014-09-29 10:06:19 -06:00
Chen, Gong 846fc70986 PCI/AER: Rename PCI_ERR_UNC_TRAIN to PCI_ERR_UNC_UND
In PCIe r1.0, sec 5.10.2, bit 0 of the Uncorrectable Error Status, Mask,
and Severity Registers was for "Training Error." In PCIe r1.1, sec 7.10.2,
bit 0 was redefined to be "Undefined."

Rename PCI_ERR_UNC_TRAIN to PCI_ERR_UNC_UND to reflect this change.

No functional change.

[bhelgaas: changelog]
Signed-off-by: Chen, Gong <gong.chen@linux.intel.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2014-09-25 09:42:40 -06:00
Joerg Roedel eb165f0584 vfio: Convert to use new iommu_capable() API function
Cc: Alex Williamson <alex.williamson@redhat.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
2014-09-25 15:47:45 +02:00
Alexey Kardashevskiy 9b936c960f drivers/vfio: Enable VFIO if EEH is not supported
The existing vfio_pci_open() fails upon error returned from
vfio_spapr_pci_eeh_open(), which breaks POWER7's P5IOC2 PHB
support which this patch brings back.

The patch fixes the issue by dropping the return value of
vfio_spapr_pci_eeh_open().

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2014-08-08 10:39:16 -06:00
Alexey Kardashevskiy 89a2edd62f drivers/vfio: Allow EEH to be built as module
This adds necessary declarations to the SPAPR VFIO EEH module,
otherwise multiple dynamic linker errors reported:

vfio_spapr_eeh: Unknown symbol eeh_pe_set_option (err 0)
vfio_spapr_eeh: Unknown symbol eeh_pe_configure (err 0)
vfio_spapr_eeh: Unknown symbol eeh_pe_reset (err 0)
vfio_spapr_eeh: Unknown symbol eeh_pe_get_state (err 0)
vfio_spapr_eeh: Unknown symbol eeh_iommu_group_to_pe (err 0)
vfio_spapr_eeh: Unknown symbol eeh_dev_open (err 0)
vfio_spapr_eeh: Unknown symbol eeh_pe_set_option (err 0)
vfio_spapr_eeh: Unknown symbol eeh_pe_configure (err 0)
vfio_spapr_eeh: Unknown symbol eeh_pe_reset (err 0)
vfio_spapr_eeh: Unknown symbol eeh_pe_get_state (err 0)
vfio_spapr_eeh: Unknown symbol eeh_iommu_group_to_pe (err 0)
vfio_spapr_eeh: Unknown symbol eeh_dev_open (err 0)

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2014-08-08 10:37:51 -06:00
Gavin Shan 92d18a6851 drivers/vfio: Fix EEH build error
The VFIO related components could be built as dynamic modules.
Unfortunately, CONFIG_EEH can't be configured to "m". The patch
fixes the build errors when configuring VFIO related components
as dynamic modules as follows:

  CC [M]  drivers/vfio/vfio_iommu_spapr_tce.o
In file included from drivers/vfio/vfio.c:33:0:
include/linux/vfio.h:101:43: warning: ‘struct pci_dev’ declared \
inside parameter list [enabled by default]
   :
  WRAP    arch/powerpc/boot/zImage.pseries
  WRAP    arch/powerpc/boot/zImage.maple
  WRAP    arch/powerpc/boot/zImage.pmac
  WRAP    arch/powerpc/boot/zImage.epapr
  MODPOST 1818 modules
ERROR: ".vfio_spapr_iommu_eeh_ioctl" [drivers/vfio/vfio_iommu_spapr_tce.ko]\
undefined!
ERROR: ".vfio_spapr_pci_eeh_open" [drivers/vfio/pci/vfio-pci.ko] undefined!
ERROR: ".vfio_spapr_pci_eeh_release" [drivers/vfio/pci/vfio-pci.ko] undefined!

Reported-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2014-08-08 10:36:20 -06:00
Alex Williamson bc4fba7712 vfio-pci: Attempt bus/slot reset on release
Each time a device is released, mark whether a local reset was
successful or whether a bus/slot reset is needed.  If a reset is
needed and all of the affected devices are bound to vfio-pci and
unused, allow the reset.  This is most useful when the userspace
driver is killed and releases all the devices in an unclean state,
such as when a QEMU VM quits.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2014-08-07 11:12:07 -06:00
Alex Williamson 61d792562b vfio-pci: Use mutex around open, release, and remove
Serializing open/release allows us to fix a refcnt error if we fail
to enable the device and lets us prevent devices from being unbound
or opened, giving us an opportunity to do bus resets on release.  No
restriction added to serialize binding devices to vfio-pci while the
mutex is held though.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2014-08-07 11:12:04 -06:00
Alex Williamson 9c22e660ce vfio-pci: Release devices with BusMaster disabled
Our current open/release path looks like this:

vfio_pci_open
  vfio_pci_enable
    pci_enable_device
    pci_save_state
    pci_store_saved_state

vfio_pci_release
  vfio_pci_disable
    pci_disable_device
    pci_restore_state

pci_enable_device() doesn't modify PCI_COMMAND_MASTER, so if a device
comes to us with it enabled, it persists through the open and gets
stored as part of the device saved state.  We then restore that saved
state when released, which can allow the device to attempt to continue
to do DMA.  When the group is disconnected from the domain, this will
get caught by the IOMMU, but if there are other devices in the group,
the device may continue running and interfere with the user.  Even in
the former case, IOMMUs don't necessarily behave well and a stream of
blocked DMA can result in unpleasant behavior on the host.

Explicitly disable Bus Master as we're enabling the device and
slightly re-work release to make sure that pci_disable_device() is
the last thing that touches the device.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2014-08-07 11:12:02 -06:00
Gavin Shan 1b69be5e8a drivers/vfio: EEH support for VFIO PCI device
The patch adds new IOCTL commands for sPAPR VFIO container device
to support EEH functionality for PCI devices, which have been passed
through from host to somebody else via VFIO.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Acked-by: Alexander Graf <agraf@suse.de>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-08-05 15:28:48 +10:00
Linus Torvalds a68a7509d3 A handful of VFIO bug fixes for v3.16
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJTkiFBAAoJECObm247sIsiymsQALDs03j1lm1ZUx3peI8JuEGp
 +4WrflS4eKBX2r+XOSqHQNlg5ceKpj1zsEMKXkLFBCwTEZXRz9tK7bu6hsRuE8XR
 j67JkBm4alsJQrJIQl0+dYJMFPVGWBNyoBc4kV2YbernBnFPc/5oOmP9wAJD80OI
 HHNQxOdfce+YZnkgltc/GkOWdG/YvuQbkyt2YmGtTePapzboj6Tf0yvkqRDaL0Bf
 07+vYi6ZOSzx/Niw/ak09hll4OhYfgiOQ/t6LETmLi7PEGR88TYO/kqdzGmlnx6V
 XsDZ+WORWjtORGnyJI4p7iVFiP7mz4mmwUptfJcA0AX98csEdYHt6H4mlVkN7Gmd
 rSvarKQ7On8MRbymQIVw5w3V78X/HiWUsUi3Fef5xytQyuD4VxylS0HDtPM/AyeN
 3zEJKOI+CIkFLZ6fReL8LKZuaRdQTbJvjxMVvmTr2XrBxZTlk91CsqT3JFXjJ6CV
 2QEv2eVmsCkPtScWNZzaeQsgH75Hc8vznZxA+/BFDoeR2iq31BPE9iuYDpaVZS58
 mnAEll4tPMIqcqQvxwoB8B8kyzX636trwW+585jr3gKRS8V6qHaWkXz06N5kzz7s
 dhRUy3rKe1SeutVo0Vs067Ym60yUlvuTeWTppuOnfRpgIpbCLeMwuiA22Wq3hBIF
 loXH9eAZOuwzBg1hqVWJ
 =aR4I
 -----END PGP SIGNATURE-----

Merge tag 'vfio-v3.16-rc1' of git://github.com/awilliam/linux-vfio into next

Pull VFIO updates from Alex Williamson:
 "A handful of VFIO bug fixes for v3.16"

* tag 'vfio-v3.16-rc1' of git://github.com/awilliam/linux-vfio:
  drivers/vfio/pci: Fix wrong MSI interrupt count
  drivers/vfio: Rework offsetofend()
  vfio/iommu_type1: Avoid overflow
  vfio/pci: Fix unchecked return value
  vfio/pci: Fix sizing of DPA and THP express capabilities
2014-06-07 20:12:15 -07:00
Gavin Shan fd49c81f08 drivers/vfio/pci: Fix wrong MSI interrupt count
According PCI local bus specification, the register of Message
Control for MSI (offset: 2, length: 2) has bit#0 to enable or
disable MSI logic and it shouldn't be part contributing to the
calculation of MSI interrupt count. The patch fixes the issue.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2014-05-30 11:35:54 -06:00
Alex Williamson c8dbca165b vfio/iommu_type1: Avoid overflow
Coverity reports use of a tained scalar used as a loop boundary.
For the most part, any values passed from userspace for a DMA mapping
size, IOVA, or virtual address are valid, with some alignment
constraints.  The size is ultimately bound by how many pages the user
is able to lock, IOVA is tested by the IOMMU driver when doing a map,
and the virtual address needs to pass get_user_pages.  The only
problem I can find is that we do expect the __u64 user values to fit
within our variables, which might not happen on 32bit platforms.  Add
a test for this and return error on overflow.  Also propagate use of
the type-correct local variables throughout the function.

The above also points to the 'end' variable, which can be zero if
we're operating at the very top of the address space.  We try to
account for this, but our loop botches it.  Rework the loop to use
the remaining size as our loop condition rather than the IOVA vs end.

Detected by Coverity: CID 714659

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2014-05-30 11:35:54 -06:00
Alex Williamson eb5685f0de vfio/pci: Fix unchecked return value
There's nothing we can do different if pci_load_and_free_saved_state()
fails, other than maybe print some log message, but the actual re-load
of the state is an unnecessary step here since we've only just saved
it.  We can cleanup a coverity warning and eliminate the unnecessary
step by freeing the state ourselves.

Detected by Coverity: CID 753101

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2014-05-30 11:35:53 -06:00
Alex Williamson afa63252b2 vfio/pci: Fix sizing of DPA and THP express capabilities
When sizing the TPH capability we store the register containing the
table size into the 'dword' variable, but then use the uninitialized
'byte' variable to analyze the size.  The table size is also actually
reported as an N-1 value, so correct sizing to account for this.

The round_up() for both TPH and DPA is unnecessary, remove it.

Detected by Coverity: CID 714665 & 715156

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2014-05-30 10:50:31 -06:00
Jean Delvare 8283b4919e driver core: dev_set_drvdata can no longer fail
So there is no point in checking its return value, which will soon
disappear.

Signed-off-by: Jean Delvare <jdelvare@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-27 13:40:51 -07:00
Linus Torvalds d0cb5f71c5 VFIO updates for v3.15 include:
- Allow the vfio-type1 IOMMU to support multiple domains within a container
 - Plumb path to query whether all domains are cache-coherent
 - Wire query into kvm-vfio device to avoid KVM x86 WBINVD emulation
 - Always select CONFIG_ANON_INODES, vfio depends on it (Arnd)
 
 The first patch also makes the vfio-type1 IOMMU driver completely independent
 of the bus_type of the devices it's handling, which enables it to be used for
 both vfio-pci and a future vfio-platform (and hopefully combinations involving
 both simultaneously).
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJTPbikAAoJECObm247sIsiG9cP/jnurW84YuAHmybzy3R4nMaa
 8BWcQST+TyQ78GWvubFDcRt+vHmJUI4iFWPBIm1twBSVrMy6F1GcT/spmSne1Dfb
 OvcfGEy59tGO+BeklHg5Qq2Hj1UzfeV9uQoy+PrpTF/sYzsyL6g5+O3Zv39SNvr0
 zCwnO2JAKXIpQlkK3wVha6V13X07Z/+d2b4JSsvmON2cnlPzOUYNrgBvoSeV2IQe
 3kMAU9qfAfQlpNDhRRZL/HshgWazCg4XZUp9UdNjUkOxrUp4vXHrVTUJnUGBDytk
 8V9UGRKF3mik9PqpZJk4jLV5urgUVpnUR5747uqs0KF+9GWxClXvh3gp1XX8Zn7f
 NCfDSMn/wrDr6fQBglKUadITaDhF49KV+J0cX083q46BbYxhAfgxv1I9UOvLreG6
 SGAezlTB0mefa1BmSwJdfKwDBsuDOhbF0TsHC6zT7yFc8pqe3hl/vQzUyu3HtwuM
 yPycQiQQmvOaymaiiBRwyLocwJbDNKPigDomv8NZ8Mcd97Fu9YsF4ydaxMJmmeKf
 TffYS4z4tUElYiU2vv1ipB8T+ReCmdtj2cIvyfwTL9jycY9ouO4bJ56dOGWd3WNl
 m7AVokbrrJQ03itUpFMuYjHAzTNUlXVvXlGr9n6L4VTR0RTL1zwJVrMVDsXdhxm4
 XgDbVB5PrxinDAC1vJvI
 =mnzO
 -----END PGP SIGNATURE-----

Merge tag 'vfio-v3.15-rc1' of git://github.com/awilliam/linux-vfio

Pull VFIO updates from Alex Williamson:
 "VFIO updates for v3.15 include:

   - Allow the vfio-type1 IOMMU to support multiple domains within a
     container
   - Plumb path to query whether all domains are cache-coherent
   - Wire query into kvm-vfio device to avoid KVM x86 WBINVD emulation
   - Always select CONFIG_ANON_INODES, vfio depends on it (Arnd)

  The first patch also makes the vfio-type1 IOMMU driver completely
  independent of the bus_type of the devices it's handling, which
  enables it to be used for both vfio-pci and a future vfio-platform
  (and hopefully combinations involving both simultaneously)"

* tag 'vfio-v3.15-rc1' of git://github.com/awilliam/linux-vfio:
  vfio: always select ANON_INODES
  kvm/vfio: Support for DMA coherent IOMMUs
  vfio: Add external user check extension interface
  vfio/type1: Add extension to test DMA cache coherence of IOMMU
  vfio/iommu_type1: Multi-IOMMU domain support
2014-04-03 14:05:02 -07:00
Linus Torvalds 4b1779c2cf PCI changes for the v3.15 merge window:
Enumeration
     - Increment max correctly in pci_scan_bridge() (Andreas Noever)
     - Clarify the "scan anyway" comment in pci_scan_bridge() (Andreas Noever)
     - Assign CardBus bus number only during the second pass (Andreas Noever)
     - Use request_resource_conflict() instead of insert_ for bus numbers (Andreas Noever)
     - Make sure bus number resources stay within their parents bounds (Andreas Noever)
     - Remove pci_fixup_parent_subordinate_busnr() (Andreas Noever)
     - Check for child busses which use more bus numbers than allocated (Andreas Noever)
     - Don't scan random busses in pci_scan_bridge() (Andreas Noever)
     - x86: Drop pcibios_scan_root() check for bus already scanned (Bjorn Helgaas)
     - x86: Use pcibios_scan_root() instead of pci_scan_bus_with_sysdata() (Bjorn Helgaas)
     - x86: Use pcibios_scan_root() instead of pci_scan_bus_on_node() (Bjorn Helgaas)
     - x86: Merge pci_scan_bus_on_node() into pcibios_scan_root() (Bjorn Helgaas)
     - x86: Drop return value of pcibios_scan_root() (Bjorn Helgaas)
 
   NUMA
     - x86: Add x86_pci_root_bus_node() to look up NUMA node from PCI bus (Bjorn Helgaas)
     - x86: Use x86_pci_root_bus_node() instead of get_mp_bus_to_node() (Bjorn Helgaas)
     - x86: Remove mp_bus_to_node[], set_mp_bus_to_node(), get_mp_bus_to_node() (Bjorn Helgaas)
     - x86: Use NUMA_NO_NODE, not -1, for unknown node (Bjorn Helgaas)
     - x86: Remove acpi_get_pxm() usage (Bjorn Helgaas)
     - ia64: Use NUMA_NO_NODE, not MAX_NUMNODES, for unknown node (Bjorn Helgaas)
     - ia64: Remove acpi_get_pxm() usage (Bjorn Helgaas)
     - ACPI: Fix acpi_get_node() prototype (Bjorn Helgaas)
 
   Resource management
     - i2o: Fix and refactor PCI space allocation (Bjorn Helgaas)
     - Add resource_contains() (Bjorn Helgaas)
     - Add %pR support for IORESOURCE_UNSET (Bjorn Helgaas)
     - Mark resources as IORESOURCE_UNSET if we can't assign them (Bjorn Helgaas)
     - Don't clear IORESOURCE_UNSET when updating BAR (Bjorn Helgaas)
     - Check IORESOURCE_UNSET before updating BAR (Bjorn Helgaas)
     - Don't try to claim IORESOURCE_UNSET resources (Bjorn Helgaas)
     - Mark 64-bit resource as IORESOURCE_UNSET if we only support 32-bit (Bjorn Helgaas)
     - Don't enable decoding if BAR hasn't been assigned an address (Bjorn Helgaas)
     - Add "weak" generic pcibios_enable_device() implementation (Bjorn Helgaas)
     - alpha, microblaze, sh, sparc, tile: Use default pcibios_enable_device() (Bjorn Helgaas)
     - s390: Use generic pci_enable_resources() (Bjorn Helgaas)
     - Don't check resource_size() in pci_bus_alloc_resource() (Bjorn Helgaas)
     - Set type in __request_region() (Bjorn Helgaas)
     - Check all IORESOURCE_TYPE_BITS in pci_bus_alloc_from_region() (Bjorn Helgaas)
     - Change pci_bus_alloc_resource() type_mask to unsigned long (Bjorn Helgaas)
     - Log IDE resource quirk in dmesg (Bjorn Helgaas)
     - Revert "[PATCH] Insert GART region into resource map" (Bjorn Helgaas)
 
   PCI device hotplug
     - Make check_link_active() non-static (Rajat Jain)
     - Use link change notifications for hot-plug and removal (Rajat Jain)
     - Enable link state change notifications (Rajat Jain)
     - Don't disable the link permanently during removal (Rajat Jain)
     - Don't check adapter or latch status while disabling (Rajat Jain)
     - Disable link notification across slot reset (Rajat Jain)
     - Ensure very fast hotplug events are also processed (Rajat Jain)
     - Add hotplug_lock to serialize hotplug events (Rajat Jain)
     - Remove a non-existent card, regardless of "surprise" capability (Rajat Jain)
     - Don't turn slot off when hot-added device already exists (Yijing Wang)
 
   MSI
     - Keep pci_enable_msi() documentation (Alexander Gordeev)
     - ahci: Fix broken single MSI fallback (Alexander Gordeev)
     - ahci, vfio: Use pci_enable_msi_range() (Alexander Gordeev)
     - Check kmalloc() return value, fix leak of name (Greg Kroah-Hartman)
     - Fix leak of msi_attrs (Greg Kroah-Hartman)
     - Fix pci_msix_vec_count() htmldocs failure (Masanari Iida)
 
   Virtualization
     - Device-specific ACS support (Alex Williamson)
 
   Freescale i.MX6
     - Wait for retraining (Marek Vasut)
 
   Marvell MVEBU
     - Use Device ID and revision from underlying endpoint (Andrew Lunn)
     - Fix incorrect size for PCI aperture resources (Jason Gunthorpe)
     - Call request_resource() on the apertures (Jason Gunthorpe)
     - Fix potential issue in range parsing (Jean-Jacques Hiblot)
 
   Renesas R-Car
     - Check platform_get_irq() return code (Ben Dooks)
     - Add error interrupt handling (Ben Dooks)
     - Fix bridge logic configuration accesses (Ben Dooks)
     - Register each instance independently (Magnus Damm)
     - Break out window size handling (Magnus Damm)
     - Make the Kconfig dependencies more generic (Magnus Damm)
 
   Synopsys DesignWare
     - Fix RC BAR to be single 64-bit non-prefetchable memory (Mohit Kumar)
 
   Miscellaneous
     - Remove unused SR-IOV VF Migration support (Bjorn Helgaas)
     - Enable INTx if BIOS left them disabled (Bjorn Helgaas)
     - Fix hex vs decimal typo in cpqhpc_probe() (Dan Carpenter)
     - Clean up par-arch object file list (Liviu Dudau)
     - Set IORESOURCE_ROM_SHADOW only for the default VGA device (Sander Eikelenboom)
     - ACPI, ARM, drm, powerpc, pcmcia, PCI: Use list_for_each_entry() for bus traversal (Yijing Wang)
     - Fix pci_bus_b() build failure (Paul Gortmaker)
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.11 (GNU/Linux)
 
 iQIcBAABAgAGBQJTOdAZAAoJEFmIoMA60/r8VYUQALRrReyMBk3pjRt/fKIX4Kwi
 ydSo/YJeeKTN8K93fLw8bb8bdPItJScJFTfEa4Q2SpZezR/ecGXLowisy0BBaPHK
 qtOyB8EqjkLS17GfyecIe9Nd2SIAI2De/0bchK3kDtIX1YlZB/k/tD3eCPMHDnnl
 m8c5kAHKPQYd8g01I+S8nrtGHk/A33grfYpJXPZbcqyhE0lWU3SI8KDAGbcKzNHE
 23Do0yNyd4nHIdixWlhETcNvzHn35Q/O38JJwW9Mf1aI9gusYuml6GFefCgu/iov
 lxqp3CEW7iPZgQEgNbrQ0HzWn/durL2Trd6S/Yh6f2xbm1LGYKWh3LZUFLd3AQDd
 INEpUgKsyb//nF3dtiyGnZlp0QykoqFyLo2AEDrb+ILTd4up5DeRY/m1UpjAXR5p
 QicBmrDksHrSivPmMZwLx1DFQYKjQbdx5lOqy9hQM/Jmsr+N3/l7QBrbQWXks3JZ
 NNAyn4RZHQB7UDQS/MmVPArs+JK5qaEDQD57QuOTlqgP19VY9C9E/l/aEqefjdFo
 XOAm7CwGpB/iBAkIbE6ROEDiJArigRVHEfxLYeE/jtGOdRDCD1deWk+g3S8DWD7m
 ZxWSgIVB00PMAmomczdg59YVFBhocgwPUa8/cw6yqzx2QKP4mWXIFZ/Sjau5I3tn
 WWoxXlUirZfTJc29XnVy
 =3mNS
 -----END PGP SIGNATURE-----

Merge tag 'pci-v3.15-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci

Pull PCI changes from Bjorn Helgaas:
 "Enumeration
   - Increment max correctly in pci_scan_bridge() (Andreas Noever)
   - Clarify the "scan anyway" comment in pci_scan_bridge() (Andreas Noever)
   - Assign CardBus bus number only during the second pass (Andreas Noever)
   - Use request_resource_conflict() instead of insert_ for bus numbers (Andreas Noever)
   - Make sure bus number resources stay within their parents bounds (Andreas Noever)
   - Remove pci_fixup_parent_subordinate_busnr() (Andreas Noever)
   - Check for child busses which use more bus numbers than allocated (Andreas Noever)
   - Don't scan random busses in pci_scan_bridge() (Andreas Noever)
   - x86: Drop pcibios_scan_root() check for bus already scanned (Bjorn Helgaas)
   - x86: Use pcibios_scan_root() instead of pci_scan_bus_with_sysdata() (Bjorn Helgaas)
   - x86: Use pcibios_scan_root() instead of pci_scan_bus_on_node() (Bjorn Helgaas)
   - x86: Merge pci_scan_bus_on_node() into pcibios_scan_root() (Bjorn Helgaas)
   - x86: Drop return value of pcibios_scan_root() (Bjorn Helgaas)

  NUMA
   - x86: Add x86_pci_root_bus_node() to look up NUMA node from PCI bus (Bjorn Helgaas)
   - x86: Use x86_pci_root_bus_node() instead of get_mp_bus_to_node() (Bjorn Helgaas)
   - x86: Remove mp_bus_to_node[], set_mp_bus_to_node(), get_mp_bus_to_node() (Bjorn Helgaas)
   - x86: Use NUMA_NO_NODE, not -1, for unknown node (Bjorn Helgaas)
   - x86: Remove acpi_get_pxm() usage (Bjorn Helgaas)
   - ia64: Use NUMA_NO_NODE, not MAX_NUMNODES, for unknown node (Bjorn Helgaas)
   - ia64: Remove acpi_get_pxm() usage (Bjorn Helgaas)
   - ACPI: Fix acpi_get_node() prototype (Bjorn Helgaas)

  Resource management
   - i2o: Fix and refactor PCI space allocation (Bjorn Helgaas)
   - Add resource_contains() (Bjorn Helgaas)
   - Add %pR support for IORESOURCE_UNSET (Bjorn Helgaas)
   - Mark resources as IORESOURCE_UNSET if we can't assign them (Bjorn Helgaas)
   - Don't clear IORESOURCE_UNSET when updating BAR (Bjorn Helgaas)
   - Check IORESOURCE_UNSET before updating BAR (Bjorn Helgaas)
   - Don't try to claim IORESOURCE_UNSET resources (Bjorn Helgaas)
   - Mark 64-bit resource as IORESOURCE_UNSET if we only support 32-bit (Bjorn Helgaas)
   - Don't enable decoding if BAR hasn't been assigned an address (Bjorn Helgaas)
   - Add "weak" generic pcibios_enable_device() implementation (Bjorn Helgaas)
   - alpha, microblaze, sh, sparc, tile: Use default pcibios_enable_device() (Bjorn Helgaas)
   - s390: Use generic pci_enable_resources() (Bjorn Helgaas)
   - Don't check resource_size() in pci_bus_alloc_resource() (Bjorn Helgaas)
   - Set type in __request_region() (Bjorn Helgaas)
   - Check all IORESOURCE_TYPE_BITS in pci_bus_alloc_from_region() (Bjorn Helgaas)
   - Change pci_bus_alloc_resource() type_mask to unsigned long (Bjorn Helgaas)
   - Log IDE resource quirk in dmesg (Bjorn Helgaas)
   - Revert "[PATCH] Insert GART region into resource map" (Bjorn Helgaas)

  PCI device hotplug
   - Make check_link_active() non-static (Rajat Jain)
   - Use link change notifications for hot-plug and removal (Rajat Jain)
   - Enable link state change notifications (Rajat Jain)
   - Don't disable the link permanently during removal (Rajat Jain)
   - Don't check adapter or latch status while disabling (Rajat Jain)
   - Disable link notification across slot reset (Rajat Jain)
   - Ensure very fast hotplug events are also processed (Rajat Jain)
   - Add hotplug_lock to serialize hotplug events (Rajat Jain)
   - Remove a non-existent card, regardless of "surprise" capability (Rajat Jain)
   - Don't turn slot off when hot-added device already exists (Yijing Wang)

  MSI
   - Keep pci_enable_msi() documentation (Alexander Gordeev)
   - ahci: Fix broken single MSI fallback (Alexander Gordeev)
   - ahci, vfio: Use pci_enable_msi_range() (Alexander Gordeev)
   - Check kmalloc() return value, fix leak of name (Greg Kroah-Hartman)
   - Fix leak of msi_attrs (Greg Kroah-Hartman)
   - Fix pci_msix_vec_count() htmldocs failure (Masanari Iida)

  Virtualization
   - Device-specific ACS support (Alex Williamson)

  Freescale i.MX6
   - Wait for retraining (Marek Vasut)

  Marvell MVEBU
   - Use Device ID and revision from underlying endpoint (Andrew Lunn)
   - Fix incorrect size for PCI aperture resources (Jason Gunthorpe)
   - Call request_resource() on the apertures (Jason Gunthorpe)
   - Fix potential issue in range parsing (Jean-Jacques Hiblot)

  Renesas R-Car
   - Check platform_get_irq() return code (Ben Dooks)
   - Add error interrupt handling (Ben Dooks)
   - Fix bridge logic configuration accesses (Ben Dooks)
   - Register each instance independently (Magnus Damm)
   - Break out window size handling (Magnus Damm)
   - Make the Kconfig dependencies more generic (Magnus Damm)

  Synopsys DesignWare
   - Fix RC BAR to be single 64-bit non-prefetchable memory (Mohit Kumar)

  Miscellaneous
   - Remove unused SR-IOV VF Migration support (Bjorn Helgaas)
   - Enable INTx if BIOS left them disabled (Bjorn Helgaas)
   - Fix hex vs decimal typo in cpqhpc_probe() (Dan Carpenter)
   - Clean up par-arch object file list (Liviu Dudau)
   - Set IORESOURCE_ROM_SHADOW only for the default VGA device (Sander Eikelenboom)
   - ACPI, ARM, drm, powerpc, pcmcia, PCI: Use list_for_each_entry() for bus traversal (Yijing Wang)
   - Fix pci_bus_b() build failure (Paul Gortmaker)"

* tag 'pci-v3.15-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (108 commits)
  Revert "[PATCH] Insert GART region into resource map"
  PCI: Log IDE resource quirk in dmesg
  PCI: Change pci_bus_alloc_resource() type_mask to unsigned long
  PCI: Check all IORESOURCE_TYPE_BITS in pci_bus_alloc_from_region()
  resources: Set type in __request_region()
  PCI: Don't check resource_size() in pci_bus_alloc_resource()
  s390/PCI: Use generic pci_enable_resources()
  tile PCI RC: Use default pcibios_enable_device()
  sparc/PCI: Use default pcibios_enable_device() (Leon only)
  sh/PCI: Use default pcibios_enable_device()
  microblaze/PCI: Use default pcibios_enable_device()
  alpha/PCI: Use default pcibios_enable_device()
  PCI: Add "weak" generic pcibios_enable_device() implementation
  PCI: Don't enable decoding if BAR hasn't been assigned an address
  PCI: Enable INTx in pci_reenable_device() only when MSI/MSI-X not enabled
  PCI: Mark 64-bit resource as IORESOURCE_UNSET if we only support 32-bit
  PCI: Don't try to claim IORESOURCE_UNSET resources
  PCI: Check IORESOURCE_UNSET before updating BAR
  PCI: Don't clear IORESOURCE_UNSET when updating BAR
  PCI: Mark resources as IORESOURCE_UNSET if we can't assign them
  ...

Conflicts:
	arch/x86/include/asm/topology.h
	drivers/ata/ahci.c
2014-04-01 15:14:04 -07:00
Arnd Bergmann 4379d2ae15 vfio: always select ANON_INODES
The vfio code cannot be built when CONFIG_ANON_INODES is
disabled, so this enforces the symbol to be enabled through
Kconfig.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2014-03-27 11:58:58 -06:00
David Rientjes 668f9abbd4 mm: close PageTail race
Commit bf6bddf192 ("mm: introduce compaction and migration for
ballooned pages") introduces page_count(page) into memory compaction
which dereferences page->first_page if PageTail(page).

This results in a very rare NULL pointer dereference on the
aforementioned page_count(page).  Indeed, anything that does
compound_head(), including page_count() is susceptible to racing with
prep_compound_page() and seeing a NULL or dangling page->first_page
pointer.

This patch uses Andrea's implementation of compound_trans_head() that
deals with such a race and makes it the default compound_head()
implementation.  This includes a read memory barrier that ensures that
if PageTail(head) is true that we return a head page that is neither
NULL nor dangling.  The patch then adds a store memory barrier to
prep_compound_page() to ensure page->first_page is set.

This is the safest way to ensure we see the head page that we are
expecting, PageTail(page) is already in the unlikely() path and the
memory barriers are unfortunately required.

Hugetlbfs is the exception, we don't enforce a store memory barrier
during init since no race is possible.

Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Holger Kiehl <Holger.Kiehl@dwd.de>
Cc: Christoph Lameter <cl@linux.com>
Cc: Rafael Aquini <aquini@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-03-04 07:55:47 -08:00
Alex Williamson 88d7ab8949 vfio: Add external user check extension interface
This lets us check extensions, particularly VFIO_DMA_CC_IOMMU using
the external user interface, allowing KVM to probe IOMMU coherency.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2014-02-26 11:38:39 -07:00
Alex Williamson aa42931827 vfio/type1: Add extension to test DMA cache coherence of IOMMU
Now that the type1 IOMMU backend can support IOMMU_CACHE, we need to
be able to test whether coherency is currently enforced.  Add an
extension for this.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2014-02-26 11:38:37 -07:00
Alex Williamson 1ef3e2bc04 vfio/iommu_type1: Multi-IOMMU domain support
We currently have a problem that we cannot support advanced features
of an IOMMU domain (ex. IOMMU_CACHE), because we have no guarantee
that those features will be supported by all of the hardware units
involved with the domain over its lifetime.  For instance, the Intel
VT-d architecture does not require that all DRHDs support snoop
control.  If we create a domain based on a device behind a DRHD that
does support snoop control and enable SNP support via the IOMMU_CACHE
mapping option, we cannot then add a device behind a DRHD which does
not support snoop control or we'll get reserved bit faults from the
SNP bit in the pagetables.  To add to the complexity, we can't know
the properties of a domain until a device is attached.

We could pass this problem off to userspace and require that a
separate vfio container be used, but we don't know how to handle page
accounting in that case.  How do we know that a page pinned in one
container is the same page as a different container and avoid double
billing the user for the page.

The solution is therefore to support multiple IOMMU domains per
container.  In the majority of cases, only one domain will be required
since hardware is typically consistent within a system.  However, this
provides us the ability to validate compatibility of domains and
support mixed environments where page table flags can be different
between domains.

To do this, our DMA tracking needs to change.  We currently try to
coalesce user mappings into as few tracking entries as possible.  The
problem then becomes that we lose granularity of user mappings.  We've
never guaranteed that a user is able to unmap at a finer granularity
than the original mapping, but we must honor the granularity of the
original mapping.  This coalescing code is therefore removed, allowing
only unmaps covering complete maps.  The change in accounting is
fairly small here, a typical QEMU VM will start out with roughly a
dozen entries, so it's arguable if this coalescing was ever needed.

We also move IOMMU domain creation to the point where a group is
attached to the container.  An interesting side-effect of this is that
we now have access to the device at the time of domain creation and
can probe the devices within the group to determine the bus_type.
This finally makes vfio_iommu_type1 completely device/bus agnostic.
In fact, each IOMMU domain can host devices on different buses managed
by different physical IOMMUs, and present a single DMA mapping
interface to the user.  When a new domain is created, mappings are
replayed to bring the IOMMU pagetables up to the state of the current
container.  And of course, DMA mapping and unmapping automatically
traverse all of the configured IOMMU domains.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Cc: Varun Sethi <Varun.Sethi@freescale.com>
2014-02-26 11:38:36 -07:00