Add check for hv_is_hyperv_initialized() at the top of
init_hv_pci_drv(), so if the pci-hyperv driver is force-loaded on non
Hyper-V platforms, the init_hv_pci_drv() will exit immediately, without
any side effects, like assignments to hvpci_block_ops, etc.
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Reported-and-tested-by: Mohammad Alqayeem <mohammad.alqyeem@nutanix.com>
Reviewed-by: Wei Liu <wei.liu@kernel.org>
Link: https://lore.kernel.org/r/1621984653-1210-1-git-send-email-haiyangz@microsoft.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
With the new method of flushing/stopping the workqueue before doing bus
removal, the old mechanism of using refcount and wait for completion
is no longer needed. Remove those dead code.
Link: https://lore.kernel.org/r/1620806809-31055-1-git-send-email-longli@linuxonhyperv.com
Signed-off-by: Long Li <longli@microsoft.com>
[lorenzo.pieralisi@arm.com: Reworded subject]
Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
On removing the device, any work item (hv_pci_devices_present() or
hv_pci_eject_device()) scheduled on workqueue hbus->wq may still be running
and race with hv_pci_remove().
This can happen because the host may send PCI_EJECT or PCI_BUS_RELATIONS(2)
and decide to rescind the channel immediately after that.
Fix this by flushing/destroying the workqueue of hbus before doing hbus remove.
Link: https://lore.kernel.org/r/1620806800-30983-1-git-send-email-longli@linuxonhyperv.com
Signed-off-by: Long Li <longli@microsoft.com>
Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
There is not a consistent pattern for checking Hyper-V hypercall status.
Existing code uses a number of variants. The variants work, but a consistent
pattern would improve the readability of the code, and be more conformant
to what the Hyper-V TLFS says about hypercall status.
Implemented new helper functions hv_result(), hv_result_success(), and
hv_repcomp(). Changed the places where hv_do_hypercall() and related variants
are used to use the helper functions.
Signed-off-by: Joseph Salisbury <joseph.salisbury@microsoft.com>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Link: https://lore.kernel.org/r/1618620183-9967-2-git-send-email-joseph.salisbury@linux.microsoft.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
The Hyper-V PCI driver still makes use of a msi_controller structure,
but it looks more like a distant leftover than anything actually
useful, since it is initialised to 0 and never used for anything.
Just remove it.
Link: https://lore.kernel.org/r/20210330151145.997953-7-maz@kernel.org
Tested-by: Michael Kelley <mikelley@microsoft.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
The hv_compose_msi_msg() callback in irq_chip::irq_compose_msi_msg is
invoked via irq_chip_compose_msi_msg(), which itself is always invoked from
atomic contexts from the guts of the interrupt core code.
There is no way to change this w/o rewriting the whole driver, so use
tasklet_disable_in_atomic() which allows to make tasklet_disable()
sleepable once the remaining atomic users are addressed.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Wei Liu <wei.liu@kernel.org>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20210309084242.516519290@linutronix.de
-----BEGIN PGP SIGNATURE-----
iQJIBAABCgAyFiEEgMe7l+5h9hnxdsnuWYigwDrT+vwFAmA2xiQUHGJoZWxnYWFz
QGdvb2dsZS5jb20ACgkQWYigwDrT+vzRDA/9GCyEskI9DMtyT9UeoTMzpHcUZpaU
eCbLa2BSPjOKlrHLnPY7IwE0nT7ihe4OOcm8uOYOWtulE46XJNCHfxlUYP3SbI0Y
JlG0FBCh4ldzCzzKsftwkSvVhk+gn+ms9ucJ8q2iBSOXVhG/41IbX7++8IfbQM4v
VHjdYUmTCCiOSRDtBVi82p4+GAHxH8IhaB0gDNb1Q7myj+qJKL5nKjK/nukgO0fO
UpCnSxyua48Ij+c59Y1QAIhGeORq5Gg5Q4ussY3FxS9ovhZODEGQwCFniTfilqRw
wEB9Fb8tiPY60ljEyDPnERMkiW69zutTJqOY4LfwmoRM9IEbxD6VPIqF5gin8sB7
pHhX4KUU+eB1hQdK9SGKjkwyehquNKzTdxsu2jccltOKwBm5jcXYeOvu2bJTzZn+
rrZPYJoA1dQig3bEuOzsBxvW4Jaj7IsVfVcao4OzXyh8Y7tLr9kVDXxr7JC/EkPM
zRK24yglERD2J1JXgNMvOuJQj6JmRHhEbV/faZci8x8ZEaz1FawRAUZqHf/gGmnW
2CllarHbRnchPyD8btv03Mp84WG6fCfKy7zG2D8HxOsiStDO/5ICehHtGcvYg7IL
RuE4Tj8OKdcbw/8cO4C3842FqiSj34+jooNIHSLyBqcpJam6VsN4XqNIZCL+DeG5
Q2JXruAaahTWOZg=
=GXL5
-----END PGP SIGNATURE-----
Merge tag 'pci-v5.12-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci
Pull PCI updates from Bjorn Helgaas:
"Enumeration:
- Remove unnecessary locking around _OSC (Bjorn Helgaas)
- Clarify message about _OSC failure (Bjorn Helgaas)
- Remove notification of PCIe bandwidth changes (Bjorn Helgaas)
- Tidy checking of syscall user config accessors (Heiner Kallweit)
Resource management:
- Decline to resize resources if boot config must be preserved (Ard
Biesheuvel)
- Fix pci_register_io_range() memory leak (Geert Uytterhoeven)
Error handling (Keith Busch):
- Clear error status from the correct device
- Retain error recovery status so drivers can use it after reset
- Log the type of Port (Root or Switch Downstream) that we reset
- Always request a reset for Downstream Ports in frozen state
Endpoint framework and NTB (Kishon Vijay Abraham I):
- Make *_get_first_free_bar() take into account 64 bit BAR
- Add helper API to get the 'next' unreserved BAR
- Make *_free_bar() return error codes on failure
- Remove unused pci_epf_match_device()
- Add support to associate secondary EPC with EPF
- Add support in configfs to associate two EPCs with EPF
- Add pci_epc_ops to map MSI IRQ
- Add pci_epf_ops to expose function-specific attrs
- Allow user to create sub-directory of 'EPF Device' directory
- Implement ->msi_map_irq() ops for cadence
- Configure LM_EP_FUNC_CFG based on epc->function_num_map for cadence
- Add EP function driver to provide NTB functionality
- Add support for EPF PCI Non-Transparent Bridge
- Add specification for PCI NTB function device
- Add PCI endpoint NTB function user guide
- Add configfs binding documentation for pci-ntb endpoint function
Broadcom STB PCIe controller driver:
- Add support for BCM4908 and external PERST# signal controller
(Rafał Miłecki)
Cadence PCIe controller driver:
- Retrain Link to work around Gen2 training defect (Nadeem Athani)
- Fix merge botch in cdns_pcie_host_map_dma_ranges() (Krzysztof
Wilczyński)
Freescale Layerscape PCIe controller driver:
- Add LX2160A rev2 EP mode support (Hou Zhiqiang)
- Convert to builtin_platform_driver() (Michael Walle)
MediaTek PCIe controller driver:
- Fix OF node reference leak (Krzysztof Wilczyński)
Microchip PolarFlare PCIe controller driver:
- Add Microchip PolarFire PCIe controller driver (Daire McNamara)
Qualcomm PCIe controller driver:
- Use PHY_REFCLK_USE_PAD only for ipq8064 (Ansuel Smith)
- Add support for ddrss_sf_tbu clock for sm8250 (Dmitry Baryshkov)
Renesas R-Car PCIe controller driver:
- Drop PCIE_RCAR config option (Lad Prabhakar)
- Always allocate MSI addresses in 32bit space (Marek Vasut)
Rockchip PCIe controller driver:
- Add FriendlyARM NanoPi M4B DT binding (Chen-Yu Tsai)
- Make 'ep-gpios' DT property optional (Chen-Yu Tsai)
Synopsys DesignWare PCIe controller driver:
- Work around ECRC configuration hardware defect (Vidya Sagar)
- Drop support for config space in DT 'ranges' (Rob Herring)
- Change size to u64 for EP outbound iATU (Shradha Todi)
- Add upper limit address for outbound iATU (Shradha Todi)
- Make dw_pcie ops optional (Jisheng Zhang)
- Remove unnecessary dw_pcie_ops from al driver (Jisheng Zhang)
Xilinx Versal CPM PCIe controller driver:
- Fix OF node reference leak (Pan Bian)
Miscellaneous:
- Remove tango host controller driver (Arnd Bergmann)
- Remove IRQ handler & data together (altera-msi, brcmstb, dwc)
(Martin Kaiser)
- Fix xgene-msi race in installing chained IRQ handler (Martin
Kaiser)
- Apply CONFIG_PCI_DEBUG to entire drivers/pci hierarchy (Junhao He)
- Fix pci-bridge-emul array overruns (Russell King)
- Remove obsolete uses of WARN_ON(in_interrupt()) (Sebastian Andrzej
Siewior)"
* tag 'pci-v5.12-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (69 commits)
PCI: qcom: Use PHY_REFCLK_USE_PAD only for ipq8064
PCI: qcom: Add support for ddrss_sf_tbu clock
dt-bindings: PCI: qcom: Document ddrss_sf_tbu clock for sm8250
PCI: al: Remove useless dw_pcie_ops
PCI: dwc: Don't assume the ops in dw_pcie always exist
PCI: dwc: Add upper limit address for outbound iATU
PCI: dwc: Change size to u64 for EP outbound iATU
PCI: dwc: Drop support for config space in 'ranges'
PCI: layerscape: Convert to builtin_platform_driver()
PCI: layerscape: Add LX2160A rev2 EP mode support
dt-bindings: PCI: layerscape: Add LX2160A rev2 compatible strings
PCI: dwc: Work around ECRC configuration issue
PCI/portdrv: Report reset for frozen channel
PCI/AER: Specify the type of Port that was reset
PCI/ERR: Retain status from error notification
PCI/AER: Clear AER status from Root Port when resetting Downstream Port
PCI/ERR: Clear status of the reporting device
dt-bindings: arm: rockchip: Add FriendlyARM NanoPi M4B
PCI: rockchip: Make 'ep-gpios' DT property optional
Documentation: PCI: Add PCI endpoint NTB function user guide
...
We will soon use the same structure to handle IO-APIC interrupts as
well. Introduce an enum to identify the source and a data structure for
IO-APIC RTE.
While at it, update pci-hyperv.c to use the enum.
No functional change.
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Acked-by: Rob Herring <robh@kernel.org>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Link: https://lore.kernel.org/r/20210203150435.27941-13-wei.liu@kernel.org
The enum ioapic_irq_destination_types and the enumerated constants starting
with 'dest_' are gross misnomers because they describe the delivery mode.
Rename then enum and the constants so they actually make sense.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201024213535.443185-6-dwmw2@infradead.org
-----BEGIN PGP SIGNATURE-----
iQFHBAABCAAxFiEEIbPD0id6easf0xsudhRwX5BBoF4FAl+FqrsTHHdlaS5saXVA
a2VybmVsLm9yZwAKCRB2FHBfkEGgXnN8B/4sRg7j9OTzVBlDiXF2vj6vbuplTIH6
JR6S5f4PNjUg4gV6ghzSnsx1zqNhPSOr78zDqYto8vv+wqqj3thmld8+gAnSbKtt
yoAa7mhbbN1ryJiwPlZzvX4ApzGZPC7byqEi3+zPIcag6TEl8eyYJOmvY3x1zv8x
CsAb57oCC4erD0n4xlTyfuc8TLpO+EiU53PXbR9AovKQHe4m2/8LWyEbmrm5cRUR
gx8RxoLkkrqK0unzcmanbm47QodiaOTUpycs3IvaBeWZQsqSgFZdI1RAdTZNg+U+
GT8eMRXAwpgDpilPm/0n1O0PKGAsVh9Lbw8Btb/ggqnjTUlA4Z3Df23E
=Wy5n
-----END PGP SIGNATURE-----
Merge tag 'hyperv-next-signed' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux
Pull Hyper-V updates from Wei Liu:
- a series from Boqun Feng to support page size larger than 4K
- a few miscellaneous clean-ups
* tag 'hyperv-next-signed' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux:
hv: clocksource: Add notrace attribute to read_hv_sched_clock_*() functions
x86/hyperv: Remove aliases with X64 in their name
PCI: hv: Document missing hv_pci_protocol_negotiation() parameter
scsi: storvsc: Support PAGE_SIZE larger than 4K
Driver: hv: util: Use VMBUS_RING_SIZE() for ringbuffer sizes
HID: hyperv: Use VMBUS_RING_SIZE() for ringbuffer sizes
Input: hyperv-keyboard: Use VMBUS_RING_SIZE() for ringbuffer sizes
hv_netvsc: Use HV_HYP_PAGE_SIZE for Hyper-V communication
hv: hyperv.h: Introduce some hvpfn helper functions
Drivers: hv: vmbus: Move virt_to_hvpfn() to hyperv header
Drivers: hv: Use HV_HYP_PAGE in hv_synic_enable_regs()
Drivers: hv: vmbus: Introduce types of GPADL
Drivers: hv: vmbus: Move __vmbus_open()
Drivers: hv: vmbus: Always use HV_HYP_PAGE_SIZE for gpadl
drivers: hv: remove cast from hyperv_die_event
pci_restore_msi_state() directly writes the MSI/MSI-X related registers
via MMIO. On a physical machine, this works perfectly; for a Linux VM
running on a hypervisor, which typically enables IOMMU interrupt remapping,
the hypervisor usually should trap and emulate the MMIO accesses in order
to re-create the necessary interrupt remapping table entries in the IOMMU,
otherwise the interrupts can not work in the VM after hibernation.
Hyper-V is different from other hypervisors in that it does not trap and
emulate the MMIO accesses, and instead it uses a para-virtualized method,
which requires the VM to call hv_compose_msi_msg() to notify the hypervisor
of the info that would be passed to the hypervisor in the case of the
trap-and-emulate method. This is not an issue to a lot of PCI device
drivers, which destroy and re-create the interrupts across hibernation, so
hv_compose_msi_msg() is called automatically. However, some PCI device
drivers (e.g. the in-tree GPU driver nouveau and the out-of-tree Nvidia
proprietary GPU driver) do not destroy and re-create MSI/MSI-X interrupts
across hibernation, so hv_pci_resume() has to call hv_compose_msi_msg(),
otherwise the PCI device drivers can no longer receive interrupts after
the VM resumes from hibernation.
Hyper-V is also different in that chip->irq_unmask() may fail in a
Linux VM running on Hyper-V (on a physical machine, chip->irq_unmask()
can not fail because unmasking an MSI/MSI-X register just means an MMIO
write): during hibernation, when a CPU is offlined, the kernel tries
to move the interrupt to the remaining CPUs that haven't been offlined
yet. In this case, hv_irq_unmask() -> hv_do_hypercall() always fails
because the vmbus channel has been closed: here the early "return" in
hv_irq_unmask() means the pci_msi_unmask_irq() is not called, i.e. the
desc->masked remains "true", so later after hibernation, the MSI interrupt
always remains masked, which is incorrect. Refer to cpu_disable_common()
-> fixup_irqs() -> irq_migrate_all_off_this_cpu() -> migrate_one_irq():
static bool migrate_one_irq(struct irq_desc *desc)
{
...
if (maskchip && chip->irq_mask)
chip->irq_mask(d);
...
err = irq_do_set_affinity(d, affinity, false);
...
if (maskchip && chip->irq_unmask)
chip->irq_unmask(d);
Fix the issue by calling pci_msi_unmask_irq() unconditionally in
hv_irq_unmask(). Also suppress the error message for hibernation because
the hypercall failure during hibernation does not matter (at this time
all the devices have been frozen). Note: the correct affinity info is
still updated into the irqdata data structure in migrate_one_irq() ->
irq_do_set_affinity() -> hv_set_affinity(), so later when the VM
resumes, hv_pci_restore_msi_state() is able to correctly restore
the interrupt with the correct affinity.
Link: https://lore.kernel.org/r/20201002085158.9168-1-decui@microsoft.com
Fixes: ac82fc8327 ("PCI: hv: Add hibernation support")
Signed-off-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Reviewed-by: Jake Oshins <jakeo@microsoft.com>
Add missing documentation for the parameter "version" and "num_version"
of the hv_pci_protocol_negotiation() function and resolve build time
kernel-doc warnings:
drivers/pci/controller/pci-hyperv.c:2535: warning: Function parameter
or member 'version' not described in 'hv_pci_protocol_negotiation'
drivers/pci/controller/pci-hyperv.c:2535: warning: Function parameter
or member 'num_version' not described in 'hv_pci_protocol_negotiation'
No change to functionality intended.
Signed-off-by: Krzysztof Wilczyński <kw@linux.com>
Link: https://lore.kernel.org/r/20200925234753.1767227-1-kw@linux.com
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Signed-off-by: Wei Liu <wei.liu@kernel.org>
pci_msi_get_hwirq() and pci_msi_set_desc are not longer special. Enable the
generic MSI domain ops in the core and PCI MSI code unconditionally and get
rid of the x86 specific implementations in the X86 MSI code and in the
hyperv PCI driver.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20200826112332.564274859@linutronix.de
Convert the interrupt remap drivers to retrieve the pci device from the msi
descriptor and use info::hwirq.
This is the first step to prepare x86 for using the generic MSI domain ops.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Wei Liu <wei.liu@kernel.org>
Acked-by: Joerg Roedel <jroedel@suse.de>
Link: https://lore.kernel.org/r/20200826112332.466405395@linutronix.de
sparse report build warning as follows:
drivers/pci/controller/pci-hyperv.c:941:5: warning:
symbol 'hv_read_config_block' was not declared. Should it be static?
drivers/pci/controller/pci-hyperv.c:1021:5: warning:
symbol 'hv_write_config_block' was not declared. Should it be static?
drivers/pci/controller/pci-hyperv.c:1090:5: warning:
symbol 'hv_register_block_invalidate' was not declared. Should it be static?
Those functions are not used outside of this file, so mark them static.
Link: https://lore.kernel.org/r/20200706135234.80758-1-weiyongjun1@huawei.com
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Kdump could fail sometime on Hyper-V guest because the retry in
hv_pci_enter_d0() releases child device structures in hv_pci_bus_exit().
Although there is a second asynchronous device relations message sending
from the host, if this message arrives to the guest after
hv_send_resource_allocated() is called, the retry would fail.
Fix the problem by moving retry to hv_pci_probe() and start the retry
from hv_pci_query_relations() call. This will cause a device relations
message to arrive to the guest synchronously; the guest would then be
able to rebuild the child device structures before calling
hv_send_resource_allocated().
Link: https://lore.kernel.org/r/20200727071731.18516-1-weh@microsoft.com
Fixes: c81992e7f4 ("PCI: hv: Retry PCI bus D0 entry on invalid device state")
Signed-off-by: Wei Hu <weh@microsoft.com>
[lorenzo.pieralisi@arm.com: fixed a comment and commit log]
Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
One of the more common cases of allocation size calculations is finding
the size of a structure that has a zero-sized array at the end, along
with memory for some number of elements for that array. For example:
struct hv_dr_state {
...
struct hv_pcidev_description func[];
};
struct pci_bus_relations {
...
struct pci_function_description func[];
} __packed;
Make use of the struct_size() helper instead of an open-coded version
in order to avoid any potential type mistakes.
So, replace the following forms:
offsetof(struct hv_dr_state, func) +
(sizeof(struct hv_pcidev_description) *
(relations->device_count))
offsetof(struct pci_bus_relations, func) +
(sizeof(struct pci_function_description) *
(bus_rel->device_count))
with:
struct_size(dr, func, relations->device_count)
and
struct_size(bus_rel, func, bus_rel->device_count)
respectively.
Link: https://lore.kernel.org/r/20200525164319.GA13596@embeddedor
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Reviewed-by: Wei Liu <wei.liu@kernel.org>
When kdump is triggered, some PCI devices may have not been shut down
cleanly before the kdump kernel starts.
This causes the initial attempt to enter D0 state in the kdump kernel to
fail with invalid device state returned from Hyper-V host.
When this happens, explicitly call hv_pci_bus_exit() and retry to enter
the D0 state.
Link: https://lore.kernel.org/r/20200507050300.10974-1-weh@microsoft.com
Signed-off-by: Wei Hu <weh@microsoft.com>
[lorenzo.pieralisi@arm.com: commit log]
Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
In some error cases in hv_pci_probe(), allocated resources are not freed.
Fix this by adding a field to keep track of the high water mark for slots
that have resources allocated to them. In case of an error, this high
water mark is used to know which slots have resources that must be released.
Since slots are numbered starting with zero, a value of -1 indicates no
slots have been allocated resources. There may be unused slots in the range
between slot 0 and the high water mark slot, but these slots are already
ignored by the existing code in the allocate and release loops with the call
to get_pcichild_wslot().
Link: https://lore.kernel.org/r/20200507050211.10923-1-weh@microsoft.com
Signed-off-by: Wei Hu <weh@microsoft.com>
Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
The current implementation of hv_compose_msi_msg() is incompatible with
the new functionality that allows changing the vCPU a VMBus channel will
interrupt: if this function always calls hv_pci_onchannelcallback() in
the polling loop, the interrupt going to a different CPU could cause
hv_pci_onchannelcallback() to be running simultaneously in a tasklet,
which will break. The current code also has a problem in that it is not
synchronized with vmbus_reset_channel_cb(): hv_compose_msi_msg() could
be accessing the ring buffer via the call of hv_pci_onchannelcallback()
well after the time that vmbus_reset_channel_cb() has finished.
Fix these issues as follows. Disable the channel tasklet before
entering the polling loop in hv_compose_msi_msg() and re-enable it when
done. This will prevent hv_pci_onchannelcallback() from running in a
tasklet on a different CPU. Moreover, poll by always calling
hv_pci_onchannelcallback(), but check the channel callback function for
NULL and invoke the callback within a sched_lock critical section. This
will prevent hv_compose_msi_msg() from accessing the ring buffer after
vmbus_reset_channel_cb() has acquired the sched_lock spinlock.
Suggested-by: Michael Kelley <mikelley@microsoft.com>
Signed-off-by: Andrea Parri (Microsoft) <parri.andrea@gmail.com>
Cc: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Cc: Andrew Murray <amurray@thegoodpenguin.co.uk>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: <linux-pci@vger.kernel.org>
Link: https://lore.kernel.org/r/20200406001514.19876-8-parri.andrea@gmail.com
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Add a new structure (hv_msi_entry), which is also defined in the TLFS,
to describe the msi entry for HVCALL_RETARGET_INTERRUPT. The structure
is needed because its layout may be different from architecture to
architecture.
Also add a new generic interface hv_set_msi_entry_from_desc() to allow
different archs to set the msi entry from msi_desc.
No functional change, only preparation for the future support of virtual
PCI on non-x86 architectures.
Signed-off-by: Boqun Feng (Microsoft) <boqun.feng@gmail.com>
Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Reviewed-by: Dexuan Cui <decui@microsoft.com>
Currently, retarget_msi_interrupt and other structures it relys on are
defined in pci-hyperv.c. However, those structures are actually defined
in Hypervisor Top-Level Functional Specification [1] and may be
different in sizes of fields or layout from architecture to
architecture. Let's move those definitions into x86's tlfs header file
to support virtual PCI on non-x86 architectures in the future. Note that
"__packed" attribute is added to these structures during the movement
for the same reason as we use the attribute for other TLFS structures in
the header file: make sure the structures meet the specification and
avoid anything unexpected from the compilers.
Additionally, rename struct retarget_msi_interrupt to
hv_retarget_msi_interrupt for the consistent naming convention, also
mirroring the name in TLFS.
[1]: https://docs.microsoft.com/en-us/virtualization/hyper-v-on-windows/reference/tlfs
Signed-off-by: Boqun Feng (Microsoft) <boqun.feng@gmail.com>
Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Reviewed-by: Dexuan Cui <decui@microsoft.com>
Currently HVCALL_RETARGET_INTERRUPT and HV_PARTITION_ID_SELF are defined
in pci-hyperv.c. However, similar to other hypercall related
definitions, it makes more sense to put them in the tlfs header file.
Besides, these definitions are arch-dependent, so for the support of
virtual PCI on non-x86 archs in the future, move them into arch-specific
tlfs header file.
Signed-off-by: Boqun Feng (Microsoft) <boqun.feng@gmail.com>
Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Reviewed-by: Andrew Murray <amurray@thegoodpenguin.co.uk>
Reviewed-by: Dexuan Cui <decui@microsoft.com>
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:
struct foo {
int stuff;
struct boo array[];
};
By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.
Also, notice that, dynamic memory allocations won't be affected by
this change:
"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]
This issue was found with the help of Coccinelle.
[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Reviewed-by: Dexuan Cui <decui@microsoft.com>
Starting with Hyper-V PCI protocol version 1.3, the host VSP can send
PCI_BUS_RELATIONS2 and pass the vNUMA node information for devices on the
bus. The vNUMA node tells which guest NUMA node this device is on based
on guest VM configuration topology and physical device information.
Add code to negotiate v1.3 and process PCI_BUS_RELATIONS2.
Signed-off-by: Long Li <longli@microsoft.com>
Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
hv_dr_state is used to find present PCI devices on the bus. The structure
reuses struct pci_function_description from VSP message to describe a
device.
To prepare support for pci_function_description v2, decouple this
dependence in hv_dr_state so it can work with both v1 and v2 VSP messages.
There is no functionality change.
Signed-off-by: Long Li <longli@microsoft.com>
Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Now that we use kzalloc() to allocate the hbus buffer, we must call
kfree() in the error path as well to prevent memory leakage.
Fixes: 877b911a5b ("PCI: hv: Avoid a kmemleak false positive caused by the hbus buffer")
Signed-off-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
In C, there is no need to cast a void * to any other pointer type,
remove an unnecessary cast.
Signed-off-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
With the recent 59bb47985c ("mm, sl[aou]b: guarantee natural
alignment for kmalloc(power-of-two)"), kzalloc() is able to allocate
a 4KB buffer that is guaranteed to be 4KB-aligned. Here the size and
alignment of hbus is important because hbus's field
retarget_msi_interrupt_params must not cross a 4KB page boundary.
Here we prefer kzalloc to get_zeroed_page(), because a buffer
allocated by the latter is not tracked and scanned by kmemleak, and
hence kmemleak reports the pointer contained in the hbus buffer
(i.e. the hpdev struct, which is created in new_pcichild_device() and
is tracked by hbus->children) as memory leak (false positive).
If the kernel doesn't have 59bb47985c, get_zeroed_page() *must* be
used to allocate the hbus buffer and we can avoid the kmemleak false
positive by using kmemleak_alloc() and kmemleak_free() to ask
kmemleak to track and scan the hbus buffer.
Reported-by: Lili Deng <v-lide@microsoft.com>
Signed-off-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
A VM can have multiple Hyper-V hbus. It's incorrect to set the global
variable 'pci_protocol_version' when *every* hbus is initialized in
hv_pci_protocol_negotiation(). This is not an issue in practice since
every hbus should have the same value of hbus->protocol_version, but
we should make the variable per-hbus, so in case we have busses
with different protocol versions, the driver can still work correctly.
Signed-off-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Add suspend() and resume() functions so that Hyper-V virtual PCI devices
are handled properly when the VM hibernates and resumes from
hibernation.
Note that the suspend() function must make sure there are no pending
work items before calling vmbus_close(), since it runs in a process
context as a callback in dpm_suspend(). When it starts to run, the
channel callback hv_pci_onchannelcallback(), which runs in a tasklet
context, can be still running concurrently and scheduling new work items
onto hbus->wq in hv_pci_devices_present() and hv_pci_eject_device(), and
the work item handlers can access the vmbus channel, which can be being
closed by hv_pci_suspend(), e.g. the work item handler
pci_devices_present_work() -> new_pcichild_device() writes to the vmbus
channel.
To eliminate the race, hv_pci_suspend() disables the channel callback
tasklet, sets hbus->state to hv_pcibus_removing, and re-enables the
tasklet. This way, when hv_pci_suspend() proceeds, it knows that no new
work item can be scheduled, and then it flushes hbus->wq and safely
closes the vmbus channel.
Signed-off-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
There is no functional change. This is just preparatory for a later
patch which adds the hibernation support for the pci-hyperv driver.
Signed-off-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Code that iterates over all standard PCI BARs typically uses
PCI_STD_RESOURCE_END. However, that requires the unusual test
"i <= PCI_STD_RESOURCE_END" rather than something the typical
"i < PCI_STD_NUM_BARS".
Add a definition for PCI_STD_NUM_BARS and change loops to use the more
idiomatic C style to help avoid fencepost errors.
Link: https://lore.kernel.org/r/20190927234026.23342-1-efremov@linux.com
Link: https://lore.kernel.org/r/20190927234308.23935-1-efremov@linux.com
Link: https://lore.kernel.org/r/20190916204158.6889-3-efremov@linux.com
Signed-off-by: Denis Efremov <efremov@linux.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Sebastian Ott <sebott@linux.ibm.com> # arch/s390/
Acked-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> # video/fbdev/
Acked-by: Gustavo Pimentel <gustavo.pimentel@synopsys.com> # pci/controller/dwc/
Acked-by: Jack Wang <jinpu.wang@cloud.ionos.com> # scsi/pm8001/
Acked-by: Martin K. Petersen <martin.petersen@oracle.com> # scsi/pm8001/
Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # memstick/
Pull networking updates from David Miller:
1) Support IPV6 RA Captive Portal Identifier, from Maciej Żenczykowski.
2) Use bio_vec in the networking instead of custom skb_frag_t, from
Matthew Wilcox.
3) Make use of xmit_more in r8169 driver, from Heiner Kallweit.
4) Add devmap_hash to xdp, from Toke Høiland-Jørgensen.
5) Support all variants of 5750X bnxt_en chips, from Michael Chan.
6) More RTNL avoidance work in the core and mlx5 driver, from Vlad
Buslov.
7) Add TCP syn cookies bpf helper, from Petar Penkov.
8) Add 'nettest' to selftests and use it, from David Ahern.
9) Add extack support to drop_monitor, add packet alert mode and
support for HW drops, from Ido Schimmel.
10) Add VLAN offload to stmmac, from Jose Abreu.
11) Lots of devm_platform_ioremap_resource() conversions, from
YueHaibing.
12) Add IONIC driver, from Shannon Nelson.
13) Several kTLS cleanups, from Jakub Kicinski.
* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1930 commits)
mlxsw: spectrum_buffers: Add the ability to query the CPU port's shared buffer
mlxsw: spectrum: Register CPU port with devlink
mlxsw: spectrum_buffers: Prevent changing CPU port's configuration
net: ena: fix incorrect update of intr_delay_resolution
net: ena: fix retrieval of nonadaptive interrupt moderation intervals
net: ena: fix update of interrupt moderation register
net: ena: remove all old adaptive rx interrupt moderation code from ena_com
net: ena: remove ena_restore_ethtool_params() and relevant fields
net: ena: remove old adaptive interrupt moderation code from ena_netdev
net: ena: remove code duplication in ena_com_update_nonadaptive_moderation_interval _*()
net: ena: enable the interrupt_moderation in driver_supported_features
net: ena: reimplement set/get_coalesce()
net: ena: switch to dim algorithm for rx adaptive interrupt moderation
net: ena: add intr_moder_rx_interval to struct ena_com_dev and use it
net: phy: adin: implement Energy Detect Powerdown mode via phy-tunable
ethtool: implement Energy Detect Powerdown support via phy-tunable
xen-netfront: do not assume sk_buff_head list is empty in error handling
s390/ctcm: Delete unnecessary checks before the macro call “dev_kfree_skb”
net: ena: don't wake up tx queue when down
drop_monitor: Better sanitize notified packets
...
As recommended by Azure host team, the bytes 4, 5 have more uniqueness
(info entropy) than bytes 8, 9 so use them as the PCI domain numbers.
On older hosts, bytes 4, 5 can also be used -- no backward compatibility
issues are introduced and the chance of collision is greatly reduced.
In the rare cases of collision, the driver code detects and finds
another number that is not in use.
Suggested-by: Michael Kelley <mikelley@microsoft.com>
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Acked-by: Sasha Levin <sashal@kernel.org>
This interface driver is a helper driver allows other drivers to
have a common interface with the Hyper-V PCI frontend driver.
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Windows SR-IOV provides a backchannel mechanism in software for communication
between a VF driver and a PF driver. These "configuration blocks" are
similar in concept to PCI configuration space, but instead of doing reads and
writes in 32-bit chunks through a very slow path, packets of up to 128 bytes
can be sent or received asynchronously.
Nearly every SR-IOV device contains just such a communications channel in
hardware, so using this one in software is usually optional. Using the
software channel, however, allows driver implementers to leverage software
tools that fuzz the communications channel looking for vulnerabilities.
The usage model for these packets puts the responsibility for reading or
writing on the VF driver. The VF driver sends a read or a write packet,
indicating which "block" is being referred to by number.
If the PF driver wishes to initiate communication, it can "invalidate" one or
more of the first 64 blocks. This invalidation is delivered via a callback
supplied by the VF driver by this driver.
No protocol is implied, except that supplied by the PF and VF drivers.
Signed-off-by: Jake Oshins <jakeo@microsoft.com>
Signed-off-by: Dexuan Cui <decui@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: K. Y. Srinivasan <kys@microsoft.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently in Azure cloud, for passthrough devices, the host sets the
device instance ID's bytes 8 - 15 to a value derived from the host HWID,
which is the same on all devices in a VM. So, the device instance ID's
bytes 8 and 9 provided by the host are no longer unique. This affects
all Azure hosts since July 2018, and can cause device passthrough to VMs
to fail because the bytes 8 and 9 are used as PCI domain number.
Collision of domain numbers will cause the second device with the same
domain number fail to load.
In the cases of collision, we will detect and find another number that is
not in use.
Suggested-by: Michael Kelley <mikelley@microsoft.com>
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Acked-by: Sasha Levin <sashal@kernel.org>
The slot must be removed before the pci_dev is removed, otherwise a panic
can happen due to use-after-free.
Fixes: 15becc2b56 ("PCI: hv: Add hv_pci_remove_slots() when we unload the driver")
Signed-off-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Cc: stable@vger.kernel.org
To allocate its fwnode that is then used to allocate an irqdomain,
the driver uses irq_domain_alloc_fwnode(), passing it a VA as an
identifier. This is a rather bad idea, as this address ends up
published in debugfs (and we want to move away from VAs there
anyway).
Instead, let's allocate a named fwnode by using the device GUID as
an identifier. It is allegedly unique, and can be traced back to
the original device.
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Fix a use-after-free in hv_eject_device_work().
Fixes: 05f151a73e ("PCI: hv: Fix a memory leak in hv_eject_device_work()")
Signed-off-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Cc: stable@vger.kernel.org
When we hot-remove a device, usually the host sends us a PCI_EJECT message,
and a PCI_BUS_RELATIONS message with bus_rel->device_count == 0.
When we execute the quick hot-add/hot-remove test, the host may not send
us the PCI_EJECT message if the guest has not fully finished the
initialization by sending the PCI_RESOURCES_ASSIGNED* message to the
host, so it's potentially unsafe to only depend on the
pci_destroy_slot() in hv_eject_device_work() because the code path
create_root_hv_pci_bus()
-> hv_pci_assign_slots()
is not called in this case. Note: in this case, the host still sends the
guest a PCI_BUS_RELATIONS message with bus_rel->device_count == 0.
In the quick hot-add/hot-remove test, we can have such a race before
the code path
pci_devices_present_work()
-> new_pcichild_device()
adds the new device into the hbus->children list, we may have already
received the PCI_EJECT message, and since the tasklet handler
hv_pci_onchannelcallback()
may fail to find the "hpdev" by calling
get_pcichild_wslot(hbus, dev_message->wslot.slot)
hv_pci_eject_device() is not called; Later, by continuing execution
create_root_hv_pci_bus()
-> hv_pci_assign_slots()
creates the slot and the PCI_BUS_RELATIONS message with
bus_rel->device_count == 0 removes the device from hbus->children, and
we end up being unable to remove the slot in
hv_pci_remove()
-> hv_pci_remove_slots()
Remove the slot in pci_devices_present_work() when the device
is removed to address this race.
pci_devices_present_work() and hv_eject_device_work() run in the
singled-threaded hbus->wq, so there is not a double-remove issue for the
slot.
We cannot offload hv_pci_eject_device() from hv_pci_onchannelcallback()
to the workqueue, because we need the hv_pci_onchannelcallback()
synchronously call hv_pci_eject_device() to poll the channel
ringbuffer to work around the "hangs in hv_compose_msi_msg()" issue
fixed in commit de0aa7b2f9 ("PCI: hv: Fix 2 hang issues in
hv_compose_msi_msg()")
Fixes: a15f2c08c7 ("PCI: hv: support reporting serial number as slot information")
Signed-off-by: Dexuan Cui <decui@microsoft.com>
[lorenzo.pieralisi@arm.com: rewritten commit log]
Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Reviewed-by: Stephen Hemminger <stephen@networkplumber.org>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Cc: stable@vger.kernel.org