Currently the reserved bits of the Processor Compatibility
Register (PCR) are cleared as per the Programming Note in Section
1.3.3 of version 3.0B of the Power ISA. This causes all new
architecture features to be made available when running on newer
processors with new architecture features added to the PCR as bits
must be set to disable a given feature.
For example to disable new features added as part of Version 2.07 of
the ISA the corresponding bit in the PCR needs to be set.
As new processor features generally require explicit kernel support
they should be disabled until such support is implemented. Therefore
kernels should set all unknown/reserved bits in the PCR such that any
new architecture features which the kernel does not currently know
about get disabled.
An update is planned to the ISA to clarify that the PCR is an
exception to the Programming Note on reserved bits in Section 1.3.3.
Signed-off-by: Alistair Popple <alistair@popple.id.au>
Signed-off-by: Jordan Niethe <jniethe5@gmail.com>
Tested-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190917004605.22471-2-alistair@popple.id.au
When an SVM makes an hypercall or incurs some other exception, the
Ultravisor usually forwards (a.k.a. reflects) the exceptions to the
Hypervisor. After processing the exception, Hypervisor uses the
UV_RETURN ultracall to return control back to the SVM.
The expected register state on entry to this ultracall is:
* Non-volatile registers are restored to their original values.
* If returning from an hypercall, register R0 contains the return value
(unlike other ultracalls) and, registers R4 through R12 contain any
output values of the hypercall.
* R3 contains the ultracall number, i.e UV_RETURN.
* If returning with a synthesized interrupt, R2 contains the
synthesized interrupt number.
Thanks to input from Paul Mackerras, Ram Pai and Mike Anderson.
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Signed-off-by: Claudio Carvalho <cclaudio@linux.ibm.com>
Acked-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190822034838.27876-8-cclaudio@linux.ibm.com
At present, when running a guest on POWER9 using HV KVM but not using
an in-kernel interrupt controller (XICS or XIVE), for example if QEMU
is run with the kernel_irqchip=off option, the guest entry code goes
ahead and tries to load the guest context into the XIVE hardware, even
though no context has been set up.
To fix this, we check that the "CAM word" is non-zero before pushing
it to the hardware. The CAM word is initialized to a non-zero value
in kvmppc_xive_connect_vcpu() and kvmppc_xive_native_connect_vcpu(),
and is now cleared in kvmppc_xive_{,native_}cleanup_vcpu.
Fixes: 5af5099385 ("KVM: PPC: Book3S HV: Native usage of the XIVE interrupt controller")
Cc: stable@vger.kernel.org # v4.12+
Reported-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Reviewed-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190813100100.GC9567@blackberry
Escalation interrupts are interrupts sent to the host by the XIVE
hardware when it has an interrupt to deliver to a guest VCPU but that
VCPU is not running anywhere in the system. Hence we disable the
escalation interrupt for the VCPU being run when we enter the guest
and re-enable it when the guest does an H_CEDE hypercall indicating
it is idle.
It is possible that an escalation interrupt gets generated just as we
are entering the guest. In that case the escalation interrupt may be
using a queue entry in one of the interrupt queues, and that queue
entry may not have been processed when the guest exits with an H_CEDE.
The existing entry code detects this situation and does not clear the
vcpu->arch.xive_esc_on flag as an indication that there is a pending
queue entry (if the queue entry gets processed, xive_esc_irq() will
clear the flag). There is a comment in the code saying that if the
flag is still set on H_CEDE, we have to abort the cede rather than
re-enabling the escalation interrupt, lest we end up with two
occurrences of the escalation interrupt in the interrupt queue.
However, the exit code doesn't do that; it aborts the cede in the sense
that vcpu->arch.ceded gets cleared, but it still enables the escalation
interrupt by setting the source's PQ bits to 00. Instead we need to
set the PQ bits to 10, indicating that an interrupt has been triggered.
We also need to avoid setting vcpu->arch.xive_esc_on in this case
(i.e. vcpu->arch.xive_esc_on seen to be set on H_CEDE) because
xive_esc_irq() will run at some point and clear it, and if we race with
that we may end up with an incorrect result (i.e. xive_esc_on set when
the escalation interrupt has just been handled).
It is extremely unlikely that having two queue entries would cause
observable problems; theoretically it could cause queue overflow, but
the CPU would have to have thousands of interrupts targetted to it for
that to be possible. However, this fix will also make it possible to
determine accurately whether there is an unhandled escalation
interrupt in the queue, which will be needed by the following patch.
Fixes: 9b9b13a6d1 ("KVM: PPC: Book3S HV: Keep XIVE escalation interrupt masked unless ceded")
Cc: stable@vger.kernel.org # v4.16+
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190813100349.GD9567@blackberry
Seven fixes, all for bugs introduced this cycle.
The commit to add KASAN support broke booting on 32-bit SMP machines, due to a
refactoring that moved some setup out of the secondary CPU path.
A fix for another 32-bit SMP bug introduced by the fast syscall entry
implementation for 32-bit BOOKE. And a build fix for the same commit.
Our change to allow the DAWR to be force enabled on Power9 introduced a bug in
KVM, where we clobber r3 leading to a host crash.
The same commit also exposed a previously unreachable bug in the nested KVM
handling of DAWR, which could lead to an oops in a nested host.
One of the DMA reworks broke the b43legacy WiFi driver on some people's
powermacs, fix it by enabling a 30-bit ZONE_DMA on 32-bit.
A fix for TLB flushing in KVM introduced a new bug, as it neglected to also
flush the ERAT, this could lead to memory corruption in the guest.
Thanks to:
Aaro Koskinen, Christoph Hellwig, Christophe Leroy, Larry Finger, Michael
Neuling, Suraj Jitindar Singh.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJdDg4YAAoJEFHr6jzI4aWACKYP/RG1cqYDjEWz0N9bxjAOanx6
z//hPZqZrObEORx0mek07LNj6JDy4eL7CB9WaEudJjHt7mYugLYq0g7hUMVvBWnB
irFEuzGJ8EgWl1aMbmz+fgf49PBIuroy2o/4pyzzQXoDaw44QyUaCke2VEBskQNG
RW64C2rDVrPgpRHzBB9EZVNv7svmo6ERJsEpRvqP3PZG1ZxgXW+DXbEdSmJCcgAt
8oI+z6frRv+0ez+nge7TULo8DuheShfxc7l0jFrd48i35v2qB/IowPr8cof9fRwM
TqnB+3dZXHPKPz6J9mz80p9ZDe1omLzg6i9EiR2/7a3XGpRBo7kCg3Iri7N5pu0j
LotK9l1+mXWLy5P6lOHH5/tEHv52Wqsvh5IetpNJ2tgXp3MzbOc1/Ut9h7Ag7cRw
WRa7tNXQ5Ud8uPM1Pds8Ymhd+/nZ9RItjGcu6S095/OGpM1FJR9a0QnfUHMyfyuX
kAGrJDWcAkCd/Q9tKHsQotuZAFmRCQe4JFkzTiGzwdjYWYgtTA1c/eIv3+SG7eLV
1dsaIYzIS56b+Qz2Qc/pKHwho+I9o505Y7LFXxlCGXDDjyI72ioTQDwiSBzaZdc9
ORwNchLfpXNpiNXRoRqAnqmhWxavYmA6oJ13RDBiMBxIUWHynVbEzLlX9fPNdBFj
Kw3Zd15znokXBzU+1mDE
=Ju1y
-----END PGP SIGNATURE-----
Merge tag 'powerpc-5.2-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:
"This is a frustratingly large batch at rc5. Some of these were sent
earlier but were missed by me due to being distracted by other things,
and some took a while to track down due to needing manual bisection on
old hardware. But still we clearly need to improve our testing of KVM,
and of 32-bit, so that we catch these earlier.
Summary: seven fixes, all for bugs introduced this cycle.
- The commit to add KASAN support broke booting on 32-bit SMP
machines, due to a refactoring that moved some setup out of the
secondary CPU path.
- A fix for another 32-bit SMP bug introduced by the fast syscall
entry implementation for 32-bit BOOKE. And a build fix for the same
commit.
- Our change to allow the DAWR to be force enabled on Power9
introduced a bug in KVM, where we clobber r3 leading to a host
crash.
- The same commit also exposed a previously unreachable bug in the
nested KVM handling of DAWR, which could lead to an oops in a
nested host.
- One of the DMA reworks broke the b43legacy WiFi driver on some
people's powermacs, fix it by enabling a 30-bit ZONE_DMA on 32-bit.
- A fix for TLB flushing in KVM introduced a new bug, as it neglected
to also flush the ERAT, this could lead to memory corruption in the
guest.
Thanks to: Aaro Koskinen, Christoph Hellwig, Christophe Leroy, Larry
Finger, Michael Neuling, Suraj Jitindar Singh"
* tag 'powerpc-5.2-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
KVM: PPC: Book3S HV: Invalidate ERAT when flushing guest TLB entries
powerpc: enable a 30-bit ZONE_DMA for 32-bit pmac
KVM: PPC: Book3S HV: Only write DAWR[X] when handling h_set_dawr in real mode
KVM: PPC: Book3S HV: Fix r3 corruption in h_set_dabr()
powerpc/32: fix build failure on book3e with KVM
powerpc/booke: fix fast syscall entry on SMP
powerpc/32s: fix initial setup of segment registers on secondary CPU
The hcall H_SET_DAWR is used by a guest to set the data address
watchpoint register (DAWR). This hcall is handled in the host in
kvmppc_h_set_dawr() which can be called in either real mode on the
guest exit path from hcall_try_real_mode() in book3s_hv_rmhandlers.S,
or in virtual mode when called from kvmppc_pseries_do_hcall() in
book3s_hv.c.
The function kvmppc_h_set_dawr() updates the dawr and dawrx fields in
the vcpu struct accordingly and then also writes the respective values
into the DAWR and DAWRX registers directly. It is necessary to write
the registers directly here when calling the function in real mode
since the path to re-enter the guest won't do this. However when in
virtual mode the host DAWR and DAWRX values have already been
restored, and so writing the registers would overwrite these.
Additionally there is no reason to write the guest values here as
these will be read from the vcpu struct and written to the registers
appropriately the next time the vcpu is run.
This also avoids the case when handling h_set_dawr for a nested guest
where the guest hypervisor isn't able to write the DAWR and DAWRX
registers directly and must rely on the real hypervisor to do this for
it when it calls H_ENTER_NESTED.
Fixes: c1fe190c06 ("powerpc: Add force enable of DAWR on P9 option")
Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Based on 1 normalized pattern(s):
this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license version 2 as
published by the free software foundation this program is
distributed in the hope that it will be useful but without any
warranty without even the implied warranty of merchantability or
fitness for a particular purpose see the gnu general public license
for more details
extracted by the scancode license scanner the SPDX license identifier
GPL-2.0-only
has been chosen to replace the boilerplate/reference in 655 file(s).
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Richard Fontana <rfontana@redhat.com>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190527070034.575739538@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* POWER: support for direct access to the POWER9 XIVE interrupt controller,
memory and performance optimizations.
* x86: support for accessing memory not backed by struct page, fixes and refactoring
* Generic: dirty page tracking improvements
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQEcBAABAgAGBQJc3qV/AAoJEL/70l94x66Dn3QH/jX1Bn0P/RZAIt4w0SySklSg
PqxUKDyBQqB9vN9Qeb9jWXAKPH2CtM3+up/rz7oRnBWp7qA6vXcC/R/QJYAvzdXE
nklsR/oYCsflR1KdlVYuDvvPCPP2fLBU5zfN83OsaBQ8fNRkm3gN+N5XQ2SbXbLy
Mo9tybS4otY201UAC96e8N0ipwwyCRpDneQpLcl+F5nH3RBt63cVbs04O+70MXn7
eT4I+8K3+Go7LATzT8hglD21D/7uvE31qQb6yr5L33IfhU4GB51RZzBXTNaAdY8n
hT1rMrRkAMAFWYZPQDfoMadjWU3i5DIfstKjDxOr9oTfuOEp5Z+GvJwvVnUDg1I=
=D0+p
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM updates from Paolo Bonzini:
"ARM:
- support for SVE and Pointer Authentication in guests
- PMU improvements
POWER:
- support for direct access to the POWER9 XIVE interrupt controller
- memory and performance optimizations
x86:
- support for accessing memory not backed by struct page
- fixes and refactoring
Generic:
- dirty page tracking improvements"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (155 commits)
kvm: fix compilation on aarch64
Revert "KVM: nVMX: Expose RDPMC-exiting only when guest supports PMU"
kvm: x86: Fix L1TF mitigation for shadow MMU
KVM: nVMX: Disable intercept for FS/GS base MSRs in vmcs02 when possible
KVM: PPC: Book3S: Remove useless checks in 'release' method of KVM device
KVM: PPC: Book3S HV: XIVE: Fix spelling mistake "acessing" -> "accessing"
KVM: PPC: Book3S HV: Make sure to load LPID for radix VCPUs
kvm: nVMX: Set nested_run_pending in vmx_set_nested_state after checks complete
tests: kvm: Add tests for KVM_SET_NESTED_STATE
KVM: nVMX: KVM_SET_NESTED_STATE - Tear down old EVMCS state before setting new state
tests: kvm: Add tests for KVM_CAP_MAX_VCPUS and KVM_CAP_MAX_CPU_ID
tests: kvm: Add tests to .gitignore
KVM: Introduce KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2
KVM: Fix kvm_clear_dirty_log_protect off-by-(minus-)one
KVM: Fix the bitmap range to copy during clear dirty
KVM: arm64: Fix ptrauth ID register masking logic
KVM: x86: use direct accessors for RIP and RSP
KVM: VMX: Use accessors for GPRs outside of dedicated caching logic
KVM: x86: Omit caching logic for always-available GPRs
kvm, x86: Properly check whether a pfn is an MMIO or not
...
Commit 70ea13f6e6 ("KVM: PPC: Book3S HV: Flush TLB on secondary radix
threads", 2019-04-29) aimed to make radix guests that are using the
real-mode entry path load the LPID register and flush the TLB in the
same place where those things are done for HPT guests. However, it
omitted to remove a branch which branches around that code for radix
guests. The result is that with indep_thread_mode = N, radix guests
don't run correctly. (With indep_threads_mode = Y, which is the
default, radix guests use a different entry path.)
This removes the offending branch, and also the load and compare that
the branch depends on, since the cr7 setting is now unused.
Reported-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Tested-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Fixes: 70ea13f6e6 ("KVM: PPC: Book3S HV: Flush TLB on secondary radix threads")
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Reimplement Book3S idle code in C, moving POWER7/8/9 implementation
speific HV idle code to the powernv platform code.
Book3S assembly stubs are kept in common code and used only to save
the stack frame and non-volatile GPRs before executing architected
idle instructions, and restoring the stack and reloading GPRs then
returning to C after waking from idle.
The complex logic dealing with threads and subcores, locking, SPRs,
HMIs, timebase resync, etc., is all done in C which makes it more
maintainable.
This is not a strict translation to C code, there are some
significant differences:
- Idle wakeup no longer uses the ->cpu_restore call to reinit SPRs,
but saves and restores them itself.
- The optimisation where EC=ESL=0 idle modes did not have to save GPRs
or change MSR is restored, because it's now simple to do. ESL=1
sleeps that do not lose GPRs can use this optimization too.
- KVM secondary entry and cede is now more of a call/return style
rather than branchy. nap_state_lost is not required because KVM
always returns via NVGPR restoring path.
- KVM secondary wakeup from offline sequence is moved entirely into
the offline wakeup, which avoids a hwsync in the normal idle wakeup
path.
Performance measured with context switch ping-pong on different
threads or cores, is possibly improved a small amount, 1-3% depending
on stop state and core vs thread test for shallow states. Deep states
it's in the noise compared with other latencies.
KVM improvements:
- Idle sleepers now always return to caller rather than branch out
to KVM first.
- This allows optimisations like very fast return to caller when no
state has been lost.
- KVM no longer requires nap_state_lost because it controls NVGPR
save/restore itself on the way in and out.
- The heavy idle wakeup KVM request check can be moved out of the
normal host idle code and into the not-performance-critical offline
code.
- KVM nap code now returns from where it is called, which makes the
flow a bit easier to follow.
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Squash the KVM changes in]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This merges in the ppc-kvm topic branch from the powerpc tree to get
patches which touch both general powerpc code and KVM code, one of
which is a prerequisite for following patches.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
When running on POWER9 with kvm_hv.indep_threads_mode = N and the host
in SMT1 mode, KVM will run guest VCPUs on offline secondary threads.
If those guests are in radix mode, we fail to load the LPID and flush
the TLB if necessary, leading to the guest crashing with an
unsupported MMU fault. This arises from commit 9a4506e11b ("KVM:
PPC: Book3S HV: Make radix handle process scoped LPID flush in C,
with relocation on", 2018-05-17), which didn't consider the case
where indep_threads_mode = N.
For simplicity, this makes the real-mode guest entry path flush the
TLB in the same place for both radix and hash guests, as we did before
9a4506e11b, though the code is now C code rather than assembly code.
We also have the radix TLB flush open-coded rather than calling
radix__local_flush_tlb_lpid_guest(), because the TLB flush can be
called in real mode, and in real mode we don't want to invoke the
tracepoint code.
Fixes: 9a4506e11b ("KVM: PPC: Book3S HV: Make radix handle process scoped LPID flush in C, with relocation on")
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This replaces assembler code in book3s_hv_rmhandlers.S that checks
the kvm->arch.need_tlb_flush cpumask and optionally does a TLB flush
with C code in book3s_hv_builtin.c. Note that unlike the radix
version, the hash version doesn't do an explicit ERAT invalidation
because we will invalidate and load up the SLB before entering the
guest, and that will invalidate the ERAT.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
The code in book3s_hv_rmhandlers.S that pushes the XIVE virtual CPU
context to the hardware currently assumes it is being called in real
mode, which is usually true. There is however a path by which it can
be executed in virtual mode, in the case where indep_threads_mode = N.
A virtual CPU executing on an offline secondary thread can take a
hypervisor interrupt in virtual mode and return from the
kvmppc_hv_entry() call after the kvm_secondary_got_guest label.
It is possible for it to be given another vcpu to execute before it
gets to execute the stop instruction. In that case it will call
kvmppc_hv_entry() for the second VCPU in virtual mode, and the XIVE
vCPU push code will be executed in virtual mode. The result in that
case will be a host crash due to an unexpected data storage interrupt
caused by executing the stdcix instruction in virtual mode.
This fixes it by adding a code path for virtual mode, which uses the
virtual TIMA pointer and normal load/store instructions.
[paulus@ozlabs.org - wrote patch description]
Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Implement a real mode handler for the H_CALL H_PAGE_INIT which can be
used to zero or copy a guest page. The page is defined to be 4k and must
be 4k aligned.
The in-kernel real mode handler halves the time to handle this H_CALL
compared to handling it in userspace for a hash guest.
Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This adds a flag so that the DAWR can be enabled on P9 via:
echo Y > /sys/kernel/debug/powerpc/dawr_enable_dangerous
The DAWR was previously force disabled on POWER9 in:
9654153158 powerpc: Disable DAWR in the base POWER9 CPU features
Also see Documentation/powerpc/DAWR-POWER9.txt
This is a dangerous setting, USE AT YOUR OWN RISK.
Some users may not care about a bad user crashing their box
(ie. single user/desktop systems) and really want the DAWR. This
allows them to force enable DAWR.
This flag can also be used to disable DAWR access. Once this is
cleared, all DAWR access should be cleared immediately and your
machine once again safe from crashing.
Userspace may get confused by toggling this. If DAWR is force
enabled/disabled between getting the number of breakpoints (via
PTRACE_GETHWDBGINFO) and setting the breakpoint, userspace will get an
inconsistent view of what's available. Similarly for guests.
For the DAWR to be enabled in a KVM guest, the DAWR needs to be force
enabled in the host AND the guest. For this reason, this won't work on
POWERVM as it doesn't allow the HCALL to work. Writes of 'Y' to the
dawr_enable_dangerous file will fail if the hypervisor doesn't support
writing the DAWR.
To double check the DAWR is working, run this kernel selftest:
tools/testing/selftests/powerpc/ptrace/ptrace-hwbreak.c
Any errors/failures/skips mean something is wrong.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This merges in the "ppc-kvm" topic branch of the powerpc tree to get a
series of commits that touch both general arch/powerpc code and KVM
code. These commits will be merged both via the KVM tree and the
powerpc tree.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
When the hash MMU is active the AMR, IAMR and UAMOR are used for
pkeys. The AMR is directly writable by user space, and the UAMOR masks
those writes, meaning both registers are effectively user register
state. The IAMR is used to create an execute only key.
Also we must maintain the value of at least the AMR when running in
process context, so that any memory accesses done by the kernel on
behalf of the process are correctly controlled by the AMR.
Although we are correctly switching all registers when going into a
guest, on returning to the host we just write 0 into all regs, except
on Power9 where we restore the IAMR correctly.
This could be observed by a user process if it writes the AMR, then
runs a guest and we then return immediately to it without
rescheduling. Because we have written 0 to the AMR that would have the
effect of granting read/write permission to pages that the process was
trying to protect.
In addition, when using the Radix MMU, the AMR can prevent inadvertent
kernel access to userspace data, writing 0 to the AMR disables that
protection.
So save and restore AMR, IAMR and UAMOR.
Fixes: cf43d3b264 ("powerpc: Enable pkey subsystem")
Cc: stable@vger.kernel.org # v4.16+
Signed-off-by: Russell Currey <ruscur@russell.cc>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Acked-by: Paul Mackerras <paulus@ozlabs.org>
Currently trying to build without IOMMU support will fail:
(.text+0x1380): undefined reference to `kvmppc_h_get_tce'
(.text+0x1384): undefined reference to `kvmppc_rm_h_put_tce'
(.text+0x149c): undefined reference to `kvmppc_rm_h_stuff_tce'
(.text+0x14a0): undefined reference to `kvmppc_rm_h_put_tce_indirect'
This happens because turning off IOMMU support will prevent
book3s_64_vio_hv.c from being built because it is only built when
SPAPR_TCE_IOMMU is set, which depends on IOMMU support.
Fix it using ifdefs for the undefined references.
Fixes: 76d837a4c0 ("KVM: PPC: Book3S PR: Don't include SPAPR TCE code on non-pseries platforms")
Signed-off-by: Jordan Niethe <jniethe5@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This makes the handling of machine check interrupts that occur inside
a guest simpler and more robust, with less done in assembler code and
in real mode.
Now, when a machine check occurs inside a guest, we always get the
machine check event struct and put a copy in the vcpu struct for the
vcpu where the machine check occurred. We no longer call
machine_check_queue_event() from kvmppc_realmode_mc_power7(), because
on POWER8, when a vcpu is running on an offline secondary thread and
we call machine_check_queue_event(), that calls irq_work_queue(),
which doesn't work because the CPU is offline, but instead triggers
the WARN_ON(lazy_irq_pending()) in pnv_smp_cpu_kill_self() (which
fires again and again because nothing clears the condition).
All that machine_check_queue_event() actually does is to cause the
event to be printed to the console. For a machine check occurring in
the guest, we now print the event in kvmppc_handle_exit_hv()
instead.
The assembly code at label machine_check_realmode now just calls C
code and then continues exiting the guest. We no longer either
synthesize a machine check for the guest in assembly code or return
to the guest without a machine check.
The code in kvmppc_handle_exit_hv() is extended to handle the case
where the guest is not FWNMI-capable. In that case we now always
synthesize a machine check interrupt for the guest. Previously, if
the host thinks it has recovered the machine check fully, it would
return to the guest without any notification that the machine check
had occurred. If the machine check was caused by some action of the
guest (such as creating duplicate SLB entries), it is much better to
tell the guest that it has caused a problem. Therefore we now always
generate a machine check interrupt for guests that are not
FWNMI-capable.
Reviewed-by: Aravinda Prasad <aravinda@linux.vnet.ibm.com>
Reviewed-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
When we are running as a nested hypervisor, we use a hypercall to
enter the guest rather than code in book3s_hv_rmhandlers.S. This means
that the hypercall handlers listed in hcall_real_table never get called.
There are some hypercalls that are handled there and not in
kvmppc_pseries_do_hcall(), which therefore won't get processed for
a nested guest.
To fix this, we add cases to kvmppc_pseries_do_hcall() to handle those
hypercalls, with the following exceptions:
- The HPT hypercalls (H_ENTER, H_REMOVE, etc.) are not handled because
we only support radix mode for nested guests.
- H_CEDE has to be handled specially because the cede logic in
kvmhv_run_single_vcpu assumes that it has been processed by the time
that kvmhv_p9_guest_entry() returns. Therefore we put a special
case for H_CEDE in kvmhv_p9_guest_entry().
For the XICS hypercalls, if real-mode processing is enabled, then the
virtual-mode handlers assume that they are being called only to finish
up the operation. Therefore we turn off the real-mode flag in the XICS
code when running as a nested hypervisor.
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This adds a new hypercall, H_ENTER_NESTED, which is used by a nested
hypervisor to enter one of its nested guests. The hypercall supplies
register values in two structs. Those values are copied by the level 0
(L0) hypervisor (the one which is running in hypervisor mode) into the
vcpu struct of the L1 guest, and then the guest is run until an
interrupt or error occurs which needs to be reported to L1 via the
hypercall return value.
Currently this assumes that the L0 and L1 hypervisors are the same
endianness, and the structs passed as arguments are in native
endianness. If they are of different endianness, the version number
check will fail and the hcall will be rejected.
Nested hypervisors do not support indep_threads_mode=N, so this adds
code to print a warning message if the administrator has set
indep_threads_mode=N, and treat it as Y.
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
When the 'regs' field was added to struct kvm_vcpu_arch, the code
was changed to use several of the fields inside regs (e.g., gpr, lr,
etc.) but not the ccr field, because the ccr field in struct pt_regs
is 64 bits on 64-bit platforms, but the cr field in kvm_vcpu_arch is
only 32 bits. This changes the code to use the regs.ccr field
instead of cr, and changes the assembly code on 64-bit platforms to
use 64-bit loads and stores instead of 32-bit ones.
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This creates an alternative guest entry/exit path which is used for
radix guests on POWER9 systems when we have indep_threads_mode=Y. In
these circumstances there is exactly one vcpu per vcore and there is
no coordination required between vcpus or vcores; the vcpu can enter
the guest without needing to synchronize with anything else.
The new fast path is implemented almost entirely in C in book3s_hv.c
and runs with the MMU on until the guest is entered. On guest exit
we use the existing path until the point where we are committed to
exiting the guest (as distinct from handling an interrupt in the
low-level code and returning to the guest) and we have pulled the
guest context from the XIVE. At that point we check a flag in the
stack frame to see whether we came in via the old path and the new
path; if we came in via the new path then we go back to C code to do
the rest of the process of saving the guest context and restoring the
host context.
The C code is split into separate functions for handling the
OS-accessible state and the hypervisor state, with the idea that the
latter can be replaced by a hypercall when we implement nested
virtualization.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
[mpe: Fix CONFIG_ALTIVEC=n build]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This adds a parameter to __kvmppc_save_tm and __kvmppc_restore_tm
which allows the caller to indicate whether it wants the nonvolatile
register state to be preserved across the call, as required by the C
calling conventions. This parameter being non-zero also causes the
MSR bits that enable TM, FP, VMX and VSX to be preserved. The
condition register and DSCR are now always preserved.
With this, kvmppc_save_tm_hv and kvmppc_restore_tm_hv can be called
from C code provided the 3rd parameter is non-zero. So that these
functions can be called from modules, they now include code to set
the TOC pointer (r2) on entry, as they can call other built-in C
functions which will assume the TOC to have been set.
Also, the fake suspend code in kvmppc_save_tm_hv is modified here to
assume that treclaim in fake-suspend state does not modify any registers,
which is the case on POWER9. This enables the code to be simplified
quite a bit.
_kvmppc_save_tm_pr and _kvmppc_restore_tm_pr become much simpler with
this change, since they now only need to save and restore TAR and pass
1 for the 3rd argument to __kvmppc_{save,restore}_tm.
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This streamlines the first part of the code that handles a hypervisor
interrupt that occurred in the guest. With this, all of the real-mode
handling that occurs is done before the "guest_exit_cont" label; once
we get to that label we are committed to exiting to host virtual mode.
Thus the machine check and HMI real-mode handling is moved before that
label.
Also, the code to handle external interrupts is moved out of line, as
is the code that calls kvmppc_realmode_hmi_handler().
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This pulls out the assembler code that is responsible for saving and
restoring the PMU state for the host and guest into separate functions
so they can be used from an alternate entry path. The calling
convention is made compatible with C.
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Reviewed-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This is based on a patch by Suraj Jitindar Singh.
This moves the code in book3s_hv_rmhandlers.S that generates an
external, decrementer or privileged doorbell interrupt just before
entering the guest to C code in book3s_hv_builtin.c. This is to
make future maintenance and modification easier. The algorithm
expressed in the C code is almost identical to the previous
algorithm.
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Currently we use two bits in the vcpu pending_exceptions bitmap to
indicate that an external interrupt is pending for the guest, one
for "one-shot" interrupts that are cleared when delivered, and one
for interrupts that persist until cleared by an explicit action of
the OS (e.g. an acknowledge to an interrupt controller). The
BOOK3S_IRQPRIO_EXTERNAL bit is used for one-shot interrupt requests
and BOOK3S_IRQPRIO_EXTERNAL_LEVEL is used for persisting interrupts.
In practice BOOK3S_IRQPRIO_EXTERNAL never gets used, because our
Book3S platforms generally, and pseries in particular, expect
external interrupt requests to persist until they are acknowledged
at the interrupt controller. That combined with the confusion
introduced by having two bits for what is essentially the same thing
makes it attractive to simplify things by only using one bit. This
patch does that.
With this patch there is only BOOK3S_IRQPRIO_EXTERNAL, and by default
it has the semantics of a persisting interrupt. In order to avoid
breaking the ABI, we introduce a new "external_oneshot" flag which
preserves the behaviour of the KVM_INTERRUPT ioctl with the
KVM_INTERRUPT_SET argument.
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
files not using feature fixup don't need asm/feature-fixups.h
files using feature fixup need asm/feature-fixups.h
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This patch moves ASM_CONST() and stringify_in_c() into
dedicated asm-const.h, then cleans all related inclusions.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
[mpe: asm-compat.h should include asm-const.h]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
POWER9 DD1 was never a product. It is no longer supported by upstream
firmware, and it is not effectively supported in Linux due to lack of
testing.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Michael Ellerman <mpe@ellerman.id.au>
[mpe: Remove arch_make_huge_pte() entirely]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Notable changes:
- Support for split PMD page table lock on 64-bit Book3S (Power8/9).
- Add support for HAVE_RELIABLE_STACKTRACE, so we properly support live
patching again.
- Add support for patching barrier_nospec in copy_from_user() and syscall entry.
- A couple of fixes for our data breakpoints on Book3S.
- A series from Nick optimising TLB/mm handling with the Radix MMU.
- Numerous small cleanups to squash sparse/gcc warnings from Mathieu Malaterre.
- Several series optimising various parts of the 32-bit code from Christophe Leroy.
- Removal of support for two old machines, "SBC834xE" and "C2K" ("GEFanuc,C2K"),
which is why the diffstat has so many deletions.
And many other small improvements & fixes.
There's a few out-of-area changes. Some minor ftrace changes OK'ed by Steve, and
a fix to our powernv cpuidle driver. Then there's a series touching mm, x86 and
fs/proc/task_mmu.c, which cleans up some details around pkey support. It was
ack'ed/reviewed by Ingo & Dave and has been in next for several weeks.
Thanks to:
Akshay Adiga, Alastair D'Silva, Alexey Kardashevskiy, Al Viro, Andrew
Donnellan, Aneesh Kumar K.V, Anju T Sudhakar, Arnd Bergmann, Balbir Singh,
Cédric Le Goater, Christophe Leroy, Christophe Lombard, Colin Ian King, Dave
Hansen, Fabio Estevam, Finn Thain, Frederic Barrat, Gautham R. Shenoy, Haren
Myneni, Hari Bathini, Ingo Molnar, Jonathan Neuschäfer, Josh Poimboeuf,
Kamalesh Babulal, Madhavan Srinivasan, Mahesh Salgaonkar, Mark Greer, Mathieu
Malaterre, Matthew Wilcox, Michael Neuling, Michal Suchanek, Naveen N. Rao,
Nicholas Piggin, Nicolai Stange, Olof Johansson, Paul Gortmaker, Paul
Mackerras, Peter Rosin, Pridhiviraj Paidipeddi, Ram Pai, Rashmica Gupta, Ravi
Bangoria, Russell Currey, Sam Bobroff, Samuel Mendoza-Jonas, Segher
Boessenkool, Shilpasri G Bhat, Simon Guo, Souptick Joarder, Stewart Smith,
Thiago Jung Bauermann, Torsten Duwe, Vaibhav Jain, Wei Yongjun, Wolfram Sang,
Yisheng Xie, YueHaibing.
-----BEGIN PGP SIGNATURE-----
iQIwBAABCAAaBQJbGQKBExxtcGVAZWxsZXJtYW4uaWQuYXUACgkQUevqPMjhpYBq
TRAAioK7rz5xYMkxaM3Ng3ybobEeNAwQqOolz98xvmnB9SfDWNuc99vf8cGu0/fQ
zc8AKZ5RcnwipOjyGlxW9oa1ZhVq0xtYnQPiYLEKMdLQmh5D+C7+KpvAd1UElweg
ub40/xDySWfMujfuMSF9JDCWPIXyojt4Xg5nJKIVRrAm/3YMe/+i5Am7NWHuMCEb
aQmZtlYW5Mz81XY0968hjpUO6eKFRmsaM7yFAhGTXx6+oLRpGj1PZB4AwdRIKS2L
Ak7q/VgxtE4W+s3a0GK2s+eXIhGKeFuX9AVnx3nti+8/K1OqrqhDcLMUC/9JpCpv
EvOtO7dxPnZujHjdu4Eai/xNoo4h6zRy7bWqve9LoBM40CP5jljKzu1lwqqb5yO0
jC7/aXhgiSIxxcRJLjoI/TYpZPu40MifrkydmczykdPyPCnMIWEJDcj4KsRL/9Y8
9SSbJzRNC/SgQNTbUYPZFFi6G0QaMmlcbCb628k8QT+Gn3Xkdf/ZtxzqEyoF4Irq
46kFBsiSSK4Bu0rVlcUtJQLgdqytWULO6NKEYnD67laxYcgQd8pGFQ8SjZhRZLgU
q5LA3HIWhoAI4M0wZhOnKXO6JfiQ1UbO8gUJLsWsfF0Fk5KAcdm+4kb4jbI1H4Qk
Vol9WNRZwEllyaiqScZN9RuVVuH0GPOZeEH1dtWK+uWi0lM=
=ZlBf
-----END PGP SIGNATURE-----
Merge tag 'powerpc-4.18-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates from Michael Ellerman:
"Notable changes:
- Support for split PMD page table lock on 64-bit Book3S (Power8/9).
- Add support for HAVE_RELIABLE_STACKTRACE, so we properly support
live patching again.
- Add support for patching barrier_nospec in copy_from_user() and
syscall entry.
- A couple of fixes for our data breakpoints on Book3S.
- A series from Nick optimising TLB/mm handling with the Radix MMU.
- Numerous small cleanups to squash sparse/gcc warnings from Mathieu
Malaterre.
- Several series optimising various parts of the 32-bit code from
Christophe Leroy.
- Removal of support for two old machines, "SBC834xE" and "C2K"
("GEFanuc,C2K"), which is why the diffstat has so many deletions.
And many other small improvements & fixes.
There's a few out-of-area changes. Some minor ftrace changes OK'ed by
Steve, and a fix to our powernv cpuidle driver. Then there's a series
touching mm, x86 and fs/proc/task_mmu.c, which cleans up some details
around pkey support. It was ack'ed/reviewed by Ingo & Dave and has
been in next for several weeks.
Thanks to: Akshay Adiga, Alastair D'Silva, Alexey Kardashevskiy, Al
Viro, Andrew Donnellan, Aneesh Kumar K.V, Anju T Sudhakar, Arnd
Bergmann, Balbir Singh, Cédric Le Goater, Christophe Leroy, Christophe
Lombard, Colin Ian King, Dave Hansen, Fabio Estevam, Finn Thain,
Frederic Barrat, Gautham R. Shenoy, Haren Myneni, Hari Bathini, Ingo
Molnar, Jonathan Neuschäfer, Josh Poimboeuf, Kamalesh Babulal,
Madhavan Srinivasan, Mahesh Salgaonkar, Mark Greer, Mathieu Malaterre,
Matthew Wilcox, Michael Neuling, Michal Suchanek, Naveen N. Rao,
Nicholas Piggin, Nicolai Stange, Olof Johansson, Paul Gortmaker, Paul
Mackerras, Peter Rosin, Pridhiviraj Paidipeddi, Ram Pai, Rashmica
Gupta, Ravi Bangoria, Russell Currey, Sam Bobroff, Samuel
Mendoza-Jonas, Segher Boessenkool, Shilpasri G Bhat, Simon Guo,
Souptick Joarder, Stewart Smith, Thiago Jung Bauermann, Torsten Duwe,
Vaibhav Jain, Wei Yongjun, Wolfram Sang, Yisheng Xie, YueHaibing"
* tag 'powerpc-4.18-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (251 commits)
powerpc/64s/radix: Fix missing ptesync in flush_cache_vmap
cpuidle: powernv: Fix promotion from snooze if next state disabled
powerpc: fix build failure by disabling attribute-alias warning in pci_32
ocxl: Fix missing unlock on error in afu_ioctl_enable_p9_wait()
powerpc-opal: fix spelling mistake "Uniterrupted" -> "Uninterrupted"
powerpc: fix spelling mistake: "Usupported" -> "Unsupported"
powerpc/pkeys: Detach execute_only key on !PROT_EXEC
powerpc/powernv: copy/paste - Mask SO bit in CR
powerpc: Remove core support for Marvell mv64x60 hostbridges
powerpc/boot: Remove core support for Marvell mv64x60 hostbridges
powerpc/boot: Remove support for Marvell mv64x60 i2c controller
powerpc/boot: Remove support for Marvell MPSC serial controller
powerpc/embedded6xx: Remove C2K board support
powerpc/lib: optimise PPC32 memcmp
powerpc/lib: optimise 32 bits __clear_user()
powerpc/time: inline arch_vtime_task_switch()
powerpc/Makefile: set -mcpu=860 flag for the 8xx
powerpc: Implement csum_ipv6_magic in assembly
powerpc/32: Optimise __csum_partial()
powerpc/lib: Adjust .balign inside string functions for PPC32
...
Currently __kvmppc_save/restore_tm() APIs can only be invoked from
assembly function. This patch adds C function wrappers for them so
that they can be safely called from C function.
Signed-off-by: Simon Guo <wei.guo.simon@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
HV KVM and PR KVM need different MSR source to indicate whether
treclaim. or trecheckpoint. is necessary.
This patch add new parameter (guest MSR) for these kvmppc_save_tm/
kvmppc_restore_tm() APIs:
- For HV KVM, it is VCPU_MSR
- For PR KVM, it is current host MSR or VCPU_SHADOW_SRR1
This enhancement enables these 2 APIs to be reused by PR KVM later.
And the patch keeps HV KVM logic unchanged.
This patch also reworks kvmppc_save_tm()/kvmppc_restore_tm() to
have a clean ABI: r3 for vcpu and r4 for guest_msr.
During kvmppc_save_tm/kvmppc_restore_tm(), the R1 need to be saved
or restored. Currently the R1 is saved into HSTATE_HOST_R1. In PR
KVM, we are going to add a C function wrapper for
kvmppc_save_tm/kvmppc_restore_tm() where the R1 will be incremented
with added stackframe and save into HSTATE_HOST_R1. There are several
places in HV KVM to load HSTATE_HOST_R1 as R1, and we don't want to
bring risk or confusion by TM code.
This patch will use HSTATE_SCRATCH2 to save/restore R1 in
kvmppc_save_tm/kvmppc_restore_tm() to avoid future confusion, since
the r1 is actually a temporary/scratch value to be saved/stored.
[paulus@ozlabs.org - rebased on top of 7b0e827c69 ("KVM: PPC: Book3S HV:
Factor fake-suspend handling out of kvmppc_save/restore_tm", 2018-05-30)]
Signed-off-by: Simon Guo <wei.guo.simon@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
It is a simple patch just for moving kvmppc_save_tm/kvmppc_restore_tm()
functionalities to tm.S. There is no logic change. The reconstruct of
those APIs will be done in later patches to improve readability.
It is for preparation of reusing those APIs on both HV/PR PPC KVM.
Some slight change during move the functions includes:
- surrounds some HV KVM specific code with CONFIG_KVM_BOOK3S_HV_POSSIBLE
for compilation.
- use _GLOBAL() to define kvmppc_save_tm/kvmppc_restore_tm()
[paulus@ozlabs.org - rebased on top of 7b0e827c69 ("KVM: PPC: Book3S HV:
Factor fake-suspend handling out of kvmppc_save/restore_tm", 2018-05-30)]
Signed-off-by: Simon Guo <wei.guo.simon@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This splits out the handling of "fake suspend" mode, part of the
hypervisor TM assist code for POWER9, and puts almost all of it in
new kvmppc_save_tm_hv and kvmppc_restore_tm_hv functions. The new
functions branch to kvmppc_save/restore_tm if the CPU does not
require hypervisor TM assistance.
With this, it will be more straightforward to move kvmppc_save_tm and
kvmppc_restore_tm to another file and use them for transactional
memory support in PR KVM. Additionally, it also makes the code a
bit clearer and reduces the number of feature sections.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
When CONFIG_RELOCATABLE=n, the Linux real mode interrupt handlers call
into KVM using real address. This needs to be translated to the kernel
linear effective address before the MMU is switched on.
kvmppc_bad_host_intr misses adding these bits, so when it is used to
handle a system reset interrupt (that always gets delivered in real
mode), it results in an instruction access fault immediately after
the MMU is turned on.
Fix this by ensuring the top 2 address bits are set when the MMU is
turned on.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
The radix guest code can has fewer restrictions about what context it
can run in, so move this flushing out of assembly and have it use the
Linux TLB flush implementations introduced previously.
This allows powerpc:tlbie trace events to be used.
This changes the tlbiel sequence to only execute RIC=2 flush once on
the first set flushed, then RIC=0 for the rest of the sets. The end
result of the flush should be unchanged. This matches the local PID
flush pattern that was introduced in a5998fcb92 ("powerpc/mm/radix:
Optimise tlbiel flush all case").
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
A radix guest can execute tlbie instructions to invalidate TLB entries.
After a tlbie or a group of tlbies, it must then do the architected
sequence eieio; tlbsync; ptesync to ensure that the TLB invalidation
has been processed by all CPUs in the system before it can rely on
no CPU using any translation that it just invalidated.
In fact it is the ptesync which does the actual synchronization in
this sequence, and hardware has a requirement that the ptesync must
be executed on the same CPU thread as the tlbies which it is expected
to order. Thus, if a vCPU gets moved from one physical CPU to
another after it has done some tlbies but before it can get to do the
ptesync, the ptesync will not have the desired effect when it is
executed on the second physical CPU.
To fix this, we do a ptesync in the exit path for radix guests. If
there are any pending tlbies, this will wait for them to complete.
If there aren't, then ptesync will just do the same as sync.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Currently, the HV KVM guest entry/exit code adds the timebase offset
from the vcore struct to the timebase on guest entry, and subtracts
it on guest exit. Which is fine, except that it is possible for
userspace to change the offset using the SET_ONE_REG interface while
the vcore is running, as there is only one timebase offset per vcore
but potentially multiple VCPUs in the vcore. If that were to happen,
KVM would subtract a different offset on guest exit from that which
it had added on guest entry, leading to the timebase being out of sync
between cores in the host, which then leads to bad things happening
such as hangs and spurious watchdog timeouts.
To fix this, we add a new field 'tb_offset_applied' to the vcore struct
which stores the offset that is currently applied to the timebase.
This value is set from the vcore tb_offset field on guest entry, and
is what is subtracted from the timebase on guest exit. Since it is
zero when the timebase offset is not applied, we can simplify the
logic in kvmhv_start_timing and kvmhv_accumulate_time.
In addition, we had secondary threads reading the timebase while
running concurrently with code on the primary thread which would
eventually add or subtract the timebase offset from the timebase.
This occurred while saving or restoring the DEC register value on
the secondary threads. Although no specific incorrect behaviour has
been observed, this is a race which should be fixed. To fix it, we
move the DEC saving code to just before we call kvmhv_commence_exit,
and the DEC restoring code to after the point where we have waited
for the primary thread to switch the MMU context and add the timebase
offset. That way we are sure that the timebase contains the guest
timebase value in both cases.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
During guest entry/exit, we switch over to/from the guest MMU context
and we cannot take exceptions in the hypervisor code.
Since ftrace may be enabled and since it can result in us taking a trap,
disable ftrace by setting paca->ftrace_enabled to zero. There are two
paths through which we enter/exit a guest:
1. If we are the vcore runner, then we enter the guest via
__kvmppc_vcore_entry() and we disable ftrace around this. This is always
the case for Power9, and for the primary thread on Power8.
2. If we are a secondary thread in Power8, then we would be in nap due
to SMT being disabled. We are woken up by an IPI to enter the guest. In
this scenario, we enter the guest through kvm_start_guest(). We disable
ftrace at this point. In this scenario, ftrace would only get re-enabled
on the secondary thread when SMT is re-enabled (via start_secondary()).
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Notable changes:
- Support for 4PB user address space on 64-bit, opt-in via mmap().
- Removal of POWER4 support, which was accidentally broken in 2016 and no one
noticed, and blocked use of some modern instructions.
- Workarounds so that the hypervisor can enable Transactional Memory on Power9.
- A series to disable the DAWR (Data Address Watchpoint Register) on Power9.
- More information displayed in the meltdown/spectre_v1/v2 sysfs files.
- A vpermxor (Power8 Altivec) implementation for the raid6 Q Syndrome.
- A big series to make the allocation of our pacas (per cpu area), kernel page
tables, and per-cpu stacks NUMA aware when using the Radix MMU on Power9.
And as usual many fixes, reworks and cleanups.
Thanks to:
Aaro Koskinen, Alexandre Belloni, Alexey Kardashevskiy, Alistair Popple, Andy
Shevchenko, Aneesh Kumar K.V, Anshuman Khandual, Balbir Singh, Benjamin
Herrenschmidt, Christophe Leroy, Christophe Lombard, Cyril Bur, Daniel Axtens,
Dave Young, Finn Thain, Frederic Barrat, Gustavo Romero, Horia Geantă,
Jonathan Neuschäfer, Kees Cook, Larry Finger, Laurent Dufour, Laurent Vivier,
Logan Gunthorpe, Madhavan Srinivasan, Mark Greer, Mark Hairgrove, Markus
Elfring, Mathieu Malaterre, Matt Brown, Matt Evans, Mauricio Faria de
Oliveira, Michael Neuling, Naveen N. Rao, Nicholas Piggin, Paul Mackerras,
Philippe Bergheaud, Ram Pai, Rob Herring, Sam Bobroff, Segher Boessenkool,
Simon Guo, Simon Horman, Stewart Smith, Sukadev Bhattiprolu, Suraj Jitindar
Singh, Thiago Jung Bauermann, Vaibhav Jain, Vaidyanathan Srinivasan, Vasant
Hegde, Wei Yongjun.
-----BEGIN PGP SIGNATURE-----
iQIwBAABCAAaBQJayKxDExxtcGVAZWxsZXJtYW4uaWQuYXUACgkQUevqPMjhpYAr
JQ/6A9Xs4zHDn9OeT9esEIxciETqUlrP0Wp64c4JVC7EkG1E7xRDZ4Xb4m8R2nNt
9sPhtNO1yCtEk6kFQtPNB0N8v6pud4I6+aMcYnn+tP8mJRYQ4x9bYaF3Hw98IKmE
Kd6TglmsUQvh2GpwPiF93KpzzWu1HB2kZzzqJcAMTMh7C79Qz00BjrTJltzXB2jx
tJ+B4lVy8BeU8G5nDAzJEEwb5Ypkn8O40rS/lpAwVTYOBJ8Rbyq8Fj82FeREK9YO
4EGaEKPkC/FdzX7OJV3v2/nldCd8pzV471fAoGuBUhJiJBMBoBybcTHIdDex7LlL
zMLV1mUtGo8iolRPhL8iCH+GGifZz2WzstYCozz7hgIraWtc/frq9rZp6q0LdH/K
trk7UbPGlVb92ecWZVpZyEcsMzKrCgZqnAe9wRNh1uEKScEdzd/bmRaMhENUObRh
Hili6AVvmSKExpy7k2sZP/oUMaeC15/xz8Lk7l8a/iCkYhNmPYh5iSXM5+UKpcRT
FYOcO0o3DwXsN46Whow3nJ7TqAsDy9/ecPUG71JQi3ZrHnRrm8jxkn8MCG5pZ1Fi
KvKDxlg6RiJo3DF9/fSOpJUokvMwqBS5dJo4eh5eiDy94aBTqmBKFecvPxQm7a0L
l3uXCF/6JuXEvMukFjGBO4RiYhw8i+B2uKsh81XUh7HKrgE=
=HAB1
-----END PGP SIGNATURE-----
Merge tag 'powerpc-4.17-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates from Michael Ellerman:
"Notable changes:
- Support for 4PB user address space on 64-bit, opt-in via mmap().
- Removal of POWER4 support, which was accidentally broken in 2016
and no one noticed, and blocked use of some modern instructions.
- Workarounds so that the hypervisor can enable Transactional Memory
on Power9.
- A series to disable the DAWR (Data Address Watchpoint Register) on
Power9.
- More information displayed in the meltdown/spectre_v1/v2 sysfs
files.
- A vpermxor (Power8 Altivec) implementation for the raid6 Q
Syndrome.
- A big series to make the allocation of our pacas (per cpu area),
kernel page tables, and per-cpu stacks NUMA aware when using the
Radix MMU on Power9.
And as usual many fixes, reworks and cleanups.
Thanks to: Aaro Koskinen, Alexandre Belloni, Alexey Kardashevskiy,
Alistair Popple, Andy Shevchenko, Aneesh Kumar K.V, Anshuman Khandual,
Balbir Singh, Benjamin Herrenschmidt, Christophe Leroy, Christophe
Lombard, Cyril Bur, Daniel Axtens, Dave Young, Finn Thain, Frederic
Barrat, Gustavo Romero, Horia Geantă, Jonathan Neuschäfer, Kees Cook,
Larry Finger, Laurent Dufour, Laurent Vivier, Logan Gunthorpe,
Madhavan Srinivasan, Mark Greer, Mark Hairgrove, Markus Elfring,
Mathieu Malaterre, Matt Brown, Matt Evans, Mauricio Faria de Oliveira,
Michael Neuling, Naveen N. Rao, Nicholas Piggin, Paul Mackerras,
Philippe Bergheaud, Ram Pai, Rob Herring, Sam Bobroff, Segher
Boessenkool, Simon Guo, Simon Horman, Stewart Smith, Sukadev
Bhattiprolu, Suraj Jitindar Singh, Thiago Jung Bauermann, Vaibhav
Jain, Vaidyanathan Srinivasan, Vasant Hegde, Wei Yongjun"
* tag 'powerpc-4.17-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (207 commits)
powerpc/64s/idle: Fix restore of AMOR on POWER9 after deep sleep
powerpc/64s: Fix POWER9 DD2.2 and above in cputable features
powerpc/64s: Fix pkey support in dt_cpu_ftrs, add CPU_FTR_PKEY bit
powerpc/64s: Fix dt_cpu_ftrs to have restore_cpu clear unwanted LPCR bits
Revert "powerpc/64s/idle: POWER9 ESL=0 stop avoid save/restore overhead"
powerpc: iomap.c: introduce io{read|write}64_{lo_hi|hi_lo}
powerpc: io.h: move iomap.h include so that it can use readq/writeq defs
cxl: Fix possible deadlock when processing page faults from cxllib
powerpc/hw_breakpoint: Only disable hw breakpoint if cpu supports it
powerpc/mm/radix: Update command line parsing for disable_radix
powerpc/mm/radix: Parse disable_radix commandline correctly.
powerpc/mm/hugetlb: initialize the pagetable cache correctly for hugetlb
powerpc/mm/radix: Update pte fragment count from 16 to 256 on radix
powerpc/mm/keys: Update documentation and remove unnecessary check
powerpc/64s/idle: POWER9 ESL=0 stop avoid save/restore overhead
powerpc/64s/idle: Consolidate power9_offline_stop()/power9_idle_stop()
powerpc/powernv: Always stop secondaries before reboot/shutdown
powerpc: hard disable irqs in smp_send_stop loop
powerpc: use NMI IPI for smp_send_stop
powerpc/powernv: Fix SMT4 forcing idle code
...
SLOF checks for 'sc 1' (hypercall) support by issuing a hcall with
H_SET_DABR. Since the recent commit e8ebedbf31 ("KVM: PPC: Book3S
HV: Return error from h_set_dabr() on POWER9") changed H_SET_DABR to
return H_UNSUPPORTED on Power9, we see guest boot failures, the
symptom is the boot seems to just stop in SLOF, eg:
SLOF ***************************************************************
QEMU Starting
Build Date = Sep 24 2017 12:23:07
FW Version = buildd@ release 20170724
<no further output>
SLOF can cope if H_SET_DABR returns H_HARDWARE. So wwitch the return
value to H_HARDWARE instead of H_UNSUPPORTED so that we don't break
the guest boot.
That does mean we return a different error to PowerVM in this case,
but that's probably not a big concern.
Fixes: e8ebedbf31 ("KVM: PPC: Book3S HV: Return error from h_set_dabr() on POWER9")
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Bring in yet another series that touches KVM code, and might need to
be merged into the kvm-ppc branch to resolve conflicts.
This required some changes in pnv_power9_force_smt4_catch/release()
due to the paca array becomming an array of pointers.
The "lppaca" is a structure registered with the hypervisor. This is
unnecessary when running on non-virtualised platforms. One field from
the lppaca (pmcregs_in_use) is also used by the host, so move the host
part out into the paca (lppaca field is still updated in
guest mode).
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Fix non-pseries build with some #ifdefs]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
POWER9 with the DAWR disabled causes problems for partition
migration. Either we have to fail the migration (since we lose the
DAWR) or we silently drop the DAWR and allow the migration to pass.
This patch does the latter and allows the migration to pass (at the
cost of silently losing the DAWR). This is not ideal but hopefully the
best overall solution. This approach has been acked by Paulus.
With this patch kvmppc_set_one_reg() will store the DAWR in the vcpu
but won't actually set it on POWER9 hardware.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
POWER7 compat mode guests can use h_set_dabr on POWER9. POWER9 should
use the DAWR but since it's disabled there we can't.
This returns H_UNSUPPORTED on a h_set_dabr() on POWER9 where the DAWR
is disabled.
Current Linux guests ignore this error, so they will silently not get
the DAWR (sigh). The same error code is being used by POWERVM in this
case.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This works around a hardware bug in "Nimbus" POWER9 DD2.2 processors,
where the contents of the TEXASR can get corrupted while a thread is
in fake suspend state. The workaround is for the instruction emulation
code to use the value saved at the most recent guest exit in real
suspend mode. We achieve this by simply not saving the TEXASR into
the vcpu struct on an exit in fake suspend state. We also have to
take care to set the orig_texasr field only on guest exit in real
suspend state.
This also means that on guest entry in fake suspend state, TEXASR
will be restored to the value it had on the last exit in real suspend
state, effectively counteracting any hardware-caused corruption. This
works because TEXASR may not be written in suspend state.
With this, the guest might see the wrong values in TEXASR if it reads
it while in suspend state, but will see the correct value in
non-transactional state (e.g. after a treclaim), and treclaim will
work correctly.
With this workaround, the code will actually run slightly faster, and
will operate correctly on systems without the TEXASR bug (since TEXASR
may not be written in suspend state, and is only changed by failure
recording, which will have already been done before we get into fake
suspend state). Therefore these changes are not made subject to a CPU
feature bit.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This works around a hardware bug in "Nimbus" POWER9 DD2.2 processors,
where a treclaim performed in fake suspend mode can cause subsequent
reads from the XER register to return inconsistent values for the SO
(summary overflow) bit. The inconsistent SO bit state can potentially
be observed on any thread in the core. We have to do the treclaim
because that is the only way to get the thread out of suspend state
(fake or real) and into non-transactional state.
The workaround for the bug is to force the core into SMT4 mode before
doing the treclaim. This patch adds the code to do that, conditional
on the CPU_FTR_P9_TM_XER_SO_BUG feature bit.
Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
POWER9 has hardware bugs relating to transactional memory and thread
reconfiguration (changes to hardware SMT mode). Specifically, the core
does not have enough storage to store a complete checkpoint of all the
architected state for all four threads. The DD2.2 version of POWER9
includes hardware modifications designed to allow hypervisor software
to implement workarounds for these problems. This patch implements
those workarounds in KVM code so that KVM guests see a full, working
transactional memory implementation.
The problems center around the use of TM suspended state, where the
CPU has a checkpointed state but execution is not transactional. The
workaround is to implement a "fake suspend" state, which looks to the
guest like suspended state but the CPU does not store a checkpoint.
In this state, any instruction that would cause a transition to
transactional state (rfid, rfebb, mtmsrd, tresume) or would use the
checkpointed state (treclaim) causes a "soft patch" interrupt (vector
0x1500) to the hypervisor so that it can be emulated. The trechkpt
instruction also causes a soft patch interrupt.
On POWER9 DD2.2, we avoid returning to the guest in any state which
would require a checkpoint to be present. The trechkpt in the guest
entry path which would normally create that checkpoint is replaced by
either a transition to fake suspend state, if the guest is in suspend
state, or a rollback to the pre-transactional state if the guest is in
transactional state. Fake suspend state is indicated by a flag in the
PACA plus a new bit in the PSSCR. The new PSSCR bit is write-only and
reads back as 0.
On exit from the guest, if the guest is in fake suspend state, we still
do the treclaim instruction as we would in real suspend state, in order
to get into non-transactional state, but we do not save the resulting
register state since there was no checkpoint.
Emulation of the instructions that cause a softpatch interrupt is
handled in two paths. If the guest is in real suspend mode, we call
kvmhv_p9_tm_emulation_early() to handle the cases where the guest is
transitioning to transactional state. This is called before we do the
treclaim in the guest exit path; because we haven't done treclaim, we
can get back to the guest with the transaction still active. If the
instruction is a case that kvmhv_p9_tm_emulation_early() doesn't
handle, or if the guest is in fake suspend state, then we proceed to
do the complete guest exit path and subsequently call
kvmhv_p9_tm_emulation() in host context with the MMU on. This handles
all the cases including the cases that generate program interrupts
(illegal instruction or TM Bad Thing) and facility unavailable
interrupts.
The emulation is reasonably straightforward and is mostly concerned
with checking for exception conditions and updating the state of
registers such as MSR and CR0. The treclaim emulation takes care to
ensure that the TEXASR register gets updated as if it were the guest
treclaim instruction that had done failure recording, not the treclaim
done in hypervisor state in the guest exit path.
With this, the KVM_CAP_PPC_HTM capability returns true (1) even if
transactional memory is not available to host userspace.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Since commit 6964e6a4e4 ("KVM: PPC: Book3S HV: Do SLB load/unload
with guest LPCR value loaded", 2018-01-11), we have been seeing
occasional machine check interrupts on POWER8 systems when running
KVM guests, due to SLB multihit errors.
This turns out to be due to the guest exit code reloading the host
SLB entries from the SLB shadow buffer when the SLB was not previously
cleared in the guest entry path. This can happen because the path
which skips from the guest entry code to the guest exit code without
entering the guest now does the skip before the SLB is cleared and
loaded with guest values, but the host values are loaded after the
point in the guest exit path that we skip to.
To fix this, we move the code that reloads the host SLB values up
so that it occurs just before the point in the guest exit code (the
label guest_bypass:) where we skip to from the guest entry path.
Reported-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Fixes: 6964e6a4e4 ("KVM: PPC: Book3S HV: Do SLB load/unload with guest LPCR value loaded")
Tested-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This fixes a bug where the trap number that is returned by
__kvmppc_vcore_entry gets corrupted. The effect of the corruption
is that IPIs get ignored on POWER9 systems when the IPI is sent via
a doorbell interrupt to a CPU which is executing in a KVM guest.
The effect of the IPI being ignored is often that another CPU locks
up inside smp_call_function_many() (and if that CPU is holding a
spinlock, other CPUs then lock up inside raw_spin_lock()).
The trap number is currently held in register r12 for most of the
assembly-language part of the guest exit path. In that path, we
call kvmppc_subcore_exit_guest(), which is a C function, without
restoring r12 afterwards. Depending on the kernel config and the
compiler, it may modify r12 or it may not, so some config/compiler
combinations see the bug and others don't.
To fix this, we arrange for the trap number to be stored on the
stack from the 'guest_bypass:' label until the end of the function,
then the trap number is loaded and returned in r12 as before.
Cc: stable@vger.kernel.org # v4.8+
Fixes: fd7bacbca4 ("KVM: PPC: Book3S HV: Fix TB corruption in guest exit path on HMI interrupt")
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
ARM:
- Include icache invalidation optimizations, improving VM startup time
- Support for forwarded level-triggered interrupts, improving
performance for timers and passthrough platform devices
- A small fix for power-management notifiers, and some cosmetic changes
PPC:
- Add MMIO emulation for vector loads and stores
- Allow HPT guests to run on a radix host on POWER9 v2.2 CPUs without
requiring the complex thread synchronization of older CPU versions
- Improve the handling of escalation interrupts with the XIVE interrupt
controller
- Support decrement register migration
- Various cleanups and bugfixes.
s390:
- Cornelia Huck passed maintainership to Janosch Frank
- Exitless interrupts for emulated devices
- Cleanup of cpuflag handling
- kvm_stat counter improvements
- VSIE improvements
- mm cleanup
x86:
- Hypervisor part of SEV
- UMIP, RDPID, and MSR_SMI_COUNT emulation
- Paravirtualized TLB shootdown using the new KVM_VCPU_PREEMPTED bit
- Allow guests to see TOPOEXT, GFNI, VAES, VPCLMULQDQ, and more AVX512
features
- Show vcpu id in its anonymous inode name
- Many fixes and cleanups
- Per-VCPU MSR bitmaps (already merged through x86/pti branch)
- Stable KVM clock when nesting on Hyper-V (merged through x86/hyperv)
-----BEGIN PGP SIGNATURE-----
iQEcBAABCAAGBQJafvMtAAoJEED/6hsPKofo6YcH/Rzf2RmshrWaC3q82yfIV0Qz
Z8N8yJHSaSdc3Jo6cmiVj0zelwAxdQcyjwlT7vxt5SL2yML+/Q0st9Hc3EgGGXPm
Il99eJEl+2MYpZgYZqV8ff3mHS5s5Jms+7BITAeh6Rgt+DyNbykEAvzt+MCHK9cP
xtsIZQlvRF7HIrpOlaRzOPp3sK2/MDZJ1RBE7wYItK3CUAmsHim/LVYKzZkRTij3
/9b4LP1yMMbziG+Yxt1o682EwJB5YIat6fmDG9uFeEVI5rWWN7WFubqs8gCjYy/p
FX+BjpOdgTRnX+1m9GIj0Jlc/HKMXryDfSZS07Zy4FbGEwSiI5SfKECub4mDhuE=
=C/uD
-----END PGP SIGNATURE-----
Merge tag 'kvm-4.16-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM updates from Radim Krčmář:
"ARM:
- icache invalidation optimizations, improving VM startup time
- support for forwarded level-triggered interrupts, improving
performance for timers and passthrough platform devices
- a small fix for power-management notifiers, and some cosmetic
changes
PPC:
- add MMIO emulation for vector loads and stores
- allow HPT guests to run on a radix host on POWER9 v2.2 CPUs without
requiring the complex thread synchronization of older CPU versions
- improve the handling of escalation interrupts with the XIVE
interrupt controller
- support decrement register migration
- various cleanups and bugfixes.
s390:
- Cornelia Huck passed maintainership to Janosch Frank
- exitless interrupts for emulated devices
- cleanup of cpuflag handling
- kvm_stat counter improvements
- VSIE improvements
- mm cleanup
x86:
- hypervisor part of SEV
- UMIP, RDPID, and MSR_SMI_COUNT emulation
- paravirtualized TLB shootdown using the new KVM_VCPU_PREEMPTED bit
- allow guests to see TOPOEXT, GFNI, VAES, VPCLMULQDQ, and more
AVX512 features
- show vcpu id in its anonymous inode name
- many fixes and cleanups
- per-VCPU MSR bitmaps (already merged through x86/pti branch)
- stable KVM clock when nesting on Hyper-V (merged through
x86/hyperv)"
* tag 'kvm-4.16-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (197 commits)
KVM: PPC: Book3S: Add MMIO emulation for VMX instructions
KVM: PPC: Book3S HV: Branch inside feature section
KVM: PPC: Book3S HV: Make HPT resizing work on POWER9
KVM: PPC: Book3S HV: Fix handling of secondary HPTEG in HPT resizing code
KVM: PPC: Book3S PR: Fix broken select due to misspelling
KVM: x86: don't forget vcpu_put() in kvm_arch_vcpu_ioctl_set_sregs()
KVM: PPC: Book3S PR: Fix svcpu copying with preemption enabled
KVM: PPC: Book3S HV: Drop locks before reading guest memory
kvm: x86: remove efer_reload entry in kvm_vcpu_stat
KVM: x86: AMD Processor Topology Information
x86/kvm/vmx: do not use vm-exit instruction length for fast MMIO when running nested
kvm: embed vcpu id to dentry of vcpu anon inode
kvm: Map PFN-type memory regions as writable (if possible)
x86/kvm: Make it compile on 32bit and with HYPYERVISOR_GUEST=n
KVM: arm/arm64: Fixup userspace irqchip static key optimization
KVM: arm/arm64: Fix userspace_irqchip_in_use counting
KVM: arm/arm64: Fix incorrect timer_is_pending logic
MAINTAINERS: update KVM/s390 maintainers
MAINTAINERS: add Halil as additional vfio-ccw maintainer
MAINTAINERS: add David as a reviewer for KVM/s390
...
Seven fixes that are either trivial or that address bugs that people
are actually hitting. The main ones are:
- Drop spinlocks before reading guest memory
- Fix a bug causing corruption of VCPU state in PR KVM with preemption
enabled
- Make HPT resizing work on POWER9
- Add MMIO emulation for vector loads and stores, because guests now
use these instructions in memcpy and similar routines.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJafWn0AAoJEJ2a6ncsY3GfaMsIANF0hQD8SS78WNKnoy0vnZ/X
PUXdjwHEsfkg5KdQ7o0oaa2BJHHqO3vozddmMiG14r2L1mNCHJpnVZCVV0GaEJcZ
eU8++OPK6yrsPNNpAjnrtQ0Vk4LwzoT0bftEjS3TtLt1s2uSo+R1+HLmxbxGhQUX
bZngo9wQ3cjUfAXLrPtAVhE5wTmgVOiufVRyfRsBRdFzRsAWqjY4hBtJAfwdff4r
AA5H0RCrXO6e1feKr5ElU8KzX6b7IjH9Xu868oJ1r16zZfE05PBl1X5n4XG7XDm7
xWvs8uLAB7iRv2o/ecFznYJ+Dz1NCBVzD0RmAUTqPCcVKDrxixaTkqMPFW97IAA=
=HOJR
-----END PGP SIGNATURE-----
Merge tag 'kvm-ppc-next-4.16-2' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc
Second PPC KVM update for 4.16
Seven fixes that are either trivial or that address bugs that people
are actually hitting. The main ones are:
- Drop spinlocks before reading guest memory
- Fix a bug causing corruption of VCPU state in PR KVM with preemption
enabled
- Make HPT resizing work on POWER9
- Add MMIO emulation for vector loads and stores, because guests now
use these instructions in memcpy and similar routines.
We ended up with code that did a conditional branch inside a feature
section to code outside of the feature section. Depending on how the
object file gets organized, that might mean we exceed the 14bit
relocation limit for conditional branches:
arch/powerpc/kvm/built-in.o:arch/powerpc/kvm/book3s_hv_rmhandlers.S:416:(__ftr_alt_97+0x8): relocation truncated to fit: R_PPC64_REL14 against `.text'+1ca4
So instead of doing a conditional branch outside of the feature section,
let's just jump at the end of the same, making the branch very short.
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
- Allow HPT guests to run on a radix host on POWER9 v2.2 CPUs
without requiring the complex thread synchronization that earlier
CPU versions required.
- A series from Ben Herrenschmidt to improve the handling of
escalation interrupts with the XIVE interrupt controller.
- Provide for the decrementer register to be copied across on
migration.
- Various minor cleanups and bugfixes.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJaYXViAAoJEJ2a6ncsY3GfDhgIAIDVBZH/Ftq7eJiUSxDpqyCQ
DF/x7fNKzK/J33pu+3ntOI2gZsldExAy7vH2M27I4qLIkbI5y3vu4v8l3CDlS1LK
9dKi72zg7baozoVF5mGUNm0B1sSvZiIQlC/kaami2aPTF1GcrJ561GthzfZwxENX
TSLqOA4LkeUZh2tUsvbcUrPi6v+E4Em2lgacQcx2ioMblWz56sZu79VsUbSSw/a3
P8+pIv7EbHw+TrOZMehjCbZkOdBeZ3IRLJsdlIAfe7y4vWME/5b9uVnQS/+XQj/B
6f3rQrduGvF2P6GMjsm8gDkgE5oZ1zbKlgO4i5WApnu80MMLFlfEUN+GWuGJ95Q=
=OjGs
-----END PGP SIGNATURE-----
Merge tag 'kvm-ppc-next-4.16-1' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc
PPC KVM update for 4.16
- Allow HPT guests to run on a radix host on POWER9 v2.2 CPUs
without requiring the complex thread synchronization that earlier
CPU versions required.
- A series from Ben Herrenschmidt to improve the handling of
escalation interrupts with the XIVE interrupt controller.
- Provide for the decrementer register to be copied across on
migration.
- Various minor cleanups and bugfixes.
Merge our fixes branch from the 4.15 cycle.
Unusually the fixes branch saw some significant features merged,
notably the RFI flush patches, so we want the code in next to be
tested against that, to avoid any surprises when the two are merged.
There's also some other work on the panic handling that was reverted
in fixes and we now want to do properly in next, which would conflict.
And we also fix a few other minor merge conflicts.
Merge the topic branch we share with kvm-ppc, this brings in two xive
commits, one from Paul to rework HMI handling, and a minor cleanup to
drop an unused flag.
Rename the paca->soft_enabled to paca->irq_soft_mask as it is no
longer used as a flag for interrupt state, but a mask.
Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This works on top of the single escalation support. When in single
escalation, with this change, we will keep the escalation interrupt
disabled unless the VCPU is in H_CEDE (idle). In any other case, we
know the VCPU will be rescheduled and thus there is no need to take
escalation interrupts in the host whenever a guest interrupt fires.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
The prodded flag is only cleared at the beginning of H_CEDE,
so every time we have an escalation, we will cause the *next*
H_CEDE to return immediately.
Instead use a dedicated "irq_pending" flag to indicate that
a guest interrupt is pending for the VCPU. We don't reuse the
existing exception bitmap so as to avoid expensive atomic ops.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This merges in the ppc-kvm topic branch of the powerpc tree to get
two patches which are prerequisites for the following patch series,
plus another patch which touches both powerpc and KVM code.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Hypervisor maintenance interrupts (HMIs) are generated by various
causes, signalled by bits in the hypervisor maintenance exception
register (HMER). In most cases calling OPAL to handle the interrupt
is the correct thing to do, but the "debug trigger" HMIs signalled by
PPC bit 17 (bit 46) of HMER are used to invoke software workarounds
for hardware bugs, and OPAL does not have any code to handle this
cause. The debug trigger HMI is used in POWER9 DD2.0 and DD2.1 chips
to work around a hardware bug in executing vector load instructions to
cache inhibited memory. In POWER9 DD2.2 chips, it is generated when
conditions are detected relating to threads being in TM (transactional
memory) suspended mode when the core SMT configuration needs to be
reconfigured.
The kernel currently has code to detect the vector CI load condition,
but only when the HMI occurs in the host, not when it occurs in a
guest. If a HMI occurs in the guest, it is always passed to OPAL, and
then we always re-sync the timebase, because the HMI cause might have
been a timebase error, for which OPAL would re-sync the timebase, thus
removing the timebase offset which KVM applied for the guest. Since
we don't know what OPAL did, we don't know whether to subtract the
timebase offset from the timebase, so instead we re-sync the timebase.
This adds code to determine explicitly what the cause of a debug
trigger HMI will be. This is based on a new device-tree property
under the CPU nodes called ibm,hmi-special-triggers, if it is
present, or otherwise based on the PVR (processor version register).
The handling of debug trigger HMIs is pulled out into a separate
function which can be called from the KVM guest exit code. If this
function handles and clears the HMI, and no other HMI causes remain,
then we skip calling OPAL and we proceed to subtract the guest
timebase offset from the timebase.
The overall handling for HMIs that occur in the host (i.e. not in a
KVM guest) is largely unchanged, except that we now don't set the flag
for the vector CI load workaround on DD2.2 processors.
This also removes a BUG_ON in the KVM code. BUG_ON is generally not
useful in KVM guest entry/exit code since it is difficult to handle
the resulting trap gracefully.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This moves the code that loads and unloads the guest SLB values so that
it is done while the guest LPCR value is loaded in the LPCR register.
The reason for doing this is that on POWER9, the behaviour of the
slbmte instruction depends on the LPCR[UPRT] bit. If UPRT is 1, as
it is for a radix host (or guest), the SLB index is truncated to
2 bits. This means that for a HPT guest on a radix host, the SLB
was not being loaded correctly, causing the guest to crash.
The SLB is now loaded much later in the guest entry path, after the
LPCR is loaded, which for a secondary thread is after it sees that
the primary thread has switched the MMU to the guest. The loop that
waits for the primary thread has a branch out to the exit code that
is taken if it sees that other threads have commenced exiting the
guest. Since we have now not loaded the SLB at this point, we make
this path branch to a new label 'guest_bypass' and we move the SLB
unload code to before this label.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This fixes a bug where it is possible to enter a guest on a POWER9
system without having the XIVE (interrupt controller) context loaded.
This can happen because we unload the XIVE context from the CPU
before doing the real-mode handling for machine checks. After the
real-mode handler runs, it is possible that we re-enter the guest
via a fast path which does not load the XIVE context.
To fix this, we move the unloading of the XIVE context to come after
the real-mode machine check handler is called.
Fixes: 5af5099385 ("KVM: PPC: Book3S HV: Native usage of the XIVE interrupt controller")
Cc: stable@vger.kernel.org # v4.11+
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
On Book3S in HV mode, we don't use the vcpu->arch.dec field at all.
Instead, all logic is built around vcpu->arch.dec_expires.
So let's remove the one remaining piece of code that was setting it.
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This commit does simple conversions of rfi/rfid to the new macros that
include the expected destination context. By simple we mean cases
where there is a single well known destination context, and it's
simply a matter of substituting the instruction for the appropriate
macro.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This merges in a couple of fixes from the kvm-ppc-fixes branch that
modify the same areas of code as some commits from the kvm-ppc-next
branch, in order to resolve the conflicts.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This patch removes the restriction that a radix host can only run
radix guests, allowing us to run HPT (hashed page table) guests as
well. This is useful because it provides a way to run old guest
kernels that know about POWER8 but not POWER9.
Unfortunately, POWER9 currently has a restriction that all threads
in a given code must either all be in HPT mode, or all in radix mode.
This means that when entering a HPT guest, we have to obtain control
of all 4 threads in the core and get them to switch their LPIDR and
LPCR registers, even if they are not going to run a guest. On guest
exit we also have to get all threads to switch LPIDR and LPCR back
to host values.
To make this feasible, we require that KVM not be in the "independent
threads" mode, and that the CPU cores be in single-threaded mode from
the host kernel's perspective (only thread 0 online; threads 1, 2 and
3 offline). That allows us to use the same code as on POWER8 for
obtaining control of the secondary threads.
To manage the LPCR/LPIDR changes required, we extend the kvm_split_info
struct to contain the information needed by the secondary threads.
All threads perform a barrier synchronization (where all threads wait
for every other thread to reach the synchronization point) on guest
entry, both before and after loading LPCR and LPIDR. On guest exit,
they all once again perform a barrier synchronization both before
and after loading host values into LPCR and LPIDR.
Finally, it is also currently necessary to flush the entire TLB every
time we enter a HPT guest on a radix host. We do this on thread 0
with a loop of tlbiel instructions.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This patch allows for a mode on POWER9 hosts where we control all the
threads of a core, much as we do on POWER8. The mode is controlled by
a module parameter on the kvm_hv module, called "indep_threads_mode".
The normal mode on POWER9 is the "independent threads" mode, with
indep_threads_mode=Y, where the host is in SMT4 mode (or in fact any
desired SMT mode) and each thread independently enters and exits from
KVM guests without reference to what other threads in the core are
doing.
If indep_threads_mode is set to N at the point when a VM is started,
KVM will expect every core that the guest runs on to be in single
threaded mode (that is, threads 1, 2 and 3 offline), and will set the
flag that prevents secondary threads from coming online. We can still
use all four threads; the code that implements dynamic micro-threading
on POWER8 will become active in over-commit situations and will allow
up to three other VCPUs to be run on the secondary threads of the core
whenever a VCPU is run.
The reason for wanting this mode is that this will allow us to run HPT
guests on a radix host on a POWER9 machine that does not support
"mixed mode", that is, having some threads in a core be in HPT mode
while other threads are in radix mode. It will also make it possible
to implement a "strict threads" mode in future, if desired.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This merges in the ppc-kvm topic branch of the powerpc tree to get the
commit that reverts the patch "KVM: PPC: Book3S HV: POWER9 does not
require secondary thread management". This is needed for subsequent
patches which will be applied on this branch.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This reverts commit 94a04bc25a.
In order to run HPT guests on a radix POWER9 host, we will have to run
the host in single-threaded mode, because POWER9 processors do not
currently support running some threads of a core in HPT mode while
others are in radix mode ("mixed mode").
That means that we will need the same mechanisms that are used on
POWER8 to make the secondary threads available to KVM, which were
disabled on POWER9 by commit 94a04bc25a.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
On POWER9 systems, we push the VCPU context onto the XIVE (eXternal
Interrupt Virtualization Engine) hardware when entering a guest,
and pull the context off the XIVE when exiting the guest. The push
is done with cache-inhibited stores, and the pull with cache-inhibited
loads.
Testing has revealed that it is possible (though very rare) for
the stores to get reordered with the loads so that we end up with the
guest VCPU context still loaded on the XIVE after we have exited the
guest. When that happens, it is possible for the same VCPU context
to then get loaded on another CPU, which causes the machine to
checkstop.
To fix this, we add I/O barrier instructions (eieio) before and
after the push and pull operations. As partial compensation for the
potential slowdown caused by the extra barriers, we remove the eieio
instructions between the two stores in the push operation, and between
the two loads in the pull operation. (The architecture requires
loads to cache-inhibited, guarded storage to be kept in order, and
requires stores to cache-inhibited, guarded storage likewise to be
kept in order, but allows such loads and stores to be reordered with
respect to each other.)
Reported-by: Carol L Soto <clsoto@us.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
At present, if an interrupt (i.e. an exception or trap) occurs in the
code where KVM is switching the MMU to or from guest context, we jump
to kvmppc_bad_host_intr, where we simply spin with interrupts disabled.
In this situation, it is hard to debug what happened because we get no
indication as to which interrupt occurred or where. Typically we get
a cascade of stall and soft lockup warnings from other CPUs.
In order to get more information for debugging, this adds code to
create a stack frame on the emergency stack and save register values
to it. We start half-way down the emergency stack in order to give
ourselves some chance of being able to do a stack trace on secondary
threads that are already on the emergency stack.
On POWER7 or POWER8, we then just spin, as before, because we don't
know what state the MMU context is in or what other threads are doing,
and we can't switch back to host context without coordinating with
other threads. On POWER9 we can do better; there we load up the host
MMU context and jump to C code, which prints an oops message to the
console and panics.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
- Add another case where msgsync is required.
- Required barrier sequence for global doorbells is msgsync ; lwsync
When msgsnd is used for IPIs to other cores, msgsync must be executed by
the target to order stores performed on the source before its msgsnd
(provided the source executes the appropriate sync).
Fixes: 1704a81cce ("KVM: PPC: Book3S HV: Use msgsnd for IPIs to other cores on POWER9")
Cc: stable@vger.kernel.org # v4.10+
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
On POWER9 DD2.1 and below, sometimes on a Hypervisor Data Storage
Interrupt (HDSI) the HDSISR is not be updated at all.
To work around this we put a canary value into the HDSISR before
returning to a guest and then check for this canary when we take a
HDSI. If we find the canary on a HDSI, we know the hardware didn't
update the HDSISR. In this case we return to the guest to retake the
HDSI which should correctly update the HDSISR the second time HDSI
entry.
After talking to Paulus we've applied this workaround to all POWER9
CPUs. The workaround of returning to the guest shouldn't ever be
triggered on well behaving CPU. The extra instructions should have
negligible performance impact.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Aneesh Kumar reported seeing host crashes when running recent kernels
on POWER8. The symptom was an oops like this:
Unable to handle kernel paging request for data at address 0xf00000000786c620
Faulting instruction address: 0xc00000000030e1e4
Oops: Kernel access of bad area, sig: 11 [#1]
LE SMP NR_CPUS=2048 NUMA PowerNV
Modules linked in: powernv_op_panel
CPU: 24 PID: 6663 Comm: qemu-system-ppc Tainted: G W 4.13.0-rc7-43932-gfc36c59 #2
task: c000000fdeadfe80 task.stack: c000000fdeb68000
NIP: c00000000030e1e4 LR: c00000000030de6c CTR: c000000000103620
REGS: c000000fdeb6b450 TRAP: 0300 Tainted: G W (4.13.0-rc7-43932-gfc36c59)
MSR: 9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 24044428 XER: 20000000
CFAR: c00000000030e134 DAR: f00000000786c620 DSISR: 40000000 SOFTE: 0
GPR00: 0000000000000000 c000000fdeb6b6d0 c0000000010bd000 000000000000e1b0
GPR04: c00000000115e168 c000001fffa6e4b0 c00000000115d000 c000001e1b180386
GPR08: f000000000000000 c000000f9a8913e0 f00000000786c600 00007fff587d0000
GPR12: c000000fdeb68000 c00000000fb0f000 0000000000000001 00007fff587cffff
GPR16: 0000000000000000 c000000000000000 00000000003fffff c000000fdebfe1f8
GPR20: 0000000000000004 c000000fdeb6b8a8 0000000000000001 0008000000000040
GPR24: 07000000000000c0 00007fff587cffff c000000fdec20bf8 00007fff587d0000
GPR28: c000000fdeca9ac0 00007fff587d0000 00007fff587c0000 00007fff587d0000
NIP [c00000000030e1e4] __get_user_pages_fast+0x434/0x1070
LR [c00000000030de6c] __get_user_pages_fast+0xbc/0x1070
Call Trace:
[c000000fdeb6b6d0] [c00000000139dab8] lock_classes+0x0/0x35fe50 (unreliable)
[c000000fdeb6b7e0] [c00000000030ef38] get_user_pages_fast+0xf8/0x120
[c000000fdeb6b830] [c000000000112318] kvmppc_book3s_hv_page_fault+0x308/0xf30
[c000000fdeb6b960] [c00000000010e10c] kvmppc_vcpu_run_hv+0xfdc/0x1f00
[c000000fdeb6bb20] [c0000000000e915c] kvmppc_vcpu_run+0x2c/0x40
[c000000fdeb6bb40] [c0000000000e5650] kvm_arch_vcpu_ioctl_run+0x110/0x300
[c000000fdeb6bbe0] [c0000000000d6468] kvm_vcpu_ioctl+0x528/0x900
[c000000fdeb6bd40] [c0000000003bc04c] do_vfs_ioctl+0xcc/0x950
[c000000fdeb6bde0] [c0000000003bc930] SyS_ioctl+0x60/0x100
[c000000fdeb6be30] [c00000000000b96c] system_call+0x58/0x6c
Instruction dump:
7ca81a14 2fa50000 41de0010 7cc8182a 68c60002 78c6ffe2 0b060000 3cc2000a
794a3664 390610d8 e9080000 7d485214 <e90a0020> 7d435378 790507e1 408202f0
---[ end trace fad4a342d0414aa2 ]---
It turns out that what has happened is that the SLB entry for the
vmmemap region hasn't been reloaded on exit from a guest, and it has
the wrong page size. Then, when the host next accesses the vmemmap
region, it gets a page fault.
Commit a25bd72bad ("powerpc/mm/radix: Workaround prefetch issue with
KVM", 2017-07-24) modified the guest exit code so that it now only clears
out the SLB for hash guest. The code tests the radix flag and puts the
result in a non-volatile CR field, CR2, and later branches based on CR2.
Unfortunately, the kvmppc_save_tm function, which gets called between
those two points, modifies all the user-visible registers in the case
where the guest was in transactional or suspended state, except for a
few which it restores (namely r1, r2, r9 and r13). Thus the hash/radix indication in CR2 gets corrupted.
This fixes the problem by re-doing the comparison just before the
result is needed. For good measure, this also adds comments next to
the call sites of kvmppc_save_tm and kvmppc_restore_tm pointing out
that non-volatile register state will be lost.
Cc: stable@vger.kernel.org # v4.13
Fixes: a25bd72bad ("powerpc/mm/radix: Workaround prefetch issue with KVM")
Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This fix was intended for 4.13, but didn't get in because both
maintainers were on vacation.
Paul Mackerras:
"It adds mutual exclusion between list_add_rcu and list_del_rcu calls
on the kvm->arch.spapr_tce_tables list. Without this, userspace could
potentially trigger corruption of the list and cause a host crash or
worse."
This merges in the 'ppc-kvm' topic branch from the powerpc tree in
order to bring in some fixes which touch both powerpc and KVM code.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Commit 2f2724630f ("KVM: PPC: Book3S HV: Cope with host using large
decrementer mode", 2017-05-22) added code to treat the hypervisor
decrementer (HDEC) as a 64-bit value on POWER9 rather than 32-bit.
Unfortunately, that commit missed one place where HDEC is treated
as a 32-bit value. This fixes it.
This bug should not have any user-visible consequences that I can
think of, beyond an occasional unnecessary exit to the host kernel.
If the hypervisor decrementer has gone negative, then the bottom
32 bits will be negative for about 4 seconds after that, so as
long as we get out of the guest within those 4 seconds we won't
conclude that the HDEC interrupt is spurious.
Reported-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Fixes: 2f2724630f ("KVM: PPC: Book3S HV: Cope with host using large decrementer mode")
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
binutils >= 2.26 now warns about misuse of register expressions in
assembler operands that are actually literals. In this instance r0 is
being used where a literal 0 should be used.
Signed-off-by: Andreas Schwab <schwab@linux-m68k.org>
[mpe: Split into separate KVM patch, tweak change log]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
POWER9 CPUs have independent MMU contexts per thread, so KVM does not
need to quiesce secondary threads, so the hwthread_req/hwthread_state
protocol does not have to be used. So patch it away on POWER9, and patch
away the branch from the Linux idle wakeup to kvm_start_guest that is
never used.
Add a warning and error out of kvmppc_grab_hwthread in case it is ever
called on POWER9.
This avoids a hwsync in the idle wakeup path on POWER9.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Acked-by: Paul Mackerras <paulus@ozlabs.org>
[mpe: Use WARN(...) instead of WARN_ON()/pr_err(...)]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
When msgsnd is used for IPIs to other cores, msgsync must be executed by
the target to order stores performed on the source before its msgsnd
(provided the source executes the appropriate sync).
Fixes: 1704a81cce ("KVM: PPC: Book3S HV: Use msgsnd for IPIs to other cores on POWER9")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
There's a somewhat architectural issue with Radix MMU and KVM.
When coming out of a guest with AIL (Alternate Interrupt Location, ie,
MMU enabled), we start executing hypervisor code with the PID register
still containing whatever the guest has been using.
The problem is that the CPU can (and will) then start prefetching or
speculatively load from whatever host context has that same PID (if
any), thus bringing translations for that context into the TLB, which
Linux doesn't know about.
This can cause stale translations and subsequent crashes.
Fixing this in a way that is neither racy nor a huge performance
impact is difficult. We could just make the host invalidations always
use broadcast forms but that would hurt single threaded programs for
example.
We chose to fix it instead by partitioning the PID space between guest
and host. This is possible because today Linux only use 19 out of the
20 bits of PID space, so existing guests will work if we make the host
use the top half of the 20 bits space.
We additionally add support for a property to indicate to Linux the
size of the PID register which will be useful if we eventually have
processors with a larger PID space available.
There is still an issue with malicious guests purposefully setting the
PID register to a value in the hosts PID range. Hopefully future HW
can prevent that, but in the meantime, we handle it with a pair of
kludges:
- On the way out of a guest, before we clear the current VCPU in the
PACA, we check the PID and if it's outside of the permitted range
we flush the TLB for that PID.
- When context switching, if the mm is "new" on that CPU (the
corresponding bit was set for the first time in the mm cpumask), we
check if any sibling thread is in KVM (has a non-NULL VCPU pointer
in the PACA). If that is the case, we also flush the PID for that
CPU (core).
This second part is needed to handle the case where a process is
migrated (or starts a new pthread) on a sibling thread of the CPU
coming out of KVM, as there's a window where stale translations can
exist before we detect it and flush them out.
A future optimization could be added by keeping track of whether the
PID has ever been used and avoid doing that for completely fresh PIDs.
We could similarily mark PIDs that have been the subject of a global
invalidation as "fresh". But for now this will do.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
[mpe: Rework the asm to build with CONFIG_PPC_RADIX_MMU=n, drop
unneeded include of kvm_book3s_asm.h]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Highlights include:
- Support for STRICT_KERNEL_RWX on 64-bit server CPUs.
- Platform support for FSP2 (476fpe) board
- Enable ZONE_DEVICE on 64-bit server CPUs.
- Generic & powerpc spin loop primitives to optimise busy waiting
- Convert VDSO update function to use new update_vsyscall() interface
- Optimisations to hypercall/syscall/context-switch paths
- Improvements to the CPU idle code on Power8 and Power9.
As well as many other fixes and improvements.
Thanks to:
Akshay Adiga, Andrew Donnellan, Andrew Jeffery, Anshuman Khandual, Anton
Blanchard, Balbir Singh, Benjamin Herrenschmidt, Christophe Leroy, Christophe
Lombard, Colin Ian King, Dan Carpenter, Gautham R. Shenoy, Hari Bathini, Ian
Munsie, Ivan Mikhaylov, Javier Martinez Canillas, Madhavan Srinivasan,
Masahiro Yamada, Matt Brown, Michael Neuling, Michal Suchanek, Murilo
Opsfelder Araujo, Naveen N. Rao, Nicholas Piggin, Oliver O'Halloran, Paul
Mackerras, Pavel Machek, Russell Currey, Santosh Sivaraj, Stephen Rothwell,
Thiago Jung Bauermann, Yang Li.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJZXyPCAAoJEFHr6jzI4aWAI9QQAISf2x5y//cqCi4ISyQB5pTq
KLS/yQajNkQOw7c0fzBZOaH5Xd/SJ6AcKWDg8yDlpDR3+sRRsr98iIRECgKS5I7/
DxD9ywcbSoMXFQQo1ZMCp5CeuMUIJRtugBnUQM+JXCSUCPbznCY5DchDTLyTBTpO
MeMVhI//JxthhoOMA9MudiEGaYCU9ho442Z4OJUSiLUv8WRbvQX9pTqoc4vx1fxA
BWf2mflztBVcIfKIyxIIIlDLukkMzix6gSYPMCbC7lzkbnU7JSqKiheJXjo1gJS2
ePHKDxeNR2/QP0g/j3aT/MR1uTt9MaNBSX3gANE1xQ9OoJ8m1sOtCO4gNbSdLWae
eXhDnoiEp930DRZOeEioOItuWWoxFaMyYk3BMmRKV4QNdYL3y3TRVeL2/XmRGqYL
Lxz4IY/x5TteFEJNGcRX90uizNSi8AaEXPF16pUk8Ctt6eH3ZSwPMv2fHeYVCMr0
KFlKHyaPEKEoztyzLcUR6u9QB56yxDN58bvLpd32AeHvKhqyxFoySy59x9bZbatn
B2y8mmDItg860e0tIG6jrtplpOVvL8i5jla5RWFVoQDuxxrLAds3vG9JZQs+eRzx
Fiic93bqeUAS6RzhXbJ6QQJYIyhE7yqpcgv7ME1W87SPef3HPBk9xlp3yIDwdA2z
bBDvrRnvupusz8qCWrxe
=w8rj
-----END PGP SIGNATURE-----
Merge tag 'powerpc-4.13-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates from Michael Ellerman:
"Highlights include:
- Support for STRICT_KERNEL_RWX on 64-bit server CPUs.
- Platform support for FSP2 (476fpe) board
- Enable ZONE_DEVICE on 64-bit server CPUs.
- Generic & powerpc spin loop primitives to optimise busy waiting
- Convert VDSO update function to use new update_vsyscall() interface
- Optimisations to hypercall/syscall/context-switch paths
- Improvements to the CPU idle code on Power8 and Power9.
As well as many other fixes and improvements.
Thanks to: Akshay Adiga, Andrew Donnellan, Andrew Jeffery, Anshuman
Khandual, Anton Blanchard, Balbir Singh, Benjamin Herrenschmidt,
Christophe Leroy, Christophe Lombard, Colin Ian King, Dan Carpenter,
Gautham R. Shenoy, Hari Bathini, Ian Munsie, Ivan Mikhaylov, Javier
Martinez Canillas, Madhavan Srinivasan, Masahiro Yamada, Matt Brown,
Michael Neuling, Michal Suchanek, Murilo Opsfelder Araujo, Naveen N.
Rao, Nicholas Piggin, Oliver O'Halloran, Paul Mackerras, Pavel Machek,
Russell Currey, Santosh Sivaraj, Stephen Rothwell, Thiago Jung
Bauermann, Yang Li"
* tag 'powerpc-4.13-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (158 commits)
powerpc/Kconfig: Enable STRICT_KERNEL_RWX for some configs
powerpc/mm/radix: Implement STRICT_RWX/mark_rodata_ro() for Radix
powerpc/mm/hash: Implement mark_rodata_ro() for hash
powerpc/vmlinux.lds: Align __init_begin to 16M
powerpc/lib/code-patching: Use alternate map for patch_instruction()
powerpc/xmon: Add patch_instruction() support for xmon
powerpc/kprobes/optprobes: Use patch_instruction()
powerpc/kprobes: Move kprobes over to patch_instruction()
powerpc/mm/radix: Fix execute permissions for interrupt_vectors
powerpc/pseries: Fix passing of pp0 in updatepp() and updateboltedpp()
powerpc/64s: Blacklist rtas entry/exit from kprobes
powerpc/64s: Blacklist functions invoked on a trap
powerpc/64s: Un-blacklist system_call() from kprobes
powerpc/64s: Move system_call() symbol to just after setting MSR_EE
powerpc/64s: Blacklist system_call() and system_call_common() from kprobes
powerpc/64s: Convert .L__replay_interrupt_return to a local label
powerpc64/elfv1: Only dereference function descriptor for non-text symbols
cxl: Export library to support IBM XSL
powerpc/dts: Use #include "..." to include local DT
powerpc/perf/hv-24x7: Aggregate result elements on POWER9 SMT8
...
At present, interrupts are hard-disabled fairly late in the guest
entry path, in the assembly code. Since we check for pending signals
for the vCPU(s) task(s) earlier in the guest entry path, it is
possible for a signal to be delivered before we enter the guest but
not be noticed until after we exit the guest for some other reason.
Similarly, it is possible for the scheduler to request a reschedule
while we are in the guest entry path, and we won't notice until after
we have run the guest, potentially for a whole timeslice.
Furthermore, with a radix guest on POWER9, we can take the interrupt
with the MMU on. In this case we end up leaving interrupts
hard-disabled after the guest exit, and they are likely to stay
hard-disabled until we exit to userspace or context-switch to
another process. This was masking the fact that we were also not
setting the RI (recoverable interrupt) bit in the MSR, meaning
that if we had taken an interrupt, it would have crashed the host
kernel with an unrecoverable interrupt message.
To close these races, we need to check for signals and reschedule
requests after hard-disabling interrupts, and then keep interrupts
hard-disabled until we enter the guest. If there is a signal or a
reschedule request from another CPU, it will send an IPI, which will
cause a guest exit.
This puts the interrupt disabling before we call kvmppc_start_thread()
for all the secondary threads of this core that are going to run vCPUs.
The reason for that is that once we have started the secondary threads
there is no easy way to back out without going through at least part
of the guest entry path. However, kvmppc_start_thread() includes some
code for radix guests which needs to call smp_call_function(), which
must be called with interrupts enabled. To solve this problem, this
patch moves that code into a separate function that is called earlier.
When the guest exit is caused by an external interrupt, a hypervisor
doorbell or a hypervisor maintenance interrupt, we now handle these
using the replay facility. __kvmppc_vcore_entry() now returns the
trap number that caused the exit on this thread, and instead of the
assembly code jumping to the handler entry, we return to C code with
interrupts still hard-disabled and set the irq_happened flag in the
PACA, so that when we do local_irq_enable() the appropriate handler
gets called.
With all this, we now have the interrupt soft-enable flag clear while
we are in the guest. This is useful because code in the real-mode
hypercall handlers that checks whether interrupts are enabled will
now see that they are disabled, which is correct, since interrupts
are hard-disabled in the real-mode code.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Enhance KVM to cause a guest exit with KVM_EXIT_NMI
exit reason upon a machine check exception (MCE) in
the guest address space if the KVM_CAP_PPC_FWNMI
capability is enabled (instead of delivering a 0x200
interrupt to guest). This enables QEMU to build error
log and deliver machine check exception to guest via
guest registered machine check handler.
This approach simplifies the delivery of machine
check exception to guest OS compared to the earlier
approach of KVM directly invoking 0x200 guest interrupt
vector.
This design/approach is based on the feedback for the
QEMU patches to handle machine check exception. Details
of earlier approach of handling machine check exception
in QEMU and related discussions can be found at:
https://lists.nongnu.org/archive/html/qemu-devel/2014-11/msg00813.html
Note:
This patch now directly invokes machine_check_print_event_info()
from kvmppc_handle_exit_hv() to print the event to host console
at the time of guest exit before the exception is passed on to the
guest. Hence, the host-side handling which was performed earlier
via machine_check_fwnmi is removed.
The reasons for this approach is (i) it is not possible
to distinguish whether the exception occurred in the
guest or the host from the pt_regs passed on the
machine_check_exception(). Hence machine_check_exception()
calls panic, instead of passing on the exception to
the guest, if the machine check exception is not
recoverable. (ii) the approach introduced in this
patch gives opportunity to the host kernel to perform
actions in virtual mode before passing on the exception
to the guest. This approach does not require complex
tweaks to machine_check_fwnmi and friends.
Signed-off-by: Aravinda Prasad <aravinda@linux.vnet.ibm.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Idle code now always runs at the 0xc... effective address whether
in real or virtual mode. This means rfid can be ditched, along
with a lot of SRR manipulations.
In the wakeup path, carry SRR1 around in r12. Use mtmsrd to change
MSR states as required.
This also balances the return prediction for the idle call, by
doing blr rather than rfid to return to the idle caller.
On POWER9, 2-process context switch on different cores, with snooze
disabled, increases performance by 2%.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Incorporate v2 fixes from Nick]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
On POWER9, we no longer have the restriction that we had on POWER8
where all threads in a core have to be in the same partition, so
the CPU threads are now independent. However, we still want to be
able to run guests with a virtual SMT topology, if only to allow
migration of guests from POWER8 systems to POWER9.
A guest that has a virtual SMT mode greater than 1 will expect to
be able to use the doorbell facility; it will expect the msgsndp
and msgclrp instructions to work appropriately and to be able to read
sensible values from the TIR (thread identification register) and
DPDES (directed privileged doorbell exception status) special-purpose
registers. However, since each CPU thread is a separate sub-processor
in POWER9, these instructions and registers can only be used within
a single CPU thread.
In order for these instructions to appear to act correctly according
to the guest's virtual SMT mode, we have to trap and emulate them.
We cause them to trap by clearing the HFSCR_MSGP bit in the HFSCR
register. The emulation is triggered by the hypervisor facility
unavailable interrupt that occurs when the guest uses them.
To cause a doorbell interrupt to occur within the guest, we set the
DPDES register to 1. If the guest has interrupts enabled, the CPU
will generate a doorbell interrupt and clear the DPDES register in
hardware. The DPDES hardware register for the guest is saved in the
vcpu->arch.vcore->dpdes field. Since this gets written by the guest
exit code, other VCPUs wishing to cause a doorbell interrupt don't
write that field directly, but instead set a vcpu->arch.doorbell_request
flag. This is consumed and set to 0 by the guest entry code, which
then sets DPDES to 1.
Emulating reads of the DPDES register is somewhat involved, because
it requires reading the doorbell pending interrupt status of all of the
VCPU threads in the virtual core, and if any of those VCPUs are
running, their doorbell status is only up-to-date in the hardware
DPDES registers of the CPUs where they are running. In order to get
a reasonable approximation of the current doorbell status, we send
those CPUs an IPI, causing an exit from the guest which will update
the vcpu->arch.vcore->dpdes field. We then use that value in
constructing the emulated DPDES register value.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This adds code to allow us to use a different value for the HFSCR
(Hypervisor Facilities Status and Control Register) when running the
guest from that which applies in the host. The reason for doing this
is to allow us to trap the msgsndp instruction and related operations
in future so that they can be virtualized. We also save the value of
HFSCR when a hypervisor facility unavailable interrupt occurs, because
the high byte of HFSCR indicates which facility the guest attempted to
access.
We save and restore the host value on guest entry/exit because some
bits of it affect host userspace execution.
We only do all this on POWER9, not on POWER8, because we are not
intending to virtualize any of the facilities controlled by HFSCR on
POWER8. In particular, the HFSCR bit that controls execution of
msgsndp and related operations does not exist on POWER8. The HFSCR
doesn't exist at all on POWER7.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This allows userspace (e.g. QEMU) to enable large decrementer mode for
the guest when running on a POWER9 host, by setting the LPCR_LD bit in
the guest LPCR value. With this, the guest exit code saves 64 bits of
the guest DEC value on exit. Other places that use the guest DEC
value check the LPCR_LD bit in the guest LPCR value, and if it is set,
omit the 32-bit sign extension that would otherwise be done.
This doesn't change the DEC emulation used by PR KVM because PR KVM
is not supported on POWER9 yet.
This is partly based on an earlier patch by Oliver O'Halloran.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
At present, HV KVM on POWER8 and POWER9 machines loses any instruction
or data breakpoint set in the host whenever a guest is run.
Instruction breakpoints are currently only used by xmon, but ptrace
and the perf_event subsystem can set data breakpoints as well as xmon.
To fix this, we save the host values of the debug registers (CIABR,
DAWR and DAWRX) before entering the guest and restore them on exit.
To provide space to save them in the stack frame, we expand the stack
frame allocated by kvmppc_hv_entry() from 112 to 144 bytes.
Fixes: b005255e12 ("KVM: PPC: Book3S HV: Context-switch new POWER8 SPRs", 2014-01-08)
Cc: stable@vger.kernel.org # v3.14+
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This restores several special-purpose registers (SPRs) to sane values
on guest exit that were missed before.
TAR and VRSAVE are readable and writable by userspace, and we need to
save and restore them to prevent the guest from potentially affecting
userspace execution (not that TAR or VRSAVE are used by any known
program that run uses the KVM_RUN ioctl). We save/restore these
in kvmppc_vcpu_run_hv() rather than on every guest entry/exit.
FSCR affects userspace execution in that it can prohibit access to
certain facilities by userspace. We restore it to the normal value
for the task on exit from the KVM_RUN ioctl.
IAMR is normally 0, and is restored to 0 on guest exit. However,
with a radix host on POWER9, it is set to a value that prevents the
kernel from executing user-accessible memory. On POWER9, we save
IAMR on guest entry and restore it on guest exit to the saved value
rather than 0. On POWER8 we continue to set it to 0 on guest exit.
PSPB is normally 0. We restore it to 0 on guest exit to prevent
userspace taking advantage of the guest having set it non-zero
(which would allow userspace to set its SMT priority to high).
UAMOR is normally 0. We restore it to 0 on guest exit to prevent
the AMR from being used as a covert channel between userspace
processes, since the AMR is not context-switched at present.
Fixes: b005255e12 ("KVM: PPC: Book3S HV: Context-switch new POWER8 SPRs", 2014-01-08)
Cc: stable@vger.kernel.org # v3.14+
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
POWER9 introduces a new mode for the decrementer register, called
large decrementer mode, in which the decrementer counter is 56 bits
wide rather than 32, and reads are sign-extended rather than
zero-extended. For the decrementer, this new mode is optional and
controlled by a bit in the LPCR. The hypervisor decrementer (HDEC)
is 56 bits wide on POWER9 and has no mode control.
Since KVM code reads and writes the decrementer and hypervisor
decrementer registers in a few places, it needs to be aware of the
need to treat the decrementer value as a 64-bit quantity, and only do
a 32-bit sign extension when large decrementer mode is not in effect.
Similarly, the HDEC should always be treated as a 64-bit quantity on
POWER9. We define a new EXTEND_HDEC macro to encapsulate the feature
test for POWER9 and the sign extension.
To enable the sign extension to be removed in large decrementer mode,
we test the LPCR_LD bit in the host LPCR image stored in the struct
kvm for the guest. If is set then large decrementer mode is enabled
and the sign extension should be skipped.
This is partly based on an earlier patch by Oliver O'Halloran.
Cc: stable@vger.kernel.org # v4.10+
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This patch makes KVM capable of using the XIVE interrupt controller
to provide the standard PAPR "XICS" style hypercalls. It is necessary
for proper operations when the host uses XIVE natively.
This has been lightly tested on an actual system, including PCI
pass-through with a TG3 device.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
[mpe: Cleanup pr_xxx(), unsplit pr_xxx() strings, etc., fix build
failures by adding KVM_XIVE which depends on KVM_XICS and XIVE, and
adding empty stubs for the kvm_xive_xxx() routines, fixup subject,
integrate fixes from Paul for building PR=y HV=n]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>