We walk the userspace PTs to discover what mapping size was
used there. However, this can race against the userspace tables
being freed, and we end-up in the weeds.
Thankfully, the mm code is being generous and will IPI us when
doing so. So let's implement our part of the bargain and disable
interrupts around the walk. This ensures that nothing terrible
happens during that time.
We still need to handle the removal of the page tables before
the walk. For that, allow get_user_mapping_size() to return an
error, and make sure this error can be propagated all the way
to the the exit handler.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20230316174546.3777507-2-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
All kvm_arch_vm_ioctl() implementations now only deal with "int"
types as return values, so we can change the return type of these
functions to use "int" instead of "long".
Signed-off-by: Thomas Huth <thuth@redhat.com>
Acked-by: Anup Patel <anup@brainfault.org>
Message-Id: <20230208140105.655814-7-thuth@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In case of success, this function returns the amount of handled bytes.
However, this does not work for large values: The function is called
from kvm_arch_vm_ioctl() (which still returns a long), which in turn
is called from kvm_vm_ioctl() in virt/kvm/kvm_main.c. And that function
stores the return value in an "int r" variable. So the upper 32-bits
of the "long" return value are lost there.
KVM ioctl functions should only return "int" values, so let's limit
the amount of bytes that can be requested here to INT_MAX to avoid
the problem with the truncated return value. We can then also change
the return type of the function to "int" to make it clearer that it
is not possible to return a "long" here.
Fixes: f0376edb1d ("KVM: arm64: Add ioctl to fetch/store tags in a guest")
Signed-off-by: Thomas Huth <thuth@redhat.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: Steven Price <steven.price@arm.com>
Message-Id: <20230208140105.655814-5-thuth@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Read mmu_invalidate_seq before dropping the mmap_lock so that KVM can
detect if the results of vma_lookup() (e.g. vma_shift) become stale
before it acquires kvm->mmu_lock. This fixes a theoretical bug where a
VMA could be changed by userspace after vma_lookup() and before KVM
reads the mmu_invalidate_seq, causing KVM to install page table entries
based on a (possibly) no-longer-valid vma_shift.
Re-order the MMU cache top-up to earlier in user_mem_abort() so that it
is not done after KVM has read mmu_invalidate_seq (i.e. so as to avoid
inducing spurious fault retries).
This bug has existed since KVM/ARM's inception. It's unlikely that any
sane userspace currently modifies VMAs in such a way as to trigger this
race. And even with directed testing I was unable to reproduce it. But a
sufficiently motivated host userspace might be able to exploit this
race.
Fixes: 94f8e6418d ("KVM: ARM: Handle guest faults in KVM")
Cc: stable@vger.kernel.org
Reported-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230313235454.2964067-1-dmatlack@google.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Presently, when a guest writes 1 to PMCR_EL0.{C,P}, which is WO/RAZ,
KVM saves the register value, including these bits.
When userspace reads the register using KVM_GET_ONE_REG, KVM returns
the saved register value as it is (the saved value might have these
bits set). This could result in userspace setting these bits on the
destination during migration. Consequently, KVM may end up resetting
the vPMU counter registers (PMCCNTR_EL0 and/or PMEVCNTR<n>_EL0) to
zero on the first KVM_RUN after migration.
Fix this by not saving those bits when a guest writes 1 to those bits.
Fixes: ab9468340d ("arm64: KVM: Add access handler for PMCR register")
Cc: stable@vger.kernel.org
Reviewed-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Reiji Watanabe <reijiw@google.com>
Link: https://lore.kernel.org/r/20230313033234.1475987-1-reijiw@google.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Have KVM_GET_ONE_REG for vPMU counter (vPMC) registers (PMCCNTR_EL0
and PMEVCNTR<n>_EL0) return the sum of the register value in the sysreg
file and the current perf event counter value.
Values of vPMC registers are saved in sysreg files on certain occasions.
These saved values don't represent the current values of the vPMC
registers if the perf events for the vPMCs count events after the save.
The current values of those registers are the sum of the sysreg file
value and the current perf event counter value. But, when userspace
reads those registers (using KVM_GET_ONE_REG), KVM returns the sysreg
file value to userspace (not the sum value).
Fix this to return the sum value for KVM_GET_ONE_REG.
Fixes: 051ff581ce ("arm64: KVM: Add access handler for event counter register")
Cc: stable@vger.kernel.org
Reviewed-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Reiji Watanabe <reijiw@google.com>
Link: https://lore.kernel.org/r/20230313033208.1475499-1-reijiw@google.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Having a per-vcpu virtual offset is a pain. It needs to be synchronized
on each update, and expands badly to a setup where different timers can
have different offsets, or have composite offsets (as with NV).
So let's start by replacing the use of the CNTVOFF_EL2 shadow register
(which we want to reclaim for NV anyway), and make the virtual timer
carry a pointer to a VM-wide offset.
This simplifies the code significantly. It also addresses two terrible bugs:
- The use of CNTVOFF_EL2 leads to some nice offset corruption
when the sysreg gets reset, as reported by Joey.
- The kvm mutex is taken from a vcpu ioctl, which goes against
the locking rules...
Reported-by: Joey Gouly <joey.gouly@arm.com>
Reviewed-by: Reiji Watanabe <reijiw@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230224173915.GA17407@e124191.cambridge.arm.com
Tested-by: Joey Gouly <joey.gouly@arm.com>
Link: https://lore.kernel.org/r/20230224191640.3396734-1-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
- Provide a virtual cache topology to the guest to avoid
inconsistencies with migration on heterogenous systems. Non secure
software has no practical need to traverse the caches by set/way in
the first place.
- Add support for taking stage-2 access faults in parallel. This was an
accidental omission in the original parallel faults implementation,
but should provide a marginal improvement to machines w/o FEAT_HAFDBS
(such as hardware from the fruit company).
- A preamble to adding support for nested virtualization to KVM,
including vEL2 register state, rudimentary nested exception handling
and masking unsupported features for nested guests.
- Fixes to the PSCI relay that avoid an unexpected host SVE trap when
resuming a CPU when running pKVM.
- VGIC maintenance interrupt support for the AIC
- Improvements to the arch timer emulation, primarily aimed at reducing
the trap overhead of running nested.
- Add CONFIG_USERFAULTFD to the KVM selftests config fragment in the
interest of CI systems.
- Avoid VM-wide stop-the-world operations when a vCPU accesses its own
redistributor.
- Serialize when toggling CPACR_EL1.SMEN to avoid unexpected exceptions
in the host.
- Aesthetic and comment/kerneldoc fixes
- Drop the vestiges of the old Columbia mailing list and add [Oliver]
as co-maintainer
This also drags in arm64's 'for-next/sme2' branch, because both it and
the PSCI relay changes touch the EL2 initialization code.
RISC-V:
- Fix wrong usage of PGDIR_SIZE instead of PUD_SIZE
- Correctly place the guest in S-mode after redirecting a trap to the guest
- Redirect illegal instruction traps to guest
- SBI PMU support for guest
s390:
- Two patches sorting out confusion between virtual and physical
addresses, which currently are the same on s390.
- A new ioctl that performs cmpxchg on guest memory
- A few fixes
x86:
- Change tdp_mmu to a read-only parameter
- Separate TDP and shadow MMU page fault paths
- Enable Hyper-V invariant TSC control
- Fix a variety of APICv and AVIC bugs, some of them real-world,
some of them affecting architecurally legal but unlikely to
happen in practice
- Mark APIC timer as expired if its in one-shot mode and the count
underflows while the vCPU task was being migrated
- Advertise support for Intel's new fast REP string features
- Fix a double-shootdown issue in the emergency reboot code
- Ensure GIF=1 and disable SVM during an emergency reboot, i.e. give SVM
similar treatment to VMX
- Update Xen's TSC info CPUID sub-leaves as appropriate
- Add support for Hyper-V's extended hypercalls, where "support" at this
point is just forwarding the hypercalls to userspace
- Clean up the kvm->lock vs. kvm->srcu sequences when updating the PMU and
MSR filters
- One-off fixes and cleanups
- Fix and cleanup the range-based TLB flushing code, used when KVM is
running on Hyper-V
- Add support for filtering PMU events using a mask. If userspace
wants to restrict heavily what events the guest can use, it can now
do so without needing an absurd number of filter entries
- Clean up KVM's handling of "PMU MSRs to save", especially when vPMU
support is disabled
- Add PEBS support for Intel Sapphire Rapids
- Fix a mostly benign overflow bug in SEV's send|receive_update_data()
- Move several SVM-specific flags into vcpu_svm
x86 Intel:
- Handle NMI VM-Exits before leaving the noinstr region
- A few trivial cleanups in the VM-Enter flows
- Stop enabling VMFUNC for L1 purely to document that KVM doesn't support
EPTP switching (or any other VM function) for L1
- Fix a crash when using eVMCS's enlighted MSR bitmaps
Generic:
- Clean up the hardware enable and initialization flow, which was
scattered around multiple arch-specific hooks. Instead, just
let the arch code call into generic code. Both x86 and ARM should
benefit from not having to fight common KVM code's notion of how
to do initialization.
- Account allocations in generic kvm_arch_alloc_vm()
- Fix a memory leak if coalesced MMIO unregistration fails
selftests:
- On x86, cache the CPU vendor (AMD vs. Intel) and use the info to emit
the correct hypercall instruction instead of relying on KVM to patch
in VMMCALL
- Use TAP interface for kvm_binary_stats_test and tsc_msrs_test
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmP2YA0UHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroPg/Qf+J6nT+TkIa+8Ei+fN1oMTDp4YuIOx
mXvJ9mRK9sQ+tAUVwvDz3qN/fK5mjsYbRHIDlVc5p2Q3bCrVGDDqXPFfCcLx1u+O
9U9xjkO4JxD2LS9pc70FYOyzVNeJ8VMGOBbC2b0lkdYZ4KnUc6e/WWFKJs96bK+H
duo+RIVyaMthnvbTwSv1K3qQb61n6lSJXplywS8KWFK6NZAmBiEFDAWGRYQE9lLs
VcVcG0iDJNL/BQJ5InKCcvXVGskcCm9erDszPo7w4Bypa4S9AMS42DHUaRZrBJwV
/WqdH7ckIz7+OSV0W1j+bKTHAFVTCjXYOM7wQykgjawjICzMSnnG9Gpskw==
=goe1
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Paolo Bonzini:
"ARM:
- Provide a virtual cache topology to the guest to avoid
inconsistencies with migration on heterogenous systems. Non secure
software has no practical need to traverse the caches by set/way in
the first place
- Add support for taking stage-2 access faults in parallel. This was
an accidental omission in the original parallel faults
implementation, but should provide a marginal improvement to
machines w/o FEAT_HAFDBS (such as hardware from the fruit company)
- A preamble to adding support for nested virtualization to KVM,
including vEL2 register state, rudimentary nested exception
handling and masking unsupported features for nested guests
- Fixes to the PSCI relay that avoid an unexpected host SVE trap when
resuming a CPU when running pKVM
- VGIC maintenance interrupt support for the AIC
- Improvements to the arch timer emulation, primarily aimed at
reducing the trap overhead of running nested
- Add CONFIG_USERFAULTFD to the KVM selftests config fragment in the
interest of CI systems
- Avoid VM-wide stop-the-world operations when a vCPU accesses its
own redistributor
- Serialize when toggling CPACR_EL1.SMEN to avoid unexpected
exceptions in the host
- Aesthetic and comment/kerneldoc fixes
- Drop the vestiges of the old Columbia mailing list and add [Oliver]
as co-maintainer
RISC-V:
- Fix wrong usage of PGDIR_SIZE instead of PUD_SIZE
- Correctly place the guest in S-mode after redirecting a trap to the
guest
- Redirect illegal instruction traps to guest
- SBI PMU support for guest
s390:
- Sort out confusion between virtual and physical addresses, which
currently are the same on s390
- A new ioctl that performs cmpxchg on guest memory
- A few fixes
x86:
- Change tdp_mmu to a read-only parameter
- Separate TDP and shadow MMU page fault paths
- Enable Hyper-V invariant TSC control
- Fix a variety of APICv and AVIC bugs, some of them real-world, some
of them affecting architecurally legal but unlikely to happen in
practice
- Mark APIC timer as expired if its in one-shot mode and the count
underflows while the vCPU task was being migrated
- Advertise support for Intel's new fast REP string features
- Fix a double-shootdown issue in the emergency reboot code
- Ensure GIF=1 and disable SVM during an emergency reboot, i.e. give
SVM similar treatment to VMX
- Update Xen's TSC info CPUID sub-leaves as appropriate
- Add support for Hyper-V's extended hypercalls, where "support" at
this point is just forwarding the hypercalls to userspace
- Clean up the kvm->lock vs. kvm->srcu sequences when updating the
PMU and MSR filters
- One-off fixes and cleanups
- Fix and cleanup the range-based TLB flushing code, used when KVM is
running on Hyper-V
- Add support for filtering PMU events using a mask. If userspace
wants to restrict heavily what events the guest can use, it can now
do so without needing an absurd number of filter entries
- Clean up KVM's handling of "PMU MSRs to save", especially when vPMU
support is disabled
- Add PEBS support for Intel Sapphire Rapids
- Fix a mostly benign overflow bug in SEV's
send|receive_update_data()
- Move several SVM-specific flags into vcpu_svm
x86 Intel:
- Handle NMI VM-Exits before leaving the noinstr region
- A few trivial cleanups in the VM-Enter flows
- Stop enabling VMFUNC for L1 purely to document that KVM doesn't
support EPTP switching (or any other VM function) for L1
- Fix a crash when using eVMCS's enlighted MSR bitmaps
Generic:
- Clean up the hardware enable and initialization flow, which was
scattered around multiple arch-specific hooks. Instead, just let
the arch code call into generic code. Both x86 and ARM should
benefit from not having to fight common KVM code's notion of how to
do initialization
- Account allocations in generic kvm_arch_alloc_vm()
- Fix a memory leak if coalesced MMIO unregistration fails
selftests:
- On x86, cache the CPU vendor (AMD vs. Intel) and use the info to
emit the correct hypercall instruction instead of relying on KVM to
patch in VMMCALL
- Use TAP interface for kvm_binary_stats_test and tsc_msrs_test"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (325 commits)
KVM: SVM: hyper-v: placate modpost section mismatch error
KVM: x86/mmu: Make tdp_mmu_allowed static
KVM: arm64: nv: Use reg_to_encoding() to get sysreg ID
KVM: arm64: nv: Only toggle cache for virtual EL2 when SCTLR_EL2 changes
KVM: arm64: nv: Filter out unsupported features from ID regs
KVM: arm64: nv: Emulate EL12 register accesses from the virtual EL2
KVM: arm64: nv: Allow a sysreg to be hidden from userspace only
KVM: arm64: nv: Emulate PSTATE.M for a guest hypervisor
KVM: arm64: nv: Add accessors for SPSR_EL1, ELR_EL1 and VBAR_EL1 from virtual EL2
KVM: arm64: nv: Handle SMCs taken from virtual EL2
KVM: arm64: nv: Handle trapped ERET from virtual EL2
KVM: arm64: nv: Inject HVC exceptions to the virtual EL2
KVM: arm64: nv: Support virtual EL2 exceptions
KVM: arm64: nv: Handle HCR_EL2.NV system register traps
KVM: arm64: nv: Add nested virt VCPU primitives for vEL2 VCPU state
KVM: arm64: nv: Add EL2 system registers to vcpu context
KVM: arm64: nv: Allow userspace to set PSR_MODE_EL2x
KVM: arm64: nv: Reset VCPU to EL2 registers if VCPU nested virt is set
KVM: arm64: nv: Introduce nested virtualization VCPU feature
KVM: arm64: Use the S2 MMU context to iterate over S2 table
...
- Support for arm64 SME 2 and 2.1. SME2 introduces a new 512-bit
architectural register (ZT0, for the look-up table feature) that Linux
needs to save/restore.
- Include TPIDR2 in the signal context and add the corresponding
kselftests.
- Perf updates: Arm SPEv1.2 support, HiSilicon uncore PMU updates, ACPI
support to the Marvell DDR and TAD PMU drivers, reset DTM_PMU_CONFIG
(ARM CMN) at probe time.
- Support for DYNAMIC_FTRACE_WITH_CALL_OPS on arm64.
- Permit EFI boot with MMU and caches on. Instead of cleaning the entire
loaded kernel image to the PoC and disabling the MMU and caches before
branching to the kernel bare metal entry point, leave the MMU and
caches enabled and rely on EFI's cacheable 1:1 mapping of all of
system RAM to populate the initial page tables.
- Expose the AArch32 (compat) ELF_HWCAP features to user in an arm64
kernel (the arm32 kernel only defines the values).
- Harden the arm64 shadow call stack pointer handling: stash the shadow
stack pointer in the task struct on interrupt, load it directly from
this structure.
- Signal handling cleanups to remove redundant validation of size
information and avoid reading the same data from userspace twice.
- Refactor the hwcap macros to make use of the automatically generated
ID registers. It should make new hwcaps writing less error prone.
- Further arm64 sysreg conversion and some fixes.
- arm64 kselftest fixes and improvements.
- Pointer authentication cleanups: don't sign leaf functions, unify
asm-arch manipulation.
- Pseudo-NMI code generation optimisations.
- Minor fixes for SME and TPIDR2 handling.
- Miscellaneous updates: ARCH_FORCE_MAX_ORDER is now selectable, replace
strtobool() to kstrtobool() in the cpufeature.c code, apply dynamic
shadow call stack in two passes, intercept pfn changes in set_pte_at()
without the required break-before-make sequence, attempt to dump all
instructions on unhandled kernel faults.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE5RElWfyWxS+3PLO2a9axLQDIXvEFAmP0/QsACgkQa9axLQDI
XvG+gA/+JDVEH9wRzAIZvbp9hSuohPc48xgAmIMP1eiVB0/5qeRjYAJwS33H0rXS
BPC2kj9IBy/eQeM9ICg0nFd0zYznSVacITqe6NrqeJ1F+ftS4rrHdfxd+J7kIoCs
V2L8e+BJvmHdhmNV2qMAgJdGlfxfQBA7fv2cy52HKYcouoOh1AUVR/x+yXVXAsCd
qJP3+dlUKccgm/oc5unEC1eZ49u8O+EoasqOyfG6K5udMgzhEX3K6imT9J3hw0WT
UjstYkx5uGS/prUrRCQAX96VCHoZmzEDKtQuHkHvQXEYXsYPF3ldbR2CziNJnHe7
QfSkjJlt8HAtExA+BkwEe9i0MQO/2VF5qsa2e4fA6l7uqGu3LOtS/jJd23C9n9fR
Id8aBMeN6S8+MjqRA9L2uf4t6e4ISEHoG9ZRdc4WOwloxEEiJoIeun+7bHdOSZLj
AFdHFCz4NXiiwC0UP0xPDI2YeCLqt5np7HmnrUqwzRpVO8UUagiJD8TIpcBSjBN9
J68eidenHUW7/SlIeaMKE2lmo8AUEAJs9AorDSugF19/ThJcQdx7vT2UAZjeVB3j
1dbbwajnlDOk/w8PQC4thFp5/MDlfst0htS3WRwa+vgkweE2EAdTU4hUZ8qEP7FQ
smhYtlT1xUSTYDTqoaG/U2OWR6/UU79wP0jgcOsHXTuyYrtPI/Q=
=VmXL
-----END PGP SIGNATURE-----
Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 updates from Catalin Marinas:
- Support for arm64 SME 2 and 2.1. SME2 introduces a new 512-bit
architectural register (ZT0, for the look-up table feature) that
Linux needs to save/restore
- Include TPIDR2 in the signal context and add the corresponding
kselftests
- Perf updates: Arm SPEv1.2 support, HiSilicon uncore PMU updates, ACPI
support to the Marvell DDR and TAD PMU drivers, reset DTM_PMU_CONFIG
(ARM CMN) at probe time
- Support for DYNAMIC_FTRACE_WITH_CALL_OPS on arm64
- Permit EFI boot with MMU and caches on. Instead of cleaning the
entire loaded kernel image to the PoC and disabling the MMU and
caches before branching to the kernel bare metal entry point, leave
the MMU and caches enabled and rely on EFI's cacheable 1:1 mapping of
all of system RAM to populate the initial page tables
- Expose the AArch32 (compat) ELF_HWCAP features to user in an arm64
kernel (the arm32 kernel only defines the values)
- Harden the arm64 shadow call stack pointer handling: stash the shadow
stack pointer in the task struct on interrupt, load it directly from
this structure
- Signal handling cleanups to remove redundant validation of size
information and avoid reading the same data from userspace twice
- Refactor the hwcap macros to make use of the automatically generated
ID registers. It should make new hwcaps writing less error prone
- Further arm64 sysreg conversion and some fixes
- arm64 kselftest fixes and improvements
- Pointer authentication cleanups: don't sign leaf functions, unify
asm-arch manipulation
- Pseudo-NMI code generation optimisations
- Minor fixes for SME and TPIDR2 handling
- Miscellaneous updates: ARCH_FORCE_MAX_ORDER is now selectable,
replace strtobool() to kstrtobool() in the cpufeature.c code, apply
dynamic shadow call stack in two passes, intercept pfn changes in
set_pte_at() without the required break-before-make sequence, attempt
to dump all instructions on unhandled kernel faults
* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (130 commits)
arm64: fix .idmap.text assertion for large kernels
kselftest/arm64: Don't require FA64 for streaming SVE+ZA tests
kselftest/arm64: Copy whole EXTRA context
arm64: kprobes: Drop ID map text from kprobes blacklist
perf: arm_spe: Print the version of SPE detected
perf: arm_spe: Add support for SPEv1.2 inverted event filtering
perf: Add perf_event_attr::config3
arm64/sme: Fix __finalise_el2 SMEver check
drivers/perf: fsl_imx8_ddr_perf: Remove set-but-not-used variable
arm64/signal: Only read new data when parsing the ZT context
arm64/signal: Only read new data when parsing the ZA context
arm64/signal: Only read new data when parsing the SVE context
arm64/signal: Avoid rereading context frame sizes
arm64/signal: Make interface for restore_fpsimd_context() consistent
arm64/signal: Remove redundant size validation from parse_user_sigframe()
arm64/signal: Don't redundantly verify FPSIMD magic
arm64/cpufeature: Use helper macros to specify hwcaps
arm64/cpufeature: Always use symbolic name for feature value in hwcaps
arm64/sysreg: Initial unsigned annotations for ID registers
arm64/sysreg: Initial annotation of signed ID registers
...
- Provide a virtual cache topology to the guest to avoid
inconsistencies with migration on heterogenous systems. Non secure
software has no practical need to traverse the caches by set/way in
the first place.
- Add support for taking stage-2 access faults in parallel. This was an
accidental omission in the original parallel faults implementation,
but should provide a marginal improvement to machines w/o FEAT_HAFDBS
(such as hardware from the fruit company).
- A preamble to adding support for nested virtualization to KVM,
including vEL2 register state, rudimentary nested exception handling
and masking unsupported features for nested guests.
- Fixes to the PSCI relay that avoid an unexpected host SVE trap when
resuming a CPU when running pKVM.
- VGIC maintenance interrupt support for the AIC
- Improvements to the arch timer emulation, primarily aimed at reducing
the trap overhead of running nested.
- Add CONFIG_USERFAULTFD to the KVM selftests config fragment in the
interest of CI systems.
- Avoid VM-wide stop-the-world operations when a vCPU accesses its own
redistributor.
- Serialize when toggling CPACR_EL1.SMEN to avoid unexpected exceptions
in the host.
- Aesthetic and comment/kerneldoc fixes
- Drop the vestiges of the old Columbia mailing list and add myself as
co-maintainer
This also drags in a couple of branches to avoid conflicts:
- The shared 'kvm-hw-enable-refactor' branch that reworks
initialization, as it conflicted with the virtual cache topology
changes.
- arm64's 'for-next/sme2' branch, as the PSCI relay changes, as both
touched the EL2 initialization code.
-----BEGIN PGP SIGNATURE-----
iQJDBAABCgAtFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAmPw29cPHG1hekBrZXJu
ZWwub3JnAAoJECPQ0LrRPXpD9doQAIJyMW0odT6JBe15uGCxTuTnJbb8mniajJdX
CuSxPl85WyKLtZbIJLRTQgyt6Nzbu0N38zM0y/qBZT5BvAnWYI8etvnJhYZjooAy
jrf0Me/GM5hnORXN+1dByCmlV+DSuBkax86tgIC7HhU71a2SWpjlmWQi/mYvQmIK
PBAqpFF+w2cWHi0ZvCq96c5EXBdN4FLEA5cdZhekCbgw1oX8+x+HxdpBuGW5lTEr
9oWOzOzJQC1uFnjP3unFuIaG94QIo+NA4aGLMzfb7wm2wdQUnKebtdj/RxsDZOKe
43Q1+MDFWMsxxFu4FULH8fPMwidIm5rfz3pw3JJloqaZp8vk/vjDLID7AYucMIX8
1G/mjqz6E9lYvv57WBmBhT/+apSDAmeHlAT97piH73Nemga91esDKuHSdtA8uB5j
mmzcUYajuB2GH9rsaXJhVKt/HW7l9fbGliCkI99ckq/oOTO9VsKLsnwS/rMRIsPn
y2Y8Lyoe4eqokd1DNn5/bo+3qDnfmzm6iDmZOo+JYuJv9KS95zuw17Wu7la9UAPV
e13+btoijHDvu8RnTecuXljWfAAKVtEjpEIoS5aP2R2iDvhr0d8POlMPaJ40YuRq
D2fKr18b6ngt+aI0TY63/ksEIFexx67HuwQsUZ2lRjyjq5/x+u3YIqUPbKrU4Rnl
uxXjSvyr
=r4s/
-----END PGP SIGNATURE-----
Merge tag 'kvmarm-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64 updates for 6.3
- Provide a virtual cache topology to the guest to avoid
inconsistencies with migration on heterogenous systems. Non secure
software has no practical need to traverse the caches by set/way in
the first place.
- Add support for taking stage-2 access faults in parallel. This was an
accidental omission in the original parallel faults implementation,
but should provide a marginal improvement to machines w/o FEAT_HAFDBS
(such as hardware from the fruit company).
- A preamble to adding support for nested virtualization to KVM,
including vEL2 register state, rudimentary nested exception handling
and masking unsupported features for nested guests.
- Fixes to the PSCI relay that avoid an unexpected host SVE trap when
resuming a CPU when running pKVM.
- VGIC maintenance interrupt support for the AIC
- Improvements to the arch timer emulation, primarily aimed at reducing
the trap overhead of running nested.
- Add CONFIG_USERFAULTFD to the KVM selftests config fragment in the
interest of CI systems.
- Avoid VM-wide stop-the-world operations when a vCPU accesses its own
redistributor.
- Serialize when toggling CPACR_EL1.SMEN to avoid unexpected exceptions
in the host.
- Aesthetic and comment/kerneldoc fixes
- Drop the vestiges of the old Columbia mailing list and add [Oliver]
as co-maintainer
This also drags in arm64's 'for-next/sme2' branch, because both it and
the PSCI relay changes touch the EL2 initialization code.
* kvm-arm64/nv-prefix:
: Preamble to NV support, courtesy of Marc Zyngier.
:
: This brings in a set of prerequisite patches for supporting nested
: virtualization in KVM/arm64. Of course, there is a long way to go until
: NV is actually enabled in KVM.
:
: - Introduce cpucap / vCPU feature flag to pivot the NV code on
:
: - Add support for EL2 vCPU register state
:
: - Basic nested exception handling
:
: - Hide unsupported features from the ID registers for NV-capable VMs
KVM: arm64: nv: Use reg_to_encoding() to get sysreg ID
KVM: arm64: nv: Only toggle cache for virtual EL2 when SCTLR_EL2 changes
KVM: arm64: nv: Filter out unsupported features from ID regs
KVM: arm64: nv: Emulate EL12 register accesses from the virtual EL2
KVM: arm64: nv: Allow a sysreg to be hidden from userspace only
KVM: arm64: nv: Emulate PSTATE.M for a guest hypervisor
KVM: arm64: nv: Add accessors for SPSR_EL1, ELR_EL1 and VBAR_EL1 from virtual EL2
KVM: arm64: nv: Handle SMCs taken from virtual EL2
KVM: arm64: nv: Handle trapped ERET from virtual EL2
KVM: arm64: nv: Inject HVC exceptions to the virtual EL2
KVM: arm64: nv: Support virtual EL2 exceptions
KVM: arm64: nv: Handle HCR_EL2.NV system register traps
KVM: arm64: nv: Add nested virt VCPU primitives for vEL2 VCPU state
KVM: arm64: nv: Add EL2 system registers to vcpu context
KVM: arm64: nv: Allow userspace to set PSR_MODE_EL2x
KVM: arm64: nv: Reset VCPU to EL2 registers if VCPU nested virt is set
KVM: arm64: nv: Introduce nested virtualization VCPU feature
KVM: arm64: Use the S2 MMU context to iterate over S2 table
arm64: Add ARM64_HAS_NESTED_VIRT cpufeature
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
* kvm-arm64/misc:
: Miscellaneous updates
:
: - Convert CPACR_EL1_TTA to the new, generated system register
: definitions.
:
: - Serialize toggling CPACR_EL1.SMEN to avoid unexpected exceptions when
: accessing SVCR in the host.
:
: - Avoid quiescing the guest if a vCPU accesses its own redistributor's
: SGIs/PPIs, eliminating the need to IPI. Largely an optimization for
: nested virtualization, as the L1 accesses the affected registers
: rather often.
:
: - Conversion to kstrtobool()
:
: - Common definition of INVALID_GPA across architectures
:
: - Enable CONFIG_USERFAULTFD for CI runs of KVM selftests
KVM: arm64: Fix non-kerneldoc comments
KVM: selftests: Enable USERFAULTFD
KVM: selftests: Remove redundant setbuf()
arm64/sysreg: clean up some inconsistent indenting
KVM: MMU: Make the definition of 'INVALID_GPA' common
KVM: arm64: vgic-v3: Use kstrtobool() instead of strtobool()
KVM: arm64: vgic-v3: Limit IPI-ing when accessing GICR_{C,S}ACTIVER0
KVM: arm64: Synchronize SMEN on vcpu schedule out
KVM: arm64: Kill CPACR_EL1_TTA definition
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
* kvm-arm64/apple-vgic-mi:
: VGIC maintenance interrupt support for the AIC, courtesy of Marc Zyngier.
:
: The AIC provides a non-maskable VGIC maintenance interrupt, which until
: now was not supported by KVM. This series (1) allows the registration of
: a non-maskable maintenance interrupt and (2) wires in support for this
: with the AIC driver.
irqchip/apple-aic: Correctly map the vgic maintenance interrupt
irqchip/apple-aic: Register vgic maintenance interrupt with KVM
KVM: arm64: vgic: Allow registration of a non-maskable maintenance interrupt
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
* kvm-arm64/psci-relay-fixes:
: Fixes for CPU on/resume with pKVM, courtesy Quentin Perret.
:
: A consequence of deprivileging the host is that pKVM relays PSCI calls
: on behalf of the host. pKVM's CPU initialization failed to fully
: initialize the CPU's EL2 state, which notably led to unexpected SVE
: traps resulting in a hyp panic.
:
: The issue is addressed by reusing parts of __finalise_el2 to restore CPU
: state in the PSCI relay.
KVM: arm64: Finalise EL2 state from pKVM PSCI relay
KVM: arm64: Use sanitized values in __check_override in nVHE
KVM: arm64: Introduce finalise_el2_state macro
KVM: arm64: Provide sanitized SYS_ID_AA64SMFR0_EL1 to nVHE
* kvm-arm64/nv-timer-improvements:
: Timer emulation improvements, courtesy of Marc Zyngier.
:
: - Avoid re-arming an hrtimer for a guest timer that is already pending
:
: - Only reload the affected timer context when emulating a sysreg access
: instead of both the virtual/physical timers.
KVM: arm64: timers: Don't BUG() on unhandled timer trap
KVM: arm64: Reduce overhead of trapped timer sysreg accesses
KVM: arm64: Don't arm a hrtimer for an already pending timer
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
* kvm-arm64/parallel-access-faults:
: Parallel stage-2 access fault handling
:
: The parallel faults changes that went in to 6.2 covered most stage-2
: aborts, with the exception of stage-2 access faults. Building on top of
: the new infrastructure, this series adds support for handling access
: faults (i.e. updating the access flag) in parallel.
:
: This is expected to provide a performance uplift for cores that do not
: implement FEAT_HAFDBS, such as those from the fruit company.
KVM: arm64: Condition HW AF updates on config option
KVM: arm64: Handle access faults behind the read lock
KVM: arm64: Don't serialize if the access flag isn't set
KVM: arm64: Return EAGAIN for invalid PTE in attr walker
KVM: arm64: Ignore EAGAIN for walks outside of a fault
KVM: arm64: Use KVM's pte type/helpers in handle_access_fault()
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
* kvm-arm64/virtual-cache-geometry:
: Virtualized cache geometry for KVM guests, courtesy of Akihiko Odaki.
:
: KVM/arm64 has always exposed the host cache geometry directly to the
: guest, even though non-secure software should never perform CMOs by
: Set/Way. This was slightly wrong, as the cache geometry was derived from
: the PE on which the vCPU thread was running and not a sanitized value.
:
: All together this leads to issues migrating VMs on heterogeneous
: systems, as the cache geometry saved/restored could be inconsistent.
:
: KVM/arm64 now presents 1 level of cache with 1 set and 1 way. The cache
: geometry is entirely controlled by userspace, such that migrations from
: older kernels continue to work.
KVM: arm64: Mark some VM-scoped allocations as __GFP_ACCOUNT
KVM: arm64: Normalize cache configuration
KVM: arm64: Mask FEAT_CCIDX
KVM: arm64: Always set HCR_TID2
arm64/cache: Move CLIDR macro definitions
arm64/sysreg: Add CCSIDR2_EL1
arm64/sysreg: Convert CCSIDR_EL1 to automatic generation
arm64: Allow the definition of UNKNOWN system register fields
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Avoid open-coding and just use the helper to encode the ID from the
sysreg table entry.
No functional change intended.
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230211190742.49843-1-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
As there is a number of features that we either can't support,
or don't want to support right away with NV, let's add some
basic filtering so that we don't advertize silly things to the
EL2 guest.
Whilst we are at it, advertize FEAT_TTL as well as FEAT_GTG, which
the NV implementation will implement.
Reviewed-by: Ganapatrao Kulkarni <gankulkarni@os.amperecomputing.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230209175820.1939006-18-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
With HCR_EL2.NV bit set, accesses to EL12 registers in the virtual EL2
trap to EL2. Handle those traps just like we do for EL1 registers.
One exception is CNTKCTL_EL12. We don't trap on CNTKCTL_EL1 for non-VHE
virtual EL2 because we don't have to. However, accessing CNTKCTL_EL12
will trap since it's one of the EL12 registers controlled by HCR_EL2.NV
bit. Therefore, add a handler for it and don't treat it as a
non-trap-registers when preparing a shadow context.
These registers, being only a view on their EL1 counterpart, are
permanently hidden from userspace.
Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com>
Signed-off-by: Jintack Lim <jintack.lim@linaro.org>
[maz: EL12_REG(), register visibility]
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230209175820.1939006-17-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
So far, we never needed to distinguish between registers hidden
from userspace and being hidden from a guest (they are always
either visible to both, or hidden from both).
With NV, we have the ugly case of the EL02 and EL12 registers,
which are only a view on the EL0 and EL1 registers. It makes
absolutely no sense to expose them to userspace, since it
already has the canonical view.
Add a new visibility flag (REG_HIDDEN_USER) and a new helper that
checks for it and REG_HIDDEN when checking whether to expose
a sysreg to userspace. Subsequent patches will make use of it.
Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230209175820.1939006-16-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
We can no longer blindly copy the VCPU's PSTATE into SPSR_EL2 and return
to the guest and vice versa when taking an exception to the hypervisor,
because we emulate virtual EL2 in EL1 and therefore have to translate
the mode field from EL2 to EL1 and vice versa.
This requires keeping track of the state we enter the guest, for which
we transiently use a dedicated flag.
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230209175820.1939006-15-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
For the same reason we trap virtual memory register accesses at virtual
EL2, we need to trap SPSR_EL1, ELR_EL1 and VBAR_EL1 accesses. ARM v8.3
introduces the HCR_EL2.NV1 bit to be able to trap on those register
accesses in EL1. Do not set this bit until the whole nesting support is
completed, which happens further down the line...
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: Jintack Lim <jintack.lim@linaro.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230209175820.1939006-14-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Non-nested guests have used the hvc instruction to initiate SMCCC
calls into KVM. This is quite a poor fit for NV as hvc exceptions are
always taken to EL2. In other words, KVM needs to unconditionally
forward the hvc exception back into vEL2 to uphold the architecture.
Instead, treat the smc instruction from vEL2 as we would a guest
hypercall, thereby allowing the vEL2 to interact with KVM's hypercall
surface. Note that on NV-capable hardware HCR_EL2.TSC causes smc
instructions executed in non-secure EL1 to trap to EL2, even if EL3 is
not implemented.
Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com>
Signed-off-by: Jintack Lim <jintack.lim@linaro.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230209175820.1939006-13-maz@kernel.org
[Oliver: redo commit message, only handle smc from vEL2]
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
When a guest hypervisor running virtual EL2 in EL1 executes an ERET
instruction, we will have set HCR_EL2.NV which traps ERET to EL2, so
that we can emulate the exception return in software.
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230209175820.1939006-12-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
As we expect all PSCI calls from the L1 hypervisor to be performed
using SMC when nested virtualization is enabled, it is clear that
all HVC instruction from the VM (including from the virtual EL2)
are supposed to handled in the virtual EL2.
Forward these to EL2 as required.
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com>
Signed-off-by: Jintack Lim <jintack.lim@linaro.org>
[maz: add handling of HCR_EL2.HCD]
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230209175820.1939006-11-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Support injecting exceptions and performing exception returns to and
from virtual EL2. This must be done entirely in software except when
taking an exception from vEL0 to vEL2 when the virtual HCR_EL2.{E2H,TGE}
== {1,1} (a VHE guest hypervisor).
[maz: switch to common exception injection framework, illegal exeption
return handling]
Reviewed-by: Ganapatrao Kulkarni <gankulkarni@os.amperecomputing.com>
Signed-off-by: Jintack Lim <jintack.lim@linaro.org>
Signed-off-by: Christoffer Dall <christoffer.dall@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230209175820.1939006-10-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
ARM v8.3 introduces a new bit in the HCR_EL2, which is the NV bit. When
this bit is set, accessing EL2 registers in EL1 traps to EL2. In
addition, executing the following instructions in EL1 will trap to EL2:
tlbi, at, eret, and msr/mrs instructions to access SP_EL1. Most of the
instructions that trap to EL2 with the NV bit were undef at EL1 prior to
ARM v8.3. The only instruction that was not undef is eret.
This patch sets up a handler for EL2 registers and SP_EL1 register
accesses at EL1. The host hypervisor keeps those register values in
memory, and will emulate their behavior.
This patch doesn't set the NV bit yet. It will be set in a later patch
once nested virtualization support is completed.
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: Jintack Lim <jintack.lim@linaro.org>
[maz: EL2_REG() macros]
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230209175820.1939006-9-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
We were not allowing userspace to set a more privileged mode for the VCPU
than EL1, but we should allow this when nested virtualization is enabled
for the VCPU.
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230209175820.1939006-6-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Reset the VCPU with PSTATE.M = EL2h when the nested virtualization
feature is enabled on the VCPU.
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@arm.com>
[maz: rework register reset not to use empty data structures]
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230209175820.1939006-5-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Most of our S2 helpers take a kvm_s2_mmu pointer, but quickly
revert back to using the kvm structure. By doing so, we lose
track of which S2 MMU context we were initially using, and fallback
to the "canonical" context.
If we were trying to unmap a S2 context managed by a guest hypervisor,
we end-up parsing the wrong set of page tables, and bad stuff happens
(as this is often happening on the back of a trapped TLBI from the
guest hypervisor).
Instead, make sure we always use the provided MMU context all the way.
This has no impact on non-NV, as we always pass the canonical MMU
context.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Andre Przywara <andre.przywara@arm.com>
Link: https://lore.kernel.org/r/20230209175820.1939006-3-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Add a new ARM64_HAS_NESTED_VIRT feature to indicate that the
CPU has the ARMv8.3 nested virtualization capability, together
with the 'kvm-arm.mode=nested' command line option.
This will be used to support nested virtualization in KVM.
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Jintack Lim <jintack.lim@linaro.org>
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@arm.com>
[maz: moved the command-line option to kvm-arm.mode]
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230209175820.1939006-2-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
* arm64/for-next/perf:
perf: arm_spe: Print the version of SPE detected
perf: arm_spe: Add support for SPEv1.2 inverted event filtering
perf: Add perf_event_attr::config3
drivers/perf: fsl_imx8_ddr_perf: Remove set-but-not-used variable
perf: arm_spe: Support new SPEv1.2/v8.7 'not taken' event
perf: arm_spe: Use new PMSIDR_EL1 register enums
perf: arm_spe: Drop BIT() and use FIELD_GET/PREP accessors
arm64/sysreg: Convert SPE registers to automatic generation
arm64: Drop SYS_ from SPE register defines
perf: arm_spe: Use feature numbering for PMSEVFR_EL1 defines
perf/marvell: Add ACPI support to TAD uncore driver
perf/marvell: Add ACPI support to DDR uncore driver
perf/arm-cmn: Reset DTM_PMU_CONFIG at probe
drivers/perf: hisi: Extract initialization of "cpa_pmu->pmu"
drivers/perf: hisi: Simplify the parameters of hisi_pmu_init()
drivers/perf: hisi: Advertise the PERF_PMU_CAP_NO_EXCLUDE capability
* for-next/sysreg:
: arm64 sysreg and cpufeature fixes/updates
KVM: arm64: Use symbolic definition for ISR_EL1.A
arm64/sysreg: Add definition of ISR_EL1
arm64/sysreg: Add definition for ICC_NMIAR1_EL1
arm64/cpufeature: Remove 4 bit assumption in ARM64_FEATURE_MASK()
arm64/sysreg: Fix errors in 32 bit enumeration values
arm64/cpufeature: Fix field sign for DIT hwcap detection
* for-next/sme:
: SME-related updates
arm64/sme: Optimise SME exit on syscall entry
arm64/sme: Don't use streaming mode to probe the maximum SME VL
arm64/ptrace: Use system_supports_tpidr2() to check for TPIDR2 support
* for-next/kselftest: (23 commits)
: arm64 kselftest fixes and improvements
kselftest/arm64: Don't require FA64 for streaming SVE+ZA tests
kselftest/arm64: Copy whole EXTRA context
kselftest/arm64: Fix enumeration of systems without 128 bit SME for SSVE+ZA
kselftest/arm64: Fix enumeration of systems without 128 bit SME
kselftest/arm64: Don't require FA64 for streaming SVE tests
kselftest/arm64: Limit the maximum VL we try to set via ptrace
kselftest/arm64: Correct buffer size for SME ZA storage
kselftest/arm64: Remove the local NUM_VL definition
kselftest/arm64: Verify simultaneous SSVE and ZA context generation
kselftest/arm64: Verify that SSVE signal context has SVE_SIG_FLAG_SM set
kselftest/arm64: Remove spurious comment from MTE test Makefile
kselftest/arm64: Support build of MTE tests with clang
kselftest/arm64: Initialise current at build time in signal tests
kselftest/arm64: Don't pass headers to the compiler as source
kselftest/arm64: Remove redundant _start labels from FP tests
kselftest/arm64: Fix .pushsection for strings in FP tests
kselftest/arm64: Run BTI selftests on systems without BTI
kselftest/arm64: Fix test numbering when skipping tests
kselftest/arm64: Skip non-power of 2 SVE vector lengths in fp-stress
kselftest/arm64: Only enumerate power of two VLs in syscall-abi
...
* for-next/misc:
: Miscellaneous arm64 updates
arm64/mm: Intercept pfn changes in set_pte_at()
Documentation: arm64: correct spelling
arm64: traps: attempt to dump all instructions
arm64: Apply dynamic shadow call stack patching in two passes
arm64: el2_setup.h: fix spelling typo in comments
arm64: Kconfig: fix spelling
arm64: cpufeature: Use kstrtobool() instead of strtobool()
arm64: Avoid repeated AA64MMFR1_EL1 register read on pagefault path
arm64: make ARCH_FORCE_MAX_ORDER selectable
* for-next/sme2: (23 commits)
: Support for arm64 SME 2 and 2.1
arm64/sme: Fix __finalise_el2 SMEver check
kselftest/arm64: Remove redundant _start labels from zt-test
kselftest/arm64: Add coverage of SME 2 and 2.1 hwcaps
kselftest/arm64: Add coverage of the ZT ptrace regset
kselftest/arm64: Add SME2 coverage to syscall-abi
kselftest/arm64: Add test coverage for ZT register signal frames
kselftest/arm64: Teach the generic signal context validation about ZT
kselftest/arm64: Enumerate SME2 in the signal test utility code
kselftest/arm64: Cover ZT in the FP stress test
kselftest/arm64: Add a stress test program for ZT0
arm64/sme: Add hwcaps for SME 2 and 2.1 features
arm64/sme: Implement ZT0 ptrace support
arm64/sme: Implement signal handling for ZT
arm64/sme: Implement context switching for ZT0
arm64/sme: Provide storage for ZT0
arm64/sme: Add basic enumeration for SME2
arm64/sme: Enable host kernel to access ZT0
arm64/sme: Manually encode ZT0 load and store instructions
arm64/esr: Document ISS for ZT0 being disabled
arm64/sme: Document SME 2 and SME 2.1 ABI
...
* for-next/tpidr2:
: Include TPIDR2 in the signal context
kselftest/arm64: Add test case for TPIDR2 signal frame records
kselftest/arm64: Add TPIDR2 to the set of known signal context records
arm64/signal: Include TPIDR2 in the signal context
arm64/sme: Document ABI for TPIDR2 signal information
* for-next/scs:
: arm64: harden shadow call stack pointer handling
arm64: Stash shadow stack pointer in the task struct on interrupt
arm64: Always load shadow stack pointer directly from the task struct
* for-next/compat-hwcap:
: arm64: Expose compat ARMv8 AArch32 features (HWCAPs)
arm64: Add compat hwcap SSBS
arm64: Add compat hwcap SB
arm64: Add compat hwcap I8MM
arm64: Add compat hwcap ASIMDBF16
arm64: Add compat hwcap ASIMDFHM
arm64: Add compat hwcap ASIMDDP
arm64: Add compat hwcap FPHP and ASIMDHP
* for-next/ftrace:
: Add arm64 support for DYNAMICE_FTRACE_WITH_CALL_OPS
arm64: avoid executing padding bytes during kexec / hibernation
arm64: Implement HAVE_DYNAMIC_FTRACE_WITH_CALL_OPS
arm64: ftrace: Update stale comment
arm64: patching: Add aarch64_insn_write_literal_u64()
arm64: insn: Add helpers for BTI
arm64: Extend support for CONFIG_FUNCTION_ALIGNMENT
ACPI: Don't build ACPICA with '-Os'
Compiler attributes: GCC cold function alignment workarounds
ftrace: Add DYNAMIC_FTRACE_WITH_CALL_OPS
* for-next/efi-boot-mmu-on:
: Permit arm64 EFI boot with MMU and caches on
arm64: kprobes: Drop ID map text from kprobes blacklist
arm64: head: Switch endianness before populating the ID map
efi: arm64: enter with MMU and caches enabled
arm64: head: Clean the ID map and the HYP text to the PoC if needed
arm64: head: avoid cache invalidation when entering with the MMU on
arm64: head: record the MMU state at primary entry
arm64: kernel: move identity map out of .text mapping
arm64: head: Move all finalise_el2 calls to after __enable_mmu
* for-next/ptrauth:
: arm64 pointer authentication cleanup
arm64: pauth: don't sign leaf functions
arm64: unify asm-arch manipulation
* for-next/pseudo-nmi:
: Pseudo-NMI code generation optimisations
arm64: irqflags: use alternative branches for pseudo-NMI logic
arm64: add ARM64_HAS_GIC_PRIO_RELAXED_SYNC cpucap
arm64: make ARM64_HAS_GIC_PRIO_MASKING depend on ARM64_HAS_GIC_CPUIF_SYSREGS
arm64: rename ARM64_HAS_IRQ_PRIO_MASKING to ARM64_HAS_GIC_PRIO_MASKING
arm64: rename ARM64_HAS_SYSREG_GIC_CPUIF to ARM64_HAS_GIC_CPUIF_SYSREGS
Generally speaking, any memory allocations that can be associated with a
particular VM should be charged to the cgroup of its process.
Nonetheless, there are a couple spots in KVM/arm64 that aren't currently
accounted:
- the ccsidr array containing the virtualized cache hierarchy
- the cpumask of supported cpus, for use of the vPMU on heterogeneous
systems
Go ahead and set __GFP_ACCOUNT for these allocations.
Reviewed-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Akihiko Odaki <akihiko.odaki@daynix.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Link: https://lore.kernel.org/r/20230206235229.4174711-1-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
The robots amongts us have started spitting out irritating emails about
random errors such as:
<quote>
arch/arm64/kvm/arm.c:2207: warning: expecting prototype for Initialize Hyp().
Prototype was for kvm_arm_init() instead
</quote>
which makes little sense until you finally grok what they are on about:
comments that look like a kerneldoc, but that aren't.
Let's address this before I get even more irritated... ;-)
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/63e139e1.J5AHO6vmxaALh7xv%25lkp@intel.com
Link: https://lore.kernel.org/r/20230207094321.1238600-1-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
The EL2 state is not initialised correctly when a CPU comes out of
CPU_{SUSPEND,OFF} as the finalise_el2 function is not being called.
Let's directly call finalise_el2_state from this path to solve the
issue.
Fixes: 504ee23611 ("arm64: Add the arm64.nosve command line option")
Signed-off-by: Quentin Perret <qperret@google.com>
Link: https://lore.kernel.org/r/20230201103755.1398086-5-qperret@google.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
We will need a sanitized copy of SYS_ID_AA64SMFR0_EL1 from the nVHE EL2
code shortly, so make sure to provide it with a copy.
Signed-off-by: Quentin Perret <qperret@google.com>
Acked-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20230201103755.1398086-2-qperret@google.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
We don't have a running VCPU context to save vgic3 pending table due
to KVM_DEV_ARM_VGIC_{GRP_CTRL, SAVE_PENDING_TABLES} command on KVM
device "kvm-arm-vgic-v3". The unknown case is caught by kvm-unit-tests.
# ./kvm-unit-tests/tests/its-pending-migration
WARNING: CPU: 120 PID: 7973 at arch/arm64/kvm/../../../virt/kvm/kvm_main.c:3325 \
mark_page_dirty_in_slot+0x60/0xe0
:
mark_page_dirty_in_slot+0x60/0xe0
__kvm_write_guest_page+0xcc/0x100
kvm_write_guest+0x7c/0xb0
vgic_v3_save_pending_tables+0x148/0x2a0
vgic_set_common_attr+0x158/0x240
vgic_v3_set_attr+0x4c/0x5c
kvm_device_ioctl+0x100/0x160
__arm64_sys_ioctl+0xa8/0xf0
invoke_syscall.constprop.0+0x7c/0xd0
el0_svc_common.constprop.0+0x144/0x160
do_el0_svc+0x34/0x60
el0_svc+0x3c/0x1a0
el0t_64_sync_handler+0xb4/0x130
el0t_64_sync+0x178/0x17c
Use vgic_write_guest_lock() to save vgic3 pending table.
Reported-by: Zenghui Yu <yuzenghui@huawei.com>
Signed-off-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230126235451.469087-5-gshan@redhat.com
We don't have a running VCPU context to restore vgic3 LPI pending status
due to command KVM_DEV_ARM_{VGIC_GRP_CTRL, ITS_RESTORE_TABLES} on KVM
device "kvm-arm-vgic-its".
Use vgic_write_guest_lock() to restore vgic3 LPI pending status.
Signed-off-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230126235451.469087-4-gshan@redhat.com
Currently, the unknown no-running-vcpu sites are reported when a
dirty page is tracked by mark_page_dirty_in_slot(). Until now, the
only known no-running-vcpu site is saving vgic/its tables through
KVM_DEV_ARM_{VGIC_GRP_CTRL, ITS_SAVE_TABLES} command on KVM device
"kvm-arm-vgic-its". Unfortunately, there are more unknown sites to
be handled and no-running-vcpu context will be allowed in these
sites: (1) KVM_DEV_ARM_{VGIC_GRP_CTRL, ITS_RESTORE_TABLES} command
on KVM device "kvm-arm-vgic-its" to restore vgic/its tables. The
vgic3 LPI pending status could be restored. (2) Save vgic3 pending
table through KVM_DEV_ARM_{VGIC_GRP_CTRL, VGIC_SAVE_PENDING_TABLES}
command on KVM device "kvm-arm-vgic-v3".
In order to handle those unknown cases, we need a unified helper
vgic_write_guest_lock(). struct vgic_dist::save_its_tables_in_progress
is also renamed to struct vgic_dist::save_tables_in_progress.
No functional change intended.
Suggested-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230126235451.469087-3-gshan@redhat.com
Although not handling a trap is a pretty bad situation to be in,
panicing the kernel isn't useful and provides no valuable
information to help debugging the situation.
Instead, dump the encoding of the unhandled sysreg, and inject
an UNDEF in the guest. At least, this gives a user an opportunity
to report the issue with some information to help debugging it.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230112123829.458912-4-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Each read/write to a trapped timer system register results
in a whole kvm_timer_vcpu_put/load() cycle which affects all
of the timers, and a bit more.
There is no need for such a thing, and we can limit the impact
to the timer being affected, and only this one.
This drastically simplifies the emulated case, and limits the
damage for trapped accesses. This also brings some performance
back for NV.
Whilst we're at it, fix a comment that didn't quite capture why
we always set CNTVOFF_EL2 to 0 when disabling the virtual timer.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230112123829.458912-3-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
When fully emulating a timer, we back it with a hrtimer that is
armver on vcpu_load(). However, we do this even if the timer is
already pending.
This causes spurious interrupts to be taken, though the guest
doesn't observe them (the interrupt is already pending).
Although this is a waste of precious cycles, this isn't the
end of the world with the current state of KVM. However, this
can lead to a situation where a guest doesn't make forward
progress anymore with NV.
Fix it by checking that if the timer is already pending
before arming a new hrtimer. Also drop the hrtimer cancelling,
which is useless, by construction.
Reported-by: D Scott Phillips <scott@os.amperecomputing.com>
Fixes: bee038a674 ("KVM: arm/arm64: Rework the timer code to use a timer_map")
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230112123829.458912-2-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
ARM:
* Fix the PMCR_EL0 reset value after the PMU rework
* Correctly handle S2 fault triggered by a S1 page table walk
by not always classifying it as a write, as this breaks on
R/O memslots
* Document why we cannot exit with KVM_EXIT_MMIO when taking
a write fault from a S1 PTW on a R/O memslot
* Put the Apple M2 on the naughty list for not being able to
correctly implement the vgic SEIS feature, just like the M1
before it
* Reviewer updates: Alex is stepping down, replaced by Zenghui
x86:
* Fix various rare locking issues in Xen emulation and teach lockdep
to detect them
* Documentation improvements
* Do not return host topology information from KVM_GET_SUPPORTED_CPUID
Before this change, the cache configuration of the physical CPU was
exposed to vcpus. This is problematic because the cache configuration a
vcpu sees varies when it migrates between vcpus with different cache
configurations.
Fabricate cache configuration from the sanitized value, which holds the
CTR_EL0 value the userspace sees regardless of which physical CPU it
resides on.
CLIDR_EL1 and CCSIDR_EL1 are now writable from the userspace so that
the VMM can restore the values saved with the old kernel.
Suggested-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
Link: https://lore.kernel.org/r/20230112023852.42012-8-akihiko.odaki@daynix.com
[ Oliver: Squash Marc's fix for CCSIDR_EL1.LineSize when set from userspace ]
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
To save the vgic LPI pending state with GICv4.1, the VPEs must all be
unmapped from the ITSs so that the sGIC caches can be flushed.
The opposite is done once the state is saved.
This is all done by using the activate/deactivate irqdomain callbacks
directly from the vgic code. Crutially, this is done without holding
the irqdesc lock for the interrupts that represent the VPE. And these
callbacks are changing the state of the irqdesc. What could possibly
go wrong?
If a doorbell fires while we are messing with the irqdesc state,
it will acquire the lock and change the interrupt state concurrently.
Since we don't hole the lock, curruption occurs in on the interrupt
state. Oh well.
While acquiring the lock would fix this (and this was Shanker's
initial approach), this is still a layering violation we could do
without. A better approach is actually to free the VPE interrupt,
do what we have to do, and re-request it.
It is more work, but this usually happens only once in the lifetime
of the VM and we don't really care about this sort of overhead.
Fixes: f66b7b151e ("KVM: arm64: GICv4.1: Try to save VLPI state in save_pending_tables")
Reported-by: Shanker Donthineni <sdonthineni@nvidia.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20230118022348.4137094-1-sdonthineni@nvidia.com
Commit d77e59a8fc ("arm64: mte: Lock a page for MTE tag
initialisation") added a call to mte_clear_page_tags() in case a
prior mte_copy_tags_from_user() failed in order to avoid stale tags in
the guest page (it should have really been a separate commit).
Unfortunately, the argument passed to this function was the address of
the struct page rather than the actual page address. Fix this function
call.
Fixes: d77e59a8fc ("arm64: mte: Lock a page for MTE tag initialisation")
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230119170902.1574756-1-catalin.marinas@arm.com
In preparation for adding support for storage for ZT0 to the thread_struct
rename za_state to sme_state. Since ZT0 is accessible when PSTATE.ZA is
set just like ZA itself we will extend the allocation done for ZA to
cover it, avoiding the need to further expand task_struct for non-SME
tasks.
No functional changes.
Signed-off-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20221208-arm64-sme2-v4-1-f2fa0aef982f@kernel.org
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
KVM already has a 'GPA_INVALID' defined as (~(gpa_t)0) in kvm_types.h,
and it is used by ARM code. We do not need another definition of
'INVALID_GPA' for X86 specifically.
Instead of using the common 'GPA_INVALID' for X86, replace it with
'INVALID_GPA', and change the users of 'GPA_INVALID' so that the diff
can be smaller. Also because the name 'INVALID_GPA' tells the user we
are using an invalid GPA, while the name 'GPA_INVALID' is emphasizing
the GPA is an invalid one.
No functional change intended.
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20230105130127.866171-1-yu.c.zhang@linux.intel.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
We currently have a non-standard SYS_ prefix in the constants generated
for the SPE register bitfields. Drop this in preparation for automatic
register definition generation.
The SPE mask defines were unshifted, and the SPE register field
enumerations were shifted. The autogenerated defines are the opposite,
so make the necessary adjustments.
No functional changes.
Tested-by: James Clark <james.clark@arm.com>
Signed-off-by: Rob Herring <robh@kernel.org>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Link: https://lore.kernel.org/r/20220825-arm-spe-v8-7-v4-2-327f860daf28@kernel.org
Signed-off-by: Will Deacon <will@kernel.org>
strtobool() is the same as kstrtobool().
However, the latter is more used within the kernel.
In order to remove strtobool() and slightly simplify kstrtox.h, switch to
the other function name.
While at it, include the corresponding header file (<linux/kstrtox.h>)
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Zenghui Yu <yuzenghui@huawei.com>
Link: https://lore.kernel.org/r/f546e636c6d2bbcc0d8c4191ab98ce892fce4584.1673702763.git.christophe.jaillet@wanadoo.fr
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
When a vcpu is accessing *its own* redistributor's SGIs/PPIs, there
is no point in doing a stop-the-world operation. Instead, we can
just let the access occur as we do with GICv2.
This is a very minor optimisation for a non-nesting guest, but
a potentially major one for a nesting L1 hypervisor which is
likely to access the emulated registers pretty often (on each
vcpu switch, at the very least).
Reported-by: Ganapatrao Kulkarni <gankulkarni@os.amperecomputing.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230112154840.1808595-1-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
If we have VHE and need to reenable SME for host in
kvm_arch_vcpu_put_fp, CPACR.SMEN is modified from 0 to 1. Trap
control for reading SVCR is modified from enable to disable.
Synchronization is needed before reading SVCR later in
fpsimd_save, or it may cause sync exception which can not be
handled by host.
Cc: Marc Zyngier <maz@kernel.org>
Cc: James Morse <james.morse@arm.com>
Cc: Alexandru Elisei <alexandru.elisei@arm.com>
Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
Cc: Oliver Upton <oliver.upton@linux.dev>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Nianyao Tang <tangnianyao@huawei.com>
Reviewed-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20221220105024.13484-1-dongbo4@huawei.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Our Apple M1/M2 friends do have a per-CPU maintenance interrupt,
but no mask to make use of it in the standard Linux framework.
Given that KVM directly drives the *source* of the interrupt and
leaves the GIC interrupt always enabled, there is no harm in tolerating
such a setup. It will become useful once we enable NV on M2 HW.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230103095022.3230946-3-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Since the One True Way is to use the new generated definition,
kill the KVM-specific definition of CPACR_EL1_TTA, and move
over to CPACR_ELx_TTA, hopefully for the same result.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230112154803.1808559-1-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
As it currently stands, KVM makes use of FEAT_HAFDBS unconditionally.
Use of the feature in the rest of the kernel is guarded by an associated
Kconfig option.
Align KVM with the rest of the kernel and only enable VTCR_HA when
ARM64_HW_AFDBM is enabled. This can be helpful for testing changes to
the stage-2 access fault path on Armv8.1+ implementations.
Link: https://lore.kernel.org/r/20221202185156.696189-7-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
As the underlying software walkers are able to traverse and update
stage-2 in parallel there is no need to serialize access faults.
Only take the read lock when handling an access fault.
Link: https://lore.kernel.org/r/20221202185156.696189-6-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
The page table walkers are invoked outside fault handling paths, such as
write protecting a range of memory. EAGAIN is generally used by the
walkers to retry execution due to races on a particular PTE, like taking
an access fault on a PTE being invalidated from another thread.
This early return behavior is undesirable for walkers that operate
outside a fault handler. Suppress EAGAIN and continue the walk if
operating outside a fault handler.
Link: https://lore.kernel.org/r/20221202185156.696189-3-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
The CCSIDR access handler masks the associativity bits according to the
bit layout for processors without FEAT_CCIDX. KVM also assumes CCSIDR is
32-bit where it will be 64-bit if FEAT_CCIDX is enabled. Mask FEAT_CCIDX
so that these assumptions hold.
Suggested-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
Link: https://lore.kernel.org/r/20230112023852.42012-7-akihiko.odaki@daynix.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Always set HCR_TID2 to trap CTR_EL0, CCSIDR2_EL1, CLIDR_EL1, and
CSSELR_EL1. This saves a few lines of code and allows to employ their
access trap handlers for more purposes anticipated by the old
condition for setting HCR_TID2.
Suggested-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
Reviewed-by: Reiji Watanabe <reijiw@google.com>
Link: https://lore.kernel.org/r/20230112023852.42012-6-akihiko.odaki@daynix.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Now that we are generating ISR_EL1 we have acquired a constant for
ISR_EL1.A, use it rather than the magic number we had been using in the KVM
entry code.
Suggested-by: Marc Zyngier <maz@kernel.org>
Acked-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20221208-arm64-isr-el1-v2-3-89f7073a1ca9@kernel.org
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
* kvm-arm64/s1ptw-write-fault:
: .
: Fix S1PTW fault handling that was until then always taken
: as a write. From the cover letter:
:
: `Recent developments on the EFI front have resulted in guests that
: simply won't boot if the page tables are in a read-only memslot and
: that you're a bit unlucky in the way S2 gets paged in... The core
: issue is related to the fact that we treat a S1PTW as a write, which
: is close enough to what needs to be done. Until to get to RO memslots.
:
: The first patch fixes this and is definitely a stable candidate. It
: splits the faulting of page tables in two steps (RO translation fault,
: followed by a writable permission fault -- should it even happen).
: The second one documents the slightly odd behaviour of PTW writes to
: RO memslot, which do not result in a KVM_MMIO exit. The last patch is
: totally optional, only tangentially related, and randomly repainting
: stuff (maybe that's contagious, who knows)."
:
: .
KVM: arm64: Convert FSC_* over to ESR_ELx_FSC_*
KVM: arm64: Document the behaviour of S1PTW faults on RO memslots
KVM: arm64: Fix S1PTW handling on RO memslots
Signed-off-by: Marc Zyngier <maz@kernel.org>
* kvm-arm64/pmu-fixes-6.2:
: .
: Fix for an incredibly stupid bug in the PMU rework that went into
: 6.2. Brown paper bag time.
: .
KVM: arm64: PMU: Fix PMCR_EL0 reset value
Signed-off-by: Marc Zyngier <maz@kernel.org>
I really hoped that Apple had fixed their not-quite-a-vgic implementation
when moving from M1 to M2. Alas, it seems they didn't, and running
a buggy EFI version results in the vgic generating SErrors outside
of the guest and taking the host down.
Apply the same workaround as for M1. Yes, this is all a bit crap.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230103095022.3230946-2-maz@kernel.org
The former is an AArch32 legacy, so let's move over to the
verbose (and strictly identical) version.
This involves moving some of the #defines that were private
to KVM into the more generic esr.h.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Allow architectures to opt out of the generic hardware enabling logic,
and opt out on both s390 and PPC, which don't need to manually enable
virtualization as it's always on (when available).
In addition to letting s390 and PPC drop a bit of dead code, this will
hopefully also allow ARM to clean up its related code, e.g. ARM has its
own per-CPU flag to track which CPUs have enable hardware due to the
need to keep hardware enabled indefinitely when pKVM is enabled.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Acked-by: Anup Patel <anup@brainfault.org>
Message-Id: <20221130230934.1014142-50-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Define pr_fmt using KBUILD_MODNAME for all KVM x86 code so that printks
use consistent formatting across common x86, Intel, and AMD code. In
addition to providing consistent print formatting, using KBUILD_MODNAME,
e.g. kvm_amd and kvm_intel, allows referencing SVM and VMX (and SEV and
SGX and ...) as technologies without generating weird messages, and
without causing naming conflicts with other kernel code, e.g. "SEV: ",
"tdx: ", "sgx: " etc.. are all used by the kernel for non-KVM subsystems.
Opportunistically move away from printk() for prints that need to be
modified anyways, e.g. to drop a manual "kvm: " prefix.
Opportunistically convert a few SGX WARNs that are similarly modified to
WARN_ONCE; in the very unlikely event that the WARNs fire, odds are good
that they would fire repeatedly and spam the kernel log without providing
unique information in each print.
Note, defining pr_fmt yields undesirable results for code that uses KVM's
printk wrappers, e.g. vcpu_unimpl(). But, that's a pre-existing problem
as SVM/kvm_amd already defines a pr_fmt, and thankfully use of KVM's
wrappers is relatively limited in KVM x86 code.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Message-Id: <20221130230934.1014142-35-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Drop kvm_arch_check_processor_compat() and its support code now that all
architecture implementations are nops.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Eric Farman <farman@linux.ibm.com> # s390
Acked-by: Anup Patel <anup@brainfault.org>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Message-Id: <20221130230934.1014142-33-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Drop kvm_arch_init() and kvm_arch_exit() now that all implementations
are nops.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Eric Farman <farman@linux.ibm.com> # s390
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Acked-by: Anup Patel <anup@brainfault.org>
Message-Id: <20221130230934.1014142-30-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Tag kvm_arm_init() and its unique helper as __init, and tag data that is
only ever modified under the kvm_arm_init() umbrella as read-only after
init.
Opportunistically name the boolean param in kvm_timer_hyp_init()'s
prototype to match its definition.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221130230934.1014142-21-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Do arm/arch specific initialization directly in arm's module_init(), now
called kvm_arm_init(), instead of bouncing through kvm_init() to reach
kvm_arch_init(). Invoking kvm_arch_init() is the very first action
performed by kvm_init(), so from a initialization perspective this is a
glorified nop.
Avoiding kvm_arch_init() also fixes a mostly benign bug as kvm_arch_exit()
doesn't properly unwind if a later stage of kvm_init() fails. While the
soon-to-be-deleted comment about compiling as a module being unsupported
is correct, kvm_arch_exit() can still be called by kvm_init() if any step
after the call to kvm_arch_init() succeeds.
Add a FIXME to call out that pKVM initialization isn't unwound if
kvm_init() fails, which is a pre-existing problem inherited from
kvm_arch_exit().
Making kvm_arch_init() a nop will also allow dropping kvm_arch_init() and
kvm_arch_exit() entirely once all other architectures follow suit.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221130230934.1014142-20-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Undo everything done by init_subsystems() if a later initialization step
fails, i.e. unregister perf callbacks in addition to unregistering the
power management notifier.
Fixes: bfa79a8054 ("KVM: arm64: Elevate hypervisor mappings creation at EL2")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221130230934.1014142-19-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Teardown hypervisor mode if vector slot setup fails in order to avoid
leaking any allocations done by init_hyp_mode().
Fixes: b881cdce77 ("KVM: arm64: Allocate hyp vectors statically")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221130230934.1014142-18-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
For a number of historical reasons, the KVM/arm64 hotplug setup is pretty
complicated, and we have two extra CPUHP notifiers for vGIC and timers.
It looks pretty pointless, and gets in the way of further changes.
So let's just expose some helpers that can be called from the core
CPUHP callback, and get rid of everything else.
This gives us the opportunity to drop a useless notifier entry,
as well as tidy-up the timer enable/disable, which was a bit odd.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221130230934.1014142-17-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Drop kvm_arch_hardware_setup() and kvm_arch_hardware_unsetup() now that
all implementations are nops.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Eric Farman <farman@linux.ibm.com> # s390
Acked-by: Anup Patel <anup@brainfault.org>
Message-Id: <20221130230934.1014142-10-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* Enable the per-vcpu dirty-ring tracking mechanism, together with an
option to keep the good old dirty log around for pages that are
dirtied by something other than a vcpu.
* Switch to the relaxed parallel fault handling, using RCU to delay
page table reclaim and giving better performance under load.
* Relax the MTE ABI, allowing a VMM to use the MAP_SHARED mapping option,
which multi-process VMMs such as crosvm rely on (see merge commit 382b5b87a97d:
"Fix a number of issues with MTE, such as races on the tags being
initialised vs the PG_mte_tagged flag as well as the lack of support
for VM_SHARED when KVM is involved. Patches from Catalin Marinas and
Peter Collingbourne").
* Merge the pKVM shadow vcpu state tracking that allows the hypervisor
to have its own view of a vcpu, keeping that state private.
* Add support for the PMUv3p5 architecture revision, bringing support
for 64bit counters on systems that support it, and fix the
no-quite-compliant CHAIN-ed counter support for the machines that
actually exist out there.
* Fix a handful of minor issues around 52bit VA/PA support (64kB pages
only) as a prefix of the oncoming support for 4kB and 16kB pages.
* Pick a small set of documentation and spelling fixes, because no
good merge window would be complete without those.
s390:
* Second batch of the lazy destroy patches
* First batch of KVM changes for kernel virtual != physical address support
* Removal of a unused function
x86:
* Allow compiling out SMM support
* Cleanup and documentation of SMM state save area format
* Preserve interrupt shadow in SMM state save area
* Respond to generic signals during slow page faults
* Fixes and optimizations for the non-executable huge page errata fix.
* Reprogram all performance counters on PMU filter change
* Cleanups to Hyper-V emulation and tests
* Process Hyper-V TLB flushes from a nested guest (i.e. from a L2 guest
running on top of a L1 Hyper-V hypervisor)
* Advertise several new Intel features
* x86 Xen-for-KVM:
** Allow the Xen runstate information to cross a page boundary
** Allow XEN_RUNSTATE_UPDATE flag behaviour to be configured
** Add support for 32-bit guests in SCHEDOP_poll
* Notable x86 fixes and cleanups:
** One-off fixes for various emulation flows (SGX, VMXON, NRIPS=0).
** Reinstate IBPB on emulated VM-Exit that was incorrectly dropped a few
years back when eliminating unnecessary barriers when switching between
vmcs01 and vmcs02.
** Clean up vmread_error_trampoline() to make it more obvious that params
must be passed on the stack, even for x86-64.
** Let userspace set all supported bits in MSR_IA32_FEAT_CTL irrespective
of the current guest CPUID.
** Fudge around a race with TSC refinement that results in KVM incorrectly
thinking a guest needs TSC scaling when running on a CPU with a
constant TSC, but no hardware-enumerated TSC frequency.
** Advertise (on AMD) that the SMM_CTL MSR is not supported
** Remove unnecessary exports
Generic:
* Support for responding to signals during page faults; introduces
new FOLL_INTERRUPTIBLE flag that was reviewed by mm folks
Selftests:
* Fix an inverted check in the access tracking perf test, and restore
support for asserting that there aren't too many idle pages when
running on bare metal.
* Fix build errors that occur in certain setups (unsure exactly what is
unique about the problematic setup) due to glibc overriding
static_assert() to a variant that requires a custom message.
* Introduce actual atomics for clear/set_bit() in selftests
* Add support for pinning vCPUs in dirty_log_perf_test.
* Rename the so called "perf_util" framework to "memstress".
* Add a lightweight psuedo RNG for guest use, and use it to randomize
the access pattern and write vs. read percentage in the memstress tests.
* Add a common ucall implementation; code dedup and pre-work for running
SEV (and beyond) guests in selftests.
* Provide a common constructor and arch hook, which will eventually be
used by x86 to automatically select the right hypercall (AMD vs. Intel).
* A bunch of added/enabled/fixed selftests for ARM64, covering memslots,
breakpoints, stage-2 faults and access tracking.
* x86-specific selftest changes:
** Clean up x86's page table management.
** Clean up and enhance the "smaller maxphyaddr" test, and add a related
test to cover generic emulation failure.
** Clean up the nEPT support checks.
** Add X86_PROPERTY_* framework to retrieve multi-bit CPUID values.
** Fix an ordering issue in the AMX test introduced by recent conversions
to use kvm_cpu_has(), and harden the code to guard against similar bugs
in the future. Anything that tiggers caching of KVM's supported CPUID,
kvm_cpu_has() in this case, effectively hides opt-in XSAVE features if
the caching occurs before the test opts in via prctl().
Documentation:
* Remove deleted ioctls from documentation
* Clean up the docs for the x86 MSR filter.
* Various fixes
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmOaFrcUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroPemQgAq49excg2Cc+EsHnZw3vu/QWdA0Rt
KhL3OgKxuHNjCbD2O9n2t5di7eJOTQ7F7T0eDm3xPTr4FS8LQ2327/mQePU/H2CF
mWOpq9RBWLzFsSTeVA2Mz9TUTkYSnDHYuRsBvHyw/n9cL76BWVzjImldFtjYjjex
yAwl8c5itKH6bc7KO+5ydswbvBzODkeYKUSBNdbn6m0JGQST7XppNwIAJvpiHsii
Qgpk0e4Xx9q4PXG/r5DedI6BlufBsLhv0aE9SHPzyKH3JbbUFhJYI8ZD5OhBQuYW
MwxK2KlM5Jm5ud2NZDDlsMmmvd1lnYCFDyqNozaKEWC1Y5rq1AbMa51fXA==
=QAYX
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Paolo Bonzini:
"ARM64:
- Enable the per-vcpu dirty-ring tracking mechanism, together with an
option to keep the good old dirty log around for pages that are
dirtied by something other than a vcpu.
- Switch to the relaxed parallel fault handling, using RCU to delay
page table reclaim and giving better performance under load.
- Relax the MTE ABI, allowing a VMM to use the MAP_SHARED mapping
option, which multi-process VMMs such as crosvm rely on (see merge
commit 382b5b87a97d: "Fix a number of issues with MTE, such as
races on the tags being initialised vs the PG_mte_tagged flag as
well as the lack of support for VM_SHARED when KVM is involved.
Patches from Catalin Marinas and Peter Collingbourne").
- Merge the pKVM shadow vcpu state tracking that allows the
hypervisor to have its own view of a vcpu, keeping that state
private.
- Add support for the PMUv3p5 architecture revision, bringing support
for 64bit counters on systems that support it, and fix the
no-quite-compliant CHAIN-ed counter support for the machines that
actually exist out there.
- Fix a handful of minor issues around 52bit VA/PA support (64kB
pages only) as a prefix of the oncoming support for 4kB and 16kB
pages.
- Pick a small set of documentation and spelling fixes, because no
good merge window would be complete without those.
s390:
- Second batch of the lazy destroy patches
- First batch of KVM changes for kernel virtual != physical address
support
- Removal of a unused function
x86:
- Allow compiling out SMM support
- Cleanup and documentation of SMM state save area format
- Preserve interrupt shadow in SMM state save area
- Respond to generic signals during slow page faults
- Fixes and optimizations for the non-executable huge page errata
fix.
- Reprogram all performance counters on PMU filter change
- Cleanups to Hyper-V emulation and tests
- Process Hyper-V TLB flushes from a nested guest (i.e. from a L2
guest running on top of a L1 Hyper-V hypervisor)
- Advertise several new Intel features
- x86 Xen-for-KVM:
- Allow the Xen runstate information to cross a page boundary
- Allow XEN_RUNSTATE_UPDATE flag behaviour to be configured
- Add support for 32-bit guests in SCHEDOP_poll
- Notable x86 fixes and cleanups:
- One-off fixes for various emulation flows (SGX, VMXON, NRIPS=0).
- Reinstate IBPB on emulated VM-Exit that was incorrectly dropped
a few years back when eliminating unnecessary barriers when
switching between vmcs01 and vmcs02.
- Clean up vmread_error_trampoline() to make it more obvious that
params must be passed on the stack, even for x86-64.
- Let userspace set all supported bits in MSR_IA32_FEAT_CTL
irrespective of the current guest CPUID.
- Fudge around a race with TSC refinement that results in KVM
incorrectly thinking a guest needs TSC scaling when running on a
CPU with a constant TSC, but no hardware-enumerated TSC
frequency.
- Advertise (on AMD) that the SMM_CTL MSR is not supported
- Remove unnecessary exports
Generic:
- Support for responding to signals during page faults; introduces
new FOLL_INTERRUPTIBLE flag that was reviewed by mm folks
Selftests:
- Fix an inverted check in the access tracking perf test, and restore
support for asserting that there aren't too many idle pages when
running on bare metal.
- Fix build errors that occur in certain setups (unsure exactly what
is unique about the problematic setup) due to glibc overriding
static_assert() to a variant that requires a custom message.
- Introduce actual atomics for clear/set_bit() in selftests
- Add support for pinning vCPUs in dirty_log_perf_test.
- Rename the so called "perf_util" framework to "memstress".
- Add a lightweight psuedo RNG for guest use, and use it to randomize
the access pattern and write vs. read percentage in the memstress
tests.
- Add a common ucall implementation; code dedup and pre-work for
running SEV (and beyond) guests in selftests.
- Provide a common constructor and arch hook, which will eventually
be used by x86 to automatically select the right hypercall (AMD vs.
Intel).
- A bunch of added/enabled/fixed selftests for ARM64, covering
memslots, breakpoints, stage-2 faults and access tracking.
- x86-specific selftest changes:
- Clean up x86's page table management.
- Clean up and enhance the "smaller maxphyaddr" test, and add a
related test to cover generic emulation failure.
- Clean up the nEPT support checks.
- Add X86_PROPERTY_* framework to retrieve multi-bit CPUID values.
- Fix an ordering issue in the AMX test introduced by recent
conversions to use kvm_cpu_has(), and harden the code to guard
against similar bugs in the future. Anything that tiggers
caching of KVM's supported CPUID, kvm_cpu_has() in this case,
effectively hides opt-in XSAVE features if the caching occurs
before the test opts in via prctl().
Documentation:
- Remove deleted ioctls from documentation
- Clean up the docs for the x86 MSR filter.
- Various fixes"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (361 commits)
KVM: x86: Add proper ReST tables for userspace MSR exits/flags
KVM: selftests: Allocate ucall pool from MEM_REGION_DATA
KVM: arm64: selftests: Align VA space allocator with TTBR0
KVM: arm64: Fix benign bug with incorrect use of VA_BITS
KVM: arm64: PMU: Fix period computation for 64bit counters with 32bit overflow
KVM: x86: Advertise that the SMM_CTL MSR is not supported
KVM: x86: remove unnecessary exports
KVM: selftests: Fix spelling mistake "probabalistic" -> "probabilistic"
tools: KVM: selftests: Convert clear/set_bit() to actual atomics
tools: Drop "atomic_" prefix from atomic test_and_set_bit()
tools: Drop conflicting non-atomic test_and_{clear,set}_bit() helpers
KVM: selftests: Use non-atomic clear/set bit helpers in KVM tests
perf tools: Use dedicated non-atomic clear/set bit helpers
tools: Take @bit as an "unsigned long" in {clear,set}_bit() helpers
KVM: arm64: selftests: Enable single-step without a "full" ucall()
KVM: x86: fix APICv/x2AVIC disabled when vm reboot by itself
KVM: Remove stale comment about KVM_REQ_UNHALT
KVM: Add missing arch for KVM_CREATE_DEVICE and KVM_{SET,GET}_DEVICE_ATTR
KVM: Reference to kvm_userspace_memory_region in doc and comments
KVM: Delete all references to removed KVM_SET_MEMORY_ALIAS ioctl
...
ARMV8_PMU_PMCR_N_MASK is an unshifted value which results in the wrong
reset value for PMCR_EL0, so shift it to fix it.
This fixes the following error when running qemu:
$ qemu-system-aarch64 -cpu host -machine type=virt,accel=kvm -kernel ...
target/arm/helper.c:1813: pmevcntr_rawwrite: Assertion `counter < pmu_num_counters(env)' failed.
Fixes: 292e8f1494 ("KVM: arm64: PMU: Simplify PMCR_EL0 reset handling")
Signed-off-by: James Clark <james.clark@arm.com>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221209164446.1972014-2-james.clark@arm.com
- Enable the per-vcpu dirty-ring tracking mechanism, together with an
option to keep the good old dirty log around for pages that are
dirtied by something other than a vcpu.
- Switch to the relaxed parallel fault handling, using RCU to delay
page table reclaim and giving better performance under load.
- Relax the MTE ABI, allowing a VMM to use the MAP_SHARED mapping
option, which multi-process VMMs such as crosvm rely on.
- Merge the pKVM shadow vcpu state tracking that allows the hypervisor
to have its own view of a vcpu, keeping that state private.
- Add support for the PMUv3p5 architecture revision, bringing support
for 64bit counters on systems that support it, and fix the
no-quite-compliant CHAIN-ed counter support for the machines that
actually exist out there.
- Fix a handful of minor issues around 52bit VA/PA support (64kB pages
only) as a prefix of the oncoming support for 4kB and 16kB pages.
- Add/Enable/Fix a bunch of selftests covering memslots, breakpoints,
stage-2 faults and access tracking. You name it, we got it, we
probably broke it.
- Pick a small set of documentation and spelling fixes, because no
good merge window would be complete without those.
As a side effect, this tag also drags:
- The 'kvmarm-fixes-6.1-3' tag as a dependency to the dirty-ring
series
- A shared branch with the arm64 tree that repaints all the system
registers to match the ARM ARM's naming, and resulting in
interesting conflicts
-----BEGIN PGP SIGNATURE-----
iQJDBAABCgAtFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAmOODb0PHG1hekBrZXJu
ZWwub3JnAAoJECPQ0LrRPXpDztsQAInRnsgLl57/SpqhZzExNCllN6AT/bdeB3uz
rnw3ScJOV174uNKp8lnPWoTvu2YUGiVtBp6tFHhDI8le7zHX438ZT8KE5mcs8p5i
KfFKnb8SHV2DDpqkcy24c0Xl/6vsg1qkKrdfJb49yl5ZakRITDpynW/7tn6dXsxX
wASeGFdCYeW4g2xMQzsCbtx6LgeQ8uomBmzRfPrOtZHYYxAn6+4Mj4595EC1sWxM
AQnbp8tW3Vw46saEZAQvUEOGOW9q0Nls7G21YqQ52IA+ZVDK1LmAF2b1XY3edjkk
pX8EsXOURfqdasBxfSfF3SgnUazoz9GHpSzp1cTVTktrPp40rrT7Ldtml0ktq69d
1malPj47KVMDsIq0kNJGnMxciXFgAHw+VaCQX+k4zhIatNwviMbSop2fEoxj22jc
4YGgGOxaGrnvmAJhreCIbr4CkZk5CJ8Zvmtfg+QM6npIp8BY8896nvORx/d4i6tT
H4caadd8AAR56ANUyd3+KqF3x0WrkaU0PLHJLy1tKwOXJUUTjcpvIfahBAAeUlSR
qEFrtb+EEMPgAwLfNOICcNkPZR/yyuYvM+FiUQNVy5cNiwFkpztpIctfOFaHySGF
K07O2/a1F6xKL0OKRUg7hGKknF9ecmux4vHhiUMuIk9VOgNTWobHozBDorLKXMzC
aWa6oGVC
=iIPT
-----END PGP SIGNATURE-----
Merge tag 'kvmarm-6.2' of https://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64 updates for 6.2
- Enable the per-vcpu dirty-ring tracking mechanism, together with an
option to keep the good old dirty log around for pages that are
dirtied by something other than a vcpu.
- Switch to the relaxed parallel fault handling, using RCU to delay
page table reclaim and giving better performance under load.
- Relax the MTE ABI, allowing a VMM to use the MAP_SHARED mapping
option, which multi-process VMMs such as crosvm rely on.
- Merge the pKVM shadow vcpu state tracking that allows the hypervisor
to have its own view of a vcpu, keeping that state private.
- Add support for the PMUv3p5 architecture revision, bringing support
for 64bit counters on systems that support it, and fix the
no-quite-compliant CHAIN-ed counter support for the machines that
actually exist out there.
- Fix a handful of minor issues around 52bit VA/PA support (64kB pages
only) as a prefix of the oncoming support for 4kB and 16kB pages.
- Add/Enable/Fix a bunch of selftests covering memslots, breakpoints,
stage-2 faults and access tracking. You name it, we got it, we
probably broke it.
- Pick a small set of documentation and spelling fixes, because no
good merge window would be complete without those.
As a side effect, this tag also drags:
- The 'kvmarm-fixes-6.1-3' tag as a dependency to the dirty-ring
series
- A shared branch with the arm64 tree that repaints all the system
registers to match the ARM ARM's naming, and resulting in
interesting conflicts
* for-next/sve-state:
arm64/fp: Use a struct to pass data to fpsimd_bind_state_to_cpu()
arm64/sve: Leave SVE enabled on syscall if we don't context switch
arm64/fpsimd: SME no longer requires SVE register state
arm64/fpsimd: Load FP state based on recorded data type
arm64/fpsimd: Stop using TIF_SVE to manage register saving in KVM
arm64/fpsimd: Have KVM explicitly say which FP registers to save
arm64/fpsimd: Track the saved FPSIMD state type separately to TIF_SVE
KVM: arm64: Discard any SVE state when entering KVM guests
* kvm-arm64/pmu-unchained:
: .
: PMUv3 fixes and improvements:
:
: - Make the CHAIN event handling strictly follow the architecture
:
: - Add support for PMUv3p5 (64bit counters all the way)
:
: - Various fixes and cleanups
: .
KVM: arm64: PMU: Fix period computation for 64bit counters with 32bit overflow
KVM: arm64: PMU: Sanitise PMCR_EL0.LP on first vcpu run
KVM: arm64: PMU: Simplify PMCR_EL0 reset handling
KVM: arm64: PMU: Replace version number '0' with ID_AA64DFR0_EL1_PMUVer_NI
KVM: arm64: PMU: Make kvm_pmc the main data structure
KVM: arm64: PMU: Simplify vcpu computation on perf overflow notification
KVM: arm64: PMU: Allow PMUv3p5 to be exposed to the guest
KVM: arm64: PMU: Implement PMUv3p5 long counter support
KVM: arm64: PMU: Allow ID_DFR0_EL1.PerfMon to be set from userspace
KVM: arm64: PMU: Allow ID_AA64DFR0_EL1.PMUver to be set from userspace
KVM: arm64: PMU: Move the ID_AA64DFR0_EL1.PMUver limit to VM creation
KVM: arm64: PMU: Do not let AArch32 change the counters' top 32 bits
KVM: arm64: PMU: Simplify setting a counter to a specific value
KVM: arm64: PMU: Add counter_index_to_*reg() helpers
KVM: arm64: PMU: Only narrow counters that are not 64bit wide
KVM: arm64: PMU: Narrow the overflow checking when required
KVM: arm64: PMU: Distinguish between 64bit counter and 64bit overflow
KVM: arm64: PMU: Always advertise the CHAIN event
KVM: arm64: PMU: Align chained counter implementation with architecture pseudocode
arm64: Add ID_DFR0_EL1.PerfMon values for PMUv3p7 and IMP_DEF
Signed-off-by: Marc Zyngier <maz@kernel.org>
* kvm-arm64/mte-map-shared:
: .
: Update the MTE support to allow the VMM to use shared mappings
: to back the memslots exposed to MTE-enabled guests.
:
: Patches courtesy of Catalin Marinas and Peter Collingbourne.
: .
: Fix a number of issues with MTE, such as races on the tags
: being initialised vs the PG_mte_tagged flag as well as the
: lack of support for VM_SHARED when KVM is involved.
:
: Patches from Catalin Marinas and Peter Collingbourne.
: .
Documentation: document the ABI changes for KVM_CAP_ARM_MTE
KVM: arm64: permit all VM_MTE_ALLOWED mappings with MTE enabled
KVM: arm64: unify the tests for VMAs in memslots when MTE is enabled
arm64: mte: Lock a page for MTE tag initialisation
mm: Add PG_arch_3 page flag
KVM: arm64: Simplify the sanitise_mte_tags() logic
arm64: mte: Fix/clarify the PG_mte_tagged semantics
mm: Do not enable PG_arch_2 for all 64-bit architectures
Signed-off-by: Marc Zyngier <maz@kernel.org>
* kvm-arm64/pkvm-vcpu-state: (25 commits)
: .
: Large drop of pKVM patches from Will Deacon and co, adding
: a private vm/vcpu state at EL2, managed independently from
: the EL1 state. From the cover letter:
:
: "This is version six of the pKVM EL2 state series, extending the pKVM
: hypervisor code so that it can dynamically instantiate and manage VM
: data structures without the host being able to access them directly.
: These structures consist of a hyp VM, a set of hyp vCPUs and the stage-2
: page-table for the MMU. The pages used to hold the hypervisor structures
: are returned to the host when the VM is destroyed."
: .
KVM: arm64: Use the pKVM hyp vCPU structure in handle___kvm_vcpu_run()
KVM: arm64: Don't unnecessarily map host kernel sections at EL2
KVM: arm64: Explicitly map 'kvm_vgic_global_state' at EL2
KVM: arm64: Maintain a copy of 'kvm_arm_vmid_bits' at EL2
KVM: arm64: Unmap 'kvm_arm_hyp_percpu_base' from the host
KVM: arm64: Return guest memory from EL2 via dedicated teardown memcache
KVM: arm64: Instantiate guest stage-2 page-tables at EL2
KVM: arm64: Consolidate stage-2 initialisation into a single function
KVM: arm64: Add generic hyp_memcache helpers
KVM: arm64: Provide I-cache invalidation by virtual address at EL2
KVM: arm64: Initialise hypervisor copies of host symbols unconditionally
KVM: arm64: Add per-cpu fixmap infrastructure at EL2
KVM: arm64: Instantiate pKVM hypervisor VM and vCPU structures from EL1
KVM: arm64: Add infrastructure to create and track pKVM instances at EL2
KVM: arm64: Rename 'host_kvm' to 'host_mmu'
KVM: arm64: Add hyp_spinlock_t static initializer
KVM: arm64: Include asm/kvm_mmu.h in nvhe/mem_protect.h
KVM: arm64: Add helpers to pin memory shared with the hypervisor at EL2
KVM: arm64: Prevent the donation of no-map pages
KVM: arm64: Implement do_donate() helper for donating memory
...
Signed-off-by: Marc Zyngier <maz@kernel.org>
* kvm-arm64/parallel-faults:
: .
: Parallel stage-2 fault handling, courtesy of Oliver Upton.
: From the cover letter:
:
: "Presently KVM only takes a read lock for stage 2 faults if it believes
: the fault can be fixed by relaxing permissions on a PTE (write unprotect
: for dirty logging). Otherwise, stage 2 faults grab the write lock, which
: predictably can pile up all the vCPUs in a sufficiently large VM.
:
: Like the TDP MMU for x86, this series loosens the locking around
: manipulations of the stage 2 page tables to allow parallel faults. RCU
: and atomics are exploited to safely build/destroy the stage 2 page
: tables in light of multiple software observers."
: .
KVM: arm64: Reject shared table walks in the hyp code
KVM: arm64: Don't acquire RCU read lock for exclusive table walks
KVM: arm64: Take a pointer to walker data in kvm_dereference_pteref()
KVM: arm64: Handle stage-2 faults in parallel
KVM: arm64: Make table->block changes parallel-aware
KVM: arm64: Make leaf->leaf PTE changes parallel-aware
KVM: arm64: Make block->table PTE changes parallel-aware
KVM: arm64: Split init and set for table PTE
KVM: arm64: Atomically update stage 2 leaf attributes in parallel walks
KVM: arm64: Protect stage-2 traversal with RCU
KVM: arm64: Tear down unlinked stage-2 subtree after break-before-make
KVM: arm64: Use an opaque type for pteps
KVM: arm64: Add a helper to tear down unlinked stage-2 subtrees
KVM: arm64: Don't pass kvm_pgtable through kvm_pgtable_walk_data
KVM: arm64: Pass mm_ops through the visitor context
KVM: arm64: Stash observed pte value in visitor context
KVM: arm64: Combine visitor arguments into a context structure
Signed-off-by: Marc Zyngier <maz@kernel.org>
* kvm-arm64/dirty-ring:
: .
: Add support for the "per-vcpu dirty-ring tracking with a bitmap
: and sprinkles on top", courtesy of Gavin Shan.
:
: This branch drags the kvmarm-fixes-6.1-3 tag which was already
: merged in 6.1-rc4 so that the branch is in a working state.
: .
KVM: Push dirty information unconditionally to backup bitmap
KVM: selftests: Automate choosing dirty ring size in dirty_log_test
KVM: selftests: Clear dirty ring states between two modes in dirty_log_test
KVM: selftests: Use host page size to map ring buffer in dirty_log_test
KVM: arm64: Enable ring-based dirty memory tracking
KVM: Support dirty ring in conjunction with bitmap
KVM: Move declaration of kvm_cpu_dirty_log_size() to kvm_dirty_ring.h
KVM: x86: Introduce KVM_REQ_DIRTY_RING_SOFT_FULL
Signed-off-by: Marc Zyngier <maz@kernel.org>
* kvm-arm64/52bit-fixes:
: .
: 52bit PA fixes, courtesy of Ryan Roberts. From the cover letter:
:
: "I've been adding support for FEAT_LPA2 to KVM and as part of that work have been
: testing various (84) configurations of HW, host and guest kernels on FVP. This
: has thrown up a couple of pre-existing bugs, for which the fixes are provided."
: .
KVM: arm64: Fix benign bug with incorrect use of VA_BITS
KVM: arm64: Fix PAR_TO_HPFAR() to work independently of PA_BITS.
KVM: arm64: Fix kvm init failure when mode!=vhe and VA_BITS=52.
Signed-off-by: Marc Zyngier <maz@kernel.org>
get_user_mapping_size() uses kvm's pgtable library to walk a user space
page table created by the kernel, and in doing so, passes metadata
that the library needs, including ia_bits, which defines the size of the
input address.
For the case where the kernel is compiled for 52 VA bits but runs on HW
that does not support LVA, it will fall back to 48 VA bits at runtime.
Therefore we must use vabits_actual rather than VA_BITS to get the true
address size.
This is benign in the current code base because the pgtable library only
uses it for error checking.
Fixes: 6011cf68c8 ("KVM: arm64: Walk userspace page tables to compute the THP mapping size")
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221205114031.3972780-1-ryan.roberts@arm.com
Fix the bogus masking when computing the period of a 64bit counter
with 32bit overflow. It really should be treated like a 32bit counter
for the purpose of the period.
Reported-by: Ricardo Koller <ricarkol@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/Y4jbosgHbUDI0WF4@google.com
To convert the 32bit id registers to use the sysreg generation, they
must first have a regular pattern, to match the symbols the script
generates.
Ensure symbols for the ID_DFR0_EL1 register have an _EL1 suffix,
and use lower-case for feature names where the arm-arm does the same.
The arm-arm has feature names for some of the ID_DFR0_EL1.PerMon encodings.
Use these feature names in preference to the '8_4' indication of the
architecture version they were introduced in.
No functional change.
Signed-off-by: James Morse <james.morse@arm.com>
Link: https://lore.kernel.org/r/20221130171637.718182-12-james.morse@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
For reasons that are unclear to this reader fpsimd_bind_state_to_cpu()
populates the struct fpsimd_last_state_struct that it uses to store the
active floating point state for KVM guests by passing an argument for
each member of the structure. As the richness of the architecture increases
this is resulting in a function with a rather large number of arguments
which isn't ideal.
Simplify the interface by using the struct directly as the single argument
for the function, renaming it as we lift the definition into the header.
This could be built on further to reduce the work we do adding storage for
new FP state in various places but for now it just simplifies this one
interface.
Signed-off-by: Mark Brown <broonie@kernel.org>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221115094640.112848-9-broonie@kernel.org
Signed-off-by: Will Deacon <will@kernel.org>
Now that we are explicitly telling the host FP code which register state
it needs to save we can remove the manipulation of TIF_SVE from the KVM
code, simplifying it and allowing us to optimise our handling of normal
tasks. Remove the manipulation of TIF_SVE from KVM and instead rely on
to_save to ensure we save the correct data for it.
There should be no functional or performance impact from this change.
Signed-off-by: Mark Brown <broonie@kernel.org>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221115094640.112848-5-broonie@kernel.org
Signed-off-by: Will Deacon <will@kernel.org>
In order to avoid needlessly saving and restoring the guest registers KVM
relies on the host FPSMID code to save the guest registers when we context
switch away from the guest. This is done by binding the KVM guest state to
the CPU on top of the task state that was originally there, then carefully
managing the TIF_SVE flag for the task to cause the host to save the full
SVE state when needed regardless of the needs of the host task. This works
well enough but isn't terribly direct about what is going on and makes it
much more complicated to try to optimise what we're doing with the SVE
register state.
Let's instead have KVM pass in the register state it wants saving when it
binds to the CPU. We introduce a new FP_STATE_CURRENT for use
during normal task binding to indicate that we should base our
decisions on the current task. This should not be used when
actually saving. Ideally we might want to use a separate enum for
the type to save but this enum and the enum values would then
need to be named which has problems with clarity and ambiguity.
In order to ease any future debugging that might be required this patch
does not actually update any of the decision making about what to save,
it merely starts tracking the new information and warns if the requested
state is not what we would otherwise have decided to save.
Signed-off-by: Mark Brown <broonie@kernel.org>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221115094640.112848-4-broonie@kernel.org
Signed-off-by: Will Deacon <will@kernel.org>
When we save the state for the floating point registers this can be done
in the form visible through either the FPSIMD V registers or the SVE Z and
P registers. At present we track which format is currently used based on
TIF_SVE and the SME streaming mode state but particularly in the SVE case
this limits our options for optimising things, especially around syscalls.
Introduce a new enum which we place together with saved floating point
state in both thread_struct and the KVM guest state which explicitly
states which format is active and keep it up to date when we change it.
At present we do not use this state except to verify that it has the
expected value when loading the state, future patches will introduce
functional changes.
Signed-off-by: Mark Brown <broonie@kernel.org>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221115094640.112848-3-broonie@kernel.org
Signed-off-by: Will Deacon <will@kernel.org>
Since 8383741ab2 (KVM: arm64: Get rid of host SVE tracking/saving)
KVM has not tracked the host SVE state, relying on the fact that we
currently disable SVE whenever we perform a syscall. This may not be true
in future since performance optimisation may result in us keeping SVE
enabled in order to avoid needing to take access traps to reenable it.
Handle this by clearing TIF_SVE and converting the stored task state to
FPSIMD format when preparing to run the guest. This is done with a new
call fpsimd_kvm_prepare() to keep the direct state manipulation
functions internal to fpsimd.c.
Signed-off-by: Mark Brown <broonie@kernel.org>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221115094640.112848-2-broonie@kernel.org
Signed-off-by: Will Deacon <will@kernel.org>
Certain VMMs such as crosvm have features (e.g. sandboxing) that depend
on being able to map guest memory as MAP_SHARED. The current restriction
on sharing MAP_SHARED pages with the guest is preventing the use of
those features with MTE. Now that the races between tasks concurrently
clearing tags on the same page have been fixed, remove this restriction.
Note that this is a relaxation of the ABI.
Signed-off-by: Peter Collingbourne <pcc@google.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Steven Price <steven.price@arm.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221104011041.290951-8-pcc@google.com
Previously we allowed creating a memslot containing a private mapping that
was not VM_MTE_ALLOWED, but would later reject KVM_RUN with -EFAULT. Now
we reject the memory region at memslot creation time.
Since this is a minor tweak to the ABI (a VMM that created one of
these memslots would fail later anyway), no VMM to my knowledge has
MTE support yet, and the hardware with the necessary features is not
generally available, we can probably make this ABI change at this point.
Signed-off-by: Peter Collingbourne <pcc@google.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Steven Price <steven.price@arm.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221104011041.290951-7-pcc@google.com
Initialising the tags and setting PG_mte_tagged flag for a page can race
between multiple set_pte_at() on shared pages or setting the stage 2 pte
via user_mem_abort(). Introduce a new PG_mte_lock flag as PG_arch_3 and
set it before attempting page initialisation. Given that PG_mte_tagged
is never cleared for a page, consider setting this flag to mean page
unlocked and wait on this bit with acquire semantics if the page is
locked:
- try_page_mte_tagging() - lock the page for tagging, return true if it
can be tagged, false if already tagged. No acquire semantics if it
returns true (PG_mte_tagged not set) as there is no serialisation with
a previous set_page_mte_tagged().
- set_page_mte_tagged() - set PG_mte_tagged with release semantics.
The two-bit locking is based on Peter Collingbourne's idea.
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Peter Collingbourne <pcc@google.com>
Reviewed-by: Steven Price <steven.price@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Peter Collingbourne <pcc@google.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221104011041.290951-6-pcc@google.com
Currently sanitise_mte_tags() checks if it's an online page before
attempting to sanitise the tags. Such detection should be done in the
caller via the VM_MTE_ALLOWED vma flag. Since kvm_set_spte_gfn() does
not have the vma, leave the page unmapped if not already tagged. Tag
initialisation will be done on a subsequent access fault in
user_mem_abort().
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
[pcc@google.com: fix the page initializer]
Signed-off-by: Peter Collingbourne <pcc@google.com>
Reviewed-by: Steven Price <steven.price@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Peter Collingbourne <pcc@google.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221104011041.290951-4-pcc@google.com
Currently the PG_mte_tagged page flag mostly means the page contains
valid tags and it should be set after the tags have been cleared or
restored. However, in mte_sync_tags() it is set before setting the tags
to avoid, in theory, a race with concurrent mprotect(PROT_MTE) for
shared pages. However, a concurrent mprotect(PROT_MTE) with a copy on
write in another thread can cause the new page to have stale tags.
Similarly, tag reading via ptrace() can read stale tags if the
PG_mte_tagged flag is set before actually clearing/restoring the tags.
Fix the PG_mte_tagged semantics so that it is only set after the tags
have been cleared or restored. This is safe for swap restoring into a
MAP_SHARED or CoW page since the core code takes the page lock. Add two
functions to test and set the PG_mte_tagged flag with acquire and
release semantics. The downside is that concurrent mprotect(PROT_MTE) on
a MAP_SHARED page may cause tag loss. This is already the case for KVM
guests if a VMM changes the page protection while the guest triggers a
user_mem_abort().
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
[pcc@google.com: fix build with CONFIG_ARM64_MTE disabled]
Signed-off-by: Peter Collingbourne <pcc@google.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Reviewed-by: Steven Price <steven.price@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Peter Collingbourne <pcc@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221104011041.290951-3-pcc@google.com
Userspace can play some dirty tricks on us by selecting a given
PMU version (such as PMUv3p5), restore a PMCR_EL0 value that
has PMCR_EL0.LP set, and then switch the PMU version to PMUv3p1,
for example. In this situation, we end-up with PMCR_EL0.LP being
set and spreading havoc in the PMU emulation.
This is specially hard as the first two step can be done on
one vcpu and the third step on another, meaning that we need
to sanitise *all* vcpus when the PMU version is changed.
In orer to avoid a pretty complicated locking situation,
defer the sanitisation of PMCR_EL0 to the point where the
vcpu is actually run for the first tine, using the existing
KVM_REQ_RELOAD_PMU request that calls into kvm_pmu_handle_pmcr().
There is still an obscure corner case where userspace could
do the above trick, and then save the VM without running it.
They would then observe an inconsistent state (PMUv3.1 + LP set),
but that state will be fixed on the first run anyway whenever
the guest gets restored on a host.
Reported-by: Reiji Watanabe <reijiw@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Resetting PMCR_EL0 is a pretty involved process that includes
poisoning some of the writable bits, just because we can.
It makes it hard to reason about about what gets configured,
and just resetting things to 0 seems like a much saner option.
Reduce reset_pmcr() to just preserving PMCR_EL0.N from the host,
and setting PMCR_EL0.LC if we don't support AArch32.
Signed-off-by: Marc Zyngier <maz@kernel.org>
kvm_host_pmu_init() returns when detected PMU is either not implemented, or
implementation defined. kvm_pmu_probe_armpmu() also has a similar situation.
Extracted ID_AA64DFR0_EL1_PMUVer value, when PMU is not implemented is '0',
which can be replaced with ID_AA64DFR0_EL1_PMUVer_NI defined as '0b0000'.
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: linux-perf-users@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-arm-kernel@lists.infradead.org
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221128135629.118346-1-anshuman.khandual@arm.com
Exclusive table walks are the only supported table walk in the hyp, as
there is no construct like RCU available in the hypervisor code. Reject
any attempt to do a shared table walk by returning an error and allowing
the caller to clean up the mess.
Suggested-by: Will Deacon <will@kernel.org>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Acked-by: Will Deacon <will@kernel.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221118182222.3932898-4-oliver.upton@linux.dev
Rather than passing through the state of the KVM_PGTABLE_WALK_SHARED
flag, just take a pointer to the whole walker structure instead. Move
around struct kvm_pgtable and the RCU indirection such that the
associated ifdeffery remains in one place while ensuring the walker +
flags definitions precede their use.
No functional change intended.
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Acked-by: Will Deacon <will@kernel.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221118182222.3932898-2-oliver.upton@linux.dev
The PMU code has historically been torn between referencing a counter
as a pair vcpu+index or as the PMC pointer.
Given that it is pretty easy to go from one representation to
the other, standardise on the latter which, IMHO, makes the
code slightly more readable. YMMV.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221113163832.3154370-17-maz@kernel.org
The way we compute the target vcpu on getting an overflow is
a bit odd, as we use the PMC array as an anchor for kvm_pmc_to_vcpu,
while we could directly compute the correct address.
Get rid of the intermediate step and directly compute the target
vcpu.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221113163832.3154370-16-maz@kernel.org
PMUv3p5 (which is mandatory with ARMv8.5) comes with some extra
features:
- All counters are 64bit
- The overflow point is controlled by the PMCR_EL0.LP bit
Add the required checks in the helpers that control counter
width and overflow, as well as the sysreg handling for the LP
bit. A new kvm_pmu_is_3p5() helper makes it easy to spot the
PMUv3p5 specific handling.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221113163832.3154370-14-maz@kernel.org
Allow userspace to write ID_DFR0_EL1, on the condition that only
the PerfMon field can be altered and be something that is compatible
with what was computed for the AArch64 view of the guest.
Reviewed-by: Reiji Watanabe <reijiw@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221113163832.3154370-13-maz@kernel.org
Allow userspace to write ID_AA64DFR0_EL1, on the condition that only
the PMUver field can be altered and be at most the one that was
initially computed for the guest.
Reviewed-by: Reiji Watanabe <reijiw@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221113163832.3154370-12-maz@kernel.org
As further patches will enable the selection of a PMU revision
from userspace, sample the supported PMU revision at VM creation
time, rather than building each time the ID_AA64DFR0_EL1 register
is accessed.
This shouldn't result in any change in behaviour.
Reviewed-by: Reiji Watanabe <reijiw@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221113163832.3154370-11-maz@kernel.org
Even when using PMUv3p5 (which implies 64bit counters), there is
no way for AArch32 to write to the top 32 bits of the counters.
The only way to influence these bits (other than by counting
events) is by writing PMCR.P==1.
Make sure we obey the architecture and preserve the top 32 bits
on a counter update.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221113163832.3154370-10-maz@kernel.org
kvm_pmu_set_counter_value() is pretty odd, as it tries to update
the counter value while taking into account the value that is
currently held by the running perf counter.
This is not only complicated, this is quite wrong. Nowhere in
the architecture is it said that the counter would be offset
by something that is pending. The counter should be updated
with the value set by SW, and start counting from there if
required.
Remove the odd computation and just assign the provided value
after having released the perf event (which is then restarted).
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221113163832.3154370-9-maz@kernel.org
In order to reduce the boilerplate code, add two helpers returning
the counter register index (resp. the event register) in the vcpu
register file from the counter index.
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221113163832.3154370-8-maz@kernel.org
The current PMU emulation sometimes narrows counters to 32bit
if the counter isn't the cycle counter. As this is going to
change with PMUv3p5 where the counters are all 64bit, fix
the couple of cases where this happens unconditionally.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Reiji Watanabe <reijiw@google.com>
Link: https://lore.kernel.org/r/20221113163832.3154370-7-maz@kernel.org
For 64bit counters that overflow on a 32bit boundary, make
sure we only check the bottom 32bit to generate a CHAIN event.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Reiji Watanabe <reijiw@google.com>
Link: https://lore.kernel.org/r/20221113163832.3154370-6-maz@kernel.org
The PMU architecture makes a subtle difference between a 64bit
counter and a counter that has a 64bit overflow. This is for example
the case of the cycle counter, which can generate an overflow on
a 32bit boundary if PMCR_EL0.LC==0 despite the accumulation being
done on 64 bits.
Use this distinction in the few cases where it matters in the code,
as we will reuse this with PMUv3p5 long counters.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221113163832.3154370-5-maz@kernel.org
Even when the underlying HW doesn't offer the CHAIN event
(which happens with QEMU), we can always support it as we're
in control of the counter overflow.
Always advertise the event via PMCEID0_EL0.
Reviewed-by: Reiji Watanabe <reijiw@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221113163832.3154370-4-maz@kernel.org
Ricardo recently pointed out that the PMU chained counter emulation
in KVM wasn't quite behaving like the one on actual hardware, in
the sense that a chained counter would expose an overflow on
both halves of a chained counter, while KVM would only expose the
overflow on the top half.
The difference is subtle, but significant. What does the architecture
say (DDI0087 H.a):
- Up to PMUv3p4, all counters but the cycle counter are 32bit
- A 32bit counter that overflows generates a CHAIN event on the
adjacent counter after exposing its own overflow status
- The CHAIN event is accounted if the counter is correctly
configured (CHAIN event selected and counter enabled)
This all means that our current implementation (which uses 64bit
perf events) prevents us from emulating this overflow on the lower half.
How to fix this? By implementing the above, to the letter.
This largely results in code deletion, removing the notions of
"counter pair", "chained counters", and "canonical counter".
The code is further restructured to make the CHAIN handling similar
to SWINC, as the two are now extremely similar in behaviour.
Reported-by: Ricardo Koller <ricarkol@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Reiji Watanabe <reijiw@google.com>
Link: https://lore.kernel.org/r/20221113163832.3154370-3-maz@kernel.org
As a stepping stone towards deprivileging the host's access to the
guest's vCPU structures, introduce some naive flush/sync routines to
copy most of the host vCPU into the hyp vCPU on vCPU run and back
again on return to EL1.
This allows us to run using the pKVM hyp structures when KVM is
initialised in protected mode.
Tested-by: Vincent Donnefort <vdonnefort@google.com>
Co-developed-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221110190259.26861-27-will@kernel.org
We no longer need to map the host's '.rodata' and '.bss' sections in the
stage-1 page-table of the pKVM hypervisor at EL2, so remove those
mappings and avoid creating any future dependencies at EL2 on
host-controlled data structures.
Tested-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221110190259.26861-25-will@kernel.org
The pkvm hypervisor at EL2 may need to read the 'kvm_vgic_global_state'
variable from the host, for example when saving and restoring the state
of the virtual GIC.
Explicitly map 'kvm_vgic_global_state' in the stage-1 page-table of the
pKVM hypervisor rather than relying on mapping all of the host '.rodata'
section.
Tested-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221110190259.26861-24-will@kernel.org
Sharing 'kvm_arm_vmid_bits' between EL1 and EL2 allows the host to
modify the variable arbitrarily, potentially leading to all sorts of
shenanians as this is used to configure the VTTBR register for the
guest stage-2.
In preparation for unmapping host sections entirely from EL2, maintain
a copy of 'kvm_arm_vmid_bits' in the pKVM hypervisor and initialise it
from the host value while it is still trusted.
Tested-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221110190259.26861-23-will@kernel.org
When pKVM is enabled, the hypervisor at EL2 does not trust the host at
EL1 and must therefore prevent it from having unrestricted access to
internal hypervisor state.
The 'kvm_arm_hyp_percpu_base' array holds the offsets for hypervisor
per-cpu allocations, so move this this into the nVHE code where it
cannot be modified by the untrusted host at EL1.
Tested-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221110190259.26861-22-will@kernel.org
Rather than relying on the host to free the previously-donated pKVM
hypervisor VM pages explicitly on teardown, introduce a dedicated
teardown memcache which allows the host to reclaim guest memory
resources without having to keep track of all of the allocations made by
the pKVM hypervisor at EL2.
Tested-by: Vincent Donnefort <vdonnefort@google.com>
Co-developed-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Will Deacon <will@kernel.org>
[maz: dropped __maybe_unused from unmap_donated_memory_noclear()]
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221110190259.26861-21-will@kernel.org
Extend the initialisation of guest data structures within the pKVM
hypervisor at EL2 so that we instantiate a memory pool and a full
'struct kvm_s2_mmu' structure for each VM, with a stage-2 page-table
entirely independent from the one managed by the host at EL1.
The 'struct kvm_pgtable_mm_ops' used by the page-table code is populated
with a set of callbacks that can manage guest pages in the hypervisor
without any direct intervention from the host, allocating page-table
pages from the provided pool and returning these to the host on VM
teardown. To keep things simple, the stage-2 MMU for the guest is
configured identically to the host stage-2 in the VTCR register and so
the IPA size of the guest must match the PA size of the host.
For now, the new page-table is unused as there is no way for the host
to map anything into it. Yet.
Tested-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221110190259.26861-20-will@kernel.org
The initialisation of guest stage-2 page-tables is currently split
across two functions: kvm_init_stage2_mmu() and kvm_arm_setup_stage2().
That is presumably for historical reasons as kvm_arm_setup_stage2()
originates from the (now defunct) KVM port for 32-bit Arm.
Simplify this code path by merging both functions into one, taking care
to map the 'struct kvm' into the hypervisor stage-1 early on in order to
simplify the failure path.
Tested-by: Vincent Donnefort <vdonnefort@google.com>
Co-developed-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221110190259.26861-19-will@kernel.org
The host at EL1 and the pKVM hypervisor at EL2 will soon need to
exchange memory pages dynamically for creating and destroying VM state.
Indeed, the hypervisor will rely on the host to donate memory pages it
can use to create guest stage-2 page-tables and to store VM and vCPU
metadata. In order to ease this process, introduce a
'struct hyp_memcache' which is essentially a linked list of available
pages, indexed by physical addresses so that it can be passed
meaningfully between the different virtual address spaces configured at
EL1 and EL2.
Tested-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221110190259.26861-18-will@kernel.org
In preparation for handling cache maintenance of guest pages from within
the pKVM hypervisor at EL2, introduce an EL2 copy of icache_inval_pou()
which will later be plumbed into the stage-2 page-table cache
maintenance callbacks, ensuring that the initial contents of pages
mapped as executable into the guest stage-2 page-table is visible to the
instruction fetcher.
Tested-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221110190259.26861-17-will@kernel.org
The nVHE object at EL2 maintains its own copies of some host variables
so that, when pKVM is enabled, the host cannot directly modify the
hypervisor state. When running in normal nVHE mode, however, these
variables are still mirrored at EL2 but are not initialised.
Initialise the hypervisor symbols from the host copies regardless of
pKVM, ensuring that any reference to this data at EL2 with normal nVHE
will return a sensibly initialised value.
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Tested-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221110190259.26861-16-will@kernel.org
Mapping pages in a guest page-table from within the pKVM hypervisor at
EL2 may require cache maintenance to ensure that the initialised page
contents is visible even to non-cacheable (e.g. MMU-off) accesses from
the guest.
In preparation for performing this maintenance at EL2, introduce a
per-vCPU fixmap which allows the pKVM hypervisor to map guest pages
temporarily into its stage-1 page-table for the purposes of cache
maintenance and, in future, poisoning on the reclaim path. The use of a
fixmap avoids the need for memory allocation or locking on the map()
path.
Tested-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Co-developed-by: Will Deacon <will@kernel.org>
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221110190259.26861-15-will@kernel.org
With the pKVM hypervisor at EL2 now offering hypercalls to the host for
creating and destroying VM and vCPU structures, plumb these in to the
existing arm64 KVM backend to ensure that the hypervisor data structures
are allocated and initialised on first vCPU run for a pKVM guest.
In the host, 'struct kvm_protected_vm' is introduced to hold the handle
of the pKVM VM instance as well as to track references to the memory
donated to the hypervisor so that it can be freed back to the host
allocator following VM teardown. The stage-2 page-table, hypervisor VM
and vCPU structures are allocated separately so as to avoid the need for
a large physically-contiguous allocation in the host at run-time.
Tested-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221110190259.26861-14-will@kernel.org
Introduce a global table (and lock) to track pKVM instances at EL2, and
provide hypercalls that can be used by the untrusted host to create and
destroy pKVM VMs and their vCPUs. pKVM VM/vCPU state is directly
accessible only by the trusted hypervisor (EL2).
Each pKVM VM is directly associated with an untrusted host KVM instance,
and is referenced by the host using an opaque handle. Future patches
will provide hypercalls to allow the host to initialize/set/get pKVM
VM/vCPU state using the opaque handle.
Tested-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Fuad Tabba <tabba@google.com>
Co-developed-by: Will Deacon <will@kernel.org>
Signed-off-by: Will Deacon <will@kernel.org>
[maz: silence warning on unmap_donated_memory_noclear()]
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221110190259.26861-13-will@kernel.org
In preparation for introducing VM and vCPU state at EL2, rename the
existing 'struct host_kvm' and its singleton 'host_kvm' instance to
'host_mmu' so as to avoid confusion between the structure tracking the
host stage-2 MMU state and the host instance of a 'struct kvm' for a
protected guest.
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Tested-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221110190259.26861-12-will@kernel.org
Introduce a static initializer macro for 'hyp_spinlock_t' so that it is
straightforward to instantiate global locks at EL2. This will be later
utilised for locking the VM table in the hypervisor.
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Tested-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221110190259.26861-11-will@kernel.org
nvhe/mem_protect.h refers to __load_stage2() in the definition of
__load_host_stage2() but doesn't include the relevant header.
Include asm/kvm_mmu.h in nvhe/mem_protect.h so that users of the latter
don't have to do this themselves.
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Tested-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221110190259.26861-10-will@kernel.org
Add helpers allowing the hypervisor to check whether a range of pages
are currently shared by the host, and 'pin' them if so by blocking host
unshare operations until the memory has been unpinned.
This will allow the hypervisor to take references on host-provided
data-structures (e.g. 'struct kvm') with the guarantee that these pages
will remain in a stable state until the hypervisor decides to release
them, for example during guest teardown.
Tested-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221110190259.26861-9-will@kernel.org
Memory regions marked as "no-map" in the host device-tree routinely
include TrustZone carev-outs and DMA pools. Although donating such pages
to the hypervisor may not breach confidentiality, it could be used to
corrupt its state in uncontrollable ways. To prevent this, let's block
host-initiated memory transitions targeting "no-map" pages altogether in
nVHE protected mode as there should be no valid reason to do this in
current operation.
Thankfully, the pKVM EL2 hypervisor has a full copy of the host's list
of memblock regions, so we can easily check for the presence of the
MEMBLOCK_NOMAP flag on a region containing pages being donated from the
host.
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Tested-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221110190259.26861-8-will@kernel.org
Transferring ownership information of a memory region from one component
to another can be achieved using a "donate" operation, which results
in the previous owner losing access to the underlying pages entirely
and the new owner having exclusive access to the page.
Implement a do_donate() helper, along the same lines as do_{un,}share,
and provide this functionality for the host-{to,from}-hyp cases as this
will later be used to donate/reclaim memory pages to store VM metadata
at EL2.
In a similar manner to the sharing transitions, permission checks are
performed by the hypervisor to ensure that the component initiating the
transition really is the owner of the page and also that the completer
does not currently have a page mapped at the target address.
Tested-by: Vincent Donnefort <vdonnefort@google.com>
Co-developed-by: Quentin Perret <qperret@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221110190259.26861-7-will@kernel.org
The 'pkvm_component_id' enum type provides constants to refer to the
host and the hypervisor, yet this information is duplicated by the
'pkvm_hyp_id' constant.
Remove the definition of 'pkvm_hyp_id' and move the 'pkvm_component_id'
type definition to 'mem_protect.h' so that it can be used outside of
the memory protection code, for example when initialising the owner for
hypervisor-owned pages.
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Tested-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221110190259.26861-6-will@kernel.org
In order to allow unmapping arbitrary memory pages from the hypervisor
stage-1 page-table, fix-up the initial refcount for pages that have been
mapped before the 'vmemmap' array was up and running so that it
accurately accounts for all existing hypervisor mappings.
This is achieved by traversing the entire hypervisor stage-1 page-table
during initialisation of EL2 and updating the corresponding
'struct hyp_page' for each valid mapping.
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Tested-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221110190259.26861-5-will@kernel.org
The EL2 'vmemmap' array in nVHE Protected mode is currently very sparse:
only memory pages owned by the hypervisor itself have a matching 'struct
hyp_page'. However, as the size of this struct has been reduced
significantly since its introduction, it appears that we can now afford
to back the vmemmap for all of memory.
Having an easily accessible 'struct hyp_page' for every physical page in
memory provides the hypervisor with a simple mechanism to store metadata
(e.g. a refcount) that wouldn't otherwise fit in the very limited number
of software bits available in the host stage-2 page-table entries. This
will be used in subsequent patches when pinning host memory pages for
use by the hypervisor at EL2.
Tested-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221110190259.26861-4-will@kernel.org
All the contiguous pages used to initialize a 'struct hyp_pool' are
considered coalescable, which means that the hyp page allocator will
actively try to merge them with their buddies on the hyp_put_page() path.
However, using hyp_put_page() on a page that is not part of the inital
memory range given to a hyp_pool() is currently unsupported.
In order to allow dynamically extending hyp pools at run-time, add a
check to __hyp_attach_page() to allow inserting 'external' pages into
the free-list of order 0. This will be necessary to allow lazy donation
of pages from the host to the hypervisor when allocating guest stage-2
page-table pages at EL2.
Tested-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221110190259.26861-3-will@kernel.org
We will soon need to manipulate 'struct hyp_page' refcounts from outside
page_alloc.c, so move the helpers to a common header file to allow them
to be reused easily.
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Tested-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221110190259.26861-2-will@kernel.org
For nvhe and protected modes, the hyp stage 1 page-tables were previously
configured to have the same number of VA bits as the kernel's idmap.
However, for kernel configs with VA_BITS=52 and where the kernel is
loaded in physical memory below 48 bits, the idmap VA bits is actually
smaller than the kernel's normal stage 1 VA bits. This can lead to
kernel addresses that can't be mapped into the hypervisor, leading to
kvm initialization failure during boot:
kvm [1]: IPA Size Limit: 48 bits
kvm [1]: Cannot map world-switch code
kvm [1]: error initializing Hyp mode: -34
Fix this by ensuring that the hyp stage 1 VA size is the maximum of
what's used for the idmap and the regular kernel stage 1. At the same
time, refactor the code so that the hyp VA bits is only calculated in
one place.
Prior to 7ba8f2b2d6, the idmap was always 52 bits for a 52 VA bits
kernel and therefore the hyp stage1 was also always 52 bits.
Fixes: 7ba8f2b2d6 ("arm64: mm: use a 48-bit ID map when possible on 52-bit VA builds")
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
[maz: commit message fixes]
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221103150507.32948-2-ryan.roberts@arm.com
The stage-2 map walker has been made parallel-aware, and as such can be
called while only holding the read side of the MMU lock. Rip out the
conditional locking in user_mem_abort() and instead grab the read lock.
Continue to take the write lock from other callsites to
kvm_pgtable_stage2_map().
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221107220033.1895655-1-oliver.upton@linux.dev
stage2_map_walker_try_leaf() and friends now handle stage-2 PTEs
generically, and perform the correct flush when a table PTE is removed.
Additionally, they've been made parallel-aware, using an atomic break
to take ownership of the PTE.
Stop clearing the PTE in the pre-order callback and instead let
stage2_map_walker_try_leaf() deal with it.
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221107220006.1895572-1-oliver.upton@linux.dev
Convert stage2_map_walker_try_leaf() to use the new break-before-make
helpers, thereby making the handler parallel-aware. As before, avoid the
break-before-make if recreating the existing mapping. Additionally,
retry execution if another vCPU thread is modifying the same PTE.
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Reviewed-by: Ben Gardon <bgardon@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221107215934.1895478-1-oliver.upton@linux.dev
In order to service stage-2 faults in parallel, stage-2 table walkers
must take exclusive ownership of the PTE being worked on. An additional
requirement of the architecture is that software must perform a
'break-before-make' operation when changing the block size used for
mapping memory.
Roll these two concepts together into helpers for performing a
'break-before-make' sequence. Use a special PTE value to indicate a PTE
has been locked by a software walker. Additionally, use an atomic
compare-exchange to 'break' the PTE when the stage-2 page tables are
possibly shared with another software walker. Elide the DSB + TLBI if
the evicted PTE was invalid (and thus not subject to break-before-make).
All of the atomics do nothing for now, as the stage-2 walker isn't fully
ready to perform parallel walks.
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221107215855.1895367-1-oliver.upton@linux.dev
Create a helper to initialize a table and directly call
smp_store_release() to install it (for now). Prepare for a subsequent
change that generalizes PTE writes with a helper.
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221107215644.1895162-11-oliver.upton@linux.dev
The stage2 attr walker is already used for parallel walks. Since commit
f783ef1c0e ("KVM: arm64: Add fast path to handle permission relaxation
during dirty logging"), KVM acquires the read lock when
write-unprotecting a PTE. However, the walker only uses a simple store
to update the PTE. This is safe as the only possible race is with
hardware updates to the access flag, which is benign.
However, a subsequent change to KVM will allow more changes to the stage
2 page tables to be done in parallel. Prepare the stage 2 attribute
walker by performing atomic updates to the PTE when walking in parallel.
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221107215644.1895162-10-oliver.upton@linux.dev
Use RCU to safely walk the stage-2 page tables in parallel. Acquire and
release the RCU read lock when traversing the page tables. Defer the
freeing of table memory to an RCU callback. Indirect the calls into RCU
and provide stubs for hypervisor code, as RCU is not available in such a
context.
The RCU protection doesn't amount to much at the moment, as readers are
already protected by the read-write lock (all walkers that free table
memory take the write lock). Nonetheless, a subsequent change will
futher relax the locking requirements around the stage-2 MMU, thereby
depending on RCU.
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221107215644.1895162-9-oliver.upton@linux.dev
The break-before-make sequence is a bit annoying as it opens a window
wherein memory is unmapped from the guest. KVM should replace the PTE
as quickly as possible and avoid unnecessary work in between.
Presently, the stage-2 map walker tears down a removed table before
installing a block mapping when coalescing a table into a block. As the
removed table is no longer visible to hardware walkers after the
DSB+TLBI, it is possible to move the remaining cleanup to happen after
installing the new PTE.
Reshuffle the stage-2 map walker to install the new block entry in
the pre-order callback. Unwire all of the teardown logic and replace
it with a call to kvm_pgtable_stage2_free_removed() after fixing
the PTE. The post-order visitor is now completely unnecessary, so drop
it. Finally, touch up the comments to better represent the now
simplified map walker.
Note that the call to tear down the unlinked stage-2 is indirected
as a subsequent change will use an RCU callback to trigger tear down.
RCU is not available to pKVM, so there is a need to use different
implementations on pKVM and non-pKVM VMs.
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Reviewed-by: Ben Gardon <bgardon@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221107215644.1895162-8-oliver.upton@linux.dev
Use an opaque type for pteps and require visitors explicitly dereference
the pointer before using. Protecting page table memory with RCU requires
that KVM dereferences RCU-annotated pointers before using. However, RCU
is not available for use in the nVHE hypervisor and the opaque type can
be conditionally annotated with RCU for the stage-2 MMU.
Call the type a 'pteref' to avoid a naming collision with raw pteps. No
functional change intended.
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221107215644.1895162-7-oliver.upton@linux.dev
A subsequent change to KVM will move the tear down of an unlinked
stage-2 subtree out of the critical path of the break-before-make
sequence.
Introduce a new helper for tearing down unlinked stage-2 subtrees.
Leverage the existing stage-2 free walkers to do so, with a deep call
into __kvm_pgtable_walk() as the subtree is no longer reachable from the
root.
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221107215644.1895162-6-oliver.upton@linux.dev
In order to tear down page tables from outside the context of
kvm_pgtable (such as an RCU callback), stop passing a pointer through
kvm_pgtable_walk_data.
No functional change intended.
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Reviewed-by: Ben Gardon <bgardon@google.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221107215644.1895162-5-oliver.upton@linux.dev
As a prerequisite for getting visitors off of struct kvm_pgtable, pass
mm_ops through the visitor context.
No functional change intended.
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Reviewed-by: Ben Gardon <bgardon@google.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221107215644.1895162-4-oliver.upton@linux.dev
Rather than reading the ptep all over the shop, read the ptep once from
__kvm_pgtable_visit() and stick it in the visitor context. Reread the
ptep after visiting a leaf in case the callback installed a new table
underneath.
No functional change intended.
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Reviewed-by: Ben Gardon <bgardon@google.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221107215644.1895162-3-oliver.upton@linux.dev
Passing new arguments by value to the visitor callbacks is extremely
inflexible for stuffing new parameters used by only some of the
visitors. Use a context structure instead and pass the pointer through
to the visitor callback.
While at it, redefine the 'flags' parameter to the visitor to contain
the bit indicating the phase of the walk. Pass the entire set of flags
through the context structure such that the walker can communicate
additional state to the visitor callback.
No functional change intended.
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Reviewed-by: Ben Gardon <bgardon@google.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221107215644.1895162-2-oliver.upton@linux.dev
Enable ring-based dirty memory tracking on ARM64:
- Enable CONFIG_HAVE_KVM_DIRTY_RING_ACQ_REL.
- Enable CONFIG_NEED_KVM_DIRTY_RING_WITH_BITMAP.
- Set KVM_DIRTY_LOG_PAGE_OFFSET for the ring buffer's physical page
offset.
- Add ARM64 specific kvm_arch_allow_write_without_running_vcpu() to
keep the site of saving vgic/its tables out of the no-running-vcpu
radar.
Signed-off-by: Gavin Shan <gshan@redhat.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221110104914.31280-5-gshan@redhat.com
Enable asynchronous unwind table generation for both the core kernel as
well as modules, and emit the resulting .eh_frame sections as init code
so we can use the unwind directives for code patching at boot or module
load time.
This will be used by dynamic shadow call stack support, which will rely
on code patching rather than compiler codegen to emit the shadow call
stack push and pop instructions.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Reviewed-by: Sami Tolvanen <samitolvanen@google.com>
Tested-by: Sami Tolvanen <samitolvanen@google.com>
Link: https://lore.kernel.org/r/20221027155908.1940624-2-ardb@kernel.org
Signed-off-by: Will Deacon <will@kernel.org>
virt/kvm/irqchip.c is including "irq.h" from the arch-specific KVM source
directory (i.e. not from arch/*/include) for the sole purpose of retrieving
irqchip_in_kernel.
Making the function inline in a header that is already included,
such as asm/kvm_host.h, is not possible because it needs to look at
struct kvm which is defined after asm/kvm_host.h is included. So add a
kvm_arch_irqchip_in_kernel non-inline function; irqchip_in_kernel() is
only performance critical on arm64 and x86, and the non-inline function
is enough on all other architectures.
irq.h can then be deleted from all architectures except x86.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add a new "interruptible" flag showing that the caller is willing to be
interrupted by signals during the __gfn_to_pfn_memslot() request. Wire it
up with a FOLL_INTERRUPTIBLE flag that we've just introduced.
This prepares KVM to be able to respond to SIGUSR1 (for QEMU that's the
SIGIPI) even during e.g. handling an userfaultfd page fault.
No functional change intended.
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221011195809.557016-4-peterx@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The trapping of SMPRI_EL1 and TPIDR2_EL0 currently only really
work on nVHE, as only this mode uses the fine-grained trapping
that controls these two registers.
Move the trapping enable/disable code into
__{de,}activate_traps_common(), allowing it to be called when it
actually matters on VHE, and remove the flipping of EL2 control
for TPIDR2_EL0, which only affects the host access of this
register.
Fixes: 861262ab86 ("KVM: arm64: Handle SME host state when running guests")
Reported-by: Mark Brown <broonie@kernel.org>
Reviewed-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/86bkpqer4z.wl-maz@kernel.org
enter_exception64() performs an MTE check, which involves dereferencing
vcpu->kvm. While vcpu has already been fixed up to be a HYP VA pointer,
kvm is still a pointer in the kernel VA space.
This only affects nVHE configurations with MTE enabled, as in other
cases, the pointer is either valid (VHE) or not dereferenced (!MTE).
Fix this by first converting kvm to a HYP VA pointer.
Fixes: ea7fc1bb1c ("KVM: arm64: Introduce MTE VM feature")
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: Steven Price <steven.price@arm.com>
[maz: commit message tidy-up]
Signed-off-by: Marc Zyngier <maz@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20221027120945.29679-1-ryan.roberts@arm.com
hyp_get_page_state() is used with pKVM to retrieve metadata about a page
by parsing a hypervisor stage-1 PTE. However, it incorrectly uses a
helper which parses *stage-2* mappings. Ouch.
Luckily, pkvm_getstate() only looks at the software bits, which happen
to be in the same place for stage-1 and stage-2 PTEs, and this all ends
up working correctly by accident. But clearly, we should do better.
Fix hyp_get_page_state() to use the correct helper.
Fixes: e82edcc75c ("KVM: arm64: Implement do_share() helper for sharing memory")
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221025145156.855308-1-qperret@google.com
- Fix a bug preventing restoring an ITS containing mappings
for very large and very sparse device topology
- Work around a relocation handling error when compiling
the nVHE object with profile optimisation
-----BEGIN PGP SIGNATURE-----
iQJDBAABCgAtFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAmNRGjoPHG1hekBrZXJu
ZWwub3JnAAoJECPQ0LrRPXpDVcEP/2zn+J/pAv1FoNKxSiSk/H2hx5PH685mqEwv
I7YPYPOuGJDv/uJFXzMMDpjjs8RHG9yWwPwtFZYGKqtmlkXgM3a0jk5jtsgfSVcp
vsvqgR0kL67x5D/Mc+I/bkVMNEecrucvkewf0Sp7W2gdXnC+rm1EFN5McK5U60tW
+s6fKVkIB0/fxNJCXPdETGiNpoYGEBeVVzoTrXGFm9QBcSDKvqdaPF8koZzOecpS
3UcVYKBaxNf1zowAseNYiS4kndwKuopaYsk4BY/KDgTbPyoFQBwMZYm6DAkb9ugW
sFK6g8FmOJJ5meLWIg4vTdruILL+YP6gbJ0HrzFoxJJEdd2iRzEUgQlBp7G+STSY
npXmeTJh5DJelm1Qd4OdiSBlqwYTIxKEyaV34/xeFh9P3w/LbN9uSiZAybbTw92h
sE1kQpT//s8lTXOnn23YHycZ1Dsy0JSExoJutCt6E5YnTb0wtPQ6ZLoWmyWMnTL/
6Gyj7LTy+trv4l0N2ND+LF7BfN6Cb4eSXMfLLhwo2qfQI2kD/gBrbrTYYdyiVvHD
YnZftvVViOcQVaocXP1SeQ5/yhdj6ASrCiFWie7nJW9Z3IJ9BJuXCxk3gSW5yCde
Rw37+FpVSciyhpEK9f4aVGfAGA8ifO1Gi5yp70ioF8Y3i1CvS238g0knx6v6p8Vm
dAi3i1ga
=C0jb
-----END PGP SIGNATURE-----
Merge tag 'kvmarm-fixes-6.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64 fixes for 6.1, take #2
- Fix a bug preventing restoring an ITS containing mappings
for very large and very sparse device topology
- Work around a relocation handling error when compiling
the nVHE object with profile optimisation
- Fix for stage-2 invalidation holding the VM MMU lock
for too long by limiting the walk to the largest
block mapping size
- Enable stack protection and branch profiling for VHE
- Two selftest fixes
-----BEGIN PGP SIGNATURE-----
iQJDBAABCgAtFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAmNIEAUPHG1hekBrZXJu
ZWwub3JnAAoJECPQ0LrRPXpDIXcP/AyWlCEJlmc1Jcd9rlaW1Wenr82U+StLVeMy
qP5P02gMWdbGExWIWEi4zkt+pAm7K2WRgXid9z5Vjw7kZY/+WwswTzKHWcQhVuZv
cBHfeOqgtoHVGR8NcwX6xcp406y3WRqYIsyAmbc5qmo75L8Ew1o3m+3eDfFtAq7l
3XuTCv+lQGNSGMhXHN2SVewZ+pCAo3XJmuHfCBXTqRjwqH4Tzh+54IKzo+9mqBWW
7yeIm5qcbIKGuXLuLL7XCf99gWy/3kQ0xQ1yJeXLAyiHswHqEISZXGHnKeATvD+6
RdbmQ9oRmIYfZfoDKZRUJg8TyTvW1rIKokFbe0q2iyuDnI5D/fAJ48epZaLw+kEf
PUzdB3UgPk19SLwgZKQddqY4wOD420ZD5x1TUFUQuLL7sjVv1vUILDvuCLWpq7F7
GyfSB+LEMgexHGsZ1wjslN/ivTbG+dQgaSS9mlV8/WDOLPtD2uOf65vYR3P28hAX
zOHrwm3e2+UV83BsEFEY2FQiiIBD24JmSecMbmAIHY09MCSZ+vJ/WbF4J1PcPP8C
3vjueIYTcjhzLtQrfIkGZcS7+wC9ji/RRmpJjbg79EpwrjhEs9G8h1+HyL9+zBZ4
Xn6X+ZG/cv0/ZYdin0ZRzJMvM0RutbsR77blVCLY97PBuLtBlqJDcxr+lmmjyIZ2
Db8Qd6uW
=IOxM
-----END PGP SIGNATURE-----
Merge tag 'kvmarm-fixes-6.1-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64 fixes for 6.1, take #1
- Fix for stage-2 invalidation holding the VM MMU lock
for too long by limiting the walk to the largest
block mapping size
- Enable stack protection and branch profiling for VHE
- Two selftest fixes
With some PCIe topologies, restoring a guest fails while
parsing the ITS device tables.
Reproducer hints:
1. Create ARM virt VM with pxb-pcie bus which adds
extra host bridges, with qemu command like:
```
-device pxb-pcie,bus_nr=8,id=pci.x,numa_node=0,bus=pcie.0 \
-device pcie-root-port,..,bus=pci.x \
...
-device pxb-pcie,bus_nr=37,id=pci.y,numa_node=1,bus=pcie.0 \
-device pcie-root-port,..,bus=pci.y \
...
```
2. Ensure the guest uses 2-level device table
3. Perform VM migration which calls save/restore device tables
In that setup, we get a big "offset" between 2 device_ids,
which makes unsigned "len" round up a big positive number,
causing the scan loop to continue with a bad GPA. For example:
1. L1 table has 2 entries;
2. and we are now scanning at L2 table entry index 2075 (pointed
to by L1 first entry)
3. if next device id is 9472, we will get a big offset: 7397;
4. with unsigned 'len', 'len -= offset * esz', len will underflow to a
positive number, mistakenly into next iteration with a bad GPA;
(It should break out of the current L2 table scanning, and jump
into the next L1 table entry)
5. that bad GPA fails the guest read.
Fix it by stopping the L2 table scan when the next device id is
outside of the current table, allowing the scan to continue from
the next L1 table entry.
Thanks to Eric Auger for the fix suggestion.
Fixes: 920a7a8fa9 ("KVM: arm64: vgic-its: Add infrastructure for tableookup")
Suggested-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Eric Ren <renzhengeek@gmail.com>
[maz: commit message tidy-up]
Signed-off-by: Marc Zyngier <maz@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/d9c3a564af9e2c5bf63f48a7dcbf08cd593c5c0b.1665802985.git.renzhengeek@gmail.com
Kernel build with clang and KCFLAGS=-fprofile-sample-use=<profile> fails with:
error: arch/arm64/kvm/hyp/nvhe/kvm_nvhe.tmp.o: Unexpected SHT_REL
section ".rel.llvm.call-graph-profile"
Starting from 13.0.0 llvm can generate SHT_REL section, see
https://reviews.llvm.org/rGca3bdb57fa1ac98b711a735de048c12b5fdd8086.
gen-hyprel does not support SHT_REL relocation section.
Filter out profile use flags to fix the build with profile optimization.
Signed-off-by: Denis Nikitin <denik@chromium.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221014184532.3153551-1-denik@chromium.org
* Fixes for single-stepping in the presence of an async
exception as well as the preservation of PSTATE.SS
* Better handling of AArch32 ID registers on AArch64-only
systems
* Fixes for the dirty-ring API, allowing it to work on
architectures with relaxed memory ordering
* Advertise the new kvmarm mailing list
* Various minor cleanups and spelling fixes
RISC-V:
* Improved instruction encoding infrastructure for
instructions not yet supported by binutils
* Svinval support for both KVM Host and KVM Guest
* Zihintpause support for KVM Guest
* Zicbom support for KVM Guest
* Record number of signal exits as a VCPU stat
* Use generic guest entry infrastructure
x86:
* Misc PMU fixes and cleanups.
* selftests: fixes for Hyper-V hypercall
* selftests: fix nx_huge_pages_test on TDP-disabled hosts
* selftests: cleanups for fix_hypercall_test
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmM7OcMUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroPAFgf/Rqc9hrXZVdbh2OZ+gScSsFsPK1zO
DISUksLcXaYVYYsvQAEg/N2BPz3XbmO4jA+z8bIUrYTA7fC98we2C4jfR+EaX/fO
+/Kzf0lAgu/nQZyFzUya+1jRsZqvVbC/HmDCI2kzN4u78e/LZ7NVcMijdV/ly6ib
cq0b0LLqJHe/fcpJ806JZP3p5sndQhDmlUkZ2AWZf6CUKSEFcufbbYkt+84ZK4PL
N9mEqXYQ3DXClLQmIBv+NZhtGlmADkWDE4BNouw8dVxhaXH7Hw/jfBHdb6SSHMRe
tQ6Src1j8AYOhf5J35SMudgkbGcMelm0yeZ7Sizk+5Ft0EmdbZsnkvsGdQ==
=4RA+
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull more kvm updates from Paolo Bonzini:
"The main batch of ARM + RISC-V changes, and a few fixes and cleanups
for x86 (PMU virtualization and selftests).
ARM:
- Fixes for single-stepping in the presence of an async exception as
well as the preservation of PSTATE.SS
- Better handling of AArch32 ID registers on AArch64-only systems
- Fixes for the dirty-ring API, allowing it to work on architectures
with relaxed memory ordering
- Advertise the new kvmarm mailing list
- Various minor cleanups and spelling fixes
RISC-V:
- Improved instruction encoding infrastructure for instructions not
yet supported by binutils
- Svinval support for both KVM Host and KVM Guest
- Zihintpause support for KVM Guest
- Zicbom support for KVM Guest
- Record number of signal exits as a VCPU stat
- Use generic guest entry infrastructure
x86:
- Misc PMU fixes and cleanups.
- selftests: fixes for Hyper-V hypercall
- selftests: fix nx_huge_pages_test on TDP-disabled hosts
- selftests: cleanups for fix_hypercall_test"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (57 commits)
riscv: select HAVE_POSIX_CPU_TIMERS_TASK_WORK
RISC-V: KVM: Use generic guest entry infrastructure
RISC-V: KVM: Record number of signal exits as a vCPU stat
RISC-V: KVM: add __init annotation to riscv_kvm_init()
RISC-V: KVM: Expose Zicbom to the guest
RISC-V: KVM: Provide UAPI for Zicbom block size
RISC-V: KVM: Make ISA ext mappings explicit
RISC-V: KVM: Allow Guest use Zihintpause extension
RISC-V: KVM: Allow Guest use Svinval extension
RISC-V: KVM: Use Svinval for local TLB maintenance when available
RISC-V: Probe Svinval extension form ISA string
RISC-V: KVM: Change the SBI specification version to v1.0
riscv: KVM: Apply insn-def to hlv encodings
riscv: KVM: Apply insn-def to hfence encodings
riscv: Introduce support for defining instructions
riscv: Add X register names to gpr-nums
KVM: arm64: Advertise new kvmarm mailing list
kvm: vmx: keep constant definition format consistent
kvm: mmu: fix typos in struct kvm_arch
KVM: selftests: Fix nx_huge_pages_test on TDP-disabled hosts
...
am sending out early due to me travelling next week. There is a
lone mm patch for which Andrew gave an informal ack at
https://lore.kernel.org/linux-mm/20220817102500.440c6d0a3fce296fdf91bea6@linux-foundation.org.
I will send the bulk of ARM work, as well as other
architectures, at the end of next week.
ARM:
* Account stage2 page table allocations in memory stats.
x86:
* Account EPT/NPT arm64 page table allocations in memory stats.
* Tracepoint cleanups/fixes for nested VM-Enter and emulated MSR accesses.
* Drop eVMCS controls filtering for KVM on Hyper-V, all known versions of
Hyper-V now support eVMCS fields associated with features that are
enumerated to the guest.
* Use KVM's sanitized VMCS config as the basis for the values of nested VMX
capabilities MSRs.
* A myriad event/exception fixes and cleanups. Most notably, pending
exceptions morph into VM-Exits earlier, as soon as the exception is
queued, instead of waiting until the next vmentry. This fixed
a longstanding issue where the exceptions would incorrecly become
double-faults instead of triggering a vmexit; the common case of
page-fault vmexits had a special workaround, but now it's fixed
for good.
* A handful of fixes for memory leaks in error paths.
* Cleanups for VMREAD trampoline and VMX's VM-Exit assembly flow.
* Never write to memory from non-sleepable kvm_vcpu_check_block()
* Selftests refinements and cleanups.
* Misc typo cleanups.
Generic:
* remove KVM_REQ_UNHALT
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmM2zwcUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroNpbwf+MlVeOlzE5SBdrJ0TEnLmKUel1lSz
QnZzP5+D65oD0zhCilUZHcg6G4mzZ5SdVVOvrGJvA0eXh25ruLNMF6jbaABkMLk/
FfI1ybN7A82hwJn/aXMI/sUurWv4Jteaad20JC2DytBCnsW8jUqc49gtXHS2QWy4
3uMsFdpdTAg4zdJKgEUfXBmQviweVpjjl3ziRyZZ7yaeo1oP7XZ8LaE1nR2l5m0J
mfjzneNm5QAnueypOh5KhSwIvqf6WHIVm/rIHDJ1HIFbgfOU0dT27nhb1tmPwAcE
+cJnnMUHjZqtCXteHkAxMClyRq0zsEoKk0OGvSOOMoq3Q0DavSXUNANOig==
=/hqX
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Paolo Bonzini:
"The first batch of KVM patches, mostly covering x86.
ARM:
- Account stage2 page table allocations in memory stats
x86:
- Account EPT/NPT arm64 page table allocations in memory stats
- Tracepoint cleanups/fixes for nested VM-Enter and emulated MSR
accesses
- Drop eVMCS controls filtering for KVM on Hyper-V, all known
versions of Hyper-V now support eVMCS fields associated with
features that are enumerated to the guest
- Use KVM's sanitized VMCS config as the basis for the values of
nested VMX capabilities MSRs
- A myriad event/exception fixes and cleanups. Most notably, pending
exceptions morph into VM-Exits earlier, as soon as the exception is
queued, instead of waiting until the next vmentry. This fixed a
longstanding issue where the exceptions would incorrecly become
double-faults instead of triggering a vmexit; the common case of
page-fault vmexits had a special workaround, but now it's fixed for
good
- A handful of fixes for memory leaks in error paths
- Cleanups for VMREAD trampoline and VMX's VM-Exit assembly flow
- Never write to memory from non-sleepable kvm_vcpu_check_block()
- Selftests refinements and cleanups
- Misc typo cleanups
Generic:
- remove KVM_REQ_UNHALT"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (94 commits)
KVM: remove KVM_REQ_UNHALT
KVM: mips, x86: do not rely on KVM_REQ_UNHALT
KVM: x86: never write to memory from kvm_vcpu_check_block()
KVM: x86: Don't snapshot pending INIT/SIPI prior to checking nested events
KVM: nVMX: Make event request on VMXOFF iff INIT/SIPI is pending
KVM: nVMX: Make an event request if INIT or SIPI is pending on VM-Enter
KVM: SVM: Make an event request if INIT or SIPI is pending when GIF is set
KVM: x86: lapic does not have to process INIT if it is blocked
KVM: x86: Rename kvm_apic_has_events() to make it INIT/SIPI specific
KVM: x86: Rename and expose helper to detect if INIT/SIPI are allowed
KVM: nVMX: Make an event request when pending an MTF nested VM-Exit
KVM: x86: make vendor code check for all nested events
mailmap: Update Oliver's email address
KVM: x86: Allow force_emulation_prefix to be written without a reload
KVM: selftests: Add an x86-only test to verify nested exception queueing
KVM: selftests: Use uapi header to get VMX and SVM exit reasons/codes
KVM: x86: Rename inject_pending_events() to kvm_check_and_inject_events()
KVM: VMX: Update MTF and ICEBP comments to document KVM's subtle behavior
KVM: x86: Treat pending TRIPLE_FAULT requests as pending exceptions
KVM: x86: Morph pending exceptions to pending VM-Exits at queue time
...
For historical reasons, the VHE code inherited the build configuration from
nVHE. Now those two parts have their own folder and makefile, we can
enable stack protection and branch profiling for VHE.
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Reviewed-by: Quentin Perret <qperret@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221004154216.2833636-1-vdonnefort@google.com
Presently stage2_apply_range() works on a batch of memory addressed by a
stage 2 root table entry for the VM. Depending on the IPA limit of the
VM and PAGE_SIZE of the host, this could address a massive range of
memory. Some examples:
4 level, 4K paging -> 512 GB batch size
3 level, 64K paging -> 4TB batch size
Unsurprisingly, working on such a large range of memory can lead to soft
lockups. When running dirty_log_perf_test:
./dirty_log_perf_test -m -2 -s anonymous_thp -b 4G -v 48
watchdog: BUG: soft lockup - CPU#0 stuck for 45s! [dirty_log_perf_:16703]
Modules linked in: vfat fat cdc_ether usbnet mii xhci_pci xhci_hcd sha3_generic gq(O)
CPU: 0 PID: 16703 Comm: dirty_log_perf_ Tainted: G O 6.0.0-smp-DEV #1
pstate: 80400009 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : dcache_clean_inval_poc+0x24/0x38
lr : clean_dcache_guest_page+0x28/0x4c
sp : ffff800021763990
pmr_save: 000000e0
x29: ffff800021763990 x28: 0000000000000005 x27: 0000000000000de0
x26: 0000000000000001 x25: 00400830b13bc77f x24: ffffad4f91ead9c0
x23: 0000000000000000 x22: ffff8000082ad9c8 x21: 0000fffafa7bc000
x20: ffffad4f9066ce50 x19: 0000000000000003 x18: ffffad4f92402000
x17: 000000000000011b x16: 000000000000011b x15: 0000000000000124
x14: ffff07ff8301d280 x13: 0000000000000000 x12: 00000000ffffffff
x11: 0000000000010001 x10: fffffc0000000000 x9 : ffffad4f9069e580
x8 : 000000000000000c x7 : 0000000000000000 x6 : 000000000000003f
x5 : ffff07ffa2076980 x4 : 0000000000000001 x3 : 000000000000003f
x2 : 0000000000000040 x1 : ffff0830313bd000 x0 : ffff0830313bcc40
Call trace:
dcache_clean_inval_poc+0x24/0x38
stage2_unmap_walker+0x138/0x1ec
__kvm_pgtable_walk+0x130/0x1d4
__kvm_pgtable_walk+0x170/0x1d4
__kvm_pgtable_walk+0x170/0x1d4
__kvm_pgtable_walk+0x170/0x1d4
kvm_pgtable_stage2_unmap+0xc4/0xf8
kvm_arch_flush_shadow_memslot+0xa4/0x10c
kvm_set_memslot+0xb8/0x454
__kvm_set_memory_region+0x194/0x244
kvm_vm_ioctl_set_memory_region+0x58/0x7c
kvm_vm_ioctl+0x49c/0x560
__arm64_sys_ioctl+0x9c/0xd4
invoke_syscall+0x4c/0x124
el0_svc_common+0xc8/0x194
do_el0_svc+0x38/0xc0
el0_svc+0x2c/0xa4
el0t_64_sync_handler+0x84/0xf0
el0t_64_sync+0x1a0/0x1a4
Use the largest supported block mapping for the configured page size as
the batch granularity. In so doing the walker is guaranteed to visit a
leaf only once.
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221007234151.461779-3-oliver.upton@linux.dev
- arm64 perf: DDR PMU driver for Alibaba's T-Head Yitian 710 SoC, SVE
vector granule register added to the user regs together with SVE perf
extensions documentation.
- SVE updates: add HWCAP for SVE EBF16, update the SVE ABI documentation
to match the actual kernel behaviour (zeroing the registers on syscall
rather than "zeroed or preserved" previously).
- More conversions to automatic system registers generation.
- vDSO: use self-synchronising virtual counter access in gettimeofday()
if the architecture supports it.
- arm64 stacktrace cleanups and improvements.
- arm64 atomics improvements: always inline assembly, remove LL/SC
trampolines.
- Improve the reporting of EL1 exceptions: rework BTI and FPAC exception
handling, better EL1 undefs reporting.
- Cortex-A510 erratum 2658417: remove BF16 support due to incorrect
result.
- arm64 defconfig updates: build CoreSight as a module, enable options
necessary for docker, memory hotplug/hotremove, enable all PMUs
provided by Arm.
- arm64 ptrace() support for TPIDR2_EL0 (register provided with the SME
extensions).
- arm64 ftraces updates/fixes: fix module PLTs with mcount, remove
unused function.
- kselftest updates for arm64: simple HWCAP validation, FP stress test
improvements, validation of ZA regs in signal handlers, include larger
SVE and SME vector lengths in signal tests, various cleanups.
- arm64 alternatives (code patching) improvements to robustness and
consistency: replace cpucap static branches with equivalent
alternatives, associate callback alternatives with a cpucap.
- Miscellaneous updates: optimise kprobe performance of patching
single-step slots, simplify uaccess_mask_ptr(), move MTE registers
initialisation to C, support huge vmalloc() mappings, run softirqs on
the per-CPU IRQ stack, compat (arm32) misalignment fixups for
multiword accesses.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE5RElWfyWxS+3PLO2a9axLQDIXvEFAmM9W4cACgkQa9axLQDI
XvEy3w/+LJ3KCFowWiz5gTAWikjv+UVssHjLMJixn47V7hsEFQ26Xnam/438rTMI
kE95u6DHUpw2SMIxKzFRO7oI5cQtP+cWGwTtOUnjVO+U1oN+HqDOIbO9DbylWDcU
eeeqMMmawMfTPuZrYklpOhXscsorbrKIvYBg7wHYOcwBYV3EPhWr89lwMvTVRuyJ
qpX628KlkGMaBcONNhv3nS3qZcAOs0oHQCAVS4C8czLDL+vtJlumXUS3xr1Mqm72
xtFe7sje8Djr2kZ8mzh0GbFiZEBoBD3F/l7ayq8gVRaVpToUt8sk36Stjs4LojF1
6imuAfji/5TItkScq5KhGqj6MIugwp/eUVbRN74OLNTYx7msF1ZADNFQ+Q0UuY0H
SYK13KvmOji0xjS8qAfhqrwNB79sk3fb+zF9LjETbdz4ZJCgg9gcFbSUTY0DvMfS
MXZk/jVeB07olA8xYbjh0BRt4UV9xU628FPQzK5k7e4Nzl4jSvgtJZCZanfuVtjy
/ZS1vbN8o7tQLBAlVnw+Exi/VedkKxkkMgm8tPKsMgERTFDx0Pc4Gs72hRpDnPWT
MRbeCCGleAf3JQ5vF0coBDNOCEVvweQgShHOyHTz0GyhWXLCFx3RJICo5I4EIpps
LLUk4JK0fO3LVrf1AEpu5ZP4+Sact0zfsH3gB7qyLPYFDmjDXD8=
=jl3Z
-----END PGP SIGNATURE-----
Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 updates from Catalin Marinas:
- arm64 perf: DDR PMU driver for Alibaba's T-Head Yitian 710 SoC, SVE
vector granule register added to the user regs together with SVE perf
extensions documentation.
- SVE updates: add HWCAP for SVE EBF16, update the SVE ABI
documentation to match the actual kernel behaviour (zeroing the
registers on syscall rather than "zeroed or preserved" previously).
- More conversions to automatic system registers generation.
- vDSO: use self-synchronising virtual counter access in gettimeofday()
if the architecture supports it.
- arm64 stacktrace cleanups and improvements.
- arm64 atomics improvements: always inline assembly, remove LL/SC
trampolines.
- Improve the reporting of EL1 exceptions: rework BTI and FPAC
exception handling, better EL1 undefs reporting.
- Cortex-A510 erratum 2658417: remove BF16 support due to incorrect
result.
- arm64 defconfig updates: build CoreSight as a module, enable options
necessary for docker, memory hotplug/hotremove, enable all PMUs
provided by Arm.
- arm64 ptrace() support for TPIDR2_EL0 (register provided with the SME
extensions).
- arm64 ftraces updates/fixes: fix module PLTs with mcount, remove
unused function.
- kselftest updates for arm64: simple HWCAP validation, FP stress test
improvements, validation of ZA regs in signal handlers, include
larger SVE and SME vector lengths in signal tests, various cleanups.
- arm64 alternatives (code patching) improvements to robustness and
consistency: replace cpucap static branches with equivalent
alternatives, associate callback alternatives with a cpucap.
- Miscellaneous updates: optimise kprobe performance of patching
single-step slots, simplify uaccess_mask_ptr(), move MTE registers
initialisation to C, support huge vmalloc() mappings, run softirqs on
the per-CPU IRQ stack, compat (arm32) misalignment fixups for
multiword accesses.
* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (126 commits)
arm64: alternatives: Use vdso/bits.h instead of linux/bits.h
arm64/kprobe: Optimize the performance of patching single-step slot
arm64: defconfig: Add Coresight as module
kselftest/arm64: Handle EINTR while reading data from children
kselftest/arm64: Flag fp-stress as exiting when we begin finishing up
kselftest/arm64: Don't repeat termination handler for fp-stress
ARM64: reloc_test: add __init/__exit annotations to module init/exit funcs
arm64/mm: fold check for KFENCE into can_set_direct_map()
arm64: ftrace: fix module PLTs with mcount
arm64: module: Remove unused plt_entry_is_initialized()
arm64: module: Make plt_equals_entry() static
arm64: fix the build with binutils 2.27
kselftest/arm64: Don't enable v8.5 for MTE selftest builds
arm64: uaccess: simplify uaccess_mask_ptr()
arm64: asm/perf_regs.h: Avoid C++-style comment in UAPI header
kselftest/arm64: Fix typo in hwcap check
arm64: mte: move register initialization to C
arm64: mm: handle ARM64_KERNEL_USES_PMD_MAPS in vmemmap_populate()
arm64: dma: Drop cache invalidation from arch_dma_prep_coherent()
arm64/sve: Add Perf extensions documentation
...
- Fixes for single-stepping in the presence of an async
exception as well as the preservation of PSTATE.SS
- Better handling of AArch32 ID registers on AArch64-only
systems
- Fixes for the dirty-ring API, allowing it to work on
architectures with relaxed memory ordering
- Advertise the new kvmarm mailing list
- Various minor cleanups and spelling fixes
-----BEGIN PGP SIGNATURE-----
iQJDBAABCgAtFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAmM5hQcPHG1hekBrZXJu
ZWwub3JnAAoJECPQ0LrRPXpDoMUP/jra4HSmujLUB5G7Op8HxuurEecOc6xtw0Af
AbDLlVc2Vs4rrdVh8GMc8D80atUAVitp8IFjdp/PzI2GTBTzWz43Gav2AbhgIJbJ
xoFVHL8LkdHKyMbq10359DqGMqhIf41OFzGwhbzcx2V4pKNkSpjbCpu3bi/+Ybjg
006ZpZc7NAU0rZgw9Flb/dhn0jw7RMc3orhoDQ4tBp1P/VhvqvgFt5bWipkvvBP7
+lQK28ujG3ghST/hKRhg6ozgy5+6NEEHMuhErMYP8nIivRchX+pWF2Lb0qGH1e+U
v2MZIZnIIUjyTV1vbYlxtltzfYmPuQ2MFNUBawI9tmlIOU9vJSCzeJS64uWK4KLV
kbmk57OfC7rQoSNJH4jaKQp0YpIktrB9Vei97t4I7NwEmkjQj6cLTgg4tQrNqTiQ
cFGeC9mE+lhFC8z1lCbna2eG631FxpPrB1SJ1/CU9wboam9dUfXGIvBPh+i2pvMZ
vcxzUZJ11y+/uhp4k8i2PBwNno0iwRXd5MinwRUs2CR5vhs8qa5y7FVWKyqKpgI2
xqr4lYTixJZL3mWkYyOQuClrTbT1zkoaPldLq6M7wvO08+QV8ryMeyKT+9s/gNQU
dcYSwBCWZaOZm2nN8/zjxRb7VqZVu3cwyXi9XXUWNTCgIe/Q/SDPbXU/Hwbgzf8X
UsQF7e9A
=aNPK
-----END PGP SIGNATURE-----
Merge tag 'kvmarm-6.1' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64 updates for v6.1
- Fixes for single-stepping in the presence of an async
exception as well as the preservation of PSTATE.SS
- Better handling of AArch32 ID registers on AArch64-only
systems
- Fixes for the dirty-ring API, allowing it to work on
architectures with relaxed memory ordering
- Advertise the new kvmarm mailing list
- Various minor cleanups and spelling fixes
* kvm-arm64/misc-6.1:
: .
: Misc KVM/arm64 fixes and improvement for v6.1
:
: - Simplify the affinity check when moving a GICv3 collection
:
: - Tone down the shouting when kvm-arm.mode=protected is passed
: to a guest
:
: - Fix various comments
:
: - Advertise the new kvmarm@lists.linux.dev and deprecate the
: old Columbia list
: .
KVM: arm64: Advertise new kvmarm mailing list
KVM: arm64: Fix comment typo in nvhe/switch.c
KVM: selftests: Update top-of-file comment in psci_test
KVM: arm64: Ignore kvm-arm.mode if !is_hyp_mode_available()
KVM: arm64: vgic: Remove duplicate check in update_affinity_collection()
Signed-off-by: Marc Zyngier <maz@kernel.org>
* for-next/alternatives:
: Alternatives (code patching) improvements
arm64: fix the build with binutils 2.27
arm64: avoid BUILD_BUG_ON() in alternative-macros
arm64: alternatives: add shared NOP callback
arm64: alternatives: add alternative_has_feature_*()
arm64: alternatives: have callbacks take a cap
arm64: alternatives: make alt_region const
arm64: alternatives: hoist print out of __apply_alternatives()
arm64: alternatives: proton-pack: prepare for cap changes
arm64: alternatives: kvm: prepare for cap changes
arm64: cpufeature: make cpus_have_cap() noinstr-safe
Fix the comment of __hyp_vgic_restore_state() from saying VEH to VHE,
also change the underscore to a dash to match the comment
above __hyp_vgic_save_state().
Signed-off-by: Wei-Lin Chang <r09922117@csie.ntu.edu.tw>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20220929042839.24277-1-r09922117@csie.ntu.edu.tw
KVM_REQ_UNHALT is now unnecessary because it is replaced by the return
value of kvm_vcpu_block/kvm_vcpu_halt. Remove it.
No functional change intended.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Message-Id: <20220921003201.1441511-13-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Ignore kvm-arm.mode if !is_hyp_mode_available(). Specifically, we want
to avoid switching kvm_mode to KVM_MODE_PROTECTED if hypervisor mode is
not available. This prevents "Protected KVM" cpu capability being
reported when Linux is booting in EL1 and would not have KVM enabled.
Reasonably though, we should warn if the command line is requesting a
KVM mode at all if KVM isn't actually available. Allow
"kvm-arm.mode=none" to skip the warning since this would disable KVM
anyway.
Signed-off-by: Elliot Berman <quic_eberman@quicinc.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20220920190658.2880184-1-quic_eberman@quicinc.com
The 'coll' parameter to update_affinity_collection() is never NULL,
so comparing it with 'ite->collection' is enough to cover both
the NULL case and the "another collection" case.
Remove the duplicate check in update_affinity_collection().
Signed-off-by: Gavin Shan <gshan@redhat.com>
[maz: repainted commit message]
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20220923065447.323445-1-gshan@redhat.com
With commit 0c24e06119 ("mm: kmemleak: add rbtree and store physical
address for objects allocated with PA"), kmemleak started to put the
objects allocated with physical address onto object_phys_tree_root tree.
The kmemleak_free_part() therefore no longer worked as expected on
physically allocated objects (hyp_mem_base in this case) as it attempted to
search and remove things in object_tree_root tree.
Fix it by using kmemleak_free_part_phys() to unregister hyp_mem_base. This
fixes an immediate crash when booting a KVM host in protected mode with
kmemleak enabled.
Signed-off-by: Zenghui Yu <yuzenghui@huawei.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20220908130659.2021-1-yuzenghui@huawei.com
* kvm-arm64/single-step-async-exception:
: .
: Single-step fixes from Reiji Watanabe:
:
: "This series fixes two bugs of single-step execution enabled by
: userspace, and add a test case for KVM_GUESTDBG_SINGLESTEP to
: the debug-exception test to verify the single-step behavior."
: .
KVM: arm64: selftests: Add a test case for KVM_GUESTDBG_SINGLESTEP
KVM: arm64: selftests: Refactor debug-exceptions to make it amenable to new test cases
KVM: arm64: Clear PSTATE.SS when the Software Step state was Active-pending
KVM: arm64: Preserve PSTATE.SS for the guest while single-step is enabled
Signed-off-by: Marc Zyngier <maz@kernel.org>
While userspace enables single-step, if the Software Step state at the
last guest exit was "Active-pending", clear PSTATE.SS on guest entry
to restore the state.
Currently, KVM sets PSTATE.SS to 1 on every guest entry while userspace
enables single-step for the vCPU (with KVM_GUESTDBG_SINGLESTEP).
It means KVM always makes the vCPU's Software Step state
"Active-not-pending" on the guest entry, which lets the VCPU perform
single-step (then Software Step exception is taken). This could cause
extra single-step (without returning to userspace) if the Software Step
state at the last guest exit was "Active-pending" (i.e. the last
exit was triggered by an asynchronous exception after the single-step
is performed, but before the Software Step exception is taken.
See "Figure D2-3 Software step state machine" and "D2.12.7 Behavior
in the active-pending state" in ARM DDI 0487I.a for more info about
this behavior).
Fix this by clearing PSTATE.SS on guest entry if the Software Step state
at the last exit was "Active-pending" so that KVM restore the state (and
the exception is taken before further single-step is performed).
Fixes: 337b99bf7e ("KVM: arm64: guest debug, add support for single-step")
Signed-off-by: Reiji Watanabe <reijiw@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20220917010600.532642-3-reijiw@google.com
Preserve the PSTATE.SS value for the guest while userspace enables
single-step (i.e. while KVM manipulates the PSTATE.SS) for the vCPU.
Currently, while userspace enables single-step for the vCPU
(with KVM_GUESTDBG_SINGLESTEP), KVM sets PSTATE.SS to 1 on every
guest entry, not saving its original value.
When userspace disables single-step, KVM doesn't restore the original
value for the subsequent guest entry (use the current value instead).
Exception return instructions copy PSTATE.SS from SPSR_ELx.SS
only in certain cases when single-step is enabled (and set it to 0
in other cases). So, the value matters only when the guest enables
single-step (and when the guest's Software step state isn't affected
by single-step enabled by userspace, practically), though.
Fix this by preserving the original PSTATE.SS value while userspace
enables single-step, and restoring the value once it is disabled.
This fix modifies the behavior of GET_ONE_REG/SET_ONE_REG for the
PSTATE.SS while single-step is enabled by userspace.
Presently, GET_ONE_REG/SET_ONE_REG gets/sets the current PSTATE.SS
value, which KVM will override on the next guest entry (i.e. the
value userspace gets/sets is not used for the next guest entry).
With this patch, GET_ONE_REG/SET_ONE_REG will get/set the guest's
preserved value, which KVM will preserve and try to restore after
single-step is disabled.
Fixes: 337b99bf7e ("KVM: arm64: guest debug, add support for single-step")
Signed-off-by: Reiji Watanabe <reijiw@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20220917010600.532642-2-reijiw@google.com
Merge arm64/for-next/sysreg in order to avoid upstream conflicts
due to the never ending sysreg repainting...
Signed-off-by: Marc Zyngier <maz@kernel.org>