Commit Graph

2084 Commits

Author SHA1 Message Date
Will Deacon 634d90cf0a KVM: arm64: Handle FFA_MEM_LEND calls from the host
Handle FFA_MEM_LEND calls from the host by treating them identically to
FFA_MEM_SHARE calls for the purposes of the host stage-2 page-table, but
forwarding on the original request to EL3.

Signed-off-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20230523101828.7328-9-will@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2023-06-01 21:34:51 +00:00
Will Deacon 0e3bcb49c1 KVM: arm64: Handle FFA_MEM_RECLAIM calls from the host
Intecept FFA_MEM_RECLAIM calls from the host and transition the host
stage-2 page-table entries from the SHARED_OWNED state back to the OWNED
state once EL3 has confirmed that the secure mapping has been reclaimed.

Signed-off-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20230523101828.7328-8-will@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2023-06-01 21:34:51 +00:00
Will Deacon 4360900017 KVM: arm64: Handle FFA_MEM_SHARE calls from the host
Intercept FFA_MEM_SHARE/FFA_FN64_MEM_SHARE calls from the host and
transition the host stage-2 page-table entries from the OWNED state to
the SHARED_OWNED state prior to forwarding the call onto EL3.

Co-developed-by: Andrew Walbran <qwandor@google.com>
Signed-off-by: Andrew Walbran <qwandor@google.com>
Signed-off-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20230523101828.7328-7-will@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2023-06-01 21:34:51 +00:00
Will Deacon f9112eade7 KVM: arm64: Add FF-A helpers to share/unshare memory with secure world
Extend pKVM's memory protection code so that we can update the host's
stage-2 page-table to track pages shared with secure world by the host
using FF-A and prevent those pages from being mapped into a guest.

Co-developed-by: Andrew Walbran <qwandor@google.com>
Signed-off-by: Andrew Walbran <qwandor@google.com>
Signed-off-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20230523101828.7328-6-will@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2023-06-01 21:34:51 +00:00
Will Deacon 9d0c6a9af9 KVM: arm64: Handle FFA_RXTX_MAP and FFA_RXTX_UNMAP calls from the host
Handle FFA_RXTX_MAP and FFA_RXTX_UNMAP calls from the host by sharing
the host's mailbox memory with the hypervisor and establishing a
separate pair of mailboxes between the hypervisor and the SPMD at EL3.

Co-developed-by: Andrew Walbran <qwandor@google.com>
Signed-off-by: Andrew Walbran <qwandor@google.com>
Signed-off-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20230523101828.7328-5-will@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2023-06-01 21:34:51 +00:00
Will Deacon bc3888a0f4 KVM: arm64: Allocate pages for hypervisor FF-A mailboxes
The FF-A proxy code needs to allocate its own buffer pair for
communication with EL3 and for forwarding calls from the host at EL1.

Reserve a couple of pages for this purpose and use them to initialise
the hypervisor's FF-A buffer structure.

Co-developed-by: Andrew Walbran <qwandor@google.com>
Signed-off-by: Andrew Walbran <qwandor@google.com>
Signed-off-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20230523101828.7328-4-will@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2023-06-01 21:34:50 +00:00
Will Deacon 12bdce4f41 KVM: arm64: Probe FF-A version and host/hyp partition ID during init
Probe FF-A during pKVM initialisation so that we can detect any
inconsistencies in the version or partition ID early on.

Signed-off-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20230523101828.7328-3-will@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2023-06-01 21:34:50 +00:00
Will Deacon 048be5fea4 KVM: arm64: Block unsafe FF-A calls from the host
When KVM is initialised in protected mode, we must take care to filter
certain FFA calls from the host kernel so that the integrity of guest
and hypervisor memory is maintained and is not made available to the
secure world.

As a first step, intercept and block all memory-related FF-A SMC calls
from the host to EL3 and don't advertise any FF-A features. This puts
the framework in place for handling them properly.

Co-developed-by: Andrew Walbran <qwandor@google.com>
Signed-off-by: Andrew Walbran <qwandor@google.com>
Signed-off-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20230523101828.7328-2-will@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2023-06-01 21:34:50 +00:00
Oliver Upton 40e54cad45 KVM: arm64: Document default vPMU behavior on heterogeneous systems
KVM maintains a mask of supported CPUs when a vPMU type is explicitly
selected by userspace and is used to reject any attempt to run the vCPU
on an unsupported CPU. This is great, but we're still beholden to the
default behavior where vCPUs can be scheduled anywhere and guest
counters may silently stop working.

Avoid confusing the next poor sod to look at this code and document the
intended behavior.

Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230525212723.3361524-3-oliver.upton@linux.dev
2023-05-31 10:29:56 +01:00
Oliver Upton 1c913a1c35 KVM: arm64: Iterate arm_pmus list to probe for default PMU
To date KVM has relied on using a perf event to probe the core PMU at
the time of vPMU initialization. Behind the scenes perf_event_init()
would iteratively walk the PMUs of the system and return the PMU that
could handle the event. However, an upcoming change in perf core will
drop the iterative walk, thereby breaking the fragile dance we do on the
KVM side.

Avoid the problem altogether by iterating over the list of supported
PMUs maintained in KVM, returning the core PMU that matches the CPU
we were called on.

Tested-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230525212723.3361524-2-oliver.upton@linux.dev
2023-05-31 10:29:56 +01:00
Oliver Upton f6a27d6dc5 KVM: arm64: Drop last page ref in kvm_pgtable_stage2_free_removed()
The reference count on page table allocations is increased for every
'counted' PTE (valid or donated) in the table in addition to the initial
reference from ->zalloc_page(). kvm_pgtable_stage2_free_removed() fails
to drop the last reference on the root of the table walk, meaning we
leak memory.

Fix it by dropping the last reference after the free walker returns,
at which point all references for 'counted' PTEs have been released.

Cc: stable@vger.kernel.org
Fixes: 5c359cca1f ("KVM: arm64: Tear down unlinked stage-2 subtree after break-before-make")
Reported-by: Yu Zhao <yuzhao@google.com>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Tested-by: Yu Zhao <yuzhao@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230530193213.1663411-1-oliver.upton@linux.dev
2023-05-31 08:02:21 +01:00
Mostafa Saleh b53d4a2723 KVM: arm64: Use BTI for nvhe
CONFIG_ARM64_BTI_KERNEL compiles the kernel to support ARMv8.5-BTI.
However, the nvhe code doesn't make use of it as it doesn't map any
pages with Guarded Page(GP) bit.

kvm pgtable code is modified to map executable pages with GP bit
if BTI is enabled for the kernel.

At hyp init, SCTLR_EL2.BT is set to 1 to match EL1 configuration
(SCTLR_EL1.BT1) set in bti_enable().

One difference between kernel and nvhe code, is that the kernel maps
.text with GP while nvhe maps all the executable pages, this makes
nvhe code need to deal with special initialization code coming from
other executable sections (.idmap.text).
For this we need to add bti instruction at the beginning of
__kvm_handle_stub_hvc as it can be called by  __host_hvc through
branch instruction(br) and unlike SYM_FUNC_START, SYM_CODE_START
doesn’t add bti instruction at the beginning, and it can’t be modified
to add it as it is used with vector tables.
Another solution which is more intrusive is to convert
__kvm_handle_stub_hvc to a function and inject “bti jc” instead of
“bti c” in SYM_FUNC_START

Signed-off-by: Mostafa Saleh <smostafa@google.com>
Link: https://lore.kernel.org/r/20230530150845.2856828-1-smostafa@google.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2023-05-30 16:24:28 +00:00
Akihiko Odaki 811154e234 KVM: arm64: Populate fault info for watchpoint
When handling ESR_ELx_EC_WATCHPT_LOW, far_el2 member of struct
kvm_vcpu_fault_info will be copied to far member of struct
kvm_debug_exit_arch and exposed to the userspace. The userspace will
see stale values from older faults if the fault info does not get
populated.

Fixes: 8fb2046180 ("KVM: arm64: Move early handlers to per-EC handlers")
Suggested-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230530024651.10014-1-akihiko.odaki@daynix.com
Cc: stable@vger.kernel.org
2023-05-30 08:39:07 +01:00
Fuad Tabba a9f0e3d5a0 KVM: arm64: Reload PTE after invoking walker callback on preorder traversal
The preorder callback on the kvm_pgtable_stage2_map() path can replace
a table with a block, then recursively free the detached table. The
higher-level walking logic stashes the old page table entry and
then walks the freed table, invoking the leaf callback and
potentially freeing pgtable pages prematurely.

In normal operation, the call to tear down the detached stage-2
is indirected and uses an RCU callback to trigger the freeing.
RCU is not available to pKVM, which is where this bug is
triggered.

Change the behavior of the walker to reload the page table entry
after invoking the walker callback on preorder traversal, as it
does for leaf entries.

Tested on Pixel 6.

Fixes: 5c359cca1f ("KVM: arm64: Tear down unlinked stage-2 subtree after break-before-make")
Suggested-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230522103258.402272-1-tabba@google.com
2023-05-24 13:47:12 +01:00
Marc Zyngier d282fa3c5c KVM: arm64: Handle trap of tagged Set/Way CMOs
We appear to have missed the Set/Way CMOs when adding MTE support.
Not that we really expect anyone to use them, but you never know
what stupidity some people can come up with...

Treat these mostly like we deal with the classic S/W CMOs, only
with an additional check that MTE really is enabled.

Signed-off-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Reviewed-by: Steven Price <steven.price@arm.com>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20230515204601.1270428-3-maz@kernel.org
2023-05-24 13:45:18 +01:00
Will Deacon 09cce60bdd KVM: arm64: Prevent unconditional donation of unmapped regions from the host
Since host stage-2 mappings are created lazily, we cannot rely solely on
the pte in order to recover the target physical address when checking a
host-initiated memory transition as this permits donation of unmapped
regions corresponding to MMIO or "no-map" memory.

Instead of inspecting the pte, move the addr_is_allowed_memory() check
into the host callback function where it is passed the physical address
directly from the walker.

Cc: Quentin Perret <qperret@google.com>
Fixes: e82edcc75c ("KVM: arm64: Implement do_share() helper for sharing memory")
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230518095844.1178-1-will@kernel.org
2023-05-19 10:20:30 +01:00
Jean-Philippe Brucker 6254873226 KVM: arm64: vgic: Fix a comment
It is host userspace, not the guest, that issues
KVM_DEV_ARM_VGIC_GRP_CTRL

Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230518100914.2837292-5-jean-philippe@linaro.org
2023-05-19 10:20:00 +01:00
Jean-Philippe Brucker c38b8400ae KVM: arm64: vgic: Fix locking comment
It is now config_lock that must be held, not kvm lock. Replace the
comment with a lockdep annotation.

Fixes: f003277311 ("KVM: arm64: Use config_lock to protect vgic state")
Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230518100914.2837292-4-jean-philippe@linaro.org
2023-05-19 10:20:00 +01:00
Jean-Philippe Brucker 9cf2f840c4 KVM: arm64: vgic: Wrap vgic_its_create() with config_lock
vgic_its_create() changes the vgic state without holding the
config_lock, which triggers a lockdep warning in vgic_v4_init():

[  358.667941] WARNING: CPU: 3 PID: 178 at arch/arm64/kvm/vgic/vgic-v4.c:245 vgic_v4_init+0x15c/0x7a8
...
[  358.707410]  vgic_v4_init+0x15c/0x7a8
[  358.708550]  vgic_its_create+0x37c/0x4a4
[  358.709640]  kvm_vm_ioctl+0x1518/0x2d80
[  358.710688]  __arm64_sys_ioctl+0x7ac/0x1ba8
[  358.711960]  invoke_syscall.constprop.0+0x70/0x1e0
[  358.713245]  do_el0_svc+0xe4/0x2d4
[  358.714289]  el0_svc+0x44/0x8c
[  358.715329]  el0t_64_sync_handler+0xf4/0x120
[  358.716615]  el0t_64_sync+0x190/0x194

Wrap the whole of vgic_its_create() with config_lock since, in addition
to calling vgic_v4_init(), it also modifies the global kvm->arch.vgic
state.

Fixes: f003277311 ("KVM: arm64: Use config_lock to protect vgic state")
Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230518100914.2837292-3-jean-philippe@linaro.org
2023-05-19 10:20:00 +01:00
Jean-Philippe Brucker 59112e9c39 KVM: arm64: vgic: Fix a circular locking issue
Lockdep reports a circular lock dependency between the srcu and the
config_lock:

[  262.179917] -> #1 (&kvm->srcu){.+.+}-{0:0}:
[  262.182010]        __synchronize_srcu+0xb0/0x224
[  262.183422]        synchronize_srcu_expedited+0x24/0x34
[  262.184554]        kvm_io_bus_register_dev+0x324/0x50c
[  262.185650]        vgic_register_redist_iodev+0x254/0x398
[  262.186740]        vgic_v3_set_redist_base+0x3b0/0x724
[  262.188087]        kvm_vgic_addr+0x364/0x600
[  262.189189]        vgic_set_common_attr+0x90/0x544
[  262.190278]        vgic_v3_set_attr+0x74/0x9c
[  262.191432]        kvm_device_ioctl+0x2a0/0x4e4
[  262.192515]        __arm64_sys_ioctl+0x7ac/0x1ba8
[  262.193612]        invoke_syscall.constprop.0+0x70/0x1e0
[  262.195006]        do_el0_svc+0xe4/0x2d4
[  262.195929]        el0_svc+0x44/0x8c
[  262.196917]        el0t_64_sync_handler+0xf4/0x120
[  262.198238]        el0t_64_sync+0x190/0x194
[  262.199224]
[  262.199224] -> #0 (&kvm->arch.config_lock){+.+.}-{3:3}:
[  262.201094]        __lock_acquire+0x2b70/0x626c
[  262.202245]        lock_acquire+0x454/0x778
[  262.203132]        __mutex_lock+0x190/0x8b4
[  262.204023]        mutex_lock_nested+0x24/0x30
[  262.205100]        vgic_mmio_write_v3_misc+0x5c/0x2a0
[  262.206178]        dispatch_mmio_write+0xd8/0x258
[  262.207498]        __kvm_io_bus_write+0x1e0/0x350
[  262.208582]        kvm_io_bus_write+0xe0/0x1cc
[  262.209653]        io_mem_abort+0x2ac/0x6d8
[  262.210569]        kvm_handle_guest_abort+0x9b8/0x1f88
[  262.211937]        handle_exit+0xc4/0x39c
[  262.212971]        kvm_arch_vcpu_ioctl_run+0x90c/0x1c04
[  262.214154]        kvm_vcpu_ioctl+0x450/0x12f8
[  262.215233]        __arm64_sys_ioctl+0x7ac/0x1ba8
[  262.216402]        invoke_syscall.constprop.0+0x70/0x1e0
[  262.217774]        do_el0_svc+0xe4/0x2d4
[  262.218758]        el0_svc+0x44/0x8c
[  262.219941]        el0t_64_sync_handler+0xf4/0x120
[  262.221110]        el0t_64_sync+0x190/0x194

Note that the current report, which can be triggered by the vgic_irq
kselftest, is a triple chain that includes slots_lock, but after
inverting the slots_lock/config_lock dependency, the actual problem
reported above remains.

In several places, the vgic code calls kvm_io_bus_register_dev(), which
synchronizes the srcu, while holding config_lock (#1). And the MMIO
handler takes the config_lock while holding the srcu read lock (#0).

Break dependency #1, by registering the distributor and redistributors
without holding config_lock. The ITS also uses kvm_io_bus_register_dev()
but already relies on slots_lock to serialize calls.

The distributor iodev is created on the first KVM_RUN call. Multiple
threads will race for vgic initialization, and only the first one will
see !vgic_ready() under the lock. To serialize those threads, rely on
slots_lock rather than config_lock.

Redistributors are created earlier, through KVM_DEV_ARM_VGIC_GRP_ADDR
ioctls and vCPU creation. Similarly, serialize the iodev creation with
slots_lock, and the rest with config_lock.

Fixes: f003277311 ("KVM: arm64: Use config_lock to protect vgic state")
Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230518100914.2837292-2-jean-philippe@linaro.org
2023-05-19 10:20:00 +01:00
Arnd Bergmann 14c3555f05 arm64: kvm: avoid overflow in integer division
The newly added kvm_mmu_split_nr_page_tables() function uses DIV_ROUND_DOWN_ULL()
to divide 64-bit addresses, but this requires a 32-bit divisior, and PUD_SIZE
may exceed that when 64KB pages are used:

arch/arm64/kvm/mmu.c: In function 'kvm_mmu_split_nr_page_tables':
include/linux/math.h:42:64: error: conversion from 'long unsigned int' to 'u32' {aka 'unsigned int'} changes value from '68719476736' to '0' [-Werror=overflow]
   42 |         DIV_ROUND_DOWN_ULL((unsigned long long)(ll) + (d) - 1, (d))
      |                                                                ^~~
include/linux/math.h:39:47: note: in definition of macro 'DIV_ROUND_DOWN_ULL'
   39 | #define DIV_ROUND_DOWN_ULL(ll, d) div_u64(ll, d)
      |                                               ^
arch/arm64/kvm/mmu.c:95:22: note: in expansion of macro 'DIV_ROUND_UP_ULL'
   95 |                 n += DIV_ROUND_UP_ULL(range, PUD_SIZE);
      |                      ^~~~~~~~~~~~~~~~

Since this code is only used on 64-bit targets, DIV_ROUND_UP() can deal with this
more easily, as it already takes 64-bit arguments.

Fixes: e7bf7a490c ("KVM: arm64: Split huge pages when dirty logging is enabled")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/r/20230517202352.793673-1-arnd@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2023-05-18 17:39:22 +00:00
Marc Zyngier a12ab1378a KVM: arm64: Use local TLBI on permission relaxation
Broadcast TLB invalidations (TLBIs) targeting the Inner Shareable
Domain are usually less performant than their non-shareable variant.
In particular, we observed some implementations that take
millliseconds to complete parallel broadcasted TLBIs.

It's safe to use non-shareable TLBIs when relaxing permissions on a
PTE in the KVM case.  According to the ARM ARM (0487I.a) section
D8.13.1 "Using break-before-make when updating translation table
entries", permission relaxation does not need break-before-make.
Specifically, R_WHZWS states that these are the only changes that
require a break-before-make sequence: changes of memory type
(Shareability or Cacheability), address changes, or changing the block
size.

Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Ricardo Koller <ricarkol@google.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Link: https://lore.kernel.org/r/20230426172330.1439644-13-ricarkol@google.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2023-05-16 17:39:19 +00:00
Ricardo Koller 6acf51666d KVM: arm64: Split huge pages during KVM_CLEAR_DIRTY_LOG
This is the arm64 counterpart of commit cb00a70bd4 ("KVM: x86/mmu:
Split huge pages mapped by the TDP MMU during KVM_CLEAR_DIRTY_LOG"),
which has the benefit of splitting the cost of splitting a memslot
across multiple ioctls.

Split huge pages on the range specified using KVM_CLEAR_DIRTY_LOG.
And do not split when enabling dirty logging if
KVM_DIRTY_LOG_INITIALLY_SET is set.

Signed-off-by: Ricardo Koller <ricarkol@google.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Link: https://lore.kernel.org/r/20230426172330.1439644-12-ricarkol@google.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2023-05-16 17:39:19 +00:00
Ricardo Koller 3005f6f294 KVM: arm64: Open-code kvm_mmu_write_protect_pt_masked()
Move the functionality of kvm_mmu_write_protect_pt_masked() into its
caller, kvm_arch_mmu_enable_log_dirty_pt_masked().  This will be used
in a subsequent commit in order to share some of the code in
kvm_arch_mmu_enable_log_dirty_pt_masked().

Signed-off-by: Ricardo Koller <ricarkol@google.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Link: https://lore.kernel.org/r/20230426172330.1439644-11-ricarkol@google.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2023-05-16 17:39:18 +00:00
Ricardo Koller e7bf7a490c KVM: arm64: Split huge pages when dirty logging is enabled
Split huge pages eagerly when enabling dirty logging. The goal is to
avoid doing it while faulting on write-protected pages, which
negatively impacts guest performance.

A memslot marked for dirty logging is split in 1GB pieces at a time.
This is in order to release the mmu_lock and give other kernel threads
the opportunity to run, and also in order to allocate enough pages to
split a 1GB range worth of huge pages (or a single 1GB huge page).
Note that these page allocations can fail, so eager page splitting is
best-effort.  This is not a correctness issue though, as huge pages
can still be split on write-faults.

Eager page splitting only takes effect when the huge page mapping has
been existing in the stage-2 page table. Otherwise, the huge page will
be mapped to multiple non-huge pages on page fault.

The benefits of eager page splitting are the same as in x86, added
with commit a3fe5dbda0 ("KVM: x86/mmu: Split huge pages mapped by
the TDP MMU when dirty logging is enabled"). For example, when running
dirty_log_perf_test with 64 virtual CPUs (Ampere Altra), 1GB per vCPU,
50% reads, and 2MB HugeTLB memory, the time it takes vCPUs to access
all of their memory after dirty logging is enabled decreased by 44%
from 2.58s to 1.42s.

Signed-off-by: Ricardo Koller <ricarkol@google.com>
Reviewed-by: Shaoqin Huang <shahuang@redhat.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Link: https://lore.kernel.org/r/20230426172330.1439644-10-ricarkol@google.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2023-05-16 17:39:18 +00:00
Ricardo Koller ce2b602238 KVM: arm64: Add kvm_uninit_stage2_mmu()
Add kvm_uninit_stage2_mmu() and move kvm_free_stage2_pgd() into it. A
future commit will add some more things to do inside of
kvm_uninit_stage2_mmu().

Signed-off-by: Ricardo Koller <ricarkol@google.com>
Reviewed-by: Shaoqin Huang <shahuang@redhat.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Link: https://lore.kernel.org/r/20230426172330.1439644-9-ricarkol@google.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2023-05-16 17:39:18 +00:00
Ricardo Koller 6bd92b9d8b KVM: arm64: Refactor kvm_arch_commit_memory_region()
Refactor kvm_arch_commit_memory_region() as a preparation for a future
commit to look cleaner and more understandable. Also, it looks more
like its x86 counterpart (in kvm_mmu_slot_apply_flags()).

Signed-off-by: Ricardo Koller <ricarkol@google.com>
Reviewed-by: Shaoqin Huang <shahuang@redhat.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Link: https://lore.kernel.org/r/20230426172330.1439644-8-ricarkol@google.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2023-05-16 17:39:18 +00:00
Ricardo Koller 8f5a3eb751 KVM: arm64: Add kvm_pgtable_stage2_split()
Add a new stage2 function, kvm_pgtable_stage2_split(), for splitting a
range of huge pages. This will be used for eager-splitting huge pages
into PAGE_SIZE pages. The goal is to avoid having to split huge pages
on write-protection faults, and instead use this function to do it
ahead of time for large ranges (e.g., all guest memory in 1G chunks at
a time).

Signed-off-by: Ricardo Koller <ricarkol@google.com>
Reviewed-by: Shaoqin Huang <shahuang@redhat.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Link: https://lore.kernel.org/r/20230426172330.1439644-7-ricarkol@google.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2023-05-16 17:39:18 +00:00
Ricardo Koller 2f440b72e8 KVM: arm64: Add KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE
Add a capability for userspace to specify the eager split chunk size.
The chunk size specifies how many pages to break at a time, using a
single allocation. Bigger the chunk size, more pages need to be
allocated ahead of time.

Suggested-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Ricardo Koller <ricarkol@google.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Link: https://lore.kernel.org/r/20230426172330.1439644-6-ricarkol@google.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2023-05-16 17:39:18 +00:00
Ricardo Koller e7c05540c6 KVM: arm64: Add helper for creating unlinked stage2 subtrees
Add a stage2 helper, kvm_pgtable_stage2_create_unlinked(), for
creating unlinked tables (which is the opposite of
kvm_pgtable_stage2_free_unlinked()).  Creating an unlinked table is
useful for splitting level 1 and 2 entries into subtrees of PAGE_SIZE
PTEs.  For example, a level 1 entry can be split into PAGE_SIZE PTEs
by first creating a fully populated tree, and then use it to replace
the level 1 entry in a single step.  This will be used in a subsequent
commit for eager huge-page splitting (a dirty-logging optimization).

Signed-off-by: Ricardo Koller <ricarkol@google.com>
Reviewed-by: Shaoqin Huang <shahuang@redhat.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Link: https://lore.kernel.org/r/20230426172330.1439644-4-ricarkol@google.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2023-05-16 17:39:18 +00:00
Ricardo Koller 02f10845f4 KVM: arm64: Add KVM_PGTABLE_WALK flags for skipping CMOs and BBM TLBIs
Add two flags to kvm_pgtable_visit_ctx, KVM_PGTABLE_WALK_SKIP_BBM_TLBI
and KVM_PGTABLE_WALK_SKIP_CMO, to indicate that the walk should not
perform TLB invalidations (TLBIs) in break-before-make (BBM) nor cache
maintenance operations (CMO). This will be used by a future commit to
create unlinked tables not accessible to the HW page-table walker.

Signed-off-by: Ricardo Koller <ricarkol@google.com>
Reviewed-by: Shaoqin Huang <shahuang@redhat.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Link: https://lore.kernel.org/r/20230426172330.1439644-3-ricarkol@google.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2023-05-16 17:39:17 +00:00
Ricardo Koller c14d08c5ad KVM: arm64: Rename free_removed to free_unlinked
Normalize on referring to tables outside of an active paging structure
as 'unlinked'.

A subsequent change to KVM will add support for building page tables
that are not part of an active paging structure. The existing
'removed_table' terminology is quite clunky when applied in this
context.

Signed-off-by: Ricardo Koller <ricarkol@google.com>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Reviewed-by: Shaoqin Huang <shahuang@redhat.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Link: https://lore.kernel.org/r/20230426172330.1439644-2-ricarkol@google.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2023-05-16 17:39:17 +00:00
Marc Zyngier c3a62df457 Merge branch kvm-arm64/pgtable-fixes-6.4 into kvmarm-master/fixes
* kvm-arm64/pgtable-fixes-6.4:
  : .
  : Fixes for concurrent S2 mapping race from Oliver:
  :
  : "So it appears that there is a race between two parallel stage-2 map
  : walkers that could lead to mapping the incorrect PA for a given IPA, as
  : the IPA -> PA relationship picks up an unintended offset. This series
  : eliminates the problem by using the current IPA of the walk as the
  : source-of-truth regarding where we are in a map operation."
  : .
  KVM: arm64: Constify start/end/phys fields of the pgtable walker data
  KVM: arm64: Infer PA offset from VA in hyp map walker
  KVM: arm64: Infer the PA offset from IPA in stage-2 map walker

Signed-off-by: Marc Zyngier <maz@kernel.org>
2023-05-11 15:26:01 +01:00
Marc Zyngier 9a48c597d6 Merge branch kvm-arm64/misc-6.4 into kvmarm-master/fixes
* kvm-arm64/misc-6.4:
  : .
  : Minor changes for 6.4:
  :
  : - Make better use of the bitmap API (bitmap_zero, bitmap_zalloc...)
  :
  : - FP/SVE/SME documentation update, in the hope that this field
  :   becomes clearer...
  :
  : - Add workaround for the usual Apple SEIS brokenness
  :
  : - Random comment fixes
  : .
  KVM: arm64: vgic: Add Apple M2 PRO/MAX cpus to the list of broken SEIS implementations
  KVM: arm64: Clarify host SME state management
  KVM: arm64: Restructure check for SVE support in FP trap handler
  KVM: arm64: Document check for TIF_FOREIGN_FPSTATE
  KVM: arm64: Fix repeated words in comments
  KVM: arm64: Use the bitmap API to allocate bitmaps
  KVM: arm64: Slightly optimize flush_context()

Signed-off-by: Marc Zyngier <maz@kernel.org>
2023-05-11 15:25:58 +01:00
Marc Zyngier e910baa9c1 KVM: arm64: vgic: Add Apple M2 PRO/MAX cpus to the list of broken SEIS implementations
Unsurprisingly, the M2 PRO is also affected by the SEIS bug, so add it
to the naughty list. And since M2 MAX is likely to be of the same ilk,
flag it as well.

Tested on a M2 PRO mini machine.

Signed-off-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Zenghui Yu <yuzenghui@huawei.com>
Link: https://lore.kernel.org/r/20230501182141.39770-1-maz@kernel.org
2023-05-11 15:17:02 +01:00
Linus Torvalds c8c655c34e s390:
* More phys_to_virt conversions
 
 * Improvement of AP management for VSIE (nested virtualization)
 
 ARM64:
 
 * Numerous fixes for the pathological lock inversion issue that
   plagued KVM/arm64 since... forever.
 
 * New framework allowing SMCCC-compliant hypercalls to be forwarded
   to userspace, hopefully paving the way for some more features
   being moved to VMMs rather than be implemented in the kernel.
 
 * Large rework of the timer code to allow a VM-wide offset to be
   applied to both virtual and physical counters as well as a
   per-timer, per-vcpu offset that complements the global one.
   This last part allows the NV timer code to be implemented on
   top.
 
 * A small set of fixes to make sure that we don't change anything
   affecting the EL1&0 translation regime just after having having
   taken an exception to EL2 until we have executed a DSB. This
   ensures that speculative walks started in EL1&0 have completed.
 
 * The usual selftest fixes and improvements.
 
 KVM x86 changes for 6.4:
 
 * Optimize CR0.WP toggling by avoiding an MMU reload when TDP is enabled,
   and by giving the guest control of CR0.WP when EPT is enabled on VMX
   (VMX-only because SVM doesn't support per-bit controls)
 
 * Add CR0/CR4 helpers to query single bits, and clean up related code
   where KVM was interpreting kvm_read_cr4_bits()'s "unsigned long" return
   as a bool
 
 * Move AMD_PSFD to cpufeatures.h and purge KVM's definition
 
 * Avoid unnecessary writes+flushes when the guest is only adding new PTEs
 
 * Overhaul .sync_page() and .invlpg() to utilize .sync_page()'s optimizations
   when emulating invalidations
 
 * Clean up the range-based flushing APIs
 
 * Revamp the TDP MMU's reaping of Accessed/Dirty bits to clear a single
   A/D bit using a LOCK AND instead of XCHG, and skip all of the "handle
   changed SPTE" overhead associated with writing the entire entry
 
 * Track the number of "tail" entries in a pte_list_desc to avoid having
   to walk (potentially) all descriptors during insertion and deletion,
   which gets quite expensive if the guest is spamming fork()
 
 * Disallow virtualizing legacy LBRs if architectural LBRs are available,
   the two are mutually exclusive in hardware
 
 * Disallow writes to immutable feature MSRs (notably PERF_CAPABILITIES)
   after KVM_RUN, similar to CPUID features
 
 * Overhaul the vmx_pmu_caps selftest to better validate PERF_CAPABILITIES
 
 * Apply PMU filters to emulated events and add test coverage to the
   pmu_event_filter selftest
 
 x86 AMD:
 
 * Add support for virtual NMIs
 
 * Fixes for edge cases related to virtual interrupts
 
 x86 Intel:
 
 * Don't advertise XTILE_CFG in KVM_GET_SUPPORTED_CPUID if XTILE_DATA is
   not being reported due to userspace not opting in via prctl()
 
 * Fix a bug in emulation of ENCLS in compatibility mode
 
 * Allow emulation of NOP and PAUSE for L2
 
 * AMX selftests improvements
 
 * Misc cleanups
 
 MIPS:
 
 * Constify MIPS's internal callbacks (a leftover from the hardware enabling
   rework that landed in 6.3)
 
 Generic:
 
 * Drop unnecessary casts from "void *" throughout kvm_main.c
 
 * Tweak the layout of "struct kvm_mmu_memory_cache" to shrink the struct
   size by 8 bytes on 64-bit kernels by utilizing a padding hole
 
 Documentation:
 
 * Fix goof introduced by the conversion to rST
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmRNExkUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroNyjwf+MkzDael9y9AsOZoqhEZ5OsfQYJ32
 Im5ZVYsPRU2K5TuoWql6meIihgclCj1iIU32qYHa2F1WYt2rZ72rJp+HoY8b+TaI
 WvF0pvNtqQyg3iEKUBKPA4xQ6mj7RpQBw86qqiCHmlfNt0zxluEGEPxH8xrWcfhC
 huDQ+NUOdU7fmJ3rqGitCvkUbCuZNkw3aNPR8dhU8RAWrwRzP2hBOmdxIeo81WWY
 XMEpJSijbGpXL9CvM0Jz9nOuMJwZwCCBGxg1vSQq0xTfLySNMxzvWZC2GFaBjucb
 j0UOQ7yE0drIZDVhd3sdNslubXXU6FcSEzacGQb9aigMUon3Tem9SHi7Kw==
 =S2Hq
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm updates from Paolo Bonzini:
 "s390:

   - More phys_to_virt conversions

   - Improvement of AP management for VSIE (nested virtualization)

  ARM64:

   - Numerous fixes for the pathological lock inversion issue that
     plagued KVM/arm64 since... forever.

   - New framework allowing SMCCC-compliant hypercalls to be forwarded
     to userspace, hopefully paving the way for some more features being
     moved to VMMs rather than be implemented in the kernel.

   - Large rework of the timer code to allow a VM-wide offset to be
     applied to both virtual and physical counters as well as a
     per-timer, per-vcpu offset that complements the global one. This
     last part allows the NV timer code to be implemented on top.

   - A small set of fixes to make sure that we don't change anything
     affecting the EL1&0 translation regime just after having having
     taken an exception to EL2 until we have executed a DSB. This
     ensures that speculative walks started in EL1&0 have completed.

   - The usual selftest fixes and improvements.

  x86:

   - Optimize CR0.WP toggling by avoiding an MMU reload when TDP is
     enabled, and by giving the guest control of CR0.WP when EPT is
     enabled on VMX (VMX-only because SVM doesn't support per-bit
     controls)

   - Add CR0/CR4 helpers to query single bits, and clean up related code
     where KVM was interpreting kvm_read_cr4_bits()'s "unsigned long"
     return as a bool

   - Move AMD_PSFD to cpufeatures.h and purge KVM's definition

   - Avoid unnecessary writes+flushes when the guest is only adding new
     PTEs

   - Overhaul .sync_page() and .invlpg() to utilize .sync_page()'s
     optimizations when emulating invalidations

   - Clean up the range-based flushing APIs

   - Revamp the TDP MMU's reaping of Accessed/Dirty bits to clear a
     single A/D bit using a LOCK AND instead of XCHG, and skip all of
     the "handle changed SPTE" overhead associated with writing the
     entire entry

   - Track the number of "tail" entries in a pte_list_desc to avoid
     having to walk (potentially) all descriptors during insertion and
     deletion, which gets quite expensive if the guest is spamming
     fork()

   - Disallow virtualizing legacy LBRs if architectural LBRs are
     available, the two are mutually exclusive in hardware

   - Disallow writes to immutable feature MSRs (notably
     PERF_CAPABILITIES) after KVM_RUN, similar to CPUID features

   - Overhaul the vmx_pmu_caps selftest to better validate
     PERF_CAPABILITIES

   - Apply PMU filters to emulated events and add test coverage to the
     pmu_event_filter selftest

   - AMD SVM:
       - Add support for virtual NMIs
       - Fixes for edge cases related to virtual interrupts

   - Intel AMX:
       - Don't advertise XTILE_CFG in KVM_GET_SUPPORTED_CPUID if
         XTILE_DATA is not being reported due to userspace not opting in
         via prctl()
       - Fix a bug in emulation of ENCLS in compatibility mode
       - Allow emulation of NOP and PAUSE for L2
       - AMX selftests improvements
       - Misc cleanups

  MIPS:

   - Constify MIPS's internal callbacks (a leftover from the hardware
     enabling rework that landed in 6.3)

  Generic:

   - Drop unnecessary casts from "void *" throughout kvm_main.c

   - Tweak the layout of "struct kvm_mmu_memory_cache" to shrink the
     struct size by 8 bytes on 64-bit kernels by utilizing a padding
     hole

  Documentation:

   - Fix goof introduced by the conversion to rST"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (211 commits)
  KVM: s390: pci: fix virtual-physical confusion on module unload/load
  KVM: s390: vsie: clarifications on setting the APCB
  KVM: s390: interrupt: fix virtual-physical confusion for next alert GISA
  KVM: arm64: Have kvm_psci_vcpu_on() use WRITE_ONCE() to update mp_state
  KVM: arm64: Acquire mp_state_lock in kvm_arch_vcpu_ioctl_vcpu_init()
  KVM: selftests: Test the PMU event "Instructions retired"
  KVM: selftests: Copy full counter values from guest in PMU event filter test
  KVM: selftests: Use error codes to signal errors in PMU event filter test
  KVM: selftests: Print detailed info in PMU event filter asserts
  KVM: selftests: Add helpers for PMC asserts in PMU event filter test
  KVM: selftests: Add a common helper for the PMU event filter guest code
  KVM: selftests: Fix spelling mistake "perrmited" -> "permitted"
  KVM: arm64: vhe: Drop extra isb() on guest exit
  KVM: arm64: vhe: Synchronise with page table walker on MMU update
  KVM: arm64: pkvm: Document the side effects of kvm_flush_dcache_to_poc()
  KVM: arm64: nvhe: Synchronise with page table walker on TLBI
  KVM: arm64: Handle 32bit CNTPCTSS traps
  KVM: arm64: nvhe: Synchronise with page table walker on vcpu run
  KVM: arm64: vgic: Don't acquire its_lock before config_lock
  KVM: selftests: Add test to verify KVM's supported XCR0
  ...
2023-05-01 12:06:20 -07:00
Linus Torvalds 7fa8a8ee94 - Nick Piggin's "shoot lazy tlbs" series, to improve the peformance of
switching from a user process to a kernel thread.
 
 - More folio conversions from Kefeng Wang, Zhang Peng and Pankaj Raghav.
 
 - zsmalloc performance improvements from Sergey Senozhatsky.
 
 - Yue Zhao has found and fixed some data race issues around the
   alteration of memcg userspace tunables.
 
 - VFS rationalizations from Christoph Hellwig:
 
   - removal of most of the callers of write_one_page().
 
   - make __filemap_get_folio()'s return value more useful
 
 - Luis Chamberlain has changed tmpfs so it no longer requires swap
   backing.  Use `mount -o noswap'.
 
 - Qi Zheng has made the slab shrinkers operate locklessly, providing
   some scalability benefits.
 
 - Keith Busch has improved dmapool's performance, making part of its
   operations O(1) rather than O(n).
 
 - Peter Xu adds the UFFD_FEATURE_WP_UNPOPULATED feature to userfaultd,
   permitting userspace to wr-protect anon memory unpopulated ptes.
 
 - Kirill Shutemov has changed MAX_ORDER's meaning to be inclusive rather
   than exclusive, and has fixed a bunch of errors which were caused by its
   unintuitive meaning.
 
 - Axel Rasmussen give userfaultfd the UFFDIO_CONTINUE_MODE_WP feature,
   which causes minor faults to install a write-protected pte.
 
 - Vlastimil Babka has done some maintenance work on vma_merge():
   cleanups to the kernel code and improvements to our userspace test
   harness.
 
 - Cleanups to do_fault_around() by Lorenzo Stoakes.
 
 - Mike Rapoport has moved a lot of initialization code out of various
   mm/ files and into mm/mm_init.c.
 
 - Lorenzo Stoakes removd vmf_insert_mixed_prot(), which was added for
   DRM, but DRM doesn't use it any more.
 
 - Lorenzo has also coverted read_kcore() and vread() to use iterators
   and has thereby removed the use of bounce buffers in some cases.
 
 - Lorenzo has also contributed further cleanups of vma_merge().
 
 - Chaitanya Prakash provides some fixes to the mmap selftesting code.
 
 - Matthew Wilcox changes xfs and afs so they no longer take sleeping
   locks in ->map_page(), a step towards RCUification of pagefaults.
 
 - Suren Baghdasaryan has improved mmap_lock scalability by switching to
   per-VMA locking.
 
 - Frederic Weisbecker has reworked the percpu cache draining so that it
   no longer causes latency glitches on cpu isolated workloads.
 
 - Mike Rapoport cleans up and corrects the ARCH_FORCE_MAX_ORDER Kconfig
   logic.
 
 - Liu Shixin has changed zswap's initialization so we no longer waste a
   chunk of memory if zswap is not being used.
 
 - Yosry Ahmed has improved the performance of memcg statistics flushing.
 
 - David Stevens has fixed several issues involving khugepaged,
   userfaultfd and shmem.
 
 - Christoph Hellwig has provided some cleanup work to zram's IO-related
   code paths.
 
 - David Hildenbrand has fixed up some issues in the selftest code's
   testing of our pte state changing.
 
 - Pankaj Raghav has made page_endio() unneeded and has removed it.
 
 - Peter Xu contributed some rationalizations of the userfaultfd
   selftests.
 
 - Yosry Ahmed has fixed an issue around memcg's page recalim accounting.
 
 - Chaitanya Prakash has fixed some arm-related issues in the
   selftests/mm code.
 
 - Longlong Xia has improved the way in which KSM handles hwpoisoned
   pages.
 
 - Peter Xu fixes a few issues with uffd-wp at fork() time.
 
 - Stefan Roesch has changed KSM so that it may now be used on a
   per-process and per-cgroup basis.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZEr3zQAKCRDdBJ7gKXxA
 jlLoAP0fpQBipwFxED0Us4SKQfupV6z4caXNJGPeay7Aj11/kQD/aMRC2uPfgr96
 eMG3kwn2pqkB9ST2QpkaRbxA//eMbQY=
 =J+Dj
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2023-04-27-15-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:

 - Nick Piggin's "shoot lazy tlbs" series, to improve the peformance of
   switching from a user process to a kernel thread.

 - More folio conversions from Kefeng Wang, Zhang Peng and Pankaj
   Raghav.

 - zsmalloc performance improvements from Sergey Senozhatsky.

 - Yue Zhao has found and fixed some data race issues around the
   alteration of memcg userspace tunables.

 - VFS rationalizations from Christoph Hellwig:
     - removal of most of the callers of write_one_page()
     - make __filemap_get_folio()'s return value more useful

 - Luis Chamberlain has changed tmpfs so it no longer requires swap
   backing. Use `mount -o noswap'.

 - Qi Zheng has made the slab shrinkers operate locklessly, providing
   some scalability benefits.

 - Keith Busch has improved dmapool's performance, making part of its
   operations O(1) rather than O(n).

 - Peter Xu adds the UFFD_FEATURE_WP_UNPOPULATED feature to userfaultd,
   permitting userspace to wr-protect anon memory unpopulated ptes.

 - Kirill Shutemov has changed MAX_ORDER's meaning to be inclusive
   rather than exclusive, and has fixed a bunch of errors which were
   caused by its unintuitive meaning.

 - Axel Rasmussen give userfaultfd the UFFDIO_CONTINUE_MODE_WP feature,
   which causes minor faults to install a write-protected pte.

 - Vlastimil Babka has done some maintenance work on vma_merge():
   cleanups to the kernel code and improvements to our userspace test
   harness.

 - Cleanups to do_fault_around() by Lorenzo Stoakes.

 - Mike Rapoport has moved a lot of initialization code out of various
   mm/ files and into mm/mm_init.c.

 - Lorenzo Stoakes removd vmf_insert_mixed_prot(), which was added for
   DRM, but DRM doesn't use it any more.

 - Lorenzo has also coverted read_kcore() and vread() to use iterators
   and has thereby removed the use of bounce buffers in some cases.

 - Lorenzo has also contributed further cleanups of vma_merge().

 - Chaitanya Prakash provides some fixes to the mmap selftesting code.

 - Matthew Wilcox changes xfs and afs so they no longer take sleeping
   locks in ->map_page(), a step towards RCUification of pagefaults.

 - Suren Baghdasaryan has improved mmap_lock scalability by switching to
   per-VMA locking.

 - Frederic Weisbecker has reworked the percpu cache draining so that it
   no longer causes latency glitches on cpu isolated workloads.

 - Mike Rapoport cleans up and corrects the ARCH_FORCE_MAX_ORDER Kconfig
   logic.

 - Liu Shixin has changed zswap's initialization so we no longer waste a
   chunk of memory if zswap is not being used.

 - Yosry Ahmed has improved the performance of memcg statistics
   flushing.

 - David Stevens has fixed several issues involving khugepaged,
   userfaultfd and shmem.

 - Christoph Hellwig has provided some cleanup work to zram's IO-related
   code paths.

 - David Hildenbrand has fixed up some issues in the selftest code's
   testing of our pte state changing.

 - Pankaj Raghav has made page_endio() unneeded and has removed it.

 - Peter Xu contributed some rationalizations of the userfaultfd
   selftests.

 - Yosry Ahmed has fixed an issue around memcg's page recalim
   accounting.

 - Chaitanya Prakash has fixed some arm-related issues in the
   selftests/mm code.

 - Longlong Xia has improved the way in which KSM handles hwpoisoned
   pages.

 - Peter Xu fixes a few issues with uffd-wp at fork() time.

 - Stefan Roesch has changed KSM so that it may now be used on a
   per-process and per-cgroup basis.

* tag 'mm-stable-2023-04-27-15-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (369 commits)
  mm,unmap: avoid flushing TLB in batch if PTE is inaccessible
  shmem: restrict noswap option to initial user namespace
  mm/khugepaged: fix conflicting mods to collapse_file()
  sparse: remove unnecessary 0 values from rc
  mm: move 'mmap_min_addr' logic from callers into vm_unmapped_area()
  hugetlb: pte_alloc_huge() to replace huge pte_alloc_map()
  maple_tree: fix allocation in mas_sparse_area()
  mm: do not increment pgfault stats when page fault handler retries
  zsmalloc: allow only one active pool compaction context
  selftests/mm: add new selftests for KSM
  mm: add new KSM process and sysfs knobs
  mm: add new api to enable ksm per process
  mm: shrinkers: fix debugfs file permissions
  mm: don't check VMA write permissions if the PTE/PMD indicates write permissions
  migrate_pages_batch: fix statistics for longterm pin retry
  userfaultfd: use helper function range_in_vma()
  lib/show_mem.c: use for_each_populated_zone() simplify code
  mm: correct arg in reclaim_pages()/reclaim_clean_pages_from_list()
  fs/buffer: convert create_page_buffers to folio_create_buffers
  fs/buffer: add folio_create_empty_buffers helper
  ...
2023-04-27 19:42:02 -07:00
Paolo Bonzini 4f382a79a6 KVM/arm64 updates for 6.4
- Numerous fixes for the pathological lock inversion issue that
   plagued KVM/arm64 since... forever.
 
 - New framework allowing SMCCC-compliant hypercalls to be forwarded
   to userspace, hopefully paving the way for some more features
   being moved to VMMs rather than be implemented in the kernel.
 
 - Large rework of the timer code to allow a VM-wide offset to be
   applied to both virtual and physical counters as well as a
   per-timer, per-vcpu offset that complements the global one.
   This last part allows the NV timer code to be implemented on
   top.
 
 - A small set of fixes to make sure that we don't change anything
   affecting the EL1&0 translation regime just after having having
   taken an exception to EL2 until we have executed a DSB. This
   ensures that speculative walks started in EL1&0 have completed.
 
 - The usual selftest fixes and improvements.
 -----BEGIN PGP SIGNATURE-----
 
 iQJDBAABCgAtFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAmRCZIwPHG1hekBrZXJu
 ZWwub3JnAAoJECPQ0LrRPXpDoZ8P/ioXAdDbAE4hTuyD2YdKJ3IGWN3pg52Z7xc2
 rBXXFrbK9+n9FEc3AVdHoGsRPDP0Ynl+apj+aB0Klr/Fl0KKqac+W0ARX9rn1mI1
 HjeygFPaGnXjMUp0BjeSLS+g3b0gebELJ6R1QEe1/MIPb8Se7M1y3ZpMWdhe0PPL
 vyzw3LZq2OAlLgWKZhAfhh03qdr2kqJxypYs6nMrcexfn8dXT78dsYKW1nXmqKcE
 61Gg23MDPUoexYpUhm+ym5t8hltoI1di8faPmxEpaFzpSDyAg8V5vo6LiW9jn3cf
 RX0Sikk1laiRAhVbbIFCKC148vFyKxum3scpKyb91Qc+sK1kmIcxvEqlc6SfG9je
 +5ndZwAfXtW6SMSOyX8y5fXbee7M0sx3n3le9BNgwXfmLWg/GHXJ544dJgVIlf/e
 0Z+8QnP1IUDfARR/b2FlW7A7XLzNHQzO379ekcAdUptbGwlS9CrW6SJ83QR7K6fB
 bh0aSSELKsD7pX8wnNyNACvmz2zL12ITlDKdZWUr8MSxyTjgVy7s0BDsQT3sbrA1
 1sH++RvUWfC2k7tVT3vjZFzUDlPw3bnZmo5YMWRTMbXEdr1V5rDw5F5IXit13KeT
 8bk0hnJgnLmyoX2A17v5dkFMIKD7p13tqDRdfFcn0ru63HIKxgkS3ITkDmsAQELK
 DHT7RBE0
 =Bhta
 -----END PGP SIGNATURE-----

Merge tag 'kvmarm-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD

KVM/arm64 updates for 6.4

- Numerous fixes for the pathological lock inversion issue that
  plagued KVM/arm64 since... forever.

- New framework allowing SMCCC-compliant hypercalls to be forwarded
  to userspace, hopefully paving the way for some more features
  being moved to VMMs rather than be implemented in the kernel.

- Large rework of the timer code to allow a VM-wide offset to be
  applied to both virtual and physical counters as well as a
  per-timer, per-vcpu offset that complements the global one.
  This last part allows the NV timer code to be implemented on
  top.

- A small set of fixes to make sure that we don't change anything
  affecting the EL1&0 translation regime just after having having
  taken an exception to EL2 until we have executed a DSB. This
  ensures that speculative walks started in EL1&0 have completed.

- The usual selftest fixes and improvements.
2023-04-26 15:46:52 -04:00
Linus Torvalds df45da57cb arm64 updates for 6.4
ACPI:
 	* Improve error reporting when failing to manage SDEI on AGDI device
 	  removal
 
 Assembly routines:
 	* Improve register constraints so that the compiler can make use of
 	  the zero register instead of moving an immediate #0 into a GPR
 
 	* Allow the compiler to allocate the registers used for CAS
 	  instructions
 
 CPU features and system registers:
 	* Cleanups to the way in which CPU features are identified from the
 	  ID register fields
 
 	* Extend system register definition generation to handle Enum types
 	  when defining shared register fields
 
 	* Generate definitions for new _EL2 registers and add new fields
 	  for ID_AA64PFR1_EL1
 
 	* Allow SVE to be disabled separately from SME on the kernel
 	  command-line
 
 Tracing:
 	* Support for "direct calls" in ftrace, which enables BPF tracing
 	  for arm64
 
 Kdump:
 	* Don't bother unmapping the crashkernel from the linear mapping,
 	  which then allows us to use huge (block) mappings and reduce
 	  TLB pressure when a crashkernel is loaded.
 
 Memory management:
 	* Try again to remove data cache invalidation from the coherent DMA
 	  allocation path
 
 	* Simplify the fixmap code by mapping at page granularity
 
 	* Allow the kfence pool to be allocated early, preventing the rest
 	  of the linear mapping from being forced to page granularity
 
 Perf and PMU:
 	* Move CPU PMU code out to drivers/perf/ where it can be reused
 	  by the 32-bit ARM architecture when running on ARMv8 CPUs
 
 	* Fix race between CPU PMU probing and pKVM host de-privilege
 
 	* Add support for Apple M2 CPU PMU
 
 	* Adjust the generic PERF_COUNT_HW_BRANCH_INSTRUCTIONS event
 	  dynamically, depending on what the CPU actually supports
 
 	* Minor fixes and cleanups to system PMU drivers
 
 Stack tracing:
 	* Use the XPACLRI instruction to strip PAC from pointers, rather
 	  than rolling our own function in C
 
 	* Remove redundant PAC removal for toolchains that handle this in
 	  their builtins
 
 	* Make backtracing more resilient in the face of instrumentation
 
 Miscellaneous:
 	* Fix single-step with KGDB
 
 	* Remove harmless warning when 'nokaslr' is passed on the kernel
 	  command-line
 
 	* Minor fixes and cleanups across the board
 -----BEGIN PGP SIGNATURE-----
 
 iQFEBAABCgAuFiEEPxTL6PPUbjXGY88ct6xw3ITBYzQFAmRChcwQHHdpbGxAa2Vy
 bmVsLm9yZwAKCRC3rHDchMFjNCgBCADFvkYY9ESztSnd3EpiMbbAzgRCQBiA5H7U
 F2Wc+hIWgeAeUEttSH22+F16r6Jb0gbaDvsuhtN2W/rwQhKNbCU0MaUME05MPmg2
 AOp+RZb2vdT5i5S5dC6ZM6G3T6u9O78LBWv2JWBdd6RIybamEn+RL00ep2WAduH7
 n1FgTbsKgnbScD2qd4K1ejZ1W/BQMwYulkNpyTsmCIijXM12lkzFlxWnMtky3uhR
 POpawcIZzXvWI02QAX+SIdynGChQV3VP+dh9GuFbt7ASigDEhgunvfUYhZNSaqf4
 +/q0O8toCtmQJBUhF0DEDSB5T8SOz5v9CKxKuwfaX6Trq0ixFQpZ
 =78L9
 -----END PGP SIGNATURE-----

Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

Pull arm64 updates from Will Deacon:
 "ACPI:

   - Improve error reporting when failing to manage SDEI on AGDI device
     removal

  Assembly routines:

   - Improve register constraints so that the compiler can make use of
     the zero register instead of moving an immediate #0 into a GPR

   - Allow the compiler to allocate the registers used for CAS
     instructions

  CPU features and system registers:

   - Cleanups to the way in which CPU features are identified from the
     ID register fields

   - Extend system register definition generation to handle Enum types
     when defining shared register fields

   - Generate definitions for new _EL2 registers and add new fields for
     ID_AA64PFR1_EL1

   - Allow SVE to be disabled separately from SME on the kernel
     command-line

  Tracing:

   - Support for "direct calls" in ftrace, which enables BPF tracing for
     arm64

  Kdump:

   - Don't bother unmapping the crashkernel from the linear mapping,
     which then allows us to use huge (block) mappings and reduce TLB
     pressure when a crashkernel is loaded.

  Memory management:

   - Try again to remove data cache invalidation from the coherent DMA
     allocation path

   - Simplify the fixmap code by mapping at page granularity

   - Allow the kfence pool to be allocated early, preventing the rest of
     the linear mapping from being forced to page granularity

  Perf and PMU:

   - Move CPU PMU code out to drivers/perf/ where it can be reused by
     the 32-bit ARM architecture when running on ARMv8 CPUs

   - Fix race between CPU PMU probing and pKVM host de-privilege

   - Add support for Apple M2 CPU PMU

   - Adjust the generic PERF_COUNT_HW_BRANCH_INSTRUCTIONS event
     dynamically, depending on what the CPU actually supports

   - Minor fixes and cleanups to system PMU drivers

  Stack tracing:

   - Use the XPACLRI instruction to strip PAC from pointers, rather than
     rolling our own function in C

   - Remove redundant PAC removal for toolchains that handle this in
     their builtins

   - Make backtracing more resilient in the face of instrumentation

  Miscellaneous:

   - Fix single-step with KGDB

   - Remove harmless warning when 'nokaslr' is passed on the kernel
     command-line

   - Minor fixes and cleanups across the board"

* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (72 commits)
  KVM: arm64: Ensure CPU PMU probes before pKVM host de-privilege
  arm64: kexec: include reboot.h
  arm64: delete dead code in this_cpu_set_vectors()
  arm64/cpufeature: Use helper macro to specify ID register for capabilites
  drivers/perf: hisi: add NULL check for name
  drivers/perf: hisi: Remove redundant initialized of pmu->name
  arm64/cpufeature: Consistently use symbolic constants for min_field_value
  arm64/cpufeature: Pull out helper for CPUID register definitions
  arm64/sysreg: Convert HFGITR_EL2 to automatic generation
  ACPI: AGDI: Improve error reporting for problems during .remove()
  arm64: kernel: Fix kernel warning when nokaslr is passed to commandline
  perf/arm-cmn: Fix port detection for CMN-700
  arm64: kgdb: Set PSTATE.SS to 1 to re-enable single-step
  arm64: move PAC masks to <asm/pointer_auth.h>
  arm64: use XPACLRI to strip PAC
  arm64: avoid redundant PAC stripping in __builtin_return_address()
  arm64/sme: Fix some comments of ARM SME
  arm64/signal: Alloc tpidr2 sigframe after checking system_supports_tpidr2()
  arm64/signal: Use system_supports_tpidr2() to check TPIDR2
  arm64/idreg: Don't disable SME when disabling SVE
  ...
2023-04-25 12:39:01 -07:00
Linus Torvalds 5dfb75e842 RCU Changes for 6.4:
o  MAINTAINERS files additions and changes.
  o  Fix hotplug warning in nohz code.
  o  Tick dependency changes by Zqiang.
  o  Lazy-RCU shrinker fixes by Zqiang.
  o  rcu-tasks stall reporting improvements by Neeraj.
  o  Initial changes for renaming of k[v]free_rcu() to its new k[v]free_rcu_mightsleep()
     name for robustness.
  o  Documentation Updates:
  o  Significant changes to srcu_struct size.
  o  Deadlock detection for srcu_read_lock() vs synchronize_srcu() from Boqun.
  o  rcutorture and rcu-related tool, which are targeted for v6.4 from Boqun's tree.
  o  Other misc changes.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEcoCIrlGe4gjE06JJqA4nf2o45hAFAmQuBnIACgkQqA4nf2o4
 5hACVRAAoXu7/gfh5Pjw9O4E4pCdPJKsZZVYrcrVGrq6NAxRn6M1SgurAdC5grj2
 96x0waoGaiO82V0H5iJMcKdAVu67x9R8WaQ1JoxN75Efn8h9W4TguB87TV1gk0xS
 eZ18b/CyEaM5mNb80DFFF4FLohy5737p/kNTMqXQdUyR1BsDl16iRMgjiBiFhNUx
 yPo8Y2kC2U2OTbldZgaE7s9bQO3xxEcifx93sGWsAex/gx54FYNisiwSlCOSgOE+
 XkYo/OKk8Xvr82tLVX8XQVEPCMJ+rxea8T5zSs8/alvsPq7gA8wW3y6fsoa3vUU/
 +Gd+W+Q/OsONIDtp8rQAY1qsD0ScDpaR8052RSH0zTa7pj8HsQgE5PjZ+cJW0SEi
 cKN+Oe8+ETqKald+xZ6PDf58O212VLrru3RpQWrOQcJ7fmKmfT4REK0RcbLgg4qT
 CBgOo6eg+ub4pxq2y11LZJBNTv1/S7xAEzFE0kArew64KB2gyVud0VJRZVAJnEfe
 93QQVDFrwK2bhgWQZ6J6IbTvGeQW0L93IibuaU6jhZPR283VtUIIvM7vrOylN7Fq
 4jsae0T7YGYfKUhgTpm7rCnm8A/D3Ni8MY0sKYYgDSyKmZUsnpI5wpx1xke4lwwV
 ErrY46RCFa+k8wscc6iWfB4cGXyyFHyu+wtyg0KpFn5JAzcfz4A=
 =Rgbj
 -----END PGP SIGNATURE-----

Merge tag 'rcu.6.4.april5.2023.3' of git://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux

Pull RCU updates from Joel Fernandes:

 - Updates and additions to MAINTAINERS files, with Boqun being added to
   the RCU entry and Zqiang being added as an RCU reviewer.

   I have also transitioned from reviewer to maintainer; however, Paul
   will be taking over sending RCU pull-requests for the next merge
   window.

 - Resolution of hotplug warning in nohz code, achieved by fixing
   cpu_is_hotpluggable() through interaction with the nohz subsystem.

   Tick dependency modifications by Zqiang, focusing on fixing usage of
   the TICK_DEP_BIT_RCU_EXP bitmask.

 - Avoid needless calls to the rcu-lazy shrinker for CONFIG_RCU_LAZY=n
   kernels, fixed by Zqiang.

 - Improvements to rcu-tasks stall reporting by Neeraj.

 - Initial renaming of k[v]free_rcu() to k[v]free_rcu_mightsleep() for
   increased robustness, affecting several components like mac802154,
   drbd, vmw_vmci, tracing, and more.

   A report by Eric Dumazet showed that the API could be unknowingly
   used in an atomic context, so we'd rather make sure they know what
   they're asking for by being explicit:

      https://lore.kernel.org/all/20221202052847.2623997-1-edumazet@google.com/

 - Documentation updates, including corrections to spelling,
   clarifications in comments, and improvements to the srcu_size_state
   comments.

 - Better srcu_struct cache locality for readers, by adjusting the size
   of srcu_struct in support of SRCU usage by Christoph Hellwig.

 - Teach lockdep to detect deadlocks between srcu_read_lock() vs
   synchronize_srcu() contributed by Boqun.

   Previously lockdep could not detect such deadlocks, now it can.

 - Integration of rcutorture and rcu-related tools, targeted for v6.4
   from Boqun's tree, featuring new SRCU deadlock scenarios, test_nmis
   module parameter, and more

 - Miscellaneous changes, various code cleanups and comment improvements

* tag 'rcu.6.4.april5.2023.3' of git://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux: (71 commits)
  checkpatch: Error out if deprecated RCU API used
  mac802154: Rename kfree_rcu() to kvfree_rcu_mightsleep()
  rcuscale: Rename kfree_rcu() to kfree_rcu_mightsleep()
  ext4/super: Rename kfree_rcu() to kfree_rcu_mightsleep()
  net/mlx5: Rename kfree_rcu() to kfree_rcu_mightsleep()
  net/sysctl: Rename kvfree_rcu() to kvfree_rcu_mightsleep()
  lib/test_vmalloc.c: Rename kvfree_rcu() to kvfree_rcu_mightsleep()
  tracing: Rename kvfree_rcu() to kvfree_rcu_mightsleep()
  misc: vmw_vmci: Rename kvfree_rcu() to kvfree_rcu_mightsleep()
  drbd: Rename kvfree_rcu() to kvfree_rcu_mightsleep()
  rcu: Protect rcu_print_task_exp_stall() ->exp_tasks access
  rcu: Avoid stack overflow due to __rcu_irq_enter_check_tick() being kprobe-ed
  rcu-tasks: Report stalls during synchronize_srcu() in rcu_tasks_postscan()
  rcu: Permit start_poll_synchronize_rcu_expedited() to be invoked early
  rcu: Remove never-set needwake assignment from rcu_report_qs_rdp()
  rcu: Register rcu-lazy shrinker only for CONFIG_RCU_LAZY=y kernels
  rcu: Fix missing TICK_DEP_MASK_RCU_EXP dependency check
  rcu: Fix set/clear TICK_DEP_BIT_RCU_EXP bitmask race
  rcu/trace: use strscpy() to instead of strncpy()
  tick/nohz: Fix cpu_is_hotpluggable() by checking with nohz subsystem
  ...
2023-04-24 12:16:14 -07:00
Paolo Bonzini 265b97cbc2 KVM/arm64 fixes for 6.3, part #4
- Plug a buffer overflow due to the use of the user-provided register
    width for firmware regs. Outright reject accesses where the
    user register width does not match the kernel representation.
 
  - Protect non-atomic RMW operations on vCPU flags against preemption,
    as an update to the flags by an intervening preemption could be lost.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQSNXHjWXuzMZutrKNKivnWIJHzdFgUCZEALwAAKCRCivnWIJHzd
 Fq2nAQDPwWpuMOrQYZCM5RF5/8yai5OPKB0NcvK00UDM4f6E1AEAxDud/KT9i15c
 dmBfGn1zDzW4XAENrOkxjQWuBqi/tQg=
 =Lxvf
 -----END PGP SIGNATURE-----

Merge tag 'kvmarm-fixes-6.3-4' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD

KVM/arm64 fixes for 6.3, part #4

 - Plug a buffer overflow due to the use of the user-provided register
   width for firmware regs. Outright reject accesses where the
   user register width does not match the kernel representation.

 - Protect non-atomic RMW operations on vCPU flags against preemption,
   as an update to the flags by an intervening preemption could be lost.
2023-04-21 19:19:02 -04:00
Mark Brown aaa2f14e6f KVM: arm64: Clarify host SME state management
Normally when running a guest we do not touch the floating point
register state until first use of floating point by the guest, saving
the current state and loading the guest state at that point. This has
been found to offer a performance benefit in common cases. However
currently if SME is active when switching to a guest then we exit
streaming mode, disable ZA and invalidate the floating point register
state prior to starting the guest.

The exit from streaming mode is required for correct guest operation, if
we leave streaming mode enabled then many non-SME operations can
generate SME traps (eg, SVE operations will become streaming SVE
operations). If EL1 leaves CPACR_EL1.SMEN disabled then the host is
unable to intercept these traps. This will mean that a SME unaware guest
will see SME exceptions which will confuse it. Disabling streaming mode
also avoids creating spurious indications of usage of the SME hardware
which could impact system performance, especially with shared SME
implementations. Document the requirement to exit streaming mode
clearly.

There is no issue with guest operation caused by PSTATE.ZA so we can
defer handling for that until first floating point usage, do so if the
register state is not that of the current task and hence has already
been saved. We could also do this for the case where the register state
is that for the current task however this is very unlikely to happen and
would require disproportionate effort so continue to save the state in
that case.

Saving this state on first use would require that we map and unmap
storage for the host version of these registers for use by the
hypervisor, taking care to deal with protected KVM and the fact that the
host can free or reallocate the backing storage. Given that the strong
recommendation is that applications should only keep PSTATE.ZA enabled
when the state it enables is in active use it is difficult to see a case
where a VMM would wish to do this, it would need to not only be using
SME but also running the guest in the middle of SME usage. This can be
revisited in the future if a use case does arises, in the interim such
tasks will work but experience a performance overhead.

This brings our handling of SME more into line with our handling of
other floating point state and documents more clearly the constraints we
have, especially around streaming mode.

Signed-off-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221214-kvm-arm64-sme-context-switch-v2-3-57ba0082e9ff@kernel.org
2023-04-21 13:46:20 +01:00
Mark Brown d071cefdcc KVM: arm64: Restructure check for SVE support in FP trap handler
We share the same handler for general floating point and SVE traps with a
check to make sure we don't handle any SVE traps if the system doesn't
have SVE support. Since we will be adding SME support and wishing to handle
that along with other FP related traps rewrite the check to be more scalable
and a bit clearer too, ensuring we don't misidentify SME traps as SVE ones.

Signed-off-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221214-kvm-arm64-sme-context-switch-v2-2-57ba0082e9ff@kernel.org
2023-04-21 13:46:20 +01:00
Mark Brown 4c181e3d35 KVM: arm64: Document check for TIF_FOREIGN_FPSTATE
In kvm_arch_vcpu_load_fp() we unconditionally set the current FP state
to FP_STATE_HOST_OWNED, this will be overridden to FP_STATE_NONE if
TIF_FOREIGN_FPSTATE is set but the check is deferred until
kvm_arch_vcpu_ctxflush_fp() where we are no longer preemptable. Add a
comment to this effect to help avoid people being concerned about the
lack of a check and discover where the check is done.

Signed-off-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221214-kvm-arm64-sme-context-switch-v2-1-57ba0082e9ff@kernel.org
2023-04-21 13:46:20 +01:00
Jingyu Wang 3d17935628 KVM: arm64: Fix repeated words in comments
Delete the redundant word 'to'.

Signed-off-by: Jingyu Wang <jingyuwang_vip@163.com>
Reviewed-by: Mukesh Ojha <quic_mojha@quicinc.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230309075919.169518-1-jingyuwang_vip@163.com
2023-04-21 13:46:04 +01:00
Marc Zyngier 1ea244158a KVM: arm64: Constify start/end/phys fields of the pgtable walker data
As we are revamping the way the pgtable walker evaluates some of the
data, make it clear that we rely on somew of the fields to be constant
across the lifetime of a walk.

For this, flag the start, end and phys fields of the walk data as
'const', which will generate an error if we were to accidentally
update these fields again.

Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
2023-04-21 10:51:40 +01:00
Oliver Upton 39bc95be37 KVM: arm64: Infer PA offset from VA in hyp map walker
Similar to the recently fixed stage-2 walker, the hyp map walker
increments the PA and VA of a walk separately. Unlike stage-2, there is
no bug here as the map walker has exclusive access to the stage-1 page
tables.

Nonetheless, in the interest of continuity throughout the page table
code, tweak the hyp map walker to avoid incrementing the PA and instead
use the VA as the authoritative source of how far along a table walk has
gotten. Calculate the PA to use for a leaf PTE by adding the offset of
the VA from the start of the walk to the starting PA.

Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230421071606.1603916-3-oliver.upton@linux.dev
2023-04-21 10:51:40 +01:00
Oliver Upton 1f0f4a2ef7 KVM: arm64: Infer the PA offset from IPA in stage-2 map walker
Until now, the page table walker counted increments to the PA and IPA
of a walk in two separate places. While the PA is incremented as soon as
a leaf PTE is installed in stage2_map_walker_try_leaf(), the IPA is
actually bumped in the generic table walker context. Critically,
__kvm_pgtable_visit() rereads the PTE after the LEAF callback returns
to work out if a table or leaf was installed, and only bumps the IPA for
a leaf PTE.

This arrangement worked fine when we handled faults behind the write lock,
as the walker had exclusive access to the stage-2 page tables. However,
commit 1577cb5823 ("KVM: arm64: Handle stage-2 faults in parallel")
started handling all stage-2 faults behind the read lock, opening up a
race where a walker could increment the PA but not the IPA of a walk.
Nothing good ensues, as the walker starts mapping with the incorrect
IPA -> PA relationship.

For example, assume that two vCPUs took a data abort on the same IPA.
One observes that dirty logging is disabled, and the other observed that
it is enabled:

  vCPU attempting PMD mapping		  vCPU attempting PTE mapping
  ======================================  =====================================
  /* install PMD */
  stage2_make_pte(ctx, leaf);
  data->phys += granule;
  					  /* replace PMD with a table */
  					  stage2_try_break_pte(ctx, data->mmu);
					  stage2_make_pte(ctx, table);
  /* table is observed */
  ctx.old = READ_ONCE(*ptep);
  table = kvm_pte_table(ctx.old, level);

  /*
   * map walk continues w/o incrementing
   * IPA.
   */
   __kvm_pgtable_walk(..., level + 1);

Bring an end to the whole mess by using the IPA as the single source of
truth for how far along a walk has gotten. Work out the correct PA to
map by calculating the IPA offset from the beginning of the walk and add
that to the starting physical address.

Cc: stable@vger.kernel.org
Fixes: 1577cb5823 ("KVM: arm64: Handle stage-2 faults in parallel")
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230421071606.1603916-2-oliver.upton@linux.dev
2023-04-21 10:51:40 +01:00
Marc Zyngier 36fe1b29b3 Merge branch kvm-arm64/spec-ptw into kvmarm-master/next
* kvm-arm64/spec-ptw:
  : .
  : On taking an exception from EL1&0 to EL2(&0), the page table walker is
  : allowed to carry on with speculative walks started from EL1&0 while
  : running at EL2 (see R_LFHQG). Given that the PTW may be actively using
  : the EL1&0 system registers, the only safe way to deal with it is to
  : issue a DSB before changing any of it.
  :
  : We already did the right thing for SPE and TRBE, but ignored the PTW
  : for unknown reasons (probably because the architecture wasn't crystal
  : clear at the time).
  :
  : This requires a bit of surgery in the nvhe code, though most of these
  : patches are comments so that my future self can understand the purpose
  : of these barriers. The VHE code is largely unaffected, thanks to the
  : DSB in the context switch.
  : .
  KVM: arm64: vhe: Drop extra isb() on guest exit
  KVM: arm64: vhe: Synchronise with page table walker on MMU update
  KVM: arm64: pkvm: Document the side effects of kvm_flush_dcache_to_poc()
  KVM: arm64: nvhe: Synchronise with page table walker on TLBI
  KVM: arm64: nvhe: Synchronise with page table walker on vcpu run

Signed-off-by: Marc Zyngier <maz@kernel.org>
2023-04-21 09:44:58 +01:00
Marc Zyngier 6dcf7316e0 Merge branch kvm-arm64/smccc-filtering into kvmarm-master/next
* kvm-arm64/smccc-filtering:
  : .
  : SMCCC call filtering and forwarding to userspace, courtesy of
  : Oliver Upton. From the cover letter:
  :
  : "The Arm SMCCC is rather prescriptive in regards to the allocation of
  : SMCCC function ID ranges. Many of the hypercall ranges have an
  : associated specification from Arm (FF-A, PSCI, SDEI, etc.) with some
  : room for vendor-specific implementations.
  :
  : The ever-expanding SMCCC surface leaves a lot of work within KVM for
  : providing new features. Furthermore, KVM implements its own
  : vendor-specific ABI, with little room for other implementations (like
  : Hyper-V, for example). Rather than cramming it all into the kernel we
  : should provide a way for userspace to handle hypercalls."
  : .
  KVM: selftests: Fix spelling mistake "KVM_HYPERCAL_EXIT_SMC" -> "KVM_HYPERCALL_EXIT_SMC"
  KVM: arm64: Test that SMC64 arch calls are reserved
  KVM: arm64: Prevent userspace from handling SMC64 arch range
  KVM: arm64: Expose SMC/HVC width to userspace
  KVM: selftests: Add test for SMCCC filter
  KVM: selftests: Add a helper for SMCCC calls with SMC instruction
  KVM: arm64: Let errors from SMCCC emulation to reach userspace
  KVM: arm64: Return NOT_SUPPORTED to guest for unknown PSCI version
  KVM: arm64: Introduce support for userspace SMCCC filtering
  KVM: arm64: Add support for KVM_EXIT_HYPERCALL
  KVM: arm64: Use a maple tree to represent the SMCCC filter
  KVM: arm64: Refactor hvc filtering to support different actions
  KVM: arm64: Start handling SMCs from EL1
  KVM: arm64: Rename SMC/HVC call handler to reflect reality
  KVM: arm64: Add vm fd device attribute accessors
  KVM: arm64: Add a helper to check if a VM has ran once
  KVM: x86: Redefine 'longmode' as a flag for KVM_EXIT_HYPERCALL

Signed-off-by: Marc Zyngier <maz@kernel.org>
2023-04-21 09:44:32 +01:00
Marc Zyngier b22498c484 Merge branch kvm-arm64/timer-vm-offsets into kvmarm-master/next
* kvm-arm64/timer-vm-offsets: (21 commits)
  : .
  : This series aims at satisfying multiple goals:
  :
  : - allow a VMM to atomically restore a timer offset for a whole VM
  :   instead of updating the offset each time a vcpu get its counter
  :   written
  :
  : - allow a VMM to save/restore the physical timer context, something
  :   that we cannot do at the moment due to the lack of offsetting
  :
  : - provide a framework that is suitable for NV support, where we get
  :   both global and per timer, per vcpu offsetting, and manage
  :   interrupts in a less braindead way.
  :
  : Conflict resolution involves using the new per-vcpu config lock instead
  : of the home-grown timer lock.
  : .
  KVM: arm64: Handle 32bit CNTPCTSS traps
  KVM: arm64: selftests: Augment existing timer test to handle variable offset
  KVM: arm64: selftests: Deal with spurious timer interrupts
  KVM: arm64: selftests: Add physical timer registers to the sysreg list
  KVM: arm64: nv: timers: Support hyp timer emulation
  KVM: arm64: nv: timers: Add a per-timer, per-vcpu offset
  KVM: arm64: Document KVM_ARM_SET_CNT_OFFSETS and co
  KVM: arm64: timers: Abstract the number of valid timers per vcpu
  KVM: arm64: timers: Fast-track CNTPCT_EL0 trap handling
  KVM: arm64: Elide kern_hyp_va() in VHE-specific parts of the hypervisor
  KVM: arm64: timers: Move the timer IRQs into arch_timer_vm_data
  KVM: arm64: timers: Abstract per-timer IRQ access
  KVM: arm64: timers: Rationalise per-vcpu timer init
  KVM: arm64: timers: Allow save/restoring of the physical timer
  KVM: arm64: timers: Allow userspace to set the global counter offset
  KVM: arm64: Expose {un,}lock_all_vcpus() to the rest of KVM
  KVM: arm64: timers: Allow physical offset without CNTPOFF_EL2
  KVM: arm64: timers: Use CNTPOFF_EL2 to offset the physical timer
  arm64: Add HAS_ECV_CNTPOFF capability
  arm64: Add CNTPOFF_EL2 register definition
  ...

Signed-off-by: Marc Zyngier <maz@kernel.org>
2023-04-21 09:36:40 +01:00
Will Deacon 87727ba2bb KVM: arm64: Ensure CPU PMU probes before pKVM host de-privilege
Although pKVM supports CPU PMU emulation for non-protected guests since
722625c6f4 ("KVM: arm64: Reenable pmu in Protected Mode"), this relies
on the PMU driver probing before the host has de-privileged so that the
'kvm_arm_pmu_available' static key can still be enabled by patching the
hypervisor text.

As it happens, both of these events hang off device_initcall() but the
PMU consistently won the race until 7755cec63a ("arm64: perf: Move
PMUv3 driver to drivers/perf"). Since then, the host will fail to boot
when pKVM is enabled:

  | hw perfevents: enabled with armv8_pmuv3_0 PMU driver, 7 counters available
  | kvm [1]: nVHE hyp BUG at: [<ffff8000090366e0>] __kvm_nvhe_handle_host_mem_abort+0x270/0x284!
  | kvm [1]: Cannot dump pKVM nVHE stacktrace: !CONFIG_PROTECTED_NVHE_STACKTRACE
  | kvm [1]: Hyp Offset: 0xfffea41fbdf70000
  | Kernel panic - not syncing: HYP panic:
  | PS:a00003c9 PC:0000dbe04b0c66e0 ESR:00000000f2000800
  | FAR:fffffbfffddfcf00 HPFAR:00000000010b0bf0 PAR:0000000000000000
  | VCPU:0000000000000000
  | CPU: 2 PID: 1 Comm: swapper/0 Not tainted 6.3.0-rc7-00083-g0bce6746d154 #1
  | Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015
  | Call trace:
  |  dump_backtrace+0xec/0x108
  |  show_stack+0x18/0x2c
  |  dump_stack_lvl+0x50/0x68
  |  dump_stack+0x18/0x24
  |  panic+0x13c/0x33c
  |  nvhe_hyp_panic_handler+0x10c/0x190
  |  aarch64_insn_patch_text_nosync+0x64/0xc8
  |  arch_jump_label_transform+0x4c/0x5c
  |  __jump_label_update+0x84/0xfc
  |  jump_label_update+0x100/0x134
  |  static_key_enable_cpuslocked+0x68/0xac
  |  static_key_enable+0x20/0x34
  |  kvm_host_pmu_init+0x88/0xa4
  |  armpmu_register+0xf0/0xf4
  |  arm_pmu_acpi_probe+0x2ec/0x368
  |  armv8_pmu_driver_init+0x38/0x44
  |  do_one_initcall+0xcc/0x240

Fix the race properly by deferring the de-privilege step to
device_initcall_sync(). This will also be needed in future when probing
IOMMU devices and allows us to separate the pKVM de-privilege logic from
the core hypervisor initialisation path.

Cc: Oliver Upton <oliver.upton@linux.dev>
Cc: Fuad Tabba <tabba@google.com>
Cc: Marc Zyngier <maz@kernel.org>
Fixes: 7755cec63a ("arm64: perf: Move PMUv3 driver to drivers/perf")
Tested-by: Fuad Tabba <tabba@google.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230420123356.2708-1-will@kernel.org
Signed-off-by: Will Deacon <will@kernel.org>
2023-04-20 16:57:53 +01:00
Christophe JAILLET a00e9e4319 KVM: arm64: Use the bitmap API to allocate bitmaps
Use bitmap_zalloc()/bitmap_free() instead of hand-writing them.

It is less verbose and it improves the semantic.

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/3c5043731db4d3635383e9326bc7e98e25de3288.1681854412.git.christophe.jaillet@wanadoo.fr
2023-04-20 09:10:28 +01:00
Christophe JAILLET 4be8ddb48b KVM: arm64: Slightly optimize flush_context()
bitmap_zero() is faster than bitmap_clear(), so use it to save a few
cycles.

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/97bf2743f3a302b3066aced02218b9da60690dd3.1681854412.git.christophe.jaillet@wanadoo.fr
2023-04-20 09:10:28 +01:00
Reiji Watanabe a189884bdc KVM: arm64: Have kvm_psci_vcpu_on() use WRITE_ONCE() to update mp_state
All accessors of kvm_vcpu_arch::mp_state should be {READ,WRITE}_ONCE(),
since readers of the mp_state don't acquire the mp_state_lock.
Nonetheless, kvm_psci_vcpu_on() updates the mp_state without using
WRITE_ONCE(). So, fix the code to update the mp_state using WRITE_ONCE.

Signed-off-by: Reiji Watanabe <reijiw@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230419021852.2981107-3-reijiw@google.com
2023-04-20 09:06:20 +01:00
Reiji Watanabe 4ff910be01 KVM: arm64: Acquire mp_state_lock in kvm_arch_vcpu_ioctl_vcpu_init()
kvm_arch_vcpu_ioctl_vcpu_init() doesn't acquire mp_state_lock
when setting the mp_state to KVM_MP_STATE_RUNNABLE. Fix the
code to acquire the lock.

Signed-off-by: Reiji Watanabe <reijiw@google.com>
[maz: minor refactor]
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230419021852.2981107-2-reijiw@google.com
2023-04-20 09:06:02 +01:00
Dan Carpenter a25bc8486f KVM: arm64: Fix buffer overflow in kvm_arm_set_fw_reg()
The KVM_REG_SIZE() comes from the ioctl and it can be a power of two
between 0-32768 but if it is more than sizeof(long) this will corrupt
memory.

Fixes: 99adb56763 ("KVM: arm/arm64: Add save/restore support for firmware workaround state")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: Steven Price <steven.price@arm.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/4efbab8c-640f-43b2-8ac6-6d68e08280fe@kili.mountain
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2023-04-19 15:22:37 +00:00
Marc Zyngier bcf3e7da3a KVM: arm64: vhe: Drop extra isb() on guest exit
__kvm_vcpu_run_vhe() end on VHE with an isb(). However, this
function is only reachable via kvm_call_hyp_ret(), which already
contains an isb() in order to mimick the behaviour of nVHE and
provide a context synchronisation event.

We thus have two isb()s back to back, which is one too many.
Drop the first one and solely rely on the one in the helper.

Signed-off-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
2023-04-14 08:23:29 +01:00
Marc Zyngier 1ff2755d68 KVM: arm64: vhe: Synchronise with page table walker on MMU update
Contrary to nVHE, VHE is a lot easier when it comes to dealing
with speculative page table walks started at EL1. As we only change
EL1&0 translation regime when context-switching, we already benefit
from the effect of the DSB that sits in the context switch code.

We only need to take care of it in the NV case, where we can
flip between between two EL1 contexts (one of them being the virtual
EL2) without a context switch.

Signed-off-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
2023-04-14 08:23:29 +01:00
Marc Zyngier 8442d65373 KVM: arm64: pkvm: Document the side effects of kvm_flush_dcache_to_poc()
We rely on the presence of a DSB at the end of kvm_flush_dcache_to_poc()
that, on top of ensuring completion of the cache clean, also covers
the speculative page table walk started from EL1.

Document this dependency.

Signed-off-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
2023-04-14 08:23:29 +01:00
Marc Zyngier 7e1b2329c2 KVM: arm64: nvhe: Synchronise with page table walker on TLBI
A TLBI from EL2 impacting EL1 involves messing with the EL1&0
translation regime, and the page table walker may still be
performing speculative walks.

Piggyback on the existing DSBs to always have a DSB ISH that
will synchronise all load/store operations that the PTW may
still have.

Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
2023-04-14 08:23:29 +01:00
Marc Zyngier a6610435ac KVM: arm64: Handle 32bit CNTPCTSS traps
When CNTPOFF isn't implemented and that we have a non-zero counter
offset, CNTPCT and CNTPCTSS are trapped. We properly handle the
former, but not the latter, as it is not present in the sysreg
table (despite being actually handled in the code). Bummer.

Just populate the cp15_64 table with the missing register.

Reported-by: Reiji Watanabe <reijiw@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
2023-04-13 14:23:42 +01:00
Marc Zyngier 55b5bac159 KVM: arm64: nvhe: Synchronise with page table walker on vcpu run
When taking an exception between the EL1&0 translation regime and
the EL2 translation regime, the page table walker is allowed to
complete the walks started from EL0 or EL1 while running at EL2.

It means that altering the system registers that define the EL1&0
translation regime is fraught with danger *unless* we wait for
the completion of such walk with a DSB (R_LFHQG and subsequent
statements in the ARM ARM). We already did the right thing for
other external agents (SPE, TRBE), but not the PTW.

Rework the existing SPE/TRBE synchronisation to include the PTW,
and add the missing DSB on guest exit.

Signed-off-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
2023-04-13 08:38:53 +01:00
Oliver Upton 49e5d16b6f KVM: arm64: vgic: Don't acquire its_lock before config_lock
commit f003277311 ("KVM: arm64: Use config_lock to protect vgic
state") was meant to rectify a longstanding lock ordering issue in KVM
where the kvm->lock is taken while holding vcpu->mutex. As it so
happens, the aforementioned commit introduced yet another locking issue
by acquiring the its_lock before acquiring the config lock.

This is obviously wrong, especially considering that the lock ordering
is well documented in vgic.c. Reshuffle the locks once more to take the
config_lock before the its_lock. While at it, sprinkle in the lockdep
hinting that has become popular as of late to keep lockdep apprised of
our ordering.

Cc: stable@vger.kernel.org
Fixes: f003277311 ("KVM: arm64: Use config_lock to protect vgic state")
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230412062733.988229-1-oliver.upton@linux.dev
2023-04-12 13:50:18 +01:00
Oliver Upton 5a23ad6510 KVM: arm64: Prevent userspace from handling SMC64 arch range
Though presently unused, there is an SMC64 view of the Arm architecture
calls defined by the SMCCC. The documentation of the SMCCC filter states
that the SMC64 range is reserved, but nothing actually prevents
userspace from applying a filter to the range.

Insert a range with the HANDLE action for the SMC64 arch range, thereby
preventing userspace from imposing filtering/forwarding on it.

Fixes: fb88707dd3 ("KVM: arm64: Use a maple tree to represent the SMCCC filter")
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230408121732.3411329-2-oliver.upton@linux.dev
2023-04-08 15:22:55 +01:00
Paolo Bonzini 0bf9601f8e KVM/arm64 fixes for 6.3, part #3
- Ensure the guest PMU context is restored before the first KVM_RUN,
    fixing an issue where EL0 event counting is broken after vCPU
    save/restore
 
  - Actually initialize ID_AA64PFR0_EL1.{CSV2,CSV3} based on the
    sanitized, system-wide values for protected VMs
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQSNXHjWXuzMZutrKNKivnWIJHzdFgUCZC2ZbwAKCRCivnWIJHzd
 FvovAP9aooqUBBs8w2myh8SXCv7dJOg88r6fKS5vCqMdkY7OhwD9GXA7hWU2dXdy
 X1L8qq6C7R+GtIY/kDm9E2HkNKg7rQc=
 =nfXd
 -----END PGP SIGNATURE-----

Merge tag 'kvmarm-fixes-6.3-3' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD

KVM/arm64 fixes for 6.3, part #3

 - Ensure the guest PMU context is restored before the first KVM_RUN,
   fixing an issue where EL0 event counting is broken after vCPU
   save/restore

 - Actually initialize ID_AA64PFR0_EL1.{CSV2,CSV3} based on the
   sanitized, system-wide values for protected VMs
2023-04-06 13:34:19 -04:00
Kirill A. Shutemov 23baf831a3 mm, treewide: redefine MAX_ORDER sanely
MAX_ORDER currently defined as number of orders page allocator supports:
user can ask buddy allocator for page order between 0 and MAX_ORDER-1.

This definition is counter-intuitive and lead to number of bugs all over
the kernel.

Change the definition of MAX_ORDER to be inclusive: the range of orders
user can ask from buddy allocator is 0..MAX_ORDER now.

[kirill@shutemov.name: fix min() warning]
  Link: https://lkml.kernel.org/r/20230315153800.32wib3n5rickolvh@box
[akpm@linux-foundation.org: fix another min_t warning]
[kirill@shutemov.name: fixups per Zi Yan]
  Link: https://lkml.kernel.org/r/20230316232144.b7ic4cif4kjiabws@box.shutemov.name
[akpm@linux-foundation.org: fix underlining in docs]
  Link: https://lore.kernel.org/oe-kbuild-all/202303191025.VRCTk6mP-lkp@intel.com/
Link: https://lkml.kernel.org/r/20230315113133.11326-11-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Michael Ellerman <mpe@ellerman.id.au>	[powerpc]
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-05 19:42:46 -07:00
Marc Zyngier 0e5c9a9d65 KVM: arm64: Expose SMC/HVC width to userspace
When returning to userspace to handle a SMCCC call, we consistently
set PC to point to the instruction immediately after the HVC/SMC.

However, should userspace need to know the exact address of the
trapping instruction, it needs to know about the *size* of that
instruction. For AArch64, this is pretty easy. For AArch32, this
is a bit more funky, as Thumb has 16bit encodings for both HVC
and SMC.

Expose this to userspace with a new flag that directly derives
from ESR_EL2.IL. Also update the documentation to reflect the PC
state at the point of exit.

Finally, this fixes a small buglet where the hypercall.{args,ret}
fields would not be cleared on exit, and could contain some
random junk.

Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/86pm8iv8tj.wl-maz@kernel.org
2023-04-05 19:20:23 +01:00
Paul E. McKenney 79cf833be6 kvm: Remove "select SRCU"
Now that the SRCU Kconfig option is unconditionally selected, there is
no longer any point in selecting it.  Therefore, remove the "select SRCU"
Kconfig statements from the various KVM Kconfig files.

Acked-by: Sean Christopherson <seanjc@google.com> (x86)
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Aleksandar Markovic <aleksandar.qemu.devel@gmail.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: <kvm@vger.kernel.org>
Acked-by: Marc Zyngier <maz@kernel.org> (arm64)
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Acked-by: Anup Patel <anup@brainfault.org> (riscv)
Acked-by: Heiko Carstens <hca@linux.ibm.com> (s390)
Reviewed-by: John Ogness <john.ogness@linutronix.de>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
2023-04-05 13:47:42 +00:00
Oliver Upton 37c8e49479 KVM: arm64: Let errors from SMCCC emulation to reach userspace
Typically a negative return from an exit handler is used to request a
return to userspace with the specified error. KVM's handling of SMCCC
emulation (i.e. both HVCs and SMCs) deviates from the trend and resumes
the guest instead.

Stop handling negative returns this way and instead let the error
percolate to userspace.

Suggested-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230404154050.2270077-12-oliver.upton@linux.dev
2023-04-05 12:07:42 +01:00
Oliver Upton 7e484d2785 KVM: arm64: Return NOT_SUPPORTED to guest for unknown PSCI version
A subsequent change to KVM will allow negative returns from SMCCC
handlers to exit to userspace. Make way for this change by explicitly
returning SMCCC_RET_NOT_SUPPORTED to the guest if the VM is configured
to use an unknown PSCI version. Add a WARN since this is undoubtedly a
KVM bug.

Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230404154050.2270077-11-oliver.upton@linux.dev
2023-04-05 12:07:41 +01:00
Oliver Upton 821d935c87 KVM: arm64: Introduce support for userspace SMCCC filtering
As the SMCCC (and related specifications) march towards an 'everything
and the kitchen sink' interface for interacting with a system it becomes
less likely that KVM will support every related feature. We could do
better by letting userspace have a crack at it instead.

Allow userspace to define an 'SMCCC filter' that applies to both HVCs
and SMCs initiated by the guest. Supporting both conduits with this
interface is important for a couple of reasons. Guest SMC usage is table
stakes for a nested guest, as HVCs are always taken to the virtual EL2.
Additionally, guests may want to interact with a service on the secure
side which can now be proxied by userspace.

Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230404154050.2270077-10-oliver.upton@linux.dev
2023-04-05 12:07:41 +01:00
Oliver Upton d824dff191 KVM: arm64: Add support for KVM_EXIT_HYPERCALL
In anticipation of user hypercall filters, add the necessary plumbing to
get SMCCC calls out to userspace. Even though the exit structure has
space for KVM to pass register arguments, let's just avoid it altogether
and let userspace poke at the registers via KVM_GET_ONE_REG.

This deliberately stretches the definition of a 'hypercall' to cover
SMCs from EL1 in addition to the HVCs we know and love. KVM doesn't
support EL1 calls into secure services, but now we can paint that as a
userspace problem and be done with it.

Finally, we need a flag to let userspace know what conduit instruction
was used (i.e. SMC vs. HVC).

Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230404154050.2270077-9-oliver.upton@linux.dev
2023-04-05 12:07:41 +01:00
Oliver Upton fb88707dd3 KVM: arm64: Use a maple tree to represent the SMCCC filter
Maple tree is an efficient B-tree implementation that is intended for
storing non-overlapping intervals. Such a data structure is a good fit
for the SMCCC filter as it is desirable to sparsely allocate the 32 bit
function ID space.

To that end, add a maple tree to kvm_arch and correctly init/teardown
along with the VM. Wire in a test against the hypercall filter for HVCs
which does nothing until the controls are exposed to userspace.

Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230404154050.2270077-8-oliver.upton@linux.dev
2023-04-05 12:07:41 +01:00
Oliver Upton a8308b3fc9 KVM: arm64: Refactor hvc filtering to support different actions
KVM presently allows userspace to filter guest hypercalls with bitmaps
expressed via pseudo-firmware registers. These bitmaps have a narrow
scope and, of course, can only allow/deny a particular call. A
subsequent change to KVM will introduce a generalized UAPI for filtering
hypercalls, allowing functions to be forwarded to userspace.

Refactor the existing hypercall filtering logic to make room for more
than two actions. While at it, generalize the function names around
SMCCC as it is the basis for the upcoming UAPI.

No functional change intended.

Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230404154050.2270077-7-oliver.upton@linux.dev
2023-04-05 12:07:41 +01:00
Oliver Upton c2d2e9b3d8 KVM: arm64: Start handling SMCs from EL1
Whelp, the architecture gods have spoken and confirmed that the function
ID space is common between SMCs and HVCs. Not only that, the expectation
is that hypervisors handle calls to both SMC and HVC conduits. KVM
recently picked up support for SMCCCs in commit bd36b1a9eb ("KVM:
arm64: nv: Handle SMCs taken from virtual EL2") but scoped it only to a
nested hypervisor.

Let's just open the floodgates and let EL1 access our SMCCC
implementation with the SMC instruction as well.

Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230404154050.2270077-6-oliver.upton@linux.dev
2023-04-05 12:07:41 +01:00
Oliver Upton aac9496812 KVM: arm64: Rename SMC/HVC call handler to reflect reality
KVM handles SMCCC calls from virtual EL2 that use the SMC instruction
since commit bd36b1a9eb ("KVM: arm64: nv: Handle SMCs taken from
virtual EL2"). Thus, the function name of the handler no longer reflects
reality.

Normalize the name on SMCCC, since that's the only hypercall interface
KVM supports in the first place. No fuctional change intended.

Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230404154050.2270077-5-oliver.upton@linux.dev
2023-04-05 12:07:41 +01:00
Oliver Upton e0fc6b2161 KVM: arm64: Add vm fd device attribute accessors
A subsequent change will allow userspace to convey a filter for
hypercalls through a vm device attribute. Add the requisite boilerplate
for vm attribute accessors.

Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230404154050.2270077-4-oliver.upton@linux.dev
2023-04-05 12:07:41 +01:00
Oliver Upton de40bb8abb KVM: arm64: Add a helper to check if a VM has ran once
The test_bit(...) pattern is quite a lot of keystrokes. Replace
existing callsites with a helper.

No functional change intended.

Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230404154050.2270077-3-oliver.upton@linux.dev
2023-04-05 12:07:41 +01:00
Fuad Tabba e81625218b KVM: arm64: Advertise ID_AA64PFR0_EL1.CSV2/3 to protected VMs
The existing pKVM code attempts to advertise CSV2/3 using values
initialized to 0, but never set. To advertise CSV2/3 to protected
guests, pass the CSV2/3 values to hyp when initializing hyp's
view of guests' ID_AA64PFR0_EL1.

Similar to non-protected KVM, these are system-wide, rather than
per cpu, for simplicity.

Fixes: 6c30bfb18d ("KVM: arm64: Add handlers for protected VM System Registers")
Signed-off-by: Fuad Tabba <tabba@google.com>
Link: https://lore.kernel.org/r/20230404152321.413064-1-tabba@google.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2023-04-04 15:52:06 +00:00
Alexey Kardashevskiy 52882b9c7a KVM: PPC: Make KVM_CAP_IRQFD_RESAMPLE platform dependent
When introduced, IRQFD resampling worked on POWER8 with XICS. However
KVM on POWER9 has never implemented it - the compatibility mode code
("XICS-on-XIVE") misses the kvm_notify_acked_irq() call and the native
XIVE mode does not handle INTx in KVM at all.

This moved the capability support advertising to platforms and stops
advertising it on XIVE, i.e. POWER9 and later.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Acked-by: Anup Patel <anup@brainfault.org>
Acked-by: Nicholas Piggin <npiggin@gmail.com>
Message-Id: <20220504074807.3616813-1-aik@ozlabs.ru>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-03-31 11:19:05 -04:00
Marc Zyngier 81dc9504a7 KVM: arm64: nv: timers: Support hyp timer emulation
Emulating EL2 also means emulating the EL2 timers. To do so, we expand
our timer framework to deal with at most 4 timers. At any given time,
two timers are using the HW timers, and the two others are purely
emulated.

The role of deciding which is which at any given time is left to a
mapping function which is called every time we need to make such a
decision.

Reviewed-by: Colton Lewis <coltonlewis@google.com>
Co-developed-by: Christoffer Dall <christoffer.dall@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230330174800.2677007-18-maz@kernel.org
2023-03-30 19:01:10 +01:00
Marc Zyngier 1e0eec09d4 KVM: arm64: nv: timers: Add a per-timer, per-vcpu offset
Being able to set a global offset isn't enough.

With NV, we also need to a per-vcpu, per-timer offset (for example,
CNTVCT_EL0 being offset by CNTVOFF_EL2).

Use a similar method as the VM-wide offset to have a timer point
to the shadow register that contains the offset value.

Reviewed-by: Colton Lewis <coltonlewis@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230330174800.2677007-17-maz@kernel.org
2023-03-30 19:01:10 +01:00
Marc Zyngier 476fcd4b7b KVM: arm64: timers: Abstract the number of valid timers per vcpu
We so far have a pretty fixed number of timers to take care of.
This is about to change as NV brings another two into the
picture, and we must be careful not to try and emulate non-valid
timers in a given VM.

For this, abstract the number of timers for a given vcpu behind
an accessor, which helpfully returns a constant for now.

Reviewed-by: Colton Lewis <coltonlewis@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230330174800.2677007-15-maz@kernel.org
2023-03-30 19:01:10 +01:00
Marc Zyngier e9adde432b KVM: arm64: timers: Fast-track CNTPCT_EL0 trap handling
Now that it is likely that CNTPCT_EL0 accesses will trap,
fast-track the emulation of the counter read which doesn't
need more that a simple offsetting.

One day, we'll have CNTPOFF everywhere. One day.

Suggested-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230330174800.2677007-14-maz@kernel.org
2023-03-30 19:01:10 +01:00
Marc Zyngier 8a5eb2d210 KVM: arm64: timers: Move the timer IRQs into arch_timer_vm_data
Having the timer IRQs duplicated into each vcpu isn't great, and
becomes absolutely awful with NV. So let's move these into
the per-VM arch_timer_vm_data structure.

This simplifies a lot of code, but requires us to introduce a
mutex so that we can reason about userspace trying to change
an interrupt number while another vcpu is running, something
that wasn't really well handled so far.

Reviewed-by: Colton Lewis <coltonlewis@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230330174800.2677007-12-maz@kernel.org
2023-03-30 19:01:10 +01:00
Marc Zyngier 33c549460e KVM: arm64: timers: Abstract per-timer IRQ access
As we are about to move the location of the per-timer IRQ into
the VM structure, abstract the location of the IRQ behind an
accessor. This will make the repainting sligntly less painful.

Reviewed-by: Colton Lewis <coltonlewis@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230330174800.2677007-11-maz@kernel.org
2023-03-30 19:01:10 +01:00
Marc Zyngier 5591805d2c KVM: arm64: timers: Rationalise per-vcpu timer init
The way we initialise our timer contexts may be satisfactory
for two timers, but will be getting pretty annoying with four.

Cleanup the whole thing by removing the code duplication and
getting rid of unused IRQ configuration elements.

Reviewed-by: Colton Lewis <coltonlewis@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230330174800.2677007-10-maz@kernel.org
2023-03-30 19:01:10 +01:00
Marc Zyngier 680232a94c KVM: arm64: timers: Allow save/restoring of the physical timer
Nothing like being 10 year late to a party!

Now that userspace can set counter offsets, we can save/restore
the physical timer as well!

Nobody really cared so far, but you're welcome anyway.

Reviewed-by: Colton Lewis <coltonlewis@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230330174800.2677007-9-maz@kernel.org
2023-03-30 19:01:10 +01:00
Marc Zyngier 30ec7997d1 KVM: arm64: timers: Allow userspace to set the global counter offset
And this is the moment you have all been waiting for: setting the
counter offset from userspace.

We expose a brand new capability that reports the ability to set
the offset for both the virtual and physical sides.

In keeping with the architecture, the offset is expressed as
a delta that is substracted from the physical counter value.

Once this new API is used, there is no going back, and the counters
cannot be written to to set the offsets implicitly (the writes
are instead ignored).

Reviewed-by: Colton Lewis <coltonlewis@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230330174800.2677007-8-maz@kernel.org
2023-03-30 19:01:10 +01:00
Marc Zyngier 96906a9150 KVM: arm64: Expose {un,}lock_all_vcpus() to the rest of KVM
Being able to lock/unlock all vcpus in one go is a feature that
only the vgic has enjoyed so far. Let's be brave and expose it
to the world.

Reviewed-by: Colton Lewis <coltonlewis@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230330174800.2677007-7-maz@kernel.org
2023-03-30 19:01:09 +01:00
Marc Zyngier c605ee2450 KVM: arm64: timers: Allow physical offset without CNTPOFF_EL2
CNTPOFF_EL2 is awesome, but it is mostly vapourware, and no publicly
available implementation has it. So for the common mortals, let's
implement the emulated version of this thing.

It means trapping accesses to the physical counter and timer, and
emulate some of it as necessary.

As for CNTPOFF_EL2, nobody sets the offset yet.

Reviewed-by: Colton Lewis <coltonlewis@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230330174800.2677007-6-maz@kernel.org
2023-03-30 19:01:09 +01:00
Marc Zyngier 2b4825a869 KVM: arm64: timers: Use CNTPOFF_EL2 to offset the physical timer
With ECV and CNTPOFF_EL2, it is very easy to offer an offset for
the physical timer. So let's do just that.

Nothing can set the offset yet, so this should have no effect
whatsoever (famous last words...).

Reviewed-by: Colton Lewis <coltonlewis@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230330174800.2677007-5-maz@kernel.org
2023-03-30 19:01:09 +01:00
Marc Zyngier 0d0ae656b7 KVM: arm64: timers: Use a per-vcpu, per-timer accumulator for fractional ns
Instead of accumulating the fractional ns value generated every time
we compute a ns delta in a global variable, use a per-vcpu, per-timer
variable. This keeps the fractional ns local to the timer instead of
contributing to any odd, unrelated timer.

Reviewed-by: Colton Lewis <coltonlewis@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230330174800.2677007-2-maz@kernel.org
2023-03-30 19:01:09 +01:00
Reiji Watanabe f9ea835e99 KVM: arm64: PMU: Restore the guest's EL0 event counting after migration
Currently, with VHE, KVM enables the EL0 event counting for the
guest on vcpu_load() or KVM enables it as a part of the PMU
register emulation process, when needed.  However, in the migration
case (with VHE), the same handling is lacking, as vPMU register
values that were restored by userspace haven't been propagated yet
(the PMU events haven't been created) at the vcpu load-time on the
first KVM_RUN (kvm_vcpu_pmu_restore_guest() called from vcpu_load()
on the first KVM_RUN won't do anything as events_{guest,host} of
kvm_pmu_events are still zero).

So, with VHE, enable the guest's EL0 event counting on the first
KVM_RUN (after the migration) when needed.  More specifically,
have kvm_pmu_handle_pmcr() call kvm_vcpu_pmu_restore_guest()
so that kvm_pmu_handle_pmcr() on the first KVM_RUN can take
care of it.

Fixes: d0c94c4979 ("KVM: arm64: Restore PMU configuration on first run")
Cc: stable@vger.kernel.org
Reviewed-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Reiji Watanabe <reijiw@google.com>
Link: https://lore.kernel.org/r/20230329023944.2488484-1-reijiw@google.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2023-03-30 06:14:43 +00:00
Oliver Upton f003277311 KVM: arm64: Use config_lock to protect vgic state
Almost all of the vgic state is VM-scoped but accessed from the context
of a vCPU. These accesses were serialized on the kvm->lock which cannot
be nested within a vcpu->mutex critical section.

Move over the vgic state to using the config_lock. Tweak the lock
ordering where necessary to ensure that the config_lock is acquired
after the vcpu->mutex. Acquire the config_lock in kvm_vgic_create() to
avoid a race between the converted flows and GIC creation. Where
necessary, continue to acquire kvm->lock to avoid a race with vCPU
creation (i.e. flows that use lock_all_vcpus()).

Finally, promote the locking expectations in comments to lockdep
assertions and update the locking documentation for the config_lock as
well as vcpu->mutex.

Cc: stable@vger.kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230327164747.2466958-5-oliver.upton@linux.dev
2023-03-29 14:08:31 +01:00
Oliver Upton 4bba7f7def KVM: arm64: Use config_lock to protect data ordered against KVM_RUN
There are various bits of VM-scoped data that can only be configured
before the first call to KVM_RUN, such as the hypercall bitmaps and
the PMU. As these fields are protected by the kvm->lock and accessed
while holding vcpu->mutex, this is yet another example of lock
inversion.

Change out the kvm->lock for kvm->arch.config_lock in all of these
instances. Opportunistically simplify the locking mechanics of the
PMU configuration by holding the config_lock for the entirety of
kvm_arm_pmu_v3_set_attr().

Note that this also addresses a couple of bugs. There is an unguarded
read of the PMU version in KVM_ARM_VCPU_PMU_V3_FILTER which could race
with KVM_ARM_VCPU_PMU_V3_SET_PMU. Additionally, until now writes to the
per-vCPU vPMU irq were not serialized VM-wide, meaning concurrent calls
to KVM_ARM_VCPU_PMU_V3_IRQ could lead to a false positive in
pmu_irq_is_valid().

Cc: stable@vger.kernel.org
Tested-by: Jeremy Linton <jeremy.linton@arm.com>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230327164747.2466958-4-oliver.upton@linux.dev
2023-03-29 14:08:31 +01:00
Oliver Upton c43120afb5 KVM: arm64: Avoid lock inversion when setting the VM register width
kvm->lock must be taken outside of the vcpu->mutex. Of course, the
locking documentation for KVM makes this abundantly clear. Nonetheless,
the locking order in KVM/arm64 has been wrong for quite a while; we
acquire the kvm->lock while holding the vcpu->mutex all over the shop.

All was seemingly fine until commit 42a90008f8 ("KVM: Ensure lockdep
knows about kvm->lock vs. vcpu->mutex ordering rule") caught us with our
pants down, leading to lockdep barfing:

 ======================================================
 WARNING: possible circular locking dependency detected
 6.2.0-rc7+ #19 Not tainted
 ------------------------------------------------------
 qemu-system-aar/859 is trying to acquire lock:
 ffff5aa69269eba0 (&host_kvm->lock){+.+.}-{3:3}, at: kvm_reset_vcpu+0x34/0x274

 but task is already holding lock:
 ffff5aa68768c0b8 (&vcpu->mutex){+.+.}-{3:3}, at: kvm_vcpu_ioctl+0x8c/0xba0

 which lock already depends on the new lock.

Add a dedicated lock to serialize writes to VM-scoped configuration from
the context of a vCPU. Protect the register width flags with the new
lock, thus avoiding the need to grab the kvm->lock while holding
vcpu->mutex in kvm_reset_vcpu().

Cc: stable@vger.kernel.org
Reported-by: Jeremy Linton <jeremy.linton@arm.com>
Link: https://lore.kernel.org/kvmarm/f6452cdd-65ff-34b8-bab0-5c06416da5f6@arm.com/
Tested-by: Jeremy Linton <jeremy.linton@arm.com>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230327164747.2466958-3-oliver.upton@linux.dev
2023-03-29 14:08:31 +01:00
Oliver Upton 0acc7239c2 KVM: arm64: Avoid vcpu->mutex v. kvm->lock inversion in CPU_ON
KVM/arm64 had the lock ordering backwards on vcpu->mutex and kvm->lock
from the very beginning. One such example is the way vCPU resets are
handled: the kvm->lock is acquired while handling a guest CPU_ON PSCI
call.

Add a dedicated lock to serialize writes to kvm_vcpu_arch::{mp_state,
reset_state}. Promote all accessors of mp_state to {READ,WRITE}_ONCE()
as readers do not acquire the mp_state_lock. While at it, plug yet
another race by taking the mp_state_lock in the KVM_SET_MP_STATE ioctl
handler.

As changes to MP state are now guarded with a dedicated lock, drop the
kvm->lock acquisition from the PSCI CPU_ON path. Similarly, move the
reader of reset_state outside of the kvm->lock and instead protect it
with the mp_state_lock. Note that writes to reset_state::reset have been
demoted to regular stores as both readers and writers acquire the
mp_state_lock.

While the kvm->lock inversion still exists in kvm_reset_vcpu(), at least
now PSCI CPU_ON no longer depends on it for serializing vCPU reset.

Cc: stable@vger.kernel.org
Tested-by: Jeremy Linton <jeremy.linton@arm.com>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230327164747.2466958-2-oliver.upton@linux.dev
2023-03-29 14:08:31 +01:00
Marc Zyngier 8c2e8ac8ad KVM: arm64: Check for kvm_vma_mte_allowed in the critical section
On page fault, we find about the VMA that backs the page fault
early on, and quickly release the mmap_read_lock. However, using
the VMA pointer after the critical section is pretty dangerous,
as a teardown may happen in the meantime and the VMA be long gone.

Move the sampling of the MTE permission early, and NULL-ify the
VMA pointer after that, just to be on the safe side.

Signed-off-by: Marc Zyngier <maz@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20230316174546.3777507-3-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2023-03-16 23:42:56 +00:00