Commit Graph

39991 Commits

Author SHA1 Message Date
Linus Torvalds 55a677b256 EFI fix for v5.16
Ensure that the EFI memory map resides in encrypted memory even after it
 has been reallocated.
 -----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE+9lifEBpyUIVN1cpw08iOZLZjyQFAmGt9nUACgkQw08iOZLZ
 jyT6ZgwAt1jkqt/ILev5MPE2rEQ8BrYhR7pMBcNW7su4BECwRfmQBGjYFeeXjlqp
 7tK2L18DsHJxgt699gumDD7IyjbLNUx9zqsd7nJzlvnpwjGSiUhAGLPwoAeRSEgD
 sxK/HIt6J5ZX/zMsEbtptP9M1eitbZQhtt+0aCLcJvt1NCZc/+FvUaEEPI+6BrMt
 TMWdXd6MOZ8FV2iq+FUpTeG9SGzze6QaIiL7H5Z/otvXbBG1iWmlWR0bR17CXGUC
 thIuqCjEKSM7P45kFF+byCq5ajo55ULy3ZrPVQUYLt/FdMdI+pLxK8TTg46xq8KS
 mXqexjtCPLiJP728lWTETIPB7x9aQW/i6cwesh0O6cmhPznJybW/+uR9zBsyiSGs
 8+9ua+JT9+B3bEn1rsbWYKEcy3G4KrxwtJ5aqRNVJM96pel/yxA4EYkmfyf4TlrS
 pxbY2Agzlci18qCHb5lycc1yM6WHdirzY1l0m8uwqk7o8GHtRMMMU/7e/kvmtxUA
 j9V3vqm6
 =mDRi
 -----END PGP SIGNATURE-----

Merge tag 'efi-urgent-for-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi

Pull EFI fix from Ard Biesheuvel:
 "Ensure that the EFI memory map resides in encrypted memory even after
  it has been reallocated"

* tag 'efi-urgent-for-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi:
  x86/sme: Explicitly map new EFI memmap table as encrypted
2021-12-06 10:09:00 -08:00
Linus Torvalds f5d54a42d3 - Fix a couple of SWAPGS fencing issues in the x86 entry code
- Use the proper operand types in __{get,put}_user() to prevent
 truncation in SEV-ES string io
 
 - Make sure the kernel mappings are present in trampoline_pgd in order
 to prevent any potential accesses to unmapped memory after switching to
 it
 
 - Fix a trivial list corruption in objtool's pv_ops validation
 
 - Disable the clocksource watchdog for TSC on platforms which claim
 that the TSC is constant, doesn't stop in sleep states, CPU has TSC
 adjust and the number of sockets of the platform are max 2, to prevent
 erroneous markings of the TSC as unstable.
 
 - Make sure TSC adjust is always checked not only when going idle
 
 - Prevent a stack leak by initializing struct _fpx_sw_bytes properly in
 the FPU code
 
 - Fix INTEL_FAM6_RAPTORLAKE define naming to adhere to the convention
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmGsnWcACgkQEsHwGGHe
 VUoR+g/9FcOP0/XLH+LKHumYc9JHXsp5BvYGihyypMFgU0fXQBORtGqdls8jZtiJ
 kEdbW6iL0MRlyN8aHJCqr7dJqs7KJlpWes6hky7BY+U+7uewtjL5y3eSyZnA34T3
 M/Raecx27Hh0L0kHQlHXTUN73v1cgDvq3dCXWsP7Jqgjf5cEmCcV/tPEateqhq/f
 8TkLVIm55rJlbJ0LBO/cT0V3Q8QH9JPKm7nviOZuKCh9gcttFEPaM9MkaJyKUhoy
 O13jlenDoVkVWRXIQec1EZp2pTLxVAm/3Y0plge1yEVsejzh07gsQnMpoNeF+yFC
 8mDgSv8ZAED/vbsnB+BcgoRVj6ajG0+ilpLzcfPwUquiqS9pZrBSTddlvYDPjRMC
 MEXO548xiYgxmipu3r62H89nqmLEYQPk914rJu6bDnDeJ1gaabh8RXbNtQcRqqj3
 RETgVOp78iWn+aT33RLLD1EyodZb2IkMy087a3+TZICIXG81aDj9VgHvrVRnWnfY
 yKuldyrEKzi60yMQkV6h1oc8KSWQhspUSLtOVS9zrulCinYphFOfYFrzFmcKUWIq
 GdVb9eaP2oNBGfPybXP+TBLGZ4Zv9iXZmaEUk7ZGCjgv3ZmGMWJ18Hs/ufs2cwWK
 RNNUo3sz/y3OsreHowkWIk1eSxI16MabB7G/PDMnBSHlioVT390=
 =d6nS
 -----END PGP SIGNATURE-----

Merge tag 'x86_urgent_for_v5.16_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fixes from Borislav Petkov:

 - Fix a couple of SWAPGS fencing issues in the x86 entry code

 - Use the proper operand types in __{get,put}_user() to prevent
   truncation in SEV-ES string io

 - Make sure the kernel mappings are present in trampoline_pgd in order
   to prevent any potential accesses to unmapped memory after switching
   to it

 - Fix a trivial list corruption in objtool's pv_ops validation

 - Disable the clocksource watchdog for TSC on platforms which claim
   that the TSC is constant, doesn't stop in sleep states, CPU has TSC
   adjust and the number of sockets of the platform are max 2, to
   prevent erroneous markings of the TSC as unstable.

 - Make sure TSC adjust is always checked not only when going idle

 - Prevent a stack leak by initializing struct _fpx_sw_bytes properly in
   the FPU code

 - Fix INTEL_FAM6_RAPTORLAKE define naming to adhere to the convention

* tag 'x86_urgent_for_v5.16_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/xen: Add xenpv_restore_regs_and_return_to_usermode()
  x86/entry: Use the correct fence macro after swapgs in kernel CR3
  x86/entry: Add a fence for kernel entry SWAPGS in paranoid_entry()
  x86/sev: Fix SEV-ES INS/OUTS instructions for word, dword, and qword
  x86/64/mm: Map all kernel memory into trampoline_pgd
  objtool: Fix pv_ops noinstr validation
  x86/tsc: Disable clocksource watchdog for TSC on qualified platorms
  x86/tsc: Add a timer to make sure TSC_adjust is always checked
  x86/fpu/signal: Initialize sw_bytes in save_xstate_epilog()
  x86/cpu: Drop spurious underscore from RAPTOR_LAKE #define
2021-12-05 08:43:35 -08:00
Linus Torvalds 90bf8d98b4 * Static analysis fix
* New SEV-ES protocol for communicating invalid VMGEXIT requests
 * Ensure APICv is considered inactive if there is no APIC
 * Fix reserved bits for AMD PerfEvtSeln register
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmGscuAUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroOnzAgAg/0tifAOC8F31xu+gbbFa2hJsfH1
 sr2QXqVI6XUWeAHw5GEEcSFHGcFc0BrmfPEL4T3tEhkTGcijL2uTK8dwgq3ue4yg
 5LzaZzh03AWi4x84rV3XNVHHatzF69tgbUG49rSlK2T6BkGWzh4gI5LZV7XqNqLh
 acW+92YcCGo/O9RAUYYakofX4bp0rsQaZornQiD/R5X6AlrtMUyhAYHH5Wnv69n+
 MHf4K9MzrtixXbTvkOXflN5yz6TkIGvCpCK+gmppeuJ3JyZshjn+y95XcDDZtB6o
 +4Ypap3SmIkjTg4VwS8lRwIFkkY31hxhW2ohRnorfP2Q5Ckcz2/kVcIEew==
 =uJMy
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull more kvm fixes from Paolo Bonzini:

 - Static analysis fix

 - New SEV-ES protocol for communicating invalid VMGEXIT requests

 - Ensure APICv is considered inactive if there is no APIC

 - Fix reserved bits for AMD PerfEvtSeln register

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: SVM: Do not terminate SEV-ES guests on GHCB validation failure
  KVM: SEV: Fall back to vmalloc for SEV-ES scratch area if necessary
  KVM: SEV: Return appropriate error codes if SEV-ES scratch setup fails
  KVM: x86/mmu: Retry page fault if root is invalidated by memslot update
  KVM: VMX: Set failure code in prepare_vmcs02()
  KVM: ensure APICv is considered inactive if there is no APIC
  KVM: x86/pmu: Fix reserved bits for AMD PerfEvtSeln register
2021-12-05 08:25:33 -08:00
Tom Lendacky 1ff2fc0286 x86/sme: Explicitly map new EFI memmap table as encrypted
Reserving memory using efi_mem_reserve() calls into the x86
efi_arch_mem_reserve() function. This function will insert a new EFI
memory descriptor into the EFI memory map representing the area of
memory to be reserved and marking it as EFI runtime memory. As part
of adding this new entry, a new EFI memory map is allocated and mapped.
The mapping is where a problem can occur. This new memory map is mapped
using early_memremap() and generally mapped encrypted, unless the new
memory for the mapping happens to come from an area of memory that is
marked as EFI_BOOT_SERVICES_DATA memory. In this case, the new memory will
be mapped unencrypted. However, during replacement of the old memory map,
efi_mem_type() is disabled, so the new memory map will now be long-term
mapped encrypted (in efi.memmap), resulting in the map containing invalid
data and causing the kernel boot to crash.

Since it is known that the area will be mapped encrypted going forward,
explicitly map the new memory map as encrypted using early_memremap_prot().

Cc: <stable@vger.kernel.org> # 4.14.x
Fixes: 8f716c9b5f ("x86/mm: Add support to access boot related data in the clear")
Link: https://lore.kernel.org/all/ebf1eb2940405438a09d51d121ec0d02c8755558.1634752931.git.thomas.lendacky@amd.com/
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
[ardb: incorporate Kconfig fix by Arnd]
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
2021-12-05 16:44:52 +01:00
Tom Lendacky ad5b353240 KVM: SVM: Do not terminate SEV-ES guests on GHCB validation failure
Currently, an SEV-ES guest is terminated if the validation of the VMGEXIT
exit code or exit parameters fails.

The VMGEXIT instruction can be issued from userspace, even though
userspace (likely) can't update the GHCB. To prevent userspace from being
able to kill the guest, return an error through the GHCB when validation
fails rather than terminating the guest. For cases where the GHCB can't be
updated (e.g. the GHCB can't be mapped, etc.), just return back to the
guest.

The new error codes are documented in the lasest update to the GHCB
specification.

Fixes: 291bd20d5d ("KVM: SVM: Add initial support for a VMGEXIT VMEXIT")
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <b57280b5562893e2616257ac9c2d4525a9aeeb42.1638471124.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-05 03:02:04 -05:00
Sean Christopherson a655276a59 KVM: SEV: Fall back to vmalloc for SEV-ES scratch area if necessary
Use kvzalloc() to allocate KVM's buffer for SEV-ES's GHCB scratch area so
that KVM falls back to __vmalloc() if physically contiguous memory isn't
available.  The buffer is purely a KVM software construct, i.e. there's
no need for it to be physically contiguous.

Cc: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211109222350.2266045-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-05 03:02:03 -05:00
Sean Christopherson 75236f5f22 KVM: SEV: Return appropriate error codes if SEV-ES scratch setup fails
Return appropriate error codes if setting up the GHCB scratch area for an
SEV-ES guest fails.  In particular, returning -EINVAL instead of -ENOMEM
when allocating the kernel buffer could be confusing as userspace would
likely suspect a guest issue.

Fixes: 8f423a80d2 ("KVM: SVM: Support MMIO for an SEV-ES guest")
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211109222350.2266045-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-05 03:02:03 -05:00
Lai Jiangshan 5c8f6a2e31 x86/xen: Add xenpv_restore_regs_and_return_to_usermode()
In the native case, PER_CPU_VAR(cpu_tss_rw + TSS_sp0) is the
trampoline stack. But XEN pv doesn't use trampoline stack, so
PER_CPU_VAR(cpu_tss_rw + TSS_sp0) is also the kernel stack.

In that case, source and destination stacks are identical, which means
that reusing swapgs_restore_regs_and_return_to_usermode() in XEN pv
would cause %rsp to move up to the top of the kernel stack and leave the
IRET frame below %rsp.

This is dangerous as it can be corrupted if #NMI / #MC hit as either of
these events occurring in the middle of the stack pushing would clobber
data on the (original) stack.

And, with  XEN pv, swapgs_restore_regs_and_return_to_usermode() pushing
the IRET frame on to the original address is useless and error-prone
when there is any future attempt to modify the code.

 [ bp: Massage commit message. ]

Fixes: 7f2590a110 ("x86/entry/64: Use a per-CPU trampoline stack for IDT entries")
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Link: https://lkml.kernel.org/r/20211126101209.8613-4-jiangshanlai@gmail.com
2021-12-03 19:21:15 +01:00
Lai Jiangshan 1367afaa2e x86/entry: Use the correct fence macro after swapgs in kernel CR3
The commit

  c758907004 ("x86/entry/64: Remove unneeded kernel CR3 switching")

removed a CR3 write in the faulting path of load_gs_index().

But the path's FENCE_SWAPGS_USER_ENTRY has no fence operation if PTI is
enabled, see spectre_v1_select_mitigation().

Rather, it depended on the serializing CR3 write of SWITCH_TO_KERNEL_CR3
and since it got removed, add a FENCE_SWAPGS_KERNEL_ENTRY call to make
sure speculation is blocked.

 [ bp: Massage commit message and comment. ]

Fixes: c758907004 ("x86/entry/64: Remove unneeded kernel CR3 switching")
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20211126101209.8613-3-jiangshanlai@gmail.com
2021-12-03 19:13:53 +01:00
Lai Jiangshan c07e45553d x86/entry: Add a fence for kernel entry SWAPGS in paranoid_entry()
Commit

  18ec54fdd6 ("x86/speculation: Prepare entry code for Spectre v1 swapgs mitigations")

added FENCE_SWAPGS_{KERNEL|USER}_ENTRY for conditional SWAPGS. In
paranoid_entry(), it uses only FENCE_SWAPGS_KERNEL_ENTRY for both
branches. This is because the fence is required for both cases since the
CR3 write is conditional even when PTI is enabled.

But

  96b2371413 ("x86/entry/64: Switch CR3 before SWAPGS in paranoid entry")

changed the order of SWAPGS and the CR3 write. And it missed the needed
FENCE_SWAPGS_KERNEL_ENTRY for the user gsbase case.

Add it back by changing the branches so that FENCE_SWAPGS_KERNEL_ENTRY
can cover both branches.

  [ bp: Massage, fix typos, remove obsolete comment while at it. ]

Fixes: 96b2371413 ("x86/entry/64: Switch CR3 before SWAPGS in paranoid entry")
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20211126101209.8613-2-jiangshanlai@gmail.com
2021-12-03 18:55:47 +01:00
Michael Sterritt 1d5379d047 x86/sev: Fix SEV-ES INS/OUTS instructions for word, dword, and qword
Properly type the operands being passed to __put_user()/__get_user().
Otherwise, these routines truncate data for dependent instructions
(e.g., INSW) and only read/write one byte.

This has been tested by sending a string with REP OUTSW to a port and
then reading it back in with REP INSW on the same port.

Previous behavior was to only send and receive the first char of the
size. For example, word operations for "abcd" would only read/write
"ac". With change, the full string is now written and read back.

Fixes: f980f9c31a (x86/sev-es: Compile early handler code into kernel image)
Signed-off-by: Michael Sterritt <sterritt@google.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Marc Orr <marcorr@google.com>
Reviewed-by: Peter Gonda <pgonda@google.com>
Reviewed-by: Joerg Roedel <jroedel@suse.de>
Link: https://lkml.kernel.org/r/20211119232757.176201-1-sterritt@google.com
2021-12-03 18:09:30 +01:00
Joerg Roedel 51523ed1c2 x86/64/mm: Map all kernel memory into trampoline_pgd
The trampoline_pgd only maps the 0xfffffff000000000-0xffffffffffffffff
range of kernel memory (with 4-level paging). This range contains the
kernel's text+data+bss mappings and the module mapping space but not the
direct mapping and the vmalloc area.

This is enough to get the application processors out of real-mode, but
for code that switches back to real-mode the trampoline_pgd is missing
important parts of the address space. For example, consider this code
from arch/x86/kernel/reboot.c, function machine_real_restart() for a
64-bit kernel:

  #ifdef CONFIG_X86_32
  	load_cr3(initial_page_table);
  #else
  	write_cr3(real_mode_header->trampoline_pgd);

  	/* Exiting long mode will fail if CR4.PCIDE is set. */
  	if (boot_cpu_has(X86_FEATURE_PCID))
  		cr4_clear_bits(X86_CR4_PCIDE);
  #endif

  	/* Jump to the identity-mapped low memory code */
  #ifdef CONFIG_X86_32
  	asm volatile("jmpl *%0" : :
  		     "rm" (real_mode_header->machine_real_restart_asm),
  		     "a" (type));
  #else
  	asm volatile("ljmpl *%0" : :
  		     "m" (real_mode_header->machine_real_restart_asm),
  		     "D" (type));
  #endif

The code switches to the trampoline_pgd, which unmaps the direct mapping
and also the kernel stack. The call to cr4_clear_bits() will find no
stack and crash the machine. The real_mode_header pointer below points
into the direct mapping, and dereferencing it also causes a crash.

The reason this does not crash always is only that kernel mappings are
global and the CR3 switch does not flush those mappings. But if theses
mappings are not in the TLB already, the above code will crash before it
can jump to the real-mode stub.

Extend the trampoline_pgd to contain all kernel mappings to prevent
these crashes and to make code which runs on this page-table more
robust.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20211202153226.22946-5-joro@8bytes.org
2021-12-03 09:11:43 +01:00
Sean Christopherson a955cad84c KVM: x86/mmu: Retry page fault if root is invalidated by memslot update
Bail from the page fault handler if the root shadow page was obsoleted by
a memslot update.  Do the check _after_ acuiring mmu_lock, as the TDP MMU
doesn't rely on the memslot/MMU generation, and instead relies on the
root being explicit marked invalid by kvm_mmu_zap_all_fast(), which takes
mmu_lock for write.

For the TDP MMU, inserting a SPTE into an obsolete root can leak a SP if
kvm_tdp_mmu_zap_invalidated_roots() has already zapped the SP, i.e. has
moved past the gfn associated with the SP.

For other MMUs, the resulting behavior is far more convoluted, though
unlikely to be truly problematic.  Installing SPs/SPTEs into the obsolete
root isn't directly problematic, as the obsolete root will be unloaded
and dropped before the vCPU re-enters the guest.  But because the legacy
MMU tracks shadow pages by their role, any SP created by the fault can
can be reused in the new post-reload root.  Again, that _shouldn't_ be
problematic as any leaf child SPTEs will be created for the current/valid
memslot generation, and kvm_mmu_get_page() will not reuse child SPs from
the old generation as they will be flagged as obsolete.  But, given that
continuing with the fault is pointess (the root will be unloaded), apply
the check to all MMUs.

Fixes: b7cccd397f ("KVM: x86/mmu: Fast invalidation for TDP MMU")
Cc: stable@vger.kernel.org
Cc: Ben Gardon <bgardon@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211120045046.3940942-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-02 04:12:12 -05:00
Dan Carpenter bfbb307c62 KVM: VMX: Set failure code in prepare_vmcs02()
The error paths in the prepare_vmcs02() function are supposed to set
*entry_failure_code but this path does not.  It leads to using an
uninitialized variable in the caller.

Fixes: 71f7347025 ("KVM: nVMX: Load GUEST_IA32_PERF_GLOBAL_CTRL MSR on VM-Entry")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Message-Id: <20211130125337.GB24578@kili>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-02 04:12:11 -05:00
Paolo Bonzini ef8b4b7203 KVM: ensure APICv is considered inactive if there is no APIC
kvm_vcpu_apicv_active() returns false if a virtual machine has no in-kernel
local APIC, however kvm_apicv_activated might still be true if there are
no reasons to disable APICv; in fact it is quite likely that there is none
because APICv is inhibited by specific configurations of the local APIC
and those configurations cannot be programmed.  This triggers a WARN:

   WARN_ON_ONCE(kvm_apicv_activated(vcpu->kvm) != kvm_vcpu_apicv_active(vcpu));

To avoid this, introduce another cause for APICv inhibition, namely the
absence of an in-kernel local APIC.  This cause is enabled by default,
and is dropped by either KVM_CREATE_IRQCHIP or the enabling of
KVM_CAP_IRQCHIP_SPLIT.

Reported-by: Ignat Korchagin <ignat@cloudflare.com>
Fixes: ee49a89329 ("KVM: x86: Move SVM's APICv sanity check to common x86", 2021-10-22)
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Tested-by: Ignat Korchagin <ignat@cloudflare.com>
Message-Id: <20211130123746.293379-1-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-02 04:12:11 -05:00
Like Xu cb1d220da0 KVM: x86/pmu: Fix reserved bits for AMD PerfEvtSeln register
If we run the following perf command in an AMD Milan guest:

  perf stat \
  -e cpu/event=0x1d0/ \
  -e cpu/event=0x1c7/ \
  -e cpu/umask=0x1f,event=0x18e/ \
  -e cpu/umask=0x7,event=0x18e/ \
  -e cpu/umask=0x18,event=0x18e/ \
  ./workload

dmesg will report a #GP warning from an unchecked MSR access
error on MSR_F15H_PERF_CTLx.

This is because according to APM (Revision: 4.03) Figure 13-7,
the bits [35:32] of AMD PerfEvtSeln register is a part of the
event select encoding, which extends the EVENT_SELECT field
from 8 bits to 12 bits.

Opportunistically update pmu->reserved_bits for reserved bit 19.

Reported-by: Jim Mattson <jmattson@google.com>
Fixes: ca724305a2 ("KVM: x86/vPMU: Implement AMD vPMU code for KVM")
Signed-off-by: Like Xu <likexu@tencent.com>
Message-Id: <20211118130320.95997-1-likexu@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-02 04:11:50 -05:00
Feng Tang b50db7095f x86/tsc: Disable clocksource watchdog for TSC on qualified platorms
There are cases that the TSC clocksource is wrongly judged as unstable by
the clocksource watchdog mechanism which tries to validate the TSC against
HPET, PM_TIMER or jiffies. While there is hardly a general reliable way to
check the validity of a watchdog, Thomas Gleixner proposed [1]:

"I'm inclined to lift that requirement when the CPU has:

    1) X86_FEATURE_CONSTANT_TSC
    2) X86_FEATURE_NONSTOP_TSC
    3) X86_FEATURE_NONSTOP_TSC_S3
    4) X86_FEATURE_TSC_ADJUST
    5) At max. 4 sockets

 After two decades of horrors we're finally at a point where TSC seems
 to be halfway reliable and less abused by BIOS tinkerers. TSC_ADJUST
 was really key as we can now detect even small modifications reliably
 and the important point is that we can cure them as well (not pretty
 but better than all other options)."

As feature #3 X86_FEATURE_NONSTOP_TSC_S3 only exists on several generations
of Atom processorz, and is always coupled with X86_FEATURE_CONSTANT_TSC
and X86_FEATURE_NONSTOP_TSC, skip checking it, and also be more defensive
to use maximal 2 sockets.

The check is done inside tsc_init() before registering 'tsc-early' and
'tsc' clocksources, as there were cases that both of them had been
wrongly judged as unreliable.

For more background of tsc/watchdog, there is a good summary in [2]

[tglx} Update vs. jiffies:

  On systems where the only remaining clocksource aside of TSC is jiffies
  there is no way to make this work because that creates a circular
  dependency. Jiffies accuracy depends on not missing a periodic timer
  interrupt, which is not guaranteed. That could be detected by TSC, but as
  TSC is not trusted this cannot be compensated. The consequence is a
  circulus vitiosus which results in shutting down TSC and falling back to
  the jiffies clocksource which is even more unreliable.

[1]. https://lore.kernel.org/lkml/87eekfk8bd.fsf@nanos.tec.linutronix.de/
[2]. https://lore.kernel.org/lkml/87a6pimt1f.ffs@nanos.tec.linutronix.de/

[ tglx: Refine comment and amend changelog ]

Fixes: 6e3cd95234 ("x86/hpet: Use another crystalball to evaluate HPET usability")
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Feng Tang <feng.tang@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20211117023751.24190-2-feng.tang@intel.com
2021-12-02 00:40:36 +01:00
Feng Tang c7719e7934 x86/tsc: Add a timer to make sure TSC_adjust is always checked
The TSC_ADJUST register is checked every time a CPU enters idle state, but
Thomas Gleixner mentioned there is still a caveat that a system won't enter
idle [1], either because it's too busy or configured purposely to not enter
idle.

Setup a periodic timer (every 10 minutes) to make sure the check is
happening on a regular base.

[1] https://lore.kernel.org/lkml/875z286xtk.fsf@nanos.tec.linutronix.de/

Fixes: 6e3cd95234 ("x86/hpet: Use another crystalball to evaluate HPET usability")
Requested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Feng Tang <feng.tang@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20211117023751.24190-1-feng.tang@intel.com
2021-12-02 00:40:35 +01:00
Marco Elver 52d0b8b187 x86/fpu/signal: Initialize sw_bytes in save_xstate_epilog()
save_sw_bytes() did not fully initialize sw_bytes, which caused KMSAN
to report an infoleak (see below).
Initialize sw_bytes explicitly to avoid this.

KMSAN report follows:

=====================================================
BUG: KMSAN: kernel-infoleak in instrument_copy_to_user ./include/linux/instrumented.h:121
BUG: KMSAN: kernel-infoleak in __copy_to_user ./include/linux/uaccess.h:154
BUG: KMSAN: kernel-infoleak in save_xstate_epilog+0x2df/0x510 arch/x86/kernel/fpu/signal.c:127
 instrument_copy_to_user ./include/linux/instrumented.h:121
 __copy_to_user ./include/linux/uaccess.h:154
 save_xstate_epilog+0x2df/0x510 arch/x86/kernel/fpu/signal.c:127
 copy_fpstate_to_sigframe+0x861/0xb60 arch/x86/kernel/fpu/signal.c:245
 get_sigframe+0x656/0x7e0 arch/x86/kernel/signal.c:296
 __setup_rt_frame+0x14d/0x2a60 arch/x86/kernel/signal.c:471
 setup_rt_frame arch/x86/kernel/signal.c:781
 handle_signal arch/x86/kernel/signal.c:825
 arch_do_signal_or_restart+0x417/0xdd0 arch/x86/kernel/signal.c:870
 handle_signal_work kernel/entry/common.c:149
 exit_to_user_mode_loop+0x1f6/0x490 kernel/entry/common.c:173
 exit_to_user_mode_prepare kernel/entry/common.c:208
 __syscall_exit_to_user_mode_work kernel/entry/common.c:290
 syscall_exit_to_user_mode+0x7e/0xc0 kernel/entry/common.c:302
 do_syscall_64+0x60/0xd0 arch/x86/entry/common.c:88
 entry_SYSCALL_64_after_hwframe+0x44/0xae ??:?

Local variable sw_bytes created at:
 save_xstate_epilog+0x80/0x510 arch/x86/kernel/fpu/signal.c:121
 copy_fpstate_to_sigframe+0x861/0xb60 arch/x86/kernel/fpu/signal.c:245

Bytes 20-47 of 48 are uninitialized
Memory access of size 48 starts at ffff8880801d3a18
Data copied to user address 00007ffd90e2ef50
=====================================================

Link: https://lore.kernel.org/all/CAG_fn=V9T6OKPonSjsi9PmWB0hMHFC=yawozdft8i1-MSxrv=w@mail.gmail.com/
Fixes: 53599b4d54 ("x86/fpu/signal: Prepare for variable sigframe length")
Reported-by: Alexander Potapenko <glider@google.com>
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Alexander Potapenko <glider@google.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Tested-by: Alexander Potapenko <glider@google.com>
Link: https://lkml.kernel.org/r/20211126124746.761278-1-glider@google.com
2021-11-30 15:13:47 -08:00
Tony Luck 7d697f0d57 x86/cpu: Drop spurious underscore from RAPTOR_LAKE #define
Convention for all the other "lake" CPUs is all one word.

So s/RAPTOR_LAKE/RAPTORLAKE/

Fixes: fbdb5e8f29 ("x86/cpu: Add Raptor Lake to Intel family")
Reported-by: Rui Zhang <rui.zhang@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lkml.kernel.org/r/20211119170832.1034220-1-tony.luck@intel.com
2021-11-30 14:05:48 -08:00
Linus Torvalds f080815fdb ARM64:
* Fix constant sign extension affecting TCR_EL2 and preventing
 running on ARMv8.7 models due to spurious bits being set
 
 * Fix use of helpers using PSTATE early on exit by always sampling
 it as soon as the exit takes place
 
 * Move pkvm's 32bit handling into a common helper
 
 RISC-V:
 
 * Fix incorrect KVM_MAX_VCPUS value
 
 * Unmap stage2 mapping when deleting/moving a memslot
 
 x86:
 
 * Fix and downgrade BUG_ON due to uninitialized cache
 
 * Many APICv and MOVE_ENC_CONTEXT_FROM fixes
 
 * Correctly emulate TLB flushes around nested vmentry/vmexit
 and when the nested hypervisor uses VPID
 
 * Prevent modifications to CPUID after the VM has run
 
 * Other smaller bugfixes
 
 Generic:
 
 * Memslot handling bugfixes
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmGmHBEUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroOkGgf/RBjt1d7H6Um7tD7oA5QiIHmNY4ko
 K/90OAa8h62rilxpqxkRgLNmphBc5AzcbufVXN4J1hVhw2M+u1ouDxKeHS1GEZTA
 /XdNb0dwK99TpOJkIcuV/NQVIZUxkM00VbIiCoLkX06VuIc1Gie1G4bqzLhWCP8Y
 ts9l/pkfafvfEmjmcjVd7gkDOnEPbT+JPDJcuo/RA7C7Z2L4+8DsFeyfWGqBP647
 J6omUUxD82QRm28OVOK4V7aNALWsAdlaqHrVFAPZywQl7QTWMO0UTcKTdCCB2B4Q
 QnHejFV6pFh55q3/fhe7epy9e2Sw+NOsmWKTEGPbU5nn94R8lyW1GV4ZUQ==
 =Nduu
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm fixes from Paolo Bonzini:
 "ARM64:

   - Fix constant sign extension affecting TCR_EL2 and preventing
     running on ARMv8.7 models due to spurious bits being set

   - Fix use of helpers using PSTATE early on exit by always sampling it
     as soon as the exit takes place

   - Move pkvm's 32bit handling into a common helper

  RISC-V:

   - Fix incorrect KVM_MAX_VCPUS value

   - Unmap stage2 mapping when deleting/moving a memslot

  x86:

   - Fix and downgrade BUG_ON due to uninitialized cache

   - Many APICv and MOVE_ENC_CONTEXT_FROM fixes

   - Correctly emulate TLB flushes around nested vmentry/vmexit and when
     the nested hypervisor uses VPID

   - Prevent modifications to CPUID after the VM has run

   - Other smaller bugfixes

  Generic:

   - Memslot handling bugfixes"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (44 commits)
  KVM: fix avic_set_running for preemptable kernels
  KVM: VMX: clear vmx_x86_ops.sync_pir_to_irr if APICv is disabled
  KVM: SEV: accept signals in sev_lock_two_vms
  KVM: SEV: do not take kvm->lock when destroying
  KVM: SEV: Prohibit migration of a VM that has mirrors
  KVM: SEV: Do COPY_ENC_CONTEXT_FROM with both VMs locked
  selftests: sev_migrate_tests: add tests for KVM_CAP_VM_COPY_ENC_CONTEXT_FROM
  KVM: SEV: move mirror status to destination of KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM
  KVM: SEV: initialize regions_list of a mirror VM
  KVM: SEV: cleanup locking for KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM
  KVM: SEV: do not use list_replace_init on an empty list
  KVM: x86: Use a stable condition around all VT-d PI paths
  KVM: x86: check PIR even for vCPUs with disabled APICv
  KVM: VMX: prepare sync_pir_to_irr for running with APICv disabled
  KVM: selftests: page_table_test: fix calculation of guest_test_phys_mem
  KVM: x86/mmu: Handle "default" period when selectively waking kthread
  KVM: MMU: shadow nested paging does not have PKU
  KVM: x86/mmu: Remove spurious TLB flushes in TDP MMU zap collapsible path
  KVM: x86/mmu: Use yield-safe TDP MMU root iter in MMU notifier unmapping
  KVM: X86: Use vcpu->arch.walk_mmu for kvm_mmu_invlpg()
  ...
2021-11-30 09:22:15 -08:00
Paolo Bonzini 7cfc5c653b KVM: fix avic_set_running for preemptable kernels
avic_set_running() passes the current CPU to avic_vcpu_load(), albeit
via vcpu->cpu rather than smp_processor_id().  If the thread is migrated
while avic_set_running runs, the call to avic_vcpu_load() can use a stale
value for the processor id.  Avoid this by blocking preemption over the
entire execution of avic_set_running().

Reported-by: Sean Christopherson <seanjc@google.com>
Fixes: 8221c13700 ("svm: Manage vcpu load/unload when enable AVIC")
Cc: stable@vger.kernel.org
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-11-30 07:40:48 -05:00
Paolo Bonzini e90e51d5f0 KVM: VMX: clear vmx_x86_ops.sync_pir_to_irr if APICv is disabled
There is nothing to synchronize if APICv is disabled, since neither
other vCPUs nor assigned devices can set PIR.ON.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-11-30 07:40:47 -05:00
Paolo Bonzini c9d61dcb0b KVM: SEV: accept signals in sev_lock_two_vms
Generally, kvm->lock is not taken for a long time, but
sev_lock_two_vms is different: it takes vCPU locks
inside, so userspace can hold it back just by calling
a vCPU ioctl.  Play it safe and use mutex_lock_killable.

Message-Id: <20211123005036.2954379-13-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-11-30 03:54:15 -05:00
Paolo Bonzini 10a37929ef KVM: SEV: do not take kvm->lock when destroying
Taking the lock is useless since there are no other references,
and there are already accesses (e.g. to sev->enc_context_owner)
that do not take it.  So get rid of it.

Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211123005036.2954379-12-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-11-30 03:54:14 -05:00
Paolo Bonzini 17d44a96f0 KVM: SEV: Prohibit migration of a VM that has mirrors
VMs that mirror an encryption context rely on the owner to keep the
ASID allocated.  Performing a KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM
would cause a dangling ASID:

1. copy context from A to B (gets ref to A)
2. move context from A to L (moves ASID from A to L)
3. close L (releases ASID from L, B still references it)

The right way to do the handoff instead is to create a fresh mirror VM
on the destination first:

1. copy context from A to B (gets ref to A)
[later] 2. close B (releases ref to A)
3. move context from A to L (moves ASID from A to L)
4. copy context from L to M

So, catch the situation by adding a count of how many VMs are
mirroring this one's encryption context.

Fixes: 0b020f5af0 ("KVM: SEV: Add support for SEV-ES intra host migration")
Message-Id: <20211123005036.2954379-11-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-11-30 03:54:14 -05:00
Paolo Bonzini bf42b02b19 KVM: SEV: Do COPY_ENC_CONTEXT_FROM with both VMs locked
Now that we have a facility to lock two VMs with deadlock
protection, use it for the creation of mirror VMs as well.  One of
COPY_ENC_CONTEXT_FROM(dst, src) and COPY_ENC_CONTEXT_FROM(src, dst)
would always fail, so the combination is nonsensical and it is okay to
return -EBUSY if it is attempted.

This sidesteps the question of what happens if a VM is
MOVE_ENC_CONTEXT_FROM'd at the same time as it is
COPY_ENC_CONTEXT_FROM'd: the locking prevents that from
happening.

Cc: Peter Gonda <pgonda@google.com>
Cc: Sean Christopherson <seanjc@google.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211123005036.2954379-10-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-11-30 03:54:13 -05:00
Paolo Bonzini 642525e3bd KVM: SEV: move mirror status to destination of KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM
Allow intra-host migration of a mirror VM; the destination VM will be
a mirror of the same ASID as the source.

Fixes: b56639318b ("KVM: SEV: Add support for SEV intra host migration")
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211123005036.2954379-8-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-11-30 03:54:12 -05:00
Paolo Bonzini 2b347a3878 KVM: SEV: initialize regions_list of a mirror VM
This was broken before the introduction of KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM,
but technically harmless because the region list was unused for a mirror
VM.  However, it is untidy and it now causes a NULL pointer access when
attempting to move the encryption context of a mirror VM.

Fixes: 54526d1fd5 ("KVM: x86: Support KVM VMs sharing SEV context")
Message-Id: <20211123005036.2954379-7-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-11-30 03:54:12 -05:00
Paolo Bonzini 501b580c02 KVM: SEV: cleanup locking for KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM
Encapsulate the handling of the migration_in_progress flag for both VMs in
two functions sev_lock_two_vms and sev_unlock_two_vms.  It does not matter
if KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM locks the destination struct kvm a bit
later, and this change 1) keeps the cleanup chain of labels smaller 2)
makes it possible for KVM_CAP_VM_COPY_ENC_CONTEXT_FROM to reuse the logic.

Cc: Peter Gonda <pgonda@google.com>
Cc: Sean Christopherson <seanjc@google.com>
Message-Id: <20211123005036.2954379-6-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-11-30 03:54:11 -05:00
Paolo Bonzini 4674164f0a KVM: SEV: do not use list_replace_init on an empty list
list_replace_init cannot be used if the source is an empty list,
because "new->next->prev = new" will overwrite "old->next":

				new				old
				prev = new, next = new		prev = old, next = old
new->next = old->next		prev = new, next = old		prev = old, next = old
new->next->prev = new		prev = new, next = old		prev = old, next = new
new->prev = old->prev		prev = old, next = old		prev = old, next = old
new->next->prev = new		prev = old, next = old		prev = new, next = new

The desired outcome instead would be to leave both old and new the same
as they were (two empty circular lists).  Use list_cut_before, which
already has the necessary check and is documented to discard the
previous contents of the list that will hold the result.

Fixes: b56639318b ("KVM: SEV: Add support for SEV intra host migration")
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211123005036.2954379-5-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-11-30 03:54:11 -05:00
Paolo Bonzini 53b7ca1a35 KVM: x86: Use a stable condition around all VT-d PI paths
Currently, checks for whether VT-d PI can be used refer to the current
status of the feature in the current vCPU; or they more or less pick
vCPU 0 in case a specific vCPU is not available.

However, these checks do not attempt to synchronize with changes to
the IRTE.  In particular, there is no path that updates the IRTE when
APICv is re-activated on vCPU 0; and there is no path to wakeup a CPU
that has APICv disabled, if the wakeup occurs because of an IRTE
that points to a posted interrupt.

To fix this, always go through the VT-d PI path as long as there are
assigned devices and APICv is available on both the host and the VM side.
Since the relevant condition was copied over three times, take the hint
and factor it into a separate function.

Suggested-by: Sean Christopherson <seanjc@google.com>
Cc: stable@vger.kernel.org
Reviewed-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: David Matlack <dmatlack@google.com>
Message-Id: <20211123004311.2954158-5-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-11-30 03:53:14 -05:00
Paolo Bonzini 37c4dbf337 KVM: x86: check PIR even for vCPUs with disabled APICv
The IRTE for an assigned device can trigger a POSTED_INTR_VECTOR even
if APICv is disabled on the vCPU that receives it.  In that case, the
interrupt will just cause a vmexit and leave the ON bit set together
with the PIR bit corresponding to the interrupt.

Right now, the interrupt would not be delivered until APICv is re-enabled.
However, fixing this is just a matter of always doing the PIR->IRR
synchronization, even if the vCPU has temporarily disabled APICv.

This is not a problem for performance, or if anything it is an
improvement.  First, in the common case where vcpu->arch.apicv_active is
true, one fewer check has to be performed.  Second, static_call_cond will
elide the function call if APICv is not present or disabled.  Finally,
in the case for AMD hardware we can remove the sync_pir_to_irr callback:
it is only needed for apic_has_interrupt_for_ppr, and that function
already has a fallback for !APICv.

Cc: stable@vger.kernel.org
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: David Matlack <dmatlack@google.com>
Message-Id: <20211123004311.2954158-4-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-11-30 03:52:39 -05:00
Paolo Bonzini 7e1901f6c8 KVM: VMX: prepare sync_pir_to_irr for running with APICv disabled
If APICv is disabled for this vCPU, assigned devices may still attempt to
post interrupts.  In that case, we need to cancel the vmentry and deliver
the interrupt with KVM_REQ_EVENT.  Extend the existing code that handles
injection of L1 interrupts into L2 to cover this case as well.

vmx_hwapic_irr_update is only called when APICv is active so it would be
confusing to add a check for vcpu->arch.apicv_active in there.  Instead,
just use vmx_set_rvi directly in vmx_sync_pir_to_irr.

Cc: stable@vger.kernel.org
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: David Matlack <dmatlack@google.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211123004311.2954158-3-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-11-30 03:51:58 -05:00
Sean Christopherson f47491d7f3 KVM: x86/mmu: Handle "default" period when selectively waking kthread
Account for the '0' being a default, "let KVM choose" period, when
determining whether or not the recovery worker needs to be awakened in
response to userspace reducing the period.  Failure to do so results in
the worker not being awakened properly, e.g. when changing the period
from '0' to any small-ish value.

Fixes: 4dfe4f40d8 ("kvm: x86: mmu: Make NX huge page recovery period configurable")
Cc: stable@vger.kernel.org
Cc: Junaid Shahid <junaids@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211120015706.3830341-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-11-30 03:09:27 -05:00
Paolo Bonzini 28f091bc2f KVM: MMU: shadow nested paging does not have PKU
Initialize the mask for PKU permissions as if CR4.PKE=0, avoiding
incorrect interpretations of the nested hypervisor's page tables.

Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-11-30 03:09:26 -05:00
Sean Christopherson 4b85c921cd KVM: x86/mmu: Remove spurious TLB flushes in TDP MMU zap collapsible path
Drop the "flush" param and return values to/from the TDP MMU's helper for
zapping collapsible SPTEs.  Because the helper runs with mmu_lock held
for read, not write, it uses tdp_mmu_zap_spte_atomic(), and the atomic
zap handles the necessary remote TLB flush.

Similarly, because mmu_lock is dropped and re-acquired between zapping
legacy MMUs and zapping TDP MMUs, kvm_mmu_zap_collapsible_sptes() must
handle remote TLB flushes from the legacy MMU before calling into the TDP
MMU.

Fixes: e2209710cc ("KVM: x86/mmu: Skip rmap operations if rmaps not allocated")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211120045046.3940942-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-11-30 03:09:25 -05:00
Sean Christopherson 7533377215 KVM: x86/mmu: Use yield-safe TDP MMU root iter in MMU notifier unmapping
Use the yield-safe variant of the TDP MMU iterator when handling an
unmapping event from the MMU notifier, as most occurences of the event
allow yielding.

Fixes: e1eed5847b ("KVM: x86/mmu: Allow yielding during MMU notifier unmap/zap, if possible")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211120015008.3780032-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-11-30 03:09:25 -05:00
Linus Torvalds 9557e60b8c A single fix for a missing __init annotation of prepare_command_line().
-----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmGjr4UTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoXEXEAC79s58FkEV5LsKnmNe5pr/XG47Fgbz
 MMTX+P4IUwUrvRHPzEKPdO4lR8ZSFefhSHcPA06cWYyNyeh/UHq/sB4JGysuH7bw
 JAIJhJ3PsLv19TWvIKN3WHB7R3gwqKYzzzsKjKJHfepHv7kzidYz1fu380/bA2Zw
 LDjlMaQFSIA4rk7sBB/FFC3wsCvmU69iwG82x+QXX3ZWk0eB8bmw4Vsz6tsjo1D2
 EtEyxmjh2HjI+AbTORUsqucWvoKq3PvA9CX4SJdT/u+/ZX1zYR6iwsBkB+oiDNFx
 mxN15QKUA0t5NmF3dtoftj0lLM7iT7rd9PAdyPbWDm1o02z7Q780tsv7S8A2njMp
 ACwvCkupE9LSmEpFH9YyQeCBtyppXd0laN0mcL3h1MmSQbZs3KrtqyQDJ6McjPwL
 TiDFa8PKCBejGms7P6gPBv0DCsNNBQ0alAioi+zVsn8WbnMKGBpb0nplAdn1Cxt0
 GuJR63oEpX8Jtejb/018cEdDBiEmGh+7HXVXy4giAKnn92Pd7AhpWXwUYGuoD7j3
 1nRI9TOEoWS9ZcK7goaVp9D42YCEPsn66RPGyg25EaTlWJ0g9tmpbXhGzFPYg8FS
 tGIBw7CwlGMIKng+MWwuGu9hJYk0Tc1NcvrkxYTSVFNDYo4AQDpqwLlGuJsO893c
 8tFApEuQ+w5zkg==
 =QG5U
 -----END PGP SIGNATURE-----

Merge tag 'x86-urgent-2021-11-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 build fix from Thomas Gleixner:
 "A single fix for a missing __init annotation of prepare_command_line()"

* tag 'x86-urgent-2021-11-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/boot: Mark prepare_command_line() __init
2021-11-28 09:24:50 -08:00
Linus Torvalds 0757ca01d9 IOMMU Fixes for Linux v5.16-rc2:
Including:
 
   - Intel VT-d fixes:
     - Remove unused PASID_DISABLED
     - Fix RCU locking
     - Fix for the unmap_pages call-back
 
   - Rockchip RK3568 address mask fix
 
   - AMD IOMMUv2 log message clarification
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEr9jSbILcajRFYWYyK/BELZcBGuMFAmGjhDgACgkQK/BELZcB
 GuMOiQ/+MPfnGSpKtEdID/p9d7yo97M/WKRhx5aT27hfW4gyyLtAO22vUTfZ/obK
 Mmwl/mthMm55BuzOes8/ka7zcrkaluJitFGWVLN6dzXZRTZc2nMdYQb1Y25IZVEP
 AwTDdfi8btPRYRrZ1vpXZtDzF5purGe0a5P0psUox7roSq+uWZcXOy3WZDbPNA9b
 pAwPTacnxbeSrElOF6rUyv5eXilvMDMG/1lF4/gFR89xAYcDIPpWNLuRWNYWxu2M
 qTbGBGXnSs/iIzRmBs5rhqbR5VQAb8tXQjjMBb5wWx9i6gaw217wHZPQrpS6ej3Z
 E8vpD/z/kxsMLBpiNM34an/3krgzm6alhA6dalZsvGdAvQHWj4LGma72lCBYsuM1
 +Lre4MvMJ1kvONH9aWoMJWZEITnL2pS3Afu72vhg+7ank4xI/5Ej3bO7mYfN2MV+
 XeEgIUv6HJioJ8ITwYyApJskDk0CLGvfmyHADwe+rj4A+YYzOgEv6kGQlZWzaNWa
 ISDibSMa4od4rw63CLS+rq4vK13MhfyerLZl9IMpvqcUKuWrb3SCiGjTLg5iWkL0
 eqoIIvMuUavmTNG6HMOwfUMNN963osLJPXjwMYkzWbbpusLqb38KvLCed3xGS8tb
 KwhFggAF9Qt++bjo9V8zk5FdAovW0n4m5Nu6WDFcMwPCxccy3xc=
 =+B76
 -----END PGP SIGNATURE-----

Merge tag 'iommu-fixes-v5.16-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu

Pull iommu fixes from Joerg Roedel:

 - Intel VT-d fixes:
     - Remove unused PASID_DISABLED
     - Fix RCU locking
     - Fix for the unmap_pages call-back

 - Rockchip RK3568 address mask fix

 - AMD IOMMUv2 log message clarification

* tag 'iommu-fixes-v5.16-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu:
  iommu/vt-d: Fix unmap_pages support
  iommu/vt-d: Fix an unbalanced rcu_read_lock/rcu_read_unlock()
  iommu/rockchip: Fix PAGE_DESC_HI_MASKs for RK3568
  iommu/amd: Clarify AMD IOMMUv2 initialization messages
  iommu/vt-d: Remove unused PASID_DISABLED
2021-11-28 07:17:38 -08:00
Joerg Roedel 21e96a2035 iommu/vt-d: Remove unused PASID_DISABLED
The macro is unused after commit 00ecd54013 so it can be removed.

Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Fixes: 00ecd54013 ("iommu/vt-d: Clean up unused PASID updating functions")
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/20211123105507.7654-2-joro@8bytes.org
2021-11-26 22:54:20 +01:00
Linus Torvalds 6b54698aec xen: branch for v5.16-rc3
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQRTLbB6QfY48x44uB6AXGG7T9hjvgUCYaD8mwAKCRCAXGG7T9hj
 vspAAPwLA5SUorji33PTetwmcpLcoRJ3Q4HAPz+bOPdm9iL/PgD/V8MtxFrFebBs
 AJoa+GmBarUNn7XCqKnCcA64iXhrpQw=
 =6GY3
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-5.16c-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip

Pull xen fixes from Juergen Gross:

 - Kconfig fix to make it possible to control building of the privcmd
   driver

 - three fixes for issues identified by the kernel test robot

 - a five-patch series to simplify timeout handling for Xen PV driver
   initialization

 - two patches to fix error paths in xenstore/xenbus driver
   initialization

* tag 'for-linus-5.16c-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
  xen: make HYPERVISOR_set_debugreg() always_inline
  xen: make HYPERVISOR_get_debugreg() always_inline
  xen: detect uninitialized xenbus in xenbus_init
  xen: flag xen_snd_front to be not essential for system boot
  xen: flag pvcalls-front to be not essential for system boot
  xen: flag hvc_xen to be not essential for system boot
  xen: flag xen_drm_front to be not essential for system boot
  xen: add "not_essential" flag to struct xenbus_driver
  xen/pvh: add missing prototype to header
  xen: don't continue xenstore initialization in case of errors
  xen/privcmd: make option visible in Kconfig
2021-11-26 09:54:13 -08:00
Lai Jiangshan 05b29633c7 KVM: X86: Use vcpu->arch.walk_mmu for kvm_mmu_invlpg()
INVLPG operates on guest virtual address, which are represented by
vcpu->arch.walk_mmu.  In nested virtualization scenarios,
kvm_mmu_invlpg() was using the wrong MMU structure; if L2's invlpg were
emulated by L0 (in practice, it hardly happen) when nested two-dimensional
paging is enabled, the call to ->tlb_flush_gva() would be skipped and
the hardware TLB entry would not be invalidated.

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Message-Id: <20211124122055.64424-5-jiangshanlai@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-11-26 08:14:21 -05:00
Lai Jiangshan 12ec33a705 KVM: X86: Fix when shadow_root_level=5 && guest root_level<4
If the is an L1 with nNPT in 32bit, the shadow walk starts with
pae_root.

Fixes: a717a780fc ("KVM: x86/mmu: Support shadowing NPT when 5-level paging is enabled in host)
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Message-Id: <20211124122055.64424-2-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-11-26 08:14:20 -05:00
Vitaly Kuznetsov feb627e8d6 KVM: x86: Forbid KVM_SET_CPUID{,2} after KVM_RUN
Commit 63f5a1909f ("KVM: x86: Alert userspace that KVM_SET_CPUID{,2}
after KVM_RUN is broken") officially deprecated KVM_SET_CPUID{,2} ioctls
after first successful KVM_RUN and promissed to make this sequence forbiden
in 5.16. It's time to fulfil the promise.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20211122175818.608220-3-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-11-26 08:14:20 -05:00
Sean Christopherson 712494de96 KVM: nVMX: Emulate guest TLB flush on nested VM-Enter with new vpid12
Fully emulate a guest TLB flush on nested VM-Enter which changes vpid12,
i.e. L2's VPID, instead of simply doing INVVPID to flush real hardware's
TLB entries for vpid02.  From L1's perspective, changing L2's VPID is
effectively a TLB flush unless "hardware" has previously cached entries
for the new vpid12.  Because KVM tracks only a single vpid12, KVM doesn't
know if the new vpid12 has been used in the past and so must treat it as
a brand new, never been used VPID, i.e. must assume that the new vpid12
represents a TLB flush from L1's perspective.

For example, if L1 and L2 share a CR3, the first VM-Enter to L2 (with a
VPID) is effectively a TLB flush as hardware/KVM has never seen vpid12
and thus can't have cached entries in the TLB for vpid12.

Reported-by: Lai Jiangshan <jiangshanlai+lkml@gmail.com>
Fixes: 5c614b3583 ("KVM: nVMX: nested VPID emulation")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211125014944.536398-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-11-26 07:11:29 -05:00
Sean Christopherson 40e5f90804 KVM: nVMX: Abide to KVM_REQ_TLB_FLUSH_GUEST request on nested vmentry/vmexit
Like KVM_REQ_TLB_FLUSH_CURRENT, the GUEST variant needs to be serviced at
nested transitions, as KVM doesn't track requests for L1 vs L2.  E.g. if
there's a pending flush when a nested VM-Exit occurs, then the flush was
requested in the context of L2 and needs to be handled before switching
to L1, otherwise the flush for L2 would effectiely be lost.

Opportunistically add a helper to handle CURRENT and GUEST as a pair, the
logic for when they need to be serviced is identical as both requests are
tied to L1 vs. L2, the only difference is the scope of the flush.

Reported-by: Lai Jiangshan <jiangshanlai+lkml@gmail.com>
Fixes: 07ffaf343e ("KVM: nVMX: Sync all PGDs on nested transition with shadow paging")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211125014944.536398-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-11-26 07:07:49 -05:00
Sean Christopherson 2b4a5a5d56 KVM: nVMX: Flush current VPID (L1 vs. L2) for KVM_REQ_TLB_FLUSH_GUEST
Flush the current VPID when handling KVM_REQ_TLB_FLUSH_GUEST instead of
always flushing vpid01.  Any TLB flush that is triggered when L2 is
active is scoped to L2's VPID (if it has one), e.g. if L2 toggles CR4.PGE
and L1 doesn't intercept PGE writes, then KVM's emulation of the TLB
flush needs to be applied to L2's VPID.

Reported-by: Lai Jiangshan <jiangshanlai+lkml@gmail.com>
Fixes: 07ffaf343e ("KVM: nVMX: Sync all PGDs on nested transition with shadow paging")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211125014944.536398-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-11-26 07:06:58 -05:00
Paolo Bonzini 30d7c5d60a KVM: SEV: expose KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM capability
The capability, albeit present, was never exposed via KVM_CHECK_EXTENSION.

Fixes: b56639318b ("KVM: SEV: Add support for SEV intra host migration")
Cc: Peter Gonda <pgonda@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-11-26 06:43:30 -05:00
Paolo Bonzini 78311a5140 KVM: x86: ignore APICv if LAPIC is not enabled
Synchronize the two calls to kvm_x86_sync_pir_to_irr.  The one
in the reenter-guest fast path invoked the callback unconditionally
even if LAPIC is present but disabled.  In this case, there are
no interrupts to deliver, and therefore posted interrupts can
be ignored.

Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-11-26 06:43:29 -05:00