Ensure that the EFI memory map resides in encrypted memory even after it
has been reallocated.
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE+9lifEBpyUIVN1cpw08iOZLZjyQFAmGt9nUACgkQw08iOZLZ
jyT6ZgwAt1jkqt/ILev5MPE2rEQ8BrYhR7pMBcNW7su4BECwRfmQBGjYFeeXjlqp
7tK2L18DsHJxgt699gumDD7IyjbLNUx9zqsd7nJzlvnpwjGSiUhAGLPwoAeRSEgD
sxK/HIt6J5ZX/zMsEbtptP9M1eitbZQhtt+0aCLcJvt1NCZc/+FvUaEEPI+6BrMt
TMWdXd6MOZ8FV2iq+FUpTeG9SGzze6QaIiL7H5Z/otvXbBG1iWmlWR0bR17CXGUC
thIuqCjEKSM7P45kFF+byCq5ajo55ULy3ZrPVQUYLt/FdMdI+pLxK8TTg46xq8KS
mXqexjtCPLiJP728lWTETIPB7x9aQW/i6cwesh0O6cmhPznJybW/+uR9zBsyiSGs
8+9ua+JT9+B3bEn1rsbWYKEcy3G4KrxwtJ5aqRNVJM96pel/yxA4EYkmfyf4TlrS
pxbY2Agzlci18qCHb5lycc1yM6WHdirzY1l0m8uwqk7o8GHtRMMMU/7e/kvmtxUA
j9V3vqm6
=mDRi
-----END PGP SIGNATURE-----
Merge tag 'efi-urgent-for-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi
Pull EFI fix from Ard Biesheuvel:
"Ensure that the EFI memory map resides in encrypted memory even after
it has been reallocated"
* tag 'efi-urgent-for-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi:
x86/sme: Explicitly map new EFI memmap table as encrypted
- Use the proper operand types in __{get,put}_user() to prevent
truncation in SEV-ES string io
- Make sure the kernel mappings are present in trampoline_pgd in order
to prevent any potential accesses to unmapped memory after switching to
it
- Fix a trivial list corruption in objtool's pv_ops validation
- Disable the clocksource watchdog for TSC on platforms which claim
that the TSC is constant, doesn't stop in sleep states, CPU has TSC
adjust and the number of sockets of the platform are max 2, to prevent
erroneous markings of the TSC as unstable.
- Make sure TSC adjust is always checked not only when going idle
- Prevent a stack leak by initializing struct _fpx_sw_bytes properly in
the FPU code
- Fix INTEL_FAM6_RAPTORLAKE define naming to adhere to the convention
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmGsnWcACgkQEsHwGGHe
VUoR+g/9FcOP0/XLH+LKHumYc9JHXsp5BvYGihyypMFgU0fXQBORtGqdls8jZtiJ
kEdbW6iL0MRlyN8aHJCqr7dJqs7KJlpWes6hky7BY+U+7uewtjL5y3eSyZnA34T3
M/Raecx27Hh0L0kHQlHXTUN73v1cgDvq3dCXWsP7Jqgjf5cEmCcV/tPEateqhq/f
8TkLVIm55rJlbJ0LBO/cT0V3Q8QH9JPKm7nviOZuKCh9gcttFEPaM9MkaJyKUhoy
O13jlenDoVkVWRXIQec1EZp2pTLxVAm/3Y0plge1yEVsejzh07gsQnMpoNeF+yFC
8mDgSv8ZAED/vbsnB+BcgoRVj6ajG0+ilpLzcfPwUquiqS9pZrBSTddlvYDPjRMC
MEXO548xiYgxmipu3r62H89nqmLEYQPk914rJu6bDnDeJ1gaabh8RXbNtQcRqqj3
RETgVOp78iWn+aT33RLLD1EyodZb2IkMy087a3+TZICIXG81aDj9VgHvrVRnWnfY
yKuldyrEKzi60yMQkV6h1oc8KSWQhspUSLtOVS9zrulCinYphFOfYFrzFmcKUWIq
GdVb9eaP2oNBGfPybXP+TBLGZ4Zv9iXZmaEUk7ZGCjgv3ZmGMWJ18Hs/ufs2cwWK
RNNUo3sz/y3OsreHowkWIk1eSxI16MabB7G/PDMnBSHlioVT390=
=d6nS
-----END PGP SIGNATURE-----
Merge tag 'x86_urgent_for_v5.16_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Borislav Petkov:
- Fix a couple of SWAPGS fencing issues in the x86 entry code
- Use the proper operand types in __{get,put}_user() to prevent
truncation in SEV-ES string io
- Make sure the kernel mappings are present in trampoline_pgd in order
to prevent any potential accesses to unmapped memory after switching
to it
- Fix a trivial list corruption in objtool's pv_ops validation
- Disable the clocksource watchdog for TSC on platforms which claim
that the TSC is constant, doesn't stop in sleep states, CPU has TSC
adjust and the number of sockets of the platform are max 2, to
prevent erroneous markings of the TSC as unstable.
- Make sure TSC adjust is always checked not only when going idle
- Prevent a stack leak by initializing struct _fpx_sw_bytes properly in
the FPU code
- Fix INTEL_FAM6_RAPTORLAKE define naming to adhere to the convention
* tag 'x86_urgent_for_v5.16_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/xen: Add xenpv_restore_regs_and_return_to_usermode()
x86/entry: Use the correct fence macro after swapgs in kernel CR3
x86/entry: Add a fence for kernel entry SWAPGS in paranoid_entry()
x86/sev: Fix SEV-ES INS/OUTS instructions for word, dword, and qword
x86/64/mm: Map all kernel memory into trampoline_pgd
objtool: Fix pv_ops noinstr validation
x86/tsc: Disable clocksource watchdog for TSC on qualified platorms
x86/tsc: Add a timer to make sure TSC_adjust is always checked
x86/fpu/signal: Initialize sw_bytes in save_xstate_epilog()
x86/cpu: Drop spurious underscore from RAPTOR_LAKE #define
* New SEV-ES protocol for communicating invalid VMGEXIT requests
* Ensure APICv is considered inactive if there is no APIC
* Fix reserved bits for AMD PerfEvtSeln register
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmGscuAUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroOnzAgAg/0tifAOC8F31xu+gbbFa2hJsfH1
sr2QXqVI6XUWeAHw5GEEcSFHGcFc0BrmfPEL4T3tEhkTGcijL2uTK8dwgq3ue4yg
5LzaZzh03AWi4x84rV3XNVHHatzF69tgbUG49rSlK2T6BkGWzh4gI5LZV7XqNqLh
acW+92YcCGo/O9RAUYYakofX4bp0rsQaZornQiD/R5X6AlrtMUyhAYHH5Wnv69n+
MHf4K9MzrtixXbTvkOXflN5yz6TkIGvCpCK+gmppeuJ3JyZshjn+y95XcDDZtB6o
+4Ypap3SmIkjTg4VwS8lRwIFkkY31hxhW2ohRnorfP2Q5Ckcz2/kVcIEew==
=uJMy
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull more kvm fixes from Paolo Bonzini:
- Static analysis fix
- New SEV-ES protocol for communicating invalid VMGEXIT requests
- Ensure APICv is considered inactive if there is no APIC
- Fix reserved bits for AMD PerfEvtSeln register
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: SVM: Do not terminate SEV-ES guests on GHCB validation failure
KVM: SEV: Fall back to vmalloc for SEV-ES scratch area if necessary
KVM: SEV: Return appropriate error codes if SEV-ES scratch setup fails
KVM: x86/mmu: Retry page fault if root is invalidated by memslot update
KVM: VMX: Set failure code in prepare_vmcs02()
KVM: ensure APICv is considered inactive if there is no APIC
KVM: x86/pmu: Fix reserved bits for AMD PerfEvtSeln register
Reserving memory using efi_mem_reserve() calls into the x86
efi_arch_mem_reserve() function. This function will insert a new EFI
memory descriptor into the EFI memory map representing the area of
memory to be reserved and marking it as EFI runtime memory. As part
of adding this new entry, a new EFI memory map is allocated and mapped.
The mapping is where a problem can occur. This new memory map is mapped
using early_memremap() and generally mapped encrypted, unless the new
memory for the mapping happens to come from an area of memory that is
marked as EFI_BOOT_SERVICES_DATA memory. In this case, the new memory will
be mapped unencrypted. However, during replacement of the old memory map,
efi_mem_type() is disabled, so the new memory map will now be long-term
mapped encrypted (in efi.memmap), resulting in the map containing invalid
data and causing the kernel boot to crash.
Since it is known that the area will be mapped encrypted going forward,
explicitly map the new memory map as encrypted using early_memremap_prot().
Cc: <stable@vger.kernel.org> # 4.14.x
Fixes: 8f716c9b5f ("x86/mm: Add support to access boot related data in the clear")
Link: https://lore.kernel.org/all/ebf1eb2940405438a09d51d121ec0d02c8755558.1634752931.git.thomas.lendacky@amd.com/
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
[ardb: incorporate Kconfig fix by Arnd]
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Currently, an SEV-ES guest is terminated if the validation of the VMGEXIT
exit code or exit parameters fails.
The VMGEXIT instruction can be issued from userspace, even though
userspace (likely) can't update the GHCB. To prevent userspace from being
able to kill the guest, return an error through the GHCB when validation
fails rather than terminating the guest. For cases where the GHCB can't be
updated (e.g. the GHCB can't be mapped, etc.), just return back to the
guest.
The new error codes are documented in the lasest update to the GHCB
specification.
Fixes: 291bd20d5d ("KVM: SVM: Add initial support for a VMGEXIT VMEXIT")
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <b57280b5562893e2616257ac9c2d4525a9aeeb42.1638471124.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Use kvzalloc() to allocate KVM's buffer for SEV-ES's GHCB scratch area so
that KVM falls back to __vmalloc() if physically contiguous memory isn't
available. The buffer is purely a KVM software construct, i.e. there's
no need for it to be physically contiguous.
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211109222350.2266045-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Return appropriate error codes if setting up the GHCB scratch area for an
SEV-ES guest fails. In particular, returning -EINVAL instead of -ENOMEM
when allocating the kernel buffer could be confusing as userspace would
likely suspect a guest issue.
Fixes: 8f423a80d2 ("KVM: SVM: Support MMIO for an SEV-ES guest")
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211109222350.2266045-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In the native case, PER_CPU_VAR(cpu_tss_rw + TSS_sp0) is the
trampoline stack. But XEN pv doesn't use trampoline stack, so
PER_CPU_VAR(cpu_tss_rw + TSS_sp0) is also the kernel stack.
In that case, source and destination stacks are identical, which means
that reusing swapgs_restore_regs_and_return_to_usermode() in XEN pv
would cause %rsp to move up to the top of the kernel stack and leave the
IRET frame below %rsp.
This is dangerous as it can be corrupted if #NMI / #MC hit as either of
these events occurring in the middle of the stack pushing would clobber
data on the (original) stack.
And, with XEN pv, swapgs_restore_regs_and_return_to_usermode() pushing
the IRET frame on to the original address is useless and error-prone
when there is any future attempt to modify the code.
[ bp: Massage commit message. ]
Fixes: 7f2590a110 ("x86/entry/64: Use a per-CPU trampoline stack for IDT entries")
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Link: https://lkml.kernel.org/r/20211126101209.8613-4-jiangshanlai@gmail.com
The commit
c758907004 ("x86/entry/64: Remove unneeded kernel CR3 switching")
removed a CR3 write in the faulting path of load_gs_index().
But the path's FENCE_SWAPGS_USER_ENTRY has no fence operation if PTI is
enabled, see spectre_v1_select_mitigation().
Rather, it depended on the serializing CR3 write of SWITCH_TO_KERNEL_CR3
and since it got removed, add a FENCE_SWAPGS_KERNEL_ENTRY call to make
sure speculation is blocked.
[ bp: Massage commit message and comment. ]
Fixes: c758907004 ("x86/entry/64: Remove unneeded kernel CR3 switching")
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20211126101209.8613-3-jiangshanlai@gmail.com
Commit
18ec54fdd6 ("x86/speculation: Prepare entry code for Spectre v1 swapgs mitigations")
added FENCE_SWAPGS_{KERNEL|USER}_ENTRY for conditional SWAPGS. In
paranoid_entry(), it uses only FENCE_SWAPGS_KERNEL_ENTRY for both
branches. This is because the fence is required for both cases since the
CR3 write is conditional even when PTI is enabled.
But
96b2371413 ("x86/entry/64: Switch CR3 before SWAPGS in paranoid entry")
changed the order of SWAPGS and the CR3 write. And it missed the needed
FENCE_SWAPGS_KERNEL_ENTRY for the user gsbase case.
Add it back by changing the branches so that FENCE_SWAPGS_KERNEL_ENTRY
can cover both branches.
[ bp: Massage, fix typos, remove obsolete comment while at it. ]
Fixes: 96b2371413 ("x86/entry/64: Switch CR3 before SWAPGS in paranoid entry")
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20211126101209.8613-2-jiangshanlai@gmail.com
Properly type the operands being passed to __put_user()/__get_user().
Otherwise, these routines truncate data for dependent instructions
(e.g., INSW) and only read/write one byte.
This has been tested by sending a string with REP OUTSW to a port and
then reading it back in with REP INSW on the same port.
Previous behavior was to only send and receive the first char of the
size. For example, word operations for "abcd" would only read/write
"ac". With change, the full string is now written and read back.
Fixes: f980f9c31a (x86/sev-es: Compile early handler code into kernel image)
Signed-off-by: Michael Sterritt <sterritt@google.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Marc Orr <marcorr@google.com>
Reviewed-by: Peter Gonda <pgonda@google.com>
Reviewed-by: Joerg Roedel <jroedel@suse.de>
Link: https://lkml.kernel.org/r/20211119232757.176201-1-sterritt@google.com
The trampoline_pgd only maps the 0xfffffff000000000-0xffffffffffffffff
range of kernel memory (with 4-level paging). This range contains the
kernel's text+data+bss mappings and the module mapping space but not the
direct mapping and the vmalloc area.
This is enough to get the application processors out of real-mode, but
for code that switches back to real-mode the trampoline_pgd is missing
important parts of the address space. For example, consider this code
from arch/x86/kernel/reboot.c, function machine_real_restart() for a
64-bit kernel:
#ifdef CONFIG_X86_32
load_cr3(initial_page_table);
#else
write_cr3(real_mode_header->trampoline_pgd);
/* Exiting long mode will fail if CR4.PCIDE is set. */
if (boot_cpu_has(X86_FEATURE_PCID))
cr4_clear_bits(X86_CR4_PCIDE);
#endif
/* Jump to the identity-mapped low memory code */
#ifdef CONFIG_X86_32
asm volatile("jmpl *%0" : :
"rm" (real_mode_header->machine_real_restart_asm),
"a" (type));
#else
asm volatile("ljmpl *%0" : :
"m" (real_mode_header->machine_real_restart_asm),
"D" (type));
#endif
The code switches to the trampoline_pgd, which unmaps the direct mapping
and also the kernel stack. The call to cr4_clear_bits() will find no
stack and crash the machine. The real_mode_header pointer below points
into the direct mapping, and dereferencing it also causes a crash.
The reason this does not crash always is only that kernel mappings are
global and the CR3 switch does not flush those mappings. But if theses
mappings are not in the TLB already, the above code will crash before it
can jump to the real-mode stub.
Extend the trampoline_pgd to contain all kernel mappings to prevent
these crashes and to make code which runs on this page-table more
robust.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20211202153226.22946-5-joro@8bytes.org
Bail from the page fault handler if the root shadow page was obsoleted by
a memslot update. Do the check _after_ acuiring mmu_lock, as the TDP MMU
doesn't rely on the memslot/MMU generation, and instead relies on the
root being explicit marked invalid by kvm_mmu_zap_all_fast(), which takes
mmu_lock for write.
For the TDP MMU, inserting a SPTE into an obsolete root can leak a SP if
kvm_tdp_mmu_zap_invalidated_roots() has already zapped the SP, i.e. has
moved past the gfn associated with the SP.
For other MMUs, the resulting behavior is far more convoluted, though
unlikely to be truly problematic. Installing SPs/SPTEs into the obsolete
root isn't directly problematic, as the obsolete root will be unloaded
and dropped before the vCPU re-enters the guest. But because the legacy
MMU tracks shadow pages by their role, any SP created by the fault can
can be reused in the new post-reload root. Again, that _shouldn't_ be
problematic as any leaf child SPTEs will be created for the current/valid
memslot generation, and kvm_mmu_get_page() will not reuse child SPs from
the old generation as they will be flagged as obsolete. But, given that
continuing with the fault is pointess (the root will be unloaded), apply
the check to all MMUs.
Fixes: b7cccd397f ("KVM: x86/mmu: Fast invalidation for TDP MMU")
Cc: stable@vger.kernel.org
Cc: Ben Gardon <bgardon@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211120045046.3940942-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The error paths in the prepare_vmcs02() function are supposed to set
*entry_failure_code but this path does not. It leads to using an
uninitialized variable in the caller.
Fixes: 71f7347025 ("KVM: nVMX: Load GUEST_IA32_PERF_GLOBAL_CTRL MSR on VM-Entry")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Message-Id: <20211130125337.GB24578@kili>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
kvm_vcpu_apicv_active() returns false if a virtual machine has no in-kernel
local APIC, however kvm_apicv_activated might still be true if there are
no reasons to disable APICv; in fact it is quite likely that there is none
because APICv is inhibited by specific configurations of the local APIC
and those configurations cannot be programmed. This triggers a WARN:
WARN_ON_ONCE(kvm_apicv_activated(vcpu->kvm) != kvm_vcpu_apicv_active(vcpu));
To avoid this, introduce another cause for APICv inhibition, namely the
absence of an in-kernel local APIC. This cause is enabled by default,
and is dropped by either KVM_CREATE_IRQCHIP or the enabling of
KVM_CAP_IRQCHIP_SPLIT.
Reported-by: Ignat Korchagin <ignat@cloudflare.com>
Fixes: ee49a89329 ("KVM: x86: Move SVM's APICv sanity check to common x86", 2021-10-22)
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Tested-by: Ignat Korchagin <ignat@cloudflare.com>
Message-Id: <20211130123746.293379-1-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
If we run the following perf command in an AMD Milan guest:
perf stat \
-e cpu/event=0x1d0/ \
-e cpu/event=0x1c7/ \
-e cpu/umask=0x1f,event=0x18e/ \
-e cpu/umask=0x7,event=0x18e/ \
-e cpu/umask=0x18,event=0x18e/ \
./workload
dmesg will report a #GP warning from an unchecked MSR access
error on MSR_F15H_PERF_CTLx.
This is because according to APM (Revision: 4.03) Figure 13-7,
the bits [35:32] of AMD PerfEvtSeln register is a part of the
event select encoding, which extends the EVENT_SELECT field
from 8 bits to 12 bits.
Opportunistically update pmu->reserved_bits for reserved bit 19.
Reported-by: Jim Mattson <jmattson@google.com>
Fixes: ca724305a2 ("KVM: x86/vPMU: Implement AMD vPMU code for KVM")
Signed-off-by: Like Xu <likexu@tencent.com>
Message-Id: <20211118130320.95997-1-likexu@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
There are cases that the TSC clocksource is wrongly judged as unstable by
the clocksource watchdog mechanism which tries to validate the TSC against
HPET, PM_TIMER or jiffies. While there is hardly a general reliable way to
check the validity of a watchdog, Thomas Gleixner proposed [1]:
"I'm inclined to lift that requirement when the CPU has:
1) X86_FEATURE_CONSTANT_TSC
2) X86_FEATURE_NONSTOP_TSC
3) X86_FEATURE_NONSTOP_TSC_S3
4) X86_FEATURE_TSC_ADJUST
5) At max. 4 sockets
After two decades of horrors we're finally at a point where TSC seems
to be halfway reliable and less abused by BIOS tinkerers. TSC_ADJUST
was really key as we can now detect even small modifications reliably
and the important point is that we can cure them as well (not pretty
but better than all other options)."
As feature #3 X86_FEATURE_NONSTOP_TSC_S3 only exists on several generations
of Atom processorz, and is always coupled with X86_FEATURE_CONSTANT_TSC
and X86_FEATURE_NONSTOP_TSC, skip checking it, and also be more defensive
to use maximal 2 sockets.
The check is done inside tsc_init() before registering 'tsc-early' and
'tsc' clocksources, as there were cases that both of them had been
wrongly judged as unreliable.
For more background of tsc/watchdog, there is a good summary in [2]
[tglx} Update vs. jiffies:
On systems where the only remaining clocksource aside of TSC is jiffies
there is no way to make this work because that creates a circular
dependency. Jiffies accuracy depends on not missing a periodic timer
interrupt, which is not guaranteed. That could be detected by TSC, but as
TSC is not trusted this cannot be compensated. The consequence is a
circulus vitiosus which results in shutting down TSC and falling back to
the jiffies clocksource which is even more unreliable.
[1]. https://lore.kernel.org/lkml/87eekfk8bd.fsf@nanos.tec.linutronix.de/
[2]. https://lore.kernel.org/lkml/87a6pimt1f.ffs@nanos.tec.linutronix.de/
[ tglx: Refine comment and amend changelog ]
Fixes: 6e3cd95234 ("x86/hpet: Use another crystalball to evaluate HPET usability")
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Feng Tang <feng.tang@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20211117023751.24190-2-feng.tang@intel.com
The TSC_ADJUST register is checked every time a CPU enters idle state, but
Thomas Gleixner mentioned there is still a caveat that a system won't enter
idle [1], either because it's too busy or configured purposely to not enter
idle.
Setup a periodic timer (every 10 minutes) to make sure the check is
happening on a regular base.
[1] https://lore.kernel.org/lkml/875z286xtk.fsf@nanos.tec.linutronix.de/
Fixes: 6e3cd95234 ("x86/hpet: Use another crystalball to evaluate HPET usability")
Requested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Feng Tang <feng.tang@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20211117023751.24190-1-feng.tang@intel.com
Convention for all the other "lake" CPUs is all one word.
So s/RAPTOR_LAKE/RAPTORLAKE/
Fixes: fbdb5e8f29 ("x86/cpu: Add Raptor Lake to Intel family")
Reported-by: Rui Zhang <rui.zhang@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lkml.kernel.org/r/20211119170832.1034220-1-tony.luck@intel.com
* Fix constant sign extension affecting TCR_EL2 and preventing
running on ARMv8.7 models due to spurious bits being set
* Fix use of helpers using PSTATE early on exit by always sampling
it as soon as the exit takes place
* Move pkvm's 32bit handling into a common helper
RISC-V:
* Fix incorrect KVM_MAX_VCPUS value
* Unmap stage2 mapping when deleting/moving a memslot
x86:
* Fix and downgrade BUG_ON due to uninitialized cache
* Many APICv and MOVE_ENC_CONTEXT_FROM fixes
* Correctly emulate TLB flushes around nested vmentry/vmexit
and when the nested hypervisor uses VPID
* Prevent modifications to CPUID after the VM has run
* Other smaller bugfixes
Generic:
* Memslot handling bugfixes
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmGmHBEUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroOkGgf/RBjt1d7H6Um7tD7oA5QiIHmNY4ko
K/90OAa8h62rilxpqxkRgLNmphBc5AzcbufVXN4J1hVhw2M+u1ouDxKeHS1GEZTA
/XdNb0dwK99TpOJkIcuV/NQVIZUxkM00VbIiCoLkX06VuIc1Gie1G4bqzLhWCP8Y
ts9l/pkfafvfEmjmcjVd7gkDOnEPbT+JPDJcuo/RA7C7Z2L4+8DsFeyfWGqBP647
J6omUUxD82QRm28OVOK4V7aNALWsAdlaqHrVFAPZywQl7QTWMO0UTcKTdCCB2B4Q
QnHejFV6pFh55q3/fhe7epy9e2Sw+NOsmWKTEGPbU5nn94R8lyW1GV4ZUQ==
=Nduu
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini:
"ARM64:
- Fix constant sign extension affecting TCR_EL2 and preventing
running on ARMv8.7 models due to spurious bits being set
- Fix use of helpers using PSTATE early on exit by always sampling it
as soon as the exit takes place
- Move pkvm's 32bit handling into a common helper
RISC-V:
- Fix incorrect KVM_MAX_VCPUS value
- Unmap stage2 mapping when deleting/moving a memslot
x86:
- Fix and downgrade BUG_ON due to uninitialized cache
- Many APICv and MOVE_ENC_CONTEXT_FROM fixes
- Correctly emulate TLB flushes around nested vmentry/vmexit and when
the nested hypervisor uses VPID
- Prevent modifications to CPUID after the VM has run
- Other smaller bugfixes
Generic:
- Memslot handling bugfixes"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (44 commits)
KVM: fix avic_set_running for preemptable kernels
KVM: VMX: clear vmx_x86_ops.sync_pir_to_irr if APICv is disabled
KVM: SEV: accept signals in sev_lock_two_vms
KVM: SEV: do not take kvm->lock when destroying
KVM: SEV: Prohibit migration of a VM that has mirrors
KVM: SEV: Do COPY_ENC_CONTEXT_FROM with both VMs locked
selftests: sev_migrate_tests: add tests for KVM_CAP_VM_COPY_ENC_CONTEXT_FROM
KVM: SEV: move mirror status to destination of KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM
KVM: SEV: initialize regions_list of a mirror VM
KVM: SEV: cleanup locking for KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM
KVM: SEV: do not use list_replace_init on an empty list
KVM: x86: Use a stable condition around all VT-d PI paths
KVM: x86: check PIR even for vCPUs with disabled APICv
KVM: VMX: prepare sync_pir_to_irr for running with APICv disabled
KVM: selftests: page_table_test: fix calculation of guest_test_phys_mem
KVM: x86/mmu: Handle "default" period when selectively waking kthread
KVM: MMU: shadow nested paging does not have PKU
KVM: x86/mmu: Remove spurious TLB flushes in TDP MMU zap collapsible path
KVM: x86/mmu: Use yield-safe TDP MMU root iter in MMU notifier unmapping
KVM: X86: Use vcpu->arch.walk_mmu for kvm_mmu_invlpg()
...
avic_set_running() passes the current CPU to avic_vcpu_load(), albeit
via vcpu->cpu rather than smp_processor_id(). If the thread is migrated
while avic_set_running runs, the call to avic_vcpu_load() can use a stale
value for the processor id. Avoid this by blocking preemption over the
entire execution of avic_set_running().
Reported-by: Sean Christopherson <seanjc@google.com>
Fixes: 8221c13700 ("svm: Manage vcpu load/unload when enable AVIC")
Cc: stable@vger.kernel.org
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
There is nothing to synchronize if APICv is disabled, since neither
other vCPUs nor assigned devices can set PIR.ON.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Generally, kvm->lock is not taken for a long time, but
sev_lock_two_vms is different: it takes vCPU locks
inside, so userspace can hold it back just by calling
a vCPU ioctl. Play it safe and use mutex_lock_killable.
Message-Id: <20211123005036.2954379-13-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Taking the lock is useless since there are no other references,
and there are already accesses (e.g. to sev->enc_context_owner)
that do not take it. So get rid of it.
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211123005036.2954379-12-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
VMs that mirror an encryption context rely on the owner to keep the
ASID allocated. Performing a KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM
would cause a dangling ASID:
1. copy context from A to B (gets ref to A)
2. move context from A to L (moves ASID from A to L)
3. close L (releases ASID from L, B still references it)
The right way to do the handoff instead is to create a fresh mirror VM
on the destination first:
1. copy context from A to B (gets ref to A)
[later] 2. close B (releases ref to A)
3. move context from A to L (moves ASID from A to L)
4. copy context from L to M
So, catch the situation by adding a count of how many VMs are
mirroring this one's encryption context.
Fixes: 0b020f5af0 ("KVM: SEV: Add support for SEV-ES intra host migration")
Message-Id: <20211123005036.2954379-11-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Now that we have a facility to lock two VMs with deadlock
protection, use it for the creation of mirror VMs as well. One of
COPY_ENC_CONTEXT_FROM(dst, src) and COPY_ENC_CONTEXT_FROM(src, dst)
would always fail, so the combination is nonsensical and it is okay to
return -EBUSY if it is attempted.
This sidesteps the question of what happens if a VM is
MOVE_ENC_CONTEXT_FROM'd at the same time as it is
COPY_ENC_CONTEXT_FROM'd: the locking prevents that from
happening.
Cc: Peter Gonda <pgonda@google.com>
Cc: Sean Christopherson <seanjc@google.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211123005036.2954379-10-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Allow intra-host migration of a mirror VM; the destination VM will be
a mirror of the same ASID as the source.
Fixes: b56639318b ("KVM: SEV: Add support for SEV intra host migration")
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211123005036.2954379-8-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This was broken before the introduction of KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM,
but technically harmless because the region list was unused for a mirror
VM. However, it is untidy and it now causes a NULL pointer access when
attempting to move the encryption context of a mirror VM.
Fixes: 54526d1fd5 ("KVM: x86: Support KVM VMs sharing SEV context")
Message-Id: <20211123005036.2954379-7-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Encapsulate the handling of the migration_in_progress flag for both VMs in
two functions sev_lock_two_vms and sev_unlock_two_vms. It does not matter
if KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM locks the destination struct kvm a bit
later, and this change 1) keeps the cleanup chain of labels smaller 2)
makes it possible for KVM_CAP_VM_COPY_ENC_CONTEXT_FROM to reuse the logic.
Cc: Peter Gonda <pgonda@google.com>
Cc: Sean Christopherson <seanjc@google.com>
Message-Id: <20211123005036.2954379-6-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
list_replace_init cannot be used if the source is an empty list,
because "new->next->prev = new" will overwrite "old->next":
new old
prev = new, next = new prev = old, next = old
new->next = old->next prev = new, next = old prev = old, next = old
new->next->prev = new prev = new, next = old prev = old, next = new
new->prev = old->prev prev = old, next = old prev = old, next = old
new->next->prev = new prev = old, next = old prev = new, next = new
The desired outcome instead would be to leave both old and new the same
as they were (two empty circular lists). Use list_cut_before, which
already has the necessary check and is documented to discard the
previous contents of the list that will hold the result.
Fixes: b56639318b ("KVM: SEV: Add support for SEV intra host migration")
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211123005036.2954379-5-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Currently, checks for whether VT-d PI can be used refer to the current
status of the feature in the current vCPU; or they more or less pick
vCPU 0 in case a specific vCPU is not available.
However, these checks do not attempt to synchronize with changes to
the IRTE. In particular, there is no path that updates the IRTE when
APICv is re-activated on vCPU 0; and there is no path to wakeup a CPU
that has APICv disabled, if the wakeup occurs because of an IRTE
that points to a posted interrupt.
To fix this, always go through the VT-d PI path as long as there are
assigned devices and APICv is available on both the host and the VM side.
Since the relevant condition was copied over three times, take the hint
and factor it into a separate function.
Suggested-by: Sean Christopherson <seanjc@google.com>
Cc: stable@vger.kernel.org
Reviewed-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: David Matlack <dmatlack@google.com>
Message-Id: <20211123004311.2954158-5-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The IRTE for an assigned device can trigger a POSTED_INTR_VECTOR even
if APICv is disabled on the vCPU that receives it. In that case, the
interrupt will just cause a vmexit and leave the ON bit set together
with the PIR bit corresponding to the interrupt.
Right now, the interrupt would not be delivered until APICv is re-enabled.
However, fixing this is just a matter of always doing the PIR->IRR
synchronization, even if the vCPU has temporarily disabled APICv.
This is not a problem for performance, or if anything it is an
improvement. First, in the common case where vcpu->arch.apicv_active is
true, one fewer check has to be performed. Second, static_call_cond will
elide the function call if APICv is not present or disabled. Finally,
in the case for AMD hardware we can remove the sync_pir_to_irr callback:
it is only needed for apic_has_interrupt_for_ppr, and that function
already has a fallback for !APICv.
Cc: stable@vger.kernel.org
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: David Matlack <dmatlack@google.com>
Message-Id: <20211123004311.2954158-4-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
If APICv is disabled for this vCPU, assigned devices may still attempt to
post interrupts. In that case, we need to cancel the vmentry and deliver
the interrupt with KVM_REQ_EVENT. Extend the existing code that handles
injection of L1 interrupts into L2 to cover this case as well.
vmx_hwapic_irr_update is only called when APICv is active so it would be
confusing to add a check for vcpu->arch.apicv_active in there. Instead,
just use vmx_set_rvi directly in vmx_sync_pir_to_irr.
Cc: stable@vger.kernel.org
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: David Matlack <dmatlack@google.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211123004311.2954158-3-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Account for the '0' being a default, "let KVM choose" period, when
determining whether or not the recovery worker needs to be awakened in
response to userspace reducing the period. Failure to do so results in
the worker not being awakened properly, e.g. when changing the period
from '0' to any small-ish value.
Fixes: 4dfe4f40d8 ("kvm: x86: mmu: Make NX huge page recovery period configurable")
Cc: stable@vger.kernel.org
Cc: Junaid Shahid <junaids@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211120015706.3830341-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Initialize the mask for PKU permissions as if CR4.PKE=0, avoiding
incorrect interpretations of the nested hypervisor's page tables.
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Drop the "flush" param and return values to/from the TDP MMU's helper for
zapping collapsible SPTEs. Because the helper runs with mmu_lock held
for read, not write, it uses tdp_mmu_zap_spte_atomic(), and the atomic
zap handles the necessary remote TLB flush.
Similarly, because mmu_lock is dropped and re-acquired between zapping
legacy MMUs and zapping TDP MMUs, kvm_mmu_zap_collapsible_sptes() must
handle remote TLB flushes from the legacy MMU before calling into the TDP
MMU.
Fixes: e2209710cc ("KVM: x86/mmu: Skip rmap operations if rmaps not allocated")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211120045046.3940942-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Use the yield-safe variant of the TDP MMU iterator when handling an
unmapping event from the MMU notifier, as most occurences of the event
allow yielding.
Fixes: e1eed5847b ("KVM: x86/mmu: Allow yielding during MMU notifier unmap/zap, if possible")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211120015008.3780032-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmGjr4UTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoXEXEAC79s58FkEV5LsKnmNe5pr/XG47Fgbz
MMTX+P4IUwUrvRHPzEKPdO4lR8ZSFefhSHcPA06cWYyNyeh/UHq/sB4JGysuH7bw
JAIJhJ3PsLv19TWvIKN3WHB7R3gwqKYzzzsKjKJHfepHv7kzidYz1fu380/bA2Zw
LDjlMaQFSIA4rk7sBB/FFC3wsCvmU69iwG82x+QXX3ZWk0eB8bmw4Vsz6tsjo1D2
EtEyxmjh2HjI+AbTORUsqucWvoKq3PvA9CX4SJdT/u+/ZX1zYR6iwsBkB+oiDNFx
mxN15QKUA0t5NmF3dtoftj0lLM7iT7rd9PAdyPbWDm1o02z7Q780tsv7S8A2njMp
ACwvCkupE9LSmEpFH9YyQeCBtyppXd0laN0mcL3h1MmSQbZs3KrtqyQDJ6McjPwL
TiDFa8PKCBejGms7P6gPBv0DCsNNBQ0alAioi+zVsn8WbnMKGBpb0nplAdn1Cxt0
GuJR63oEpX8Jtejb/018cEdDBiEmGh+7HXVXy4giAKnn92Pd7AhpWXwUYGuoD7j3
1nRI9TOEoWS9ZcK7goaVp9D42YCEPsn66RPGyg25EaTlWJ0g9tmpbXhGzFPYg8FS
tGIBw7CwlGMIKng+MWwuGu9hJYk0Tc1NcvrkxYTSVFNDYo4AQDpqwLlGuJsO893c
8tFApEuQ+w5zkg==
=QG5U
-----END PGP SIGNATURE-----
Merge tag 'x86-urgent-2021-11-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 build fix from Thomas Gleixner:
"A single fix for a missing __init annotation of prepare_command_line()"
* tag 'x86-urgent-2021-11-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/boot: Mark prepare_command_line() __init
The macro is unused after commit 00ecd54013 so it can be removed.
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Fixes: 00ecd54013 ("iommu/vt-d: Clean up unused PASID updating functions")
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/20211123105507.7654-2-joro@8bytes.org
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQRTLbB6QfY48x44uB6AXGG7T9hjvgUCYaD8mwAKCRCAXGG7T9hj
vspAAPwLA5SUorji33PTetwmcpLcoRJ3Q4HAPz+bOPdm9iL/PgD/V8MtxFrFebBs
AJoa+GmBarUNn7XCqKnCcA64iXhrpQw=
=6GY3
-----END PGP SIGNATURE-----
Merge tag 'for-linus-5.16c-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen fixes from Juergen Gross:
- Kconfig fix to make it possible to control building of the privcmd
driver
- three fixes for issues identified by the kernel test robot
- a five-patch series to simplify timeout handling for Xen PV driver
initialization
- two patches to fix error paths in xenstore/xenbus driver
initialization
* tag 'for-linus-5.16c-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
xen: make HYPERVISOR_set_debugreg() always_inline
xen: make HYPERVISOR_get_debugreg() always_inline
xen: detect uninitialized xenbus in xenbus_init
xen: flag xen_snd_front to be not essential for system boot
xen: flag pvcalls-front to be not essential for system boot
xen: flag hvc_xen to be not essential for system boot
xen: flag xen_drm_front to be not essential for system boot
xen: add "not_essential" flag to struct xenbus_driver
xen/pvh: add missing prototype to header
xen: don't continue xenstore initialization in case of errors
xen/privcmd: make option visible in Kconfig
INVLPG operates on guest virtual address, which are represented by
vcpu->arch.walk_mmu. In nested virtualization scenarios,
kvm_mmu_invlpg() was using the wrong MMU structure; if L2's invlpg were
emulated by L0 (in practice, it hardly happen) when nested two-dimensional
paging is enabled, the call to ->tlb_flush_gva() would be skipped and
the hardware TLB entry would not be invalidated.
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Message-Id: <20211124122055.64424-5-jiangshanlai@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
If the is an L1 with nNPT in 32bit, the shadow walk starts with
pae_root.
Fixes: a717a780fc ("KVM: x86/mmu: Support shadowing NPT when 5-level paging is enabled in host)
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Message-Id: <20211124122055.64424-2-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Commit 63f5a1909f ("KVM: x86: Alert userspace that KVM_SET_CPUID{,2}
after KVM_RUN is broken") officially deprecated KVM_SET_CPUID{,2} ioctls
after first successful KVM_RUN and promissed to make this sequence forbiden
in 5.16. It's time to fulfil the promise.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20211122175818.608220-3-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Fully emulate a guest TLB flush on nested VM-Enter which changes vpid12,
i.e. L2's VPID, instead of simply doing INVVPID to flush real hardware's
TLB entries for vpid02. From L1's perspective, changing L2's VPID is
effectively a TLB flush unless "hardware" has previously cached entries
for the new vpid12. Because KVM tracks only a single vpid12, KVM doesn't
know if the new vpid12 has been used in the past and so must treat it as
a brand new, never been used VPID, i.e. must assume that the new vpid12
represents a TLB flush from L1's perspective.
For example, if L1 and L2 share a CR3, the first VM-Enter to L2 (with a
VPID) is effectively a TLB flush as hardware/KVM has never seen vpid12
and thus can't have cached entries in the TLB for vpid12.
Reported-by: Lai Jiangshan <jiangshanlai+lkml@gmail.com>
Fixes: 5c614b3583 ("KVM: nVMX: nested VPID emulation")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211125014944.536398-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Like KVM_REQ_TLB_FLUSH_CURRENT, the GUEST variant needs to be serviced at
nested transitions, as KVM doesn't track requests for L1 vs L2. E.g. if
there's a pending flush when a nested VM-Exit occurs, then the flush was
requested in the context of L2 and needs to be handled before switching
to L1, otherwise the flush for L2 would effectiely be lost.
Opportunistically add a helper to handle CURRENT and GUEST as a pair, the
logic for when they need to be serviced is identical as both requests are
tied to L1 vs. L2, the only difference is the scope of the flush.
Reported-by: Lai Jiangshan <jiangshanlai+lkml@gmail.com>
Fixes: 07ffaf343e ("KVM: nVMX: Sync all PGDs on nested transition with shadow paging")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211125014944.536398-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Flush the current VPID when handling KVM_REQ_TLB_FLUSH_GUEST instead of
always flushing vpid01. Any TLB flush that is triggered when L2 is
active is scoped to L2's VPID (if it has one), e.g. if L2 toggles CR4.PGE
and L1 doesn't intercept PGE writes, then KVM's emulation of the TLB
flush needs to be applied to L2's VPID.
Reported-by: Lai Jiangshan <jiangshanlai+lkml@gmail.com>
Fixes: 07ffaf343e ("KVM: nVMX: Sync all PGDs on nested transition with shadow paging")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211125014944.536398-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The capability, albeit present, was never exposed via KVM_CHECK_EXTENSION.
Fixes: b56639318b ("KVM: SEV: Add support for SEV intra host migration")
Cc: Peter Gonda <pgonda@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Synchronize the two calls to kvm_x86_sync_pir_to_irr. The one
in the reenter-guest fast path invoked the callback unconditionally
even if LAPIC is present but disabled. In this case, there are
no interrupts to deliver, and therefore posted interrupts can
be ignored.
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>