Devicetree source is a description of hardware and hardware has only one
block @20008000 which can be configured either as eMMC or SDHC. Having
two node for different modes is an obscure, unusual and confusing way to
configure it. Instead the board file is supposed to customize the block
to its needs, e.g. to SDHC mode.
This fixes dtbs_check warning:
arch/riscv/boot/dts/microchip/microchip-mpfs-icicle-kit.dt.yaml: sdhc@20008000: $nodename:0: 'sdhc@20008000' does not match '^mmc(@.*)?$'
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
According to bindings, the compatible must include microchip,mpfs. This
fixes dtbs_check warning:
arch/riscv/boot/dts/microchip/microchip-mpfs-icicle-kit.dt.yaml: /: compatible: ['microchip,mpfs-icicle-kit'] is too short
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
Reviewed-by: Conor Dooley <conor.dooley@microchip.com>
Reviewed-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
The DTSI file defines soc node and address/size cells, so there is no
point in duplicating it in DTS file.
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
Reviewed-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
The Microchip PolarFire SoC FPGA DTSI uses Cadence SD/SDIO/eMMC Host
Controller without any additional vendor compatible:
arch/riscv/boot/dts/microchip/microchip-mpfs-icicle-kit.dt.yaml: mmc@20008000: compatible:0: 'cdns,sd4hc' is not one of ['socionext,uniphier-sd4hc']
arch/riscv/boot/dts/microchip/microchip-mpfs-icicle-kit.dt.yaml: mmc@20008000: compatible: ['cdns,sd4hc'] is too short
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
Reviewed-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Rob Herring <robh@kernel.org>
Reviewed-by: Conor Dooley <conor.dooley@microchip.com>
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmFawaIUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroO7xwf9HF/zvySIRhwWsxI6RZKA5k+g0ebe
18XzPeBMPeW1C5YLukL67cvoxgW8oHc/ZMhdUKVSEkgMkhF7c6uNgtvodtv7ZFk8
/3wfBAERp37VKltliU2JvVs/8GXboK+uKnq0mIt52CO9+JVH/55URtz7Spxnl8Fc
Lx7ioeKoaPw2mnyuU6sSNwt7HuUtqn8meQ1EcYLCGND7Vyu/BevRGvqRGuXqipVb
pDlpz5Cu5T0ZCmBj/ko5PmPCU3COQw2jtiugUvKsAjPM92ViNK8B9O7cbrcEh8fi
XCOsV1KjzF0kJCvacZMQqkjijL3UriNBRW5JAvd4boV97PGcI9KHZdEFYw==
=65Mk
-----END PGP SIGNATURE-----
gpgsig -----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEKzw3R0RoQ7JKlDp6LhMZ81+7GIkFAmFb1ygTHHBhbG1lckBk
YWJiZWx0LmNvbQAKCRAuExnzX7sYicKuD/9pkpLuebSwmgzDRQ4tcGKQUc+h1rCP
U++2qb74Zi7XX5JnXwVraAC9F0ps1hOCfQzgncKd6sT/VTrbKqmPl0QF0NMniCjN
xOk5X4AsZH/qGkgizG8o73Ok9+qclbpyTeYJ5iKBKwwhDOAsutGDsSgp5nrYJYgI
30xX62eH0XvfyN1Un/qSqE/3TErkwOAlIcFlAr1Nkel0jnOfMzgdUjQEW/h4hxci
8NWFLVdP1WVdlWSmbCclu1totU884VdPquE0tb6lh6dBQXsAB///AjNHZxsO7jZ0
UGBG9d4hY40zyvu9FB+kvYyafrZXyhCc+AmphtYBY/cd3TyBXEO98mYPQB/h5Tmh
wsi1bVIqSAYfw04jJvvtLQYWJybB1el9Mp2jeKg2tEp+j96vvax2FGuHV39lcWXj
b79M/gd7fAyY+oQhYoFDsLUoY/FYiECvpBQQrgB8AMl+JiNMOp0pWZIGf98paLh+
3CjX6WCjmxBNrNbM0bj+mmm3dj2Aq8OG98BVsolqqhUah55t7Wu6Hjax7wHgqORk
FfnuemXLKE4H2AZvMcjaAqBd7/Q1SYRfO+LzcfoE8Nhwek3RaHg3wz5n8c8xT8BW
9gK8yC/UrdVZNf41nBqKEjYKaholorBP8/qZoNZGxPP87dlLxkFup91D5MtEn+c+
FNSX5c6BY+8vbQ==
=ywXC
-----END PGP SIGNATURE-----
Merge tag 'for-riscv' of https://git.kernel.org/pub/scm/virt/kvm/kvm.git into for-next
H extension definitions, shared by the KVM and RISC-V trees.
* tag 'for-riscv' of ssh://gitolite.kernel.org/pub/scm/virt/kvm/kvm: (301 commits)
RISC-V: Add hypervisor extension related CSR defines
KVM: selftests: Ensure all migrations are performed when test is affined
KVM: x86: Swap order of CPUID entry "index" vs. "significant flag" checks
ptp: Fix ptp_kvm_getcrosststamp issue for x86 ptp_kvm
x86/kvmclock: Move this_cpu_pvti into kvmclock.h
KVM: s390: Function documentation fixes
selftests: KVM: Don't clobber XMM register when read
KVM: VMX: Fix a TSX_CTRL_CPUID_CLEAR field mask issue
selftests: KVM: Explicitly use movq to read xmm registers
selftests: KVM: Call ucall_init when setting up in rseq_test
KVM: Remove tlbs_dirty
KVM: X86: Synchronize the shadow pagetable before link it
KVM: X86: Fix missed remote tlb flush in rmap_write_protect()
KVM: x86: nSVM: don't copy virt_ext from vmcb12
KVM: x86: nSVM: test eax for 4K alignment for GP errata workaround
KVM: x86: selftests: test simultaneous uses of V_IRQ from L1 and L0
KVM: x86: nSVM: restore int_vector in svm_clear_vintr
kvm: x86: Add AMD PMU MSRs to msrs_to_save_all[]
KVM: x86: nVMX: re-evaluate emulation_required on nested VM exit
KVM: x86: nVMX: don't fail nested VM entry on invalid guest state if !from_vmentry
...
Add the ability to do randconfig build targets for both
rv32 and rv64.
Based on a similar patch by Michael Ellerman for PowerPC.
Usage:
make ARCH=riscv rv32_randconfig
or
make ARCH=riscv rv64_randconfig
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
Even if mmu doesn't support ASID, current code calculates @num_asids=1
which is misleading, so avoid setting any asid related variables in such
case.
Also while here, print the number of asid bits discovered even for the
disabled case.
Verified this on Hifive Unmatched.
Reviewed-by: Anup Patel <anup@brainfault.org>
Signed-off-by: Vineet Gupta <vgupta@kernel.org>
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
Set pm_power_off to NULL like on all other architectures, check if it
is set in machine_halt() and machine_power_off() and fallback to
default_power_off if no other power driver got registered.
This brings riscv architecture inline with all other architectures,
and allows to reuse exiting power drivers unmodified.
Kernels without legacy SBI v0.1 extensions (CONFIG_RISCV_SBI_V01 is
not set), do not set pm_power_off to sbi_shutdown(). There is no
support for SBI v0.3 system reset extension either. This prevents
using gpio_poweroff on SiFive HiFive Unmatched.
Tested on SiFive HiFive unmatched, with a dtb specifying gpio-poweroff
node and kernel complied without CONFIG_RISCV_SBI_V01.
BugLink: https://bugs.launchpad.net/bugs/1942806
Signed-off-by: Dimitri John Ledkov <dimitri.ledkov@canonical.com>
Reviewed-by: Anup Patel <anup@brainfault.org>
Tested-by: Ron Economos <w6rz@comcast.net>
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
Implement generic vdso time namespace support which also enables time
namespaces for riscv. This is quite similar to what arm64 does.
selftest/timens test result:
1..10
ok 1 Passed for CLOCK_BOOTTIME (syscall)
ok 2 Passed for CLOCK_BOOTTIME (vdso)
ok 3 # SKIP CLOCK_BOOTTIME_ALARM isn't supported
ok 4 # SKIP CLOCK_BOOTTIME_ALARM isn't supported
ok 5 Passed for CLOCK_MONOTONIC (syscall)
ok 6 Passed for CLOCK_MONOTONIC (vdso)
ok 7 Passed for CLOCK_MONOTONIC_COARSE (syscall)
ok 8 Passed for CLOCK_MONOTONIC_COARSE (vdso)
ok 9 Passed for CLOCK_MONOTONIC_RAW (syscall)
ok 10 Passed for CLOCK_MONOTONIC_RAW (vdso)
# Totals: pass:8 fail:0 xfail:0 xpass:0 skip:2 error:0
Signed-off-by: Tong Tiangen <tongtiangen@huawei.com>
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
This contains a VDSO cleanup, along with a handful of VDSO fixes. These
serve as a base for the new time namespace support, but are also on
fixes.
* palmer/riscv-vdso-cleanup:
riscv/vdso: make arch_setup_additional_pages wait for mmap_sem for write killable
riscv/vdso: Move vdso data page up front
riscv/vdso: Refactor asm/vdso.h
riscv architectures relying on mmap_sem for write in their
arch_setup_additional_pages. If the waiting task gets killed by the oom
killer it would block oom_reaper from asynchronous address space reclaim
and reduce the chances of timely OOM resolving. Wait for the lock in
the killable mode and return with EINTR if the task got killed while
waiting.
Signed-off-by: Tong Tiangen <tongtiangen@huawei.com>
Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Fixes: 76d2a0493a ("RISC-V: Init and Halt Code")
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
As commit 601255ae3c ("arm64: vdso: move data page before code pages"), the
same issue exists on riscv, testcase is shown below, make sure that vdso.so is
bigger than page size,
struct timespec tp;
clock_gettime(5, &tp);
printf("tv_sec: %ld, tv_nsec: %ld\n", tp.tv_sec, tp.tv_nsec);
without this patch, test result : tv_sec: 0, tv_nsec: 0
with this patch, test result : tv_sec: 1629271537, tv_nsec: 748000000
Move the vdso data page in front of the VDSO area to fix the issue.
Fixes: ad5d1122b8 ("riscv: use vDSO common flow to reduce the latency of the time-related functions")
Signed-off-by: Tong Tiangen <tongtiangen@huawei.com>
Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
The asm/vdso.h will be included in vdso.lds.S in the next patch, the
following cleanup is needed to avoid syntax error:
1.the declaration of sys_riscv_flush_icache() is moved into asm/syscall.h.
2.the definition of struct vdso_data is moved into kernel/vdso.c.
2.the definition of VDSO_SYMBOL is placed under "#ifndef __ASSEMBLY__".
Also remove the redundant linux/types.h include.
Signed-off-by: Tong Tiangen <tongtiangen@huawei.com>
Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
Rework the CPU selection in the migration worker to ensure the specified
number of migrations are performed when the test iteslf is affined to a
subset of CPUs. The existing logic skips iterations if the target CPU is
not in the original set of possible CPUs, which causes the test to fail
if too many iterations are skipped.
==== Test Assertion Failure ====
rseq_test.c:228: i > (NR_TASK_MIGRATIONS / 2)
pid=10127 tid=10127 errno=4 - Interrupted system call
1 0x00000000004018e5: main at rseq_test.c:227
2 0x00007fcc8fc66bf6: ?? ??:0
3 0x0000000000401959: _start at ??:?
Only performed 4 KVM_RUNs, task stalled too much?
Calculate the min/max possible CPUs as a cheap "best effort" to avoid
high runtimes when the test is affined to a small percentage of CPUs.
Alternatively, a list or xarray of the possible CPUs could be used, but
even in a horrendously inefficient setup, such optimizations are not
needed because the runtime is completely dominated by the cost of
migrating the task, and the absolute runtime is well under a minute in
even truly absurd setups, e.g. running on a subset of vCPUs in a VM that
is heavily overcommited (16 vCPUs per pCPU).
Fixes: 61e52f1630 ("KVM: selftests: Add a test for KVM_RUN+rseq to detect task migration bugs")
Reported-by: Dongli Zhang <dongli.zhang@oracle.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210929234112.1862848-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Check whether a CPUID entry's index is significant before checking for a
matching index to hack-a-fix an undefined behavior bug due to consuming
uninitialized data. RESET/INIT emulation uses kvm_cpuid() to retrieve
CPUID.0x1, which does _not_ have a significant index, and fails to
initialize the dummy variable that doubles as EBX/ECX/EDX output _and_
ECX, a.k.a. index, input.
Practically speaking, it's _extremely_ unlikely any compiler will yield
code that causes problems, as the compiler would need to inline the
kvm_cpuid() call to detect the uninitialized data, and intentionally hose
the kernel, e.g. insert ud2, instead of simply ignoring the result of
the index comparison.
Although the sketchy "dummy" pattern was introduced in SVM by commit
66f7b72e11 ("KVM: x86: Make register state after reset conform to
specification"), it wasn't actually broken until commit 7ff6c03503
("KVM: x86: Remove stateful CPUID handling") arbitrarily swapped the
order of operations such that "index" was checked before the significant
flag.
Avoid consuming uninitialized data by reverting to checking the flag
before the index purely so that the fix can be easily backported; the
offending RESET/INIT code has been refactored, moved, and consolidated
from vendor code to common x86 since the bug was introduced. A future
patch will directly address the bad RESET/INIT behavior.
The undefined behavior was detected by syzbot + KernelMemorySanitizer.
BUG: KMSAN: uninit-value in cpuid_entry2_find arch/x86/kvm/cpuid.c:68
BUG: KMSAN: uninit-value in kvm_find_cpuid_entry arch/x86/kvm/cpuid.c:1103
BUG: KMSAN: uninit-value in kvm_cpuid+0x456/0x28f0 arch/x86/kvm/cpuid.c:1183
cpuid_entry2_find arch/x86/kvm/cpuid.c:68 [inline]
kvm_find_cpuid_entry arch/x86/kvm/cpuid.c:1103 [inline]
kvm_cpuid+0x456/0x28f0 arch/x86/kvm/cpuid.c:1183
kvm_vcpu_reset+0x13fb/0x1c20 arch/x86/kvm/x86.c:10885
kvm_apic_accept_events+0x58f/0x8c0 arch/x86/kvm/lapic.c:2923
vcpu_enter_guest+0xfd2/0x6d80 arch/x86/kvm/x86.c:9534
vcpu_run+0x7f5/0x18d0 arch/x86/kvm/x86.c:9788
kvm_arch_vcpu_ioctl_run+0x245b/0x2d10 arch/x86/kvm/x86.c:10020
Local variable ----dummy@kvm_vcpu_reset created at:
kvm_vcpu_reset+0x1fb/0x1c20 arch/x86/kvm/x86.c:10812
kvm_apic_accept_events+0x58f/0x8c0 arch/x86/kvm/lapic.c:2923
Reported-by: syzbot+f3985126b746b3d59c9d@syzkaller.appspotmail.com
Reported-by: Alexander Potapenko <glider@google.com>
Fixes: 2a24be79b6 ("KVM: VMX: Set EDX at INIT with CPUID.0x1, Family-Model-Stepping")
Fixes: 7ff6c03503 ("KVM: x86: Remove stateful CPUID handling")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Message-Id: <20210929222426.1855730-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
hv_clock is preallocated to have only HVC_BOOT_ARRAY_SIZE (64) elements;
if the PTP_SYS_OFFSET_PRECISE ioctl is executed on vCPUs whose index is
64 of higher, retrieving the struct pvclock_vcpu_time_info pointer with
"src = &hv_clock[cpu].pvti" will result in an out-of-bounds access and
a wild pointer. Change it to "this_cpu_pvti()" which is guaranteed to
be valid.
Fixes: 95a3d4454b ("Switch kvmclock data to a PER_CPU variable")
Signed-off-by: Zelin Deng <zelin.deng@linux.alibaba.com>
Cc: <stable@vger.kernel.org>
Message-Id: <1632892429-101194-3-git-send-email-zelin.deng@linux.alibaba.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
There're other modules might use hv_clock_per_cpu variable like ptp_kvm,
so move it into kvmclock.h and export the symbol to make it visiable to
other modules.
Signed-off-by: Zelin Deng <zelin.deng@linux.alibaba.com>
Cc: <stable@vger.kernel.org>
Message-Id: <1632892429-101194-2-git-send-email-zelin.deng@linux.alibaba.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The latest compile changes pointed us to a few instances where we use
the kernel documentation style but don't explain all variables or
don't adhere to it 100%.
It's easy to fix so let's do that.
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
There is no need to clobber a register that is only being read from.
Oops. Drop the XMM register from the clobbers list.
Signed-off-by: Oliver Upton <oupton@google.com>
Message-Id: <20210927223621.50178-1-oupton@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When updating the host's mask for its MSR_IA32_TSX_CTRL user return entry,
clear the mask in the found uret MSR instead of vmx->guest_uret_msrs[i].
Modifying guest_uret_msrs directly is completely broken as 'i' does not
point at the MSR_IA32_TSX_CTRL entry. In fact, it's guaranteed to be an
out-of-bounds accesses as is always set to kvm_nr_uret_msrs in a prior
loop. By sheer dumb luck, the fallout is limited to "only" failing to
preserve the host's TSX_CTRL_CPUID_CLEAR. The out-of-bounds access is
benign as it's guaranteed to clear a bit in a guest MSR value, which are
always zero at vCPU creation on both x86-64 and i386.
Cc: stable@vger.kernel.org
Fixes: 8ea8b8d6f8 ("KVM: VMX: Use common x86's uret MSR list as the one true list")
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210926015545.281083-1-zhenzhong.duan@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
- Add missing FORCE target when building the EL2 object
- Fix a PMU probe regression on some platforms
-----BEGIN PGP SIGNATURE-----
iQJDBAABCgAtFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAmFNjR4PHG1hekBrZXJu
ZWwub3JnAAoJECPQ0LrRPXpDImkQAL2Y+jxz3fd8oXHPFIOW1TkHTaafyk5DmqmP
WqwjQLj+WpI/q7bQawbhaoFSonFhkf2OOu4BvsrEdEzQIiONkbuIcdqMHjPKfeHK
x4dx/JG7K/W/b86WOh479L9M7VQYw0F53tgOwUxf9tHZgXl1gmlUy1K9qxqyLty6
1aYaoU8/3xuHI4RDEmQTfS/gj7X/1ng7OK3B3ZeUV1U4xOuBxcmS+leevilSE7c2
QC5mgU9Dsa3UkATdz+TZtmbx9BeTRhyNar0KnrwsaSkTti4miT8DoApoU9JdRVr3
VoSkgb2hdyT8J/jBNg+6Rs3uQr2Lp0scyXRDgZDdba7FPbUuDzCqKxfpMoXNU8Um
G5kOJGBg07cI4l8S/giIaqO6r8cZooBNuWKuJDKqED9ikbma4hG8kaLsHP4BtjNZ
INUVeL39je/5/gp6XaRPYBKqEajp5bRnxbPOzWKqqELV3s4ArtZPs8cVS6c8b0YL
9O7pv7EXueQHOgoF0Zh1H14wv4iBK5LKHGv3r/uks0ryBc/x93qeAVKcQAZ+l/s2
dPWfQzDHvwkQybkkYw4XlVE2kLTKxbcvolN+++TIfCzvXyt/pcL5MPlkZIvMfZhd
3YuN44NH61pnn5h0lHlWk9mNPnBoAswjw5qopa1kr5YVcYCcvJ0MdB/Wb6Xm473p
/DFZv20H
=klvK
-----END PGP SIGNATURE-----
Merge tag 'kvmarm-fixes-5.15-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into kvm-master
KVM/arm64 fixes for 5.15, take #1
- Add missing FORCE target when building the EL2 object
- Fix a PMU probe regression on some platforms
Compiling the KVM selftests with clang emits the following warning:
>> include/x86_64/processor.h:297:25: error: variable 'xmm0' is uninitialized when used here [-Werror,-Wuninitialized]
>> return (unsigned long)xmm0;
where xmm0 is accessed via an uninitialized register variable.
Indeed, this is a misuse of register variables, which really should only
be used for specifying register constraints on variables passed to
inline assembly. Rather than attempting to read xmm registers via
register variables, just explicitly perform the movq from the desired
xmm register.
Fixes: 783e9e5126 ("kvm: selftests: add API testing infrastructure")
Signed-off-by: Oliver Upton <oupton@google.com>
Message-Id: <20210924005147.1122357-1-oupton@google.com>
Reviewed-by: Ricardo Koller <ricarkol@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
While x86 does not require any additional setup to use the ucall
infrastructure, arm64 needs to set up the MMIO address used to signal a
ucall to userspace. rseq_test does not initialize the MMIO address,
resulting in the test spinning indefinitely.
Fix the issue by calling ucall_init() during setup.
Fixes: 61e52f1630 ("KVM: selftests: Add a test for KVM_RUN+rseq to detect task migration bugs")
Signed-off-by: Oliver Upton <oupton@google.com>
Message-Id: <20210923220033.4172362-1-oupton@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
There is no user of tlbs_dirty.
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210918005636.3675-4-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
If gpte is changed from non-present to present, the guest doesn't need
to flush tlb per SDM. So the host must synchronze sp before
link it. Otherwise the guest might use a wrong mapping.
For example: the guest first changes a level-1 pagetable, and then
links its parent to a new place where the original gpte is non-present.
Finally the guest can access the remapped area without flushing
the tlb. The guest's behavior should be allowed per SDM, but the host
kvm mmu makes it wrong.
Fixes: 4731d4c7a0 ("KVM: MMU: out of sync shadow core")
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210918005636.3675-3-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When kvm->tlbs_dirty > 0, some rmaps might have been deleted
without flushing tlb remotely after kvm_sync_page(). If @gfn
was writable before and it's rmaps was deleted in kvm_sync_page(),
and if the tlb entry is still in a remote running VCPU, the @gfn
is not safely protected.
To fix the problem, kvm_sync_page() does the remote flush when
needed to avoid the problem.
Fixes: a4ee1ca4a3 ("KVM: MMU: delay flush all tlbs on sync_page path")
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210918005636.3675-2-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
These field correspond to features that we don't expose yet to L2
While currently there are no CVE worthy features in this field,
if AMD adds more features to this field, that could allow guest
escapes similar to CVE-2021-3653 and CVE-2021-3656.
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210914154825.104886-6-mlevitsk@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
GP SVM errata workaround made the #GP handler always emulate
the SVM instructions.
However these instructions #GP in case the operand is not 4K aligned,
but the workaround code didn't check this and we ended up
emulating these instructions anyway.
This is only an emulation accuracy check bug as there is no harm for
KVM to read/write unaligned vmcb images.
Fixes: 82a11e9c6f ("KVM: SVM: Add emulation support for #GP triggered by SVM instructions")
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210914154825.104886-4-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Test that if:
* L1 disables virtual interrupt masking, and INTR intercept.
* L1 setups a virtual interrupt to be injected to L2 and enters L2 with
interrupts disabled, thus the virtual interrupt is pending.
* Now an external interrupt arrives in L1 and since
L1 doesn't intercept it, it should be delivered to L2 when
it enables interrupts.
to do this L0 (abuses) V_IRQ to setup an
interrupt window, and returns to L2.
* L2 enables interrupts.
This should trigger the interrupt window,
injection of the external interrupt and delivery
of the virtual interrupt that can now be done.
* Test that now L2 gets those interrupts.
This is the test that demonstrates the issue that was
fixed in the previous patch.
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210914154825.104886-3-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In svm_clear_vintr we try to restore the virtual interrupt
injection that might be pending, but we fail to restore
the interrupt vector.
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210914154825.104886-2-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Intel PMU MSRs is in msrs_to_save_all[], so add AMD PMU MSRs to have a
consistent behavior between Intel and AMD when using KVM_GET_MSRS,
KVM_SET_MSRS or KVM_GET_MSR_INDEX_LIST.
We have to add legacy and new MSRs to handle guests running without
X86_FEATURE_PERFCTR_CORE.
Signed-off-by: Fares Mehanna <faresx@amazon.de>
Message-Id: <20210915133951.22389-1-faresx@amazon.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
If L1 had invalid state on VM entry (can happen on SMM transactions
when we enter from real mode, straight to nested guest),
then after we load 'host' state from VMCS12, the state has to become
valid again, but since we load the segment registers with
__vmx_set_segment we weren't always updating emulation_required.
Update emulation_required explicitly at end of load_vmcs12_host_state.
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210913140954.165665-8-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
It is possible that when non root mode is entered via special entry
(!from_vmentry), that is from SMM or from loading the nested state,
the L2 state could be invalid in regard to non unrestricted guest mode,
but later it can become valid.
(for example when RSM emulation restores segment registers from SMRAM)
Thus delay the check to VM entry, where we will check this and fail.
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210913140954.165665-7-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Since no actual VM entry happened, the VM exit information is stale.
To avoid this, synthesize an invalid VM guest state VM exit.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210913140954.165665-6-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Use return statements instead of nested if, and fix error
path to free all the maps that were allocated.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210913140954.165665-2-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Currently the KVM_REQ_GET_NESTED_STATE_PAGES on SVM only reloads PDPTRs,
and MSR bitmap, with former not really needed for SMM as SMM exit code
reloads them again from SMRAM'S CR3, and later happens to work
since MSR bitmap isn't modified while in SMM.
Still it is better to be consistient with VMX.
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210913140954.165665-5-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When exiting SMM, pdpts are loaded again from the guest memory.
This fixes a theoretical bug, when exit from SMM triggers entry to the
nested guest which re-uses some of the migration
code which uses this flag as a workaround for a legacy userspace.
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210913140954.165665-4-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Windows Server 2022 with Hyper-V role enabled failed to boot on KVM when
enlightened VMCS is advertised. Debugging revealed there are two exposed
secondary controls it is not happy with: SECONDARY_EXEC_ENABLE_VMFUNC and
SECONDARY_EXEC_SHADOW_VMCS. These controls are known to be unsupported,
as there are no corresponding fields in eVMCSv1 (see the comment above
EVMCS1_UNSUPPORTED_2NDEXEC definition).
Previously, commit 31de3d2500 ("x86/kvm/hyper-v: move VMX controls
sanitization out of nested_enable_evmcs()") introduced the required
filtering mechanism for VMX MSRs but for some reason put only known
to be problematic (and not full EVMCS1_UNSUPPORTED_* lists) controls
there.
Note, Windows Server 2022 seems to have gained some sanity check for VMX
MSRs: it doesn't even try to launch a guest when there's something it
doesn't like, nested_evmcs_check_controls() mechanism can't catch the
problem.
Let's be bold this time and instead of playing whack-a-mole just filter out
all unsupported controls from VMX MSRs.
Fixes: 31de3d2500 ("x86/kvm/hyper-v: move VMX controls sanitization out of nested_enable_evmcs()")
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20210907163530.110066-1-vkuznets@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Check for a NULL cpumask_var_t when kicking multiple vCPUs via
cpumask_available(), which performs a !NULL check if and only if cpumasks
are configured to be allocated off-stack. This is a meaningless
optimization, e.g. avoids a TEST+Jcc and TEST+CMOV on x86, but more
importantly helps document that the NULL check is necessary even though
all callers pass in a local variable.
No functional change intended.
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20210827092516.1027264-3-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Fix a benign data race reported by syzbot+KCSAN[*] by ensuring vcpu->cpu
is read exactly once, and by ensuring the vCPU is booted from guest mode
if kvm_arch_vcpu_should_kick() returns true. Fix a similar race in
kvm_make_vcpus_request_mask() by ensuring the vCPU is interrupted if
kvm_request_needs_ipi() returns true.
Reading vcpu->cpu before vcpu->mode (via kvm_arch_vcpu_should_kick() or
kvm_request_needs_ipi()) means the target vCPU could get migrated (change
vcpu->cpu) and enter !OUTSIDE_GUEST_MODE between reading vcpu->cpud and
reading vcpu->mode. If that happens, the kick/IPI will be sent to the
old pCPU, not the new pCPU that is now running the vCPU or reading SPTEs.
Although failing to kick the vCPU is not exactly ideal, practically
speaking it cannot cause a functional issue unless there is also a bug in
the caller, and any such bug would exist regardless of kvm_vcpu_kick()'s
behavior.
The purpose of sending an IPI is purely to get a vCPU into the host (or
out of reading SPTEs) so that the vCPU can recognize a change in state,
e.g. a KVM_REQ_* request. If vCPU's handling of the state change is
required for correctness, KVM must ensure either the vCPU sees the change
before entering the guest, or that the sender sees the vCPU as running in
guest mode. All architectures handle this by (a) sending the request
before calling kvm_vcpu_kick() and (b) checking for requests _after_
setting vcpu->mode.
x86's READING_SHADOW_PAGE_TABLES has similar requirements; KVM needs to
ensure it kicks and waits for vCPUs that started reading SPTEs _before_
MMU changes were finalized, but any vCPU that starts reading after MMU
changes were finalized will see the new state and can continue on
uninterrupted.
For uses of kvm_vcpu_kick() that are not paired with a KVM_REQ_*, e.g.
x86's kvm_arch_sync_dirty_log(), the order of the kick must not be relied
upon for functional correctness, e.g. in the dirty log case, userspace
cannot assume it has a 100% complete log if vCPUs are still running.
All that said, eliminate the benign race since the cost of doing so is an
"extra" atomic cmpxchg() in the case where the target vCPU is loaded by
the current pCPU or is not loaded at all. I.e. the kick will be skipped
due to kvm_vcpu_exiting_guest_mode() seeing a compatible vcpu->mode as
opposed to the kick being skipped because of the cpu checks.
Keep the "cpu != me" checks even though they appear useless/impossible at
first glance. x86 processes guest IPI writes in a fast path that runs in
IN_GUEST_MODE, i.e. can call kvm_vcpu_kick() from IN_GUEST_MODE. And
calling kvm_vm_bugged()->kvm_make_vcpus_request_mask() from IN_GUEST or
READING_SHADOW_PAGE_TABLES is perfectly reasonable.
Note, a race with the cpu_online() check in kvm_vcpu_kick() likely
persists, e.g. the vCPU could exit guest mode and get offlined between
the cpu_online() check and the sending of smp_send_reschedule(). But,
the online check appears to exist only to avoid a WARN in x86's
native_smp_send_reschedule() that fires if the target CPU is not online.
The reschedule WARN exists because CPU offlining takes the CPU out of the
scheduling pool, i.e. the WARN is intended to detect the case where the
kernel attempts to schedule a task on an offline CPU. The actual sending
of the IPI is a non-issue as at worst it will simpy be dropped on the
floor. In other words, KVM's usurping of the reschedule IPI could
theoretically trigger a WARN if the stars align, but there will be no
loss of functionality.
[*] https://syzkaller.appspot.com/bug?extid=cd4154e502f43f10808a
Cc: Venkatesh Srinivas <venkateshs@google.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Fixes: 97222cc831 ("KVM: Emulate local APIC in kernel")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20210827092516.1027264-2-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KASAN reports the following issue:
BUG: KASAN: stack-out-of-bounds in kvm_make_vcpus_request_mask+0x174/0x440 [kvm]
Read of size 8 at addr ffffc9001364f638 by task qemu-kvm/4798
CPU: 0 PID: 4798 Comm: qemu-kvm Tainted: G X --------- ---
Hardware name: AMD Corporation DAYTONA_X/DAYTONA_X, BIOS RYM0081C 07/13/2020
Call Trace:
dump_stack+0xa5/0xe6
print_address_description.constprop.0+0x18/0x130
? kvm_make_vcpus_request_mask+0x174/0x440 [kvm]
__kasan_report.cold+0x7f/0x114
? kvm_make_vcpus_request_mask+0x174/0x440 [kvm]
kasan_report+0x38/0x50
kasan_check_range+0xf5/0x1d0
kvm_make_vcpus_request_mask+0x174/0x440 [kvm]
kvm_make_scan_ioapic_request_mask+0x84/0xc0 [kvm]
? kvm_arch_exit+0x110/0x110 [kvm]
? sched_clock+0x5/0x10
ioapic_write_indirect+0x59f/0x9e0 [kvm]
? static_obj+0xc0/0xc0
? __lock_acquired+0x1d2/0x8c0
? kvm_ioapic_eoi_inject_work+0x120/0x120 [kvm]
The problem appears to be that 'vcpu_bitmap' is allocated as a single long
on stack and it should really be KVM_MAX_VCPUS long. We also seem to clear
the lower 16 bits of it with bitmap_zero() for no particular reason (my
guess would be that 'bitmap' and 'vcpu_bitmap' variables in
kvm_bitmap_or_dest_vcpus() caused the confusion: while the later is indeed
16-bit long, the later should accommodate all possible vCPUs).
Fixes: 7ee30bc132 ("KVM: x86: deliver KVM IOAPIC scan request to target vCPUs")
Fixes: 9a2ae9f6b6 ("KVM: x86: Zero the IOAPIC scan request dest vCPUs bitmap")
Reported-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210827092516.1027264-7-vkuznets@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The calculation to get the per-slot dirty bitmap was incorrect leading
to a buffer overrun. Fix it by splitting out the dirty bitmap into a
separate bitmap per slot.
Fixes: 609e6202ea ("KVM: selftests: Support multiple slots in dirty_log_perf_test")
Signed-off-by: David Matlack <dmatlack@google.com>
Reviewed-by: Andrew Jones <drjones@redhat.com>
Message-Id: <20210917173657.44011-4-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
All selftests that support the backing_src option were printing their
own description of the flag and then calling backing_src_help() to dump
the list of available backing sources. Consolidate the flag printing in
backing_src_help() to align indentation, reduce duplicated strings, and
improve consistency across tests.
Note: Passing "-s" to backing_src_help is unnecessary since every test
uses the same flag. However I decided to keep it for code readability
at the call sites.
While here this opportunistically fixes the incorrectly interleaved
printing -x help message and list of backing source types in
dirty_log_perf_test.
Fixes: 609e6202ea ("KVM: selftests: Support multiple slots in dirty_log_perf_test")
Reviewed-by: Ben Gardon <bgardon@google.com>
Reviewed-by: Andrew Jones <drjones@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20210917173657.44011-3-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Every other KVM selftest uses -s for the backing_src, so switch
demand_paging_test to match.
Reviewed-by: Ben Gardon <bgardon@google.com>
Reviewed-by: Andrew Jones <drjones@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20210917173657.44011-2-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
A mirrored SEV-ES VM will need to call KVM_SEV_LAUNCH_UPDATE_VMSA to
setup its vCPUs and have them measured, and their VMSAs encrypted. Without
this change, it is impossible to have mirror VMs as part of SEV-ES VMs.
Also allow the guest status check and debugging commands since they do
not change any guest state.
Signed-off-by: Peter Gonda <pgonda@google.com>
Cc: Marc Orr <marcorr@google.com>
Cc: Nathan Tempelman <natet@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Steve Rutherford <srutherford@google.com>
Cc: Brijesh Singh <brijesh.singh@amd.com>
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: stable@vger.kernel.org
Fixes: 54526d1fd5 ("KVM: x86: Support KVM VMs sharing SEV context", 2021-04-21)
Message-Id: <20210921150345.2221634-3-pgonda@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
For mirroring SEV-ES the mirror VM will need more then just the ASID.
The FD and the handle are required to all the mirror to call psp
commands. The mirror VM will need to call KVM_SEV_LAUNCH_UPDATE_VMSA to
setup its vCPUs' VMSAs for SEV-ES.
Signed-off-by: Peter Gonda <pgonda@google.com>
Cc: Marc Orr <marcorr@google.com>
Cc: Nathan Tempelman <natet@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Steve Rutherford <srutherford@google.com>
Cc: Brijesh Singh <brijesh.singh@amd.com>
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: stable@vger.kernel.org
Fixes: 54526d1fd5 ("KVM: x86: Support KVM VMs sharing SEV context", 2021-04-21)
Message-Id: <20210921150345.2221634-2-pgonda@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Nested bus lock VM exits are not supported yet. If L2 triggers bus lock
VM exit, it will be directed to L1 VMM, which would cause unexpected
behavior. Therefore, handle L2's bus lock VM exits in L0 directly.
Fixes: fe6b6bc802 ("KVM: VMX: Enable bus lock VM exit")
Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Message-Id: <20210914095041.29764-1-chenyi.qiang@intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>