This is mandated by the current tightening of EFI executables
requirements when used in a secure boot scenario. More specifically,
an EFI executable cannot have a single section with RWX permissions,
which conflicts with the in-place kernel decompression that is done
today. Instead, the things required by the booting kernel image are
done in the EFI stub now. Work by Ard Biesheuvel.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmTsMGAACgkQEsHwGGHe
VUqqGg/+LJq7dapnyMiqafO9Xdpm+h7wTylChdfev7UB3wri9IJGXXFkwZZUNF8Z
p2c/ECk75CG+z7ZSv+WirZLWlmlBT1D51YjI1lHmg3Vmh3YXZWJHkwaG1DI4euL7
C4gvd/gmpXGMOxjZLV/D5d7wwoO2REyVkDeCy0lqHCisvd4GRXqUJHe6hJ3GxVv/
VqbT7ZHMP4BlbWK4/yTq9+wKSC9T0OfAOW2b+K/SAh19TpRqFBpkMOxaZx0nj2Eh
UxQUd2hntGtEGpPf71HPQMYCkWyvJD1LpTYfwld7tNvzXSt2xg2YZAf7WzJ0Od3b
kxg2mItHhVRCcwEc9OWlphvgAYCzYECzeMniesLFeJYLMvHNtORzxNPkDLrOYjT6
0K05DEWJlEqZhIWtaKJ/SSCj7JnfmgubfK3tG3cerCelQZIvUix6ysvDBwPnqdUr
3O1d7iMlM1UtqMo0GLeIp0q1aRKMsNfRtUvqjaiQ2NjSuyJqZl8t0jfKd6DHcnuy
WidBMhREULa5qYGE9dxosdTJK1wOhqomTmBSC8EeDUPevLV+DNQUn+oBOVhFvJVk
gpmsr2mi9+hwC9iNh1K4wslLl9Jc/BlIY0wwC0ysh6eTWQfrvpfXrtHRpmTvwcVW
g2lt+shYiPDa7q3bvqQL7iqFU6bkMGCs89EFGJEpNbBO076uNZY=
=2Oej
-----END PGP SIGNATURE-----
Merge tag 'x86_boot_for_v6.6_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 boot updates from Borislav Petkov:
"Avoid the baremetal decompressor code when booting on an EFI machine.
This is mandated by the current tightening of EFI executables
requirements when used in a secure boot scenario. More specifically,
an EFI executable cannot have a single section with RWX permissions,
which conflicts with the in-place kernel decompression that is done
today.
Instead, the things required by the booting kernel image are done in
the EFI stub now.
Work by Ard Biesheuvel"
* tag 'x86_boot_for_v6.6_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits)
x86/efistub: Avoid legacy decompressor when doing EFI boot
x86/efistub: Perform SNP feature test while running in the firmware
efi/libstub: Add limit argument to efi_random_alloc()
x86/decompressor: Factor out kernel decompression and relocation
x86/decompressor: Move global symbol references to C code
decompress: Use 8 byte alignment
x86/efistub: Prefer EFI memory attributes protocol over DXE services
x86/efistub: Perform 4/5 level paging switch from the stub
x86/decompressor: Merge trampoline cleanup with switching code
x86/decompressor: Pass pgtable address to trampoline directly
x86/decompressor: Only call the trampoline when changing paging levels
x86/decompressor: Call trampoline directly from C code
x86/decompressor: Avoid the need for a stack in the 32-bit trampoline
x86/decompressor: Use standard calling convention for trampoline
x86/decompressor: Call trampoline as a normal function
x86/decompressor: Assign paging related global variables earlier
x86/decompressor: Store boot_params pointer in callee save register
x86/efistub: Clear BSS in EFI handover protocol entrypoint
x86/decompressor: Avoid magic offsets for EFI handover entrypoint
x86/efistub: Simplify and clean up handover entry code
...
- Support partial SMT enablement.
So far the sysfs SMT control only allows to toggle between SMT on and
off. That's sufficient for x86 which usually has at max two threads
except for the Xeon PHI platform which has four threads per core.
Though PowerPC has up to 16 threads per core and so far it's only
possible to control the number of enabled threads per core via a
command line option. There is some way to control this at runtime, but
that lacks enforcement and the usability is awkward.
This update expands the sysfs interface and the core infrastructure to
accept numerical values so PowerPC can build SMT runtime control for
partial SMT enablement on top.
The core support has also been provided to the PowerPC maintainers who
added the PowerPC related changes on top.
- Minor cleanups and documentation updates.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmTsj4wTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoaszEADKMd/6m7/Bq7RU2OJ+IXw8yfMEF9nS
6HPrFu71a4cDufb/G8UckQOvkwdTFWD7bZ0snJe2sBDFTOtzK/inYkgPZTxlm7si
JcJmFnHKUM7OTwNZb7Tv1bd9Csz4JhggAYUw6P8CqsCmhQ+p6ECemx3bHDlYiywm
5eW2yzI9EM4dbsHPwUOvjI0WazGvAf0esSDAS8JTnhBXbd8FAckbMV+xuRPcCUK+
dBqbqr+3Nf4/wcXTro/gZIc7sEATAHH6m7zHlLVBSyVPnBxre8NLz6KciW4SezyJ
GWFnDV03mmG2KxQ2ugwI8n6M3zDJQtfEJFwW/x4t2M5RK+ka2a6G6GtCLHYOXLWR
akIuBXtTAC57BgpqzBihGej9eiC1BJ1QMa9ZK+6WDXSZtMTFOLlbwdY2/qyfxpfw
LfepWb+UMtFy5YyW84S1O5/AqpOtKD2kPTqfDjvDxWIAigispU+qwAKxcMzMjtwz
aAlf2Z/iX0R9DkRzGD2gaFG5AUsRich8RtVO7u+WDwYSsi8ywrvryiPlZrDDBkSQ
sRzdoHeXNGVY/FgkbZmEyBj4udrypymkR6ivqn6C2OrysgznSiv5NC983uS6TfJX
cVqdUv6CNYYNiNu0x0Qf0MluYT2s5c1Fa4bjCBJL+KwORwjM3+TCN9RA1KtFrW2T
G3Ta1KqI6wRonA==
=JQRJ
-----END PGP SIGNATURE-----
Merge tag 'smp-core-2023-08-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull CPU hotplug updates from Thomas Gleixner:
"Updates for the CPU hotplug core:
- Support partial SMT enablement.
So far the sysfs SMT control only allows to toggle between SMT on
and off. That's sufficient for x86 which usually has at max two
threads except for the Xeon PHI platform which has four threads per
core
Though PowerPC has up to 16 threads per core and so far it's only
possible to control the number of enabled threads per core via a
command line option. There is some way to control this at runtime,
but that lacks enforcement and the usability is awkward
This update expands the sysfs interface and the core infrastructure
to accept numerical values so PowerPC can build SMT runtime control
for partial SMT enablement on top
The core support has also been provided to the PowerPC maintainers
who added the PowerPC related changes on top
- Minor cleanups and documentation updates"
* tag 'smp-core-2023-08-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
Documentation: core-api/cpuhotplug: Fix state names
cpu/hotplug: Remove unused function declaration cpu_set_state_online()
cpu/SMT: Fix cpu_smt_possible() comment
cpu/SMT: Allow enabling partial SMT states via sysfs
cpu/SMT: Create topology_smt_thread_allowed()
cpu/SMT: Remove topology_smt_supported()
cpu/SMT: Store the current/max number of threads
cpu/SMT: Move smt/control simple exit cases earlier
cpu/SMT: Move SMT prototypes into cpu_smt.h
cpu/hotplug: Remove dependancy against cpu_primary_thread_mask
This pull reqeust contains the following:
o Handle negative skews in "skew is too large" messages.
o Extend watchdog check exemption to 4-Socket platforms
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEbK7UrM+RBIrCoViJnr8S83LZ+4wFAmTcF94THHBhdWxtY2tA
a2VybmVsLm9yZwAKCRCevxLzctn7jPeHEACXvYFMwno+1DPlt2PtFJglnkuYyOUQ
SwRRiOLWXdzsikzjWIvdrSOqslbdIjhZmSSyPDjyo4GPYFSSOyKNsqczwsX8R49u
O/2Yzkrx2OmIWcKiAkII8Iw4fFedoITtzC59wbZHoo/+upEpbZYP3u7AjJTird7y
du3WcWcGc1eEt5+7MNwbZfwpzo2t5Rb3Wqfgs6vnKTG7Abc/23uChsCBzPavX7X/
djNd1bA5YmEldKKxSoF5XSW/F1TWIA4fXMDkBwgRKHBx1Y7xU+nJMtam4ogAzN6a
4zgthgy5wQ7/VnTBv2rmQQb6ae3Blm09Yg1ac8zt95RLgmGkyX73lZGnRBTdYqyh
kb47Tfw3a+e+VPTD+W3rY1NOSOwbstdDHVckK+0bFvqNyXOoaEJL+EEOhm9rPXxv
le+T6Ct1VPAF9lHPUz7lVCVXN91vP4Gqrxjmeq5rqWNOvRW3jBcCLnEpFzWJtu50
JjQBi3HA0HW+Bxqov22W0llFAa0gVm8xyxXfNSSL7VoCinnS6/qyQvD1GoG0brk3
l5orOk38m2/acvTyvw2tnvAAuqmOr+oGlcQhJOOVl7jDz+sae6RMTwWCMaVxKUDo
YW7v5YFQePmzC9J4M5XFMdCCTb3cUCjVLMdsPgONM3kn9ALEDhhTBSw76N//bGLg
4/OEsT7Aq6LHkA==
=6/HI
-----END PGP SIGNATURE-----
Merge tag 'clocksource.2023.08.15a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu
Pull clocksource watchdog updates from Paul McKenney:
- Handle negative skews in "skew is too large" messages
- Extend watchdog check exemption to 4-Socket platforms
* tag 'clocksource.2023.08.15a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu:
x86/tsc: Extend watchdog check exemption to 4-Sockets platform
clocksource: Handle negative skews in "skew is too large" messages
- Carve out the new CONFIG_LIST_HARDENED as a more focused subset of
CONFIG_DEBUG_LIST (Marco Elver).
- Fix kallsyms lookup failure under Clang LTO (Yonghong Song).
- Clarify documentation for CONFIG_UBSAN_TRAP (Jann Horn).
- Flexible array member conversion not carried in other tree (Gustavo
A. R. Silva).
- Various strlcpy() and strncpy() removals not carried in other trees
(Azeem Shaikh, Justin Stitt).
- Convert nsproxy.count to refcount_t (Elena Reshetova).
- Add handful of __counted_by annotations not carried in other trees,
as well as an LKDTM test.
- Fix build failure with gcc-plugins on GCC 14+.
- Fix selftests to respect SKIP for signal-delivery tests.
- Fix CFI warning for paravirt callback prototype.
- Clarify documentation for seq_show_option_n() usage.
-----BEGIN PGP SIGNATURE-----
iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmTs6ZAWHGtlZXNjb29r
QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJkpjD/9AeST5Imc2t0t71Qd+wPxW3jT3
kDZPlHH8wHmuxSpRscX82m21SozvEMvybo6Cp7FSH4qr863FnBWMlo8acr7rKxUf
0f7Y9qgY/hKADiVx5p0pbnCgcy+l4pwsxIqVCGuhjvNCbWHrdGqLM4UjIfaVz5Ws
+55a/C3S1KVwB1s1+6to43jtKqQAx6yrqYWOaT3wEfCzHC87f9PUHhIGnFQVwPGP
WpjQI/BQKpH7+MDCoJOPrZqXaE/4lWALxR6+5BBheGbvLoWifpJEYHX6bDUzkgBz
liQDkgr4eAw5EXSOS7mX3EApfeMKakznJt9Mcmn0h3pPRlM3ZSVD64Xrou2Brpje
exS2JRuh6HwIiXY9nTHc6YMGcAWG1syAR/hM2fQdujM0CWtBUk9+kkuYWsqF6nIK
3tOxYLB/Ph4p+tShd+v5R3mEmp/6snYRKJoUk+9Fk67i54NnK4huyxaCO4zui+ML
3vHuGp8KgFHUjJaYmYXHs3TRZnKSFUkPGc4MbpiGtmJ9zhfSwlhhF+yfBJCsvmTf
ZajA+sPupT4OjLxU6vUD/ZNkXAEjWzktyX2v9YBA7FHh7SqPtX9ARRIxh417AjEJ
tBPHhW/iRw9ftBIAKDmI7gPLynngd/zvjhvk6O5egHYjjgRM1/WAJZ4V26XR6+hf
TWfQb7VRzdZIqwOEUA==
=9ZWP
-----END PGP SIGNATURE-----
Merge tag 'hardening-v6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull hardening updates from Kees Cook:
"As has become normal, changes are scattered around the tree (either
explicitly maintainer Acked or for trivial stuff that went ignored):
- Carve out the new CONFIG_LIST_HARDENED as a more focused subset of
CONFIG_DEBUG_LIST (Marco Elver)
- Fix kallsyms lookup failure under Clang LTO (Yonghong Song)
- Clarify documentation for CONFIG_UBSAN_TRAP (Jann Horn)
- Flexible array member conversion not carried in other tree (Gustavo
A. R. Silva)
- Various strlcpy() and strncpy() removals not carried in other trees
(Azeem Shaikh, Justin Stitt)
- Convert nsproxy.count to refcount_t (Elena Reshetova)
- Add handful of __counted_by annotations not carried in other trees,
as well as an LKDTM test
- Fix build failure with gcc-plugins on GCC 14+
- Fix selftests to respect SKIP for signal-delivery tests
- Fix CFI warning for paravirt callback prototype
- Clarify documentation for seq_show_option_n() usage"
* tag 'hardening-v6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (23 commits)
LoadPin: Annotate struct dm_verity_loadpin_trusted_root_digest with __counted_by
kallsyms: Change func signature for cleanup_symbol_name()
kallsyms: Fix kallsyms_selftest failure
nsproxy: Convert nsproxy.count to refcount_t
integrity: Annotate struct ima_rule_opt_list with __counted_by
lkdtm: Add FAM_BOUNDS test for __counted_by
Compiler Attributes: counted_by: Adjust name and identifier expansion
um: refactor deprecated strncpy to memcpy
um: vector: refactor deprecated strncpy
alpha: Replace one-element array with flexible-array member
hardening: Move BUG_ON_DATA_CORRUPTION to hardening options
list: Introduce CONFIG_LIST_HARDENED
list_debug: Introduce inline wrappers for debug checks
compiler_types: Introduce the Clang __preserve_most function attribute
gcc-plugins: Rename last_stmt() for GCC 14+
selftests/harness: Actually report SKIP for signal tests
x86/paravirt: Fix tlb_remove_table function callback prototype warning
EISA: Replace all non-returning strlcpy with strscpy
perf: Replace strlcpy with strscpy
um: Remove strlcpy declaration
...
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZOXT7QAKCRCRxhvAZXjc
ort3AP0VIK/oJk5skgjpinQrCfvtVz0XOtawuBtn0f1weIfb6AD9Hg1rqOKnQD5z
dkvn3xaEr3gPOVzqU5SvFwVoCM0cMwA=
=24Ha
-----END PGP SIGNATURE-----
Merge tag 'v6.6-vfs.fchmodat2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull fchmodat2 system call from Christian Brauner:
"This adds the fchmodat2() system call. It is a revised version of the
fchmodat() system call, adding a missing flag argument. Support for
both AT_SYMLINK_NOFOLLOW and AT_EMPTY_PATH are included.
Adding this system call revision has been a longstanding request but
so far has always fallen through the cracks. While the kernel
implementation of fchmodat() does not have a flag argument the libc
provided POSIX-compliant fchmodat(3) version does. Both glibc and musl
have to implement a workaround in order to support AT_SYMLINK_NOFOLLOW
(see [1] and [2]).
The workaround is brittle because it relies not just on O_PATH and
O_NOFOLLOW semantics and procfs magic links but also on our rather
inconsistent symlink semantics.
This gives userspace a proper fchmodat2() system call that libcs can
use to properly implement fchmodat(3) and allows them to get rid of
their hacks. In this case it will immediately benefit them as the
current workaround is already defunct because of aformentioned
inconsistencies.
In addition to AT_SYMLINK_NOFOLLOW, give userspace the ability to use
AT_EMPTY_PATH with fchmodat2(). This is already possible with
fchownat() so there's no reason to not also support it for
fchmodat2().
The implementation is simple and comes with selftests. Implementation
of the system call and wiring up the system call are done as separate
patches even though they could arguably be one patch. But in case
there are merge conflicts from other system call additions it can be
beneficial to have separate patches"
Link: https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/fchmodat.c;h=17eca54051ee28ba1ec3f9aed170a62630959143;hb=a492b1e5ef7ab50c6fdd4e4e9879ea5569ab0a6c#l35 [1]
Link: https://git.musl-libc.org/cgit/musl/tree/src/stat/fchmodat.c?id=718f363bc2067b6487900eddc9180c84e7739f80#n28 [2]
* tag 'v6.6-vfs.fchmodat2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
selftests: fchmodat2: remove duplicate unneeded defines
fchmodat2: add support for AT_EMPTY_PATH
selftests: Add fchmodat2 selftest
arch: Register fchmodat2, usually as syscall 452
fs: Add fchmodat2()
Non-functional cleanup of a "__user * filename"
We rely on 'uname -s' returning 'Linux' because there are os-Linux/
directories, but no other os-*/.
Supporting a non-Linux host is unlikely to happen.
Let's hard-code 'Linux'.
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Signed-off-by: Richard Weinberger <richard@nod.at>
When UML is compiled under 32-bit x86, it uses its own copy of
checksum_32.S, which is terribly out-of-date and doesn't support
checksumming unaligned data.
This causes the new "checksum" KUnit test to fail:
./tools/testing/kunit/kunit.py run --kconfig_add CONFIG_64BIT=n --cross_compile i686-linux-gnu- checksum
KTAP version 1
# Subtest: checksum
1..3
# test_csum_fixed_random_inputs: ASSERTION FAILED at lib/checksum_kunit.c:243
Expected result == expec, but
result == 33316 (0x8224)
expec == 33488 (0x82d0)
not ok 1 test_csum_fixed_random_inputs
# test_csum_all_carry_inputs: ASSERTION FAILED at lib/checksum_kunit.c:267
Expected result == expec, but
result == 65280 (0xff00)
expec == 0 (0x0)
not ok 2 test_csum_all_carry_inputs
# test_csum_no_carry_inputs: ASSERTION FAILED at lib/checksum_kunit.c:306
Expected result == expec, but
result == 65531 (0xfffb)
expec == 0 (0x0)
not ok 3 test_csum_no_carry_inputs
Sharing the normal implementation in arch/x86/lib both fixes all of
these issues and means any further fixes only need to be done once.
x86_64 already seems to share the same implementation between UML and
"normal" x86.
Signed-off-by: David Gow <davidgow@google.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
Commit e6bcfdd75d ("x86/microcode: Hide the config knob") removed the
MICROCODE_AMD config, but left some references in defconfigs and comments,
that have no effect on any kernel build around.
Clean up those remaining config references. No functional change.
Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230825141226.13566-1-lukas.bulwahn@gmail.com
Merge ACPI processor driver changes for 6.6-rc1:
- Support obtaining physical CPU ID from MADT on LoongArch (Bibo Mao).
- Convert ACPI CPU initialization to using _OSC instead of _PDC that
has been depreceted since 2018 and dropped from the specification in
ACPI 6.5 (Michal Wilczynski, Rafael Wysocki).
* acpi-processor:
ACPI: processor: LoongArch: Get physical ID from MADT
ACPI: processor: Refine messages in acpi_early_processor_control_setup()
ACPI: processor: Remove acpi_hwp_native_thermal_lvt_osc()
ACPI: processor: Use _OSC to convey OSPM processor support information
ACPI: processor: Introduce acpi_processor_osc()
ACPI: processor: Set CAP_SMP_T_SWCOORD in arch_acpi_set_proc_cap_bits()
ACPI: processor: Clear C_C2C3_FFH and C_C1_FFH in arch_acpi_set_proc_cap_bits()
ACPI: processor: Rename ACPI_PDC symbols
ACPI: processor: Refactor arch_acpi_set_pdc_bits()
ACPI: processor: Move processor_physically_present() to acpi_processor.c
ACPI: processor: Move MWAIT quirk out of acpi_processor.c
Delete KVM's printk about KVM_SET_TSS_ADDR not being called. When the
printk was added by commit 776e58ea3d ("KVM: unbreak userspace that does
not sets tss address"), KVM also stuffed a "hopefully safe" value, i.e.
the message wasn't purely informational. For reasons unknown, ostensibly
to try and help people running outdated qemu-kvm versions, the message got
left behind when KVM's stuffing was removed by commit 4918c6ca68
("KVM: VMX: Require KVM_SET_TSS_ADDR being called prior to running a VCPU").
Today, the message is completely nonsensical, as it has been over a decade
since KVM supported userspace running a Real Mode guest, on a CPU without
unrestricted guest support, without doing KVM_SET_TSS_ADDR before KVM_RUN.
I.e. KVM's ABI has required KVM_SET_TSS_ADDR for 10+ years.
To make matters worse, the message is prone to false positives as it
triggers when simply *creating* a vCPU due to RESET putting vCPUs into
Real Mode, even when the user has no intention of ever *running* the vCPU
in a Real Mode. E.g. KVM selftests stuff 64-bit mode and never touch Real
Mode, but trigger the message even though they run just fine without
doing KVM_SET_TSS_ADDR. Creating "dummy" vCPUs, e.g. to probe features,
can also trigger the message. In both scenarios, the message confuses
users and falsely implies that they've done something wrong.
Reported-by: Thorsten Glaser <t.glaser@tarent.de>
Closes: https://lkml.kernel.org/r/f1afa6c0-cde2-ab8b-ea71-bfa62a45b956%40tarent.de
Link: https://lore.kernel.org/r/20230815174215.433222-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Disallow SEV (and beyond) if nrips is disabled via module param, as KVM
can't read guest memory to partially emulate and skip an instruction. All
CPUs that support SEV support NRIPS, i.e. this is purely stopping the user
from shooting themselves in the foot.
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20230825013621.2845700-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Don't inject a #UD if KVM attempts to "emulate" to skip an instruction
for an SEV guest, and instead resume the guest and hope that it can make
forward progress. When commit 04c40f344d ("KVM: SVM: Inject #UD on
attempted emulation for SEV guest w/o insn buffer") added the completely
arbitrary #UD behavior, there were no known scenarios where a well-behaved
guest would induce a VM-Exit that triggered emulation, i.e. it was thought
that injecting #UD would be helpful.
However, now that KVM (correctly) attempts to re-inject INT3/INTO, e.g. if
a #NPF is encountered when attempting to deliver the INT3/INTO, an SEV
guest can trigger emulation without a buffer, through no fault of its own.
Resuming the guest and retrying the INT3/INTO is architecturally wrong,
e.g. the vCPU will incorrectly re-hit code #DBs, but for SEV guests there
is literally no other option that has a chance of making forward progress.
Drop the #UD injection for all "skip" emulation, not just those related to
INT3/INTO, even though that means that the guest will likely end up in an
infinite loop instead of getting a #UD (the vCPU may also crash, e.g. if
KVM emulated everything about an instruction except for advancing RIP).
There's no evidence that suggests that an unexpected #UD is actually
better than hanging the vCPU, e.g. a soft-hung vCPU can still respond to
IRQs and NMIs to generate a backtrace.
Reported-by: Wu Zongyo <wuzongyo@mail.ustc.edu.cn>
Closes: https://lore.kernel.org/all/8eb933fd-2cf3-d7a9-32fe-2a1d82eac42a@mail.ustc.edu.cn
Fixes: 6ef88d6e36 ("KVM: SVM: Re-inject INT3/INTO instead of retrying the instruction")
Cc: stable@vger.kernel.org
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20230825013621.2845700-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Skip initializing the VMSA physical address in the VMCB if the VMSA is
NULL, which occurs during intrahost migration as KVM initializes the VMCB
before copying over state from the source to the destination (including
the VMSA and its physical address).
In normal builds, __pa() is just math, so the bug isn't fatal, but with
CONFIG_DEBUG_VIRTUAL=y, the validity of the virtual address is verified
and passing in NULL will make the kernel unhappy.
Fixes: 6defa24d3b ("KVM: SEV: Init target VMCBs in sev_migrate_from")
Cc: stable@vger.kernel.org
Cc: Peter Gonda <pgonda@google.com>
Reviewed-by: Peter Gonda <pgonda@google.com>
Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com>
Link: https://lore.kernel.org/r/20230825022357.2852133-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
enc_dec_hypercall() accepted a page count instead of a size, which
forced its callers to round up. As a result, non-page aligned
vaddrs caused pages to be spuriously marked as decrypted via the
encryption status hypercall, which in turn caused consistent
corruption of pages during live migration. Live migration requires
accurate encryption status information to avoid migrating pages
from the wrong perspective.
Fixes: 064ce6c550 ("mm: x86: Invoke hypercall when page encryption status is changed")
Signed-off-by: Steve Rutherford <srutherford@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com>
Tested-by: Ben Hillier <bhillier@google.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20230824223731.2055016-1-srutherford@google.com
Group the code this way so that we can avoid too many ifdef's:
Data only used in an SNP VM with the paravisor;
Functions only used in an SNP VM with the paravisor;
Data only used in an SNP VM without the paravisor;
Functions only used in an SNP VM without the paravisor;
Functions only used in a TDX VM, with and without the paravisor;
Functions used in an SNP or TDX VM, when the paravisor is present;
Functions always used, even in a regular non-CoCo VM.
No functional change.
Signed-off-by: Dexuan Cui <decui@microsoft.com>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Reviewed-by: Tianyu Lan <tiala@microsoft.com>
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Link: https://lore.kernel.org/r/20230824080712.30327-11-decui@microsoft.com
In ms_hyperv_init_platform(), do not distinguish between a SNP VM with
the paravisor and a SNP VM without the paravisor.
Replace hv_isolation_type_en_snp() with
!ms_hyperv.paravisor_present && hv_isolation_type_snp().
The hv_isolation_type_en_snp() in drivers/hv/hv.c and
drivers/hv/hv_common.c can be changed to hv_isolation_type_snp() since
we know !ms_hyperv.paravisor_present is true there.
Signed-off-by: Dexuan Cui <decui@microsoft.com>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Reviewed-by: Tianyu Lan <tiala@microsoft.com>
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Link: https://lore.kernel.org/r/20230824080712.30327-10-decui@microsoft.com
When the paravisor is present, a SNP VM must use GHCB to access some
special MSRs, including HV_X64_MSR_GUEST_OS_ID and some SynIC MSRs.
Similarly, when the paravisor is present, a TDX VM must use TDX GHCI
to access the same MSRs.
Implement hv_tdx_msr_write() and hv_tdx_msr_read(), and use the helper
functions hv_ivm_msr_read() and hv_ivm_msr_write() to access the MSRs
in a unified way for SNP/TDX VMs with the paravisor.
Do not export hv_tdx_msr_write() and hv_tdx_msr_read(), because we never
really used hv_ghcb_msr_write() and hv_ghcb_msr_read() in any module.
Update arch/x86/include/asm/mshyperv.h so that the kernel can still build
if CONFIG_AMD_MEM_ENCRYPT or CONFIG_INTEL_TDX_GUEST is not set, or
neither is set.
Signed-off-by: Dexuan Cui <decui@microsoft.com>
Reviewed-by: Tianyu Lan <tiala@microsoft.com>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Link: https://lore.kernel.org/r/20230824080712.30327-9-decui@microsoft.com
The post_msg_page was removed in
commit 9a6b1a170c ("Drivers: hv: vmbus: Remove the per-CPU post_msg_page")
However, it turns out that we need to bring it back, but only for a TDX VM
with the paravisor: in such a VM, the hyperv_pcpu_input_arg is not decrypted,
but the HVCALL_POST_MESSAGE in such a VM needs a decrypted page as the
hypercall input page: see the comments in hyperv_init() for a detailed
explanation.
Except for HVCALL_POST_MESSAGE and HVCALL_SIGNAL_EVENT, the other hypercalls
in a TDX VM with the paravisor still use hv_hypercall_pg and must use the
hyperv_pcpu_input_arg (which is encrypted in such a VM), when a hypercall
input page is used.
Signed-off-by: Dexuan Cui <decui@microsoft.com>
Reviewed-by: Tianyu Lan <tiala@microsoft.com>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Link: https://lore.kernel.org/r/20230824080712.30327-8-decui@microsoft.com
The new variable hyperv_paravisor_present is set only when the VM
is a SNP/TDX VM with the paravisor running: see ms_hyperv_init_platform().
We introduce hyperv_paravisor_present because we can not use
ms_hyperv.paravisor_present in arch/x86/include/asm/mshyperv.h:
struct ms_hyperv_info is defined in include/asm-generic/mshyperv.h, which
is included at the end of arch/x86/include/asm/mshyperv.h, but at the
beginning of arch/x86/include/asm/mshyperv.h, we would already need to use
struct ms_hyperv_info in hv_do_hypercall().
We use hyperv_paravisor_present only in include/asm-generic/mshyperv.h,
and use ms_hyperv.paravisor_present elsewhere. In the future, we'll
introduce a hypercall function structure for different VM types, and
at boot time, the right function pointers would be written into the
structure so that runtime testing of TDX vs. SNP vs. normal will be
avoided and hyperv_paravisor_present will no longer be needed.
Call hv_vtom_init() when it's a VBS VM or when ms_hyperv.paravisor_present
is true, i.e. the VM is a SNP VM or TDX VM with the paravisor.
Enhance hv_vtom_init() for a TDX VM with the paravisor.
In hv_common_cpu_init(), don't decrypt the hyperv_pcpu_input_arg
for a TDX VM with the paravisor, just like we don't decrypt the page
for a SNP VM with the paravisor.
Signed-off-by: Dexuan Cui <decui@microsoft.com>
Reviewed-by: Tianyu Lan <tiala@microsoft.com>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Link: https://lore.kernel.org/r/20230824080712.30327-7-decui@microsoft.com
When a fully enlightened TDX guest runs on Hyper-V, the UEFI firmware sets
the HW_REDUCED flag and consequently ttyS0 interrupts can't work. Fix the
issue by overriding x86_init.acpi.reduced_hw_early_init().
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Reviewed-by: Tianyu Lan <tiala@microsoft.com>
Signed-off-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Link: https://lore.kernel.org/r/20230824080712.30327-5-decui@microsoft.com
Add Hyper-V specific code so that a fully enlightened TDX guest (i.e.
without the paravisor) can run on Hyper-V:
Don't use hv_vp_assist_page. Use GHCI instead.
Don't try to use the unsupported HV_REGISTER_CRASH_CTL.
Don't trust (use) Hyper-V's TLB-flushing hypercalls.
Don't use lazy EOI.
Share the SynIC Event/Message pages with the hypervisor.
Don't use the Hyper-V TSC page for now, because non-trivial work is
required to share the page with the hypervisor.
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Signed-off-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Link: https://lore.kernel.org/r/20230824080712.30327-4-decui@microsoft.com
A fully enlightened TDX guest on Hyper-V (i.e. without the paravisor) only
uses the GHCI call rather than hv_hypercall_pg. Do not initialize
hypercall_pg for such a guest.
In hv_common_cpu_init(), the hyperv_pcpu_input_arg page needs to be
decrypted in such a guest.
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Reviewed-by: Tianyu Lan <tiala@microsoft.com>
Signed-off-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Link: https://lore.kernel.org/r/20230824080712.30327-3-decui@microsoft.com
No logic change to SNP/VBS guests.
hv_isolation_type_tdx() will be used to instruct a TDX guest on Hyper-V to
do some TDX-specific operations, e.g. for a fully enlightened TDX guest
(i.e. without the paravisor), hv_do_hypercall() should use
__tdx_hypercall() and such a guest on Hyper-V should handle the Hyper-V
Event/Message/Monitor pages specially.
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Reviewed-by: Tianyu Lan <tiala@microsoft.com>
Signed-off-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Link: https://lore.kernel.org/r/20230824080712.30327-2-decui@microsoft.com
crash_prepare_elf64_headers() writes into the elfcorehdr an ELF PT_NOTE
for all possible CPUs. As such, subsequent changes to CPUs (ie. hot
un/plug, online/offline) do not need to rewrite the elfcorehdr.
The kimage->file_mode term covers kdump images loaded via the
kexec_file_load() syscall. Since crash_prepare_elf64_headers() wrote the
initial elfcorehdr, no update to the elfcorehdr is needed for CPU changes.
The kimage->elfcorehdr_updated term covers kdump images loaded via the
kexec_load() syscall. At least one memory or CPU change must occur to
cause crash_prepare_elf64_headers() to rewrite the elfcorehdr.
Afterwards, no update to the elfcorehdr is needed for CPU changes.
This code is intentionally *NOT* hoisted into crash_handle_hotplug_event()
as it would prevent the arch-specific handler from running for CPU
changes. This would break PPC, for example, which needs to update other
information besides the elfcorehdr, on CPU changes.
Link: https://lkml.kernel.org/r/20230814214446.6659-9-eric.devolder@oracle.com
Signed-off-by: Eric DeVolder <eric.devolder@oracle.com>
Reviewed-by: Sourabh Jain <sourabhjain@linux.ibm.com>
Acked-by: Hari Bathini <hbathini@linux.ibm.com>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Akhil Raj <lf32.dev@gmail.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Borislav Petkov (AMD) <bp@alien8.de>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Mimi Zohar <zohar@linux.ibm.com>
Cc: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Thomas Weißschuh <linux@weissschuh.net>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The hotplug support for kexec_load() requires changes to the userspace
kexec-tools and a little extra help from the kernel.
Given a kdump capture kernel loaded via kexec_load(), and a subsequent
hotplug event, the crash hotplug handler finds the elfcorehdr and rewrites
it to reflect the hotplug change. That is the desired outcome, however,
at kernel panic time, the purgatory integrity check fails (because the
elfcorehdr changed), and the capture kernel does not boot and no vmcore is
generated.
Therefore, the userspace kexec-tools/kexec must indicate to the kernel
that the elfcorehdr can be modified (because the kexec excluded the
elfcorehdr from the digest, and sized the elfcorehdr memory buffer
appropriately).
To facilitate hotplug support with kexec_load():
- a new kexec flag KEXEC_UPATE_ELFCOREHDR indicates that it is
safe for the kernel to modify the kexec_load()'d elfcorehdr
- the /sys/kernel/crash_elfcorehdr_size node communicates the
preferred size of the elfcorehdr memory buffer
- The sysfs crash_hotplug nodes (ie.
/sys/devices/system/[cpu|memory]/crash_hotplug) dynamically
take into account kexec_file_load() vs kexec_load() and
KEXEC_UPDATE_ELFCOREHDR.
This is critical so that the udev rule processing of crash_hotplug
is all that is needed to determine if the userspace unload-then-load
of the kdump image is to be skipped, or not. The proposed udev
rule change looks like:
# The kernel updates the crash elfcorehdr for CPU and memory changes
SUBSYSTEM=="cpu", ATTRS{crash_hotplug}=="1", GOTO="kdump_reload_end"
SUBSYSTEM=="memory", ATTRS{crash_hotplug}=="1", GOTO="kdump_reload_end"
The table below indicates the behavior of kexec_load()'d kdump image
updates (with the new udev crash_hotplug rule in place):
Kernel |Kexec
-------+-----+----
Old |Old |New
| a | a
-------+-----+----
New | a | b
-------+-----+----
where kexec 'old' and 'new' delineate kexec-tools has the needed
modifications for the crash hotplug feature, and kernel 'old' and 'new'
delineate the kernel supports this crash hotplug feature.
Behavior 'a' indicates the unload-then-reload of the entire kdump image.
For the kexec 'old' column, the unload-then-reload occurs due to the
missing flag KEXEC_UPDATE_ELFCOREHDR. An 'old' kernel (with 'new' kexec)
does not present the crash_hotplug sysfs node, which leads to the
unload-then-reload of the kdump image.
Behavior 'b' indicates the desired optimized behavior of the kernel
directly modifying the elfcorehdr and avoiding the unload-then-reload of
the kdump image.
If the udev rule is not updated with crash_hotplug node check, then no
matter any combination of kernel or kexec is new or old, the kdump image
continues to be unload-then-reload on hotplug changes.
To fully support crash hotplug feature, there needs to be a rollout of
kernel, kexec-tools and udev rule changes. However, the order of the
rollout of these pieces does not matter; kexec_load()'d kdump images still
function for hotplug as-is.
Link: https://lkml.kernel.org/r/20230814214446.6659-7-eric.devolder@oracle.com
Signed-off-by: Eric DeVolder <eric.devolder@oracle.com>
Suggested-by: Hari Bathini <hbathini@linux.ibm.com>
Acked-by: Hari Bathini <hbathini@linux.ibm.com>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Akhil Raj <lf32.dev@gmail.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Borislav Petkov (AMD) <bp@alien8.de>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Mimi Zohar <zohar@linux.ibm.com>
Cc: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Sourabh Jain <sourabhjain@linux.ibm.com>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Thomas Weißschuh <linux@weissschuh.net>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
When CPU or memory is hot un/plugged, or off/onlined, the crash
elfcorehdr, which describes the CPUs and memory in the system, must also
be updated.
A new elfcorehdr is generated from the available CPUs and memory and
replaces the existing elfcorehdr. The segment containing the elfcorehdr
is identified at run-time in crash_core:crash_handle_hotplug_event().
No modifications to purgatory (see 'kexec: exclude elfcorehdr from the
segment digest') or boot_params (as the elfcorehdr= capture kernel command
line parameter pointer remains unchanged and correct) are needed, just
elfcorehdr.
For kexec_file_load(), the elfcorehdr segment size is based on NR_CPUS and
CRASH_MAX_MEMORY_RANGES in order to accommodate a growing number of CPU
and memory resources.
For kexec_load(), the userspace kexec utility needs to size the elfcorehdr
segment in the same/similar manner.
To accommodate kexec_load() syscall in the absence of kexec_file_load()
syscall support, prepare_elf_headers() and dependents are moved outside of
CONFIG_KEXEC_FILE.
[eric.devolder@oracle.com: correct unused function build error]
Link: https://lkml.kernel.org/r/20230821182644.2143-1-eric.devolder@oracle.com
Link: https://lkml.kernel.org/r/20230814214446.6659-6-eric.devolder@oracle.com
Signed-off-by: Eric DeVolder <eric.devolder@oracle.com>
Reviewed-by: Sourabh Jain <sourabhjain@linux.ibm.com>
Acked-by: Hari Bathini <hbathini@linux.ibm.com>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Akhil Raj <lf32.dev@gmail.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Borislav Petkov (AMD) <bp@alien8.de>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Mimi Zohar <zohar@linux.ibm.com>
Cc: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Thomas Weißschuh <linux@weissschuh.net>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tell the page table check how many PTEs & PFNs we want it to check.
Link: https://lkml.kernel.org/r/20230802151406.3735276-3-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org>
Acked-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
handle_mm_fault returning VM_FAULT_RETRY or VM_FAULT_COMPLETED means
mmap_lock has been released. However with per-VMA locks behavior is
different and the caller should still release it. To make the rules
consistent for the caller, drop the per-VMA lock when returning
VM_FAULT_RETRY or VM_FAULT_COMPLETED. Currently the only path returning
VM_FAULT_RETRY under per-VMA locks is do_swap_page and no path returns
VM_FAULT_COMPLETED for now.
[willy@infradead.org: fix riscv]
Link: https://lkml.kernel.org/r/CAJuCfpE6GWEx1rPBmNpUfoD5o-gNFz9-UFywzCE2PbEGBiVz7g@mail.gmail.com
Link: https://lkml.kernel.org/r/20230630211957.1341547-4-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Peter Xu <peterx@redhat.com>
Tested-by: Conor Dooley <conor.dooley@microchip.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Hillf Danton <hdanton@sina.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Laurent Dufour <ldufour@linux.ibm.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Michel Lespinasse <michel@lespinasse.org>
Cc: Minchan Kim <minchan@google.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Punit Agrawal <punit.agrawal@bytedance.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
`strncpy` is deprecated for use on NUL-terminated destination strings [1].
A suitable replacement is `strscpy` [2] due to the fact that it
guarantees NUL-termination on its destination buffer argument which is
_not_ the case for `strncpy`!
In this case, it means we can drop the `...-1` from:
| strncpy(to, from, len-1);
as well as remove the comment mentioning NUL-termination as `strscpy`
implicitly grants us this behavior.
There should be no functional change as I don't believe the padding from
`strncpy` is needed here. If it turns out that the padding is necessary
we should use `strscpy_pad` as a direct replacement.
Signed-off-by: Justin Stitt <justinstitt@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andy Shevchenko <andy.shevchenko@gmail.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Dimitri Sivanich <sivanich@hpe.com>
Link: www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings[1]
Link: https://manpages.debian.org/testing/linux-manual-4.8/strscpy.9.en.html [2]
Link: https://github.com/KSPP/linux/issues/90
Link: https://lore.kernel.org/r/20230822-strncpy-arch-x86-kernel-apic-x2apic_uv_x-v1-1-91d681d0b3f3@google.com
`strncpy` is deprecated for use on NUL-terminated destination strings [1].
A suitable replacement is `strscpy` [2] due to the fact that it
guarantees NUL-termination on its destination buffer argument which is
_not_ the case for `strncpy`!
In this case, it is a simple swap from `strncpy` to `strscpy`. There is
one slight difference, though. If NUL-padding is a functional
requirement here we should opt for `strscpy_pad`. It seems like this
shouldn't be needed as I see no obvious signs of any padding being
required.
Signed-off-by: Justin Stitt <justinstitt@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings[1]
Link: https://manpages.debian.org/testing/linux-manual-4.8/strscpy.9.en.html [2]
Link: https://github.com/KSPP/linux/issues/90
Link: https://lore.kernel.org/r/20230822-strncpy-arch-x86-kernel-hpet-v1-1-2c7d3be86f4a@google.com
Both `strncpy` and `strcpy` are deprecated for use on NUL-terminated
destination strings [1].
A suitable replacement is `strscpy` [2] due to the fact that it
guarantees NUL-termination on its destination buffer argument which is
_not_ the case for `strncpy` or `strcpy`!
In this case, we can drop both the forced NUL-termination and the `... -1` from:
| strncpy(arg, val, ACTION_LEN - 1);
as `strscpy` implicitly has this behavior.
Also include slight refactor to code removing possible new-line chars as
per Yang Yang's work at [3]. This reduces code size and complexity by
using more robust and better understood interfaces.
Co-developed-by: Yang Yang <yang.yang29@zte.com.cn>
Signed-off-by: Justin Stitt <justinstitt@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andy Shevchenko <andy.shevchenko@gmail.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Dimitri Sivanich <sivanich@hpe.com>
Link: www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings[1]
Link: https://manpages.debian.org/testing/linux-manual-4.8/strscpy.9.en.html [2]
Link: https://lore.kernel.org/all/202212091545310085328@zte.com.cn/ [3]
Link: https://github.com/KSPP/linux/issues/90
Link: https://lore.kernel.org/r/20230824-strncpy-arch-x86-platform-uv-uv_nmi-v2-1-e16d9a3ec570@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
0-Day found a 34.6% regression in stress-ng's 'af-alg' test case, and
bisected it to commit b81fac906a ("x86/fpu: Move FPU initialization into
arch_cpu_finalize_init()"), which optimizes the FPU init order, and moves
the CR4_OSXSAVE enabling into a later place:
arch_cpu_finalize_init
identify_boot_cpu
identify_cpu
generic_identify
get_cpu_cap --> setup cpu capability
...
fpu__init_cpu
fpu__init_cpu_xstate
cr4_set_bits(X86_CR4_OSXSAVE);
As the FPU is not yet initialized the CPU capability setup fails to set
X86_FEATURE_OSXSAVE. Many security module like 'camellia_aesni_avx_x86_64'
depend on this feature and therefore fail to load, causing the regression.
Cure this by setting X86_FEATURE_OSXSAVE feature right after OSXSAVE
enabling.
[ tglx: Moved it into the actual BSP FPU initialization code and added a comment ]
Fixes: b81fac906a ("x86/fpu: Move FPU initialization into arch_cpu_finalize_init()")
Reported-by: kernel test robot <oliver.sang@intel.com>
Signed-off-by: Feng Tang <feng.tang@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/lkml/202307192135.203ac24e-oliver.sang@intel.com
Link: https://lore.kernel.org/lkml/20230823065747.92257-1-feng.tang@intel.com
The thread flag TIF_NEED_FPU_LOAD indicates that the FPU saved state is
valid and should be reloaded when returning to userspace. However, the
kernel will skip doing this if the FPU registers are already valid as
determined by fpregs_state_valid(). The logic embedded there considers
the state valid if two cases are both true:
1: fpu_fpregs_owner_ctx points to the current tasks FPU state
2: the last CPU the registers were live in was the current CPU.
This is usually correct logic. A CPU’s fpu_fpregs_owner_ctx is set to
the current FPU during the fpregs_restore_userregs() operation, so it
indicates that the registers have been restored on this CPU. But this
alone doesn’t preclude that the task hasn’t been rescheduled to a
different CPU, where the registers were modified, and then back to the
current CPU. To verify that this was not the case the logic relies on the
second condition. So the assumption is that if the registers have been
restored, AND they haven’t had the chance to be modified (by being
loaded on another CPU), then they MUST be valid on the current CPU.
Besides the lazy FPU optimizations, the other cases where the FPU
registers might not be valid are when the kernel modifies the FPU register
state or the FPU saved buffer. In this case the operation modifying the
FPU state needs to let the kernel know the correspondence has been
broken. The comment in “arch/x86/kernel/fpu/context.h” has:
/*
...
* If the FPU register state is valid, the kernel can skip restoring the
* FPU state from memory.
*
* Any code that clobbers the FPU registers or updates the in-memory
* FPU state for a task MUST let the rest of the kernel know that the
* FPU registers are no longer valid for this task.
*
* Either one of these invalidation functions is enough. Invalidate
* a resource you control: CPU if using the CPU for something else
* (with preemption disabled), FPU for the current task, or a task that
* is prevented from running by the current task.
*/
However, this is not completely true. When the kernel modifies the
registers or saved FPU state, it can only rely on
__fpu_invalidate_fpregs_state(), which wipes the FPU’s last_cpu
tracking. The exec path instead relies on fpregs_deactivate(), which sets
the CPU’s FPU context to NULL. This was observed to fail to restore the
reset FPU state to the registers when returning to userspace in the
following scenario:
1. A task is executing in userspace on CPU0
- CPU0’s FPU context points to tasks
- fpu->last_cpu=CPU0
2. The task exec()’s
3. While in the kernel the task is preempted
- CPU0 gets a thread executing in the kernel (such that no other
FPU context is activated)
- Scheduler sets task’s fpu->last_cpu=CPU0 when scheduling out
4. Task is migrated to CPU1
5. Continuing the exec(), the task gets to
fpu_flush_thread()->fpu_reset_fpregs()
- Sets CPU1’s fpu context to NULL
- Copies the init state to the task’s FPU buffer
- Sets TIF_NEED_FPU_LOAD on the task
6. The task reschedules back to CPU0 before completing the exec() and
returning to userspace
- During the reschedule, scheduler finds TIF_NEED_FPU_LOAD is set
- Skips saving the registers and updating task’s fpu→last_cpu,
because TIF_NEED_FPU_LOAD is the canonical source.
7. Now CPU0’s FPU context is still pointing to the task’s, and
fpu->last_cpu is still CPU0. So fpregs_state_valid() returns true even
though the reset FPU state has not been restored.
So the root cause is that exec() is doing the wrong kind of invalidate. It
should reset fpu->last_cpu via __fpu_invalidate_fpregs_state(). Further,
fpu__drop() doesn't really seem appropriate as the task (and FPU) are not
going away, they are just getting reset as part of an exec. So switch to
__fpu_invalidate_fpregs_state().
Also, delete the misleading comment that says that either kind of
invalidate will be enough, because it’s not always the case.
Fixes: 33344368cb ("x86/fpu: Clean up the fpu__clear() variants")
Reported-by: Lei Wang <lei4.wang@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Lijun Pan <lijun.pan@intel.com>
Reviewed-by: Sohil Mehta <sohil.mehta@intel.com>
Acked-by: Lijun Pan <lijun.pan@intel.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20230818170305.502891-1-rick.p.edgecombe@intel.com
-----BEGIN PGP SIGNATURE-----
iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmTiDvweHHRvcnZhbGRz
QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiG7doH/2Poj73npPKPVfT3
RF8AsgvAj2pLby67rdvwTdFX+exS63SsuwtGGRfAHfGabiwmNN+oT2dLb0aY15bp
nskHnFpcqQ/pfZ2i2rUCenzBX8S9QULvPidLKRaf1FcSdOzqd97Bw5oDPMtzqy/R
Pm+Dzs//7fYvtm69nt6hKW4d6wXxNcg7Fk/QgoJ5Ax9vGvDuZmWXH0ZgBf/5kH04
TTPQNtVX57lf+FHugkhFEn4JbYXvN168b+LuX2PHwOeG/8AIS69Hc0vgvhHNAycT
mmpUI1gWA2jfrJ2RCyyezF/6wy9Ocsp+CbPjfwjuRUxOk0XIm1+cp9Mlz/cRbMsZ
f0tOTpk=
=xrJp
-----END PGP SIGNATURE-----
BackMerge tag 'v6.5-rc7' into drm-next
Linux 6.5-rc7
This is needed for the CI stuff and the msm pull has fixes in it.
Signed-off-by: Dave Airlie <airlied@redhat.com>
When CONFIG_HYPERV is not set, arch/x86/hyperv/ivm.c is not built (see
arch/x86/Kbuild), so 'isolation_type_en_snp' in the ivm.c is not defined,
and this failure happens:
ld: arch/x86/kernel/cpu/mshyperv.o: in function `ms_hyperv_init_platform':
arch/x86/kernel/cpu/mshyperv.c:417: undefined reference to `isolation_type_en_snp'
Fix the failure by testing hv_get_isolation_type() and
ms_hyperv.paravisor_present for a fully enlightened SNP VM: when
CONFIG_HYPERV is not set, hv_get_isolation_type() is defined as a
static inline function that always returns HV_ISOLATION_TYPE_NONE
(see include/asm-generic/mshyperv.h), so the compiler won't generate any
code for the ms_hyperv.paravisor_present and static_branch_enable().
Reported-by: Tom Lendacky <thomas.lendacky@amd.com>
Closes: https://lore.kernel.org/lkml/b4979997-23b9-0c43-574e-e4a3506500ff@amd.com/
Fixes: d6e2d65244 ("x86/hyperv: Add sev-snp enlightened guest static key")
Signed-off-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Link: https://lore.kernel.org/r/20230823032008.18186-1-decui@microsoft.com
When building without CONFIG_AMD_MEM_ENCRYPT, there are several
repeated instances of -Wunused-function due to missing 'inline' on
the stub of hy_snp_boot_ap():
In file included from drivers/hv/hv_common.c:29:
./arch/x86/include/asm/mshyperv.h:272:12: error: 'hv_snp_boot_ap' defined but not used [-Werror=unused-function]
272 | static int hv_snp_boot_ap(int cpu, unsigned long start_ip) { return 0; }
| ^~~~~~~~~~~~~~
cc1: all warnings being treated as errors
Add 'inline' to fix the warnings.
Fixes: 44676bb9d5 ("x86/hyperv: Add smp support for SEV-SNP guest")
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Reviewed-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Link: https://lore.kernel.org/r/20230822-hv_snp_boot_ap-missing-inline-v1-1-e712dcb2da0f@kernel.org
__efi_call_virt() exists as an alternative for efi_call_virt() for the
sole reason that ResetSystem() returns void, and so we cannot use a call
to it in the RHS of an assignment.
Given that there is only a single user, let's drop the macro, and expand
it into the caller. That way, the remaining macro can be tightened
somewhat in terms of type safety too.
Note that the use of typeof() on the runtime service invocation does not
result in an actual call being made, but it does require a few pointer
types to be fixed up and converted into the proper function pointer
prototypes.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Add Hyperv-specific handling for faults caused by VMMCALL
instructions.
Reviewed-by: Dexuan Cui <decui@microsoft.com>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Signed-off-by: Tianyu Lan <tiala@microsoft.com>
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Link: https://lore.kernel.org/r/20230818102919.1318039-9-ltykernel@gmail.com
In the AMD SEV-SNP guest, AP needs to be started up via sev es
save area and Hyper-V requires to call HVCALL_START_VP hypercall
to pass the gpa of sev es save area with AP's vp index and VTL(Virtual
trust level) parameters. Override wakeup_secondary_cpu_64 callback
with hv_snp_boot_ap.
Reviewed-by: Dexuan Cui <decui@microsoft.com>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Signed-off-by: Tianyu Lan <tiala@microsoft.com>
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Link: https://lore.kernel.org/r/20230818102919.1318039-8-ltykernel@gmail.com
In sev-snp enlightened guest, Hyper-V hypercall needs
to use vmmcall to trigger vmexit and notify hypervisor
to handle hypercall request.
Signed-off-by: Tianyu Lan <tiala@microsoft.com>
Reviewed-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Link: https://lore.kernel.org/r/20230818102919.1318039-6-ltykernel@gmail.com
hv vp assist page needs to be shared between SEV-SNP guest and Hyper-V.
So mark the page unencrypted in the SEV-SNP guest.
Reviewed-by: Dexuan Cui <decui@microsoft.com>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Signed-off-by: Tianyu Lan <tiala@microsoft.com>
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Link: https://lore.kernel.org/r/20230818102919.1318039-4-ltykernel@gmail.com
SEV-SNP guests on Hyper-V can run at multiple Virtual Trust
Levels (VTL). During boot, get the VTL at which we're running
using the GET_VP_REGISTERs hypercall, and save the value
for future use. Then during VMBus initialization, set the VTL
with the saved value as required in the VMBus init message.
Reviewed-by: Dexuan Cui <decui@microsoft.com>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Signed-off-by: Tianyu Lan <tiala@microsoft.com>
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Link: https://lore.kernel.org/r/20230818102919.1318039-3-ltykernel@gmail.com
On the parisc architecture, lockdep reports for all static objects which
are in the __initdata section (e.g. "setup_done" in devtmpfs,
"kthreadd_done" in init/main.c) this warning:
INFO: trying to register non-static key.
The warning itself is wrong, because those objects are in the __initdata
section, but the section itself is on parisc outside of range from
_stext to _end, which is why the static_obj() functions returns a wrong
answer.
While fixing this issue, I noticed that the whole existing check can
be simplified a lot.
Instead of checking against the _stext and _end symbols (which include
code areas too) just check for the .data and .bss segments (since we check a
data object). This can be done with the existing is_kernel_core_data()
macro.
In addition objects in the __initdata section can be checked with
init_section_contains(), and is_kernel_rodata() allows keys to be in the
_ro_after_init section.
This partly reverts and simplifies commit bac59d18c7 ("x86/setup: Fix static
memory detection").
Link: https://lkml.kernel.org/r/ZNqrLRaOi/3wPAdp@p100
Fixes: bac59d18c7 ("x86/setup: Fix static memory detection")
Signed-off-by: Helge Deller <deller@gmx.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
In order to split struct ptdesc from struct page, convert various
functions to use ptdescs.
Some of the functions use the *get*page*() helper functions. Convert
these to use pagetable_alloc() and ptdesc_address() instead to help
standardize page tables further.
Link: https://lkml.kernel.org/r/20230807230513.102486-14-vishal.moola@gmail.com
Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Dinh Nguyen <dinguyen@kernel.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Geert Uytterhoeven <geert+renesas@glider.be>
Cc: Guo Ren <guoren@kernel.org>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Palmer Dabbelt <palmer@rivosinc.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "Add support for memmap on memory feature on ppc64", v8.
This patch series update memmap on memory feature to fall back to
memmap allocation outside the memory block if the alignment rules are
not met. This makes the feature more useful on architectures like
ppc64 where alignment rules are different with 64K page size.
This patch (of 6):
Instead of adding menu entry with all supported architectures, add
mm/Kconfig variable and select the same from supported architectures.
No functional change in this patch.
Link: https://lkml.kernel.org/r/20230808091501.287660-1-aneesh.kumar@linux.ibm.com
Link: https://lkml.kernel.org/r/20230808091501.287660-2-aneesh.kumar@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Making virt_to_pfn() a static inline taking a strongly typed
(const void *) makes the contract of a passing a pointer of that
type to the function explicit and exposes any misuse of the
macro virt_to_pfn() acting polymorphic and accepting many types
such as (void *), (unitptr_t) or (unsigned long) as arguments
without warnings.
Also fix all offending call sites to pass a (void *) rather
than an unsigned long. Since virt_to_mfn() is wrapping
virt_to_pfn() this function has become polymorphic as well
so the usage need to be fixed up.
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Acked-by: Juergen Gross <jgross@suse.com>
Link: https://lore.kernel.org/r/20230810-virt-to-phys-x86-xen-v1-1-9e966d333e7a@linaro.org
Signed-off-by: Juergen Gross <jgross@suse.com>
After removing the conditional return from xen_create_contiguous_region(),
the accompanying comment was left in place, but it now precedes an
unrelated conditional and confuses readers.
Fixes: 989513a735 ("xen: cleanup pvh leftovers from pv-only sources")
Signed-off-by: Petr Tesarik <petr.tesarik.ext@huawei.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Link: https://lore.kernel.org/r/20230802163151.1486-1-petrtesarik@huaweicloud.com
Signed-off-by: Juergen Gross <jgross@suse.com>
Cross-merge networking fixes after downstream PR.
Conflicts:
drivers/net/ethernet/sfc/tc.c
fa165e1949 ("sfc: don't unregister flow_indr if it was never registered")
3bf969e88a ("sfc: add MAE table machinery for conntrack table")
https://lore.kernel.org/all/20230818112159.7430e9b4@canb.auug.org.au/
No adjacent changes.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently kcov instrument is disabled for object files under
arch/x86/kernel folder.
For object files under arch/x86/kernel, actually just disabling the kcov
instrument of files:"head32.o or head64.o and sev.o" could achieve
successful booting and provide kcov coverage for object files that do not
disable kcov instrument. The additional kcov coverage collected from
arch/x86/kernel folder helps kernel fuzzing efforts to find bugs.
Link to related improvement discussion is below:
https://groups.google.com/g/syzkaller/c/Dsl-RYGCqs8/m/x-tfpTyFBAAJ Related
ticket is as follow: https://bugzilla.kernel.org/show_bug.cgi?id=198443
Link: https://lkml.kernel.org/r/06c0bb7b5f61e5884bf31180e8c122648c752010.1690771380.git.pengfei.xu@intel.com
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Tested-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Pengfei Xu <pengfei.xu@intel.com>
Cc: Aleksandr Nogikh <nogikh@google.com>
Cc: <heng.su@intel.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Kees Cook <keescook@google.com>,
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Sohil Mehta <sohil.mehta@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The APIs that allow backtracing across CPUs have always had a way to
exclude the current CPU. This convenience means callers didn't need to
find a place to allocate a CPU mask just to handle the common case.
Let's extend the API to take a CPU ID to exclude instead of just a
boolean. This isn't any more complex for the API to handle and allows the
hardlockup detector to exclude a different CPU (the one it already did a
trace for) without needing to find space for a CPU mask.
Arguably, this new API also encourages safer behavior. Specifically if
the caller wants to avoid tracing the current CPU (maybe because they
already traced the current CPU) this makes it more obvious to the caller
that they need to make sure that the current CPU ID can't change.
[akpm@linux-foundation.org: fix trigger_allbutcpu_cpu_backtrace() stub]
Link: https://lkml.kernel.org/r/20230804065935.v4.1.Ia35521b91fc781368945161d7b28538f9996c182@changeid
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: kernel test robot <lkp@intel.com>
Cc: Lecopzer Chen <lecopzer.chen@mediatek.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Pingfan Liu <kernelfans@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
range.h works with struct range data type. The resource_size_t
is an alien here.
(1) Move cap_resource() implementation into its only user, and
(2) rename and move RESOURCE_SIZE_MAX to limits.h.
Link: https://lkml.kernel.org/r/20230804064636.15368-1-andriy.shevchenko@linux.intel.com
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Cc: Alexander Sverdlin <alexander.sverdlin@nokia.com>
Cc: Borislav Petkov (AMD) <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The Kconfig refactor to consolidate KEXEC and CRASH options utilized
option names of the form ARCH_SUPPORTS_<option>. Thus rename the
ARCH_HAS_KEXEC_PURGATORY to ARCH_SUPPORTS_KEXEC_PURGATORY to follow
the same.
Link: https://lkml.kernel.org/r/20230712161545.87870-15-eric.devolder@oracle.com
Signed-off-by: Eric DeVolder <eric.devolder@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The kexec and crash kernel options are provided in the common
kernel/Kconfig.kexec. Utilize the common options and provide
the ARCH_SUPPORTS_ and ARCH_SELECTS_ entries to recreate the
equivalent set of KEXEC and CRASH options.
Link: https://lkml.kernel.org/r/20230712161545.87870-3-eric.devolder@oracle.com
Signed-off-by: Eric DeVolder <eric.devolder@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Arm disabled hugetlb vmemmap optimization [1] because hugetlb vmemmap
optimization includes an update of both the permissions (writeable to
read-only) and the output address (pfn) of the vmemmap ptes. That is not
supported without unmapping of pte(marking it invalid) by some
architectures.
With DAX vmemmap optimization we don't require such pte updates and
architectures can enable DAX vmemmap optimization while having hugetlb
vmemmap optimization disabled. Hence split DAX optimization support into
a different config.
s390, loongarch and riscv don't have devdax support. So the DAX config is
not enabled for them. With this change, arm64 should be able to select
DAX optimization
[1] commit 060a2c92d1 ("arm64: mm: hugetlb: Disable HUGETLB_PAGE_OPTIMIZE_VMEMMAP")
Link: https://lkml.kernel.org/r/20230724190759.483013-8-aneesh.kumar@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Joao Martins <joao.m.martins@oracle.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "Handle most file-backed faults under the VMA lock", v3.
This patchset adds the ability to handle page faults on parts of files
which are already in the page cache without taking the mmap lock.
This patch (of 10):
Provide lock_vma_under_rcu() when CONFIG_PER_VMA_LOCK is not defined to
eliminate ifdefs in the users.
Link: https://lkml.kernel.org/r/20230724185410.1124082-1-willy@infradead.org
Link: https://lkml.kernel.org/r/20230724185410.1124082-2-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Cc: Punit Agrawal <punit.agrawal@bytedance.com>
Cc: Arjun Roy <arjunroy@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
There are two main use cases for mmu notifiers. One is by KVM which uses
mmu_notifier_invalidate_range_start()/end() to manage a software TLB.
The other is to manage hardware TLBs which need to use the
invalidate_range() callback because HW can establish new TLB entries at
any time. Hence using start/end() can lead to memory corruption as these
callbacks happen too soon/late during page unmap.
mmu notifier users should therefore either use the start()/end() callbacks
or the invalidate_range() callbacks. To make this usage clearer rename
the invalidate_range() callback to arch_invalidate_secondary_tlbs() and
update documention.
Link: https://lkml.kernel.org/r/6f77248cd25545c8020a54b4e567e8b72be4dca1.1690292440.git-series.apopple@nvidia.com
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Cc: Andrew Donnellan <ajd@linux.ibm.com>
Cc: Chaitanya Kumar Borah <chaitanya.kumar.borah@intel.com>
Cc: Frederic Barrat <fbarrat@linux.ibm.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Nicolin Chen <nicolinc@nvidia.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
Cc: Will Deacon <will@kernel.org>
Cc: Zhi Wang <zhi.wang.linux@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The invalidate_range() is going to become an architecture specific mmu
notifier used to keep the TLB of secondary MMUs such as an IOMMU in sync
with the CPU page tables. Currently it is called from separate code paths
to the main CPU TLB invalidations. This can lead to a secondary TLB not
getting invalidated when required and makes it hard to reason about when
exactly the secondary TLB is invalidated.
To fix this move the notifier call to the architecture specific TLB
maintenance functions for architectures that have secondary MMUs requiring
explicit software invalidations.
This fixes a SMMU bug on ARM64. On ARM64 PTE permission upgrades require
a TLB invalidation. This invalidation is done by the architecture
specific ptep_set_access_flags() which calls flush_tlb_page() if required.
However this doesn't call the notifier resulting in infinite faults being
generated by devices using the SMMU if it has previously cached a
read-only PTE in it's TLB.
Moving the invalidations into the TLB invalidation functions ensures all
invalidations happen at the same time as the CPU invalidation. The
architecture specific flush_tlb_all() routines do not call the notifier as
none of the IOMMUs require this.
Link: https://lkml.kernel.org/r/0287ae32d91393a582897d6c4db6f7456b1001f2.1690292440.git-series.apopple@nvidia.com
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Suggested-by: Jason Gunthorpe <jgg@ziepe.ca>
Tested-by: SeongJae Park <sj@kernel.org>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Tested-by: Luis Chamberlain <mcgrof@kernel.org>
Cc: Andrew Donnellan <ajd@linux.ibm.com>
Cc: Chaitanya Kumar Borah <chaitanya.kumar.borah@intel.com>
Cc: Frederic Barrat <fbarrat@linux.ibm.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Nicolin Chen <nicolinc@nvidia.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
Cc: Will Deacon <will@kernel.org>
Cc: Zhi Wang <zhi.wang.linux@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Currently we'll flush the mm in flush_tlb_batched_pending() to avoid race
between reclaim unmaps pages by batched TLB flush and mprotect/munmap/etc.
Other architectures like arm64 may only need a synchronization
barrier(dsb) here rather than a full mm flush. So add
arch_flush_tlb_batched_pending() to allow an arch-specific implementation
here. This intends no functional changes on x86 since still a full mm
flush for x86.
Link: https://lkml.kernel.org/r/20230717131004.12662-4-yangyicong@huawei.com
Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Barry Song <baohua@kernel.org>
Cc: Barry Song <v-songbaohua@oppo.com>
Cc: Darren Hart <darren@os.amperecomputing.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: lipeifeng <lipeifeng@oppo.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Nadav Amit <namit@vmware.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Punit Agrawal <punit.agrawal@bytedance.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Steven Miao <realmz6@gmail.com>
Cc: Will Deacon <will@kernel.org>
Cc: Xin Hao <xhao@linux.alibaba.com>
Cc: Zeng Tao <prime.zeng@hisilicon.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This patch does some preparation works to extend batched TLB flush to
arm64. Including:
- Extend set_tlb_ubc_flush_pending() and arch_tlbbatch_add_mm()
to accept an additional argument for address, architectures
like arm64 may need this for tlbi.
- Rename arch_tlbbatch_add_mm() to arch_tlbbatch_add_pending()
to match its current function since we don't need to handle
mm on architectures like arm64 and add_mm is not proper,
add_pending will make sense to both as on x86 we're pending the
TLB flush operations while on arm64 we're pending the synchronize
operations.
This intends no functional changes on x86.
Link: https://lkml.kernel.org/r/20230717131004.12662-3-yangyicong@huawei.com
Tested-by: Yicong Yang <yangyicong@hisilicon.com>
Tested-by: Xin Hao <xhao@linux.alibaba.com>
Tested-by: Punit Agrawal <punit.agrawal@bytedance.com>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: Xin Hao <xhao@linux.alibaba.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Nadav Amit <namit@vmware.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Barry Song <baohua@kernel.org>
Cc: Darren Hart <darren@os.amperecomputing.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: lipeifeng <lipeifeng@oppo.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Steven Miao <realmz6@gmail.com>
Cc: Will Deacon <will@kernel.org>
Cc: Zeng Tao <prime.zeng@hisilicon.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "arm64: support batched/deferred tlb shootdown during page
reclamation/migration", v11.
Though ARM64 has the hardware to do tlb shootdown, the hardware
broadcasting is not free. A simplest micro benchmark shows even on
snapdragon 888 with only 8 cores, the overhead for ptep_clear_flush is
huge even for paging out one page mapped by only one process: 5.36% a.out
[kernel.kallsyms] [k] ptep_clear_flush
While pages are mapped by multiple processes or HW has more CPUs, the cost
should become even higher due to the bad scalability of tlb shootdown.
The same benchmark can result in 16.99% CPU consumption on ARM64 server
with around 100 cores according to the test on patch 4/4.
This patchset leverages the existing BATCHED_UNMAP_TLB_FLUSH by
1. only send tlbi instructions in the first stage -
arch_tlbbatch_add_mm()
2. wait for the completion of tlbi by dsb while doing tlbbatch
sync in arch_tlbbatch_flush()
Testing on snapdragon shows the overhead of ptep_clear_flush is removed by
the patchset. The micro benchmark becomes 5% faster even for one page
mapped by single process on snapdragon 888.
Since BATCHED_UNMAP_TLB_FLUSH is implemented only on x86, the patchset
does some renaming/extension for the current implementation first (Patch
1-3), then add the support on arm64 (Patch 4).
This patch (of 4):
The entire scheme of deferred TLB flush in reclaim path rests on the fact
that the cost to refill TLB entries is less than flushing out individual
entries by sending IPI to remote CPUs. But architecture can have
different ways to evaluate that. Hence apart from checking
TTU_BATCH_FLUSH in the TTU flags, rest of the decision should be
architecture specific.
[yangyicong@hisilicon.com: rebase and fix incorrect return value type]
Link: https://lkml.kernel.org/r/20230717131004.12662-1-yangyicong@huawei.com
Link: https://lkml.kernel.org/r/20230717131004.12662-2-yangyicong@huawei.com
Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
[https://lore.kernel.org/linuxppc-dev/20171101101735.2318-2-khandual@linux.vnet.ibm.com/]
Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Reviewed-by: Xin Hao <xhao@linux.alibaba.com>
Tested-by: Punit Agrawal <punit.agrawal@bytedance.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Darren Hart <darren@os.amperecomputing.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: lipeifeng <lipeifeng@oppo.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Steven Miao <realmz6@gmail.com>
Cc: Will Deacon <will@kernel.org>
Cc: Zeng Tao <prime.zeng@hisilicon.com>
Cc: Barry Song <v-songbaohua@oppo.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Nadav Amit <namit@vmware.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "mm: ioremap: Convert architectures to take GENERIC_IOREMAP
way", v8.
Motivation and implementation:
==============================
Currently, many architecutres have't taken the standard GENERIC_IOREMAP
way to implement ioremap_prot(), iounmap(), and ioremap_xx(), but make
these functions specifically under each arch's folder. Those cause many
duplicated code of ioremap() and iounmap().
In this patchset, firstly introduce generic_ioremap_prot() and
generic_iounmap() to extract the generic code for GENERIC_IOREMAP. By
taking GENERIC_IOREMAP method, the generic generic_ioremap_prot(),
generic_iounmap(), and their generic wrapper ioremap_prot(), ioremap() and
iounmap() are all visible and available to arch. Arch needs to provide
wrapper functions to override the generic version if there's arch specific
handling in its corresponding ioremap_prot(), ioremap() or iounmap().
With these changes, duplicated ioremap/iounmap() code uder ARCH-es are
removed, and the equivalent functioality is kept as before.
Background info:
================
1) The converting more architectures to take GENERIC_IOREMAP way is
suggested by Christoph in below discussion:
https://lore.kernel.org/all/Yp7h0Jv6vpgt6xdZ@infradead.org/T/#u
2) In the previous v1 to v3, it's basically further action after arm64
has converted to GENERIC_IOREMAP way in below patchset. It's done by
adding hook ioremap_allowed() and iounmap_allowed() in ARCH to add ARCH
specific handling the middle of ioremap_prot() and iounmap().
[PATCH v5 0/6] arm64: Cleanup ioremap() and support ioremap_prot()
https://lore.kernel.org/all/20220607125027.44946-1-wangkefeng.wang@huawei.com/T/#u
Later, during v3 reviewing, Christophe Leroy suggested to introduce
generic_ioremap_prot() and generic_iounmap() to generic codes, and ARCH
can provide wrapper function ioremap_prot(), ioremap() or iounmap() if
needed. Christophe made a RFC patchset as below to specially demonstrate
his idea. This is what v4 and now v5 is doing.
[RFC PATCH 0/8] mm: ioremap: Convert architectures to take GENERIC_IOREMAP way
https://lore.kernel.org/all/cover.1665568707.git.christophe.leroy@csgroup.eu/T/#u
Testing:
========
In v8, I only applied this patchset onto the latest linus's tree to build
and run on arm64 and s390.
This patch (of 19):
Let's use '#define ioremap_xx' and "#ifdef ioremap_xx" instead.
To remove defined ARCH_HAS_IOREMAP_xx macros in <asm/io.h> of each ARCH,
the ARCH's own ioremap_wc|wt|np definition need be above "#include
<asm-generic/iomap.h>. Otherwise the redefinition error would be seen
during compiling. So the relevant adjustments are made to avoid compiling
error:
loongarch:
- doesn't include <asm-generic/iomap.h>, defining ARCH_HAS_IOREMAP_WC
is redundant, so simply remove it.
m68k:
- selected GENERIC_IOMAP, <asm-generic/iomap.h> has been added in
<asm-generic/io.h>, and <asm/kmap.h> is included above
<asm-generic/iomap.h>, so simply remove ARCH_HAS_IOREMAP_WT defining.
mips:
- move "#include <asm-generic/iomap.h>" below ioremap_wc definition
in <asm/io.h>
powerpc:
- remove "#include <asm-generic/iomap.h>" in <asm/io.h> because it's
duplicated with the one in <asm-generic/io.h>, let's rely on the
latter.
x86:
- selected GENERIC_IOMAP, remove #include <asm-generic/iomap.h> in
the middle of <asm/io.h>. Let's rely on <asm-generic/io.h>.
Link: https://lkml.kernel.org/r/20230706154520.11257-2-bhe@redhat.com
Signed-off-by: Baoquan He <bhe@redhat.com>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: David Laight <David.Laight@ACULAB.COM>
Cc: Helge Deller <deller@gmx.de>
Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Niklas Schnelle <schnelle@linux.ibm.com>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Brian Cain <bcain@quicinc.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Rich Felker <dalias@libc.org>
Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@kernel.org>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Remove unused addr in __page_table_check_pud_set and
page_table_check_pud_set.
Link: https://lkml.kernel.org/r/20230713172636.1705415-9-shikemeng@huaweicloud.com
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Remove unused addr in __page_table_check_pmd_set and
page_table_check_pmd_set.
Link: https://lkml.kernel.org/r/20230713172636.1705415-8-shikemeng@huaweicloud.com
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Remove unused addr in __page_table_check_pte_set and
page_table_check_pte_set.
Link: https://lkml.kernel.org/r/20230713172636.1705415-7-shikemeng@huaweicloud.com
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Remove unused addr in __page_table_check_pud_clear and
page_table_check_pud_clear.
Link: https://lkml.kernel.org/r/20230713172636.1705415-6-shikemeng@huaweicloud.com
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Remove unused addr in page_table_check_pmd_clear and
__page_table_check_pmd_clear.
Link: https://lkml.kernel.org/r/20230713172636.1705415-5-shikemeng@huaweicloud.com
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Remove unused addr in page_table_check_pte_clear and
__page_table_check_pte_clear.
Link: https://lkml.kernel.org/r/20230713172636.1705415-4-shikemeng@huaweicloud.com
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The Instruction Fetch (IF) units on current AMD Zen-based systems do not
guarantee a synchronous #MC is delivered for poison consumption errors.
Therefore, MCG_STATUS[EIPV|RIPV] will not be set. However, the
microarchitecture does guarantee that the exception is delivered within
the same context. In other words, the exact rIP is not known, but the
context is known to not have changed.
There is no architecturally-defined method to determine this behavior.
The Code Segment (CS) register is always valid on such IF unit poison
errors regardless of the value of MCG_STATUS[EIPV|RIPV].
Add a quirk to save the CS register for poison consumption from the IF
unit banks.
This is needed to properly determine the context of the error.
Otherwise, the severity grading function will assume the context is
IN_KERNEL due to the m->cs value being 0 (the initialized value). This
leads to unnecessary kernel panics on data poison errors due to the
kernel believing the poison consumption occurred in kernel context.
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20230814200853.29258-1-yazen.ghannam@amd.com
Specify how is SRSO mitigated when SMT is disabled. Also, correct the
SMT check for that.
Fixes: e9fbc47b81 ("x86/srso: Disable the mitigation on unaffected configurations")
Suggested-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Link: https://lore.kernel.org/r/20230814200813.p5czl47zssuej7nv@treble
Now that KVM has a framework for caching guest CPUID feature flags, add
a "rule" that IRQs must be enabled when doing guest CPUID lookups, and
enforce the rule via a lockdep assertion. CPUID lookups are slow, and
within KVM, IRQs are only ever disabled in hot paths, e.g. the core run
loop, fast page fault handling, etc. I.e. querying guest CPUID with IRQs
disabled, especially in the run loop, should be avoided.
Link: https://lore.kernel.org/r/20230815203653.519297-16-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Track "virtual NMI exposed to L1" via a governed feature flag instead of
using a dedicated bit/flag in vcpu_svm.
Note, checking KVM's capabilities instead of the "vnmi" param means that
the code isn't strictly equivalent, as vnmi_enabled could have been set
if nested=false where as that the governed feature cannot. But that's a
glorified nop as the feature/flag is consumed only by paths that are
gated by nSVM being enabled.
Link: https://lore.kernel.org/r/20230815203653.519297-15-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Track "virtual GIF exposed to L1" via a governed feature flag instead of
using a dedicated bit/flag in vcpu_svm.
Note, checking KVM's capabilities instead of the "vgif" param means that
the code isn't strictly equivalent, as vgif_enabled could have been set
if nested=false where as that the governed feature cannot. But that's a
glorified nop as the feature/flag is consumed only by paths that are
Link: https://lore.kernel.org/r/20230815203653.519297-14-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Track "Pause Filtering is exposed to L1" via governed feature flags
instead of using dedicated bits/flags in vcpu_svm.
No functional change intended.
Reviewed-by: Yuan Yao <yuan.yao@intel.com>
Link: https://lore.kernel.org/r/20230815203653.519297-13-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Track "LBR virtualization exposed to L1" via a governed feature flag
instead of using a dedicated bit/flag in vcpu_svm.
Note, checking KVM's capabilities instead of the "lbrv" param means that
the code isn't strictly equivalent, as lbrv_enabled could have been set
if nested=false where as that the governed feature cannot. But that's a
glorified nop as the feature/flag is consumed only by paths that are
gated by nSVM being enabled.
Link: https://lore.kernel.org/r/20230815203653.519297-12-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Track "virtual VMSAVE/VMLOAD exposed to L1" via a governed feature flag
instead of using a dedicated bit/flag in vcpu_svm.
Opportunistically add a comment explaining why KVM disallows virtual
VMLOAD/VMSAVE when the vCPU model is Intel.
No functional change intended.
Link: https://lore.kernel.org/r/20230815203653.519297-11-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Track "TSC scaling exposed to L1" via a governed feature flag instead of
using a dedicated bit/flag in vcpu_svm.
Note, this fixes a benign bug where KVM would mark TSC scaling as exposed
to L1 even if overall nested SVM supported is disabled, i.e. KVM would let
L1 write MSR_AMD64_TSC_RATIO even when KVM didn't advertise TSCRATEMSR
support to userspace.
Reviewed-by: Yuan Yao <yuan.yao@intel.com>
Link: https://lore.kernel.org/r/20230815203653.519297-10-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Track "NRIPS exposed to L1" via a governed feature flag instead of using
a dedicated bit/flag in vcpu_svm.
No functional change intended.
Reviewed-by: Yuan Yao <yuan.yao@intel.com>
Link: https://lore.kernel.org/r/20230815203653.519297-9-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Track "VMX exposed to L1" via a governed feature flag instead of using a
dedicated helper to provide the same functionality. The main goal is to
drive convergence between VMX and SVM with respect to querying features
that are controllable via module param (SVM likes to cache nested
features), avoiding the guest CPUID lookups at runtime is just a bonus
and unlikely to provide any meaningful performance benefits.
Note, X86_FEATURE_VMX is set in kvm_cpu_caps if and only if "nested" is
true, and the CPU obviously supports VMX if KVM+VMX is running. I.e. the
check on "nested" is now implicitly down by the kvm_cpu_cap_has() check
in kvm_governed_feature_check_and_set().
No functional change intended.
Reviewed-by: Yuan Yao <yuan.yao@intel.com>
Reviwed-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20230815203653.519297-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Use the governed feature framework to track if XSAVES is "enabled", i.e.
if XSAVES can be used by the guest. Add a comment in the SVM code to
explain the very unintuitive logic of deliberately NOT checking if XSAVES
is enumerated in the guest CPUID model.
No functional change intended.
Reviewed-by: Yuan Yao <yuan.yao@intel.com>
Link: https://lore.kernel.org/r/20230815203653.519297-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Rename the XSAVES secondary execution control to follow KVM's preferred
style so that XSAVES related logic can use common macros that depend on
KVM's preferred style.
No functional change intended.
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20230815203653.519297-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Check KVM CPU capabilities instead of raw VMX support for XSAVES when
determining whether or not XSAVER can/should be exposed to the guest.
Practically speaking, it's nonsensical/impossible for a CPU to support
"enable XSAVES" without XSAVES being supported natively. The real
motivation for checking kvm_cpu_cap_has() is to allow using the governed
feature's standard check-and-set logic.
Reviewed-by: Yuan Yao <yuan.yao@intel.com>
Link: https://lore.kernel.org/r/20230815203653.519297-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Recompute whether or not XSAVES is enabled for the guest only if the
guest's CPUID model changes instead of redoing the computation every time
KVM generates vmcs01's secondary execution controls. The boot_cpu_has()
and cpu_has_vmx_xsaves() checks should never change after KVM is loaded,
and if they do the kernel/KVM is hosed.
Opportunistically add a comment explaining _why_ XSAVES is effectively
exposed to the guest if and only if XSAVE is also exposed to the guest.
Practically speaking, no functional change intended (KVM will do fewer
computations, but should still see the same xsaves_enabled value whenever
KVM looks at it).
Reviewed-by: Yuan Yao <yuan.yao@intel.com>
Link: https://lore.kernel.org/r/20230815203653.519297-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Use the governed feature framework to track whether or not the guest can
use 1GiB pages, and drop the one-off helper that wraps the surprisingly
non-trivial logic surrounding 1GiB page usage in the guest.
No functional change intended.
Reviewed-by: Yuan Yao <yuan.yao@intel.com>
Link: https://lore.kernel.org/r/20230815203653.519297-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Introduce yet another X86_FEATURE flag framework to manage and cache KVM
governed features (for lack of a better name). "Governed" in this case
means that KVM has some level of involvement and/or vested interest in
whether or not an X86_FEATURE can be used by the guest. The intent of the
framework is twofold: to simplify caching of guest CPUID flags that KVM
needs to frequently query, and to add clarity to such caching, e.g. it
isn't immediately obvious that SVM's bundle of flags for "optional nested
SVM features" track whether or not a flag is exposed to L1.
Begrudgingly define KVM_MAX_NR_GOVERNED_FEATURES for the size of the
bitmap to avoid exposing governed_features.h in arch/x86/include/asm/, but
add a FIXME to call out that it can and should be cleaned up once
"struct kvm_vcpu_arch" is no longer expose to the kernel at large.
Cc: Zeng Guang <guang.zeng@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Yuan Yao <yuan.yao@intel.com>
Link: https://lore.kernel.org/r/20230815203653.519297-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Correct the spec_ctrl field in the VMCB save area based on the AMD
Programmer's manual.
Originally, the spec_ctrl was listed as u32 with 4 bytes of reserved
area. The AMD Programmer's Manual now lists the spec_ctrl as 8 bytes
in VMCB save area.
The Public Processor Programming reference for Genoa, shows SPEC_CTRL
as 64b register, but the AMD Programmer's Manual lists SPEC_CTRL as
32b register. This discrepancy will be cleaned up in next revision of
the AMD Programmer's Manual.
Since remaining bits above bit 7 are reserved bits in SPEC_CTRL MSR
and thus, not being used, the spec_ctrl added as u32 in the VMCB save
area is currently not an issue.
Fixes: 3dd2775b74 ("KVM: SVM: Create a separate mapping for the SEV-ES save area")
Suggested-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Manali Shukla <manali.shukla@amd.com>
Link: https://lore.kernel.org/r/20230717041903.85480-1-manali.shukla@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
In VMX, ept_level looks better than tdp_level and is consistent with
SVM's get_npt_level().
Signed-off-by: Shiyuan Gao <gaoshiyuan@baidu.com>
Link: https://lore.kernel.org/r/20230810113853.98114-1-gaoshiyuan@baidu.com
[sean: massage changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Update the target pCPU for IOMMU doorbells when updating IRTE routing if
KVM is actively running the associated vCPU. KVM currently only updates
the pCPU when loading the vCPU (via avic_vcpu_load()), and so doorbell
events will be delayed until the vCPU goes through a put+load cycle (which
might very well "never" happen for the lifetime of the VM).
To avoid inserting a stale pCPU, e.g. due to racing between updating IRTE
routing and vCPU load/put, get the pCPU information from the vCPU's
Physical APIC ID table entry (a.k.a. avic_physical_id_cache in KVM) and
update the IRTE while holding ir_list_lock. Add comments with --verbose
enabled to explain exactly what is and isn't protected by ir_list_lock.
Fixes: 411b44ba80 ("svm: Implements update_pi_irte hook to setup posted interrupt")
Reported-by: dengqiao.joey <dengqiao.joey@bytedance.com>
Cc: stable@vger.kernel.org
Cc: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Cc: Joao Martins <joao.m.martins@oracle.com>
Cc: Maxim Levitsky <mlevitsk@redhat.com>
Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Reviewed-by: Joao Martins <joao.m.martins@oracle.com>
Link: https://lore.kernel.org/r/20230808233132.2499764-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Hoist the acquisition of ir_list_lock from avic_update_iommu_vcpu_affinity()
to its two callers, avic_vcpu_load() and avic_vcpu_put(), specifically to
encapsulate the write to the vCPU's entry in the AVIC Physical ID table.
This will allow a future fix to pull information from the Physical ID entry
when updating the IRTE, without potentially consuming stale information,
i.e. without racing with the vCPU being (un)loaded.
Add a comment to call out that ir_list_lock does NOT protect against
multiple writers, specifically that reading the Physical ID entry in
avic_vcpu_put() outside of the lock is safe.
To preserve some semblance of independence from ir_list_lock, keep the
READ_ONCE() in avic_vcpu_load() even though acuiring the spinlock
effectively ensures the load(s) will be generated after acquiring the
lock.
Cc: stable@vger.kernel.org
Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Reviewed-by: Joao Martins <joao.m.martins@oracle.com>
Link: https://lore.kernel.org/r/20230808233132.2499764-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Drop the WARN in KVM_RUN that asserts that KVM isn't using the hypervisor
timer, a.k.a. the VMX preemption timer, for a vCPU that is in the
UNINITIALIZIED activity state. The intent of the WARN is to sanity check
that KVM won't drop a timer interrupt due to an unexpected transition to
UNINITIALIZED, but unfortunately userspace can use various ioctl()s to
force the unexpected state.
Drop the sanity check instead of switching from the hypervisor timer to a
software based timer, as the only reason to switch to a software timer
when a vCPU is blocking is to ensure the timer interrupt wakes the vCPU,
but said interrupt isn't a valid wake event for vCPUs in UNINITIALIZED
state *and* the interrupt will be dropped in the end.
Reported-by: Yikebaer Aizezi <yikebaer61@gmail.com>
Closes: https://lore.kernel.org/all/CALcu4rbFrU4go8sBHk3FreP+qjgtZCGcYNpSiEXOLm==qFv7iQ@mail.gmail.com
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Link: https://lore.kernel.org/r/20230808232057.2498287-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Wrap kvm_{gfn,hva}_range.pte in a union so that future notifier events can
pass event specific information up and down the stack without needing to
constantly expand and churn the APIs. Lockless aging of SPTEs will pass
around a bitmap, and support for memory attributes will pass around the
new attributes for the range.
Add a "KVM_NO_ARG" placeholder to simplify handling events without an
argument (creating a dummy union variable is midly annoying).
Opportunstically drop explicit zero-initialization of the "pte" field, as
omitting the field (now a union) has the same effect.
Cc: Yu Zhao <yuzhao@google.com>
Link: https://lore.kernel.org/all/CAOUHufagkd2Jk3_HrVoFFptRXM=hX2CV8f+M-dka-hJU4bP8kw@mail.gmail.com
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Acked-by: Yu Zhao <yuzhao@google.com>
Link: https://lore.kernel.org/r/20230729004144.1054885-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
The following warning is reported when frame pointers and kernel IBT are
enabled:
vmlinux.o: warning: objtool: ibt_selftest+0x11: sibling call from callable instruction with modified stack frame
The problem is that objtool interprets the indirect branch in
ibt_selftest() as a sibling call, and GCC inserts a (partial) frame
pointer prologue before it:
0000 000000000003f550 <ibt_selftest>:
0000 3f550: f3 0f 1e fa endbr64
0004 3f554: e8 00 00 00 00 call 3f559 <ibt_selftest+0x9> 3f555: R_X86_64_PLT32 __fentry__-0x4
0009 3f559: 55 push %rbp
000a 3f55a: 48 8d 05 02 00 00 00 lea 0x2(%rip),%rax # 3f563 <ibt_selftest_ip>
0011 3f561: ff e0 jmp *%rax
Note the inline asm is missing ASM_CALL_CONSTRAINT, so the 'push %rbp'
happens before the indirect branch and the 'mov %rsp, %rbp' happens
afterwards.
Simplify the generated code and make it easier to understand for both
tools and humans by moving the selftest to proper asm.
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/99a7e16b97bda97bf0a04aa141d6241cd8a839a2.1680912949.git.jpoimboe@kernel.org
Christian reported spurious module load crashes after some of Song's
module memory layout patches.
Turns out that if the very last instruction on the very last page of the
module is a 'JMP __x86_return_thunk' then __static_call_fixup() will
trip a fault and die.
And while the module rework made this slightly more likely to happen,
it's always been possible.
Fixes: ee88d363d1 ("x86,static_call: Use alternative RET encoding")
Reported-by: Christian Bricart <christian@bricart.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Link: https://lkml.kernel.org/r/20230816104419.GA982867@hirez.programming.kicks-ass.net
Move kvm_arch_flush_remote_tlbs_memslot() to common code and drop
"arch_" from the name. kvm_arch_flush_remote_tlbs_memslot() is just a
range-based TLB invalidation where the range is defined by the memslot.
Now that kvm_flush_remote_tlbs_range() can be called from common code we
can just use that and drop a bunch of duplicate code from the arch
directories.
Note this adds a lockdep assertion for slots_lock being held when
calling kvm_flush_remote_tlbs_memslot(), which was previously only
asserted on x86. MIPS has calls to kvm_flush_remote_tlbs_memslot(),
but they all hold the slots_lock, so the lockdep assertion continues to
hold true.
Also drop the CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT ifdef gating
kvm_flush_remote_tlbs_memslot(), since it is no longer necessary.
Signed-off-by: David Matlack <dmatlack@google.com>
Signed-off-by: Raghavendra Rao Ananta <rananta@google.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: Shaoqin Huang <shahuang@redhat.com>
Acked-by: Anup Patel <anup@brainfault.org>
Acked-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230811045127.3308641-7-rananta@google.com
Make kvm_flush_remote_tlbs_range() visible in common code and create a
default implementation that just invalidates the whole TLB.
This paves the way for several future features/cleanups:
- Introduction of range-based TLBI on ARM.
- Eliminating kvm_arch_flush_remote_tlbs_memslot()
- Moving the KVM/x86 TDP MMU to common code.
No functional change intended.
Signed-off-by: David Matlack <dmatlack@google.com>
Signed-off-by: Raghavendra Rao Ananta <rananta@google.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: Shaoqin Huang <shahuang@redhat.com>
Reviewed-by: Anup Patel <anup@brainfault.org>
Acked-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230811045127.3308641-6-rananta@google.com
Rename kvm_arch_flush_remote_tlb() and the associated macro
__KVM_HAVE_ARCH_FLUSH_REMOTE_TLB to kvm_arch_flush_remote_tlbs() and
__KVM_HAVE_ARCH_FLUSH_REMOTE_TLBS respectively.
Making the name plural matches kvm_flush_remote_tlbs() and makes it more
clear that this function can affect more than one remote TLB.
No functional change intended.
Signed-off-by: David Matlack <dmatlack@google.com>
Signed-off-by: Raghavendra Rao Ananta <rananta@google.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Shaoqin Huang <shahuang@redhat.com>
Acked-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230811045127.3308641-2-rananta@google.com
Similar to how it doesn't make sense to have UNTRAIN_RET have two
untrain calls, it also doesn't make sense for VMEXIT to have an extra
IBPB call.
This cures VMEXIT doing potentially unret+IBPB or double IBPB.
Also, the (SEV) VMEXIT case seems to have been overlooked.
Redefine the meaning of the synthetic IBPB flags to:
- ENTRY_IBPB -- issue IBPB on entry (was: entry + VMEXIT)
- IBPB_ON_VMEXIT -- issue IBPB on VMEXIT
And have 'retbleed=ibpb' set *BOTH* feature flags to ensure it retains
the previous behaviour and issues IBPB on entry+VMEXIT.
The new 'srso=ibpb_vmexit' option only sets IBPB_ON_VMEXIT.
Create UNTRAIN_RET_VM specifically for the VMEXIT case, and have that
check IBPB_ON_VMEXIT.
All this avoids having the VMEXIT case having to check both ENTRY_IBPB
and IBPB_ON_VMEXIT and simplifies the alternatives.
Fixes: fb3bd914b3 ("x86/srso: Add a Speculative RAS Overflow mitigation")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230814121149.109557833@infradead.org
Since there can only be one active return_thunk, there only needs be
one (matching) untrain_ret. It fundamentally doesn't make sense to
allow multiple untrain_ret at the same time.
Fold all the 3 different untrain methods into a single (temporary)
helper stub.
Fixes: fb3bd914b3 ("x86/srso: Add a Speculative RAS Overflow mitigation")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230814121149.042774962@infradead.org
Rename the original retbleed return thunk and untrain_ret to
retbleed_return_thunk() and retbleed_untrain_ret().
No functional changes.
Suggested-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230814121148.909378169@infradead.org
Use the existing configurable return thunk. There is absolute no
justification for having created this __x86_return_thunk alternative.
To clarify, the whole thing looks like:
Zen3/4 does:
srso_alias_untrain_ret:
nop2
lfence
jmp srso_alias_return_thunk
int3
srso_alias_safe_ret: // aliasses srso_alias_untrain_ret just so
add $8, %rsp
ret
int3
srso_alias_return_thunk:
call srso_alias_safe_ret
ud2
While Zen1/2 does:
srso_untrain_ret:
movabs $foo, %rax
lfence
call srso_safe_ret (jmp srso_return_thunk ?)
int3
srso_safe_ret: // embedded in movabs instruction
add $8,%rsp
ret
int3
srso_return_thunk:
call srso_safe_ret
ud2
While retbleed does:
zen_untrain_ret:
test $0xcc, %bl
lfence
jmp zen_return_thunk
int3
zen_return_thunk: // embedded in the test instruction
ret
int3
Where Zen1/2 flush the BTB entry using the instruction decoder trick
(test,movabs) Zen3/4 use BTB aliasing. SRSO adds a return sequence
(srso_safe_ret()) which forces the function return instruction to
speculate into a trap (UD2). This RET will then mispredict and
execution will continue at the return site read from the top of the
stack.
Pick one of three options at boot (evey function can only ever return
once).
[ bp: Fixup commit message uarch details and add them in a comment in
the code too. Add a comment about the srso_select_mitigation()
dependency on retbleed_select_mitigation(). Add moar ifdeffery for
32-bit builds. Add a dummy srso_untrain_ret_alias() definition for
32-bit alternatives needing the symbol. ]
Fixes: fb3bd914b3 ("x86/srso: Add a Speculative RAS Overflow mitigation")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230814121148.842775684@infradead.org
With MSR_AMD64_SEV_DEBUG_SWAP enabled, the guest is not expected to
receive a #VC for reads or writes of DR7.
Update the SNP_FEATURES_PRESENT mask with MSR_AMD64_SNP_DEBUG_SWAP so
an SNP guest doesn't gracefully terminate during SNP feature negotiation
if MSR_AMD64_SEV_DEBUG_SWAP is enabled.
Since a guest is not expected to receive a #VC on DR7 accesses when
MSR_AMD64_SEV_DEBUG_SWAP is enabled, return an error from the #VC
handler in this situation.
Signed-off-by: Alexey Kardashevskiy <aik@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Carlos Bilbao <carlos.bilbao@amd.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com>
Link: https://lore.kernel.org/r/20230816022122.981998-1-aik@amd.com
There is infrastructure to rewrite return thunks to point to any
random thunk one desires, unwrap that from CALL_THUNKS, which up to
now was the sole user of that.
[ bp: Make the thunks visible on 32-bit and add ifdeffery for the
32-bit builds. ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230814121148.775293785@infradead.org
vmlinux.o: warning: objtool: srso_untrain_ret() falls through to next function __x86_return_skl()
vmlinux.o: warning: objtool: __x86_return_thunk() falls through to next function __x86_return_skl()
This is because these functions (can) end with CALL, which objtool
does not consider a terminating instruction. Therefore, replace the
INT3 instruction (which is a non-fatal trap) with UD2 (which is a
fatal-trap).
This indicates execution will not continue past this point.
Fixes: fb3bd914b3 ("x86/srso: Add a Speculative RAS Overflow mitigation")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230814121148.637802730@infradead.org
Commit
fb3bd914b3 ("x86/srso: Add a Speculative RAS Overflow mitigation")
reimplemented __x86_return_thunk with a mix of SYM_FUNC_START and
SYM_CODE_END, this is not a sane combination.
Since nothing should ever actually 'CALL' this, make it consistently
CODE.
Fixes: fb3bd914b3 ("x86/srso: Add a Speculative RAS Overflow mitigation")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230814121148.571027074@infradead.org
The kprobes optimization check can_optimize() calls
insn_is_indirect_jump() to detect indirect jump instructions in
a target function. If any is found, creating an optprobe is disallowed
in the function because the jump could be from a jump table and could
potentially land in the middle of the target optprobe.
With retpolines, insn_is_indirect_jump() additionally looks for calls to
indirect thunks which the compiler potentially used to replace original
jumps. This extra check is however unnecessary because jump tables are
disabled when the kernel is built with retpolines. The same is currently
the case with IBT.
Based on this observation, remove the logic to look for calls to
indirect thunks and skip the check for indirect jumps altogether if the
kernel is built with retpolines or IBT. Remove subsequently the symbols
__indirect_thunk_start and __indirect_thunk_end which are no longer
needed.
Dropping this logic indirectly fixes a problem where the range
[__indirect_thunk_start, __indirect_thunk_end] wrongly included also the
return thunk. It caused that machines which used the return thunk as
a mitigation and didn't have it patched by any alternative ended up not
being able to use optprobes in any regular function.
Fixes: 0b53c374b9 ("x86/retpoline: Use -mfunction-return")
Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Suggested-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Link: https://lore.kernel.org/r/20230711091952.27944-3-petr.pavlu@suse.com
The linker script arch/x86/kernel/vmlinux.lds.S matches the thunk
sections ".text.__x86.*" from arch/x86/lib/retpoline.S as follows:
.text {
[...]
TEXT_TEXT
[...]
__indirect_thunk_start = .;
*(.text.__x86.*)
__indirect_thunk_end = .;
[...]
}
Macro TEXT_TEXT references TEXT_MAIN which normally expands to only
".text". However, with CONFIG_LTO_CLANG, TEXT_MAIN becomes
".text .text.[0-9a-zA-Z_]*" which wrongly matches also the thunk
sections. The output layout is then different than expected. For
instance, the currently defined range [__indirect_thunk_start,
__indirect_thunk_end] becomes empty.
Prevent the problem by using ".." as the first separator, for example,
".text..__x86.indirect_thunk". This pattern is utilized by other
explicit section names which start with one of the standard prefixes,
such as ".text" or ".data", and that need to be individually selected in
the linker script.
[ nathan: Fix conflicts with SRSO and fold in fix issue brought up by
Andrew Cooper in post-review:
https://lore.kernel.org/20230803230323.1478869-1-andrew.cooper3@citrix.com ]
Fixes: dc5723b02e ("kbuild: add support for Clang LTO")
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230711091952.27944-2-petr.pavlu@suse.com
Skip the srso cmd line parsing which is not needed on Zen1/2 with SMT
disabled and with the proper microcode applied (latter should be the
case anyway) as those are not affected.
Fixes: 5a15d83488 ("x86/srso: Tie SBPB bit setting to microcode patch detection")
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230813104517.3346-1-bp@alien8.de
Initially, it was thought that doing an innocuous division in the #DE
handler would take care to prevent any leaking of old data from the
divider but by the time the fault is raised, the speculation has already
advanced too far and such data could already have been used by younger
operations.
Therefore, do the innocuous division on every exit to userspace so that
userspace doesn't see any potentially old data from integer divisions in
kernel space.
Do the same before VMRUN too, to protect host data from leaking into the
guest too.
Fixes: 77245f1c3c ("x86/CPU/AMD: Do not leak quotient data after a division by 0")
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: <stable@kernel.org>
Link: https://lore.kernel.org/r/20230811213824.10025-1-bp@alien8.de
Use LEA instead of ADD when adjusting %rsp in srso_safe_ret{,_alias}()
so as to avoid clobbering flags. Drop one of the INT3 instructions to
account for the LEA consuming one more byte than the ADD.
KVM's emulator makes indirect calls into a jump table of sorts, where
the destination of each call is a small blob of code that performs fast
emulation by executing the target instruction with fixed operands.
E.g. to emulate ADC, fastop() invokes adcb_al_dl():
adcb_al_dl:
<+0>: adc %dl,%al
<+2>: jmp <__x86_return_thunk>
A major motivation for doing fast emulation is to leverage the CPU to
handle consumption and manipulation of arithmetic flags, i.e. RFLAGS is
both an input and output to the target of the call. fastop() collects
the RFLAGS result by pushing RFLAGS onto the stack and popping them back
into a variable (held in %rdi in this case):
asm("push %[flags]; popf; " CALL_NOSPEC " ; pushf; pop %[flags]\n"
<+71>: mov 0xc0(%r8),%rdx
<+78>: mov 0x100(%r8),%rcx
<+85>: push %rdi
<+86>: popf
<+87>: call *%rsi
<+89>: nop
<+90>: nop
<+91>: nop
<+92>: pushf
<+93>: pop %rdi
and then propagating the arithmetic flags into the vCPU's emulator state:
ctxt->eflags = (ctxt->eflags & ~EFLAGS_MASK) | (flags & EFLAGS_MASK);
<+64>: and $0xfffffffffffff72a,%r9
<+94>: and $0x8d5,%edi
<+109>: or %rdi,%r9
<+122>: mov %r9,0x10(%r8)
The failures can be most easily reproduced by running the "emulator"
test in KVM-Unit-Tests.
If you're feeling a bit of deja vu, see commit b63f20a778
("x86/retpoline: Don't clobber RFLAGS during CALL_NOSPEC on i386").
In addition, this breaks booting of clang-compiled guest on
a gcc-compiled host where the host contains the %rsp-modifying SRSO
mitigations.
[ bp: Massage commit message, extend, remove addresses. ]
Fixes: fb3bd914b3 ("x86/srso: Add a Speculative RAS Overflow mitigation")
Closes: https://lore.kernel.org/all/de474347-122d-54cd-eabf-9dcc95ab9eae@amd.com
Reported-by: Srikanth Aithal <sraithal@amd.com>
Reported-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/20230810013334.GA5354@dev-arch.thelio-3990X/
Link: https://lore.kernel.org/r/20230811155255.250835-1-seanjc@google.com
There is no reason to expose all of this globally. Move everything which is
not required outside of the microcode specific code to local header files
and into the respective source files.
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230812195727.952876381@linutronix.de
Rename get_datasize() to intel_microcode_get_datasize() and make it an inline.
[ tglx: Make the argument typed and fix up the IFS code ]
Suggested-by: Boris Petkov <bp@alien8.de>
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230812195727.894165745@linutronix.de
fe055896c0 ("x86/microcode: Merge the early microcode loader") left this
needlessly public. Git archaeology provided by Borislav.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230812195727.834943153@linutronix.de
Currently vendor specific headers are included explicitly when used in
common code. Instead, include the vendor specific headers in
microcode.h, and include that in all usages.
No functional change.
Suggested-by: Boris Petkov <bp@alien8.de>
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230812195727.776541545@linutronix.de
There is really no point to have that in the CPUID evaluation code. Move it
into the Intel-specific microcode handling along with the data
structures, defines and helpers required by it. The exports need to stay
for IFS.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230812195727.719202319@linutronix.de
In reality CONFIG_MICROCODE is enabled in any reasonable configuration when
Intel or AMD support is enabled. Accommodate to reality.
Suggested-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230812195727.660453052@linutronix.de
The only remaining consumer is new_inode, where it showed up in 2001 as
commit c37fa164f793 ("v2.4.9.9 -> v2.4.9.10") in a historical repo [1]
with a changelog which does not mention it.
Since then the line got only touched up to keep compiling.
While it may have been of benefit back in the day, it is guaranteed to
at best not get in the way in the multicore setting -- as the code
performs *a lot* of work between the prefetch and actual lock acquire,
any contention means the cacheline is already invalid by the time the
routine calls spin_lock(). It adds spurious traffic, for short.
On top of it prefetch is notoriously tricky to use for single-threaded
purposes, making it questionable from the get go.
As such, remove it.
I admit upfront I did not see value in benchmarking this change, but I
can do it if that is deemed appropriate.
Removal from new_inode and of the entire thing are in the same patch as
requested by Linus, so whatever weird looks can be directed at that guy.
Link: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git/commit/fs/inode.c?id=c37fa164f793735b32aa3f53154ff1a7659e6442 [1]
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
leads to an EFI config table ending up unmapped
- Use the correct segment selector in the 32-bit version of getcpu() in
the vDSO
- Make sure vDSO and VVAR regions are placed in the 47-bit VA range even
on 5-level paging systems
- Add models 0x90-0x91 to the range of AMD Zenbleed-affected CPUs
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmTXVWMACgkQEsHwGGHe
VUrrpg//f0BFZV4dbANuYAX47A/hhrm8n5KwV9Mjq8JDeG/TumUN/vAK9+n8outO
pLpPOoXRSCSPybXBSmI1ryVwLaDq6BJy0fYyq4kHOvBFXYKodfTNdO8Oec+le3V/
DBGnY6TQ5x1PgWuAUE9WoeGQuTN1d3Dxm0V/pG2LU/qW3mr+GlBXsSKUVjwp9/OW
JPNw1XQDFVuxT+heRLxQONPkdTxkwwKxZBDRwnSSj7chbJ/jSbnX9a5xinwBvMZY
Q6nelt/AMwAgfO2oz38y1tnR0bfd8fM08SUUgoajWWghKemZK5uNAgZZJd4tPNq6
lBonNc6jF9UGohzgQbNOAjfmDomtyxe4JYGl2SnyWcfwFzGANpcSKSbM9H91LMaI
Sh7hKykZQNGmDctbLsxvlPFgIoxWFLK1DfNeM2yQRzlayqDF8CRlDIcWXHEtYOXf
AOZFMJtOJBZjLbSeWIBW278atNG3ONWEb75kqiKRYxsX9QzvwAZYCYZH+NGe6sLO
kkCm7g3NcAItf1qrPNGZd/k9fA/+3RXEWNdjYsmegMvU8vQBPY0w4NVwGtU9LCkq
jspQxnNlVy1ayqr/TQXRhzn5+d7CQ1PLNwVsGh4S+diCEFu2aEdOhrBhS1uaujLv
5iLpmyyh0yO9aHebK/u4cciAvwVB7WuzQqWYIGzdsolrc+lbY54=
=zPFr
-----END PGP SIGNATURE-----
Merge tag 'x86_urgent_for_v6.5_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Borislav Petkov:
- Do not parse the confidential computing blob on non-AMD hardware as
it leads to an EFI config table ending up unmapped
- Use the correct segment selector in the 32-bit version of getcpu() in
the vDSO
- Make sure vDSO and VVAR regions are placed in the 47-bit VA range
even on 5-level paging systems
- Add models 0x90-0x91 to the range of AMD Zenbleed-affected CPUs
* tag 'x86_urgent_for_v6.5_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/cpu/amd: Enable Zenbleed fix for AMD Custom APU 0405
x86/mm: Fix VDSO and VVAR placement on 5-level paging machines
x86/linkage: Fix typo of BUILD_VDSO in asm/linkage.h
x86/vdso: Choose the right GDT_ENTRY_CPUNODE for 32-bit getcpu() on 64-bit kernel
x86/sev: Do not try to parse for the CC blob on non-AMD hardware
to the SRSO and GDS mitigations code which got postponed to after the
embargo date
- Fix the SRSO aliasing addresses assertion so that the LLVM linker can
parse it too
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmTXULMACgkQEsHwGGHe
VUqAKxAAlGm4YodsbEX+SQTVDLsECBw9lpJ3wU8lD0fBmIAmGiWFy9jqn4FYyAi8
YlKNI8Ru/mME5cM5BBV5ZZ5PNIZPD8OmqbbaHkSjcjLJfdNg03D/RDUPELSyY7T5
HviLaknJFjn6HwvLkSLdBIsAAkVqF5lYitP8x5OgBp6Lc9PO7/xWjSZoVhOUkrFe
Pjc8sT0DgtC2PWkxbB66/uxhdUnFqpioWL+06akeSWuHweIQDQ+P7sxfCB8NZO0u
YV4hmd/JfjoVc0DtvS3HOm14Ruhmru/oiKg/XcJO7uGPBKxuVK8xsHqeUyGMdTeS
+sNXA0XjbvaUV9IihuvVHrX8nMirkW7u0NWMNlJCO9QF5eJPfc0I07VLpKJGEsph
wKSNCN7F64GfjkRGl4jPo26tX+fXGMm32+gGgpqsCYnTBu+nrqprXck4DJQZBNl4
6Le7sfUky2PSllbFh5MnKaylfeWKcqlOzfko7tjWtFm7raOHEGy31m92igKms0hM
IlCyEe6mJUcMJ60QzYwaB9FJ+50jIZXeckRnud/mExgaAGQqe7RcVbwurQCCDtYq
vd4sb9TV9vU07Uqz1NBxmzl6GbYM1ORV9hnlpj/eDnh/ArBzj44UwiGB1bVQ31Iy
OMBJZ+RQtspa12xq7Zu++mjc+9XTeX9JK81PYg6UU+5ogQapdx4=
=P0vQ
-----END PGP SIGNATURE-----
Merge tag 'x86_bugs_for_v6.5_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 mitigation fixes from Borislav Petkov:
"The first set of fallout fixes after the embargo madness. There will
be another set next week too.
- A first series of cleanups/unifications and documentation
improvements to the SRSO and GDS mitigations code which got
postponed to after the embargo date
- Fix the SRSO aliasing addresses assertion so that the LLVM linker
can parse it too"
* tag 'x86_bugs_for_v6.5_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
driver core: cpu: Fix the fallback cpu_show_gds() name
x86: Move gds_ucode_mitigated() declaration to header
x86/speculation: Add cpu_show_gds() prototype
driver core: cpu: Make cpu_show_not_affected() static
x86/srso: Fix build breakage with the LLVM linker
Documentation/srso: Document IBPB aspect and fix formatting
driver core: cpu: Unify redundant silly stubs
Documentation/hw-vuln: Unify filename specification in index
Commit
522b1d6921 ("x86/cpu/amd: Add a Zenbleed fix")
provided a fix for the Zen2 VZEROUPPER data corruption bug affecting
a range of CPU models, but the AMD Custom APU 0405 found on SteamDeck
was not listed, although it is clearly affected by the vulnerability.
Add this CPU variant to the Zenbleed erratum list, in order to
unconditionally enable the fallback fix until a proper microcode update
is available.
Fixes: 522b1d6921 ("x86/cpu/amd: Add a Zenbleed fix")
Signed-off-by: Cristian Ciocaltea <cristian.ciocaltea@collabora.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20230811203705.1699914-1-cristian.ciocaltea@collabora.com
Rework the handling of interrupt overrides on AMD Zen-based machines to
avoid recently introduced regressions (Hans de Goede).
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAmTWgQMSHHJqd0Byand5
c29ja2kubmV0AAoJEILEb/54YlRxaggP/iYft7/0saIW/gQEWUngTdwwYpDdL5c6
cnym9fO7dNvZ7svGgv1oQj3Nge79hWWWKTOVB5F3ZKPEmYHlhAekSYb+SI+/YUcp
fpIyVKO/y8nO+Cnw5t9EoTteJ65WJaadBNTuRUw28j4ETs/4AewWFFXSogzO2IIU
x4UoWPHEXhhjYTnYL+ivRAKJG5iTIbfQS16muO7p2z630FneUYGDvUxx0XuIvbf0
dXVjKi/rxgfozNyL8HmyeDE3LSjG1b7h7h6MS/ZrdM2LoE4V2pp/Z7qtxs53sIeG
Y+dsBZEp8G/YReuMPush31Sgepf6l4OQpjKkq54vRMD6su36IMojMYXayLWHmWK7
RP7n896takZ80tC7GH/pDJUUCoiyvE08/RUxnwg8QPCj5PCMmp2w0/DhhYi1FeFD
4QxYx7UBYGGFy94Bx97LvBVZXM484F8CJnQreXm7csniy0BME5MeV6S6YNjXsXJH
M+O0uwdlk24g4SH7E21jUQl6Ch+YtbcH0LRtivGnC6YkcItWkcNRccph3H27wNHH
jMR43tJsWp6yjXeyWw+vrML9VZ0aK3yyUKPAqcXhC4aIp8DSgVaVuLdRwXHmBqK5
Phx5+7zuxJ4UpOAQb1RtFOCgRAtfNN57QaXJ0hPuOuzopVOWPHjmcuw4fUdW5YCG
esjxuO9VAyQZ
=4kD6
-----END PGP SIGNATURE-----
Merge tag 'acpi-6.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull ACPI fixes from Rafael Wysocki:
"Rework the handling of interrupt overrides on AMD Zen-based machines
to avoid recently introduced regressions (Hans de Goede).
Note that this is intended as a short-term mitigation for 6.5 and the
long-term approach will be to attempt to use the configuration left by
the BIOS, but it requires more investigation"
* tag 'acpi-6.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
ACPI: resource: Add IRQ override quirk for PCSpecialist Elimina Pro 16 M
ACPI: resource: Honor MADT INT_SRC_OVR settings for IRQ1 on AMD Zen
ACPI: resource: Always use MADT override IRQ settings for all legacy non i8042 IRQs
ACPI: resource: revert "Remove "Zen" specific match and quirks"
The declaration got placed in the .c file of the caller, but that
causes a warning for the definition:
arch/x86/kernel/cpu/bugs.c:682:6: error: no previous prototype for 'gds_ucode_mitigated' [-Werror=missing-prototypes]
Move it to a header where both sides can observe it instead.
Fixes: 81ac7e5d74 ("KVM: Add GDS_NO support to KVM")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Tested-by: Daniel Sneddon <daniel.sneddon@linux.intel.com>
Cc: stable@kernel.org
Link: https://lore.kernel.org/all/20230809130530.1913368-2-arnd%40kernel.org
Add new PCI Device IDs required to support AMD's new Family 1Ah-based
models 00h-1Fh, 20h and 40h-4Fh.
[ bp: Zap a useless sentence. ]
Co-developed-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Avadhut Naik <Avadhut.Naik@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230809035244.2722455-2-avadhut.naik@amd.com
The assertion added to verify the difference in bits set of the
addresses of srso_untrain_ret_alias() and srso_safe_ret_alias() would fail
to link in LLVM's ld.lld linker with the following error:
ld.lld: error: ./arch/x86/kernel/vmlinux.lds:210: at least one side of
the expression must be absolute
ld.lld: error: ./arch/x86/kernel/vmlinux.lds:211: at least one side of
the expression must be absolute
Use ABSOLUTE to evaluate the expression referring to at least one of the
symbols so that LLD can evaluate the linker script.
Also, add linker version info to the comment about XOR being unsupported
in either ld.bfd or ld.lld until somewhat recently.
Fixes: fb3bd914b3 ("x86/srso: Add a Speculative RAS Overflow mitigation")
Closes: https://lore.kernel.org/llvm/CA+G9fYsdUeNu-gwbs0+T6XHi4hYYk=Y9725-wFhZ7gJMspLDRA@mail.gmail.com/
Reported-by: Nathan Chancellor <nathan@kernel.org>
Reported-by: Daniel Kolesa <daniel@octaforge.org>
Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Suggested-by: Sven Volkinsfeld <thyrc@gmx.net>
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://github.com/ClangBuiltLinux/linux/issues/1907
Link: https://lore.kernel.org/r/20230809-gds-v1-1-eaac90b0cbcc@google.com
Yingcong has noticed that on the 5-level paging machine, VDSO and VVAR
VMAs are placed above the 47-bit border:
8000001a9000-8000001ad000 r--p 00000000 00:00 0 [vvar]
8000001ad000-8000001af000 r-xp 00000000 00:00 0 [vdso]
This might confuse users who are not aware of 5-level paging and expect
all userspace addresses to be under the 47-bit border.
So far problem has only been triggered with ASLR disabled, although it
may also occur with ASLR enabled if the layout is randomized in a just
right way.
The problem happens due to custom placement for the VMAs in the VDSO
code: vdso_addr() tries to place them above the stack and checks the
result against TASK_SIZE_MAX, which is wrong. TASK_SIZE_MAX is set to
the 56-bit border on 5-level paging machines. Use DEFAULT_MAP_WINDOW
instead.
Fixes: b569bab78d ("x86/mm: Prepare to expose larger address space to userspace")
Reported-by: Yingcong Wu <yingcong.wu@intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/all/20230803151609.22141-1-kirill.shutemov%40linux.intel.com
The Grand Ridge and Sierra Forest are successors to Snow Ridge. They
both have Crestmont core. From the core PMU's perspective, they are
similar to the e-core of MTL. The only difference is the LBR event
logging feature, which will be implemented in the following patches.
Create a non-hybrid PMU setup for Grand Ridge and Sierra Forest.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Link: https://lore.kernel.org/r/20230522113040.2329924-1-kan.liang@linux.intel.com
Give the hybrid thingies their own section, appropriately between Core
and Atom.
Add the Raptor Lake uarch names.
Put Lunar Lake after Arrow Lake per interweb guidance.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Hans de Goede <hdegoede@redhat.com>
Link: https://lore.kernel.org/r/20230807150405.828551866@infradead.org
Sierra Forest and Grand Ridge are both E-core only using Crestmont
micro-architecture, They fit the pre-existing naming scheme prefectly
fine, adhere to it.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Hans de Goede <hdegoede@redhat.com>
Link: https://lore.kernel.org/r/20230807150405.757666627@infradead.org
Alderlake N is an E-core only product using Gracemont
micro-architecture. It fits the pre-existing naming scheme perfectly
fine, adhere to it.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Hans de Goede <hdegoede@redhat.com>
Link: https://lore.kernel.org/r/20230807150405.686834933@infradead.org
On AMD Zen acpi_dev_irq_override() by default prefers the DSDT IRQ 1
settings over the MADT settings.
This causes the keyboard to malfunction on some laptop models
(see Links), all models from the Links have an INT_SRC_OVR MADT entry
for IRQ 1.
Fixes: a9c4a912b7 ("ACPI: resource: Remove "Zen" specific match and quirks")
Link: https://bugzilla.kernel.org/show_bug.cgi?id=217336
Link: https://bugzilla.kernel.org/show_bug.cgi?id=217394
Link: https://bugzilla.kernel.org/show_bug.cgi?id=217406
Cc: All applicable <stable@vger.kernel.org>
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Convert all the APIC callback inline wrappers from apic->foo() to
static_call(apic_call_foo)(), except for the safe_wait_icr_idle() one which
is only used during SMP bringup when sending INIT/SIPI. That really can do
the conditional callback. The regular wait_icr_idle() matters as it is used
in irq_work_raise(), so X2APIC machines spare the conditional.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest)
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Declare and define the static calls for the hotpath APIC callbacks. Note
this deliberately uses STATIC_CALL_NULL() because otherwise it would be
required to have the definitions in the 32bit and the 64bit default APIC
implementations and it's hard to keep the calls in sync. The other option
would be to have stub functions for each callback type. Not pretty either
So the NULL capable calls are used and filled in during early boot after
the static key infrastructure has been initialized. The calls will be
static_call() except for the wait_irc_idle() callback which is valid to be
NULL for X2APIC systems.
Update the calls when a new APIC driver is installed and when a callback
override is invoked.
Export the trampolines for the two calls which are used in KVM and MCE
error inject modules.
Test the setup and let the next step convert the inline wrappers to make it
effective.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest)
Move them to one place so the static call conversion gets simpler.
No functional change.
[ dhansen: merge against recent x86/apic changes ]
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest)
There is no value for instrumentation to look at those wrappers and with the
upcoming conversion to static calls even less so.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest)
Nothing can change it post init.
Remove the 32bit callbacks and comments as XENPV is strictly 64bit.
While at it mop up the whitespace damage which causes eyebleed due to an
editor which is highlighting it.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest)
Cc: Juergen Gross <jgross@suse.com>
Convert all places which just assign a new function directly to the apic
callback to use apic_update_callback() which prepares for using static
calls.
Mark snp_set_wakeup_secondary_cpu() and kvm_setup_pv_ipi() __init, as they
are only invoked from init code and otherwise trigger a section mismatch as
they are now invoking a __init function.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest)
Switch them over to apic_update_callback() and remove the code.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Wei Liu <wei.liu@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest)
There are already two variants of update mechanism for particular callbacks
and virtualization just writes into the data structure.
Provide an interface and use a shadow data structure to preserve callbacks
so they can be reapplied when the APIC driver is replaced.
The extra data structure is intentional as any new callback needs to be
also updated in the core code. This also prepares for static calls.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest)
Instead of setting the Xen PV apic driver very early during boot, just use
the standard apic driver probing by setting an appropriate
x86_init.irqs.intr_mode_init callback.
At the same time eliminate xen_apic_check() which has never been used.
The #ifdef CONFIG_X86_LOCAL_APIC around the call of xen_init_apic()
can be removed, too, as CONFIG_XEN depends on CONFIG_X86_LOCAL_APIC.
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest)
Link: https://lore.kernel.org/lkml/aa086365-fd02-210f-67c6-5c9175c0dfee@suse.com
In preparation for converting the hotpath APIC callbacks to static keys,
provide common initialization infrastructure.
Lift apic_install_drivers() from probe_64.c and convert all places which
switch the apic instance by storing the pointer to use apic_install_driver()
as a first step.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest)
Yet another wrapper of a wrapper gone along with the outdated comment
that this compiles to a single instruction.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Wei Liu <wei.liu@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest)
Every callsite hands in the same constants which is a pointless exercise
and cannot be optimized by the compiler due to the indirect calls.
Use the constants in the eoi() callbacks and remove the arguments.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Wei Liu <wei.liu@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest)
First of all apic_noop can't be probed because it's not registered. So
there is no point for implementing a probe callback. The machine is
rightfully to die when that is invoked.
Remove the gunk and tidy up the other space consuming dummy callbacks.
This gunk should simply die. Nothing should ever invoke APIC callbacks once
this is installed, But that's a differrent story for another round of
cleanups. The comment on top of this file which was intentionally left in
place tells exactly why this is needed: voodoo programming.
In fact the kernel of today should just outright refuse to boot on a system
with no (functional) local APIC. That would spare tons of #ifdeffery and
other nonsense.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest)
Wasted space for no value.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest)
Now that everything has apic::max_apic_id set and the eventual update for
the x2APIC case is in place, switch the apic_id_valid() helper to use
apic::max_apic_id and remove the apic::apic_id_valid() callback.
[ dhansen: Fix subject typo ]
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest)
In order to remove the apic::apic_id_valid() callback and switch to
checking apic::max_apic_id, it is required to update apic::max_apic_id when
the APIC initialization code overrides it via x2apic_set_max_apicid().
Make the existing booleans a bitfield and add a flag which lets the update
function and the core code which switches the driver detect whether the
apic instance wants to have that update or not and apply it if required.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest)
Currently, x2apic_max_apicid==0 means that there is no max APIC id limit.
But, this means that 0 needs to be special-cased.
Designate UINT_MAX to mean unlimited so that a plain old less than or equal
compare works and there is no special-casing. Replace the 0 initialization
with UINT_MAX.
[ dhansen: muck with changelog ]
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest)
There is really no point to have a callback which compares numbers.
Add a field which allows each APIC to store the maximum APIC ID supported
and fill it in for all APIC incarnations.
The next step will remove the callback.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest)
Prepare for removing the callback and making this as simple comparison to
an upper limit, which is the obvious solution to do for limit checks...
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest)
No need for the same thing twice. Also prepares for simplifying the APIC ID
validation checks.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest)
Yet more copy and pasta gone.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest)
Why on earth makes a wrapper around some common function sense? Just to be
able to slap some vendor name on it...
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest)
Move it next to apic_mem_wait_icr_idle(), rename it so that it's clear what
it does and rewrite it in readable form.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest)
Remove tons of NOOP callbacks by making the invocation of
safe_wait_icr_idle() conditional in the inline wrapper.
Will be replaced by a static_call_cond() later.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest)
Nuke more NOOP callbacks and make the invocation conditional. Will be
replaced with a static call later.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest)
Two copies and also needlessly public. Move it into ipi.c so it can be
inlined. Rename it to apic_mem_wait_icr_idle().
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest)
Replace the undecodable comment on top of the function, replace the space
consuming zero content comments with useful ones and tidy up the
implementation to prevent further eye bleed.
Make __default_send_IPI_shortcut() static as it has no other users.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest)
Really not a hotpath and again no reason for having a gazillion of empty
callbacks returning 1. Make it return bool and provide one shared
implementation for the remaining users.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest)
default_setup_apic_routing() is a complete misnomer. On 64bit it does the
actual APIC probing and on 32bit it is used to force select the bigsmp APIC
and to emit a redundant message in the apic::setup_apic_routing() callback.
Rename the 64bit and 32bit function so they reflect what they are doing and
remove the useless APIC callback.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest)
The operation to set the IOAPIC ID in phys_id_present_map is as convoluted
as it can be.
1) Allocate a bitmap of 32byte size on the stack
2) Zero the bitmap and set the IOAPIC ID bit
3) Or the temporary bitmap over phys_id_present_map
The same functionality can be achieved by setting the IOAPIC ID bit
directly in the phys_id_present_map.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest)
This is only used on 32bit and is a wrapper around
physid_set_mask_of_physid() in all 32bit APIC drivers.
Remove the callback and use physid_set_mask_of_physid() in the code
directly,
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest)
apic::init_apic_ldr() is only invoked when the APIC is initialized. So
there is really no point in having:
- Default empty callbacks all over the place
- Two implementations of the actual LDR init function where one is
just unreadable gunk but does exactly the same as the other.
Make the apic::init_apic_ldr() invocation conditional, remove the empty
callbacks and consolidate the two implementation into one.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest)
It's a copy of default_cpu_present_to_apicid() with the omission of the
actual check whether the CPU is present.
This APIC callback should die completely, but the XEN APIC implementation
does something different which needs to be addressed first.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest)
If the system has more than 8 CPUs then XAPIC and the bigsmp APIC driver is
required. This is ensured via:
1) Enumerating all possible CPUs up to NR_CPUS
2) Checking at boot CPU APIC setup time whether the system has more than
8 CPUs and has an XAPIC.
If that's the case then it's attempted to install the bigsmp APIC
driver and a magic variable 'def_to_bigsmp' is set to one.
3) If that magic variable is set and CONFIG_X86_BIGSMP=n and the system
has more than 8 CPUs smp_sanity_check() removes all CPUs >= #8 from
the present and possible mask in the most convoluted way.
This logic is completely broken for the case where the bigsmp driver is
enabled, but not selected due to a command line option specifying the
default APIC. In that case the system boots with default APIC in logical
destination mode and fails to reduce the number of CPUs.
That aside the above which is sprinkled over 3 different places is yet
another piece of art.
It would have been too obvious to check the requirements upfront and limit
nr_cpu_ids _before_ enumerating tons of CPUs and then removing them again.
Implement exactly this. Check the bigsmp requirement when the boot APIC is
registered which happens _before_ ACPI/MPTABLE parsing and limit the number
of CPUs to 8 if it can't be used. Switch it over when the boot CPU apic is
set up if necessary.
[ dhansen: fix nr_cpu_ids off-by-one in default_setup_apic_routing() ]
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest)
On 32bit there is no APIC implementing the acpi_madt_oem_check() except XEN
PV, but that does not matter at all.
generic_apic_probe() runs before ACPI tables are parsed. This selects the
XEN APIC if there is no command line override because the XEN APIC driver
is the first to be probed.
If there is a command line override then the XEN PV driver won't be
selected in the MADT OEM check either.
As there is no other MADT check implemented for 32bit APICs, this whole
excercise is a NOOP and can be removed.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest)
UV X2APIC uses the per CPU variable from:
native_smp_prepare_cpus()
uv_system_init()
uv_system_init_hub()
which is long after the per CPU areas have been set up.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest)
Remove completely useless and mindlessly copied comments and tidy up the
code which causes eye bleed when looking at it.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest)
This per CPU variable is just yet another form of voodoo programming. The
boot ordering is:
per_cpu(x86_cpu_to_logical_apicid, cpu) = 1U << cpu;
.....
setup_apic()
apic->init_apic_ldr()
default_init_apic_ldr()
apic_write(SET_APIC_LOGICAL_ID(1UL << smp_processor_id(), APIC_LDR);
id = GET_APIC_LOGICAL_ID(apic_read(APIC_LDR);
WARN_ON(id != per_cpu(x86_cpu_to_logical_apicid, cpu));
per_cpu(x86_cpu_to_logical_apicid, cpu) = id;
So first write the default into LDR and then validate it against the same default
which was set up during early boot APIC enumeration.
Brilliant, isn't it?
The comment above the per CPU variable declaration describes it well:
'Let's keep it ugly for now.'
Remove the useless gunk and use '1U << cpu' consistently all over the place.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest)
apic::x86_32_early_logical_apicid() is yet another historical joke.
It is used to preset the x86_cpu_to_logical_apicid per CPU variable during
APIC enumeration with:
- 1 shifted left by the CPU number
- the physical APIC ID in case of bigsmp
The latter is hillarious because bigsmp uses physical destination mode
which never can use the logical APIC ID.
It gets even worse. As bigsmp can be enforced late in the boot process the
probe function overwrites the per CPU variable which is never used for this
APIC type once again.
Remove that gunk and store 1 << cpunr unconditionally if and only if the
CPU number is less than 8, because the default logical destination mode
only allows up to 8 CPUs.
This is just an intermediate step before removing the per CPU insanity
completely. Stay tuned.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest)
No need for an extra variable to find out whether the APIC has been mapped
or is accessible (X2APIC mode).
Provide an inline for this and check apic_mmio_base which is only set when
the local APIC has been mapped.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest)
The only silly usage site is gone. Remove the gunk which was even outright
wrong in the bigsmp_32 case which returned true unconditionally.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest)
The boot CPUs local APIC is now always registered, so there is no point to
have another unreadable validatation for it.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest)
num_processors is 0 by default and only gets incremented when local APICs
are registered.
Make init_apic_mappings(), which tries to enable the local APIC in the case
that no SMP configuration was found set num_processors to 1.
This allows to remove yet another check for the local APIC and yet another
place which registers the boot CPUs local APIC ID.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest)
Unlike all other SMP configuration "parsers" XEN/PV does not set
smp_found_config which is inconsistent and prevents doing proper decision
logic based on this flag.
Make XEN/PV pretend that it found SMP configuration.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest)
Convert places which just write mp_lapic_addr and let them register the
local APIC address directly instead of relying on magic other code to do
so.
Add a WARN_ON() into register_lapic_address() which is raised when
register_lapic_address() is invoked more than once during boot.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest)
Split the fixmap setup out of register_lapic_address() and reuse it when
the X2APIC is disabled during setup.
This avoids registering the APIC ID (setting 'mp_lapic_addr') twice.
[ dhansen: changelog wording tweak ]
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest)
Quite some APIC init functions are pure boolean, but use the success = 0,
fail < 0 model. That's confusing as hell when reading through the code.
Convert them to boolean.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest)
The device tree APIC parser tries to force-enable the local APIC when it is
not set in CPUID. apic_force_enable() registers the boot CPU apic on
success.
If that succeeds then dtb_lapic_setup() registers the local APIC again
eventually with a different address.
Rewrite the code so that it only registers it once.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest)
From: Dave Hansen <dave.hansen@linux.intel.com>
Some truly ancient code had different ways of calculating the 'apicid'
but it is long gone. Zap the unnecssary local variablee
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
This historical leftover is really uninteresting today. Whatever MPTABLE or
MADT delivers we only trust the hardware anyway.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest)
Register the boot CPU APIC right when the boot CPUs APIC is read from the
hardware. No point is doing this on random places and having wild
heuristics to save the boot CPU APIC ID slot and CPU number 0 reserved.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest)
boot_cpu_physical_apicid is written in random places and in the last
consequence filled with the APIC ID read from the local APIC. That causes
it to have inconsistent state when the MPTABLE is broken. As a consequence
tons of moronic checks are sprinkled all over the place.
Consolidate the code and read it exactly once when either X2APIC mode is
detected early or when the APIC mapping is established.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest)
Put it to the other historical leftovers.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest)
max_physical_apicid is assigned but never read.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest)
No point in having a wrapper around read_apic_id().
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest)
Another variable name which is confusing at best. Convert to bool.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest)
It reflects a state and not a command. Make it bool while at it.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest)
It's not longer used outside the source file.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest)
Under certain circumstances, an integer division by 0 which faults, can
leave stale quotient data from a previous division operation on Zen1
microarchitectures.
Do a dummy division 0/1 before returning from the #DE exception handler
in order to avoid any leaks of potentially sensitive data.
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: <stable@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The BUILD_VDSO macro was incorrectly spelled as BULID_VDSO in
asm/linkage.h. This causes the !defined(BULID_VDSO) directive to always
evaluate to true.
Correct the spelling to BUILD_VDSO.
Fixes: bea75b3389 ("x86/Kconfig: Introduce function padding")
Signed-off-by: Jinghao Jia <jinghao@linux.ibm.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Cc: <stable@kernel.org>
Link: https://lore.kernel.org/r/20230808182353.76218-1-jinghao@linux.ibm.com
microcode_mutex is only used by reload_store(). It has a comment saying
"to synchronize with each other". Other user of this mutex have been
removed in the commits
181b6f40e9 ("x86/microcode: Rip out the OLD_INTERFACE").
b6f86689d5 ("x86/microcode: Rip out the subsys interface gunk")
The sysfs interface does not need additional synchronisation vs itself
because it is provided as kernfs_ops::mutex which is acquired in
kernfs_fop_write_iter().
Remove the superfluous microcode_mutex.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Sohil Mehta <sohil.mehta@intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20230804075853.JF_n6GXC@linutronix.de
The vDSO getcpu() reads CPU ID from the GDT_ENTRY_CPUNODE entry when the RDPID
instruction is not available. And GDT_ENTRY_CPUNODE is defined as 28 on 32-bit
Linux kernel and 15 on 64-bit. But the 32-bit getcpu() on 64-bit Linux kernel
is compiled with 32-bit Linux kernel GDT_ENTRY_CPUNODE, i.e., 28, beyond the
64-bit Linux kernel GDT limit. Thus, it just fails _silently_.
When BUILD_VDSO32_64 is defined, choose the 64-bit Linux kernel GDT definitions
to compile the 32-bit getcpu().
Fixes: 877cff5296 ("x86/vdso: Fake 32bit VDSO build on 64bit compile for vgetcpu")
Reported-by: kernel test robot <yujie.liu@intel.com>
Reported-by: Shan Kang <shan.kang@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20230322061758.10639-1-xin3.li@intel.com
Link: https://lore.kernel.org/oe-lkp/202303020903.b01fd1de-yujie.liu@intel.com
* Add Base GDS mitigation
* Support GDS_NO under KVM
* Fix a documentation typo
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEV76QKkVc4xCGURexaDWVMHDJkrAFAmTJh5YACgkQaDWVMHDJ
krAzAw/8DzjhAYEa7a1AodCBMNg8uNOPnLNoRPPNhaN5Iw6W3zXYDBDKT9PyjAIx
RoIM0aHx/oY9nCpK441o25oCWAAyzk6E5/+q9hMa7B4aHUGKqiDUC6L9dC8UiiSN
yvoBv4g7F81QnmyazwYI64S6vnbr4Cqe7K/mvVqQ/vbJiugD25zY8mflRV9YAuMk
Oe7Ff/mCA+I/kqyKhJE3cf3qNhZ61FsFI886fOSvIE7g4THKqo5eGPpIQxR4mXiU
Ri2JWffTaeHr2m0sAfFeLH4VTZxfAgBkNQUEWeG6f2kDGTEKibXFRsU4+zxjn3gl
xug+9jfnKN1ceKyNlVeJJZKAfr2TiyUtrlSE5d+subIRKKBaAGgnCQDasaFAluzd
aZkOYz30PCebhN+KTrR84FySHCaxnev04jqdtVGAQEDbTvyNagFUdZFGhWijJShV
l2l4A0gFSYJmPfPVuuAwOJnnZtA1sRH9oz/Sny3+z9BKloZh+Nc/+Cu9zC8SLjaU
BF3Qv2gU9HKTJ+MSy2JrGS52cONfpO5ngFHoOMilZ1KBHrfSb1eiy32PDT+vK60Y
PFEmI8SWl7bmrO1snVUCfGaHBsHJSu5KMqwBGmM4xSRzJpyvRe493xC7+nFvqNLY
vFOFc4jGeusOXgiLPpfGduppkTGcM7sy75UMLwTSLcQbDK99mus=
=ZAPY
-----END PGP SIGNATURE-----
Merge tag 'gds-for-linus-2023-08-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86/gds fixes from Dave Hansen:
"Mitigate Gather Data Sampling issue:
- Add Base GDS mitigation
- Support GDS_NO under KVM
- Fix a documentation typo"
* tag 'gds-for-linus-2023-08-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
Documentation/x86: Fix backwards on/off logic about YMM support
KVM: Add GDS_NO support to KVM
x86/speculation: Add Kconfig option for GDS
x86/speculation: Add force option to GDS mitigation
x86/speculation: Add Gather Data Sampling mitigation
vulnerability on AMD processors. In short, this is yet another issue
where userspace poisons a microarchitectural structure which can then be
used to leak privileged information through a side channel.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmTQs1gACgkQEsHwGGHe
VUo1UA/8C34PwJveZDcerdkaxSF+WKx7AjOI/L2ws1qn9YVFA3ItFMgVuFTrlY6c
1eYKYB3FS9fVN3KzGOXGyhho6seHqfY0+8cyYupR+PVLn9rSy7GqHaIMr37FdQ2z
yb9xu26v+gsvuPEApazS6MxijYS98u71rHhmg97qsHCnUiMJ01+TaGucntukNJv8
FfwjZJvgeUiBPQ/6IeA/O0413tPPJ9weawPyW+sV1w7NlXjaUVkNXwiq/Xxbt9uI
sWwMBjFHpSnhBRaDK8W5Blee/ZfsS6qhJ4jyEKUlGtsElMnZLPHbnrbpxxqA9gyE
K+3ZhoHf/W1hhvcZcALNoUHLx0CvVekn0o41urAhPfUutLIiwLQWVbApmuW80fgC
DhPedEFu7Wp6Okj5+Bqi/XOsOOWN2WRDSzdAq10o1C+e+fzmkr6y4E6gskfz1zXU
ssD9S4+uAJ5bccS5lck4zLffsaA03nAYTlvl1KRP4pOz5G9ln6eyO20ar1WwfGAV
o5ZsTJVGQMyVA49QFkksj+kOI3chkmDswPYyGn2y8OfqYXU4Ip4eN+VkjorIAo10
zIec3Z0bCGZ9UUMylUmdtH3KAm8q0wVNoFrUkMEmO8j6nn7ew2BhwLMn4uu+nOnw
lX2AG6PNhRLVDVaNgDsWMwejaDsitQPoWRuCIAZ0kQhbeYuwfpM=
=73JY
-----END PGP SIGNATURE-----
Merge tag 'x86_bugs_srso' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86/srso fixes from Borislav Petkov:
"Add a mitigation for the speculative RAS (Return Address Stack)
overflow vulnerability on AMD processors.
In short, this is yet another issue where userspace poisons a
microarchitectural structure which can then be used to leak privileged
information through a side channel"
* tag 'x86_bugs_srso' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/srso: Tie SBPB bit setting to microcode patch detection
x86/srso: Add a forgotten NOENDBR annotation
x86/srso: Fix return thunks in generated code
x86/srso: Add IBPB on VMEXIT
x86/srso: Add IBPB
x86/srso: Add SRSO_NO support
x86/srso: Add IBPB_BRTYPE support
x86/srso: Add a Speculative RAS Overflow mitigation
x86/bugs: Increase the x86 bugs vector size to two u32s
The bare metal decompressor code was never really intended to run in a
hosted environment such as the EFI boot services, and does a few things
that are becoming problematic in the context of EFI boot now that the
logo requirements are getting tighter: EFI executables will no longer be
allowed to consist of a single executable section that is mapped with
read, write and execute permissions if they are intended for use in a
context where Secure Boot is enabled (and where Microsoft's set of
certificates is used, i.e., every x86 PC built to run Windows).
To avoid stepping on reserved memory before having inspected the E820
tables, and to ensure the correct placement when running a kernel build
that is non-relocatable, the bare metal decompressor moves its own
executable image to the end of the allocation that was reserved for it,
in order to perform the decompression in place. This means the region in
question requires both write and execute permissions, which either need
to be given upfront (which EFI will no longer permit), or need to be
applied on demand using the existing page fault handling framework.
However, the physical placement of the kernel is usually randomized
anyway, and even if it isn't, a dedicated decompression output buffer
can be allocated anywhere in memory using EFI APIs when still running in
the boot services, given that EFI support already implies a relocatable
kernel. This means that decompression in place is never necessary, nor
is moving the compressed image from one end to the other.
Since EFI already maps all of memory 1:1, it is also unnecessary to
create new page tables or handle page faults when decompressing the
kernel. That means there is also no need to replace the special
exception handlers for SEV. Generally, there is little need to do
any of the things that the decompressor does beyond
- initialize SEV encryption, if needed,
- perform the 4/5 level paging switch, if needed,
- decompress the kernel
- relocate the kernel
So do all of this from the EFI stub code, and avoid the bare metal
decompressor altogether.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230807162720.545787-24-ardb@kernel.org
Before refactoring the EFI stub boot flow to avoid the legacy bare metal
decompressor, duplicate the SNP feature check in the EFI stub before
handing over to the kernel proper.
The SNP feature check can be performed while running under the EFI boot
services, which means it can force the boot to fail gracefully and
return an error to the bootloader if the loaded kernel does not
implement support for all the features that the hypervisor enabled.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230807162720.545787-23-ardb@kernel.org
Factor out the decompressor sequence that invokes the decompressor,
parses the ELF and applies the relocations so that it can be called
directly from the EFI stub.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230807162720.545787-21-ardb@kernel.org
It is no longer necessary to be cautious when referring to global
variables in the position independent decompressor code, now that it is
built using PIE codegen and makes an assertion in the linker script that
no GOT entries exist (which would require adjustment for the actual
runtime load address of the decompressor binary).
This means global variables can be referenced directly from C code,
instead of having to pass their runtime addresses into C routines from
asm code, which needs to happen at each call site. Do so for the code
that will be called directly from the EFI stub after a subsequent patch,
and avoid the need to duplicate this logic a third time.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230807162720.545787-20-ardb@kernel.org
Now that the trampoline setup code and the actual invocation of it are
all done from the C routine, the trampoline cleanup can be merged into
it as well, instead of returning to asm just to call another C function.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Link: https://lore.kernel.org/r/20230807162720.545787-16-ardb@kernel.org
The only remaining use of the trampoline address by the trampoline
itself is deriving the page table address from it, and this involves
adding an offset of 0x0. So simplify this, and pass the new CR3 value
directly.
This makes the fact that the page table happens to be at the start of
the trampoline allocation an implementation detail of the caller.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230807162720.545787-15-ardb@kernel.org
Since the current and desired number of paging levels are known when the
trampoline is being prepared, avoid calling the trampoline at all if it
is clear that calling it is not going to result in a change to the
number of paging levels.
Given that the CPU is already running in long mode, the PAE and LA57
settings are necessarily consistent with the currently active page
tables, and other fields in CR4 will be initialized by the startup code
in the kernel proper. So limit the manipulation of CR4 to toggling the
LA57 bit, which is the only thing that really needs doing at this point
in the boot. This also means that there is no need to pass the value of
l5_required to toggle_la57(), as it will not be called unless CR4.LA57
needs to toggle.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Link: https://lore.kernel.org/r/20230807162720.545787-14-ardb@kernel.org
Instead of returning to the asm calling code to invoke the trampoline,
call it straight from the C code that sets it up. That way, the struct
return type is no longer needed for returning two values, and the call
can be made conditional more cleanly in a subsequent patch.
This means that all callee save 64-bit registers need to be preserved
and restored, as their contents may not survive the legacy mode switch.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Link: https://lore.kernel.org/r/20230807162720.545787-13-ardb@kernel.org
The 32-bit trampoline no longer uses the stack for anything except
performing a far return back to long mode, and preserving the caller's
stack pointer value. Currently, the trampoline stack is placed in the
same page that carries the trampoline code, which means this page must
be mapped writable and executable, and the stack is therefore executable
as well.
Replace the far return with a far jump, so that the return address can
be pre-calculated and patched into the code before it is called. This
removes the need for a 32-bit addressable stack entirely, and in a later
patch, this will be taken advantage of by removing writable permissions
from (and adding executable permissions to) the trampoline code page
when booting via the EFI stub.
Note that the value of RSP still needs to be preserved explicitly across
the switch into 32-bit mode, as the register may get truncated to 32
bits.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Link: https://lore.kernel.org/r/20230807162720.545787-12-ardb@kernel.org
Update the trampoline code so its arguments are passed via RDI and RSI,
which matches the ordinary SysV calling convention for x86_64. This will
allow this code to be called directly from C.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Link: https://lore.kernel.org/r/20230807162720.545787-11-ardb@kernel.org
Move the long return to switch to 32-bit mode into the trampoline code
so it can be called as an ordinary function. This will allow it to be
called directly from C code in a subsequent patch.
While at it, reorganize the code somewhat to keep the prologue and
epilogue of the function together, making the code a bit easier to
follow. Also, given that the trampoline is now entered in 64-bit mode, a
simple RIP-relative reference can be used to take the address of the
exit point.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Link: https://lore.kernel.org/r/20230807162720.545787-10-ardb@kernel.org
There is no need to defer the assignment of the paging related global
variables 'pgdir_shift' and 'ptrs_per_p4d' until after the trampoline is
cleaned up, so assign them as soon as it is clear that 5-level paging
will be enabled.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230807162720.545787-9-ardb@kernel.org
Instead of pushing and popping %RSI several times to preserve the struct
boot_params pointer across the execution of the startup code, move it
into a callee save register before the first call into C, and copy it
back when needed.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230807162720.545787-8-ardb@kernel.org
The so-called EFI handover protocol is value-add from the distros that
permits a loader to simply copy a PE kernel image into memory and call
an alternative entrypoint that is described by an embedded boot_params
structure.
Most implementations of this protocol do not bother to check the PE
header for minimum alignment, section placement, etc, and therefore also
don't clear the image's BSS, or even allocate enough memory for it.
Allocating more memory on the fly is rather difficult, but at least
clear the BSS region explicitly when entering in this manner, so that
the EFI stub code does not get confused by global variables that were
not zero-initialized correctly.
When booting in mixed mode, this BSS clearing must occur before any
global state is created, so clear it in the 32-bit asm entry point.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230807162720.545787-7-ardb@kernel.org
The native 32-bit or 64-bit EFI handover protocol entrypoint offset
relative to the respective startup_32/64 address is described in
boot_params as handover_offset, so that the special Linux/x86 aware EFI
loader can find it there.
When mixed mode is enabled, this single field has to describe this
offset for both the 32-bit and 64-bit entrypoints, so their respective
relative offsets have to be identical. Given that startup_32 and
startup_64 are 0x200 bytes apart, and the EFI handover entrypoint
resides at a fixed offset, the 32-bit and 64-bit versions of those
entrypoints must be exactly 0x200 bytes apart as well.
Currently, hard-coded fixed offsets are used to ensure this, but it is
sufficient to emit the 64-bit entrypoint 0x200 bytes after the 32-bit
one, wherever it happens to reside. This allows this code (which is now
EFI mixed mode specific) to be moved into efi_mixed.S and out of the
startup code in head_64.S.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230807162720.545787-6-ardb@kernel.org
Now that the EFI entry code in assembler is only used by the optional
and deprecated EFI handover protocol, and given that the EFI stub C code
no longer returns to it, most of it can simply be dropped.
While at it, clarify the symbol naming, by merging efi_main() and
efi_stub_entry(), making the latter the shared entry point for all
different boot modes that enter via the EFI stub.
The efi32_stub_entry() and efi64_stub_entry() names are referenced
explicitly by the tooling that populates the setup header, so these must
be retained, but can be emitted as aliases of efi_stub_entry() where
appropriate.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230807162720.545787-5-ardb@kernel.org
Instead of pushing/popping %RSI to/from the stack every time a function
is called from startup_64(), store it in a callee preserved register
and grab it from there when its value is actually needed.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230807162720.545787-3-ardb@kernel.org
* Fix SEV race condition
ARM:
* Fixes for the configuration of SVE/SME traps when hVHE mode is in use
* Allow use of pKVM on systems with FF-A implementations that are v1.0
compatible
* Request/release percpu IRQs (arch timer, vGIC maintenance) correctly
when pKVM is in use
* Fix function prototype after __kvm_host_psci_cpu_entry() rename
* Skip to the next instruction when emulating writes to TCR_EL1 on
AmpereOne systems
Selftests:
* Fix missing include
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmTQ7zsUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroMaZwf+LCD+U/Z5W9o9BLfn0gq/mLS0EPJe
+aa+AQvh1q0rQVFY8cgglGbpF3L1KGRWTEPNX2izJVOAmOzVwVjxlXj47fMhcwao
RzFFQ8GIjZGjP+lJ4zTtUzlDSNNDQqeG+Ji2GoWvSZYE6HDmSPv6CYOsUkmp3T6V
nEST2lCHY+lVEp62Y3YS+QcVEj6qsXDF21W4OxEPM9OWATj34IQTYmhCbbqzalgD
7D08nIdUtzk3JyiiG52XKACfSpWJMg3W78Kt6noX6be89SAvr2cw14X0sqZP6lID
akN6rByBZrSBaaj9TJQiEXSK5Ff/TphdxbDG4uDfOf8nzy2+QrKOXJ1Q7w==
=zBPg
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini:
"x86:
- Fix SEV race condition
ARM:
- Fixes for the configuration of SVE/SME traps when hVHE mode is in
use
- Allow use of pKVM on systems with FF-A implementations that are
v1.0 compatible
- Request/release percpu IRQs (arch timer, vGIC maintenance)
correctly when pKVM is in use
- Fix function prototype after __kvm_host_psci_cpu_entry() rename
- Skip to the next instruction when emulating writes to TCR_EL1 on
AmpereOne systems
Selftests:
- Fix missing include"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
selftests/rseq: Fix build with undefined __weak
KVM: SEV: remove ghcb variable declarations
KVM: SEV: only access GHCB fields once
KVM: SEV: snapshot the GHCB before accessing it
KVM: arm64: Skip instruction after emulating write to TCR_EL1
KVM: arm64: fix __kvm_host_psci_cpu_entry() prototype
KVM: arm64: Fix resetting SME trap values on reset for (h)VHE
KVM: arm64: Fix resetting SVE trap values on reset for hVHE
KVM: arm64: Use the appropriate feature trap register when activating traps
KVM: arm64: Helper to write to appropriate feature trap register based on mode
KVM: arm64: Disable SME traps for (h)VHE at setup
KVM: arm64: Use the appropriate feature trap register for SVE at EL2 setup
KVM: arm64: Factor out code for checking (h)VHE mode into a macro
KVM: arm64: Rephrase percpu enable/disable tracking in terms of hyp
KVM: arm64: Fix hardware enable/disable flows for pKVM
KVM: arm64: Allow pKVM on v1.0 compatible FF-A implementations
The 4-to-5 level mode switch trampoline disables long mode and paging in
order to be able to flick the LA57 bit. According to section 3.4.1.1 of
the x86 architecture manual [0], 64-bit GPRs might not retain the upper
32 bits of their contents across such a mode switch.
Given that RBP, RBX and RSI are live at this point, preserve them on the
stack, along with the return address that might be above 4G as well.
[0] Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1: Basic Architecture
"Because the upper 32 bits of 64-bit general-purpose registers are
undefined in 32-bit modes, the upper 32 bits of any general-purpose
register are not preserved when switching from 64-bit mode to a 32-bit
mode (to protected mode or compatibility mode). Software must not
depend on these bits to maintain a value after a 64-bit to 32-bit
mode switch."
Fixes: 194a9749c7 ("x86/boot/compressed/64: Handle 5-level paging boot if kernel is above 4G")
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230807162720.545787-2-ardb@kernel.org
Tao Liu reported a boot hang on an Intel Atom machine due to an unmapped
EFI config table. The reason being that the CC blob which contains the
CPUID page for AMD SNP guests is parsed for before even checking
whether the machine runs on AMD hardware.
Usually that's not a problem on !AMD hw - it simply won't find the CC
blob's GUID and return. However, if any parts of the config table
pointers array is not mapped, the kernel will #PF very early in the
decompressor stage without any opportunity to recover.
Therefore, do a superficial CPUID check before poking for the CC blob.
This will fix the current issue on real hardware. It would also work as
a guest on a non-lying hypervisor.
For the lying hypervisor, the check is done again, *after* parsing the
CC blob as the real CPUID page will be present then.
Clear the #VC handler in case SEV-{ES,SNP} hasn't been detected, as
a precaution.
Fixes: c01fce9cef ("x86/compressed: Add SEV-SNP feature detection/setup")
Reported-by: Tao Liu <ltao@redhat.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Tom Lendacky <thomas.lendacky@amd.com>
Tested-by: Tao Liu <ltao@redhat.com>
Cc: <stable@kernel.org>
Link: https://lore.kernel.org/r/20230601072043.24439-1-ltao@redhat.com
The SBPB bit in MSR_IA32_PRED_CMD is supported only after a microcode
patch has been applied so set X86_FEATURE_SBPB only then. Otherwise,
guests would attempt to set that bit and #GP on the MSR write.
While at it, make SMT detection more robust as some guests - depending
on how and what CPUID leafs their report - lead to cpu_smt_control
getting set to CPU_SMT_NOT_SUPPORTED but SRSO_NO should be set for any
guest incarnation where one simply cannot do SMT, for whatever reason.
Fixes: fb3bd914b3 ("x86/srso: Add a Speculative RAS Overflow mitigation")
Reported-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reported-by: Salvatore Bonaccorso <carnil@debian.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
The left overs of a moved interrupt are cleaned up once the interrupt is
raised on the new target CPU. Keeping the vector valid on the original
target CPU guarantees that there can't be an interrupt lost if the affinity
change races with an concurrent interrupt from the device.
This cleanup utilizes the lowest priority interrupt vector for this
cleanup, which makes sure that in the unlikely case when the to be cleaned
up interrupt is pending in the local APICs IRR the cleanup vector does not
live lock.
But there is no real reason to use an interrupt vector for cleaning up the
leftovers of a moved interrupt. It's not a high performance operation. The
only requirement is that it happens on the original target CPU.
Convert it to use a timer instead and adjust the code accordingly.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Xin Li <xin3.li@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20230621171248.6805-3-xin3.li@intel.com
Rename send_cleanup_vector() to vector_schedule_cleanup() to prepare for
replacing the vector cleanup IPI with a timer callback.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Xin Li <xin3.li@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Steve Wahl <steve.wahl@hpe.com>
Link: https://lore.kernel.org/r/20230621171248.6805-2-xin3.li@intel.com
Now that the driver core allows for struct class to be in read-only
memory, move the pseudo_lock_class structure to be declared at build
time placing it into read-only memory, instead of having to be
dynamically allocated at boot time.
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Reinette Chatre <reinette.chatre@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: x86@kernel.org
Cc: "H. Peter Anvin" <hpa@zytor.com>
Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Ivan Orlov <ivan.orlov0322@gmail.com>
Acked-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/r/20230620144431.583290-6-gregkh@linuxfoundation.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Now that the driver core allows for struct class to be in read-only
memory, move the msr_class structure to be declared at build time
placing it into read-only memory, instead of having to be dynamically
allocated at boot time.
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: x86@kernel.org
Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Ivan Orlov <ivan.orlov0322@gmail.com>
Link: https://lore.kernel.org/r/20230620144431.583290-5-gregkh@linuxfoundation.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Now that the driver core allows for struct class to be in read-only
memory, move the cpuid_class structure to be declared at build time
placing it into read-only memory, instead of having to be dynamically
allocated at boot time.
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: x86@kernel.org
Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Ivan Orlov <ivan.orlov0322@gmail.com>
Link: https://lore.kernel.org/r/20230620144431.583290-4-gregkh@linuxfoundation.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iQFHBAABCgAxFiEEIbPD0id6easf0xsudhRwX5BBoF4FAmTNh9UTHHdlaS5saXVA
a2VybmVsLm9yZwAKCRB2FHBfkEGgXkhFCACvhcb/scp31lcBjaGb8AWogejPBFfW
sb38M9X0E50omXoWahYXCMe2Dot7sgxEPuCr7PgwMBGj7Mbdy+kJg9OgdQc7mvks
XVENJmfe5H4pyNw69eVjGz/ceNP+GzpTEO0ut+suCxa8+JiBDNnTfLd1W/UYNCJ8
JOGauWnvkzA5B6/lgAB6JSldDTijKQr1NQGYGZOO6LMxVSUYb/9jFRceyEaGRL/S
HQdQU3FVcBYK2R2bqp4KCmQWFF352szmYuKMioMhpXQZTo868/llLPVarIwL/fUA
2u7NLLmza0N+DR3esZnKmMASa8wuNJgxV1Ros3nRHBa50xsTwfbvICjW
=MaTo
-----END PGP SIGNATURE-----
Merge tag 'hyperv-fixes-signed-20230804' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux
Pull hyperv fixes from Wei Liu:
- Fix a bug in a python script for Hyper-V (Ani Sinha)
- Workaround a bug in Hyper-V when IBT is enabled (Michael Kelley)
- Fix an issue parsing MP table when Linux runs in VTL2 (Saurabh
Sengar)
- Several cleanup patches (Nischala Yelchuri, Kameron Carr, YueHaibing,
ZhiHu)
* tag 'hyperv-fixes-signed-20230804' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux:
Drivers: hv: vmbus: Remove unused extern declaration vmbus_ontimer()
x86/hyperv: add noop functions to x86_init mpparse functions
vmbus_testing: fix wrong python syntax for integer value comparison
x86/hyperv: fix a warning in mshyperv.h
x86/hyperv: Disable IBT when hypercall page lacks ENDBR instruction
x86/hyperv: Improve code for referencing hyperv_pcpu_input_arg
Drivers: hv: Change hv_free_hyperv_page() to take void * argument
To avoid possible time-of-check/time-of-use issues, the GHCB should
almost never be accessed outside dump_ghcb, sev_es_sync_to_ghcb
and sev_es_sync_from_ghcb. The only legitimate uses are to set the
exitinfo fields and to find the address of the scratch area embedded
in the ghcb. Accessing ghcb_usage also goes through svm->sev_es.ghcb
in sev_es_validate_vmgexit(), but that is because anyway the value is
not used.
Removing a shortcut variable that contains the value of svm->sev_es.ghcb
makes these cases a bit more verbose, but it limits the chance of someone
reading the ghcb by mistake.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
A KVM guest using SEV-ES or SEV-SNP with multiple vCPUs can trigger
a double fetch race condition vulnerability and invoke the VMGEXIT
handler recursively.
sev_handle_vmgexit() maps the GHCB page using kvm_vcpu_map() and then
fetches the exit code using ghcb_get_sw_exit_code(). Soon after,
sev_es_validate_vmgexit() fetches the exit code again. Since the GHCB
page is shared with the guest, the guest is able to quickly swap the
values with another vCPU and hence bypass the validation. One vmexit code
that can be rejected by sev_es_validate_vmgexit() is SVM_EXIT_VMGEXIT;
if sev_handle_vmgexit() observes it in the second fetch, the call
to svm_invoke_exit_handler() will invoke sev_handle_vmgexit() again
recursively.
To avoid the race, always fetch the GHCB data from the places where
sev_es_sync_from_ghcb stores it.
Exploiting recursions on linux kernel has been proven feasible
in the past, but the impact is mitigated by stack guard pages
(CONFIG_VMAP_STACK). Still, if an attacker manages to call the handler
multiple times, they can theoretically trigger a stack overflow and
cause a denial-of-service, or potentially guest-to-host escape in kernel
configurations without stack guard pages.
Note that winning the race reliably in every iteration is very tricky
due to the very tight window of the fetches; depending on the compiler
settings, they are often consecutive because of optimization and inlining.
Tested by booting an SEV-ES RHEL9 guest.
Fixes: CVE-2023-4155
Fixes: 291bd20d5d ("KVM: SVM: Add initial support for a VMGEXIT VMEXIT")
Cc: stable@vger.kernel.org
Reported-by: Andy Nguyen <theflow@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Validation of the GHCB is susceptible to time-of-check/time-of-use vulnerabilities.
To avoid them, we would like to always snapshot the fields that are read in
sev_es_validate_vmgexit(), and not use the GHCB anymore after it returns.
This means:
- invoking sev_es_sync_from_ghcb() before any GHCB access, including before
sev_es_validate_vmgexit()
- snapshotting all fields including the valid bitmap and the sw_scratch field,
which are currently not caching anywhere.
The valid bitmap is the first thing to be copied out of the GHCB; then,
further accesses will use the copy in svm->sev_es.
Fixes: 291bd20d5d ("KVM: SVM: Add initial support for a VMGEXIT VMEXIT")
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Skip writes to MSR_AMD64_TSC_RATIO that are done in the context of a vCPU
if guest state isn't loaded, i.e. if KVM will update MSR_AMD64_TSC_RATIO
during svm_prepare_switch_to_guest() before entering the guest. Checking
guest_state_loaded may or may not be a net positive for performance as
the current_tsc_ratio cache will optimize away duplicate WRMSRs in the
vast majority of scenarios. However, the cost of the check is negligible,
and the real motivation is to document that KVM needs to load the vCPU's
value only when running the vCPU.
Link: https://lore.kernel.org/r/20230729011608.1065019-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Drop the @offset and @multiplier params from the kvm_x86_ops hooks for
propagating TSC offsets/multipliers into hardware, and instead have the
vendor implementations pull the information directly from the vCPU
structure. The respective vCPU fields _must_ be written at the same
time in order to maintain consistent state, i.e. it's not random luck
that the value passed in by all callers is grabbed from the vCPU.
Explicitly grabbing the value from the vCPU field in SVM's implementation
in particular will allow for additional cleanup without introducing even
more subtle dependencies. Specifically, SVM can skip the WRMSR if guest
state isn't loaded, i.e. svm_prepare_switch_to_guest() will load the
correct value for the vCPU prior to entering the guest.
This also reconciles KVM's handling of related values that are stored in
the vCPU, as svm_write_tsc_offset() already assumes/requires the caller
to have updated l1_tsc_offset.
Link: https://lore.kernel.org/r/20230729011608.1065019-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Explicitly disable preemption when writing MSR_AMD64_TSC_RATIO only in the
"outer" helper, as all direct callers of the "inner" helper now run with
preemption already disabled. And that isn't a coincidence, as the outer
helper requires a vCPU and is intended to be used when modifying guest
state and/or emulating guest instructions, which are typically done with
preemption enabled.
Direct use of the inner helper should be extremely limited, as the only
time KVM should modify MSR_AMD64_TSC_RATIO without a vCPU is when
sanitizing the MSR for a specific pCPU (currently done when {en,dis}abling
disabling SVM). The other direct caller is svm_prepare_switch_to_guest(),
which does have a vCPU, but is a one-off special case: KVM is about to
enter the guest on a specific pCPU and thus must have preemption disabled.
Link: https://lore.kernel.org/r/20230729011608.1065019-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
When emulating nested SVM transitions, use the outer helper for writing
the TSC multiplier for L2. Using the inner helper only for one-off cases,
i.e. for paths where KVM is NOT emulating or modifying vCPU state, will
allow for multiple cleanups:
- Explicitly disabling preemption only in the outer helper
- Getting the multiplier from the vCPU field in the outer helper
- Skipping the WRMSR in the outer helper if guest state isn't loaded
Opportunistically delete an extra newline.
No functional change intended.
Link: https://lore.kernel.org/r/20230729011608.1065019-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
When emulating nested VM-Exit, load L1's TSC multiplier if L1's desired
ratio doesn't match the current ratio, not if the ratio L1 is using for
L2 diverges from the default. Functionally, the end result is the same
as KVM will run L2 with L1's multiplier if L2's multiplier is the default,
i.e. checking that L1's multiplier is loaded is equivalent to checking if
L2 has a non-default multiplier.
However, the assertion that TSC scaling is exposed to L1 is flawed, as
userspace can trigger the WARN at will by writing the MSR and then
updating guest CPUID to hide the feature (modifying guest CPUID is
allowed anytime before KVM_RUN). E.g. hacking KVM's state_test
selftest to do
vcpu_set_msr(vcpu, MSR_AMD64_TSC_RATIO, 0);
vcpu_clear_cpuid_feature(vcpu, X86_FEATURE_TSCRATEMSR);
after restoring state in a new VM+vCPU yields an endless supply of:
------------[ cut here ]------------
WARNING: CPU: 10 PID: 206939 at arch/x86/kvm/svm/nested.c:1105
nested_svm_vmexit+0x6af/0x720 [kvm_amd]
Call Trace:
nested_svm_exit_handled+0x102/0x1f0 [kvm_amd]
svm_handle_exit+0xb9/0x180 [kvm_amd]
kvm_arch_vcpu_ioctl_run+0x1eab/0x2570 [kvm]
kvm_vcpu_ioctl+0x4c9/0x5b0 [kvm]
? trace_hardirqs_off+0x4d/0xa0
__se_sys_ioctl+0x7a/0xc0
__x64_sys_ioctl+0x21/0x30
do_syscall_64+0x41/0x90
entry_SYSCALL_64_after_hwframe+0x63/0xcd
Unlike the nested VMRUN path, hoisting the svm->tsc_scaling_enabled check
into the if-statement is wrong as KVM needs to ensure L1's multiplier is
loaded in the above scenario. Alternatively, the WARN_ON() could simply
be deleted, but that would make KVM's behavior even more subtle, e.g. it's
not immediately obvious why it's safe to write MSR_AMD64_TSC_RATIO when
checking only tsc_ratio_msr.
Fixes: 5228eb96a4 ("KVM: x86: nSVM: implement nested TSC scaling")
Cc: Maxim Levitsky <mlevitsk@redhat.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20230729011608.1065019-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Check for nested TSC scaling support on nested SVM VMRUN instead of
asserting that TSC scaling is exposed to L1 if L1's MSR_AMD64_TSC_RATIO
has diverged from KVM's default. Userspace can trigger the WARN at will
by writing the MSR and then updating guest CPUID to hide the feature
(modifying guest CPUID is allowed anytime before KVM_RUN). E.g. hacking
KVM's state_test selftest to do
vcpu_set_msr(vcpu, MSR_AMD64_TSC_RATIO, 0);
vcpu_clear_cpuid_feature(vcpu, X86_FEATURE_TSCRATEMSR);
after restoring state in a new VM+vCPU yields an endless supply of:
------------[ cut here ]------------
WARNING: CPU: 164 PID: 62565 at arch/x86/kvm/svm/nested.c:699
nested_vmcb02_prepare_control+0x3d6/0x3f0 [kvm_amd]
Call Trace:
<TASK>
enter_svm_guest_mode+0x114/0x560 [kvm_amd]
nested_svm_vmrun+0x260/0x330 [kvm_amd]
vmrun_interception+0x29/0x30 [kvm_amd]
svm_invoke_exit_handler+0x35/0x100 [kvm_amd]
svm_handle_exit+0xe7/0x180 [kvm_amd]
kvm_arch_vcpu_ioctl_run+0x1eab/0x2570 [kvm]
kvm_vcpu_ioctl+0x4c9/0x5b0 [kvm]
__se_sys_ioctl+0x7a/0xc0
__x64_sys_ioctl+0x21/0x30
do_syscall_64+0x41/0x90
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x45ca1b
Note, the nested #VMEXIT path has the same flaw, but needs a different
fix and will be handled separately.
Fixes: 5228eb96a4 ("KVM: x86: nSVM: implement nested TSC scaling")
Cc: Maxim Levitsky <mlevitsk@redhat.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20230729011608.1065019-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>