- Shut down the per CPU thermal throttling poll work properly when a CPU
goes offline. The missing shutdown caused the poll work to be migrated
to a unbound worker which triggered warnings about the usage of
smp_processor_id() in preemptible context
- Fix the PPIN feature initialization which missed to enable the
functionality when PPIN_CTL was enabled but the MSR locked against
updates.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl5uRHUTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoVVGD/0WEjZoB8yhwez6u0YNFhUkjfP8JFC1
mGdWMoevyH3Tb+DQNX3cW95t2O7IxP0N6OUNnYYQ9Tlqwt6r0ptJpNnXO7CV2+Jh
5lxpw/Uv2kQv69BNDK9qPDhiIBPzZQCg/utDTVdIyG0y+XU0q/IZqXh+XedAJsVr
P3U7KC//NwTYnlpPWjDsG26GHSguV4kj+Lwi88nfh1DJ7eawb8AF4k965pLmOoF9
g13EFxv2FW1/uq+QJq5ophQIH/pPI/T67rhIyLWxFsCByBzVKjm4BBgXH4gb+QIn
OofVQcaWCpZCOq2ZTNfHWdPvJK2ziig9w+twbArb7Cb9aOgp3Oe1zbp2VD4nKu4+
0G5E2Vdv6qRrEIk5LUTqlyOIogd5xPSufaCGF/HC/qXqBxqwWD0tUvjtYyRwwy+Y
u90bo90zlMjUoDirgtZrjYe0bXuy3xJ+FxZ5OxovGRxLn4qqBqEJGrXYvB0LIlpd
3x+YeHB4T2pwC6Ya5Odi6RKhwMKpro24dDMJ9jIR1u/NwIgJ2elSO9bsw6SZ823e
/Mwns7CC/7xtjOCJXPlyj4Uw0TzwTbp1W9Kb0OqJo6q+ntvxbAhoMf32FDxg0OKC
h4trc3FZt+e2a0l8R8e3nNeAvnS0fM1P4vtg18EcX8SqlSoALJkS3XUO4WeCFLBh
F9jOt/LSf+4okQ==
=9W7B
-----END PGP SIGNATURE-----
Merge tag 'ras-urgent-2020-03-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RAS fixes from Thomas Gleixner:
"Two RAS related fixes:
- Shut down the per CPU thermal throttling poll work properly when a
CPU goes offline.
The missing shutdown caused the poll work to be migrated to a
unbound worker which triggered warnings about the usage of
smp_processor_id() in preemptible context
- Fix the PPIN feature initialization which missed to enable the
functionality when PPIN_CTL was enabled but the MSR locked against
updates"
* tag 'ras-urgent-2020-03-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mce: Fix logic and comments around MSR_PPIN_CTL
x86/mce/therm_throt: Undo thermal polling properly on CPU offline
Family 19h CPUs are Zen-based and still share most architectural
features with Family 17h CPUs, and therefore still need to call
init_amd_zn() e.g., to set the RECLAIM_DISTANCE override.
init_amd_zn() also sets X86_FEATURE_ZEN, which today is only used
in amd_set_core_ssb_state(), which isn't called on some late
model Family 17h CPUs, nor on any Family 19h CPUs:
X86_FEATURE_AMD_SSBD replaces X86_FEATURE_LS_CFG_SSBD on those
later model CPUs, where the SSBD mitigation is done via the
SPEC_CTRL MSR instead of the LS_CFG MSR.
Family 19h CPUs also don't have the erratum where the CPB feature
bit isn't set, but that code can stay unchanged and run safely
on Family 19h.
Signed-off-by: Kim Phillips <kim.phillips@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200311191451.13221-1-kim.phillips@amd.com
We have had a hard coded limit of 32 machine check records since the
dawn of time. But as numbers of cores increase, it is possible for
more than 32 errors to be reported before a user process reads from
/dev/mcelog. In this case the additional errors are lost.
Keep 32 as the minimum. But tune the maximum value up based on the
number of processors.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200218184408.GA23048@agluck-desk2.amr.corp.intel.com
There are two implemented bits in the PPIN_CTL MSR:
Bit 0: LockOut (R/WO)
Set 1 to prevent further writes to MSR_PPIN_CTL.
Bit 1: Enable_PPIN (R/W)
If 1, enables MSR_PPIN to be accessible using RDMSR.
If 0, an attempt to read MSR_PPIN will cause #GP.
So there are four defined values:
0: PPIN is disabled, PPIN_CTL may be updated
1: PPIN is disabled. PPIN_CTL is locked against updates
2: PPIN is enabled. PPIN_CTL may be updated
3: PPIN is enabled. PPIN_CTL is locked against updates
Code would only enable the X86_FEATURE_INTEL_PPIN feature for case "2".
When it should have done so for both case "2" and case "3".
Fix the final test to just check for the enable bit. Also fix some of
the other comments in this function.
Fixes: 3f5a7896a5 ("x86/mce: Include the PPIN in MCE records when available")
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20200226011737.9958-1-tony.luck@intel.com
Explicitly set X86_FEATURE_OSPKE via set_cpu_cap() instead of calling
get_cpu_cap() to pull the feature bit from CPUID after enabling CR4.PKE.
Invoking get_cpu_cap() effectively wipes out any {set,clear}_cpu_cap()
changes that were made between this_cpu->c_init() and setup_pku(), as
all non-synthetic feature words are reinitialized from the CPU's CPUID
values.
Blasting away capability updates manifests most visibility when running
on a VMX capable CPU, but with VMX disabled by BIOS. To indicate that
VMX is disabled, init_ia32_feat_ctl() clears X86_FEATURE_VMX, using
clear_cpu_cap() instead of setup_clear_cpu_cap() so that KVM can report
which CPU is misconfigured (KVM needs to probe every CPU anyways).
Restoring X86_FEATURE_VMX from CPUID causes KVM to think VMX is enabled,
ultimately leading to an unexpected #GP when KVM attempts to do VMXON.
Arguably, init_ia32_feat_ctl() should use setup_clear_cpu_cap() and let
KVM figure out a different way to report the misconfigured CPU, but VMX
is not the only feature bit that is affected, i.e. there is precedent
that tweaking feature bits via {set,clear}_cpu_cap() after ->c_init()
is expected to work. Most notably, x86_init_rdrand()'s clearing of
X86_FEATURE_RDRAND when RDRAND malfunctions is also overwritten.
Fixes: 0697694564 ("x86/mm/pkeys: Actually enable Memory Protection Keys in the CPU")
Reported-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Tested-by: Jacob Keller <jacob.e.keller@intel.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20200226231615.13664-1-sean.j.christopherson@intel.com
Remove the pointless difference between 32 and 64 bit to make further
unifications simpler.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Reviewed-by: Andy Lutomirski <luto@kernel.org>
Link: https://lkml.kernel.org/r/20200225220216.428188397@linutronix.de
do_machine_check() can be raised in almost any context including the most
fragile ones. Prevent kprobes and tracing.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Reviewed-by: Andy Lutomirski <luto@kernel.org>
Link: https://lkml.kernel.org/r/20200225220216.315548935@linutronix.de
Chris Wilson reported splats from running the thermal throttling
workqueue callback on offlined CPUs. The problem is that that callback
should not even run on offlined CPUs but it happens nevertheless because
the offlining callback thermal_throttle_offline() does not symmetrically
undo the setup work done in its onlining counterpart. IOW,
1. The thermal interrupt vector should be masked out before ...
2. ... cancelling any pending work synchronously so that no new work is
enqueued anymore.
Do those things and fix the issue properly.
[ bp: Write commit message. ]
Fixes: f6656208f0 ("x86/mce/therm_throt: Optimize notifications of thermal throttle")
Reported-by: Chris Wilson <chris@chris-wilson.co.uk>
Tested-by: Pandruvada, Srinivas <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/158120068234.18291.7938335950259651295@skylake-alporthouse-com
- Populate the per CPU MCA bank descriptor pointer only after it has been
completely set up to prevent a use-after-free in case that one of the
subsequent initialization step fails
- Implement a proper release function for the sysfs entries of MCA
threshold controls instead of freeing the memory right in the CPU
teardown code, which leads to another use-after-free when the
associated sysfs file is opened and accessed.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl5RkwATHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoY6ED/4oafN9DmeY18oUv1QpoMQQMa6iduz6
udemhpjqVXO7R1Ste4ccM/fIQ8sMjf58isuxUClcDwrX3fxv+lusK2iVPEw99Vpo
w1xSjdXNW0KSSiQko9oMHVu+xcXIt8vxpL4YyjEuR81rcoecFaq2c2KhLMAW4o0p
3mEv7/QYPfpKc4ydcbcHo2JF6U1sfUpsWpoe/SxpXRpxeoy64baCWZGbcsUXqjB6
3MRxxy+ypKKKPPUM1py4D/ViDXwkhhP+gMD4ljWXCprpul/KuXAMEgvW39MtVsBJ
uMF3PMXqjKx+WY492tpxtdZjWej+X13ID/cTc2w1EBHz30Qxmc6RieTKi6FzsJYB
PKsTWdGarzORioaBg51Riq27C3+fjHbe6WqkhIQzmenSIwiV1o6o4IyuOs5sdlxX
rjIk/ssNeAxRpCy308i6Vaq98PBZqAY1/iUZN50vAzldH3bwKxobowjn+AYStA0c
9BF5zw7/3oXB4WaByuBwJ3DzWjqiXM4EUPu7LYF9DVSvj+A2xOmhwN+uz3SK6hBk
vkxiFE50Lo2qoDaATJozY8+nxgUKRNiDdz+udhVsoQxNKWUMxirsH18TFu8yBl2r
HGKsfCBY4CnV64WRy5IKQsqt3EhAgAUUoD0jSy7P3xf4HwSKAn/9OZ1cWQAo1wzQ
xnXUtRDFc7ScHg==
=2f84
-----END PGP SIGNATURE-----
Merge tag 'ras-urgent-2020-02-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RAS fixes from Thomas Gleixner:
"Two fixes for the AMD MCE driver:
- Populate the per CPU MCA bank descriptor pointer only after it has
been completely set up to prevent a use-after-free in case that one
of the subsequent initialization step fails
- Implement a proper release function for the sysfs entries of MCA
threshold controls instead of freeing the memory right in the CPU
teardown code, which leads to another use-after-free when the
associated sysfs file is opened and accessed"
* tag 'ras-urgent-2020-02-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mce/amd: Fix kobject lifetime
x86/mce/amd: Publish the bank pointer only after setup has succeeded
A split-lock occurs when an atomic instruction operates on data that spans
two cache lines. In order to maintain atomicity the core takes a global bus
lock.
This is typically >1000 cycles slower than an atomic operation within a
cache line. It also disrupts performance on other cores (which must wait
for the bus lock to be released before their memory operations can
complete). For real-time systems this may mean missing deadlines. For other
systems it may just be very annoying.
Some CPUs have the capability to raise an #AC trap when a split lock is
attempted.
Provide a command line option to give the user choices on how to handle
this:
split_lock_detect=
off - not enabled (no traps for split locks)
warn - warn once when an application does a
split lock, but allow it to continue
running.
fatal - Send SIGBUS to applications that cause split lock
On systems that support split lock detection the default is "warn". Note
that if the kernel hits a split lock in any mode other than "off" it will
OOPs.
One implementation wrinkle is that the MSR to control the split lock
detection is per-core, not per thread. This might result in some short
lived races on HT systems in "warn" mode if Linux tries to enable on one
thread while disabling on the other. Race analysis by Sean Christopherson:
- Toggling of split-lock is only done in "warn" mode. Worst case
scenario of a race is that a misbehaving task will generate multiple
#AC exceptions on the same instruction. And this race will only occur
if both siblings are running tasks that generate split-lock #ACs, e.g.
a race where sibling threads are writing different values will only
occur if CPUx is disabling split-lock after an #AC and CPUy is
re-enabling split-lock after *its* previous task generated an #AC.
- Transitioning between off/warn/fatal modes at runtime isn't supported
and disabling is tracked per task, so hardware will always reach a steady
state that matches the configured mode. I.e. split-lock is guaranteed to
be enabled in hardware once all _TIF_SLD threads have been scheduled out.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Co-developed-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Co-developed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20200126200535.GB30377@agluck-desk2.amr.corp.intel.com
Commit
aaf248848d ("perf/x86/msr: Add AMD IRPERF (Instructions Retired)
performance counter")
added support for access to the free-running counter via 'perf -e
msr/irperf/', but when exercised, it always returns a 0 count:
BEFORE:
$ perf stat -e instructions,msr/irperf/ true
Performance counter stats for 'true':
624,833 instructions
0 msr/irperf/
Simply set its enable bit - HWCR bit 30 - to make it start counting.
Enablement is restricted to all machines advertising IRPERF capability,
except those susceptible to an erratum that makes the IRPERF return
bad values.
That erratum occurs in Family 17h models 00-1fh [1], but not in F17h
models 20h and above [2].
AFTER (on a family 17h model 31h machine):
$ perf stat -e instructions,msr/irperf/ true
Performance counter stats for 'true':
621,690 instructions
622,490 msr/irperf/
[1] Revision Guide for AMD Family 17h Models 00h-0Fh Processors
[2] Revision Guide for AMD Family 17h Models 30h-3Fh Processors
The revision guides are available from the bugzilla Link below.
[ bp: Massage commit message. ]
Fixes: aaf248848d ("perf/x86/msr: Add AMD IRPERF (Instructions Retired) performance counter")
Signed-off-by: Kim Phillips <kim.phillips@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
Link: http://lkml.kernel.org/r/20200214201805.13830-1-kim.phillips@amd.com
A user has reported that they are seeing spurious corrected errors on
their hardware.
Intel Errata HSD131, HSM142, HSW131, and BDM48 report that "spurious
corrected errors may be logged in the IA32_MC0_STATUS register with
the valid field (bit 63) set, the uncorrected error field (bit 61) not
set, a Model Specific Error Code (bits [31:16]) of 0x000F, and an MCA
Error Code (bits [15:0]) of 0x0005." The Errata PDFs are linked in the
bugzilla below.
Block these spurious errors from the console and logs.
[ bp: Move the intel_filter_mce() header declarations into the already
existing CONFIG_X86_MCE_INTEL ifdeffery. ]
Co-developed-by: Alexander Krupp <centos@akr.yagii.de>
Signed-off-by: Alexander Krupp <centos@akr.yagii.de>
Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206587
Link: https://lkml.kernel.org/r/20200219131611.36816-1-prarit@redhat.com
Accessing the MCA thresholding controls in sysfs concurrently with CPU
hotplug can lead to a couple of KASAN-reported issues:
BUG: KASAN: use-after-free in sysfs_file_ops+0x155/0x180
Read of size 8 at addr ffff888367578940 by task grep/4019
and
BUG: KASAN: use-after-free in show_error_count+0x15c/0x180
Read of size 2 at addr ffff888368a05514 by task grep/4454
for example. Both result from the fact that the threshold block
creation/teardown code frees the descriptor memory itself instead of
defining proper ->release function and leaving it to the driver core to
take care of that, after all sysfs accesses have completed.
Do that and get rid of the custom freeing code, fixing the above UAFs in
the process.
[ bp: write commit message. ]
Fixes: 9526866439 ("[PATCH] x86_64: mce_amd support for family 0x10 processors")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20200214082801.13836-1-bp@alien8.de
threshold_create_bank() creates a bank descriptor per MCA error
thresholding counter which can be controlled over sysfs. It publishes
the pointer to that bank in a per-CPU variable and then goes on to
create additional thresholding blocks if the bank has such.
However, that creation of additional blocks in
allocate_threshold_blocks() can fail, leading to a use-after-free
through the per-CPU pointer.
Therefore, publish that pointer only after all blocks have been setup
successfully.
Fixes: 019f34fccf ("x86, MCE, AMD: Move shared bank to node descriptor")
Reported-by: Saar Amar <Saar.Amar@microsoft.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/20200128140846.phctkvx5btiexvbx@kili.mountain
Pull vfs file system parameter updates from Al Viro:
"Saner fs_parser.c guts and data structures. The system-wide registry
of syntax types (string/enum/int32/oct32/.../etc.) is gone and so is
the horror switch() in fs_parse() that would have to grow another case
every time something got added to that system-wide registry.
New syntax types can be added by filesystems easily now, and their
namespace is that of functions - not of system-wide enum members. IOW,
they can be shared or kept private and if some turn out to be widely
useful, we can make them common library helpers, etc., without having
to do anything whatsoever to fs_parse() itself.
And we already get that kind of requests - the thing that finally
pushed me into doing that was "oh, and let's add one for timeouts -
things like 15s or 2h". If some filesystem really wants that, let them
do it. Without somebody having to play gatekeeper for the variants
blessed by direct support in fs_parse(), TYVM.
Quite a bit of boilerplate is gone. And IMO the data structures make a
lot more sense now. -200LoC, while we are at it"
* 'merge.nfs-fs_parse.1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (25 commits)
tmpfs: switch to use of invalfc()
cgroup1: switch to use of errorfc() et.al.
procfs: switch to use of invalfc()
hugetlbfs: switch to use of invalfc()
cramfs: switch to use of errofc() et.al.
gfs2: switch to use of errorfc() et.al.
fuse: switch to use errorfc() et.al.
ceph: use errorfc() and friends instead of spelling the prefix out
prefix-handling analogues of errorf() and friends
turn fs_param_is_... into functions
fs_parse: handle optional arguments sanely
fs_parse: fold fs_parameter_desc/fs_parameter_spec
fs_parser: remove fs_parameter_description name field
add prefix to fs_context->log
ceph_parse_param(), ceph_parse_mon_ips(): switch to passing fc_log
new primitive: __fs_parse()
switch rbd and libceph to p_log-based primitives
struct p_log, variants of warnf() et.al. taking that one instead
teach logfc() to handle prefices, give it saner calling conventions
get rid of cg_invalf()
...
Unused now.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull x86 fixes from Ingo Molnar:
"Misc fixes:
- three fixes and a cleanup for the resctrl code
- a HyperV fix
- a fix to /proc/kcore contents in live debugging sessions
- a fix for the x86 decoder opcode map"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/decoder: Add TEST opcode to Group3-2
x86/resctrl: Clean up unused function parameter in mkdir path
x86/resctrl: Fix a deadlock due to inaccurate reference
x86/resctrl: Fix use-after-free due to inaccurate refcount of rdtgroup
x86/resctrl: Fix use-after-free when deleting resource groups
x86/hyper-v: Add "polling" bit to hv_synic_sint
x86/crash: Define arch_crash_save_vmcoreinfo() if CONFIG_CRASH_CORE=y
Unfortunately, GCC 9.1 is expected to be be released without support for
MPX. This means that there was only a relatively small window where
folks could have ever used MPX. It failed to gain wide adoption in the
industry, and Linux was the only mainstream OS to ever support it widely.
Support for the feature may also disappear on future processors.
This set completes the process that we started during the 5.4 merge window.
-----BEGIN PGP SIGNATURE-----
iQIcBAABCAAGBQJeK1/pAAoJEGg1lTBwyZKwgC8QAIiVn1d7A9Uj/WpnpgfCChCZ
9XiV6Ak999qD9fbAcrgNfPjieaD4mtokocSRVJuRgJu5iLnIJCINlozLPe4yVl7P
7zebnxkLq0CIA8d56bEUoFlC0J+oWYlDVQePZzNQsSk5KHVGXVLpF6U4vDVzZeQy
cprgvdeY+ehB7G6IIo0MWTg5ylKYAsOAyVvK8NIGpKY2k6/YqCnsptnsVE7bvlHy
TrEOiUWLv+hh0bMkZdP1PwKQKEuMO/IZly0HtviFbMN7T4TB1spfg7ELoBucEq3T
s4EVbYRe+nIE4tuEAveaX3CgxJek8cY5MlticskdaKSEACBwabdOF55qsZy0u+WA
PYC4iUIXfbOH8OgieKWtGX4IuSkRYdQ2nP4BOpe4ZX4+zvU7zOCIyVSKRrwkX8cc
ADtWI5FAtB36KCgUuWnHGHNZpOxPTbTLBuBataFY4Q2uBNJEBJpscZ5H9ObtyGFU
ZjlzqFnM0nFNDKEI1EEtv9jLzgZTU1RQ46s7EFeSeEQ2/s9wJ3+s5sBlVbljsmus
o658bLOEaRWC/aF15dgmEXW9GAO6uifNdmbzGnRn7oEMYyFQPTWbZvi1zGz58QaG
Y6WTtigVtsSrHS4wpYd+p+n1W06VnB6J3BpBM4G1VQv1Vm0dNd1tUOfkqOzPjg7c
33Itmsz2LaW1mb67GlgZ
=g4cC
-----END PGP SIGNATURE-----
Merge tag 'mpx-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/daveh/x86-mpx
Pull x86 MPX removal from Dave Hansen:
"MPX requires recompiling applications, which requires compiler
support. Unfortunately, GCC 9.1 is expected to be be released without
support for MPX. This means that there was only a relatively small
window where folks could have ever used MPX. It failed to gain wide
adoption in the industry, and Linux was the only mainstream OS to ever
support it widely.
Support for the feature may also disappear on future processors.
This set completes the process that we started during the 5.4 merge
window when the MPX prctl()s were removed. XSAVE support is left in
place, which allows MPX-using KVM guests to continue to function"
* tag 'mpx-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/daveh/x86-mpx:
x86/mpx: remove MPX from arch/x86
mm: remove arch_bprm_mm_init() hook
x86/mpx: remove bounds exception code
x86/mpx: remove build infrastructure
x86/alternatives: add missing insn.h include
Pull x86 mtrr updates from Ingo Molnar:
"Two changes: restrict /proc/mtrr to CAP_SYS_ADMIN, plus a cleanup"
* 'x86-mtrr-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mtrr: Require CAP_SYS_ADMIN for all access
x86/mtrr: Get rid of mtrr_seq_show() forward declaration
Pull x86 cpu-features updates from Ingo Molnar:
"The biggest change in this cycle was a large series from Sean
Christopherson to clean up the handling of VMX features. This both
fixes bugs/inconsistencies and makes the code more coherent and
future-proof.
There are also two cleanups and a minor TSX syslog messages
enhancement"
* 'x86-cpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits)
x86/cpu: Remove redundant cpu_detect_cache_sizes() call
x86/cpu: Print "VMX disabled" error message iff KVM is enabled
KVM: VMX: Allow KVM_INTEL when building for Centaur and/or Zhaoxin CPUs
perf/x86: Provide stubs of KVM helpers for non-Intel CPUs
KVM: VMX: Use VMX_FEATURE_* flags to define VMCS control bits
KVM: VMX: Check for full VMX support when verifying CPU compatibility
KVM: VMX: Use VMX feature flag to query BIOS enabling
KVM: VMX: Drop initialization of IA32_FEAT_CTL MSR
x86/cpufeatures: Add flag to track whether MSR IA32_FEAT_CTL is configured
x86/cpu: Set synthetic VMX cpufeatures during init_ia32_feat_ctl()
x86/cpu: Print VMX flags in /proc/cpuinfo using VMX_FEATURES_*
x86/cpu: Detect VMX features on Intel, Centaur and Zhaoxin CPUs
x86/vmx: Introduce VMX_FEATURES_*
x86/cpu: Clear VMX feature flag if VMX is not fully enabled
x86/zhaoxin: Use common IA32_FEAT_CTL MSR initialization
x86/centaur: Use common IA32_FEAT_CTL MSR initialization
x86/mce: WARN once if IA32_FEAT_CTL MSR is left unlocked
x86/intel: Initialize IA32_FEAT_CTL MSR at boot
tools/x86: Sync msr-index.h from kernel sources
selftests, kvm: Replace manual MSR defs with common msr-index.h
...
Pull x86 cleanups from Ingo Molnar:
"Misc cleanups all around the map"
* 'x86-cleanups-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/CPU/AMD: Remove amd_get_topology_early()
x86/tsc: Remove redundant assignment
x86/crash: Use resource_size()
x86/cpu: Add a missing prototype for arch_smt_update()
x86/nospec: Remove unused RSB_FILL_LOOPS
x86/vdso: Provide missing include file
x86/Kconfig: Correct spelling and punctuation
Documentation/x86/boot: Fix typo
x86/boot: Fix a comment's incorrect file reference
x86/process: Remove set but not used variables prev and next
x86/Kconfig: Fix Kconfig indentation
Pull x86 resource control updates from Ingo Molnar:
"The main change in this tree is the extension of the resctrl procfs
ABI with a new file that helps tooling to navigate from tasks back to
resctrl groups: /proc/{pid}/cpu_resctrl_groups.
Also fix static key usage for certain feature combinations and
simplify the task exit resctrl case"
* 'x86-cache-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/resctrl: Add task resctrl information display
x86/resctrl: Check monitoring static key in the MBM overflow handler
x86/resctrl: Do not reconfigure exiting tasks
Pull EFI updates from Ingo Molnar:
"The main changes in this cycle were:
- Cleanup of the GOP [graphics output] handling code in the EFI stub
- Complete refactoring of the mixed mode handling in the x86 EFI stub
- Overhaul of the x86 EFI boot/runtime code
- Increase robustness for mixed mode code
- Add the ability to disable DMA at the root port level in the EFI
stub
- Get rid of RWX mappings in the EFI memory map and page tables,
where possible
- Move the support code for the old EFI memory mapping style into its
only user, the SGI UV1+ support code.
- plus misc fixes, updates, smaller cleanups.
... and due to interactions with the RWX changes, another round of PAT
cleanups make a guest appearance via the EFI tree - with no side
effects intended"
* 'efi-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (75 commits)
efi/x86: Disable instrumentation in the EFI runtime handling code
efi/libstub/x86: Fix EFI server boot failure
efi/x86: Disallow efi=old_map in mixed mode
x86/boot/compressed: Relax sed symbol type regex for LLVM ld.lld
efi/x86: avoid KASAN false positives when accessing the 1: 1 mapping
efi: Fix handling of multiple efi_fake_mem= entries
efi: Fix efi_memmap_alloc() leaks
efi: Add tracking for dynamically allocated memmaps
efi: Add a flags parameter to efi_memory_map
efi: Fix comment for efi_mem_type() wrt absent physical addresses
efi/arm: Defer probe of PCIe backed efifb on DT systems
efi/x86: Limit EFI old memory map to SGI UV machines
efi/x86: Avoid RWX mappings for all of DRAM
efi/x86: Don't map the entire kernel text RW for mixed mode
x86/mm: Fix NX bit clearing issue in kernel_map_pages_in_pgd
efi/libstub/x86: Fix unused-variable warning
efi/libstub/x86: Use mandatory 16-byte stack alignment in mixed mode
efi/libstub/x86: Use const attribute for efi_is_64bit()
efi: Allow disabling PCI busmastering on bridges during boot
efi/x86: Allow translating 64-bit arguments for mixed mode calls
...
and improvements:
- Update the cached HLE state when the TSX state is changed via the new
control register. This ensures feature bit consistency.
- Exclude the new Zhaoxin CPUs from Spectre V2 and SWAPGS vulnerabilities.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl4vc5ITHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoRujEACeWTy4k+eCprV2tu4O/e15VaGaIdRp
Kdv02kZ1U/3upSubRYW8XsBqesCdbSzfO78fE7yQYZmTbOxczSE1t2XH5Cj7c3ig
3D4gzXaH2XKf+j9cXceba/3cpRsMU8fAhQYr+Immi/dG2ORs8i2C4G0cxKthqvbP
R6qdS+Pg9Mo1a2c2nLRGMDyNbNdaYXiJEiWSGNlc6Den/vog+jR4T2WV5a5IjeKL
yjjukoLoeScJ/3gYK0BQkaGNY3BeU1H5ECtbVM3SBFkp2a9OVGhr1bSxwO1U+T8C
fdzwJZU/p+JTb69kx91Tz8+wd2zLtV1wxbdZY7nyUUfGd80q0eU4sQZQyY13mONw
0LRvXrfHjIsRPSiRY3iyHYq9I0UZmjG7oInqRLimLrJmjrPtALhu/mO5TjZbxjzM
FSVx5fWbZYUgnrQpB+hjqiBcrc3sjjrDwU6DufbSq/AhTBKQ/o0mC+S63OhJ0H7C
Bzp1m0OniDuNRtYD0LZm0ktAkLpCTy5lgGBQ6ffNqs87ivFi0GTErqml5GrUEKQ5
mq7h56mf00jKPBYOVoYs9MuWJ18JVJuVNipdlM+P7027XDRpl46l9g87YuMhIol/
tQsDHB4Dv/1QTZY3K3MuNLFZaQJ4/DI9LA2B1br1PdHbZSDd81NflT6FKE11rqWC
T+DCpmj+smRqVw==
=dOgd
-----END PGP SIGNATURE-----
Merge tag 'x86-pti-2020-01-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 pti updates from Thomas Gleixner:
"The performance deterioration departement provides a few non-scary
fixes and improvements:
- Update the cached HLE state when the TSX state is changed via the
new control register. This ensures feature bit consistency.
- Exclude the new Zhaoxin CPUs from Spectre V2 and SWAPGS
vulnerabilities"
* tag 'x86-pti-2020-01-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/speculation/swapgs: Exclude Zhaoxin CPUs from SWAPGS vulnerability
x86/speculation/spectre_v2: Exclude Zhaoxin CPUs from SPECTRE_V2
x86/cpu: Update cached HLE state on write to TSX_CTRL_CPUID_CLEAR
Pull RAS updates from Borislav Petkov:
- Misc fixes to the MCE code all over the place, by Jan H. Schönherr.
- Initial support for AMD F19h and other cleanups to amd64_edac, by
Yazen Ghannam.
- Other small cleanups.
* 'ras-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
EDAC/mce_amd: Make fam_ops static global
EDAC/amd64: Drop some family checks for newer systems
EDAC/amd64: Add family ops for Family 19h Models 00h-0Fh
x86/amd_nb: Add Family 19h PCI IDs
EDAC/mce_amd: Always load on SMCA systems
x86/MCE/AMD, EDAC/mce_amd: Add new Load Store unit McaType
x86/mce: Fix use of uninitialized MCE message string
x86/mce: Fix mce=nobootlog
x86/mce: Take action on UCNA/Deferred errors again
x86/mce: Remove mce_inject_log() in favor of mce_log()
x86/mce: Pass MCE message to mce_panic() on failed kernel recovery
x86/mce/therm_throt: Mark throttle_active_work() as __maybe_unused
... and fold its function body into its single call site.
No functional changes:
# arch/x86/kernel/cpu/amd.o:
text data bss dec hex filename
5994 385 1 6380 18ec amd.o.before
5994 385 1 6380 18ec amd.o.after
md5:
99ec6daa095b502297884e949c520f90 amd.o.before.asm
99ec6daa095b502297884e949c520f90 amd.o.after.asm
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200123165811.5288-1-bp@alien8.de
From: Dave Hansen <dave.hansen@linux.intel.com>
MPX is being removed from the kernel due to a lack of support
in the toolchain going forward (gcc).
This removes all the remaining (dead at this point) MPX handling
code remaining in the tree. The only remaining code is the XSAVE
support for MPX state which is currently needd for KVM to handle
VMs which might use MPX.
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: x86@kernel.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Commit
334b0f4e9b ("x86/resctrl: Fix a deadlock due to inaccurate reference")
changed the argument to rdtgroup_kn_lock_live()/rdtgroup_kn_unlock()
within mkdir_rdt_prepare(). That change resulted in an unused function
parameter to mkdir_rdt_prepare().
Clean up the unused function parameter in mkdir_rdt_prepare() and its
callers rdtgroup_mkdir_mon() and rdtgroup_mkdir_ctrl_mon().
Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/1578500886-21771-5-git-send-email-xiaochen.shen@intel.com
There is a race condition which results in a deadlock when rmdir and
mkdir execute concurrently:
$ ls /sys/fs/resctrl/c1/mon_groups/m1/
cpus cpus_list mon_data tasks
Thread 1: rmdir /sys/fs/resctrl/c1
Thread 2: mkdir /sys/fs/resctrl/c1/mon_groups/m1
3 locks held by mkdir/48649:
#0: (sb_writers#17){.+.+}, at: [<ffffffffb4ca2aa0>] mnt_want_write+0x20/0x50
#1: (&type->i_mutex_dir_key#8/1){+.+.}, at: [<ffffffffb4c8c13b>] filename_create+0x7b/0x170
#2: (rdtgroup_mutex){+.+.}, at: [<ffffffffb4a4389d>] rdtgroup_kn_lock_live+0x3d/0x70
4 locks held by rmdir/48652:
#0: (sb_writers#17){.+.+}, at: [<ffffffffb4ca2aa0>] mnt_want_write+0x20/0x50
#1: (&type->i_mutex_dir_key#8/1){+.+.}, at: [<ffffffffb4c8c3cf>] do_rmdir+0x13f/0x1e0
#2: (&type->i_mutex_dir_key#8){++++}, at: [<ffffffffb4c86d5d>] vfs_rmdir+0x4d/0x120
#3: (rdtgroup_mutex){+.+.}, at: [<ffffffffb4a4389d>] rdtgroup_kn_lock_live+0x3d/0x70
Thread 1 is deleting control group "c1". Holding rdtgroup_mutex,
kernfs_remove() removes all kernfs nodes under directory "c1"
recursively, then waits for sub kernfs node "mon_groups" to drop active
reference.
Thread 2 is trying to create a subdirectory "m1" in the "mon_groups"
directory. The wrapper kernfs_iop_mkdir() takes an active reference to
the "mon_groups" directory but the code drops the active reference to
the parent directory "c1" instead.
As a result, Thread 1 is blocked on waiting for active reference to drop
and never release rdtgroup_mutex, while Thread 2 is also blocked on
trying to get rdtgroup_mutex.
Thread 1 (rdtgroup_rmdir) Thread 2 (rdtgroup_mkdir)
(rmdir /sys/fs/resctrl/c1) (mkdir /sys/fs/resctrl/c1/mon_groups/m1)
------------------------- -------------------------
kernfs_iop_mkdir
/*
* kn: "m1", parent_kn: "mon_groups",
* prgrp_kn: parent_kn->parent: "c1",
*
* "mon_groups", parent_kn->active++: 1
*/
kernfs_get_active(parent_kn)
kernfs_iop_rmdir
/* "c1", kn->active++ */
kernfs_get_active(kn)
rdtgroup_kn_lock_live
atomic_inc(&rdtgrp->waitcount)
/* "c1", kn->active-- */
kernfs_break_active_protection(kn)
mutex_lock
rdtgroup_rmdir_ctrl
free_all_child_rdtgrp
sentry->flags = RDT_DELETED
rdtgroup_ctrl_remove
rdtgrp->flags = RDT_DELETED
kernfs_get(kn)
kernfs_remove(rdtgrp->kn)
__kernfs_remove
/* "mon_groups", sub_kn */
atomic_add(KN_DEACTIVATED_BIAS, &sub_kn->active)
kernfs_drain(sub_kn)
/*
* sub_kn->active == KN_DEACTIVATED_BIAS + 1,
* waiting on sub_kn->active to drop, but it
* never drops in Thread 2 which is blocked
* on getting rdtgroup_mutex.
*/
Thread 1 hangs here ---->
wait_event(sub_kn->active == KN_DEACTIVATED_BIAS)
...
rdtgroup_mkdir
rdtgroup_mkdir_mon(parent_kn, prgrp_kn)
mkdir_rdt_prepare(parent_kn, prgrp_kn)
rdtgroup_kn_lock_live(prgrp_kn)
atomic_inc(&rdtgrp->waitcount)
/*
* "c1", prgrp_kn->active--
*
* The active reference on "c1" is
* dropped, but not matching the
* actual active reference taken
* on "mon_groups", thus causing
* Thread 1 to wait forever while
* holding rdtgroup_mutex.
*/
kernfs_break_active_protection(
prgrp_kn)
/*
* Trying to get rdtgroup_mutex
* which is held by Thread 1.
*/
Thread 2 hangs here ----> mutex_lock
...
The problem is that the creation of a subdirectory in the "mon_groups"
directory incorrectly releases the active protection of its parent
directory instead of itself before it starts waiting for rdtgroup_mutex.
This is triggered by the rdtgroup_mkdir() flow calling
rdtgroup_kn_lock_live()/rdtgroup_kn_unlock() with kernfs node of the
parent control group ("c1") as argument. It should be called with kernfs
node "mon_groups" instead. What is currently missing is that the
kn->priv of "mon_groups" is NULL instead of pointing to the rdtgrp.
Fix it by pointing kn->priv to rdtgrp when "mon_groups" is created. Then
it could be passed to rdtgroup_kn_lock_live()/rdtgroup_kn_unlock()
instead. And then it operates on the same rdtgroup structure but handles
the active reference of kernfs node "mon_groups" to prevent deadlock.
The same changes are also made to the "mon_data" directories.
This results in some unused function parameters that will be cleaned up
in follow-up patch as the focus here is on the fix only in support of
backporting efforts.
Fixes: c7d9aac613 ("x86/intel_rdt/cqm: Add mkdir support for RDT monitoring")
Suggested-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/1578500886-21771-4-git-send-email-xiaochen.shen@intel.com
There is a race condition in the following scenario which results in an
use-after-free issue when reading a monitoring file and deleting the
parent ctrl_mon group concurrently:
Thread 1 calls atomic_inc() to take refcount of rdtgrp and then calls
kernfs_break_active_protection() to drop the active reference of kernfs
node in rdtgroup_kn_lock_live().
In Thread 2, kernfs_remove() is a blocking routine. It waits on all sub
kernfs nodes to drop the active reference when removing all subtree
kernfs nodes recursively. Thread 2 could block on kernfs_remove() until
Thread 1 calls kernfs_break_active_protection(). Only after
kernfs_remove() completes the refcount of rdtgrp could be trusted.
Before Thread 1 calls atomic_inc() and kernfs_break_active_protection(),
Thread 2 could call kfree() when the refcount of rdtgrp (sentry) is 0
instead of 1 due to the race.
In Thread 1, in rdtgroup_kn_unlock(), referring to earlier rdtgrp memory
(rdtgrp->waitcount) which was already freed in Thread 2 results in
use-after-free issue.
Thread 1 (rdtgroup_mondata_show) Thread 2 (rdtgroup_rmdir)
-------------------------------- -------------------------
rdtgroup_kn_lock_live
/*
* kn active protection until
* kernfs_break_active_protection(kn)
*/
rdtgrp = kernfs_to_rdtgroup(kn)
rdtgroup_kn_lock_live
atomic_inc(&rdtgrp->waitcount)
mutex_lock
rdtgroup_rmdir_ctrl
free_all_child_rdtgrp
/*
* sentry->waitcount should be 1
* but is 0 now due to the race.
*/
kfree(sentry)*[1]
/*
* Only after kernfs_remove()
* completes, the refcount of
* rdtgrp could be trusted.
*/
atomic_inc(&rdtgrp->waitcount)
/* kn->active-- */
kernfs_break_active_protection(kn)
rdtgroup_ctrl_remove
rdtgrp->flags = RDT_DELETED
/*
* Blocking routine, wait for
* all sub kernfs nodes to drop
* active reference in
* kernfs_break_active_protection.
*/
kernfs_remove(rdtgrp->kn)
rdtgroup_kn_unlock
mutex_unlock
atomic_dec_and_test(
&rdtgrp->waitcount)
&& (flags & RDT_DELETED)
kernfs_unbreak_active_protection(kn)
kfree(rdtgrp)
mutex_lock
mon_event_read
rdtgroup_kn_unlock
mutex_unlock
/*
* Use-after-free: refer to earlier rdtgrp
* memory which was freed in [1].
*/
atomic_dec_and_test(&rdtgrp->waitcount)
&& (flags & RDT_DELETED)
/* kn->active++ */
kernfs_unbreak_active_protection(kn)
kfree(rdtgrp)
Fix it by moving free_all_child_rdtgrp() to after kernfs_remove() in
rdtgroup_rmdir_ctrl() to ensure it has the accurate refcount of rdtgrp.
Fixes: f3cbeacaa0 ("x86/intel_rdt/cqm: Add rmdir support")
Suggested-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/1578500886-21771-3-git-send-email-xiaochen.shen@intel.com
A resource group (rdtgrp) contains a reference count (rdtgrp->waitcount)
that indicates how many waiters expect this rdtgrp to exist. Waiters
could be waiting on rdtgroup_mutex or some work sitting on a task's
workqueue for when the task returns from kernel mode or exits.
The deletion of a rdtgrp is intended to have two phases:
(1) while holding rdtgroup_mutex the necessary cleanup is done and
rdtgrp->flags is set to RDT_DELETED,
(2) after releasing the rdtgroup_mutex, the rdtgrp structure is freed
only if there are no waiters and its flag is set to RDT_DELETED. Upon
gaining access to rdtgroup_mutex or rdtgrp, a waiter is required to check
for the RDT_DELETED flag.
When unmounting the resctrl file system or deleting ctrl_mon groups,
all of the subdirectories are removed and the data structure of rdtgrp
is forcibly freed without checking rdtgrp->waitcount. If at this point
there was a waiter on rdtgrp then a use-after-free issue occurs when the
waiter starts running and accesses the rdtgrp structure it was waiting
on.
See kfree() calls in [1], [2] and [3] in these two call paths in
following scenarios:
(1) rdt_kill_sb() -> rmdir_all_sub() -> free_all_child_rdtgrp()
(2) rdtgroup_rmdir() -> rdtgroup_rmdir_ctrl() -> free_all_child_rdtgrp()
There are several scenarios that result in use-after-free issue in
following:
Scenario 1:
-----------
In Thread 1, rdtgroup_tasks_write() adds a task_work callback
move_myself(). If move_myself() is scheduled to execute after Thread 2
rdt_kill_sb() is finished, referring to earlier rdtgrp memory
(rdtgrp->waitcount) which was already freed in Thread 2 results in
use-after-free issue.
Thread 1 (rdtgroup_tasks_write) Thread 2 (rdt_kill_sb)
------------------------------- ----------------------
rdtgroup_kn_lock_live
atomic_inc(&rdtgrp->waitcount)
mutex_lock
rdtgroup_move_task
__rdtgroup_move_task
/*
* Take an extra refcount, so rdtgrp cannot be freed
* before the call back move_myself has been invoked
*/
atomic_inc(&rdtgrp->waitcount)
/* Callback move_myself will be scheduled for later */
task_work_add(move_myself)
rdtgroup_kn_unlock
mutex_unlock
atomic_dec_and_test(&rdtgrp->waitcount)
&& (flags & RDT_DELETED)
mutex_lock
rmdir_all_sub
/*
* sentry and rdtgrp are freed
* without checking refcount
*/
free_all_child_rdtgrp
kfree(sentry)*[1]
kfree(rdtgrp)*[2]
mutex_unlock
/*
* Callback is scheduled to execute
* after rdt_kill_sb is finished
*/
move_myself
/*
* Use-after-free: refer to earlier rdtgrp
* memory which was freed in [1] or [2].
*/
atomic_dec_and_test(&rdtgrp->waitcount)
&& (flags & RDT_DELETED)
kfree(rdtgrp)
Scenario 2:
-----------
In Thread 1, rdtgroup_tasks_write() adds a task_work callback
move_myself(). If move_myself() is scheduled to execute after Thread 2
rdtgroup_rmdir() is finished, referring to earlier rdtgrp memory
(rdtgrp->waitcount) which was already freed in Thread 2 results in
use-after-free issue.
Thread 1 (rdtgroup_tasks_write) Thread 2 (rdtgroup_rmdir)
------------------------------- -------------------------
rdtgroup_kn_lock_live
atomic_inc(&rdtgrp->waitcount)
mutex_lock
rdtgroup_move_task
__rdtgroup_move_task
/*
* Take an extra refcount, so rdtgrp cannot be freed
* before the call back move_myself has been invoked
*/
atomic_inc(&rdtgrp->waitcount)
/* Callback move_myself will be scheduled for later */
task_work_add(move_myself)
rdtgroup_kn_unlock
mutex_unlock
atomic_dec_and_test(&rdtgrp->waitcount)
&& (flags & RDT_DELETED)
rdtgroup_kn_lock_live
atomic_inc(&rdtgrp->waitcount)
mutex_lock
rdtgroup_rmdir_ctrl
free_all_child_rdtgrp
/*
* sentry is freed without
* checking refcount
*/
kfree(sentry)*[3]
rdtgroup_ctrl_remove
rdtgrp->flags = RDT_DELETED
rdtgroup_kn_unlock
mutex_unlock
atomic_dec_and_test(
&rdtgrp->waitcount)
&& (flags & RDT_DELETED)
kfree(rdtgrp)
/*
* Callback is scheduled to execute
* after rdt_kill_sb is finished
*/
move_myself
/*
* Use-after-free: refer to earlier rdtgrp
* memory which was freed in [3].
*/
atomic_dec_and_test(&rdtgrp->waitcount)
&& (flags & RDT_DELETED)
kfree(rdtgrp)
If CONFIG_DEBUG_SLAB=y, Slab corruption on kmalloc-2k can be observed
like following. Note that "0x6b" is POISON_FREE after kfree(). The
corrupted bits "0x6a", "0x64" at offset 0x424 correspond to
waitcount member of struct rdtgroup which was freed:
Slab corruption (Not tainted): kmalloc-2k start=ffff9504c5b0d000, len=2048
420: 6b 6b 6b 6b 6a 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkjkkkkkkkkkkk
Single bit error detected. Probably bad RAM.
Run memtest86+ or a similar memory test tool.
Next obj: start=ffff9504c5b0d800, len=2048
000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
Slab corruption (Not tainted): kmalloc-2k start=ffff9504c58ab800, len=2048
420: 6b 6b 6b 6b 64 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkdkkkkkkkkkkk
Prev obj: start=ffff9504c58ab000, len=2048
000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
Fix this by taking reference count (waitcount) of rdtgrp into account in
the two call paths that currently do not do so. Instead of always
freeing the resource group it will only be freed if there are no waiters
on it. If there are waiters, the resource group will have its flags set
to RDT_DELETED.
It will be left to the waiter to free the resource group when it starts
running and finding that it was the last waiter and the resource group
has been removed (rdtgrp->flags & RDT_DELETED) since. (1) rdt_kill_sb()
-> rmdir_all_sub() -> free_all_child_rdtgrp() (2) rdtgroup_rmdir() ->
rdtgroup_rmdir_ctrl() -> free_all_child_rdtgrp()
Fixes: f3cbeacaa0 ("x86/intel_rdt/cqm: Add rmdir support")
Fixes: 60cf5e101f ("x86/intel_rdt: Add mkdir to resctrl file system")
Suggested-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/1578500886-21771-2-git-send-email-xiaochen.shen@intel.com
Both functions call init_intel_cacheinfo() which computes L2 and L3 cache
sizes from CPUID(4). But then they also call cpu_detect_cache_sizes() a
bit later which computes ->x86_tlbsize and L2 size from CPUID(80000006).
However, the latter call is not needed because
- on these CPUs, CPUID(80000006).EBX for ->x86_tlbsize is reserved
- CPUID(80000006).ECX for the L2 size has the same result as CPUID(4)
Therefore, remove the latter call to simplify the code.
[ bp: Rewrite commit message. ]
Signed-off-by: Tony W Wang-oc <TonyWWang-oc@zhaoxin.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/1579075257-6985-1-git-send-email-TonyWWang-oc@zhaoxin.com
Monitoring tools that want to find out which resctrl control and monitor
groups a task belongs to must currently read the "tasks" file in every
group until they locate the process ID.
Add an additional file /proc/{pid}/cpu_resctrl_groups to provide this
information:
1) res:
mon:
resctrl is not available.
2) res:/
mon:
Task is part of the root resctrl control group, and it is not associated
to any monitor group.
3) res:/
mon:mon0
Task is part of the root resctrl control group and monitor group mon0.
4) res:group0
mon:
Task is part of resctrl control group group0, and it is not associated
to any monitor group.
5) res:group0
mon:mon1
Task is part of resctrl control group group0 and monitor group mon1.
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Jinshi Chen <jinshi.chen@intel.com>
Link: https://lkml.kernel.org/r/20200115092851.14761-1-yu.c.chen@intel.com
Pull x86 fixes from Ingo Molnar:
"Misc fixes:
- a resctrl fix for uninitialized objects found by debugobjects
- a resctrl memory leak fix
- fix the unintended re-enabling of the of SME and SEV CPU flags if
memory encryption was disabled at bootup via the MSR space"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/CPU/AMD: Ensure clearing of SME/SEV features is maintained
x86/resctrl: Fix potential memory leak
x86/resctrl: Fix an imbalance in domain_remove_cpu()
Currently, there are three static keys in the resctrl file system:
rdt_mon_enable_key and rdt_alloc_enable_key indicate if the monitoring
feature and the allocation feature are enabled, respectively. The
rdt_enable_key is enabled when either the monitoring feature or the
allocation feature is enabled.
If no monitoring feature is present (either hardware doesn't support a
monitoring feature or the feature is disabled by the kernel command line
option "rdt="), rdt_enable_key is still enabled but rdt_mon_enable_key
is disabled.
MBM is a monitoring feature. The MBM overflow handler intends to
check if the monitoring feature is not enabled for fast return.
So check the rdt_mon_enable_key in it instead of the rdt_enable_key as
former is the more accurate check.
[ bp: Massage commit message. ]
Fixes: e33026831b ("x86/intel_rdt/mbm: Handle counter overflow")
Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/1576094705-13660-1-git-send-email-xiaochen.shen@intel.com
New Zhaoxin family 7 CPUs are not affected by the SWAPGS vulnerability. So
mark these CPUs in the cpu vulnerability whitelist accordingly.
Signed-off-by: Tony W Wang-oc <TonyWWang-oc@zhaoxin.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/1579227872-26972-3-git-send-email-TonyWWang-oc@zhaoxin.com
New Zhaoxin family 7 CPUs are not affected by SPECTRE_V2. So define a
separate cpu_vuln_whitelist bit NO_SPECTRE_V2 and add these CPUs to the cpu
vulnerability whitelist.
Signed-off-by: Tony W Wang-oc <TonyWWang-oc@zhaoxin.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/1579227872-26972-2-git-send-email-TonyWWang-oc@zhaoxin.com
/proc/cpuinfo currently reports Hardware Lock Elision (HLE) feature to
be present on boot cpu even if it was disabled during the bootup. This
is because cpuinfo_x86->x86_capability HLE bit is not updated after TSX
state is changed via the new MSR IA32_TSX_CTRL.
Update the cached HLE bit also since it is expected to change after an
update to CPUID_CLEAR bit in MSR IA32_TSX_CTRL.
Fixes: 95c5824f75 ("x86/cpu: Add a "tsx=" cmdline option with TSX disabled by default")
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Neelima Krishnan <neelima.krishnan@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/2529b99546294c893dfa1c89e2b3e46da3369a59.1578685425.git.pawan.kumar.gupta@linux.intel.com
If the SME and SEV features are present via CPUID, but memory encryption
support is not enabled (MSR 0xC001_0010[23]), the feature flags are cleared
using clear_cpu_cap(). However, if get_cpu_cap() is later called, these
feature flags will be reset back to present, which is not desired.
Change from using clear_cpu_cap() to setup_clear_cpu_cap() so that the
clearing of the flags is maintained.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: <stable@vger.kernel.org> # 4.16.x-
Link: https://lkml.kernel.org/r/226de90a703c3c0be5a49565047905ac4e94e8f3.1579125915.git.thomas.lendacky@amd.com
Add support for a new version of the Load Store unit bank type as
indicated by its McaType value, which will be present in future SMCA
systems.
Add the new (HWID, MCATYPE) tuple. Reuse the same name, since this is
logically the same to the user.
Also, add the new error descriptions to edac_mce_amd.
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200110015651.14887-2-Yazen.Ghannam@amd.com
Don't print an error message about VMX being disabled by BIOS if KVM,
the sole user of VMX, is disabled. E.g. if KVM is disabled and the MSR
is unlocked, the kernel will intentionally disable VMX when locking
feature control and then complain that "BIOS" disabled VMX.
Fixes: ef4d3bf198 ("x86/cpu: Clear VMX feature flag if VMX is not fully enabled")
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200114202545.20296-1-sean.j.christopherson@intel.com
It is relatively easy to trigger the following boot splat on an Ice Lake
client platform. The call stack is like:
kernel BUG at kernel/timer/timer.c:1152!
Call Trace:
__queue_delayed_work
queue_delayed_work_on
therm_throt_process
intel_thermal_interrupt
...
The reason is that a CPU's thermal interrupt is enabled prior to
executing its hotplug onlining callback which will initialize the
throttling workqueues.
Such a race can lead to therm_throt_process() accessing an uninitialized
therm_work, leading to the above BUG at a very early bootup stage.
Therefore, unmask the thermal interrupt vector only after having setup
the workqueues completely.
[ bp: Heavily massage commit message and correct comment formatting. ]
Fixes: f6656208f0 ("x86/mce/therm_throt: Optimize notifications of thermal throttle")
Signed-off-by: Chuansheng Liu <chuansheng.liu@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20200107004116.59353-1-chuansheng.liu@intel.com
Add a new feature flag, X86_FEATURE_MSR_IA32_FEAT_CTL, to track whether
IA32_FEAT_CTL has been initialized. This will allow KVM, and any future
subsystems that depend on IA32_FEAT_CTL, to rely purely on cpufeatures
to query platform support, e.g. allows a future patch to remove KVM's
manual IA32_FEAT_CTL MSR checks.
Various features (on platforms that support IA32_FEAT_CTL) are dependent
on IA32_FEAT_CTL being configured and locked, e.g. VMX and LMCE. The
MSR is always configured during boot, but only if the CPU vendor is
recognized by the kernel. Because CPUID doesn't incorporate the current
IA32_FEAT_CTL value in its reporting of relevant features, it's possible
for a feature to be reported as supported in cpufeatures but not truly
enabled, e.g. if the CPU supports VMX but the kernel doesn't recognize
the CPU.
As a result, without the flag, KVM would see VMX as supported even if
IA32_FEAT_CTL hasn't been initialized, and so would need to manually
read the MSR and check the various enabling bits to avoid taking an
unexpected #GP on VMXON.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20191221044513.21680-14-sean.j.christopherson@intel.com
Set the synthetic VMX cpufeatures, which need to be kept to preserve
/proc/cpuinfo's ABI, in the common IA32_FEAT_CTL initialization code.
Remove the vendor code that manually sets the synthetic flags.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20191221044513.21680-13-sean.j.christopherson@intel.com
Add support for generating VMX feature names in capflags.c and use the
resulting x86_vmx_flags to print the VMX flags in /proc/cpuinfo. Don't
print VMX flags if no bits are set in word 0, which holds Pin Controls.
Pin Control's INTR and NMI exiting are fundamental pillars of VMX, if
they are not supported then the CPU is broken, it does not actually
support VMX, or the kernel wasn't built with support for the target CPU.
Print the features in a dedicated "vmx flags" line to avoid polluting
the common "flags" and to avoid having to prefix all flags with "vmx_",
which results in horrendously long names.
Keep synthetic VMX flags in cpufeatures to preserve /proc/cpuinfo's ABI
for those flags. This means that "flags" and "vmx flags" will have
duplicate entries for tpr_shadow (virtual_tpr), vnmi, ept, flexpriority,
vpid and ept_ad, but caps the pollution of "flags" at those six VMX
features. The vendor-specific code that populates the synthetic flags
will be consolidated in a future patch to further minimize the lasting
damage.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20191221044513.21680-12-sean.j.christopherson@intel.com
Add an entry in struct cpuinfo_x86 to track VMX capabilities and fill
the capabilities during IA32_FEAT_CTL MSR initialization.
Make the VMX capabilities dependent on IA32_FEAT_CTL and
X86_FEATURE_NAMES so as to avoid unnecessary overhead on CPUs that can't
possibly support VMX, or when /proc/cpuinfo is not available.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20191221044513.21680-11-sean.j.christopherson@intel.com
Now that IA32_FEAT_CTL is always configured and locked for CPUs that are
known to support VMX[*], clear the VMX capability flag if the MSR is
unsupported or BIOS disabled VMX, i.e. locked IA32_FEAT_CTL and didn't
set the appropriate VMX enable bit.
[*] Because init_ia32_feat_ctl() is called from vendors ->c_init(), it's
still possible for IA32_FEAT_CTL to be left unlocked when VMX is
supported by the CPU. This is not fatal, and will be addressed in a
future patch.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20191221044513.21680-9-sean.j.christopherson@intel.com
Use the recently added IA32_FEAT_CTL MSR initialization sequence to
opportunistically enable VMX support when running on a Zhaoxin CPU.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20191221044513.21680-8-sean.j.christopherson@intel.com
Use the recently added IA32_FEAT_CTL MSR initialization sequence to
opportunistically enable VMX support when running on a Centaur CPU.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20191221044513.21680-7-sean.j.christopherson@intel.com
Opportunistically initialize IA32_FEAT_CTL to enable VMX when the MSR is
left unlocked by BIOS. Configuring feature control at boot time paves
the way for similar enabling of other features, e.g. Software Guard
Extensions (SGX).
Temporarily leave equivalent KVM code in place in order to avoid
introducing a regression on Centaur and Zhaoxin CPUs, e.g. removing
KVM's code would leave the MSR unlocked on those CPUs and would break
existing functionality if people are loading kvm_intel on Centaur and/or
Zhaoxin. Defer enablement of the boot-time configuration on Centaur and
Zhaoxin to future patches to aid bisection.
Note, Local Machine Check Exceptions (LMCE) are also supported by the
kernel and enabled via feature control, but the kernel currently uses
LMCE if and only if the feature is explicitly enabled by BIOS. Keep
the current behavior to avoid introducing bugs, future patches can opt
in to opportunistic enabling if it's deemed desirable to do so.
Always lock IA32_FEAT_CTL if it exists, even if the CPU doesn't support
VMX, so that other existing and future kernel code that queries the MSR
can assume it's locked.
Start from a clean slate when constructing the value to write to
IA32_FEAT_CTL, i.e. ignore whatever value BIOS left in the MSR so as not
to enable random features or fault on the WRMSR.
Suggested-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20191221044513.21680-5-sean.j.christopherson@intel.com
As pointed out by Boris, the defines for bits in IA32_FEATURE_CONTROL
are quite a mouthful, especially the VMX bits which must differentiate
between enabling VMX inside and outside SMX (TXT) operation. Rename the
MSR and its bit defines to abbreviate FEATURE_CONTROL as FEAT_CTL to
make them a little friendlier on the eyes.
Arguably, the MSR itself should keep the full IA32_FEATURE_CONTROL name
to match Intel's SDM, but a future patch will add a dedicated Kconfig,
file and functions for the MSR. Using the full name for those assets is
rather unwieldy, so bite the bullet and use IA32_FEAT_CTL so that its
nomenclature is consistent throughout the kernel.
Opportunistically, fix a few other annoyances with the defines:
- Relocate the bit defines so that they immediately follow the MSR
define, e.g. aren't mistaken as belonging to MISC_FEATURE_CONTROL.
- Add whitespace around the block of feature control defines to make
it clear they're all related.
- Use BIT() instead of manually encoding the bit shift.
- Use "VMX" instead of "VMXON" to match the SDM.
- Append "_ENABLED" to the LMCE (Local Machine Check Exception) bit to
be consistent with the kernel's verbiage used for all other feature
control bits. Note, the SDM refers to the LMCE bit as LMCE_ON,
likely to differentiate it from IA32_MCG_EXT_CTL.LMCE_EN. Ignore
the (literal) one-off usage of _ON, the SDM is simply "wrong".
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20191221044513.21680-2-sean.j.christopherson@intel.com
When writing a pid to file "tasks", a callback function move_myself() is
queued to this task to be called when the task returns from kernel mode
or exits. The purpose of move_myself() is to activate the newly assigned
closid and/or rmid associated with this task. This activation is done by
calling resctrl_sched_in() from move_myself(), the same function that is
called when switching to this task.
If this work is successfully queued but then the task enters PF_EXITING
status (e.g., receiving signal SIGKILL, SIGTERM) prior to the
execution of the callback move_myself(), move_myself() still calls
resctrl_sched_in() since the task status is not currently considered.
When a task is exiting, the data structure of the task itself will
be freed soon. Calling resctrl_sched_in() to write the register that
controls the task's resources is unnecessary and it implies extra
performance overhead.
Add check on task status in move_myself() and return immediately if the
task is PF_EXITING.
[ bp: Massage. ]
Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lkml.kernel.org/r/1578500026-21152-1-git-send-email-xiaochen.shen@intel.com
The function mce_severity() is not required to update its msg argument.
In fact, mce_severity_amd() does not, which makes mce_no_way_out()
return uninitialized data, which may be used later for printing.
Assuming that implementations of mce_severity() either always or never
update the msg argument (which is currently the case), it is sufficient
to initialize the temporary variable in mce_no_way_out().
While at it, avoid printing a useless "Unknown".
Signed-off-by: Jan H. Schönherr <jschoenh@amazon.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200103150722.20313-4-jschoenh@amazon.de
Since commit
8b38937b7a ("x86/mce: Do not enter deferred errors into the generic
pool twice")
the mce=nobootlog option has become mostly ineffective (after being only
slightly ineffective before), as the code is taking actions on MCEs left
over from boot when they have a usable address.
Move the check for MCP_DONTLOG a bit outward to make it effective again.
Also, since commit
011d826111 ("RAS: Add a Corrected Errors Collector")
the two branches of the remaining "if" at the bottom of machine_check_poll()
do same. Unify them.
Signed-off-by: Jan H. Schönherr <jschoenh@amazon.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200103150722.20313-3-jschoenh@amazon.de
Commit
fa92c58694 ("x86, mce: Support memory error recovery for both UCNA
and Deferred error in machine_check_poll")
added handling of UCNA and Deferred errors by adding them to the ring
for SRAO errors.
Later, commit
fd4cf79fcc ("x86/mce: Remove the MCE ring for Action Optional errors")
switched storage from the SRAO ring to the unified pool that is still
in use today. In order to only act on the intended errors, a filter
for MCE_AO_SEVERITY is used -- effectively removing handling of
UCNA/Deferred errors again.
Extend the severity filter to include UCNA/Deferred errors again.
Also, generalize the naming of the notifier from SRAO to UC to capture
the extended scope.
Note, that this change may cause a message like the following to appear,
as the same address may be reported as SRAO and as UCNA:
Memory failure: 0x5fe3284: already hardware poisoned
Technically, this is a return to previous behavior.
Signed-off-by: Jan H. Schönherr <jschoenh@amazon.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20200103150722.20313-2-jschoenh@amazon.de
set_cache_qos_cfg() is leaking memory when the given level is not
RDT_RESOURCE_L3 or RDT_RESOURCE_L2. At the moment, this function is
called with only valid levels but move the allocation after the valid
level checks in order to make it more robust and future proof.
[ bp: Massage commit message. ]
Fixes: 99adde9b37 ("x86/intel_rdt: Enable L2 CDP in MSR IA32_L2_QOS_CFG")
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Reinette Chatre <reinette.chatre@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: x86-ml <x86@kernel.org>
Link: https://lkml.kernel.org/r/20200102165844.133133-1-shakeelb@google.com
A system that supports resource monitoring may have multiple resources
while not all of these resources are capable of monitoring. Monitoring
related state is initialized only for resources that are capable of
monitoring and correspondingly this state should subsequently only be
removed from these resources that are capable of monitoring.
domain_add_cpu() calls domain_setup_mon_state() only when r->mon_capable
is true where it will initialize d->mbm_over. However,
domain_remove_cpu() calls cancel_delayed_work(&d->mbm_over) without
checking r->mon_capable resulting in an attempt to cancel d->mbm_over on
all resources, even those that never initialized d->mbm_over because
they are not capable of monitoring. Hence, it triggers a debugobjects
warning when offlining CPUs because those timer debugobjects are never
initialized:
ODEBUG: assert_init not available (active state 0) object type:
timer_list hint: 0x0
WARNING: CPU: 143 PID: 789 at lib/debugobjects.c:484
debug_print_object
Hardware name: HP Synergy 680 Gen9/Synergy 680 Gen9 Compute Module, BIOS I40 05/23/2018
RIP: 0010:debug_print_object
Call Trace:
debug_object_assert_init
del_timer
try_to_grab_pending
cancel_delayed_work
resctrl_offline_cpu
cpuhp_invoke_callback
cpuhp_thread_fun
smpboot_thread_fn
kthread
ret_from_fork
Fixes: e33026831b ("x86/intel_rdt/mbm: Handle counter overflow")
Signed-off-by: Qian Cai <cai@lca.pw>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Reinette Chatre <reinette.chatre@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: john.stultz@linaro.org
Cc: sboyd@kernel.org
Cc: <stable@vger.kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: tj@kernel.org
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Cc: x86-ml <x86@kernel.org>
Link: https://lkml.kernel.org/r/20191211033042.2188-1-cai@lca.pw
The mutex in mce_inject_log() became unnecessary with commit
5de97c9f6d ("x86/mce: Factor out and deprecate the /dev/mcelog driver"),
though the original reason for its presence only vanished with commit
7298f08ea8 ("x86/mcelog: Get rid of RCU remnants").
Drop the mutex. And as that makes mce_inject_log() identical to mce_log(),
get rid of the former in favor of the latter.
Signed-off-by: Jan H. Schönherr <jschoenh@amazon.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: x86-ml <x86@kernel.org>
Link: https://lkml.kernel.org/r/20191210000733.17979-7-jschoenh@amazon.de
In commit
b2f9d678e2 ("x86/mce: Check for faults tagged in EXTABLE_CLASS_FAULT exception table entries")
another call to mce_panic() was introduced. Pass the message of the
handled MCE to that instance of mce_panic() as well, as there doesn't
seem to be a reason not to.
Signed-off-by: Jan H. Schönherr <jschoenh@amazon.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: x86-ml <x86@kernel.org>
Link: https://lkml.kernel.org/r/20191210000733.17979-6-jschoenh@amazon.de
throttle_active_work() is only called if CONFIG_SYSFS is set, otherwise
we get a harmless warning:
arch/x86/kernel/cpu/mce/therm_throt.c:238:13: error: 'throttle_active_work' \
defined but not used [-Werror=unused-function]
Mark the function as __maybe_unused to avoid the warning.
Fixes: f6656208f0 ("x86/mce/therm_throt: Optimize notifications of thermal throttle")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: bberg@redhat.com
Cc: ckellner@redhat.com
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: hdegoede@redhat.com
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: x86-ml <x86@kernel.org>
Link: https://lkml.kernel.org/r/20191210203925.3119091-1-arnd@arndb.de
The function mce_severity_amd_smca() requires m->bank to be initialized
for correct operation. Fix the one case, where mce_severity() is called
without doing so.
Fixes: 6bda529ec4 ("x86/mce: Grade uncorrected errors for SMCA-enabled systems")
Fixes: d28af26faa ("x86/MCE: Initialize mce.bank in the case of a fatal error in mce_no_way_out()")
Signed-off-by: Jan H. Schönherr <jschoenh@amazon.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: linux-edac <linux-edac@vger.kernel.org>
Cc: <stable@vger.kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: x86-ml <x86@kernel.org>
Cc: Yazen Ghannam <Yazen.Ghannam@amd.com>
Link: https://lkml.kernel.org/r/20191210000733.17979-4-jschoenh@amazon.de
Each logical CPU in Scalable MCA systems controls a unique set of MCA
banks in the system. These banks are not shared between CPUs. The bank
types and ordering will be the same across CPUs on currently available
systems.
However, some CPUs may see a bank as Reserved/Read-as-Zero (RAZ) while
other CPUs do not. In this case, the bank seen as Reserved on one CPU is
assumed to be the same type as the bank seen as a known type on another
CPU.
In general, this occurs when the hardware represented by the MCA bank
is disabled, e.g. disabled memory controllers on certain models, etc.
The MCA bank is disabled in the hardware, so there is no possibility of
getting an MCA/MCE from it even if it is assumed to have a known type.
For example:
Full system:
Bank | Type seen on CPU0 | Type seen on CPU1
------------------------------------------------
0 | LS | LS
1 | UMC | UMC
2 | CS | CS
System with hardware disabled:
Bank | Type seen on CPU0 | Type seen on CPU1
------------------------------------------------
0 | LS | LS
1 | UMC | RAZ
2 | CS | CS
For this reason, there is a single, global struct smca_banks[] that is
initialized at boot time. This array is initialized on each CPU as it
comes online. However, the array will not be updated if an entry already
exists.
This works as expected when the first CPU (usually CPU0) has all
possible MCA banks enabled. But if the first CPU has a subset, then it
will save a "Reserved" type in smca_banks[]. Successive CPUs will then
not be able to update smca_banks[] even if they encounter a known bank
type.
This may result in unexpected behavior. Depending on the system
configuration, a user may observe issues enumerating the MCA
thresholding sysfs interface. The issues may be as trivial as sysfs
entries not being available, or as severe as system hangs.
For example:
Bank | Type seen on CPU0 | Type seen on CPU1
------------------------------------------------
0 | LS | LS
1 | RAZ | UMC
2 | CS | CS
Extend the smca_banks[] entry check to return if the entry is a
non-reserved type. Otherwise, continue so that CPUs that encounter a
known bank type can update smca_banks[].
Fixes: 68627a697c ("x86/mce/AMD, EDAC/mce_amd: Enumerate Reserved SMCA bank type")
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: linux-edac <linux-edac@vger.kernel.org>
Cc: <stable@vger.kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: x86-ml <x86@kernel.org>
Link: https://lkml.kernel.org/r/20191121141508.141273-1-Yazen.Ghannam@amd.com
... because interrupts are disabled that early and sending IPIs can
deadlock:
BUG: sleeping function called from invalid context at kernel/sched/completion.c:99
in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 0, name: swapper/1
no locks held by swapper/1/0.
irq event stamp: 0
hardirqs last enabled at (0): [<0000000000000000>] 0x0
hardirqs last disabled at (0): [<ffffffff8106dda9>] copy_process+0x8b9/0x1ca0
softirqs last enabled at (0): [<ffffffff8106dda9>] copy_process+0x8b9/0x1ca0
softirqs last disabled at (0): [<0000000000000000>] 0x0
Preemption disabled at:
[<ffffffff8104703b>] start_secondary+0x3b/0x190
CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.5.0-rc2+ #1
Hardware name: GIGABYTE MZ01-CE1-00/MZ01-CE1-00, BIOS F02 08/29/2018
Call Trace:
dump_stack
___might_sleep.cold.92
wait_for_completion
? generic_exec_single
rdmsr_safe_on_cpu
? wrmsr_on_cpus
mce_amd_feature_init
mcheck_cpu_init
identify_cpu
identify_secondary_cpu
smp_store_cpu_info
start_secondary
secondary_startup_64
The function smca_configure() is called only on the current CPU anyway,
therefore replace rdmsr_safe_on_cpu() with atomic rdmsr_safe() and avoid
the IPI.
[ bp: Update commit message. ]
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Cc: <stable@vger.kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: x86-ml <x86@kernel.org>
Link: https://lkml.kernel.org/r/157252708836.3876.4604398213417262402.stgit@buzz
pat.h is a file whose main purpose is to provide the memtype_*() APIs.
PAT is the low level hardware mechanism - but the high level abstraction
is memtype.
So name the header <memtype.h> as well - this goes hand in hand with memtype.c
and memtype_interval.c.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Zhang Xiaoxu noted that physical address locations for MTRR were visible
to non-root users, which could be considered an information leak.
In discussing[1] the options for solving this, it sounded like just
moving the capable check into open() was the first step.
If this breaks userspace, then we will have a test case for the more
conservative approaches discussed in the thread. In summary:
- MTRR should check capabilities at open time (or retain the
checks on the opener's permissions for later checks).
- changing the DAC permissions might break something that expects to
open mtrr when not uid 0.
- if we leave the DAC permissions alone and just move the capable check
to the opener, we should get the desired protection. (i.e. check
against CAP_SYS_ADMIN not just the wider uid 0.)
- if that still breaks things, as in userspace expects to be able to
read other parts of the file as non-uid-0 and non-CAP_SYS_ADMIN, then
we need to censor the contents using the opener's permissions. For
example, as done in other /proc cases, like commit
51d7b12041 ("/proc/iomem: only expose physical resource addresses to privileged users").
[1] https://lore.kernel.org/lkml/201911110934.AC5BA313@keescook/
Reported-by: Zhang Xiaoxu <zhangxiaoxu5@huawei.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: James Morris <jamorris@linux.microsoft.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Colin Ian King <colin.king@canonical.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: linux-security-module@vger.kernel.org
Cc: Matthew Garrett <mjg59@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tyler Hicks <tyhicks@canonical.com>
Cc: x86-ml <x86@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/201911181308.63F06502A1@keescook
Pull x86 fixes from Ingo Molnar:
"Various fixes:
- Fix the PAT performance regression that downgraded write-combining
device memory regions to uncached.
- There's been a number of bugs in 32-bit double fault handling -
hopefully all fixed now.
- Fix an LDT crash
- Fix an FPU over-optimization that broke with GCC9 code
optimizations.
- Misc cleanups"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mm/pat: Fix off-by-one bugs in interval tree search
x86/ioperm: Save an indentation level in tss_update_io_bitmap()
x86/fpu: Don't cache access to fpu_fpregs_owner_ctx
x86/entry/32: Remove unused 'restore_all_notrace' local label
x86/ptrace: Document FSBASE and GSBASE ABI oddities
x86/ptrace: Remove set_segment_reg() implementations for current
x86/traps: die() instead of panicking on a double fault
x86/doublefault/32: Rewrite the x86_32 #DF handler and unify with 64-bit
x86/doublefault/32: Move #DF stack and TSS to cpu_entry_area
x86/doublefault/32: Rename doublefault.c to doublefault_32.c
x86/traps: Disentangle the 32-bit and 64-bit doublefault code
lkdtm: Add a DOUBLE_FAULT crash type on x86
selftests/x86/single_step_syscall: Check SYSENTER directly
x86/mm/32: Sync only to VMALLOC_END in vmalloc_sync_all()
There are three problems with the current layout of the doublefault
stack and TSS. First, the TSS is only cacheline-aligned, which is
not enough -- if the hardware portion of the TSS (struct x86_hw_tss)
crosses a page boundary, horrible things happen [0]. Second, the
stack and TSS are global, so simultaneous double faults on different
CPUs will cause massive corruption. Third, the whole mechanism
won't work if user CR3 is loaded, resulting in a triple fault [1].
Let the doublefault stack and TSS share a page (which prevents the
TSS from spanning a page boundary), make it percpu, and move it into
cpu_entry_area. Teach the stack dump code about the doublefault
stack.
[0] Real hardware will read past the end of the page onto the next
*physical* page if a task switch happens. Virtual machines may
have any number of bugs, and I would consider it reasonable for
a VM to summarily kill the guest if it tries to task-switch to
a page-spanning TSS.
[1] Real hardware triple faults. At least some VMs seem to hang.
I'm not sure what's going on.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull x86 iopl updates from Ingo Molnar:
"This implements a nice simplification of the iopl and ioperm code that
Thomas Gleixner discovered: we can implement the IO privilege features
of the iopl system call by using the IO permission bitmap in
permissive mode, while trapping CLI/STI/POPF/PUSHF uses in user-space
if they change the interrupt flag.
This implements that feature, with testing facilities and related
cleanups"
[ "Simplification" may be an over-statement. The main goal is to avoid
the cli/sti of iopl by effectively implementing the IO port access
parts of iopl in terms of ioperm.
This may end up not workign well in case people actually depend on
cli/sti being available, or if there are mixed uses of iopl and
ioperm. We will see.. - Linus ]
* 'x86-iopl-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (22 commits)
x86/ioperm: Fix use of deprecated config option
x86/entry/32: Clarify register saving in __switch_to_asm()
selftests/x86/iopl: Extend test to cover IOPL emulation
x86/ioperm: Extend IOPL config to control ioperm() as well
x86/iopl: Remove legacy IOPL option
x86/iopl: Restrict iopl() permission scope
x86/iopl: Fixup misleading comment
selftests/x86/ioperm: Extend testing so the shared bitmap is exercised
x86/ioperm: Share I/O bitmap if identical
x86/ioperm: Remove bitmap if all permissions dropped
x86/ioperm: Move TSS bitmap update to exit to user work
x86/ioperm: Add bitmap sequence number
x86/ioperm: Move iobitmap data into a struct
x86/tss: Move I/O bitmap data into a seperate struct
x86/io: Speedup schedule out of I/O bitmap user
x86/ioperm: Avoid bitmap allocation if no permissions are set
x86/ioperm: Simplify first ioperm() invocation logic
x86/iopl: Cleanup include maze
x86/tss: Fix and move VMX BUILD_BUG_ON()
x86/cpu: Unify cpu_init()
...
Pull x86 PTI updates from Ingo Molnar:
"Fix reporting bugs of the MDS and TAA mitigation status, if one or
both are set via a boot option.
No change to mitigation behavior intended"
* 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/speculation: Fix redundant MDS mitigation message
x86/speculation: Fix incorrect MDS/TAA mitigation status
Pull x86 mm updates from Ingo Molnar:
"The main changes in this cycle were:
- A PAT series from Davidlohr Bueso, which simplifies the memtype
rbtree by using the interval tree helpers. (There's more cleanups
in this area queued up, but they didn't make the merge window.)
- Also flip over CONFIG_X86_5LEVEL to default-y. This might draw in a
few more testers, as all the major distros are going to have
5-level paging enabled by default in their next iterations.
- Misc cleanups"
* 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mm/pat: Rename pat_rbtree.c to pat_interval.c
x86/mm/pat: Drop the rbt_ prefix from external memtype calls
x86/mm/pat: Do not pass 'rb_root' down the memtype tree helper functions
x86/mm/pat: Convert the PAT tree to a generic interval tree
x86/mm: Clean up the pmd_read_atomic() comments
x86/mm: Fix function name typo in pmd_read_atomic() comment
x86/cpu: Clean up intel_tlb_table[]
x86/mm: Enable 5-level paging support by default
Pull x86 hyperv updates from Ingo Molnar:
"Misc updates to the hyperv guest code:
- Rework clockevents initialization to better support hibernation
- Allow guests to enable InvariantTSC
- Micro-optimize send_ipi_one"
* 'x86-hyperv-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/hyperv: Initialize clockevents earlier in CPU onlining
x86/hyperv: Allow guests to enable InvariantTSC
x86/hyperv: Micro-optimize send_ipi_one()
Pull x86 cpu and fpu updates from Ingo Molnar:
- math-emu fixes
- CPUID updates
- sanity-check RDRAND output to see whether the CPU at least pretends
to produce random data
- various unaligned-access across cachelines fixes in preparation of
hardware level split-lock detection
- fix MAXSMP constraints to not allow !CPUMASK_OFFSTACK kernels with
larger than 512 NR_CPUS
- misc FPU related cleanups
* 'x86-cpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/cpu: Align the x86_capability array to size of unsigned long
x86/cpu: Align cpu_caps_cleared and cpu_caps_set to unsigned long
x86/umip: Make the comments vendor-agnostic
x86/Kconfig: Rename UMIP config parameter
x86/Kconfig: Enforce limit of 512 CPUs with MAXSMP and no CPUMASK_OFFSTACK
x86/cpufeatures: Add feature bit RDPRU on AMD
x86/math-emu: Limit MATH_EMULATION to 486SX compatibles
x86/math-emu: Check __copy_from_user() result
x86/rdrand: Sanity-check RDRAND output
* 'x86-fpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/fpu: Use XFEATURE_FP/SSE enum values instead of hardcoded numbers
x86/fpu: Shrink space allocated for xstate_comp_offsets
x86/fpu: Update stale variable name in comment
Pull RAS updates from Borislav Petkov:
- Fully reworked thermal throttling notifications, there should be no
more spamming of dmesg (Srinivas Pandruvada and Benjamin Berg)
- More enablement for the Intel-compatible CPUs Zhaoxin (Tony W
Wang-oc)
- PPIN support for Icelake (Tony Luck)
* 'ras-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mce/therm_throt: Optimize notifications of thermal throttle
x86/mce: Add Xeon Icelake to list of CPUs that support PPIN
x86/mce: Lower throttling MCE messages' priority to warning
x86/mce: Add Zhaoxin LMCE support
x86/mce: Add Zhaoxin CMCI support
x86/mce: Add Zhaoxin MCE support
x86/mce/amd: Make disable_err_thresholding() static
Pull x86 microcode updates from Borislav Petkov:
"This converts the late loading method to load the microcode in
parallel (vs sequentially currently). The patch remained in linux-next
for the maximum amount of time so that any potential and hard to debug
fallout be minimized.
Now cloud folks have their milliseconds back but all the normal people
should use early loading anyway :-)"
* 'x86-microcode-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/microcode/intel: Issue the revision updated message only on the BSP
x86/microcode: Update late microcode in parallel
x86/microcode/amd: Fix two -Wunused-but-set-variable warnings
Pull x86 fixes from Ingo Molnar:
"Two fixes: disable unreliable HPET on Intel Coffe Lake platforms, and
fix a lockdep splat in the resctrl code"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/resctrl: Fix potential lockdep warning
x86/quirks: Disable HPET on Intel Coffe Lake platforms
This patch enables KCSAN for x86, with updates to build rules to not use
KCSAN for several incompatible compilation units.
Signed-off-by: Marco Elver <elver@google.com>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Since MDS and TAA mitigations are inter-related for processors that are
affected by both vulnerabilities, the followiing confusing messages can
be printed in the kernel log:
MDS: Vulnerable
MDS: Mitigation: Clear CPU buffers
To avoid the first incorrect message, defer the printing of MDS
mitigation after the TAA mitigation selection has been done. However,
that has the side effect of printing TAA mitigation first before MDS
mitigation.
[ bp: Check box is affected/mitigations are disabled first before
printing and massage. ]
Suggested-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Mark Gross <mgross@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Tyler Hicks <tyhicks@canonical.com>
Cc: x86-ml <x86@kernel.org>
Link: https://lkml.kernel.org/r/20191115161445.30809-3-longman@redhat.com
For MDS vulnerable processors with TSX support, enabling either MDS or
TAA mitigations will enable the use of VERW to flush internal processor
buffers at the right code path. IOW, they are either both mitigated
or both not. However, if the command line options are inconsistent,
the vulnerabilites sysfs files may not report the mitigation status
correctly.
For example, with only the "mds=off" option:
vulnerabilities/mds:Vulnerable; SMT vulnerable
vulnerabilities/tsx_async_abort:Mitigation: Clear CPU buffers; SMT vulnerable
The mds vulnerabilities file has wrong status in this case. Similarly,
the taa vulnerability file will be wrong with mds mitigation on, but
taa off.
Change taa_select_mitigation() to sync up the two mitigation status
and have them turned off if both "mds=off" and "tsx_async_abort=off"
are present.
Update documentation to emphasize the fact that both "mds=off" and
"tsx_async_abort=off" have to be specified together for processors that
are affected by both TAA and MDS to be effective.
[ bp: Massage and add kernel-parameters.txt change too. ]
Fixes: 1b42f01741 ("x86/speculation/taa: Add mitigation for TSX Async Abort")
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: linux-doc@vger.kernel.org
Cc: Mark Gross <mgross@linux.intel.com>
Cc: <stable@vger.kernel.org>
Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Tyler Hicks <tyhicks@canonical.com>
Cc: x86-ml <x86@kernel.org>
Link: https://lkml.kernel.org/r/20191115161445.30809-2-longman@redhat.com
If iopl() is disabled, then providing ioperm() does not make much sense.
Rename the config option and disable/enable both syscalls with it. Guard
the code with #ifdefs where appropriate.
Suggested-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The access to the full I/O port range can be also provided by the TSS I/O
bitmap, but that would require to copy 8k of data on scheduling in the
task. As shown with the sched out optimization TSS.io_bitmap_base can be
used to switch the incoming task to a preallocated I/O bitmap which has all
bits zero, i.e. allows access to all I/O ports.
Implementing this allows to provide an iopl() emulation mode which restricts
the IOPL level 3 permissions to I/O port access but removes the STI/CLI
permission which is coming with the hardware IOPL mechansim.
Provide a config option to switch IOPL to emulation mode, make it the
default and while at it also provide an option to disable IOPL completely.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Andy Lutomirski <luto@kernel.org>
Add a globally unique sequence number which is incremented when ioperm() is
changing the I/O bitmap of a task. Store the new sequence number in the
io_bitmap structure and compare it with the sequence number of the I/O
bitmap which was last loaded on a CPU. Only update the bitmap if the
sequence is different.
That should further reduce the overhead of I/O bitmap scheduling when there
are only a few I/O bitmap users on the system.
The 64bit sequence counter is sufficient. A wraparound of the sequence
counter assuming an ioperm() call every nanosecond would require about 584
years of uptime.
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Move the non hardware portion of I/O bitmap data into a seperate struct for
readability sake.
Originally-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
There is no requirement to update the TSS I/O bitmap when a thread using it is
scheduled out and the incoming thread does not use it.
For the permission check based on the TSS I/O bitmap the CPU calculates the memory
location of the I/O bitmap by the address of the TSS and the io_bitmap_base member
of the tss_struct. The easiest way to invalidate the I/O bitmap is to switch the
offset to an address outside of the TSS limit.
If an I/O instruction is issued from user space the TSS limit causes #GP to be
raised in the same was as valid I/O bitmap with all bits set to 1 would do.
This removes the extra work when an I/O bitmap using task is scheduled out
and puts the burden on the rare I/O bitmap users when they are scheduled
in.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Similar to copy_thread_tls() the 32bit and 64bit implementations of
cpu_init() are very similar and unification avoids duplicate changes in the
future.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Andy Lutomirski <luto@kernel.org>
cpu_caps_cleared[] and cpu_caps_set[] are arrays of type u32 and therefore
naturally aligned to 4 bytes, which is also unsigned long aligned on
32-bit, but not on 64-bit.
The array pointer is handed into atomic bit operations. If the access not
aligned to unsigned long then the atomic bit operations can end up crossing
a cache line boundary, which causes the CPU to do a full bus lock as it
can't lock both cache lines at once. The bus lock operation is heavy weight
and can cause severe performance degradation.
The upcoming #AC split lock detection mechanism will issue warnings for
this kind of access.
Force the alignment of these arrays to unsigned long. This avoids the
massive code changes which would be required when converting the array data
type to unsigned long.
[ tglx: Rewrote changelog ]
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20190916223958.27048-2-tony.luck@intel.com
rdtgroup_cpus_write() and mkdir_rdt_prepare() call
rdtgroup_kn_lock_live() -> kernfs_to_rdtgroup() to get 'rdtgrp', and
then call the rdt_last_cmd_{clear,puts,...}() functions which will check
if rdtgroup_mutex is held/requires its caller to hold rdtgroup_mutex.
But if 'rdtgrp' returned from kernfs_to_rdtgroup() is NULL,
rdtgroup_mutex is not held and calling rdt_last_cmd_{clear,puts,...}()
will result in a self-incurred, potential lockdep warning.
Remove the rdt_last_cmd_{clear,puts,...}() calls in these two paths.
Just returning error should be sufficient to report to the user that the
entry doesn't exist any more.
[ bp: Massage. ]
Fixes: 94457b36e8 ("x86/intel_rdt: Add diagnostics when writing the cpus file")
Fixes: cfd0f34e4c ("x86/intel_rdt: Add diagnostics when making directories")
Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Fenghua Yu <fenghua.yu@intel.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: pei.p.jia@intel.com
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: x86-ml <x86@kernel.org>
Link: https://lkml.kernel.org/r/1573079796-11713-1-git-send-email-xiaochen.shen@intel.com
Pull x86 TSX Async Abort and iTLB Multihit mitigations from Thomas Gleixner:
"The performance deterioration departement is not proud at all of
presenting the seventh installment of speculation mitigations and
hardware misfeature workarounds:
1) TSX Async Abort (TAA) - 'The Annoying Affair'
TAA is a hardware vulnerability that allows unprivileged
speculative access to data which is available in various CPU
internal buffers by using asynchronous aborts within an Intel TSX
transactional region.
The mitigation depends on a microcode update providing a new MSR
which allows to disable TSX in the CPU. CPUs which have no
microcode update can be mitigated by disabling TSX in the BIOS if
the BIOS provides a tunable.
Newer CPUs will have a bit set which indicates that the CPU is not
vulnerable, but the MSR to disable TSX will be available
nevertheless as it is an architected MSR. That means the kernel
provides the ability to disable TSX on the kernel command line,
which is useful as TSX is a truly useful mechanism to accelerate
side channel attacks of all sorts.
2) iITLB Multihit (NX) - 'No eXcuses'
iTLB Multihit is an erratum where some Intel processors may incur
a machine check error, possibly resulting in an unrecoverable CPU
lockup, when an instruction fetch hits multiple entries in the
instruction TLB. This can occur when the page size is changed
along with either the physical address or cache type. A malicious
guest running on a virtualized system can exploit this erratum to
perform a denial of service attack.
The workaround is that KVM marks huge pages in the extended page
tables as not executable (NX). If the guest attempts to execute in
such a page, the page is broken down into 4k pages which are
marked executable. The workaround comes with a mechanism to
recover these shattered huge pages over time.
Both issues come with full documentation in the hardware
vulnerabilities section of the Linux kernel user's and administrator's
guide.
Thanks to all patch authors and reviewers who had the extraordinary
priviledge to be exposed to this nuisance.
Special thanks to Borislav Petkov for polishing the final TAA patch
set and to Paolo Bonzini for shepherding the KVM iTLB workarounds and
providing also the backports to stable kernels for those!"
* 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/speculation/taa: Fix printing of TAA_MSG_SMT on IBRS_ALL CPUs
Documentation: Add ITLB_MULTIHIT documentation
kvm: x86: mmu: Recovery of shattered NX large pages
kvm: Add helper function for creating VM worker threads
kvm: mmu: ITLB_MULTIHIT mitigation
cpu/speculation: Uninline and export CPU mitigations helpers
x86/cpu: Add Tremont to the cpu vulnerability whitelist
x86/bugs: Add ITLB_MULTIHIT bug infrastructure
x86/tsx: Add config options to set tsx=on|off|auto
x86/speculation/taa: Add documentation for TSX Async Abort
x86/tsx: Add "auto" option to the tsx= cmdline parameter
kvm/x86: Export MDS_NO=0 to guests when TSX is enabled
x86/speculation/taa: Add sysfs reporting for TSX Async Abort
x86/speculation/taa: Add mitigation for TSX Async Abort
x86/cpu: Add a "tsx=" cmdline option with TSX disabled by default
x86/cpu: Add a helper function x86_read_arch_cap_msr()
x86/msr: Add the IA32_TSX_CTRL MSR
Some modern systems have very tight thermal tolerances. Because of this
they may cross thermal thresholds when running normal workloads (even
during boot). The CPU hardware will react by limiting power/frequency
and using duty cycles to bring the temperature back into normal range.
Thus users may see a "critical" message about the "temperature above
threshold" which is soon followed by "temperature/speed normal". These
messages are rate-limited, but still may repeat every few minutes.
This issue became worse starting with the Ivy Bridge generation of
CPUs because they include a TCC activation offset in the MSR
IA32_TEMPERATURE_TARGET. OEMs use this to provide alerts long before
critical temperatures are reached.
A test run on a laptop with Intel 8th Gen i5 core for two hours with a
workload resulted in 20K+ thermal interrupts per CPU for core level and
another 20K+ interrupts at package level. The kernel logs were full of
throttling messages.
The real value of these threshold interrupts, is to debug problems with
the external cooling solutions and performance issues due to excessive
throttling.
So the solution here is the following:
- In the current thermal_throttle folder, show:
- the maximum time for one throttling event and,
- the total amount of time the system was in throttling state.
- Do not log short excursions.
- Log only when, in spite of thermal throttling, the temperature is rising.
On the high threshold interrupt trigger a delayed workqueue that
monitors the threshold violation log bit (THERM_STATUS_PROCHOT_LOG). When
the log bit is set, this workqueue callback calculates three point moving
average and logs a warning message when the temperature trend is rising.
When this log bit is clear and temperature is below threshold
temperature, then the workqueue callback logs a "Normal" message. Once a
high threshold event is logged, the logging is rate-limited.
With this patch on the same test laptop, no warnings are printed in the logs
as the max time the processor could bring the temperature under control is
only 280 ms.
This implementation is done with the inputs from Alan Cox and Tony Luck.
[ bp: Touchups. ]
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: bberg@redhat.com
Cc: ckellner@redhat.com
Cc: hdegoede@redhat.com
Cc: Ingo Molnar <mingo@redhat.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: x86-ml <x86@kernel.org>
Link: https://lkml.kernel.org/r/20191111214312.81365-1-srinivas.pandruvada@linux.intel.com
If the hardware supports TSC scaling, Hyper-V will set bit 15 of the
HV_PARTITION_PRIVILEGE_MASK in guest VMs with a compatible Hyper-V
configuration version. Bit 15 corresponds to the
AccessTscInvariantControls privilege. If this privilege bit is set,
guests can access the HvSyntheticInvariantTscControl MSR: guests can
set bit 0 of this synthetic MSR to enable the InvariantTSC feature.
After setting the synthetic MSR, CPUID will enumerate support for
InvariantTSC.
Signed-off-by: Andrea Parri <parri.andrea@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lkml.kernel.org/r/20191003155200.22022-1-parri.andrea@gmail.com
For new IBRS_ALL CPUs, the Enhanced IBRS check at the beginning of
cpu_bugs_smt_update() causes the function to return early, unintentionally
skipping the MDS and TAA logic.
This is not a problem for MDS, because there appears to be no overlap
between IBRS_ALL and MDS-affected CPUs. So the MDS mitigation would be
disabled and nothing would need to be done in this function anyway.
But for TAA, the TAA_MSG_SMT string will never get printed on Cascade
Lake and newer.
The check is superfluous anyway: when 'spectre_v2_enabled' is
SPECTRE_V2_IBRS_ENHANCED, 'spectre_v2_user' is always
SPECTRE_V2_USER_NONE, and so the 'spectre_v2_user' switch statement
handles it appropriately by doing nothing. So just remove the check.
Fixes: 1b42f01741 ("x86/speculation/taa: Add mitigation for TSX Async Abort")
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Tyler Hicks <tyhicks@canonical.com>
Reviewed-by: Borislav Petkov <bp@suse.de>
With some Intel processors, putting the same virtual address in the TLB
as both a 4 KiB and 2 MiB page can confuse the instruction fetch unit
and cause the processor to issue a machine check resulting in a CPU lockup.
Unfortunately when EPT page tables use huge pages, it is possible for a
malicious guest to cause this situation.
Add a knob to mark huge pages as non-executable. When the nx_huge_pages
parameter is enabled (and we are using EPT), all huge pages are marked as
NX. If the guest attempts to execute in one of those pages, the page is
broken down into 4K pages, which are then marked executable.
This is not an issue for shadow paging (except nested EPT), because then
the host is in control of TLB flushes and the problematic situation cannot
happen. With nested EPT, again the nested guest can cause problems shadow
and direct EPT is treated in the same way.
[ tglx: Fixup default to auto and massage wording a bit ]
Originally-by: Junaid Shahid <junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Add the new cpu family ATOM_TREMONT_D to the cpu vunerability
whitelist. ATOM_TREMONT_D is not affected by X86_BUG_ITLB_MULTIHIT.
ATOM_TREMONT_D might have mitigations against other issues as well, but
only the ITLB multihit mitigation is confirmed at this point.
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Some processors may incur a machine check error possibly resulting in an
unrecoverable CPU lockup when an instruction fetch encounters a TLB
multi-hit in the instruction TLB. This can occur when the page size is
changed along with either the physical address or cache type. The relevant
erratum can be found here:
https://bugzilla.kernel.org/show_bug.cgi?id=205195
There are other processors affected for which the erratum does not fully
disclose the impact.
This issue affects both bare-metal x86 page tables and EPT.
It can be mitigated by either eliminating the use of large pages or by
using careful TLB invalidations when changing the page size in the page
tables.
Just like Spectre, Meltdown, L1TF and MDS, a new bit has been allocated in
MSR_IA32_ARCH_CAPABILITIES (PSCHANGE_MC_NO) and will be set on CPUs which
are mitigated against this issue.
Signed-off-by: Vineela Tummalapalli <vineela.tummalapalli@intel.com>
Co-developed-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
When a mon group is being deleted, rdtgrp->flags is set to RDT_DELETED
in rdtgroup_rmdir_mon() firstly. The structure of rdtgrp will be freed
until rdtgrp->waitcount is dropped to 0 in rdtgroup_kn_unlock() later.
During the window of deleting a mon group, if an application calls
rdtgroup_mondata_show() to read mondata under this mon group,
'rdtgrp' returned from rdtgroup_kn_lock_live() is a NULL pointer when
rdtgrp->flags is RDT_DELETED. And then 'rdtgrp' is passed in this path:
rdtgroup_mondata_show() --> mon_event_read() --> mon_event_count().
Thus it results in NULL pointer dereference in mon_event_count().
Check 'rdtgrp' in rdtgroup_mondata_show(), and return -ENOENT
immediately when reading mondata during the window of deleting a mon
group.
Fixes: d89b737901 ("x86/intel_rdt/cqm: Add mon_data")
Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Fenghua Yu <fenghua.yu@intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: pei.p.jia@intel.com
Cc: Reinette Chatre <reinette.chatre@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: x86-ml <x86@kernel.org>
Link: https://lkml.kernel.org/r/1572326702-27577-1-git-send-email-xiaochen.shen@intel.com
New CPU model, same MSRs to control and read the inventory number.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: x86-ml <x86@kernel.org>
Link: https://lkml.kernel.org/r/20191028163719.19708-1-tony.luck@intel.com
There is a general consensus that TSX usage is not largely spread while
the history shows there is a non trivial space for side channel attacks
possible. Therefore the tsx is disabled by default even on platforms
that might have a safe implementation of TSX according to the current
knowledge. This is a fair trade off to make.
There are, however, workloads that really do benefit from using TSX and
updating to a newer kernel with TSX disabled might introduce a
noticeable regressions. This would be especially a problem for Linux
distributions which will provide TAA mitigations.
Introduce config options X86_INTEL_TSX_MODE_OFF, X86_INTEL_TSX_MODE_ON
and X86_INTEL_TSX_MODE_AUTO to control the TSX feature. The config
setting can be overridden by the tsx cmdline options.
[ bp: Text cleanups from Josh. ]
Suggested-by: Borislav Petkov <bpetkov@suse.de>
Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Platforms which are not affected by X86_BUG_TAA may want the TSX feature
enabled. Add "auto" option to the TSX cmdline parameter. When tsx=auto
disable TSX when X86_BUG_TAA is present, otherwise enable TSX.
More details on X86_BUG_TAA can be found here:
https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/tsx_async_abort.html
[ bp: Extend the arg buffer to accommodate "auto\0". ]
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Add the sysfs reporting file for TSX Async Abort. It exposes the
vulnerability and the mitigation state similar to the existing files for
the other hardware vulnerabilities.
Sysfs file path is:
/sys/devices/system/cpu/vulnerabilities/tsx_async_abort
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Neelima Krishnan <neelima.krishnan@intel.com>
Reviewed-by: Mark Gross <mgross@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
TSX Async Abort (TAA) is a side channel vulnerability to the internal
buffers in some Intel processors similar to Microachitectural Data
Sampling (MDS). In this case, certain loads may speculatively pass
invalid data to dependent operations when an asynchronous abort
condition is pending in a TSX transaction.
This includes loads with no fault or assist condition. Such loads may
speculatively expose stale data from the uarch data structures as in
MDS. Scope of exposure is within the same-thread and cross-thread. This
issue affects all current processors that support TSX, but do not have
ARCH_CAP_TAA_NO (bit 8) set in MSR_IA32_ARCH_CAPABILITIES.
On CPUs which have their IA32_ARCH_CAPABILITIES MSR bit MDS_NO=0,
CPUID.MD_CLEAR=1 and the MDS mitigation is clearing the CPU buffers
using VERW or L1D_FLUSH, there is no additional mitigation needed for
TAA. On affected CPUs with MDS_NO=1 this issue can be mitigated by
disabling the Transactional Synchronization Extensions (TSX) feature.
A new MSR IA32_TSX_CTRL in future and current processors after a
microcode update can be used to control the TSX feature. There are two
bits in that MSR:
* TSX_CTRL_RTM_DISABLE disables the TSX sub-feature Restricted
Transactional Memory (RTM).
* TSX_CTRL_CPUID_CLEAR clears the RTM enumeration in CPUID. The other
TSX sub-feature, Hardware Lock Elision (HLE), is unconditionally
disabled with updated microcode but still enumerated as present by
CPUID(EAX=7).EBX{bit4}.
The second mitigation approach is similar to MDS which is clearing the
affected CPU buffers on return to user space and when entering a guest.
Relevant microcode update is required for the mitigation to work. More
details on this approach can be found here:
https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/mds.html
The TSX feature can be controlled by the "tsx" command line parameter.
If it is force-enabled then "Clear CPU buffers" (MDS mitigation) is
deployed. The effective mitigation state can be read from sysfs.
[ bp:
- massage + comments cleanup
- s/TAA_MITIGATION_TSX_DISABLE/TAA_MITIGATION_TSX_DISABLED/g - Josh.
- remove partial TAA mitigation in update_mds_branch_idle() - Josh.
- s/tsx_async_abort_cmdline/tsx_async_abort_parse_cmdline/g
]
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Add a kernel cmdline parameter "tsx" to control the Transactional
Synchronization Extensions (TSX) feature. On CPUs that support TSX
control, use "tsx=on|off" to enable or disable TSX. Not specifying this
option is equivalent to "tsx=off". This is because on certain processors
TSX may be used as a part of a speculative side channel attack.
Carve out the TSX controlling functionality into a separate compilation
unit because TSX is a CPU feature while the TSX async abort control
machinery will go to cpu/bugs.c.
[ bp: - Massage, shorten and clear the arg buffer.
- Clarifications of the tsx= possible options - Josh.
- Expand on TSX_CTRL availability - Pawan. ]
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Michael reported that the x86/hyperv initialization code prints the
following dmesg when running in a VM on Hyper-V:
[ 0.000738] Booting paravirtualized kernel on bare hardware
Let the x86/hyperv initialization code set pv_info.name to "Hyper-V" so
dmesg reports correctly:
[ 0.000172] Booting paravirtualized kernel on Hyper-V
[ tglx: Folded build fix provided by Yue ]
Reported-by: Michael Kelley <mikelley@microsoft.com>
Signed-off-by: Andrea Parri <parri.andrea@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Wei Liu <wei.liu@kernel.org>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Cc: YueHaibing <yuehaibing@huawei.com>
Link: https://lkml.kernel.org/r/20191015103502.13156-1-parri.andrea@gmail.com
On modern CPUs it is quite normal that the temperature limits are
reached and the CPU is throttled. In fact, often the thermal design is
not sufficient to cool the CPU at full load and limits can quickly be
reached when a burst in load happens. This will even happen with
technologies like RAPL limitting the long term power consumption of
the package.
Also, these limits are "softer", as Srinivas explains:
"CPU temperature doesn't have to hit max(TjMax) to get these warnings.
OEMs ha[ve] an ability to program a threshold where a thermal interrupt
can be generated. In some systems the offset is 20C+ (Read only value).
In recent systems, there is another offset on top of it which can be
programmed by OS, once some agent can adjust power limits dynamically.
By default this is set to low by the firmware, which I guess the
prime motivation of Benjamin to submit the patch."
So these messages do not usually indicate a hardware issue (e.g.
insufficient cooling). Log them as warnings to avoid confusion about
their severity.
[ bp: Massage commit mesage. ]
Signed-off-by: Benjamin Berg <bberg@redhat.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Hans de Goede <hdegoede@redhat.com>
Tested-by: Christian Kellner <ckellner@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: x86-ml <x86@kernel.org>
Link: https://lkml.kernel.org/r/20191009155424.249277-1-bberg@redhat.com
LLVM's assembler doesn't accept the short form INL instruction:
inl (%%dx)
but instead insists on the output register to be explicitly specified:
<inline asm>:1:7: error: invalid operand for instruction
inl (%dx)
^
LLVM ERROR: Error parsing inline asm
Use the full form of the instruction to fix the build.
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Thomas Hellstrom <thellstrom@vmware.com>
Cc: clang-built-linux@googlegroups.com
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: virtualization@lists.linux-foundation.org
Cc: "VMware, Inc." <pv-drivers@vmware.com>
Cc: x86-ml <x86@kernel.org>
Link: https://github.com/ClangBuiltLinux/linux/issues/734
Link: https://lkml.kernel.org/r/20191007192129.104336-1-samitolvanen@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
It turned out recently that on certain AMD F15h and F16h machines, due
to the BIOS dropping the ball after resume, yet again, RDRAND would not
function anymore:
c49a0a8013 ("x86/CPU/AMD: Clear RDRAND CPUID bit on AMD family 15h/16h")
Add a silly test to the CPU bringup path, to sanity-check the random
data RDRAND returns and scream as loudly as possible if that returned
random data doesn't change.
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Pu Wen <puwen@hygon.cn>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: x86-ml <x86@kernel.org>
Link: https://lkml.kernel.org/r/CAHk-=wjWPDauemCmLTKbdMYFB0UveMszZpcrwoUkJRRWKrqaTw@mail.gmail.com
... in order to not pollute dmesg with a line for each updated microcode
engine.
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Ashok Raj <ashok.raj@intel.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jon Grimm <Jon.Grimm@amd.com>
Cc: kanth.ghatraju@oracle.com
Cc: konrad.wilk@oracle.com
Cc: Mihai Carabas <mihai.carabas@oracle.com>
Cc: patrick.colp@oracle.com
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: x86-ml <x86@kernel.org>
Link: https://lkml.kernel.org/r/20190824085341.GC16813@zn.tnic
Microcode update was changed to be serialized due to restrictions after
Spectre days. Updating serially on a large multi-socket system can be
painful since it is being done on one CPU at a time.
Cloud customers have expressed discontent as services disappear for
a prolonged time. The restriction is that only one core (or only one
thread of a core in the case of an SMT system) goes through the update
while other cores (or respectively, SMT threads) are quiesced.
Do the microcode update only on the first thread of each core while
other siblings simply wait for this to complete.
[ bp: Simplify, massage, cleanup comments. ]
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jon Grimm <Jon.Grimm@amd.com>
Cc: kanth.ghatraju@oracle.com
Cc: konrad.wilk@oracle.com
Cc: patrick.colp@oracle.com
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: x86-ml <x86@kernel.org>
Link: https://lkml.kernel.org/r/1566506627-16536-2-git-send-email-mihai.carabas@oracle.com
All newer Zhaoxin CPUs support CMCI and are compatible with Intel's
Machine-Check Architecture. Add that support for Zhaoxin CPUs.
[ bp: Massage comments and export intel_init_cmci(). ]
Signed-off-by: Tony W Wang-oc <TonyWWang-oc@zhaoxin.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: CooperYan@zhaoxin.com
Cc: DavidWang@zhaoxin.com
Cc: HerryYang@zhaoxin.com
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Cc: QiyuanWang@zhaoxin.com
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: x86-ml <x86@kernel.org>
Link: https://lkml.kernel.org/r/1568787573-1297-4-git-send-email-TonyWWang-oc@zhaoxin.com
The dummy variable is the high part of the microcode revision MSR which
is defined as reserved. Mark it unused so that W=1 builds don't trigger
the above warning.
No functional changes.
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: x86@kernel.org
Link: https://lkml.kernel.org/r/20190928162559.26294-1-bp@alien8.de
UMWAIT and TPAUSE instructions use 32bit IA32_UMWAIT_CONTROL at MSR index
E1H to determines the maximum time in TSC-quanta that the processor can
reside in either C0.1 or C0.2.
This patch emulates MSR IA32_UMWAIT_CONTROL in guest and differentiate
IA32_UMWAIT_CONTROL between host and guest. The variable
mwait_control_cached in arch/x86/kernel/cpu/umwait.c caches the MSR value,
so this patch uses it to avoid frequently rdmsr of IA32_UMWAIT_CONTROL.
Co-developed-by: Jingqi Liu <jingqi.liu@intel.com>
Signed-off-by: Jingqi Liu <jingqi.liu@intel.com>
Signed-off-by: Tao Xu <tao3.xu@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Pull core timer updates from Thomas Gleixner:
"Timers and timekeeping updates:
- A large overhaul of the posix CPU timer code which is a preparation
for moving the CPU timer expiry out into task work so it can be
properly accounted on the task/process.
An update to the bogus permission checks will come later during the
merge window as feedback was not complete before heading of for
travel.
- Switch the timerqueue code to use cached rbtrees and get rid of the
homebrewn caching of the leftmost node.
- Consolidate hrtimer_init() + hrtimer_init_sleeper() calls into a
single function
- Implement the separation of hrtimers to be forced to expire in hard
interrupt context even when PREEMPT_RT is enabled and mark the
affected timers accordingly.
- Implement a mechanism for hrtimers and the timer wheel to protect
RT against priority inversion and live lock issues when a (hr)timer
which should be canceled is currently executing the callback.
Instead of infinitely spinning, the task which tries to cancel the
timer blocks on a per cpu base expiry lock which is held and
released by the (hr)timer expiry code.
- Enable the Hyper-V TSC page based sched_clock for Hyper-V guests
resulting in faster access to timekeeping functions.
- Updates to various clocksource/clockevent drivers and their device
tree bindings.
- The usual small improvements all over the place"
* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (101 commits)
posix-cpu-timers: Fix permission check regression
posix-cpu-timers: Always clear head pointer on dequeue
hrtimer: Add a missing bracket and hide `migration_base' on !SMP
posix-cpu-timers: Make expiry_active check actually work correctly
posix-timers: Unbreak CONFIG_POSIX_TIMERS=n build
tick: Mark sched_timer to expire in hard interrupt context
hrtimer: Add kernel doc annotation for HRTIMER_MODE_HARD
x86/hyperv: Hide pv_ops access for CONFIG_PARAVIRT=n
posix-cpu-timers: Utilize timerqueue for storage
posix-cpu-timers: Move state tracking to struct posix_cputimers
posix-cpu-timers: Deduplicate rlimit handling
posix-cpu-timers: Remove pointless comparisons
posix-cpu-timers: Get rid of 64bit divisions
posix-cpu-timers: Consolidate timer expiry further
posix-cpu-timers: Get rid of zero checks
rlimit: Rewrite non-sensical RLIMIT_CPU comment
posix-cpu-timers: Respect INFINITY for hard RTTIME limit
posix-cpu-timers: Switch thread group sampling to array
posix-cpu-timers: Restructure expiry array
posix-cpu-timers: Remove cputime_expires
...
Pull x86 apic updates from Thomas Gleixner:
- Cleanup the apic IPI implementation by removing duplicated code and
consolidating the functions into the APIC core.
- Implement a safe variant of the IPI broadcast mode. Contrary to
earlier attempts this uses the core tracking of which CPUs have been
brought online at least once so that a broadcast does not end up in
some dead end in BIOS/SMM code when the CPU is still waiting for
init. Once all CPUs have been brought up once, IPI broadcasting is
enabled. Before that regular one by one IPIs are issued.
- Drop the paravirt CR8 related functions as they have no user anymore
- Initialize the APIC TPR to block interrupt 16-31 as they are reserved
for CPU exceptions and should never be raised by any well behaving
device.
- Emit a warning when vector space exhaustion breaks the admin set
affinity of an interrupt.
- Make sure to use the NMI fallback when shutdown via reboot vector IPI
fails. The original code had conditions which prevent the code path
to be reached.
- Annotate various APIC config variables as RO after init.
[ The ipi broadcase change came in earlier through the cpu hotplug
branch, but I left the explanation in the commit message since it was
shared between the two different branches - Linus ]
* 'x86-apic-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (28 commits)
x86/apic/vector: Warn when vector space exhaustion breaks affinity
x86/apic: Annotate global config variables as "read-only after init"
x86/apic/x2apic: Implement IPI shorthands support
x86/apic/flat64: Remove the IPI shorthand decision logic
x86/apic: Share common IPI helpers
x86/apic: Remove the shorthand decision logic
x86/smp: Enhance native_send_call_func_ipi()
x86/smp: Move smp_function_call implementations into IPI code
x86/apic: Provide and use helper for send_IPI_allbutself()
x86/apic: Add static key to Control IPI shorthands
x86/apic: Move no_ipi_broadcast() out of 32bit
x86/apic: Add NMI_VECTOR wait to IPI shorthand
x86/apic: Remove dest argument from __default_send_IPI_shortcut()
x86/hotplug: Silence APIC and NMI when CPU is dead
x86/cpu: Move arch_smt_update() to a neutral place
x86/apic/uv: Make x2apic_extra_bits static
x86/apic: Consolidate the apic local headers
x86/apic: Move apic_flat_64 header into apic directory
x86/apic: Move ipi header into apic directory
x86/apic: Cleanup the include maze
...
Pull x86 vmware updates from Ingo Molnar:
"This updates the VMWARE guest driver with support for VMCALL/VMMCALL
based hypercalls"
* 'x86-vmware-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
input/vmmouse: Update the backdoor call with support for new instructions
drm/vmwgfx: Update the backdoor call with support for new instructions
x86/vmware: Add a header file for hypercall definitions
x86/vmware: Update platform detection code for VMCALL/VMMCALL hypercalls
Pull x86 cpu-feature updates from Ingo Molnar:
- Rework the Intel model names symbols/macros, which were decades of
ad-hoc extensions and added random noise. It's now a coherent, easy
to follow nomenclature.
- Add new Intel CPU model IDs:
- "Tiger Lake" desktop and mobile models
- "Elkhart Lake" model ID
- and the "Lightning Mountain" variant of Airmont, plus support code
- Add the new AVX512_VP2INTERSECT instruction to cpufeatures
- Remove Intel MPX user-visible APIs and the self-tests, because the
toolchain (gcc) is not supporting it going forward. This is the
first, lowest-risk phase of MPX removal.
- Remove X86_FEATURE_MFENCE_RDTSC
- Various smaller cleanups and fixes
* 'x86-cpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (25 commits)
x86/cpu: Update init data for new Airmont CPU model
x86/cpu: Add new Airmont variant to Intel family
x86/cpu: Add Elkhart Lake to Intel family
x86/cpu: Add Tiger Lake to Intel family
x86: Correct misc typos
x86/intel: Add common OPTDIFFs
x86/intel: Aggregate microserver naming
x86/intel: Aggregate big core graphics naming
x86/intel: Aggregate big core mobile naming
x86/intel: Aggregate big core client naming
x86/cpufeature: Explain the macro duplication
x86/ftrace: Remove mcount() declaration
x86/PCI: Remove superfluous returns from void functions
x86/msr-index: Move AMD MSRs where they belong
x86/cpu: Use constant definitions for CPU models
lib: Remove redundant ftrace flag removal
x86/crash: Remove unnecessary comparison
x86/bitops: Use __builtin_constant_p() directly instead of IS_IMMEDIATE()
x86: Remove X86_FEATURE_MFENCE_RDTSC
x86/mpx: Remove MPX APIs
...
Pull scheduler updates from Ingo Molnar:
- MAINTAINERS: Add Mark Rutland as perf submaintainer, Juri Lelli and
Vincent Guittot as scheduler submaintainers. Add Dietmar Eggemann,
Steven Rostedt, Ben Segall and Mel Gorman as scheduler reviewers.
As perf and the scheduler is getting bigger and more complex,
document the status quo of current responsibilities and interests,
and spread the review pain^H^H^H^H fun via an increase in the Cc:
linecount generated by scripts/get_maintainer.pl. :-)
- Add another series of patches that brings the -rt (PREEMPT_RT) tree
closer to mainline: split the monolithic CONFIG_PREEMPT dependencies
into a new CONFIG_PREEMPTION category that will allow the eventual
introduction of CONFIG_PREEMPT_RT. Still a few more hundred patches
to go though.
- Extend the CPU cgroup controller with uclamp.min and uclamp.max to
allow the finer shaping of CPU bandwidth usage.
- Micro-optimize energy-aware wake-ups from O(CPUS^2) to O(CPUS).
- Improve the behavior of high CPU count, high thread count
applications running under cpu.cfs_quota_us constraints.
- Improve balancing with SCHED_IDLE (SCHED_BATCH) tasks present.
- Improve CPU isolation housekeeping CPU allocation NUMA locality.
- Fix deadline scheduler bandwidth calculations and logic when cpusets
rebuilds the topology, or when it gets deadline-throttled while it's
being offlined.
- Convert the cpuset_mutex to percpu_rwsem, to allow it to be used from
setscheduler() system calls without creating global serialization.
Add new synchronization between cpuset topology-changing events and
the deadline acceptance tests in setscheduler(), which were broken
before.
- Rework the active_mm state machine to be less confusing and more
optimal.
- Rework (simplify) the pick_next_task() slowpath.
- Improve load-balancing on AMD EPYC systems.
- ... and misc cleanups, smaller fixes and improvements - please see
the Git log for more details.
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (53 commits)
sched/psi: Correct overly pessimistic size calculation
sched/fair: Speed-up energy-aware wake-ups
sched/uclamp: Always use 'enum uclamp_id' for clamp_id values
sched/uclamp: Update CPU's refcount on TG's clamp changes
sched/uclamp: Use TG's clamps to restrict TASK's clamps
sched/uclamp: Propagate system defaults to the root group
sched/uclamp: Propagate parent clamps
sched/uclamp: Extend CPU's cgroup controller
sched/topology: Improve load balancing on AMD EPYC systems
arch, ia64: Make NUMA select SMP
sched, perf: MAINTAINERS update, add submaintainers and reviewers
sched/fair: Use rq_lock/unlock in online_fair_sched_group
cpufreq: schedutil: fix equation in comment
sched: Rework pick_next_task() slow-path
sched: Allow put_prev_task() to drop rq->lock
sched/fair: Expose newidle_balance()
sched: Add task_struct pointer to sched_class::set_curr_task
sched: Rework CPU hotplug task selection
sched/{rt,deadline}: Fix set_next_task vs pick_next_task
sched: Fix kerneldoc comment for ia64_set_curr_task
...
Pull RAS updates from Borislav Petkov:
"The latest meager RAS updates:
- Enable processing of action-optional MCEs which have the Overflow
bit set (Tony Luck)
- -Wmissing-prototypes warning fix and a build fix (Valdis
Klētnieks)"
* 'ras-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
RAS: Build debugfs.o only when enabled in Kconfig
RAS: Fix prototype warnings
x86/mce: Don't check for the overflow bit on action optional machine checks
- Remove the unneeded backslash at EOL: that's not a macro.
And let's please checkpatch by aligning to open parenthesis.
- For 0x4f descriptor, remove " */" from the info field.
- For 0xc2 descriptor, sync the beginning of info to match the tlb_type.
(The value of info fields could be made more regular, but it's unused by
the code and will be read only by developers, so don't bother.)
Signed-off-by: Sylvain 'ythier' Hitier <sylvain.hitier@gmail.com>
Cc: Alex Shi <alex.shi@intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Hans de Goede <hdegoede@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: trivial@kernel.org
Link: https://lkml.kernel.org/r/20190915090917.GA5086@lilas
Signed-off-by: Ingo Molnar <mingo@kernel.org>
SD_BALANCE_{FORK,EXEC} and SD_WAKE_AFFINE are stripped in sd_init()
for any sched domains with a NUMA distance greater than 2 hops
(RECLAIM_DISTANCE). The idea being that it's expensive to balance
across domains that far apart.
However, as is rather unfortunately explained in:
commit 32e45ff43e ("mm: increase RECLAIM_DISTANCE to 30")
the value for RECLAIM_DISTANCE is based on node distance tables from
2011-era hardware.
Current AMD EPYC machines have the following NUMA node distances:
node distances:
node 0 1 2 3 4 5 6 7
0: 10 16 16 16 32 32 32 32
1: 16 10 16 16 32 32 32 32
2: 16 16 10 16 32 32 32 32
3: 16 16 16 10 32 32 32 32
4: 32 32 32 32 10 16 16 16
5: 32 32 32 32 16 10 16 16
6: 32 32 32 32 16 16 10 16
7: 32 32 32 32 16 16 16 10
where 2 hops is 32.
The result is that the scheduler fails to load balance properly across
NUMA nodes on different sockets -- 2 hops apart.
For example, pinning 16 busy threads to NUMA nodes 0 (CPUs 0-7) and 4
(CPUs 32-39) like so,
$ numactl -C 0-7,32-39 ./spinner 16
causes all threads to fork and remain on node 0 until the active
balancer kicks in after a few seconds and forcibly moves some threads
to node 4.
Override node_reclaim_distance for AMD Zen.
Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Suravee.Suthikulpanit@amd.com
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Thomas.Lendacky@amd.com
Cc: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20190808195301.13222-3-matt@codeblueprint.co.uk
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The new header is intended to be used by drivers using the backdoor.
Follow the KVM example using alternatives self-patching to choose
between vmcall, vmmcall and io instructions.
Also define two new CPU feature flags to indicate hypervisor support
for vmcall- and vmmcall instructions. The new XF86_FEATURE_VMW_VMMCALL
flag is needed because using XF86_FEATURE_VMMCALL might break QEMU/KVM
setups using the vmmouse driver. They rely on XF86_FEATURE_VMMCALL
on AMD to get the kvm_hypercall() right. But they do not yet implement
vmmcall for the VMware hypercall used by the vmmouse driver.
[ bp: reflow hypercall %edx usage explanation comment. ]
Signed-off-by: Thomas Hellstrom <thellstrom@vmware.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Doug Covelli <dcovelli@vmware.com>
Cc: Aaron Lewis <aaronlewis@google.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: linux-graphics-maintainer@vmware.com
Cc: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Cc: Nicolas Ferre <nicolas.ferre@microchip.com>
Cc: Robert Hoo <robert.hu@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: virtualization@lists.linux-foundation.org
Cc: <pv-drivers@vmware.com>
Cc: x86-ml <x86@kernel.org>
Link: https://lkml.kernel.org/r/20190828080353.12658-3-thomas_os@shipmail.org
hv_setup_sched_clock() references pv_ops which is only available when
CONFIG_PARAVIRT=Y.
Wrap it into a #ifdef
Signed-off-by: Tianyu Lan <Tianyu.Lan@microsoft.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190828080747.204419-1-Tianyu.Lan@microsoft.com
Currently big microservers have _XEON_D while small microservers have
_X, Make it uniformly: _D.
for i in `git grep -l "\(INTEL_FAM6_\|VULNWL_INTEL\|INTEL_CPU_FAM6\).*_\(X\|XEON_D\)"`
do
sed -i -e 's/\(\(INTEL_FAM6_\|VULNWL_INTEL\|INTEL_CPU_FAM6\).*ATOM.*\)_X/\1_D/g' \
-e 's/\(\(INTEL_FAM6_\|VULNWL_INTEL\|INTEL_CPU_FAM6\).*\)_XEON_D/\1_D/g' ${i}
done
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Cc: x86@kernel.org
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Link: https://lkml.kernel.org/r/20190827195122.677152989@infradead.org
Currently big core clients with extra graphics on have:
- _G
- _GT3E
Make it uniformly: _G
for i in `git grep -l "\(INTEL_FAM6_\|VULNWL_INTEL\|INTEL_CPU_FAM6\).*_GT3E"`
do
sed -i -e 's/\(\(INTEL_FAM6_\|VULNWL_INTEL\|INTEL_CPU_FAM6\).*\)_GT3E/\1_G/g' ${i}
done
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Cc: x86@kernel.org
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Link: https://lkml.kernel.org/r/20190827195122.622802314@infradead.org
Currently big core mobile chips have either:
- _L
- _ULT
- _MOBILE
Make it uniformly: _L.
for i in `git grep -l "\(INTEL_FAM6_\|VULNWL_INTEL\|INTEL_CPU_FAM6\).*_\(MOBILE\|ULT\)"`
do
sed -i -e 's/\(\(INTEL_FAM6_\|VULNWL_INTEL\|INTEL_CPU_FAM6\).*\)_\(MOBILE\|ULT\)/\1_L/g' ${i}
done
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Cc: x86@kernel.org
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190827195122.568978530@infradead.org
Currently the big core client models either have:
- no OPTDIFF
- _CORE
- _DESKTOP
Make it uniformly: 'no OPTDIFF'.
for i in `git grep -l "\(INTEL_FAM6_\|VULNWL_INTEL\|INTEL_CPU_FAM6\).*_\(CORE\|DESKTOP\)"`
do
sed -i -e 's/\(\(INTEL_FAM6_\|VULNWL_INTEL\|INTEL_CPU_FAM6\).*\)_\(CORE\|DESKTOP\)/\1/g' ${i}
done
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Cc: x86@kernel.org
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190827195122.513945586@infradead.org
Vmware has historically used an INL instruction for this, but recent
hardware versions support using VMCALL/VMMCALL instead, so use this
method if supported at platform detection time. Explicitly code separate
macro versions since the alternatives self-patching has not been
performed at platform detection time.
Also put tighter constraints on the assembly input parameters.
Co-developed-by: Doug Covelli <dcovelli@vmware.com>
Signed-off-by: Doug Covelli <dcovelli@vmware.com>
Signed-off-by: Thomas Hellstrom <thellstrom@vmware.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Doug Covelli <dcovelli@vmware.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: linux-graphics-maintainer@vmware.com
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: virtualization@lists.linux-foundation.org
Cc: <pv-drivers@vmware.com>
Cc: x86-ml <x86@kernel.org>
Link: https://lkml.kernel.org/r/20190828080353.12658-2-thomas_os@shipmail.org