OpenCloudOS-Kernel/arch/x86
Linus Torvalds 7fef099702 x86/resctl: fix scheduler confusion with 'current'
The implementation of 'current' on x86 is very intentionally special: it
is a very common thing to look up, and it uses 'this_cpu_read_stable()'
to get the current thread pointer efficiently from per-cpu storage.

And the keyword in there is 'stable': the current thread pointer never
changes as far as a single thread is concerned.  Even if when a thread
is preempted, or moved to another CPU, or even across an explicit call
'schedule()' that thread will still have the same value for 'current'.

It is, after all, the kernel base pointer to thread-local storage.
That's why it's stable to begin with, but it's also why it's important
enough that we have that special 'this_cpu_read_stable()' access for it.

So this is all done very intentionally to allow the compiler to treat
'current' as a value that never visibly changes, so that the compiler
can do CSE and combine multiple different 'current' accesses into one.

However, there is obviously one very special situation when the
currently running thread does actually change: inside the scheduler
itself.

So the scheduler code paths are special, and do not have a 'current'
thread at all.  Instead there are _two_ threads: the previous and the
next thread - typically called 'prev' and 'next' (or prev_p/next_p)
internally.

So this is all actually quite straightforward and simple, and not all
that complicated.

Except for when you then have special code that is run in scheduler
context, that code then has to be aware that 'current' isn't really a
valid thing.  Did you mean 'prev'? Did you mean 'next'?

In fact, even if then look at the code, and you use 'current' after the
new value has been assigned to the percpu variable, we have explicitly
told the compiler that 'current' is magical and always stable.  So the
compiler is quite free to use an older (or newer) value of 'current',
and the actual assignment to the percpu storage is not relevant even if
it might look that way.

Which is exactly what happened in the resctl code, that blithely used
'current' in '__resctrl_sched_in()' when it really wanted the new
process state (as implied by the name: we're scheduling 'into' that new
resctl state).  And clang would end up just using the old thread pointer
value at least in some configurations.

This could have happened with gcc too, and purely depends on random
compiler details.  Clang just seems to have been more aggressive about
moving the read of the per-cpu current_task pointer around.

The fix is trivial: just make the resctl code adhere to the scheduler
rules of using the prev/next thread pointer explicitly, instead of using
'current' in a situation where it just wasn't valid.

That same code is then also used outside of the scheduler context (when
a thread resctl state is explicitly changed), and then we will just pass
in 'current' as that pointer, of course.  There is no ambiguity in that
case.

The fix may be trivial, but noticing and figuring out what went wrong
was not.  The credit for that goes to Stephane Eranian.

Reported-by: Stephane Eranian <eranian@google.com>
Link: https://lore.kernel.org/lkml/20230303231133.1486085-1-eranian@google.com/
Link: https://lore.kernel.org/lkml/alpine.LFD.2.01.0908011214330.3304@localhost.localdomain/
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Tested-by: Tony Luck <tony.luck@intel.com>
Tested-by: Stephane Eranian <eranian@google.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-03-08 11:48:11 -08:00
..
boot Kbuild updates for v6.3 2023-02-26 11:53:25 -08:00
coco - Fixup comment typo 2023-02-25 09:11:30 -08:00
configs x86/defconfig: Enable CONFIG_DEBUG_WX=y 2022-09-02 10:41:42 +02:00
crypto crypto: x86/aria-avx - Do not use avx2 instructions 2023-02-14 13:39:33 +08:00
entry Changes in this cycle were: 2023-03-02 09:45:34 -08:00
events ARM: 2023-02-25 11:30:21 -08:00
hyperv x86/hyperv: Remove unregister syscore call from Hyper-V cleanup 2022-11-29 17:55:29 +00:00
ia32 x86/signal/32: Merge native and compat 32-bit signal code 2022-10-19 09:58:49 +02:00
include x86/resctl: fix scheduler confusion with 'current' 2023-03-08 11:48:11 -08:00
kernel x86/resctl: fix scheduler confusion with 'current' 2023-03-08 11:48:11 -08:00
kvm ARM: 2023-02-25 11:30:21 -08:00
lib - Cache the AMD debug registers in per-CPU variables to avoid MSR writes 2023-02-21 14:51:40 -08:00
math-emu
mm - Daniel Verkamp has contributed a memfd series ("mm/memfd: add 2023-02-23 17:09:35 -08:00
net bpf, x86: Simplify the parsing logic of structure parameters 2023-01-10 15:53:22 -08:00
pci x86/pci/xen: Fixup fallout from the PCI/MSI overhaul 2023-01-16 20:40:44 +01:00
platform A healthy mix of EFI contributions this time: 2023-02-23 14:41:48 -08:00
power - Add the call depth tracking mitigation for Retbleed which has 2022-12-14 15:03:00 -08:00
purgatory x86/purgatory: disable KMSAN instrumentation 2022-10-28 13:37:23 -07:00
ras
realmode x86/boot: Skip realmode init code when running as Xen PV guest 2022-11-25 12:05:22 +01:00
tools kbuild: allow to combine multiple V= levels 2023-01-22 23:43:32 +09:00
um This pull request contains the following changes for UML: 2023-03-01 09:13:00 -08:00
video
virt/vmx/tdx
xen xen: branch for v6.3-rc1 2023-02-21 17:07:39 -08:00
.gitignore x86/purgatory: Omit use of bin2c 2022-07-25 10:32:32 +02:00
Kbuild
Kconfig x86/Kconfig: Fix spellos & punctuation 2023-01-25 12:21:04 +01:00
Kconfig.assembler crypto: x86/aria-avx - fix build failure with old binutils 2023-01-20 18:29:31 +08:00
Kconfig.cpu
Kconfig.debug arch: make TRACE_IRQFLAGS_NMI_SUPPORT generic 2022-06-23 15:39:21 +01:00
Makefile x86/build: Make 64-bit defconfig the default 2023-02-15 14:20:17 +01:00
Makefile.um This pull request contains the following changes for UML: 2023-03-01 09:13:00 -08:00
Makefile_32.cpu