License cleanup: add SPDX GPL-2.0 license identifier to files with no license
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.
By default all files without license information are under the default
license of the kernel, which is GPL version 2.
Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.
This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.
How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,
Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.
The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.
The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.
Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).
All documentation files were explicitly excluded.
The following heuristics were used to determine which SPDX license
identifiers to apply.
- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.
For non */uapi/* files that summary was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139
and resulted in the first patch in this series.
If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930
and resulted in the second patch in this series.
- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:
SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1
and that resulted in the third patch in this series.
- when the two scanners agreed on the detected license(s), that became
the concluded license(s).
- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.
- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).
- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.
- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.
In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.
Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.
Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.
In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.
Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct
This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.
These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-01 22:07:57 +08:00
|
|
|
/* SPDX-License-Identifier: GPL-2.0 */
|
2005-04-17 06:20:36 +08:00
|
|
|
#ifndef _LINUX_SCHED_H
|
|
|
|
#define _LINUX_SCHED_H
|
|
|
|
|
2017-02-07 05:06:35 +08:00
|
|
|
/*
|
|
|
|
* Define 'struct task_struct' and provide the main scheduler
|
|
|
|
* APIs (schedule(), wakeup variants, etc.)
|
|
|
|
*/
|
2006-04-27 07:12:56 +08:00
|
|
|
|
2017-02-07 05:06:35 +08:00
|
|
|
#include <uapi/linux/sched.h>
|
2014-01-28 06:15:37 +08:00
|
|
|
|
2017-02-07 05:06:35 +08:00
|
|
|
#include <asm/current.h>
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2017-02-07 05:06:35 +08:00
|
|
|
#include <linux/pid.h>
|
2005-04-17 06:20:36 +08:00
|
|
|
#include <linux/sem.h>
|
shm: make exit_shm work proportional to task activity
This is small set of patches our team has had kicking around for a few
versions internally that fixes tasks getting hung on shm_exit when there
are many threads hammering it at once.
Anton wrote a simple test to cause the issue:
http://ozlabs.org/~anton/junkcode/bust_shm_exit.c
Before applying this patchset, this test code will cause either hanging
tracebacks or pthread out of memory errors.
After this patchset, it will still produce output like:
root@somehost:~# ./bust_shm_exit 1024 160
...
INFO: rcu_sched detected stalls on CPUs/tasks: {} (detected by 116, t=2111 jiffies, g=241, c=240, q=7113)
INFO: Stall ended before state dump start
...
But the task will continue to run along happily, so we consider this an
improvement over hanging, even if it's a bit noisy.
This patch (of 3):
exit_shm obtains the ipc_ns shm rwsem for write and holds it while it
walks every shared memory segment in the namespace. Thus the amount of
work is related to the number of shm segments in the namespace not the
number of segments that might need to be cleaned.
In addition, this occurs after the task has been notified the thread has
exited, so the number of tasks waiting for the ns shm rwsem can grow
without bound until memory is exausted.
Add a list to the task struct of all shmids allocated by this task. Init
the list head in copy_process. Use the ns->rwsem for locking. Add
segments after id is added, remove before removing from id.
On unshare of NEW_IPCNS orphan any ids as if the task had exited, similar
to handling of semaphore undo.
I chose a define for the init sequence since its a simple list init,
otherwise it would require a function call to avoid include loops between
the semaphore code and the task struct. Converting the list_del to
list_del_init for the unshare cases would remove the exit followed by
init, but I left it blow up if not inited.
Signed-off-by: Milton Miller <miltonm@bga.com>
Signed-off-by: Jack Miller <millerjo@us.ibm.com>
Cc: Davidlohr Bueso <davidlohr@hp.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Anton Blanchard <anton@samba.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-09 05:23:19 +08:00
|
|
|
#include <linux/shm.h>
|
2022-09-15 23:03:45 +08:00
|
|
|
#include <linux/kmsan_types.h>
|
2017-02-07 05:06:35 +08:00
|
|
|
#include <linux/mutex.h>
|
|
|
|
#include <linux/plist.h>
|
|
|
|
#include <linux/hrtimer.h>
|
2020-07-29 19:09:15 +08:00
|
|
|
#include <linux/irqflags.h>
|
2005-04-17 06:20:36 +08:00
|
|
|
#include <linux/seccomp.h>
|
2017-02-07 05:06:35 +08:00
|
|
|
#include <linux/nodemask.h>
|
2017-02-04 08:27:20 +08:00
|
|
|
#include <linux/rcupdate.h>
|
2019-01-18 20:27:29 +08:00
|
|
|
#include <linux/refcount.h>
|
2006-04-25 21:54:40 +08:00
|
|
|
#include <linux/resource.h>
|
2008-01-26 04:08:34 +08:00
|
|
|
#include <linux/latencytop.h>
|
2017-02-07 05:06:35 +08:00
|
|
|
#include <linux/sched/prio.h>
|
2019-08-22 03:09:05 +08:00
|
|
|
#include <linux/sched/types.h>
|
2017-02-07 05:06:35 +08:00
|
|
|
#include <linux/signal_types.h>
|
kernel: Implement selective syscall userspace redirection
Introduce a mechanism to quickly disable/enable syscall handling for a
specific process and redirect to userspace via SIGSYS. This is useful
for processes with parts that require syscall redirection and parts that
don't, but who need to perform this boundary crossing really fast,
without paying the cost of a system call to reconfigure syscall handling
on each boundary transition. This is particularly important for Windows
games running over Wine.
The proposed interface looks like this:
prctl(PR_SET_SYSCALL_USER_DISPATCH, <op>, <off>, <length>, [selector])
The range [<offset>,<offset>+<length>) is a part of the process memory
map that is allowed to by-pass the redirection code and dispatch
syscalls directly, such that in fast paths a process doesn't need to
disable the trap nor the kernel has to check the selector. This is
essential to return from SIGSYS to a blocked area without triggering
another SIGSYS from rt_sigreturn.
selector is an optional pointer to a char-sized userspace memory region
that has a key switch for the mechanism. This key switch is set to
either PR_SYS_DISPATCH_ON, PR_SYS_DISPATCH_OFF to enable and disable the
redirection without calling the kernel.
The feature is meant to be set per-thread and it is disabled on
fork/clone/execv.
Internally, this doesn't add overhead to the syscall hot path, and it
requires very little per-architecture support. I avoided using seccomp,
even though it duplicates some functionality, due to previous feedback
that maybe it shouldn't mix with seccomp since it is not a security
mechanism. And obviously, this should never be considered a security
mechanism, since any part of the program can by-pass it by using the
syscall dispatcher.
For the sysinfo benchmark, which measures the overhead added to
executing a native syscall that doesn't require interception, the
overhead using only the direct dispatcher region to issue syscalls is
pretty much irrelevant. The overhead of using the selector goes around
40ns for a native (unredirected) syscall in my system, and it is (as
expected) dominated by the supervisor-mode user-address access. In
fact, with SMAP off, the overhead is consistently less than 5ns on my
test box.
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Andy Lutomirski <luto@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20201127193238.821364-4-krisman@collabora.com
2020-11-28 03:32:34 +08:00
|
|
|
#include <linux/syscall_user_dispatch.h>
|
2017-02-07 05:06:35 +08:00
|
|
|
#include <linux/mm_types_task.h>
|
|
|
|
#include <linux/task_io_accounting.h>
|
2019-08-22 03:09:04 +08:00
|
|
|
#include <linux/posix-timers.h>
|
rseq: Introduce restartable sequences system call
Expose a new system call allowing each thread to register one userspace
memory area to be used as an ABI between kernel and user-space for two
purposes: user-space restartable sequences and quick access to read the
current CPU number value from user-space.
* Restartable sequences (per-cpu atomics)
Restartables sequences allow user-space to perform update operations on
per-cpu data without requiring heavy-weight atomic operations.
The restartable critical sections (percpu atomics) work has been started
by Paul Turner and Andrew Hunter. It lets the kernel handle restart of
critical sections. [1] [2] The re-implementation proposed here brings a
few simplifications to the ABI which facilitates porting to other
architectures and speeds up the user-space fast path.
Here are benchmarks of various rseq use-cases.
Test hardware:
arm32: ARMv7 Processor rev 4 (v7l) "Cubietruck", 2-core
x86-64: Intel E5-2630 v3@2.40GHz, 16-core, hyperthreading
The following benchmarks were all performed on a single thread.
* Per-CPU statistic counter increment
getcpu+atomic (ns/op) rseq (ns/op) speedup
arm32: 344.0 31.4 11.0
x86-64: 15.3 2.0 7.7
* LTTng-UST: write event 32-bit header, 32-bit payload into tracer
per-cpu buffer
getcpu+atomic (ns/op) rseq (ns/op) speedup
arm32: 2502.0 2250.0 1.1
x86-64: 117.4 98.0 1.2
* liburcu percpu: lock-unlock pair, dereference, read/compare word
getcpu+atomic (ns/op) rseq (ns/op) speedup
arm32: 751.0 128.5 5.8
x86-64: 53.4 28.6 1.9
* jemalloc memory allocator adapted to use rseq
Using rseq with per-cpu memory pools in jemalloc at Facebook (based on
rseq 2016 implementation):
The production workload response-time has 1-2% gain avg. latency, and
the P99 overall latency drops by 2-3%.
* Reading the current CPU number
Speeding up reading the current CPU number on which the caller thread is
running is done by keeping the current CPU number up do date within the
cpu_id field of the memory area registered by the thread. This is done
by making scheduler preemption set the TIF_NOTIFY_RESUME flag on the
current thread. Upon return to user-space, a notify-resume handler
updates the current CPU value within the registered user-space memory
area. User-space can then read the current CPU number directly from
memory.
Keeping the current cpu id in a memory area shared between kernel and
user-space is an improvement over current mechanisms available to read
the current CPU number, which has the following benefits over
alternative approaches:
- 35x speedup on ARM vs system call through glibc
- 20x speedup on x86 compared to calling glibc, which calls vdso
executing a "lsl" instruction,
- 14x speedup on x86 compared to inlined "lsl" instruction,
- Unlike vdso approaches, this cpu_id value can be read from an inline
assembly, which makes it a useful building block for restartable
sequences.
- The approach of reading the cpu id through memory mapping shared
between kernel and user-space is portable (e.g. ARM), which is not the
case for the lsl-based x86 vdso.
On x86, yet another possible approach would be to use the gs segment
selector to point to user-space per-cpu data. This approach performs
similarly to the cpu id cache, but it has two disadvantages: it is
not portable, and it is incompatible with existing applications already
using the gs segment selector for other purposes.
Benchmarking various approaches for reading the current CPU number:
ARMv7 Processor rev 4 (v7l)
Machine model: Cubietruck
- Baseline (empty loop): 8.4 ns
- Read CPU from rseq cpu_id: 16.7 ns
- Read CPU from rseq cpu_id (lazy register): 19.8 ns
- glibc 2.19-0ubuntu6.6 getcpu: 301.8 ns
- getcpu system call: 234.9 ns
x86-64 Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz:
- Baseline (empty loop): 0.8 ns
- Read CPU from rseq cpu_id: 0.8 ns
- Read CPU from rseq cpu_id (lazy register): 0.8 ns
- Read using gs segment selector: 0.8 ns
- "lsl" inline assembly: 13.0 ns
- glibc 2.19-0ubuntu6 getcpu: 16.6 ns
- getcpu system call: 53.9 ns
- Speed (benchmark taken on v8 of patchset)
Running 10 runs of hackbench -l 100000 seems to indicate, contrary to
expectations, that enabling CONFIG_RSEQ slightly accelerates the
scheduler:
Configuration: 2 sockets * 8-core Intel(R) Xeon(R) CPU E5-2630 v3 @
2.40GHz (directly on hardware, hyperthreading disabled in BIOS, energy
saving disabled in BIOS, turboboost disabled in BIOS, cpuidle.off=1
kernel parameter), with a Linux v4.6 defconfig+localyesconfig,
restartable sequences series applied.
* CONFIG_RSEQ=n
avg.: 41.37 s
std.dev.: 0.36 s
* CONFIG_RSEQ=y
avg.: 40.46 s
std.dev.: 0.33 s
- Size
On x86-64, between CONFIG_RSEQ=n/y, the text size increase of vmlinux is
567 bytes, and the data size increase of vmlinux is 5696 bytes.
[1] https://lwn.net/Articles/650333/
[2] http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdf
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Watson <davejwatson@fb.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Chris Lameter <cl@linux.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Andrew Hunter <ahh@google.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Maurer <bmaurer@fb.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-api@vger.kernel.org
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20151027235635.16059.11630.stgit@pjt-glaptop.roam.corp.google.com
Link: http://lkml.kernel.org/r/20150624222609.6116.86035.stgit@kitami.mtv.corp.google.com
Link: https://lkml.kernel.org/r/20180602124408.8430-3-mathieu.desnoyers@efficios.com
2018-06-02 20:43:54 +08:00
|
|
|
#include <linux/rseq.h>
|
2020-08-06 20:35:11 +08:00
|
|
|
#include <linux/seqlock.h>
|
2019-11-15 02:02:54 +08:00
|
|
|
#include <linux/kcsan.h>
|
rv: Add Runtime Verification (RV) interface
RV is a lightweight (yet rigorous) method that complements classical
exhaustive verification techniques (such as model checking and
theorem proving) with a more practical approach to complex systems.
RV works by analyzing the trace of the system's actual execution,
comparing it against a formal specification of the system behavior.
RV can give precise information on the runtime behavior of the
monitored system while enabling the reaction for unexpected
events, avoiding, for example, the propagation of a failure on
safety-critical systems.
The development of this interface roots in the development of the
paper:
De Oliveira, Daniel Bristot; Cucinotta, Tommaso; De Oliveira, Romulo
Silva. Efficient formal verification for the Linux kernel. In:
International Conference on Software Engineering and Formal Methods.
Springer, Cham, 2019. p. 315-332.
And:
De Oliveira, Daniel Bristot. Automata-based formal analysis
and verification of the real-time Linux kernel. PhD Thesis, 2020.
The RV interface resembles the tracing/ interface on purpose. The current
path for the RV interface is /sys/kernel/tracing/rv/.
It presents these files:
"available_monitors"
- List the available monitors, one per line.
For example:
# cat available_monitors
wip
wwnr
"enabled_monitors"
- Lists the enabled monitors, one per line;
- Writing to it enables a given monitor;
- Writing a monitor name with a '!' prefix disables it;
- Truncating the file disables all enabled monitors.
For example:
# cat enabled_monitors
# echo wip > enabled_monitors
# echo wwnr >> enabled_monitors
# cat enabled_monitors
wip
wwnr
# echo '!wip' >> enabled_monitors
# cat enabled_monitors
wwnr
# echo > enabled_monitors
# cat enabled_monitors
#
Note that more than one monitor can be enabled concurrently.
"monitoring_on"
- It is an on/off general switcher for monitoring. Note
that it does not disable enabled monitors or detach events,
but stop the per-entity monitors of monitoring the events
received from the system. It resembles the "tracing_on" switcher.
"monitors/"
Each monitor will have its one directory inside "monitors/". There
the monitor specific files will be presented.
The "monitors/" directory resembles the "events" directory on
tracefs.
For example:
# cd monitors/wip/
# ls
desc enable
# cat desc
wakeup in preemptive per-cpu testing monitor.
# cat enable
0
For further information, see the comments in the header of
kernel/trace/rv/rv.c from this patch.
Link: https://lkml.kernel.org/r/a4bfe038f50cb047bfb343ad0e12b0e646ab308b.1659052063.git.bristot@kernel.org
Cc: Wim Van Sebroeck <wim@linux-watchdog.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Marco Elver <elver@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Shuah Khan <skhan@linuxfoundation.org>
Cc: Gabriele Paoloni <gpaoloni@redhat.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: Tao Zhou <tao.zhou@linux.dev>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-trace-devel@vger.kernel.org
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-07-29 17:38:40 +08:00
|
|
|
#include <linux/rv.h>
|
2020-11-19 03:48:43 +08:00
|
|
|
#include <asm/kmap_size.h>
|
2006-04-25 21:54:40 +08:00
|
|
|
|
2017-02-07 05:06:35 +08:00
|
|
|
/* task_struct member predeclarations (sorted alphabetically): */
|
2017-02-04 05:01:58 +08:00
|
|
|
struct audit_context;
|
|
|
|
struct backing_dev_info;
|
2010-02-23 15:55:42 +08:00
|
|
|
struct bio_list;
|
2011-03-08 20:19:51 +08:00
|
|
|
struct blk_plug;
|
2021-02-26 07:43:14 +08:00
|
|
|
struct bpf_local_storage;
|
bpf: Add ambient BPF runtime context stored in current
b910eaaaa4b8 ("bpf: Fix NULL pointer dereference in bpf_get_local_storage()
helper") fixed the problem with cgroup-local storage use in BPF by
pre-allocating per-CPU array of 8 cgroup storage pointers to accommodate
possible BPF program preemptions and nested executions.
While this seems to work good in practice, it introduces new and unnecessary
failure mode in which not all BPF programs might be executed if we fail to
find an unused slot for cgroup storage, however unlikely it is. It might also
not be so unlikely when/if we allow sleepable cgroup BPF programs in the
future.
Further, the way that cgroup storage is implemented as ambiently-available
property during entire BPF program execution is a convenient way to pass extra
information to BPF program and helpers without requiring user code to pass
around extra arguments explicitly. So it would be good to have a generic
solution that can allow implementing this without arbitrary restrictions.
Ideally, such solution would work for both preemptable and sleepable BPF
programs in exactly the same way.
This patch introduces such solution, bpf_run_ctx. It adds one pointer field
(bpf_ctx) to task_struct. This field is maintained by BPF_PROG_RUN family of
macros in such a way that it always stays valid throughout BPF program
execution. BPF program preemption is handled by remembering previous
current->bpf_ctx value locally while executing nested BPF program and
restoring old value after nested BPF program finishes. This is handled by two
helper functions, bpf_set_run_ctx() and bpf_reset_run_ctx(), which are
supposed to be used before and after BPF program runs, respectively.
Restoring old value of the pointer handles preemption, while bpf_run_ctx
pointer being a property of current task_struct naturally solves this problem
for sleepable BPF programs by "following" BPF program execution as it is
scheduled in and out of CPU. It would even allow CPU migration of BPF
programs, even though it's not currently allowed by BPF infra.
This patch cleans up cgroup local storage handling as a first application. The
design itself is generic, though, with bpf_run_ctx being an empty struct that
is supposed to be embedded into a specific struct for a given BPF program type
(bpf_cg_run_ctx in this case). Follow up patches are planned that will expand
this mechanism for other uses within tracing BPF programs.
To verify that this change doesn't revert the fix to the original cgroup
storage issue, I ran the same repro as in the original report ([0]) and didn't
get any problems. Replacing bpf_reset_run_ctx(old_run_ctx) with
bpf_reset_run_ctx(NULL) triggers the issue pretty quickly (so repro does work).
[0] https://lore.kernel.org/bpf/YEEvBUiJl2pJkxTd@krava/
Fixes: b910eaaaa4b8 ("bpf: Fix NULL pointer dereference in bpf_get_local_storage() helper")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20210712230615.3525979-1-andrii@kernel.org
2021-07-13 07:06:15 +08:00
|
|
|
struct bpf_run_ctx;
|
2019-06-04 19:14:55 +08:00
|
|
|
struct capture_control;
|
2017-02-04 05:01:58 +08:00
|
|
|
struct cfs_rq;
|
|
|
|
struct fs_struct;
|
|
|
|
struct futex_pi_state;
|
|
|
|
struct io_context;
|
2021-01-27 03:34:49 +08:00
|
|
|
struct io_uring_task;
|
2017-02-04 05:01:58 +08:00
|
|
|
struct mempolicy;
|
2015-05-12 20:29:38 +08:00
|
|
|
struct nameidata;
|
2017-02-04 05:01:58 +08:00
|
|
|
struct nsproxy;
|
|
|
|
struct perf_event_context;
|
|
|
|
struct pid_namespace;
|
|
|
|
struct pipe_inode_info;
|
|
|
|
struct rcu_node;
|
|
|
|
struct reclaim_state;
|
|
|
|
struct robust_list_head;
|
2019-06-04 19:14:55 +08:00
|
|
|
struct root_domain;
|
|
|
|
struct rq;
|
2017-02-04 05:01:58 +08:00
|
|
|
struct sched_attr;
|
|
|
|
struct sched_param;
|
2007-07-10 00:52:00 +08:00
|
|
|
struct seq_file;
|
2017-02-04 05:01:58 +08:00
|
|
|
struct sighand_struct;
|
|
|
|
struct signal_struct;
|
|
|
|
struct task_delay_info;
|
2007-10-15 23:00:14 +08:00
|
|
|
struct task_group;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2005-09-30 06:18:21 +08:00
|
|
|
/*
|
|
|
|
* Task state bitmask. NOTE! These bits are also
|
|
|
|
* encoded in fs/proc/array.c: get_task_state().
|
|
|
|
*
|
|
|
|
* We have two separate sets of flags: task->state
|
|
|
|
* is about runnability, while task->exit_state are
|
|
|
|
* about the task exiting. Confusing, but this way
|
|
|
|
* modifying one set can't modify the other one by
|
|
|
|
* mistake.
|
|
|
|
*/
|
2017-02-07 05:06:35 +08:00
|
|
|
|
|
|
|
/* Used in tsk->state: */
|
2022-09-06 19:11:38 +08:00
|
|
|
#define TASK_RUNNING 0x00000000
|
|
|
|
#define TASK_INTERRUPTIBLE 0x00000001
|
|
|
|
#define TASK_UNINTERRUPTIBLE 0x00000002
|
|
|
|
#define __TASK_STOPPED 0x00000004
|
|
|
|
#define __TASK_TRACED 0x00000008
|
2017-02-07 05:06:35 +08:00
|
|
|
/* Used in tsk->exit_state: */
|
2022-09-06 19:11:38 +08:00
|
|
|
#define EXIT_DEAD 0x00000010
|
|
|
|
#define EXIT_ZOMBIE 0x00000020
|
2017-02-07 05:06:35 +08:00
|
|
|
#define EXIT_TRACE (EXIT_ZOMBIE | EXIT_DEAD)
|
|
|
|
/* Used in tsk->state again: */
|
2022-09-06 19:11:38 +08:00
|
|
|
#define TASK_PARKED 0x00000040
|
|
|
|
#define TASK_DEAD 0x00000080
|
|
|
|
#define TASK_WAKEKILL 0x00000100
|
|
|
|
#define TASK_WAKING 0x00000200
|
|
|
|
#define TASK_NOLOAD 0x00000400
|
|
|
|
#define TASK_NEW 0x00000800
|
|
|
|
#define TASK_RTLOCK_WAIT 0x00001000
|
freezer,sched: Rewrite core freezer logic
Rewrite the core freezer to behave better wrt thawing and be simpler
in general.
By replacing PF_FROZEN with TASK_FROZEN, a special block state, it is
ensured frozen tasks stay frozen until thawed and don't randomly wake
up early, as is currently possible.
As such, it does away with PF_FROZEN and PF_FREEZER_SKIP, freeing up
two PF_flags (yay!).
Specifically; the current scheme works a little like:
freezer_do_not_count();
schedule();
freezer_count();
And either the task is blocked, or it lands in try_to_freezer()
through freezer_count(). Now, when it is blocked, the freezer
considers it frozen and continues.
However, on thawing, once pm_freezing is cleared, freezer_count()
stops working, and any random/spurious wakeup will let a task run
before its time.
That is, thawing tries to thaw things in explicit order; kernel
threads and workqueues before doing bringing SMP back before userspace
etc.. However due to the above mentioned races it is entirely possible
for userspace tasks to thaw (by accident) before SMP is back.
This can be a fatal problem in asymmetric ISA architectures (eg ARMv9)
where the userspace task requires a special CPU to run.
As said; replace this with a special task state TASK_FROZEN and add
the following state transitions:
TASK_FREEZABLE -> TASK_FROZEN
__TASK_STOPPED -> TASK_FROZEN
__TASK_TRACED -> TASK_FROZEN
The new TASK_FREEZABLE can be set on any state part of TASK_NORMAL
(IOW. TASK_INTERRUPTIBLE and TASK_UNINTERRUPTIBLE) -- any such state
is already required to deal with spurious wakeups and the freezer
causes one such when thawing the task (since the original state is
lost).
The special __TASK_{STOPPED,TRACED} states *can* be restored since
their canonical state is in ->jobctl.
With this, frozen tasks need an explicit TASK_FROZEN wakeup and are
free of undue (early / spurious) wakeups.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://lore.kernel.org/r/20220822114649.055452969@infradead.org
2022-08-22 19:18:22 +08:00
|
|
|
#define TASK_FREEZABLE 0x00002000
|
|
|
|
#define __TASK_FREEZABLE_UNSAFE (0x00004000 * IS_ENABLED(CONFIG_LOCKDEP))
|
|
|
|
#define TASK_FROZEN 0x00008000
|
|
|
|
#define TASK_STATE_MAX 0x00010000
|
2017-02-07 05:06:35 +08:00
|
|
|
|
2022-09-06 18:39:55 +08:00
|
|
|
#define TASK_ANY (TASK_STATE_MAX-1)
|
|
|
|
|
freezer,sched: Rewrite core freezer logic
Rewrite the core freezer to behave better wrt thawing and be simpler
in general.
By replacing PF_FROZEN with TASK_FROZEN, a special block state, it is
ensured frozen tasks stay frozen until thawed and don't randomly wake
up early, as is currently possible.
As such, it does away with PF_FROZEN and PF_FREEZER_SKIP, freeing up
two PF_flags (yay!).
Specifically; the current scheme works a little like:
freezer_do_not_count();
schedule();
freezer_count();
And either the task is blocked, or it lands in try_to_freezer()
through freezer_count(). Now, when it is blocked, the freezer
considers it frozen and continues.
However, on thawing, once pm_freezing is cleared, freezer_count()
stops working, and any random/spurious wakeup will let a task run
before its time.
That is, thawing tries to thaw things in explicit order; kernel
threads and workqueues before doing bringing SMP back before userspace
etc.. However due to the above mentioned races it is entirely possible
for userspace tasks to thaw (by accident) before SMP is back.
This can be a fatal problem in asymmetric ISA architectures (eg ARMv9)
where the userspace task requires a special CPU to run.
As said; replace this with a special task state TASK_FROZEN and add
the following state transitions:
TASK_FREEZABLE -> TASK_FROZEN
__TASK_STOPPED -> TASK_FROZEN
__TASK_TRACED -> TASK_FROZEN
The new TASK_FREEZABLE can be set on any state part of TASK_NORMAL
(IOW. TASK_INTERRUPTIBLE and TASK_UNINTERRUPTIBLE) -- any such state
is already required to deal with spurious wakeups and the freezer
causes one such when thawing the task (since the original state is
lost).
The special __TASK_{STOPPED,TRACED} states *can* be restored since
their canonical state is in ->jobctl.
With this, frozen tasks need an explicit TASK_FROZEN wakeup and are
free of undue (early / spurious) wakeups.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://lore.kernel.org/r/20220822114649.055452969@infradead.org
2022-08-22 19:18:22 +08:00
|
|
|
/*
|
|
|
|
* DO NOT ADD ANY NEW USERS !
|
|
|
|
*/
|
|
|
|
#define TASK_FREEZABLE_UNSAFE (TASK_FREEZABLE | __TASK_FREEZABLE_UNSAFE)
|
2017-02-07 05:06:35 +08:00
|
|
|
|
|
|
|
/* Convenience macros for the sake of set_current_state: */
|
|
|
|
#define TASK_KILLABLE (TASK_WAKEKILL | TASK_UNINTERRUPTIBLE)
|
|
|
|
#define TASK_STOPPED (TASK_WAKEKILL | __TASK_STOPPED)
|
2022-04-29 21:43:34 +08:00
|
|
|
#define TASK_TRACED __TASK_TRACED
|
2017-02-07 05:06:35 +08:00
|
|
|
|
|
|
|
#define TASK_IDLE (TASK_UNINTERRUPTIBLE | TASK_NOLOAD)
|
|
|
|
|
|
|
|
/* Convenience macros for the sake of wake_up(): */
|
|
|
|
#define TASK_NORMAL (TASK_INTERRUPTIBLE | TASK_UNINTERRUPTIBLE)
|
|
|
|
|
|
|
|
/* get_task_state(): */
|
|
|
|
#define TASK_REPORT (TASK_RUNNING | TASK_INTERRUPTIBLE | \
|
|
|
|
TASK_UNINTERRUPTIBLE | __TASK_STOPPED | \
|
2017-09-23 00:37:28 +08:00
|
|
|
__TASK_TRACED | EXIT_DEAD | EXIT_ZOMBIE | \
|
|
|
|
TASK_PARKED)
|
2017-02-07 05:06:35 +08:00
|
|
|
|
2021-06-11 16:28:17 +08:00
|
|
|
#define task_is_running(task) (READ_ONCE((task)->__state) == TASK_RUNNING)
|
2017-02-07 05:06:35 +08:00
|
|
|
|
2022-05-04 04:57:47 +08:00
|
|
|
#define task_is_traced(task) ((READ_ONCE(task->jobctl) & JOBCTL_TRACED) != 0)
|
|
|
|
#define task_is_stopped(task) ((READ_ONCE(task->jobctl) & JOBCTL_STOPPED) != 0)
|
|
|
|
#define task_is_stopped_or_traced(task) ((READ_ONCE(task->jobctl) & (JOBCTL_STOPPED | JOBCTL_TRACED)) != 0)
|
2017-02-07 05:06:35 +08:00
|
|
|
|
2018-04-30 20:51:01 +08:00
|
|
|
/*
|
|
|
|
* Special states are those that do not use the normal wait-loop pattern. See
|
|
|
|
* the comment with set_special_state().
|
|
|
|
*/
|
|
|
|
#define is_special_task_state(state) \
|
2018-06-07 17:45:49 +08:00
|
|
|
((state) & (__TASK_STOPPED | __TASK_TRACED | TASK_PARKED | TASK_DEAD))
|
2018-04-30 20:51:01 +08:00
|
|
|
|
2021-08-16 05:27:43 +08:00
|
|
|
#ifdef CONFIG_DEBUG_ATOMIC_SLEEP
|
|
|
|
# define debug_normal_state_change(state_value) \
|
|
|
|
do { \
|
|
|
|
WARN_ON_ONCE(is_special_task_state(state_value)); \
|
|
|
|
current->task_state_change = _THIS_IP_; \
|
2014-09-24 16:18:55 +08:00
|
|
|
} while (0)
|
|
|
|
|
2021-08-16 05:27:43 +08:00
|
|
|
# define debug_special_state_change(state_value) \
|
2018-04-30 20:51:01 +08:00
|
|
|
do { \
|
|
|
|
WARN_ON_ONCE(!is_special_task_state(state_value)); \
|
|
|
|
current->task_state_change = _THIS_IP_; \
|
|
|
|
} while (0)
|
2021-08-16 05:27:43 +08:00
|
|
|
|
2021-08-16 05:27:44 +08:00
|
|
|
# define debug_rtlock_wait_set_state() \
|
|
|
|
do { \
|
|
|
|
current->saved_state_change = current->task_state_change;\
|
|
|
|
current->task_state_change = _THIS_IP_; \
|
|
|
|
} while (0)
|
|
|
|
|
|
|
|
# define debug_rtlock_wait_restore_state() \
|
|
|
|
do { \
|
|
|
|
current->task_state_change = current->saved_state_change;\
|
|
|
|
} while (0)
|
|
|
|
|
2014-09-24 16:18:55 +08:00
|
|
|
#else
|
2021-08-16 05:27:43 +08:00
|
|
|
# define debug_normal_state_change(cond) do { } while (0)
|
|
|
|
# define debug_special_state_change(cond) do { } while (0)
|
2021-08-16 05:27:44 +08:00
|
|
|
# define debug_rtlock_wait_set_state() do { } while (0)
|
|
|
|
# define debug_rtlock_wait_restore_state() do { } while (0)
|
2021-08-16 05:27:43 +08:00
|
|
|
#endif
|
|
|
|
|
2005-09-13 16:25:14 +08:00
|
|
|
/*
|
|
|
|
* set_current_state() includes a barrier so that the write of current->state
|
|
|
|
* is correctly serialised wrt the caller's subsequent test of whether to
|
|
|
|
* actually sleep:
|
|
|
|
*
|
2016-10-19 21:45:27 +08:00
|
|
|
* for (;;) {
|
2005-09-13 16:25:14 +08:00
|
|
|
* set_current_state(TASK_UNINTERRUPTIBLE);
|
2020-07-02 20:52:11 +08:00
|
|
|
* if (CONDITION)
|
|
|
|
* break;
|
2016-10-19 21:45:27 +08:00
|
|
|
*
|
|
|
|
* schedule();
|
|
|
|
* }
|
|
|
|
* __set_current_state(TASK_RUNNING);
|
|
|
|
*
|
|
|
|
* If the caller does not need such serialisation (because, for instance, the
|
2020-07-02 20:52:11 +08:00
|
|
|
* CONDITION test and condition change and wakeup are under the same lock) then
|
2016-10-19 21:45:27 +08:00
|
|
|
* use __set_current_state().
|
|
|
|
*
|
|
|
|
* The above is typically ordered against the wakeup, which does:
|
|
|
|
*
|
2020-07-02 20:52:11 +08:00
|
|
|
* CONDITION = 1;
|
2018-04-30 20:51:01 +08:00
|
|
|
* wake_up_state(p, TASK_UNINTERRUPTIBLE);
|
2016-10-19 21:45:27 +08:00
|
|
|
*
|
2020-07-02 20:52:11 +08:00
|
|
|
* where wake_up_state()/try_to_wake_up() executes a full memory barrier before
|
|
|
|
* accessing p->state.
|
2016-10-19 21:45:27 +08:00
|
|
|
*
|
|
|
|
* Wakeup will do: if (@state & p->state) p->state = TASK_RUNNING, that is,
|
|
|
|
* once it observes the TASK_UNINTERRUPTIBLE store the waking CPU can issue a
|
|
|
|
* TASK_RUNNING store which can collide with __set_current_state(TASK_RUNNING).
|
2005-09-13 16:25:14 +08:00
|
|
|
*
|
2018-04-30 20:51:01 +08:00
|
|
|
* However, with slightly different timing the wakeup TASK_RUNNING store can
|
2018-12-03 17:05:56 +08:00
|
|
|
* also collide with the TASK_UNINTERRUPTIBLE store. Losing that store is not
|
2018-04-30 20:51:01 +08:00
|
|
|
* a problem either because that will result in one extra go around the loop
|
|
|
|
* and our @cond test will save the day.
|
2005-09-13 16:25:14 +08:00
|
|
|
*
|
2016-10-19 21:45:27 +08:00
|
|
|
* Also see the comments of try_to_wake_up().
|
2005-09-13 16:25:14 +08:00
|
|
|
*/
|
2018-04-30 20:51:01 +08:00
|
|
|
#define __set_current_state(state_value) \
|
2021-08-16 05:27:43 +08:00
|
|
|
do { \
|
|
|
|
debug_normal_state_change((state_value)); \
|
|
|
|
WRITE_ONCE(current->__state, (state_value)); \
|
|
|
|
} while (0)
|
2018-04-30 20:51:01 +08:00
|
|
|
|
|
|
|
#define set_current_state(state_value) \
|
2021-08-16 05:27:43 +08:00
|
|
|
do { \
|
|
|
|
debug_normal_state_change((state_value)); \
|
|
|
|
smp_store_mb(current->__state, (state_value)); \
|
|
|
|
} while (0)
|
2018-04-30 20:51:01 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* set_special_state() should be used for those states when the blocking task
|
|
|
|
* can not use the regular condition based wait-loop. In that case we must
|
2021-08-16 05:27:43 +08:00
|
|
|
* serialize against wakeups such that any possible in-flight TASK_RUNNING
|
|
|
|
* stores will not collide with our state change.
|
2018-04-30 20:51:01 +08:00
|
|
|
*/
|
|
|
|
#define set_special_state(state_value) \
|
|
|
|
do { \
|
|
|
|
unsigned long flags; /* may shadow */ \
|
2021-08-16 05:27:43 +08:00
|
|
|
\
|
2018-04-30 20:51:01 +08:00
|
|
|
raw_spin_lock_irqsave(¤t->pi_lock, flags); \
|
2021-08-16 05:27:43 +08:00
|
|
|
debug_special_state_change((state_value)); \
|
2021-06-11 16:28:17 +08:00
|
|
|
WRITE_ONCE(current->__state, (state_value)); \
|
2018-04-30 20:51:01 +08:00
|
|
|
raw_spin_unlock_irqrestore(¤t->pi_lock, flags); \
|
|
|
|
} while (0)
|
|
|
|
|
2021-08-16 05:27:44 +08:00
|
|
|
/*
|
|
|
|
* PREEMPT_RT specific variants for "sleeping" spin/rwlocks
|
|
|
|
*
|
|
|
|
* RT's spin/rwlock substitutions are state preserving. The state of the
|
|
|
|
* task when blocking on the lock is saved in task_struct::saved_state and
|
|
|
|
* restored after the lock has been acquired. These operations are
|
|
|
|
* serialized by task_struct::pi_lock against try_to_wake_up(). Any non RT
|
|
|
|
* lock related wakeups while the task is blocked on the lock are
|
|
|
|
* redirected to operate on task_struct::saved_state to ensure that these
|
|
|
|
* are not dropped. On restore task_struct::saved_state is set to
|
|
|
|
* TASK_RUNNING so any wakeup attempt redirected to saved_state will fail.
|
|
|
|
*
|
|
|
|
* The lock operation looks like this:
|
|
|
|
*
|
|
|
|
* current_save_and_set_rtlock_wait_state();
|
|
|
|
* for (;;) {
|
|
|
|
* if (try_lock())
|
|
|
|
* break;
|
|
|
|
* raw_spin_unlock_irq(&lock->wait_lock);
|
|
|
|
* schedule_rtlock();
|
|
|
|
* raw_spin_lock_irq(&lock->wait_lock);
|
|
|
|
* set_current_state(TASK_RTLOCK_WAIT);
|
|
|
|
* }
|
|
|
|
* current_restore_rtlock_saved_state();
|
|
|
|
*/
|
|
|
|
#define current_save_and_set_rtlock_wait_state() \
|
|
|
|
do { \
|
|
|
|
lockdep_assert_irqs_disabled(); \
|
|
|
|
raw_spin_lock(¤t->pi_lock); \
|
|
|
|
current->saved_state = current->__state; \
|
|
|
|
debug_rtlock_wait_set_state(); \
|
|
|
|
WRITE_ONCE(current->__state, TASK_RTLOCK_WAIT); \
|
|
|
|
raw_spin_unlock(¤t->pi_lock); \
|
|
|
|
} while (0);
|
|
|
|
|
|
|
|
#define current_restore_rtlock_saved_state() \
|
|
|
|
do { \
|
|
|
|
lockdep_assert_irqs_disabled(); \
|
|
|
|
raw_spin_lock(¤t->pi_lock); \
|
|
|
|
debug_rtlock_wait_restore_state(); \
|
|
|
|
WRITE_ONCE(current->__state, current->saved_state); \
|
|
|
|
current->saved_state = TASK_RUNNING; \
|
|
|
|
raw_spin_unlock(¤t->pi_lock); \
|
|
|
|
} while (0);
|
2014-09-24 16:18:55 +08:00
|
|
|
|
2021-06-11 16:28:17 +08:00
|
|
|
#define get_current_state() READ_ONCE(current->__state)
|
2021-06-11 16:28:14 +08:00
|
|
|
|
2022-01-20 10:08:40 +08:00
|
|
|
/*
|
|
|
|
* Define the task command name length as enum, then it can be visible to
|
|
|
|
* BPF programs.
|
|
|
|
*/
|
|
|
|
enum {
|
|
|
|
TASK_COMM_LEN = 16,
|
|
|
|
};
|
2005-04-17 06:20:36 +08:00
|
|
|
|
|
|
|
extern void scheduler_tick(void);
|
|
|
|
|
2017-02-07 05:06:35 +08:00
|
|
|
#define MAX_SCHEDULE_TIMEOUT LONG_MAX
|
|
|
|
|
|
|
|
extern long schedule_timeout(long timeout);
|
|
|
|
extern long schedule_timeout_interruptible(long timeout);
|
|
|
|
extern long schedule_timeout_killable(long timeout);
|
|
|
|
extern long schedule_timeout_uninterruptible(long timeout);
|
|
|
|
extern long schedule_timeout_idle(long timeout);
|
2005-04-17 06:20:36 +08:00
|
|
|
asmlinkage void schedule(void);
|
2011-03-21 19:09:35 +08:00
|
|
|
extern void schedule_preempt_disabled(void);
|
arm64: entry.S: Do not preempt from IRQ before all cpufeatures are enabled
Preempting from IRQ-return means that the task has its PSTATE saved
on the stack, which will get restored when the task is resumed and does
the actual IRQ return.
However, enabling some CPU features requires modifying the PSTATE. This
means that, if a task was scheduled out during an IRQ-return before all
CPU features are enabled, the task might restore a PSTATE that does not
include the feature enablement changes once scheduled back in.
* Task 1:
PAN == 0 ---| |---------------
| |<- return from IRQ, PSTATE.PAN = 0
| <- IRQ |
+--------+ <- preempt() +--
^
|
reschedule Task 1, PSTATE.PAN == 1
* Init:
--------------------+------------------------
^
|
enable_cpu_features
set PSTATE.PAN on all CPUs
Worse than this, since PSTATE is untouched when task switching is done,
a task missing the new bits in PSTATE might affect another task, if both
do direct calls to schedule() (outside of IRQ/exception contexts).
Fix this by preventing preemption on IRQ-return until features are
enabled on all CPUs.
This way the only PSTATE values that are saved on the stack are from
synchronous exceptions. These are expected to be fatal this early, the
exception is BRK for WARN_ON(), but as this uses do_debug_exception()
which keeps IRQs masked, it shouldn't call schedule().
Signed-off-by: Julien Thierry <julien.thierry@arm.com>
[james: Replaced a really cool hack, with an even simpler static key in C.
expanded commit message with Julien's cover-letter ascii art]
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Will Deacon <will@kernel.org>
2019-10-16 01:25:44 +08:00
|
|
|
asmlinkage void preempt_schedule_irq(void);
|
2021-08-16 05:27:48 +08:00
|
|
|
#ifdef CONFIG_PREEMPT_RT
|
|
|
|
extern void schedule_rtlock(void);
|
|
|
|
#endif
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2016-10-29 00:58:10 +08:00
|
|
|
extern int __must_check io_schedule_prepare(void);
|
|
|
|
extern void io_schedule_finish(int token);
|
sched: Prevent recursion in io_schedule()
io_schedule() calls blk_flush_plug() which, depending on the
contents of current->plug, can initiate arbitrary blk-io requests.
Note that this contrasts with blk_schedule_flush_plug() which requires
all non-trivial work to be handed off to a separate thread.
This makes it possible for io_schedule() to recurse, and initiating
block requests could possibly call mempool_alloc() which, in times of
memory pressure, uses io_schedule().
Apart from any stack usage issues, io_schedule() will not behave
correctly when called recursively as delayacct_blkio_start() does
not allow for repeated calls.
So:
- use ->in_iowait to detect recursion. Set it earlier, and restore
it to the old value.
- move the call to "raw_rq" after the call to blk_flush_plug().
As this is some sort of per-cpu thing, we want some chance that
we are on the right CPU
- When io_schedule() is called recurively, use blk_schedule_flush_plug()
which cannot further recurse.
- as this makes io_schedule() a lot more complex and as io_schedule()
must match io_schedule_timeout(), but all the changes in io_schedule_timeout()
and make io_schedule a simple wrapper for that.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
[ Moved the now rudimentary io_schedule() into sched.h. ]
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Tony Battersby <tonyb@cybernetics.com>
Link: http://lkml.kernel.org/r/20150213162600.059fffb2@notabene.brown
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-13 12:49:17 +08:00
|
|
|
extern long io_schedule_timeout(long timeout);
|
2016-10-29 00:58:10 +08:00
|
|
|
extern void io_schedule(void);
|
sched: Prevent recursion in io_schedule()
io_schedule() calls blk_flush_plug() which, depending on the
contents of current->plug, can initiate arbitrary blk-io requests.
Note that this contrasts with blk_schedule_flush_plug() which requires
all non-trivial work to be handed off to a separate thread.
This makes it possible for io_schedule() to recurse, and initiating
block requests could possibly call mempool_alloc() which, in times of
memory pressure, uses io_schedule().
Apart from any stack usage issues, io_schedule() will not behave
correctly when called recursively as delayacct_blkio_start() does
not allow for repeated calls.
So:
- use ->in_iowait to detect recursion. Set it earlier, and restore
it to the old value.
- move the call to "raw_rq" after the call to blk_flush_plug().
As this is some sort of per-cpu thing, we want some chance that
we are on the right CPU
- When io_schedule() is called recurively, use blk_schedule_flush_plug()
which cannot further recurse.
- as this makes io_schedule() a lot more complex and as io_schedule()
must match io_schedule_timeout(), but all the changes in io_schedule_timeout()
and make io_schedule a simple wrapper for that.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
[ Moved the now rudimentary io_schedule() into sched.h. ]
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Tony Battersby <tonyb@cybernetics.com>
Link: http://lkml.kernel.org/r/20150213162600.059fffb2@notabene.brown
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-13 12:49:17 +08:00
|
|
|
|
2012-11-22 07:58:35 +08:00
|
|
|
/**
|
2017-03-07 19:48:02 +08:00
|
|
|
* struct prev_cputime - snapshot of system and user cputime
|
2012-11-22 07:58:35 +08:00
|
|
|
* @utime: time spent in user mode
|
|
|
|
* @stime: time spent in system mode
|
2015-06-30 17:30:54 +08:00
|
|
|
* @lock: protects the above two fields
|
2012-11-22 07:58:35 +08:00
|
|
|
*
|
2015-06-30 17:30:54 +08:00
|
|
|
* Stores previous user/system time values such that we can guarantee
|
|
|
|
* monotonicity.
|
2012-11-22 07:58:35 +08:00
|
|
|
*/
|
2015-06-30 17:30:54 +08:00
|
|
|
struct prev_cputime {
|
|
|
|
#ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
|
2017-02-07 05:06:35 +08:00
|
|
|
u64 utime;
|
|
|
|
u64 stime;
|
|
|
|
raw_spinlock_t lock;
|
2015-06-30 17:30:54 +08:00
|
|
|
#endif
|
2012-11-22 07:58:35 +08:00
|
|
|
};
|
|
|
|
|
2017-06-30 01:15:10 +08:00
|
|
|
enum vtime_state {
|
|
|
|
/* Task is sleeping or running in a CPU with VTIME inactive: */
|
|
|
|
VTIME_INACTIVE = 0,
|
2019-10-16 10:56:48 +08:00
|
|
|
/* Task is idle */
|
|
|
|
VTIME_IDLE,
|
2017-06-30 01:15:10 +08:00
|
|
|
/* Task runs in kernelspace in a CPU with VTIME active: */
|
|
|
|
VTIME_SYS,
|
2019-10-16 10:56:48 +08:00
|
|
|
/* Task runs in userspace in a CPU with VTIME active: */
|
|
|
|
VTIME_USER,
|
2019-10-16 10:56:49 +08:00
|
|
|
/* Task runs as guests in a CPU with VTIME active: */
|
|
|
|
VTIME_GUEST,
|
2017-06-30 01:15:10 +08:00
|
|
|
};
|
|
|
|
|
|
|
|
struct vtime {
|
|
|
|
seqcount_t seqcount;
|
|
|
|
unsigned long long starttime;
|
|
|
|
enum vtime_state state;
|
2019-10-16 10:56:47 +08:00
|
|
|
unsigned int cpu;
|
2017-06-30 01:15:11 +08:00
|
|
|
u64 utime;
|
|
|
|
u64 stime;
|
|
|
|
u64 gtime;
|
2017-06-30 01:15:10 +08:00
|
|
|
};
|
|
|
|
|
2019-06-21 16:42:02 +08:00
|
|
|
/*
|
|
|
|
* Utilization clamp constraints.
|
|
|
|
* @UCLAMP_MIN: Minimum utilization
|
|
|
|
* @UCLAMP_MAX: Maximum utilization
|
|
|
|
* @UCLAMP_CNT: Utilization clamp constraints count
|
|
|
|
*/
|
|
|
|
enum uclamp_id {
|
|
|
|
UCLAMP_MIN = 0,
|
|
|
|
UCLAMP_MAX,
|
|
|
|
UCLAMP_CNT
|
|
|
|
};
|
|
|
|
|
2019-07-19 21:59:55 +08:00
|
|
|
#ifdef CONFIG_SMP
|
|
|
|
extern struct root_domain def_root_domain;
|
|
|
|
extern struct mutex sched_domains_mutex;
|
|
|
|
#endif
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
struct sched_info {
|
2017-02-06 18:44:12 +08:00
|
|
|
#ifdef CONFIG_SCHED_INFO
|
2017-02-07 05:06:35 +08:00
|
|
|
/* Cumulative counters: */
|
|
|
|
|
|
|
|
/* # of times we have run on this CPU: */
|
|
|
|
unsigned long pcount;
|
|
|
|
|
|
|
|
/* Time spent waiting on a runqueue: */
|
|
|
|
unsigned long long run_delay;
|
|
|
|
|
|
|
|
/* Timestamps: */
|
|
|
|
|
|
|
|
/* When did we last run on a CPU? */
|
|
|
|
unsigned long long last_arrival;
|
|
|
|
|
|
|
|
/* When were we last queued to run? */
|
|
|
|
unsigned long long last_queued;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2015-06-26 02:23:37 +08:00
|
|
|
#endif /* CONFIG_SCHED_INFO */
|
2017-02-06 18:44:12 +08:00
|
|
|
};
|
2005-04-17 06:20:36 +08:00
|
|
|
|
sched/fair: Generalize the load/util averages resolution definition
Integer metric needs fixed point arithmetic. In sched/fair, a few
metrics, e.g., weight, load, load_avg, util_avg, freq, and capacity,
may have different fixed point ranges, which makes their update and
usage error-prone.
In order to avoid the errors relating to the fixed point range, we
definie a basic fixed point range, and then formalize all metrics to
base on the basic range.
The basic range is 1024 or (1 << 10). Further, one can recursively
apply the basic range to have larger range.
Pointed out by Ben Segall, weight (visible to user, e.g., NICE-0 has
1024) and load (e.g., NICE_0_LOAD) have independent ranges, but they
must be well calibrated.
Signed-off-by: Yuyang Du <yuyang.du@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: dietmar.eggemann@arm.com
Cc: lizefan@huawei.com
Cc: morten.rasmussen@arm.com
Cc: pjt@google.com
Cc: umgwanakikbuti@gmail.com
Cc: vincent.guittot@linaro.org
Link: http://lkml.kernel.org/r/1459829551-21625-2-git-send-email-yuyang.du@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-04-05 12:12:26 +08:00
|
|
|
/*
|
|
|
|
* Integer metrics need fixed point arithmetic, e.g., sched/fair
|
|
|
|
* has a few: load, load_avg, util_avg, freq, and capacity.
|
|
|
|
*
|
|
|
|
* We define a basic fixed point arithmetic range, and then formalize
|
|
|
|
* all these metrics based on that basic range.
|
|
|
|
*/
|
2017-02-07 05:06:35 +08:00
|
|
|
# define SCHED_FIXEDPOINT_SHIFT 10
|
|
|
|
# define SCHED_FIXEDPOINT_SCALE (1L << SCHED_FIXEDPOINT_SHIFT)
|
sched/fair: Generalize the load/util averages resolution definition
Integer metric needs fixed point arithmetic. In sched/fair, a few
metrics, e.g., weight, load, load_avg, util_avg, freq, and capacity,
may have different fixed point ranges, which makes their update and
usage error-prone.
In order to avoid the errors relating to the fixed point range, we
definie a basic fixed point range, and then formalize all metrics to
base on the basic range.
The basic range is 1024 or (1 << 10). Further, one can recursively
apply the basic range to have larger range.
Pointed out by Ben Segall, weight (visible to user, e.g., NICE-0 has
1024) and load (e.g., NICE_0_LOAD) have independent ranges, but they
must be well calibrated.
Signed-off-by: Yuyang Du <yuyang.du@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: dietmar.eggemann@arm.com
Cc: lizefan@huawei.com
Cc: morten.rasmussen@arm.com
Cc: pjt@google.com
Cc: umgwanakikbuti@gmail.com
Cc: vincent.guittot@linaro.org
Link: http://lkml.kernel.org/r/1459829551-21625-2-git-send-email-yuyang.du@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-04-05 12:12:26 +08:00
|
|
|
|
2019-06-21 16:42:02 +08:00
|
|
|
/* Increase resolution of cpu_capacity calculations */
|
|
|
|
# define SCHED_CAPACITY_SHIFT SCHED_FIXEDPOINT_SHIFT
|
|
|
|
# define SCHED_CAPACITY_SCALE (1L << SCHED_CAPACITY_SHIFT)
|
|
|
|
|
2007-07-10 00:51:58 +08:00
|
|
|
struct load_weight {
|
2017-02-07 05:06:35 +08:00
|
|
|
unsigned long weight;
|
|
|
|
u32 inv_weight;
|
2007-07-10 00:51:58 +08:00
|
|
|
};
|
|
|
|
|
sched/fair: Add util_est on top of PELT
The util_avg signal computed by PELT is too variable for some use-cases.
For example, a big task waking up after a long sleep period will have its
utilization almost completely decayed. This introduces some latency before
schedutil will be able to pick the best frequency to run a task.
The same issue can affect task placement. Indeed, since the task
utilization is already decayed at wakeup, when the task is enqueued in a
CPU, this can result in a CPU running a big task as being temporarily
represented as being almost empty. This leads to a race condition where
other tasks can be potentially allocated on a CPU which just started to run
a big task which slept for a relatively long period.
Moreover, the PELT utilization of a task can be updated every [ms], thus
making it a continuously changing value for certain longer running
tasks. This means that the instantaneous PELT utilization of a RUNNING
task is not really meaningful to properly support scheduler decisions.
For all these reasons, a more stable signal can do a better job of
representing the expected/estimated utilization of a task/cfs_rq.
Such a signal can be easily created on top of PELT by still using it as
an estimator which produces values to be aggregated on meaningful
events.
This patch adds a simple implementation of util_est, a new signal built on
top of PELT's util_avg where:
util_est(task) = max(task::util_avg, f(task::util_avg@dequeue))
This allows to remember how big a task has been reported by PELT in its
previous activations via f(task::util_avg@dequeue), which is the new
_task_util_est(struct task_struct*) function added by this patch.
If a task should change its behavior and it runs longer in a new
activation, after a certain time its util_est will just track the
original PELT signal (i.e. task::util_avg).
The estimated utilization of cfs_rq is defined only for root ones.
That's because the only sensible consumer of this signal are the
scheduler and schedutil when looking for the overall CPU utilization
due to FAIR tasks.
For this reason, the estimated utilization of a root cfs_rq is simply
defined as:
util_est(cfs_rq) = max(cfs_rq::util_avg, cfs_rq::util_est::enqueued)
where:
cfs_rq::util_est::enqueued = sum(_task_util_est(task))
for each RUNNABLE task on that root cfs_rq
It's worth noting that the estimated utilization is tracked only for
objects of interests, specifically:
- Tasks: to better support tasks placement decisions
- root cfs_rqs: to better support both tasks placement decisions as
well as frequencies selection
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Todd Kjos <tkjos@android.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Link: http://lkml.kernel.org/r/20180309095245.11071-2-patrick.bellasi@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-09 17:52:42 +08:00
|
|
|
/**
|
|
|
|
* struct util_est - Estimation utilization of FAIR tasks
|
|
|
|
* @enqueued: instantaneous estimated utilization of a task/cpu
|
|
|
|
* @ewma: the Exponential Weighted Moving Average (EWMA)
|
|
|
|
* utilization of a task
|
|
|
|
*
|
|
|
|
* Support data structure to track an Exponential Weighted Moving Average
|
|
|
|
* (EWMA) of a FAIR task's utilization. New samples are added to the moving
|
|
|
|
* average each time a task completes an activation. Sample's weight is chosen
|
|
|
|
* so that the EWMA will be relatively insensitive to transient changes to the
|
|
|
|
* task's workload.
|
|
|
|
*
|
|
|
|
* The enqueued attribute has a slightly different meaning for tasks and cpus:
|
|
|
|
* - task: the task's util_avg at last task dequeue time
|
|
|
|
* - cfs_rq: the sum of util_est.enqueued for each RUNNABLE task on that CPU
|
|
|
|
* Thus, the util_est.enqueued of a task represents the contribution on the
|
|
|
|
* estimated utilization of the CPU where that task is currently enqueued.
|
|
|
|
*
|
|
|
|
* Only for tasks we track a moving average of the past instantaneous
|
|
|
|
* estimated utilization. This allows to absorb sporadic drops in utilization
|
|
|
|
* of an otherwise almost periodic task.
|
2021-06-02 22:58:08 +08:00
|
|
|
*
|
|
|
|
* The UTIL_AVG_UNCHANGED flag is used to synchronize util_est with util_avg
|
|
|
|
* updates. When a task is dequeued, its util_est should not be updated if its
|
|
|
|
* util_avg has not been updated in the meantime.
|
|
|
|
* This information is mapped into the MSB bit of util_est.enqueued at dequeue
|
|
|
|
* time. Since max value of util_est.enqueued for a task is 1024 (PELT util_avg
|
|
|
|
* for a task) it is safe to use MSB.
|
sched/fair: Add util_est on top of PELT
The util_avg signal computed by PELT is too variable for some use-cases.
For example, a big task waking up after a long sleep period will have its
utilization almost completely decayed. This introduces some latency before
schedutil will be able to pick the best frequency to run a task.
The same issue can affect task placement. Indeed, since the task
utilization is already decayed at wakeup, when the task is enqueued in a
CPU, this can result in a CPU running a big task as being temporarily
represented as being almost empty. This leads to a race condition where
other tasks can be potentially allocated on a CPU which just started to run
a big task which slept for a relatively long period.
Moreover, the PELT utilization of a task can be updated every [ms], thus
making it a continuously changing value for certain longer running
tasks. This means that the instantaneous PELT utilization of a RUNNING
task is not really meaningful to properly support scheduler decisions.
For all these reasons, a more stable signal can do a better job of
representing the expected/estimated utilization of a task/cfs_rq.
Such a signal can be easily created on top of PELT by still using it as
an estimator which produces values to be aggregated on meaningful
events.
This patch adds a simple implementation of util_est, a new signal built on
top of PELT's util_avg where:
util_est(task) = max(task::util_avg, f(task::util_avg@dequeue))
This allows to remember how big a task has been reported by PELT in its
previous activations via f(task::util_avg@dequeue), which is the new
_task_util_est(struct task_struct*) function added by this patch.
If a task should change its behavior and it runs longer in a new
activation, after a certain time its util_est will just track the
original PELT signal (i.e. task::util_avg).
The estimated utilization of cfs_rq is defined only for root ones.
That's because the only sensible consumer of this signal are the
scheduler and schedutil when looking for the overall CPU utilization
due to FAIR tasks.
For this reason, the estimated utilization of a root cfs_rq is simply
defined as:
util_est(cfs_rq) = max(cfs_rq::util_avg, cfs_rq::util_est::enqueued)
where:
cfs_rq::util_est::enqueued = sum(_task_util_est(task))
for each RUNNABLE task on that root cfs_rq
It's worth noting that the estimated utilization is tracked only for
objects of interests, specifically:
- Tasks: to better support tasks placement decisions
- root cfs_rqs: to better support both tasks placement decisions as
well as frequencies selection
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Todd Kjos <tkjos@android.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Link: http://lkml.kernel.org/r/20180309095245.11071-2-patrick.bellasi@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-09 17:52:42 +08:00
|
|
|
*/
|
|
|
|
struct util_est {
|
|
|
|
unsigned int enqueued;
|
|
|
|
unsigned int ewma;
|
|
|
|
#define UTIL_EST_WEIGHT_SHIFT 2
|
2021-06-02 22:58:08 +08:00
|
|
|
#define UTIL_AVG_UNCHANGED 0x80000000
|
2018-04-05 16:05:21 +08:00
|
|
|
} __attribute__((__aligned__(sizeof(u64))));
|
sched/fair: Add util_est on top of PELT
The util_avg signal computed by PELT is too variable for some use-cases.
For example, a big task waking up after a long sleep period will have its
utilization almost completely decayed. This introduces some latency before
schedutil will be able to pick the best frequency to run a task.
The same issue can affect task placement. Indeed, since the task
utilization is already decayed at wakeup, when the task is enqueued in a
CPU, this can result in a CPU running a big task as being temporarily
represented as being almost empty. This leads to a race condition where
other tasks can be potentially allocated on a CPU which just started to run
a big task which slept for a relatively long period.
Moreover, the PELT utilization of a task can be updated every [ms], thus
making it a continuously changing value for certain longer running
tasks. This means that the instantaneous PELT utilization of a RUNNING
task is not really meaningful to properly support scheduler decisions.
For all these reasons, a more stable signal can do a better job of
representing the expected/estimated utilization of a task/cfs_rq.
Such a signal can be easily created on top of PELT by still using it as
an estimator which produces values to be aggregated on meaningful
events.
This patch adds a simple implementation of util_est, a new signal built on
top of PELT's util_avg where:
util_est(task) = max(task::util_avg, f(task::util_avg@dequeue))
This allows to remember how big a task has been reported by PELT in its
previous activations via f(task::util_avg@dequeue), which is the new
_task_util_est(struct task_struct*) function added by this patch.
If a task should change its behavior and it runs longer in a new
activation, after a certain time its util_est will just track the
original PELT signal (i.e. task::util_avg).
The estimated utilization of cfs_rq is defined only for root ones.
That's because the only sensible consumer of this signal are the
scheduler and schedutil when looking for the overall CPU utilization
due to FAIR tasks.
For this reason, the estimated utilization of a root cfs_rq is simply
defined as:
util_est(cfs_rq) = max(cfs_rq::util_avg, cfs_rq::util_est::enqueued)
where:
cfs_rq::util_est::enqueued = sum(_task_util_est(task))
for each RUNNABLE task on that root cfs_rq
It's worth noting that the estimated utilization is tracked only for
objects of interests, specifically:
- Tasks: to better support tasks placement decisions
- root cfs_rqs: to better support both tasks placement decisions as
well as frequencies selection
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Todd Kjos <tkjos@android.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Link: http://lkml.kernel.org/r/20180309095245.11071-2-patrick.bellasi@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-09 17:52:42 +08:00
|
|
|
|
sched/fair: Rewrite runnable load and utilization average tracking
The idea of runnable load average (let runnable time contribute to weight)
was proposed by Paul Turner and Ben Segall, and it is still followed by
this rewrite. This rewrite aims to solve the following issues:
1. cfs_rq's load average (namely runnable_load_avg and blocked_load_avg) is
updated at the granularity of an entity at a time, which results in the
cfs_rq's load average is stale or partially updated: at any time, only
one entity is up to date, all other entities are effectively lagging
behind. This is undesirable.
To illustrate, if we have n runnable entities in the cfs_rq, as time
elapses, they certainly become outdated:
t0: cfs_rq { e1_old, e2_old, ..., en_old }
and when we update:
t1: update e1, then we have cfs_rq { e1_new, e2_old, ..., en_old }
t2: update e2, then we have cfs_rq { e1_old, e2_new, ..., en_old }
...
We solve this by combining all runnable entities' load averages together
in cfs_rq's avg, and update the cfs_rq's avg as a whole. This is based
on the fact that if we regard the update as a function, then:
w * update(e) = update(w * e) and
update(e1) + update(e2) = update(e1 + e2), then
w1 * update(e1) + w2 * update(e2) = update(w1 * e1 + w2 * e2)
therefore, by this rewrite, we have an entirely updated cfs_rq at the
time we update it:
t1: update cfs_rq { e1_new, e2_new, ..., en_new }
t2: update cfs_rq { e1_new, e2_new, ..., en_new }
...
2. cfs_rq's load average is different between top rq->cfs_rq and other
task_group's per CPU cfs_rqs in whether or not blocked_load_average
contributes to the load.
The basic idea behind runnable load average (the same for utilization)
is that the blocked state is taken into account as opposed to only
accounting for the currently runnable state. Therefore, the average
should include both the runnable/running and blocked load averages.
This rewrite does that.
In addition, we also combine runnable/running and blocked averages
of all entities into the cfs_rq's average, and update it together at
once. This is based on the fact that:
update(runnable) + update(blocked) = update(runnable + blocked)
This significantly reduces the code as we don't need to separately
maintain/update runnable/running load and blocked load.
3. How task_group entities' share is calculated is complex and imprecise.
We reduce the complexity in this rewrite to allow a very simple rule:
the task_group's load_avg is aggregated from its per CPU cfs_rqs's
load_avgs. Then group entity's weight is simply proportional to its
own cfs_rq's load_avg / task_group's load_avg. To illustrate,
if a task_group has { cfs_rq1, cfs_rq2, ..., cfs_rqn }, then,
task_group_avg = cfs_rq1_avg + cfs_rq2_avg + ... + cfs_rqn_avg, then
cfs_rqx's entity's share = cfs_rqx_avg / task_group_avg * task_group's share
To sum up, this rewrite in principle is equivalent to the current one, but
fixes the issues described above. Turns out, it significantly reduces the
code complexity and hence increases clarity and efficiency. In addition,
the new averages are more smooth/continuous (no spurious spikes and valleys)
and updated more consistently and quickly to reflect the load dynamics.
As a result, we have less load tracking overhead, better performance,
and especially better power efficiency due to more balanced load.
Signed-off-by: Yuyang Du <yuyang.du@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: arjan@linux.intel.com
Cc: bsegall@google.com
Cc: dietmar.eggemann@arm.com
Cc: fengguang.wu@intel.com
Cc: len.brown@intel.com
Cc: morten.rasmussen@arm.com
Cc: pjt@google.com
Cc: rafael.j.wysocki@intel.com
Cc: umgwanakikbuti@gmail.com
Cc: vincent.guittot@linaro.org
Link: http://lkml.kernel.org/r/1436918682-4971-3-git-send-email-yuyang.du@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-07-15 08:04:37 +08:00
|
|
|
/*
|
2020-02-24 17:52:18 +08:00
|
|
|
* The load/runnable/util_avg accumulates an infinite geometric series
|
2020-02-24 17:52:17 +08:00
|
|
|
* (see __update_load_avg_cfs_rq() in kernel/sched/pelt.c).
|
2016-04-05 12:12:28 +08:00
|
|
|
*
|
|
|
|
* [load_avg definition]
|
|
|
|
*
|
|
|
|
* load_avg = runnable% * scale_load_down(load)
|
|
|
|
*
|
2020-02-24 17:52:18 +08:00
|
|
|
* [runnable_avg definition]
|
|
|
|
*
|
|
|
|
* runnable_avg = runnable% * SCHED_CAPACITY_SCALE
|
2016-04-05 12:12:28 +08:00
|
|
|
*
|
|
|
|
* [util_avg definition]
|
|
|
|
*
|
|
|
|
* util_avg = running% * SCHED_CAPACITY_SCALE
|
|
|
|
*
|
2020-02-24 17:52:18 +08:00
|
|
|
* where runnable% is the time ratio that a sched_entity is runnable and
|
|
|
|
* running% the time ratio that a sched_entity is running.
|
|
|
|
*
|
|
|
|
* For cfs_rq, they are the aggregated values of all runnable and blocked
|
|
|
|
* sched_entities.
|
2016-04-05 12:12:28 +08:00
|
|
|
*
|
2020-07-27 21:39:51 +08:00
|
|
|
* The load/runnable/util_avg doesn't directly factor frequency scaling and CPU
|
2020-02-24 17:52:18 +08:00
|
|
|
* capacity scaling. The scaling is done through the rq_clock_pelt that is used
|
|
|
|
* for computing those signals (see update_rq_clock_pelt())
|
2016-04-05 12:12:28 +08:00
|
|
|
*
|
sched/fair: Update scale invariance of PELT
The current implementation of load tracking invariance scales the
contribution with current frequency and uarch performance (only for
utilization) of the CPU. One main result of this formula is that the
figures are capped by current capacity of CPU. Another one is that the
load_avg is not invariant because not scaled with uarch.
The util_avg of a periodic task that runs r time slots every p time slots
varies in the range :
U * (1-y^r)/(1-y^p) * y^i < Utilization < U * (1-y^r)/(1-y^p)
with U is the max util_avg value = SCHED_CAPACITY_SCALE
At a lower capacity, the range becomes:
U * C * (1-y^r')/(1-y^p) * y^i' < Utilization < U * C * (1-y^r')/(1-y^p)
with C reflecting the compute capacity ratio between current capacity and
max capacity.
so C tries to compensate changes in (1-y^r') but it can't be accurate.
Instead of scaling the contribution value of PELT algo, we should scale the
running time. The PELT signal aims to track the amount of computation of
tasks and/or rq so it seems more correct to scale the running time to
reflect the effective amount of computation done since the last update.
In order to be fully invariant, we need to apply the same amount of
running time and idle time whatever the current capacity. Because running
at lower capacity implies that the task will run longer, we have to ensure
that the same amount of idle time will be applied when system becomes idle
and no idle time has been "stolen". But reaching the maximum utilization
value (SCHED_CAPACITY_SCALE) means that the task is seen as an
always-running task whatever the capacity of the CPU (even at max compute
capacity). In this case, we can discard this "stolen" idle times which
becomes meaningless.
In order to achieve this time scaling, a new clock_pelt is created per rq.
The increase of this clock scales with current capacity when something
is running on rq and synchronizes with clock_task when rq is idle. With
this mechanism, we ensure the same running and idle time whatever the
current capacity. This also enables to simplify the pelt algorithm by
removing all references of uarch and frequency and applying the same
contribution to utilization and loads. Furthermore, the scaling is done
only once per update of clock (update_rq_clock_task()) instead of during
each update of sched_entities and cfs/rt/dl_rq of the rq like the current
implementation. This is interesting when cgroup are involved as shown in
the results below:
On a hikey (octo Arm64 platform).
Performance cpufreq governor and only shallowest c-state to remove variance
generated by those power features so we only track the impact of pelt algo.
each test runs 16 times:
./perf bench sched pipe
(higher is better)
kernel tip/sched/core + patch
ops/seconds ops/seconds diff
cgroup
root 59652(+/- 0.18%) 59876(+/- 0.24%) +0.38%
level1 55608(+/- 0.27%) 55923(+/- 0.24%) +0.57%
level2 52115(+/- 0.29%) 52564(+/- 0.22%) +0.86%
hackbench -l 1000
(lower is better)
kernel tip/sched/core + patch
duration(sec) duration(sec) diff
cgroup
root 4.453(+/- 2.37%) 4.383(+/- 2.88%) -1.57%
level1 4.859(+/- 8.50%) 4.830(+/- 7.07%) -0.60%
level2 5.063(+/- 9.83%) 4.928(+/- 9.66%) -2.66%
Then, the responsiveness of PELT is improved when CPU is not running at max
capacity with this new algorithm. I have put below some examples of
duration to reach some typical load values according to the capacity of the
CPU with current implementation and with this patch. These values has been
computed based on the geometric series and the half period value:
Util (%) max capacity half capacity(mainline) half capacity(w/ patch)
972 (95%) 138ms not reachable 276ms
486 (47.5%) 30ms 138ms 60ms
256 (25%) 13ms 32ms 26ms
On my hikey (octo Arm64 platform) with schedutil governor, the time to
reach max OPP when starting from a null utilization, decreases from 223ms
with current scale invariance down to 121ms with the new algorithm.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: dietmar.eggemann@arm.com
Cc: patrick.bellasi@arm.com
Cc: pjt@google.com
Cc: pkondeti@codeaurora.org
Cc: quentin.perret@arm.com
Cc: rjw@rjwysocki.net
Cc: srinivas.pandruvada@linux.intel.com
Cc: thara.gopinath@linaro.org
Link: https://lkml.kernel.org/r/1548257214-13745-3-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-01-23 23:26:53 +08:00
|
|
|
* N.B., the above ratios (runnable% and running%) themselves are in the
|
|
|
|
* range of [0, 1]. To do fixed point arithmetics, we therefore scale them
|
|
|
|
* to as large a range as necessary. This is for example reflected by
|
|
|
|
* util_avg's SCHED_CAPACITY_SCALE.
|
2016-04-05 12:12:28 +08:00
|
|
|
*
|
|
|
|
* [Overflow issue]
|
|
|
|
*
|
|
|
|
* The 64-bit load_sum can have 4353082796 (=2^64/47742/88761) entities
|
|
|
|
* with the highest load (=88761), always runnable on a single cfs_rq,
|
|
|
|
* and should not overflow as the number already hits PID_MAX_LIMIT.
|
|
|
|
*
|
|
|
|
* For all other cases (including 32-bit kernels), struct load_weight's
|
|
|
|
* weight will overflow first before we do, because:
|
|
|
|
*
|
|
|
|
* Max(load_avg) <= Max(load.weight)
|
|
|
|
*
|
|
|
|
* Then it is the load_weight's responsibility to consider overflow
|
|
|
|
* issues.
|
sched/fair: Rewrite runnable load and utilization average tracking
The idea of runnable load average (let runnable time contribute to weight)
was proposed by Paul Turner and Ben Segall, and it is still followed by
this rewrite. This rewrite aims to solve the following issues:
1. cfs_rq's load average (namely runnable_load_avg and blocked_load_avg) is
updated at the granularity of an entity at a time, which results in the
cfs_rq's load average is stale or partially updated: at any time, only
one entity is up to date, all other entities are effectively lagging
behind. This is undesirable.
To illustrate, if we have n runnable entities in the cfs_rq, as time
elapses, they certainly become outdated:
t0: cfs_rq { e1_old, e2_old, ..., en_old }
and when we update:
t1: update e1, then we have cfs_rq { e1_new, e2_old, ..., en_old }
t2: update e2, then we have cfs_rq { e1_old, e2_new, ..., en_old }
...
We solve this by combining all runnable entities' load averages together
in cfs_rq's avg, and update the cfs_rq's avg as a whole. This is based
on the fact that if we regard the update as a function, then:
w * update(e) = update(w * e) and
update(e1) + update(e2) = update(e1 + e2), then
w1 * update(e1) + w2 * update(e2) = update(w1 * e1 + w2 * e2)
therefore, by this rewrite, we have an entirely updated cfs_rq at the
time we update it:
t1: update cfs_rq { e1_new, e2_new, ..., en_new }
t2: update cfs_rq { e1_new, e2_new, ..., en_new }
...
2. cfs_rq's load average is different between top rq->cfs_rq and other
task_group's per CPU cfs_rqs in whether or not blocked_load_average
contributes to the load.
The basic idea behind runnable load average (the same for utilization)
is that the blocked state is taken into account as opposed to only
accounting for the currently runnable state. Therefore, the average
should include both the runnable/running and blocked load averages.
This rewrite does that.
In addition, we also combine runnable/running and blocked averages
of all entities into the cfs_rq's average, and update it together at
once. This is based on the fact that:
update(runnable) + update(blocked) = update(runnable + blocked)
This significantly reduces the code as we don't need to separately
maintain/update runnable/running load and blocked load.
3. How task_group entities' share is calculated is complex and imprecise.
We reduce the complexity in this rewrite to allow a very simple rule:
the task_group's load_avg is aggregated from its per CPU cfs_rqs's
load_avgs. Then group entity's weight is simply proportional to its
own cfs_rq's load_avg / task_group's load_avg. To illustrate,
if a task_group has { cfs_rq1, cfs_rq2, ..., cfs_rqn }, then,
task_group_avg = cfs_rq1_avg + cfs_rq2_avg + ... + cfs_rqn_avg, then
cfs_rqx's entity's share = cfs_rqx_avg / task_group_avg * task_group's share
To sum up, this rewrite in principle is equivalent to the current one, but
fixes the issues described above. Turns out, it significantly reduces the
code complexity and hence increases clarity and efficiency. In addition,
the new averages are more smooth/continuous (no spurious spikes and valleys)
and updated more consistently and quickly to reflect the load dynamics.
As a result, we have less load tracking overhead, better performance,
and especially better power efficiency due to more balanced load.
Signed-off-by: Yuyang Du <yuyang.du@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: arjan@linux.intel.com
Cc: bsegall@google.com
Cc: dietmar.eggemann@arm.com
Cc: fengguang.wu@intel.com
Cc: len.brown@intel.com
Cc: morten.rasmussen@arm.com
Cc: pjt@google.com
Cc: rafael.j.wysocki@intel.com
Cc: umgwanakikbuti@gmail.com
Cc: vincent.guittot@linaro.org
Link: http://lkml.kernel.org/r/1436918682-4971-3-git-send-email-yuyang.du@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-07-15 08:04:37 +08:00
|
|
|
*/
|
2012-10-04 19:18:29 +08:00
|
|
|
struct sched_avg {
|
2017-02-07 05:06:35 +08:00
|
|
|
u64 last_update_time;
|
|
|
|
u64 load_sum;
|
2020-02-24 17:52:18 +08:00
|
|
|
u64 runnable_sum;
|
2017-02-07 05:06:35 +08:00
|
|
|
u32 util_sum;
|
|
|
|
u32 period_contrib;
|
|
|
|
unsigned long load_avg;
|
2020-02-24 17:52:18 +08:00
|
|
|
unsigned long runnable_avg;
|
2017-02-07 05:06:35 +08:00
|
|
|
unsigned long util_avg;
|
sched/fair: Add util_est on top of PELT
The util_avg signal computed by PELT is too variable for some use-cases.
For example, a big task waking up after a long sleep period will have its
utilization almost completely decayed. This introduces some latency before
schedutil will be able to pick the best frequency to run a task.
The same issue can affect task placement. Indeed, since the task
utilization is already decayed at wakeup, when the task is enqueued in a
CPU, this can result in a CPU running a big task as being temporarily
represented as being almost empty. This leads to a race condition where
other tasks can be potentially allocated on a CPU which just started to run
a big task which slept for a relatively long period.
Moreover, the PELT utilization of a task can be updated every [ms], thus
making it a continuously changing value for certain longer running
tasks. This means that the instantaneous PELT utilization of a RUNNING
task is not really meaningful to properly support scheduler decisions.
For all these reasons, a more stable signal can do a better job of
representing the expected/estimated utilization of a task/cfs_rq.
Such a signal can be easily created on top of PELT by still using it as
an estimator which produces values to be aggregated on meaningful
events.
This patch adds a simple implementation of util_est, a new signal built on
top of PELT's util_avg where:
util_est(task) = max(task::util_avg, f(task::util_avg@dequeue))
This allows to remember how big a task has been reported by PELT in its
previous activations via f(task::util_avg@dequeue), which is the new
_task_util_est(struct task_struct*) function added by this patch.
If a task should change its behavior and it runs longer in a new
activation, after a certain time its util_est will just track the
original PELT signal (i.e. task::util_avg).
The estimated utilization of cfs_rq is defined only for root ones.
That's because the only sensible consumer of this signal are the
scheduler and schedutil when looking for the overall CPU utilization
due to FAIR tasks.
For this reason, the estimated utilization of a root cfs_rq is simply
defined as:
util_est(cfs_rq) = max(cfs_rq::util_avg, cfs_rq::util_est::enqueued)
where:
cfs_rq::util_est::enqueued = sum(_task_util_est(task))
for each RUNNABLE task on that root cfs_rq
It's worth noting that the estimated utilization is tracked only for
objects of interests, specifically:
- Tasks: to better support tasks placement decisions
- root cfs_rqs: to better support both tasks placement decisions as
well as frequencies selection
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Todd Kjos <tkjos@android.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Link: http://lkml.kernel.org/r/20180309095245.11071-2-patrick.bellasi@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-09 17:52:42 +08:00
|
|
|
struct util_est util_est;
|
2018-04-05 16:05:21 +08:00
|
|
|
} ____cacheline_aligned;
|
2012-10-04 19:18:29 +08:00
|
|
|
|
2010-03-11 10:37:45 +08:00
|
|
|
struct sched_statistics {
|
2017-02-06 18:44:12 +08:00
|
|
|
#ifdef CONFIG_SCHEDSTATS
|
2017-02-07 05:06:35 +08:00
|
|
|
u64 wait_start;
|
|
|
|
u64 wait_max;
|
|
|
|
u64 wait_count;
|
|
|
|
u64 wait_sum;
|
|
|
|
u64 iowait_count;
|
|
|
|
u64 iowait_sum;
|
|
|
|
|
|
|
|
u64 sleep_start;
|
|
|
|
u64 sleep_max;
|
|
|
|
s64 sum_sleep_runtime;
|
|
|
|
|
|
|
|
u64 block_start;
|
|
|
|
u64 block_max;
|
2021-09-05 22:35:43 +08:00
|
|
|
s64 sum_block_runtime;
|
|
|
|
|
2017-02-07 05:06:35 +08:00
|
|
|
u64 exec_max;
|
|
|
|
u64 slice_max;
|
|
|
|
|
|
|
|
u64 nr_migrations_cold;
|
|
|
|
u64 nr_failed_migrations_affine;
|
|
|
|
u64 nr_failed_migrations_running;
|
|
|
|
u64 nr_failed_migrations_hot;
|
|
|
|
u64 nr_forced_migrations;
|
|
|
|
|
|
|
|
u64 nr_wakeups;
|
|
|
|
u64 nr_wakeups_sync;
|
|
|
|
u64 nr_wakeups_migrate;
|
|
|
|
u64 nr_wakeups_local;
|
|
|
|
u64 nr_wakeups_remote;
|
|
|
|
u64 nr_wakeups_affine;
|
|
|
|
u64 nr_wakeups_affine_attempts;
|
|
|
|
u64 nr_wakeups_passive;
|
|
|
|
u64 nr_wakeups_idle;
|
2021-10-19 04:34:28 +08:00
|
|
|
|
|
|
|
#ifdef CONFIG_SCHED_CORE
|
|
|
|
u64 core_forceidle_sum;
|
2010-03-11 10:37:45 +08:00
|
|
|
#endif
|
2021-10-19 04:34:28 +08:00
|
|
|
#endif /* CONFIG_SCHEDSTATS */
|
sched: Make struct sched_statistics independent of fair sched class
If we want to use the schedstats facility to trace other sched classes, we
should make it independent of fair sched class. The struct sched_statistics
is the schedular statistics of a task_struct or a task_group. So we can
move it into struct task_struct and struct task_group to achieve the goal.
After the patch, schestats are orgnized as follows,
struct task_struct {
...
struct sched_entity se;
struct sched_rt_entity rt;
struct sched_dl_entity dl;
...
struct sched_statistics stats;
...
};
Regarding the task group, schedstats is only supported for fair group
sched, and a new struct sched_entity_stats is introduced, suggested by
Peter -
struct sched_entity_stats {
struct sched_entity se;
struct sched_statistics stats;
} __no_randomize_layout;
Then with the se in a task_group, we can easily get the stats.
The sched_statistics members may be frequently modified when schedstats is
enabled, in order to avoid impacting on random data which may in the same
cacheline with them, the struct sched_statistics is defined as cacheline
aligned.
As this patch changes the core struct of scheduler, so I verified the
performance it may impact on the scheduler with 'perf bench sched
pipe', suggested by Mel. Below is the result, in which all the values
are in usecs/op.
Before After
kernel.sched_schedstats=0 5.2~5.4 5.2~5.4
kernel.sched_schedstats=1 5.3~5.5 5.3~5.5
[These data is a little difference with the earlier version, that is
because my old test machine is destroyed so I have to use a new
different test machine.]
Almost no impact on the sched performance.
No functional change.
[lkp@intel.com: reported build failure in earlier version]
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Link: https://lore.kernel.org/r/20210905143547.4668-3-laoar.shao@gmail.com
2021-09-05 22:35:41 +08:00
|
|
|
} ____cacheline_aligned;
|
2010-03-11 10:37:45 +08:00
|
|
|
|
|
|
|
struct sched_entity {
|
2017-02-07 05:06:35 +08:00
|
|
|
/* For load-balancing: */
|
|
|
|
struct load_weight load;
|
|
|
|
struct rb_node run_node;
|
|
|
|
struct list_head group_node;
|
|
|
|
unsigned int on_rq;
|
2010-03-11 10:37:45 +08:00
|
|
|
|
2017-02-07 05:06:35 +08:00
|
|
|
u64 exec_start;
|
|
|
|
u64 sum_exec_runtime;
|
|
|
|
u64 vruntime;
|
|
|
|
u64 prev_sum_exec_runtime;
|
2010-03-11 10:37:45 +08:00
|
|
|
|
2017-02-07 05:06:35 +08:00
|
|
|
u64 nr_migrations;
|
2010-03-11 10:37:45 +08:00
|
|
|
|
2007-07-10 00:51:58 +08:00
|
|
|
#ifdef CONFIG_FAIR_GROUP_SCHED
|
2017-02-07 05:06:35 +08:00
|
|
|
int depth;
|
|
|
|
struct sched_entity *parent;
|
2007-07-10 00:51:58 +08:00
|
|
|
/* rq on which this entity is (to be) queued: */
|
2017-02-07 05:06:35 +08:00
|
|
|
struct cfs_rq *cfs_rq;
|
2007-07-10 00:51:58 +08:00
|
|
|
/* rq "owned" by this entity/group: */
|
2017-02-07 05:06:35 +08:00
|
|
|
struct cfs_rq *my_q;
|
2020-02-24 17:52:18 +08:00
|
|
|
/* cached value of my_q->h_nr_running */
|
|
|
|
unsigned long runnable_weight;
|
2007-07-10 00:51:58 +08:00
|
|
|
#endif
|
2013-02-07 23:47:07 +08:00
|
|
|
|
2013-06-26 13:05:39 +08:00
|
|
|
#ifdef CONFIG_SMP
|
sched/core: Move sched_entity::avg into separate cache line
The sched_entity::avg collides with read-mostly sched_entity data.
The perf c2c tool showed many read HITM accesses across
many CPUs for sched_entity's cfs_rq and my_q, while having
at the same time tons of stores for avg.
After placing sched_entity::avg into separate cache line,
the perf bench sched pipe showed around 20 seconds speedup.
NOTE I cut out all perf events except for cycles and
instructions from following output.
Before:
$ perf stat -r 5 perf bench sched pipe -l 10000000
# Running 'sched/pipe' benchmark:
# Executed 10000000 pipe operations between two processes
Total time: 270.348 [sec]
27.034805 usecs/op
36989 ops/sec
...
245,537,074,035 cycles # 1.433 GHz
187,264,548,519 instructions # 0.77 insns per cycle
272.653840535 seconds time elapsed ( +- 1.31% )
After:
$ perf stat -r 5 perf bench sched pipe -l 10000000
# Running 'sched/pipe' benchmark:
# Executed 10000000 pipe operations between two processes
Total time: 251.076 [sec]
25.107678 usecs/op
39828 ops/sec
...
244,573,513,928 cycles # 1.572 GHz
187,409,641,157 instructions # 0.76 insns per cycle
251.679315188 seconds time elapsed ( +- 0.31% )
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Don Zickus <dzickus@redhat.com>
Cc: Joe Mario <jmario@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1449606239-28602-1-git-send-email-jolsa@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-12-09 04:23:59 +08:00
|
|
|
/*
|
|
|
|
* Per entity load average tracking.
|
|
|
|
*
|
|
|
|
* Put into separate cache line so it does not
|
|
|
|
* collide with read-mostly values above.
|
|
|
|
*/
|
2018-04-05 16:05:21 +08:00
|
|
|
struct sched_avg avg;
|
2012-10-04 19:18:29 +08:00
|
|
|
#endif
|
2007-07-10 00:51:58 +08:00
|
|
|
};
|
2006-07-03 15:25:42 +08:00
|
|
|
|
2008-01-26 04:08:27 +08:00
|
|
|
struct sched_rt_entity {
|
2017-02-07 05:06:35 +08:00
|
|
|
struct list_head run_list;
|
|
|
|
unsigned long timeout;
|
|
|
|
unsigned long watchdog_stamp;
|
|
|
|
unsigned int time_slice;
|
|
|
|
unsigned short on_rq;
|
|
|
|
unsigned short on_list;
|
|
|
|
|
|
|
|
struct sched_rt_entity *back;
|
2008-02-13 22:45:40 +08:00
|
|
|
#ifdef CONFIG_RT_GROUP_SCHED
|
2017-02-07 05:06:35 +08:00
|
|
|
struct sched_rt_entity *parent;
|
2008-01-26 04:08:30 +08:00
|
|
|
/* rq on which this entity is (to be) queued: */
|
2017-02-07 05:06:35 +08:00
|
|
|
struct rt_rq *rt_rq;
|
2008-01-26 04:08:30 +08:00
|
|
|
/* rq "owned" by this entity/group: */
|
2017-02-07 05:06:35 +08:00
|
|
|
struct rt_rq *my_q;
|
2008-01-26 04:08:30 +08:00
|
|
|
#endif
|
2016-10-28 16:22:25 +08:00
|
|
|
} __randomize_layout;
|
2008-01-26 04:08:27 +08:00
|
|
|
|
sched/deadline: Add SCHED_DEADLINE structures & implementation
Introduces the data structures, constants and symbols needed for
SCHED_DEADLINE implementation.
Core data structure of SCHED_DEADLINE are defined, along with their
initializers. Hooks for checking if a task belong to the new policy
are also added where they are needed.
Adds a scheduling class, in sched/dl.c and a new policy called
SCHED_DEADLINE. It is an implementation of the Earliest Deadline
First (EDF) scheduling algorithm, augmented with a mechanism (called
Constant Bandwidth Server, CBS) that makes it possible to isolate
the behaviour of tasks between each other.
The typical -deadline task will be made up of a computation phase
(instance) which is activated on a periodic or sporadic fashion. The
expected (maximum) duration of such computation is called the task's
runtime; the time interval by which each instance need to be completed
is called the task's relative deadline. The task's absolute deadline
is dynamically calculated as the time instant a task (better, an
instance) activates plus the relative deadline.
The EDF algorithms selects the task with the smallest absolute
deadline as the one to be executed first, while the CBS ensures each
task to run for at most its runtime every (relative) deadline
length time interval, avoiding any interference between different
tasks (bandwidth isolation).
Thanks to this feature, also tasks that do not strictly comply with
the computational model sketched above can effectively use the new
policy.
To summarize, this patch:
- introduces the data structures, constants and symbols needed;
- implements the core logic of the scheduling algorithm in the new
scheduling class file;
- provides all the glue code between the new scheduling class and
the core scheduler and refines the interactions between sched/dl
and the other existing scheduling classes.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Michael Trimarchi <michael@amarulasolutions.com>
Signed-off-by: Fabio Checconi <fchecconi@gmail.com>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-4-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-28 18:14:43 +08:00
|
|
|
struct sched_dl_entity {
|
2017-02-07 05:06:35 +08:00
|
|
|
struct rb_node rb_node;
|
sched/deadline: Add SCHED_DEADLINE structures & implementation
Introduces the data structures, constants and symbols needed for
SCHED_DEADLINE implementation.
Core data structure of SCHED_DEADLINE are defined, along with their
initializers. Hooks for checking if a task belong to the new policy
are also added where they are needed.
Adds a scheduling class, in sched/dl.c and a new policy called
SCHED_DEADLINE. It is an implementation of the Earliest Deadline
First (EDF) scheduling algorithm, augmented with a mechanism (called
Constant Bandwidth Server, CBS) that makes it possible to isolate
the behaviour of tasks between each other.
The typical -deadline task will be made up of a computation phase
(instance) which is activated on a periodic or sporadic fashion. The
expected (maximum) duration of such computation is called the task's
runtime; the time interval by which each instance need to be completed
is called the task's relative deadline. The task's absolute deadline
is dynamically calculated as the time instant a task (better, an
instance) activates plus the relative deadline.
The EDF algorithms selects the task with the smallest absolute
deadline as the one to be executed first, while the CBS ensures each
task to run for at most its runtime every (relative) deadline
length time interval, avoiding any interference between different
tasks (bandwidth isolation).
Thanks to this feature, also tasks that do not strictly comply with
the computational model sketched above can effectively use the new
policy.
To summarize, this patch:
- introduces the data structures, constants and symbols needed;
- implements the core logic of the scheduling algorithm in the new
scheduling class file;
- provides all the glue code between the new scheduling class and
the core scheduler and refines the interactions between sched/dl
and the other existing scheduling classes.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Michael Trimarchi <michael@amarulasolutions.com>
Signed-off-by: Fabio Checconi <fchecconi@gmail.com>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-4-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-28 18:14:43 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Original scheduling parameters. Copied here from sched_attr
|
2014-05-09 11:21:27 +08:00
|
|
|
* during sched_setattr(), they will remain the same until
|
|
|
|
* the next sched_setattr().
|
sched/deadline: Add SCHED_DEADLINE structures & implementation
Introduces the data structures, constants and symbols needed for
SCHED_DEADLINE implementation.
Core data structure of SCHED_DEADLINE are defined, along with their
initializers. Hooks for checking if a task belong to the new policy
are also added where they are needed.
Adds a scheduling class, in sched/dl.c and a new policy called
SCHED_DEADLINE. It is an implementation of the Earliest Deadline
First (EDF) scheduling algorithm, augmented with a mechanism (called
Constant Bandwidth Server, CBS) that makes it possible to isolate
the behaviour of tasks between each other.
The typical -deadline task will be made up of a computation phase
(instance) which is activated on a periodic or sporadic fashion. The
expected (maximum) duration of such computation is called the task's
runtime; the time interval by which each instance need to be completed
is called the task's relative deadline. The task's absolute deadline
is dynamically calculated as the time instant a task (better, an
instance) activates plus the relative deadline.
The EDF algorithms selects the task with the smallest absolute
deadline as the one to be executed first, while the CBS ensures each
task to run for at most its runtime every (relative) deadline
length time interval, avoiding any interference between different
tasks (bandwidth isolation).
Thanks to this feature, also tasks that do not strictly comply with
the computational model sketched above can effectively use the new
policy.
To summarize, this patch:
- introduces the data structures, constants and symbols needed;
- implements the core logic of the scheduling algorithm in the new
scheduling class file;
- provides all the glue code between the new scheduling class and
the core scheduler and refines the interactions between sched/dl
and the other existing scheduling classes.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Michael Trimarchi <michael@amarulasolutions.com>
Signed-off-by: Fabio Checconi <fchecconi@gmail.com>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-4-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-28 18:14:43 +08:00
|
|
|
*/
|
2017-02-07 05:06:35 +08:00
|
|
|
u64 dl_runtime; /* Maximum runtime for each instance */
|
|
|
|
u64 dl_deadline; /* Relative deadline of each instance */
|
|
|
|
u64 dl_period; /* Separation of two instances (period) */
|
2017-05-29 22:24:02 +08:00
|
|
|
u64 dl_bw; /* dl_runtime / dl_period */
|
sched/deadline: Use the revised wakeup rule for suspending constrained dl tasks
We have been facing some problems with self-suspending constrained
deadline tasks. The main reason is that the original CBS was not
designed for such sort of tasks.
One problem reported by Xunlei Pang takes place when a task
suspends, and then is awakened before the deadline, but so close
to the deadline that its remaining runtime can cause the task
to have an absolute density higher than allowed. In such situation,
the original CBS assumes that the task is facing an early activation,
and so it replenishes the task and set another deadline, one deadline
in the future. This rule works fine for implicit deadline tasks.
Moreover, it allows the system to adapt the period of a task in which
the external event source suffered from a clock drift.
However, this opens the window for bandwidth leakage for constrained
deadline tasks. For instance, a task with the following parameters:
runtime = 5 ms
deadline = 7 ms
[density] = 5 / 7 = 0.71
period = 1000 ms
If the task runs for 1 ms, and then suspends for another 1ms,
it will be awakened with the following parameters:
remaining runtime = 4
laxity = 5
presenting a absolute density of 4 / 5 = 0.80.
In this case, the original CBS would assume the task had an early
wakeup. Then, CBS will reset the runtime, and the absolute deadline will
be postponed by one relative deadline, allowing the task to run.
The problem is that, if the task runs this pattern forever, it will keep
receiving bandwidth, being able to run 1ms every 2ms. Following this
behavior, the task would be able to run 500 ms in 1 sec. Thus running
more than the 5 ms / 1 sec the admission control allowed it to run.
Trying to address the self-suspending case, Luca Abeni, Giuseppe
Lipari, and Juri Lelli [1] revisited the CBS in order to deal with
self-suspending tasks. In the new approach, rather than
replenishing/postponing the absolute deadline, the revised wakeup rule
adjusts the remaining runtime, reducing it to fit into the allowed
density.
A revised version of the idea is:
At a given time t, the maximum absolute density of a task cannot be
higher than its relative density, that is:
runtime / (deadline - t) <= dl_runtime / dl_deadline
Knowing the laxity of a task (deadline - t), it is possible to move
it to the other side of the equality, thus enabling to define max
remaining runtime a task can use within the absolute deadline, without
over-running the allowed density:
runtime = (dl_runtime / dl_deadline) * (deadline - t)
For instance, in our previous example, the task could still run:
runtime = ( 5 / 7 ) * 5
runtime = 3.57 ms
Without causing damage for other deadline tasks. It is note worthy
that the laxity cannot be negative because that would cause a negative
runtime. Thus, this patch depends on the patch:
df8eac8cafce ("sched/deadline: Throttle a constrained deadline task activated after the deadline")
Which throttles a constrained deadline task activated after the
deadline.
Finally, it is also possible to use the revised wakeup rule for
all other tasks, but that would require some more discussions
about pros and cons.
Reported-by: Xunlei Pang <xpang@redhat.com>
Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
[peterz: replaced dl_is_constrained with dl_is_implicit]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luca Abeni <luca.abeni@santannapisa.it>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Romulo Silva de Oliveira <romulo.deoliveira@ufsc.br>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it>
Link: http://lkml.kernel.org/r/5c800ab3a74a168a84ee5f3f84d12a02e11383be.1495803804.git.bristot@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-29 22:24:03 +08:00
|
|
|
u64 dl_density; /* dl_runtime / dl_deadline */
|
sched/deadline: Add SCHED_DEADLINE structures & implementation
Introduces the data structures, constants and symbols needed for
SCHED_DEADLINE implementation.
Core data structure of SCHED_DEADLINE are defined, along with their
initializers. Hooks for checking if a task belong to the new policy
are also added where they are needed.
Adds a scheduling class, in sched/dl.c and a new policy called
SCHED_DEADLINE. It is an implementation of the Earliest Deadline
First (EDF) scheduling algorithm, augmented with a mechanism (called
Constant Bandwidth Server, CBS) that makes it possible to isolate
the behaviour of tasks between each other.
The typical -deadline task will be made up of a computation phase
(instance) which is activated on a periodic or sporadic fashion. The
expected (maximum) duration of such computation is called the task's
runtime; the time interval by which each instance need to be completed
is called the task's relative deadline. The task's absolute deadline
is dynamically calculated as the time instant a task (better, an
instance) activates plus the relative deadline.
The EDF algorithms selects the task with the smallest absolute
deadline as the one to be executed first, while the CBS ensures each
task to run for at most its runtime every (relative) deadline
length time interval, avoiding any interference between different
tasks (bandwidth isolation).
Thanks to this feature, also tasks that do not strictly comply with
the computational model sketched above can effectively use the new
policy.
To summarize, this patch:
- introduces the data structures, constants and symbols needed;
- implements the core logic of the scheduling algorithm in the new
scheduling class file;
- provides all the glue code between the new scheduling class and
the core scheduler and refines the interactions between sched/dl
and the other existing scheduling classes.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Michael Trimarchi <michael@amarulasolutions.com>
Signed-off-by: Fabio Checconi <fchecconi@gmail.com>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-4-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-28 18:14:43 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Actual scheduling parameters. Initialized with the values above,
|
2018-12-03 17:05:56 +08:00
|
|
|
* they are continuously updated during task execution. Note that
|
sched/deadline: Add SCHED_DEADLINE structures & implementation
Introduces the data structures, constants and symbols needed for
SCHED_DEADLINE implementation.
Core data structure of SCHED_DEADLINE are defined, along with their
initializers. Hooks for checking if a task belong to the new policy
are also added where they are needed.
Adds a scheduling class, in sched/dl.c and a new policy called
SCHED_DEADLINE. It is an implementation of the Earliest Deadline
First (EDF) scheduling algorithm, augmented with a mechanism (called
Constant Bandwidth Server, CBS) that makes it possible to isolate
the behaviour of tasks between each other.
The typical -deadline task will be made up of a computation phase
(instance) which is activated on a periodic or sporadic fashion. The
expected (maximum) duration of such computation is called the task's
runtime; the time interval by which each instance need to be completed
is called the task's relative deadline. The task's absolute deadline
is dynamically calculated as the time instant a task (better, an
instance) activates plus the relative deadline.
The EDF algorithms selects the task with the smallest absolute
deadline as the one to be executed first, while the CBS ensures each
task to run for at most its runtime every (relative) deadline
length time interval, avoiding any interference between different
tasks (bandwidth isolation).
Thanks to this feature, also tasks that do not strictly comply with
the computational model sketched above can effectively use the new
policy.
To summarize, this patch:
- introduces the data structures, constants and symbols needed;
- implements the core logic of the scheduling algorithm in the new
scheduling class file;
- provides all the glue code between the new scheduling class and
the core scheduler and refines the interactions between sched/dl
and the other existing scheduling classes.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Michael Trimarchi <michael@amarulasolutions.com>
Signed-off-by: Fabio Checconi <fchecconi@gmail.com>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-4-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-28 18:14:43 +08:00
|
|
|
* the remaining runtime could be < 0 in case we are in overrun.
|
|
|
|
*/
|
2017-02-07 05:06:35 +08:00
|
|
|
s64 runtime; /* Remaining runtime for this instance */
|
|
|
|
u64 deadline; /* Absolute deadline for this instance */
|
|
|
|
unsigned int flags; /* Specifying the scheduler behaviour */
|
sched/deadline: Add SCHED_DEADLINE structures & implementation
Introduces the data structures, constants and symbols needed for
SCHED_DEADLINE implementation.
Core data structure of SCHED_DEADLINE are defined, along with their
initializers. Hooks for checking if a task belong to the new policy
are also added where they are needed.
Adds a scheduling class, in sched/dl.c and a new policy called
SCHED_DEADLINE. It is an implementation of the Earliest Deadline
First (EDF) scheduling algorithm, augmented with a mechanism (called
Constant Bandwidth Server, CBS) that makes it possible to isolate
the behaviour of tasks between each other.
The typical -deadline task will be made up of a computation phase
(instance) which is activated on a periodic or sporadic fashion. The
expected (maximum) duration of such computation is called the task's
runtime; the time interval by which each instance need to be completed
is called the task's relative deadline. The task's absolute deadline
is dynamically calculated as the time instant a task (better, an
instance) activates plus the relative deadline.
The EDF algorithms selects the task with the smallest absolute
deadline as the one to be executed first, while the CBS ensures each
task to run for at most its runtime every (relative) deadline
length time interval, avoiding any interference between different
tasks (bandwidth isolation).
Thanks to this feature, also tasks that do not strictly comply with
the computational model sketched above can effectively use the new
policy.
To summarize, this patch:
- introduces the data structures, constants and symbols needed;
- implements the core logic of the scheduling algorithm in the new
scheduling class file;
- provides all the glue code between the new scheduling class and
the core scheduler and refines the interactions between sched/dl
and the other existing scheduling classes.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Michael Trimarchi <michael@amarulasolutions.com>
Signed-off-by: Fabio Checconi <fchecconi@gmail.com>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-4-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-28 18:14:43 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Some bool flags:
|
|
|
|
*
|
|
|
|
* @dl_throttled tells if we exhausted the runtime. If so, the
|
|
|
|
* task has to wait for a replenishment to be performed at the
|
|
|
|
* next firing of dl_timer.
|
|
|
|
*
|
2017-02-07 05:06:35 +08:00
|
|
|
* @dl_yielded tells if task gave up the CPU before consuming
|
2014-04-15 19:49:04 +08:00
|
|
|
* all its available runtime during the last job.
|
2017-05-19 04:13:29 +08:00
|
|
|
*
|
|
|
|
* @dl_non_contending tells if the task is inactive while still
|
|
|
|
* contributing to the active utilization. In other words, it
|
|
|
|
* indicates if the inactive timer has been armed and its handler
|
|
|
|
* has not been executed yet. This flag is useful to avoid race
|
|
|
|
* conditions between the inactive timer handler and the wakeup
|
|
|
|
* code.
|
2017-12-12 19:10:24 +08:00
|
|
|
*
|
|
|
|
* @dl_overrun tells if the task asked to be informed about runtime
|
|
|
|
* overruns.
|
sched/deadline: Add SCHED_DEADLINE structures & implementation
Introduces the data structures, constants and symbols needed for
SCHED_DEADLINE implementation.
Core data structure of SCHED_DEADLINE are defined, along with their
initializers. Hooks for checking if a task belong to the new policy
are also added where they are needed.
Adds a scheduling class, in sched/dl.c and a new policy called
SCHED_DEADLINE. It is an implementation of the Earliest Deadline
First (EDF) scheduling algorithm, augmented with a mechanism (called
Constant Bandwidth Server, CBS) that makes it possible to isolate
the behaviour of tasks between each other.
The typical -deadline task will be made up of a computation phase
(instance) which is activated on a periodic or sporadic fashion. The
expected (maximum) duration of such computation is called the task's
runtime; the time interval by which each instance need to be completed
is called the task's relative deadline. The task's absolute deadline
is dynamically calculated as the time instant a task (better, an
instance) activates plus the relative deadline.
The EDF algorithms selects the task with the smallest absolute
deadline as the one to be executed first, while the CBS ensures each
task to run for at most its runtime every (relative) deadline
length time interval, avoiding any interference between different
tasks (bandwidth isolation).
Thanks to this feature, also tasks that do not strictly comply with
the computational model sketched above can effectively use the new
policy.
To summarize, this patch:
- introduces the data structures, constants and symbols needed;
- implements the core logic of the scheduling algorithm in the new
scheduling class file;
- provides all the glue code between the new scheduling class and
the core scheduler and refines the interactions between sched/dl
and the other existing scheduling classes.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Michael Trimarchi <michael@amarulasolutions.com>
Signed-off-by: Fabio Checconi <fchecconi@gmail.com>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-4-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-28 18:14:43 +08:00
|
|
|
*/
|
2017-10-13 15:01:22 +08:00
|
|
|
unsigned int dl_throttled : 1;
|
|
|
|
unsigned int dl_yielded : 1;
|
|
|
|
unsigned int dl_non_contending : 1;
|
2017-12-12 19:10:24 +08:00
|
|
|
unsigned int dl_overrun : 1;
|
sched/deadline: Add SCHED_DEADLINE structures & implementation
Introduces the data structures, constants and symbols needed for
SCHED_DEADLINE implementation.
Core data structure of SCHED_DEADLINE are defined, along with their
initializers. Hooks for checking if a task belong to the new policy
are also added where they are needed.
Adds a scheduling class, in sched/dl.c and a new policy called
SCHED_DEADLINE. It is an implementation of the Earliest Deadline
First (EDF) scheduling algorithm, augmented with a mechanism (called
Constant Bandwidth Server, CBS) that makes it possible to isolate
the behaviour of tasks between each other.
The typical -deadline task will be made up of a computation phase
(instance) which is activated on a periodic or sporadic fashion. The
expected (maximum) duration of such computation is called the task's
runtime; the time interval by which each instance need to be completed
is called the task's relative deadline. The task's absolute deadline
is dynamically calculated as the time instant a task (better, an
instance) activates plus the relative deadline.
The EDF algorithms selects the task with the smallest absolute
deadline as the one to be executed first, while the CBS ensures each
task to run for at most its runtime every (relative) deadline
length time interval, avoiding any interference between different
tasks (bandwidth isolation).
Thanks to this feature, also tasks that do not strictly comply with
the computational model sketched above can effectively use the new
policy.
To summarize, this patch:
- introduces the data structures, constants and symbols needed;
- implements the core logic of the scheduling algorithm in the new
scheduling class file;
- provides all the glue code between the new scheduling class and
the core scheduler and refines the interactions between sched/dl
and the other existing scheduling classes.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Michael Trimarchi <michael@amarulasolutions.com>
Signed-off-by: Fabio Checconi <fchecconi@gmail.com>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-4-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-28 18:14:43 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Bandwidth enforcement timer. Each -deadline task has its
|
|
|
|
* own bandwidth to be enforced, thus we need one timer per task.
|
|
|
|
*/
|
2017-02-07 05:06:35 +08:00
|
|
|
struct hrtimer dl_timer;
|
2017-05-19 04:13:29 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Inactive timer, responsible for decreasing the active utilization
|
|
|
|
* at the "0-lag time". When a -deadline task blocks, it contributes
|
|
|
|
* to GRUB's active utilization until the "0-lag time", hence a
|
|
|
|
* timer is needed to decrease the active utilization at the correct
|
|
|
|
* time.
|
|
|
|
*/
|
|
|
|
struct hrtimer inactive_timer;
|
2020-11-17 14:14:32 +08:00
|
|
|
|
|
|
|
#ifdef CONFIG_RT_MUTEXES
|
|
|
|
/*
|
|
|
|
* Priority Inheritance. When a DEADLINE scheduling entity is boosted
|
|
|
|
* pi_se points to the donor, otherwise points to the dl_se it belongs
|
|
|
|
* to (the original one/itself).
|
|
|
|
*/
|
|
|
|
struct sched_dl_entity *pi_se;
|
|
|
|
#endif
|
sched/deadline: Add SCHED_DEADLINE structures & implementation
Introduces the data structures, constants and symbols needed for
SCHED_DEADLINE implementation.
Core data structure of SCHED_DEADLINE are defined, along with their
initializers. Hooks for checking if a task belong to the new policy
are also added where they are needed.
Adds a scheduling class, in sched/dl.c and a new policy called
SCHED_DEADLINE. It is an implementation of the Earliest Deadline
First (EDF) scheduling algorithm, augmented with a mechanism (called
Constant Bandwidth Server, CBS) that makes it possible to isolate
the behaviour of tasks between each other.
The typical -deadline task will be made up of a computation phase
(instance) which is activated on a periodic or sporadic fashion. The
expected (maximum) duration of such computation is called the task's
runtime; the time interval by which each instance need to be completed
is called the task's relative deadline. The task's absolute deadline
is dynamically calculated as the time instant a task (better, an
instance) activates plus the relative deadline.
The EDF algorithms selects the task with the smallest absolute
deadline as the one to be executed first, while the CBS ensures each
task to run for at most its runtime every (relative) deadline
length time interval, avoiding any interference between different
tasks (bandwidth isolation).
Thanks to this feature, also tasks that do not strictly comply with
the computational model sketched above can effectively use the new
policy.
To summarize, this patch:
- introduces the data structures, constants and symbols needed;
- implements the core logic of the scheduling algorithm in the new
scheduling class file;
- provides all the glue code between the new scheduling class and
the core scheduler and refines the interactions between sched/dl
and the other existing scheduling classes.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Michael Trimarchi <michael@amarulasolutions.com>
Signed-off-by: Fabio Checconi <fchecconi@gmail.com>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-4-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-28 18:14:43 +08:00
|
|
|
};
|
2013-02-07 23:47:07 +08:00
|
|
|
|
2019-06-21 16:42:02 +08:00
|
|
|
#ifdef CONFIG_UCLAMP_TASK
|
|
|
|
/* Number of utilization clamp buckets (shorter alias) */
|
|
|
|
#define UCLAMP_BUCKETS CONFIG_UCLAMP_BUCKETS_COUNT
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Utilization clamp for a scheduling entity
|
|
|
|
* @value: clamp value "assigned" to a se
|
|
|
|
* @bucket_id: bucket index corresponding to the "assigned" value
|
2019-06-21 16:42:05 +08:00
|
|
|
* @active: the se is currently refcounted in a rq's bucket
|
2019-06-21 16:42:07 +08:00
|
|
|
* @user_defined: the requested clamp value comes from user-space
|
2019-06-21 16:42:02 +08:00
|
|
|
*
|
|
|
|
* The bucket_id is the index of the clamp bucket matching the clamp value
|
|
|
|
* which is pre-computed and stored to avoid expensive integer divisions from
|
|
|
|
* the fast path.
|
2019-06-21 16:42:05 +08:00
|
|
|
*
|
|
|
|
* The active bit is set whenever a task has got an "effective" value assigned,
|
|
|
|
* which can be different from the clamp value "requested" from user-space.
|
|
|
|
* This allows to know a task is refcounted in the rq's bucket corresponding
|
|
|
|
* to the "effective" bucket_id.
|
2019-06-21 16:42:07 +08:00
|
|
|
*
|
|
|
|
* The user_defined bit is set whenever a task has got a task-specific clamp
|
|
|
|
* value requested from userspace, i.e. the system defaults apply to this task
|
|
|
|
* just as a restriction. This allows to relax default clamps when a less
|
|
|
|
* restrictive task-specific value has been requested, thus allowing to
|
|
|
|
* implement a "nice" semantic. For example, a task running with a 20%
|
|
|
|
* default boost can still drop its own boosting to 0%.
|
2019-06-21 16:42:02 +08:00
|
|
|
*/
|
|
|
|
struct uclamp_se {
|
|
|
|
unsigned int value : bits_per(SCHED_CAPACITY_SCALE);
|
|
|
|
unsigned int bucket_id : bits_per(UCLAMP_BUCKETS);
|
2019-06-21 16:42:05 +08:00
|
|
|
unsigned int active : 1;
|
2019-06-21 16:42:07 +08:00
|
|
|
unsigned int user_defined : 1;
|
2019-06-21 16:42:02 +08:00
|
|
|
};
|
|
|
|
#endif /* CONFIG_UCLAMP_TASK */
|
|
|
|
|
2014-08-15 07:01:53 +08:00
|
|
|
union rcu_special {
|
|
|
|
struct {
|
2017-02-07 05:06:35 +08:00
|
|
|
u8 blocked;
|
|
|
|
u8 need_qs;
|
rcu: Speed up expedited GPs when interrupting RCU reader
In PREEMPT kernels, an expedited grace period might send an IPI to a
CPU that is executing an RCU read-side critical section. In that case,
it would be nice if the rcu_read_unlock() directly interacted with the
RCU core code to immediately report the quiescent state. And this does
happen in the case where the reader has been preempted. But it would
also be a nice performance optimization if immediate reporting also
happened in the preemption-free case.
This commit therefore adds an ->exp_hint field to the task_struct structure's
->rcu_read_unlock_special field. The IPI handler sets this hint when
it has interrupted an RCU read-side critical section, and this causes
the outermost rcu_read_unlock() call to invoke rcu_read_unlock_special(),
which, if preemption is enabled, reports the quiescent state immediately.
If preemption is disabled, then the report is required to be deferred
until preemption (or bottom halves or interrupts or whatever) is re-enabled.
Because this is a hint, it does nothing for more complicated cases. For
example, if the IPI interrupts an RCU reader, but interrupts are disabled
across the rcu_read_unlock(), but another rcu_read_lock() is executed
before interrupts are re-enabled, the hint will already have been cleared.
If you do crazy things like this, reporting will be deferred until some
later RCU_SOFTIRQ handler, context switch, cond_resched(), or similar.
Reported-by: Joel Fernandes <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Acked-by: Joel Fernandes (Google) <joel@joelfernandes.org>
2018-10-16 19:12:58 +08:00
|
|
|
u8 exp_hint; /* Hint for performance. */
|
2020-03-18 07:02:06 +08:00
|
|
|
u8 need_mb; /* Readers need smp_mb(). */
|
2015-08-03 04:53:17 +08:00
|
|
|
} b; /* Bits. */
|
rcu: Speed up expedited GPs when interrupting RCU reader
In PREEMPT kernels, an expedited grace period might send an IPI to a
CPU that is executing an RCU read-side critical section. In that case,
it would be nice if the rcu_read_unlock() directly interacted with the
RCU core code to immediately report the quiescent state. And this does
happen in the case where the reader has been preempted. But it would
also be a nice performance optimization if immediate reporting also
happened in the preemption-free case.
This commit therefore adds an ->exp_hint field to the task_struct structure's
->rcu_read_unlock_special field. The IPI handler sets this hint when
it has interrupted an RCU read-side critical section, and this causes
the outermost rcu_read_unlock() call to invoke rcu_read_unlock_special(),
which, if preemption is enabled, reports the quiescent state immediately.
If preemption is disabled, then the report is required to be deferred
until preemption (or bottom halves or interrupts or whatever) is re-enabled.
Because this is a hint, it does nothing for more complicated cases. For
example, if the IPI interrupts an RCU reader, but interrupts are disabled
across the rcu_read_unlock(), but another rcu_read_lock() is executed
before interrupts are re-enabled, the hint will already have been cleared.
If you do crazy things like this, reporting will be deferred until some
later RCU_SOFTIRQ handler, context switch, cond_resched(), or similar.
Reported-by: Joel Fernandes <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Acked-by: Joel Fernandes (Google) <joel@joelfernandes.org>
2018-10-16 19:12:58 +08:00
|
|
|
u32 s; /* Set of bits. */
|
2014-08-15 07:01:53 +08:00
|
|
|
};
|
2009-08-28 06:00:12 +08:00
|
|
|
|
2010-09-02 22:50:03 +08:00
|
|
|
enum perf_event_task_context {
|
|
|
|
perf_invalid_context = -1,
|
|
|
|
perf_hw_context = 0,
|
2010-09-07 23:34:50 +08:00
|
|
|
perf_sw_context,
|
2010-09-02 22:50:03 +08:00
|
|
|
perf_nr_task_contexts,
|
|
|
|
};
|
|
|
|
|
2017-02-02 00:09:06 +08:00
|
|
|
struct wake_q_node {
|
|
|
|
struct wake_q_node *next;
|
|
|
|
};
|
|
|
|
|
2020-11-19 03:48:43 +08:00
|
|
|
struct kmap_ctrl {
|
|
|
|
#ifdef CONFIG_KMAP_LOCAL
|
|
|
|
int idx;
|
|
|
|
pte_t pteval[KM_MAX_IDX];
|
|
|
|
#endif
|
|
|
|
};
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
struct task_struct {
|
2016-09-14 05:29:24 +08:00
|
|
|
#ifdef CONFIG_THREAD_INFO_IN_TASK
|
|
|
|
/*
|
|
|
|
* For reasons of header soup (see current_thread_info()), this
|
|
|
|
* must be the first element of task_struct.
|
|
|
|
*/
|
2017-02-07 05:06:35 +08:00
|
|
|
struct thread_info thread_info;
|
2016-09-14 05:29:24 +08:00
|
|
|
#endif
|
2021-06-11 16:28:17 +08:00
|
|
|
unsigned int __state;
|
2017-04-06 13:43:33 +08:00
|
|
|
|
2021-08-16 05:27:44 +08:00
|
|
|
#ifdef CONFIG_PREEMPT_RT
|
|
|
|
/* saved state for "spinlock sleepers" */
|
|
|
|
unsigned int saved_state;
|
|
|
|
#endif
|
|
|
|
|
2017-04-06 13:43:33 +08:00
|
|
|
/*
|
|
|
|
* This begins the randomizable portion of task_struct. Only
|
|
|
|
* scheduling-critical items should be added above here.
|
|
|
|
*/
|
|
|
|
randomized_struct_fields_start
|
|
|
|
|
2017-02-07 05:06:35 +08:00
|
|
|
void *stack;
|
2019-01-18 20:27:29 +08:00
|
|
|
refcount_t usage;
|
2017-02-07 05:06:35 +08:00
|
|
|
/* Per task flags (PF_*), defined further below: */
|
|
|
|
unsigned int flags;
|
|
|
|
unsigned int ptrace;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
[PATCH] sched: implement smpnice
Problem:
The introduction of separate run queues per CPU has brought with it "nice"
enforcement problems that are best described by a simple example.
For the sake of argument suppose that on a single CPU machine with a
nice==19 hard spinner and a nice==0 hard spinner running that the nice==0
task gets 95% of the CPU and the nice==19 task gets 5% of the CPU. Now
suppose that there is a system with 2 CPUs and 2 nice==19 hard spinners and
2 nice==0 hard spinners running. The user of this system would be entitled
to expect that the nice==0 tasks each get 95% of a CPU and the nice==19
tasks only get 5% each. However, whether this expectation is met is pretty
much down to luck as there are four equally likely distributions of the
tasks to the CPUs that the load balancing code will consider to be balanced
with loads of 2.0 for each CPU. Two of these distributions involve one
nice==0 and one nice==19 task per CPU and in these circumstances the users
expectations will be met. The other two distributions both involve both
nice==0 tasks being on one CPU and both nice==19 being on the other CPU and
each task will get 50% of a CPU and the user's expectations will not be
met.
Solution:
The solution to this problem that is implemented in the attached patch is
to use weighted loads when determining if the system is balanced and, when
an imbalance is detected, to move an amount of weighted load between run
queues (as opposed to a number of tasks) to restore the balance. Once
again, the easiest way to explain why both of these measures are necessary
is to use a simple example. Suppose that (in a slight variation of the
above example) that we have a two CPU system with 4 nice==0 and 4 nice=19
hard spinning tasks running and that the 4 nice==0 tasks are on one CPU and
the 4 nice==19 tasks are on the other CPU. The weighted loads for the two
CPUs would be 4.0 and 0.2 respectively and the load balancing code would
move 2 tasks resulting in one CPU with a load of 2.0 and the other with
load of 2.2. If this was considered to be a big enough imbalance to
justify moving a task and that task was moved using the current
move_tasks() then it would move the highest priority task that it found and
this would result in one CPU with a load of 3.0 and the other with a load
of 1.2 which would result in the movement of a task in the opposite
direction and so on -- infinite loop. If, on the other hand, an amount of
load to be moved is calculated from the imbalance (in this case 0.1) and
move_tasks() skips tasks until it find ones whose contributions to the
weighted load are less than this amount it would move two of the nice==19
tasks resulting in a system with 2 nice==0 and 2 nice=19 on each CPU with
loads of 2.1 for each CPU.
One of the advantages of this mechanism is that on a system where all tasks
have nice==0 the load balancing calculations would be mathematically
identical to the current load balancing code.
Notes:
struct task_struct:
has a new field load_weight which (in a trade off of space for speed)
stores the contribution that this task makes to a CPU's weighted load when
it is runnable.
struct runqueue:
has a new field raw_weighted_load which is the sum of the load_weight
values for the currently runnable tasks on this run queue. This field
always needs to be updated when nr_running is updated so two new inline
functions inc_nr_running() and dec_nr_running() have been created to make
sure that this happens. This also offers a convenient way to optimize away
this part of the smpnice mechanism when CONFIG_SMP is not defined.
int try_to_wake_up():
in this function the value SCHED_LOAD_BALANCE is used to represent the load
contribution of a single task in various calculations in the code that
decides which CPU to put the waking task on. While this would be a valid
on a system where the nice values for the runnable tasks were distributed
evenly around zero it will lead to anomalous load balancing if the
distribution is skewed in either direction. To overcome this problem
SCHED_LOAD_SCALE has been replaced by the load_weight for the relevant task
or by the average load_weight per task for the queue in question (as
appropriate).
int move_tasks():
The modifications to this function were complicated by the fact that
active_load_balance() uses it to move exactly one task without checking
whether an imbalance actually exists. This precluded the simple
overloading of max_nr_move with max_load_move and necessitated the addition
of the latter as an extra argument to the function. The internal
implementation is then modified to move up to max_nr_move tasks and
max_load_move of weighted load. This slightly complicates the code where
move_tasks() is called and if ever active_load_balance() is changed to not
use move_tasks() the implementation of move_tasks() should be simplified
accordingly.
struct sched_group *find_busiest_group():
Similar to try_to_wake_up(), there are places in this function where
SCHED_LOAD_SCALE is used to represent the load contribution of a single
task and the same issues are created. A similar solution is adopted except
that it is now the average per task contribution to a group's load (as
opposed to a run queue) that is required. As this value is not directly
available from the group it is calculated on the fly as the queues in the
groups are visited when determining the busiest group.
A key change to this function is that it is no longer to scale down
*imbalance on exit as move_tasks() uses the load in its scaled form.
void set_user_nice():
has been modified to update the task's load_weight field when it's nice
value and also to ensure that its run queue's raw_weighted_load field is
updated if it was runnable.
From: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
With smpnice, sched groups with highest priority tasks can mask the imbalance
between the other sched groups with in the same domain. This patch fixes some
of the listed down scenarios by not considering the sched groups which are
lightly loaded.
a) on a simple 4-way MP system, if we have one high priority and 4 normal
priority tasks, with smpnice we would like to see the high priority task
scheduled on one cpu, two other cpus getting one normal task each and the
fourth cpu getting the remaining two normal tasks. but with current
smpnice extra normal priority task keeps jumping from one cpu to another
cpu having the normal priority task. This is because of the
busiest_has_loaded_cpus, nr_loaded_cpus logic.. We are not including the
cpu with high priority task in max_load calculations but including that in
total and avg_load calcuations.. leading to max_load < avg_load and load
balance between cpus running normal priority tasks(2 Vs 1) will always show
imbalanace as one normal priority and the extra normal priority task will
keep moving from one cpu to another cpu having normal priority task..
b) 4-way system with HT (8 logical processors). Package-P0 T0 has a
highest priority task, T1 is idle. Package-P1 Both T0 and T1 have 1 normal
priority task each.. P2 and P3 are idle. With this patch, one of the
normal priority tasks on P1 will be moved to P2 or P3..
c) With the current weighted smp nice calculations, it doesn't always make
sense to look at the highest weighted runqueue in the busy group..
Consider a load balance scenario on a DP with HT system, with Package-0
containing one high priority and one low priority, Package-1 containing one
low priority(with other thread being idle).. Package-1 thinks that it need
to take the low priority thread from Package-0. And find_busiest_queue()
returns the cpu thread with highest priority task.. And ultimately(with
help of active load balance) we move high priority task to Package-1. And
same continues with Package-0 now, moving high priority task from package-1
to package-0.. Even without the presence of active load balance, load
balance will fail to balance the above scenario.. Fix find_busiest_queue
to use "imbalance" when it is lightly loaded.
[kernel@kolivas.org: sched: store weighted load on up]
[kernel@kolivas.org: sched: add discrete weighted cpu load function]
[suresh.b.siddha@intel.com: sched: remove dead code]
Signed-off-by: Peter Williams <pwil3058@bigpond.com.au>
Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
Cc: "Chen, Kenneth W" <kenneth.w.chen@intel.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Con Kolivas <kernel@kolivas.org>
Cc: John Hawkes <hawkes@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:54:34 +08:00
|
|
|
#ifdef CONFIG_SMP
|
2017-02-07 05:06:35 +08:00
|
|
|
int on_cpu;
|
2020-06-22 18:01:25 +08:00
|
|
|
struct __call_single_node wake_entry;
|
2017-02-07 05:06:35 +08:00
|
|
|
unsigned int wakee_flips;
|
|
|
|
unsigned long wakee_flip_decay_ts;
|
|
|
|
struct task_struct *last_wakee;
|
2013-10-07 18:29:16 +08:00
|
|
|
|
2018-01-30 18:45:55 +08:00
|
|
|
/*
|
|
|
|
* recent_used_cpu is initially set as the last CPU used by a task
|
|
|
|
* that wakes affine another task. Waker/wakee relationships can
|
|
|
|
* push tasks around a CPU where each wakeup moves to the next one.
|
|
|
|
* Tracking a recently used CPU allows a quick search for a recently
|
|
|
|
* used CPU that may be idle.
|
|
|
|
*/
|
|
|
|
int recent_used_cpu;
|
2017-02-07 05:06:35 +08:00
|
|
|
int wake_cpu;
|
[PATCH] sched: implement smpnice
Problem:
The introduction of separate run queues per CPU has brought with it "nice"
enforcement problems that are best described by a simple example.
For the sake of argument suppose that on a single CPU machine with a
nice==19 hard spinner and a nice==0 hard spinner running that the nice==0
task gets 95% of the CPU and the nice==19 task gets 5% of the CPU. Now
suppose that there is a system with 2 CPUs and 2 nice==19 hard spinners and
2 nice==0 hard spinners running. The user of this system would be entitled
to expect that the nice==0 tasks each get 95% of a CPU and the nice==19
tasks only get 5% each. However, whether this expectation is met is pretty
much down to luck as there are four equally likely distributions of the
tasks to the CPUs that the load balancing code will consider to be balanced
with loads of 2.0 for each CPU. Two of these distributions involve one
nice==0 and one nice==19 task per CPU and in these circumstances the users
expectations will be met. The other two distributions both involve both
nice==0 tasks being on one CPU and both nice==19 being on the other CPU and
each task will get 50% of a CPU and the user's expectations will not be
met.
Solution:
The solution to this problem that is implemented in the attached patch is
to use weighted loads when determining if the system is balanced and, when
an imbalance is detected, to move an amount of weighted load between run
queues (as opposed to a number of tasks) to restore the balance. Once
again, the easiest way to explain why both of these measures are necessary
is to use a simple example. Suppose that (in a slight variation of the
above example) that we have a two CPU system with 4 nice==0 and 4 nice=19
hard spinning tasks running and that the 4 nice==0 tasks are on one CPU and
the 4 nice==19 tasks are on the other CPU. The weighted loads for the two
CPUs would be 4.0 and 0.2 respectively and the load balancing code would
move 2 tasks resulting in one CPU with a load of 2.0 and the other with
load of 2.2. If this was considered to be a big enough imbalance to
justify moving a task and that task was moved using the current
move_tasks() then it would move the highest priority task that it found and
this would result in one CPU with a load of 3.0 and the other with a load
of 1.2 which would result in the movement of a task in the opposite
direction and so on -- infinite loop. If, on the other hand, an amount of
load to be moved is calculated from the imbalance (in this case 0.1) and
move_tasks() skips tasks until it find ones whose contributions to the
weighted load are less than this amount it would move two of the nice==19
tasks resulting in a system with 2 nice==0 and 2 nice=19 on each CPU with
loads of 2.1 for each CPU.
One of the advantages of this mechanism is that on a system where all tasks
have nice==0 the load balancing calculations would be mathematically
identical to the current load balancing code.
Notes:
struct task_struct:
has a new field load_weight which (in a trade off of space for speed)
stores the contribution that this task makes to a CPU's weighted load when
it is runnable.
struct runqueue:
has a new field raw_weighted_load which is the sum of the load_weight
values for the currently runnable tasks on this run queue. This field
always needs to be updated when nr_running is updated so two new inline
functions inc_nr_running() and dec_nr_running() have been created to make
sure that this happens. This also offers a convenient way to optimize away
this part of the smpnice mechanism when CONFIG_SMP is not defined.
int try_to_wake_up():
in this function the value SCHED_LOAD_BALANCE is used to represent the load
contribution of a single task in various calculations in the code that
decides which CPU to put the waking task on. While this would be a valid
on a system where the nice values for the runnable tasks were distributed
evenly around zero it will lead to anomalous load balancing if the
distribution is skewed in either direction. To overcome this problem
SCHED_LOAD_SCALE has been replaced by the load_weight for the relevant task
or by the average load_weight per task for the queue in question (as
appropriate).
int move_tasks():
The modifications to this function were complicated by the fact that
active_load_balance() uses it to move exactly one task without checking
whether an imbalance actually exists. This precluded the simple
overloading of max_nr_move with max_load_move and necessitated the addition
of the latter as an extra argument to the function. The internal
implementation is then modified to move up to max_nr_move tasks and
max_load_move of weighted load. This slightly complicates the code where
move_tasks() is called and if ever active_load_balance() is changed to not
use move_tasks() the implementation of move_tasks() should be simplified
accordingly.
struct sched_group *find_busiest_group():
Similar to try_to_wake_up(), there are places in this function where
SCHED_LOAD_SCALE is used to represent the load contribution of a single
task and the same issues are created. A similar solution is adopted except
that it is now the average per task contribution to a group's load (as
opposed to a run queue) that is required. As this value is not directly
available from the group it is calculated on the fly as the queues in the
groups are visited when determining the busiest group.
A key change to this function is that it is no longer to scale down
*imbalance on exit as move_tasks() uses the load in its scaled form.
void set_user_nice():
has been modified to update the task's load_weight field when it's nice
value and also to ensure that its run queue's raw_weighted_load field is
updated if it was runnable.
From: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
With smpnice, sched groups with highest priority tasks can mask the imbalance
between the other sched groups with in the same domain. This patch fixes some
of the listed down scenarios by not considering the sched groups which are
lightly loaded.
a) on a simple 4-way MP system, if we have one high priority and 4 normal
priority tasks, with smpnice we would like to see the high priority task
scheduled on one cpu, two other cpus getting one normal task each and the
fourth cpu getting the remaining two normal tasks. but with current
smpnice extra normal priority task keeps jumping from one cpu to another
cpu having the normal priority task. This is because of the
busiest_has_loaded_cpus, nr_loaded_cpus logic.. We are not including the
cpu with high priority task in max_load calculations but including that in
total and avg_load calcuations.. leading to max_load < avg_load and load
balance between cpus running normal priority tasks(2 Vs 1) will always show
imbalanace as one normal priority and the extra normal priority task will
keep moving from one cpu to another cpu having normal priority task..
b) 4-way system with HT (8 logical processors). Package-P0 T0 has a
highest priority task, T1 is idle. Package-P1 Both T0 and T1 have 1 normal
priority task each.. P2 and P3 are idle. With this patch, one of the
normal priority tasks on P1 will be moved to P2 or P3..
c) With the current weighted smp nice calculations, it doesn't always make
sense to look at the highest weighted runqueue in the busy group..
Consider a load balance scenario on a DP with HT system, with Package-0
containing one high priority and one low priority, Package-1 containing one
low priority(with other thread being idle).. Package-1 thinks that it need
to take the low priority thread from Package-0. And find_busiest_queue()
returns the cpu thread with highest priority task.. And ultimately(with
help of active load balance) we move high priority task to Package-1. And
same continues with Package-0 now, moving high priority task from package-1
to package-0.. Even without the presence of active load balance, load
balance will fail to balance the above scenario.. Fix find_busiest_queue
to use "imbalance" when it is lightly loaded.
[kernel@kolivas.org: sched: store weighted load on up]
[kernel@kolivas.org: sched: add discrete weighted cpu load function]
[suresh.b.siddha@intel.com: sched: remove dead code]
Signed-off-by: Peter Williams <pwil3058@bigpond.com.au>
Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
Cc: "Chen, Kenneth W" <kenneth.w.chen@intel.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Con Kolivas <kernel@kolivas.org>
Cc: John Hawkes <hawkes@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:54:34 +08:00
|
|
|
#endif
|
2017-02-07 05:06:35 +08:00
|
|
|
int on_rq;
|
|
|
|
|
|
|
|
int prio;
|
|
|
|
int static_prio;
|
|
|
|
int normal_prio;
|
|
|
|
unsigned int rt_priority;
|
2007-07-10 00:52:00 +08:00
|
|
|
|
2017-02-07 05:06:35 +08:00
|
|
|
struct sched_entity se;
|
|
|
|
struct sched_rt_entity rt;
|
2020-11-18 07:19:36 +08:00
|
|
|
struct sched_dl_entity dl;
|
2021-09-24 10:54:50 +08:00
|
|
|
const struct sched_class *sched_class;
|
2020-11-18 07:19:36 +08:00
|
|
|
|
|
|
|
#ifdef CONFIG_SCHED_CORE
|
|
|
|
struct rb_node core_node;
|
|
|
|
unsigned long core_cookie;
|
2020-11-18 07:19:43 +08:00
|
|
|
unsigned int core_occupation;
|
2020-11-18 07:19:36 +08:00
|
|
|
#endif
|
|
|
|
|
2012-06-22 19:36:05 +08:00
|
|
|
#ifdef CONFIG_CGROUP_SCHED
|
2017-02-07 05:06:35 +08:00
|
|
|
struct task_group *sched_task_group;
|
2012-06-22 19:36:05 +08:00
|
|
|
#endif
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2019-06-21 16:42:02 +08:00
|
|
|
#ifdef CONFIG_UCLAMP_TASK
|
sched/uclamp: Add a new sysctl to control RT default boost value
RT tasks by default run at the highest capacity/performance level. When
uclamp is selected this default behavior is retained by enforcing the
requested uclamp.min (p->uclamp_req[UCLAMP_MIN]) of the RT tasks to be
uclamp_none(UCLAMP_MAX), which is SCHED_CAPACITY_SCALE; the maximum
value.
This is also referred to as 'the default boost value of RT tasks'.
See commit 1a00d999971c ("sched/uclamp: Set default clamps for RT tasks").
On battery powered devices, it is desired to control this default
(currently hardcoded) behavior at runtime to reduce energy consumed by
RT tasks.
For example, a mobile device manufacturer where big.LITTLE architecture
is dominant, the performance of the little cores varies across SoCs, and
on high end ones the big cores could be too power hungry.
Given the diversity of SoCs, the new knob allows manufactures to tune
the best performance/power for RT tasks for the particular hardware they
run on.
They could opt to further tune the value when the user selects
a different power saving mode or when the device is actively charging.
The runtime aspect of it further helps in creating a single kernel image
that can be run on multiple devices that require different tuning.
Keep in mind that a lot of RT tasks in the system are created by the
kernel. On Android for instance I can see over 50 RT tasks, only
a handful of which created by the Android framework.
To control the default behavior globally by system admins and device
integrator, introduce the new sysctl_sched_uclamp_util_min_rt_default
to change the default boost value of the RT tasks.
I anticipate this to be mostly in the form of modifying the init script
of a particular device.
To avoid polluting the fast path with unnecessary code, the approach
taken is to synchronously do the update by traversing all the existing
tasks in the system. This could race with a concurrent fork(), which is
dealt with by introducing sched_post_fork() function which will ensure
the racy fork will get the right update applied.
Tested on Juno-r2 in combination with the RT capacity awareness [1].
By default an RT task will go to the highest capacity CPU and run at the
maximum frequency, which is particularly energy inefficient on high end
mobile devices because the biggest core[s] are 'huge' and power hungry.
With this patch the RT task can be controlled to run anywhere by
default, and doesn't cause the frequency to be maximum all the time.
Yet any task that really needs to be boosted can easily escape this
default behavior by modifying its requested uclamp.min value
(p->uclamp_req[UCLAMP_MIN]) via sched_setattr() syscall.
[1] 804d402fb6f6: ("sched/rt: Make RT capacity-aware")
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200716110347.19553-2-qais.yousef@arm.com
2020-07-16 19:03:45 +08:00
|
|
|
/*
|
|
|
|
* Clamp values requested for a scheduling entity.
|
|
|
|
* Must be updated with task_rq_lock() held.
|
|
|
|
*/
|
2019-06-21 16:42:05 +08:00
|
|
|
struct uclamp_se uclamp_req[UCLAMP_CNT];
|
sched/uclamp: Add a new sysctl to control RT default boost value
RT tasks by default run at the highest capacity/performance level. When
uclamp is selected this default behavior is retained by enforcing the
requested uclamp.min (p->uclamp_req[UCLAMP_MIN]) of the RT tasks to be
uclamp_none(UCLAMP_MAX), which is SCHED_CAPACITY_SCALE; the maximum
value.
This is also referred to as 'the default boost value of RT tasks'.
See commit 1a00d999971c ("sched/uclamp: Set default clamps for RT tasks").
On battery powered devices, it is desired to control this default
(currently hardcoded) behavior at runtime to reduce energy consumed by
RT tasks.
For example, a mobile device manufacturer where big.LITTLE architecture
is dominant, the performance of the little cores varies across SoCs, and
on high end ones the big cores could be too power hungry.
Given the diversity of SoCs, the new knob allows manufactures to tune
the best performance/power for RT tasks for the particular hardware they
run on.
They could opt to further tune the value when the user selects
a different power saving mode or when the device is actively charging.
The runtime aspect of it further helps in creating a single kernel image
that can be run on multiple devices that require different tuning.
Keep in mind that a lot of RT tasks in the system are created by the
kernel. On Android for instance I can see over 50 RT tasks, only
a handful of which created by the Android framework.
To control the default behavior globally by system admins and device
integrator, introduce the new sysctl_sched_uclamp_util_min_rt_default
to change the default boost value of the RT tasks.
I anticipate this to be mostly in the form of modifying the init script
of a particular device.
To avoid polluting the fast path with unnecessary code, the approach
taken is to synchronously do the update by traversing all the existing
tasks in the system. This could race with a concurrent fork(), which is
dealt with by introducing sched_post_fork() function which will ensure
the racy fork will get the right update applied.
Tested on Juno-r2 in combination with the RT capacity awareness [1].
By default an RT task will go to the highest capacity CPU and run at the
maximum frequency, which is particularly energy inefficient on high end
mobile devices because the biggest core[s] are 'huge' and power hungry.
With this patch the RT task can be controlled to run anywhere by
default, and doesn't cause the frequency to be maximum all the time.
Yet any task that really needs to be boosted can easily escape this
default behavior by modifying its requested uclamp.min value
(p->uclamp_req[UCLAMP_MIN]) via sched_setattr() syscall.
[1] 804d402fb6f6: ("sched/rt: Make RT capacity-aware")
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200716110347.19553-2-qais.yousef@arm.com
2020-07-16 19:03:45 +08:00
|
|
|
/*
|
|
|
|
* Effective clamp values used for a scheduling entity.
|
|
|
|
* Must be updated with task_rq_lock() held.
|
|
|
|
*/
|
2019-06-21 16:42:02 +08:00
|
|
|
struct uclamp_se uclamp[UCLAMP_CNT];
|
|
|
|
#endif
|
|
|
|
|
sched: Make struct sched_statistics independent of fair sched class
If we want to use the schedstats facility to trace other sched classes, we
should make it independent of fair sched class. The struct sched_statistics
is the schedular statistics of a task_struct or a task_group. So we can
move it into struct task_struct and struct task_group to achieve the goal.
After the patch, schestats are orgnized as follows,
struct task_struct {
...
struct sched_entity se;
struct sched_rt_entity rt;
struct sched_dl_entity dl;
...
struct sched_statistics stats;
...
};
Regarding the task group, schedstats is only supported for fair group
sched, and a new struct sched_entity_stats is introduced, suggested by
Peter -
struct sched_entity_stats {
struct sched_entity se;
struct sched_statistics stats;
} __no_randomize_layout;
Then with the se in a task_group, we can easily get the stats.
The sched_statistics members may be frequently modified when schedstats is
enabled, in order to avoid impacting on random data which may in the same
cacheline with them, the struct sched_statistics is defined as cacheline
aligned.
As this patch changes the core struct of scheduler, so I verified the
performance it may impact on the scheduler with 'perf bench sched
pipe', suggested by Mel. Below is the result, in which all the values
are in usecs/op.
Before After
kernel.sched_schedstats=0 5.2~5.4 5.2~5.4
kernel.sched_schedstats=1 5.3~5.5 5.3~5.5
[These data is a little difference with the earlier version, that is
because my old test machine is destroyed so I have to use a new
different test machine.]
Almost no impact on the sched performance.
No functional change.
[lkp@intel.com: reported build failure in earlier version]
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Link: https://lore.kernel.org/r/20210905143547.4668-3-laoar.shao@gmail.com
2021-09-05 22:35:41 +08:00
|
|
|
struct sched_statistics stats;
|
|
|
|
|
2007-07-26 19:40:43 +08:00
|
|
|
#ifdef CONFIG_PREEMPT_NOTIFIERS
|
2017-02-07 05:06:35 +08:00
|
|
|
/* List of struct preempt_notifier: */
|
|
|
|
struct hlist_head preempt_notifiers;
|
2007-07-26 19:40:43 +08:00
|
|
|
#endif
|
|
|
|
|
2006-09-29 16:59:40 +08:00
|
|
|
#ifdef CONFIG_BLK_DEV_IO_TRACE
|
2017-02-07 05:06:35 +08:00
|
|
|
unsigned int btrace_seq;
|
2006-09-29 16:59:40 +08:00
|
|
|
#endif
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2017-02-07 05:06:35 +08:00
|
|
|
unsigned int policy;
|
|
|
|
int nr_cpus_allowed;
|
2019-04-23 22:26:36 +08:00
|
|
|
const cpumask_t *cpus_ptr;
|
2021-07-30 19:24:33 +08:00
|
|
|
cpumask_t *user_cpus_ptr;
|
2019-04-23 22:26:36 +08:00
|
|
|
cpumask_t cpus_mask;
|
2020-09-18 23:24:31 +08:00
|
|
|
void *migration_pending;
|
2020-11-19 03:48:42 +08:00
|
|
|
#ifdef CONFIG_SMP
|
2020-09-28 23:06:07 +08:00
|
|
|
unsigned short migration_disabled;
|
2020-09-17 16:38:30 +08:00
|
|
|
#endif
|
2020-09-28 23:06:07 +08:00
|
|
|
unsigned short migration_flags;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2010-06-30 07:49:16 +08:00
|
|
|
#ifdef CONFIG_PREEMPT_RCU
|
2017-02-07 05:06:35 +08:00
|
|
|
int rcu_read_lock_nesting;
|
|
|
|
union rcu_special rcu_read_unlock_special;
|
|
|
|
struct list_head rcu_node_entry;
|
|
|
|
struct rcu_node *rcu_blocked_node;
|
2014-09-23 02:00:48 +08:00
|
|
|
#endif /* #ifdef CONFIG_PREEMPT_RCU */
|
2017-02-07 05:06:35 +08:00
|
|
|
|
2014-06-28 04:42:20 +08:00
|
|
|
#ifdef CONFIG_TASKS_RCU
|
2017-02-07 05:06:35 +08:00
|
|
|
unsigned long rcu_tasks_nvcsw;
|
2017-05-25 23:51:48 +08:00
|
|
|
u8 rcu_tasks_holdout;
|
|
|
|
u8 rcu_tasks_idx;
|
2017-02-07 05:06:35 +08:00
|
|
|
int rcu_tasks_idle_cpu;
|
2017-05-25 23:51:48 +08:00
|
|
|
struct list_head rcu_tasks_holdout_list;
|
2014-06-28 04:42:20 +08:00
|
|
|
#endif /* #ifdef CONFIG_TASKS_RCU */
|
2008-01-26 04:08:24 +08:00
|
|
|
|
rcu-tasks: Add an RCU Tasks Trace to simplify protection of tracing hooks
Because RCU does not watch exception early-entry/late-exit, idle-loop,
or CPU-hotplug execution, protection of tracing and BPF operations is
needlessly complicated. This commit therefore adds a variant of
Tasks RCU that:
o Has explicit read-side markers to allow finite grace periods in
the face of in-kernel loops for PREEMPT=n builds. These markers
are rcu_read_lock_trace() and rcu_read_unlock_trace().
o Protects code in the idle loop, exception entry/exit, and
CPU-hotplug code paths. In this respect, RCU-tasks trace is
similar to SRCU, but with lighter-weight readers.
o Avoids expensive read-side instruction, having overhead similar
to that of Preemptible RCU.
There are of course downsides:
o The grace-period code can send IPIs to CPUs, even when those
CPUs are in the idle loop or in nohz_full userspace. This is
mitigated by later commits.
o It is necessary to scan the full tasklist, much as for Tasks RCU.
o There is a single callback queue guarded by a single lock,
again, much as for Tasks RCU. However, those early use cases
that request multiple grace periods in quick succession are
expected to do so from a single task, which makes the single
lock almost irrelevant. If needed, multiple callback queues
can be provided using any number of schemes.
Perhaps most important, this variant of RCU does not affect the vanilla
flavors, rcu_preempt and rcu_sched. The fact that RCU Tasks Trace
readers can operate from idle, offline, and exception entry/exit in no
way enables rcu_preempt and rcu_sched readers to do so.
The memory ordering was outlined here:
https://lore.kernel.org/lkml/20200319034030.GX3199@paulmck-ThinkPad-P72/
This effort benefited greatly from off-list discussions of BPF
requirements with Alexei Starovoitov and Andrii Nakryiko. At least
some of the on-list discussions are captured in the Link: tags below.
In addition, KCSAN was quite helpful in finding some early bugs.
Link: https://lore.kernel.org/lkml/20200219150744.428764577@infradead.org/
Link: https://lore.kernel.org/lkml/87mu8p797b.fsf@nanos.tec.linutronix.de/
Link: https://lore.kernel.org/lkml/20200225221305.605144982@linutronix.de/
Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Andrii Nakryiko <andriin@fb.com>
[ paulmck: Apply feedback from Steve Rostedt and Joel Fernandes. ]
[ paulmck: Decrement trc_n_readers_need_end upon IPI failure. ]
[ paulmck: Fix locking issue reported by rcutorture. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-03-10 10:56:53 +08:00
|
|
|
#ifdef CONFIG_TASKS_TRACE_RCU
|
|
|
|
int trc_reader_nesting;
|
|
|
|
int trc_ipi_to_cpu;
|
2020-03-18 07:02:06 +08:00
|
|
|
union rcu_special trc_reader_special;
|
rcu-tasks: Add an RCU Tasks Trace to simplify protection of tracing hooks
Because RCU does not watch exception early-entry/late-exit, idle-loop,
or CPU-hotplug execution, protection of tracing and BPF operations is
needlessly complicated. This commit therefore adds a variant of
Tasks RCU that:
o Has explicit read-side markers to allow finite grace periods in
the face of in-kernel loops for PREEMPT=n builds. These markers
are rcu_read_lock_trace() and rcu_read_unlock_trace().
o Protects code in the idle loop, exception entry/exit, and
CPU-hotplug code paths. In this respect, RCU-tasks trace is
similar to SRCU, but with lighter-weight readers.
o Avoids expensive read-side instruction, having overhead similar
to that of Preemptible RCU.
There are of course downsides:
o The grace-period code can send IPIs to CPUs, even when those
CPUs are in the idle loop or in nohz_full userspace. This is
mitigated by later commits.
o It is necessary to scan the full tasklist, much as for Tasks RCU.
o There is a single callback queue guarded by a single lock,
again, much as for Tasks RCU. However, those early use cases
that request multiple grace periods in quick succession are
expected to do so from a single task, which makes the single
lock almost irrelevant. If needed, multiple callback queues
can be provided using any number of schemes.
Perhaps most important, this variant of RCU does not affect the vanilla
flavors, rcu_preempt and rcu_sched. The fact that RCU Tasks Trace
readers can operate from idle, offline, and exception entry/exit in no
way enables rcu_preempt and rcu_sched readers to do so.
The memory ordering was outlined here:
https://lore.kernel.org/lkml/20200319034030.GX3199@paulmck-ThinkPad-P72/
This effort benefited greatly from off-list discussions of BPF
requirements with Alexei Starovoitov and Andrii Nakryiko. At least
some of the on-list discussions are captured in the Link: tags below.
In addition, KCSAN was quite helpful in finding some early bugs.
Link: https://lore.kernel.org/lkml/20200219150744.428764577@infradead.org/
Link: https://lore.kernel.org/lkml/87mu8p797b.fsf@nanos.tec.linutronix.de/
Link: https://lore.kernel.org/lkml/20200225221305.605144982@linutronix.de/
Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Andrii Nakryiko <andriin@fb.com>
[ paulmck: Apply feedback from Steve Rostedt and Joel Fernandes. ]
[ paulmck: Decrement trc_n_readers_need_end upon IPI failure. ]
[ paulmck: Fix locking issue reported by rcutorture. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-03-10 10:56:53 +08:00
|
|
|
struct list_head trc_holdout_list;
|
2022-05-17 08:56:16 +08:00
|
|
|
struct list_head trc_blkd_node;
|
|
|
|
int trc_blkd_cpu;
|
rcu-tasks: Add an RCU Tasks Trace to simplify protection of tracing hooks
Because RCU does not watch exception early-entry/late-exit, idle-loop,
or CPU-hotplug execution, protection of tracing and BPF operations is
needlessly complicated. This commit therefore adds a variant of
Tasks RCU that:
o Has explicit read-side markers to allow finite grace periods in
the face of in-kernel loops for PREEMPT=n builds. These markers
are rcu_read_lock_trace() and rcu_read_unlock_trace().
o Protects code in the idle loop, exception entry/exit, and
CPU-hotplug code paths. In this respect, RCU-tasks trace is
similar to SRCU, but with lighter-weight readers.
o Avoids expensive read-side instruction, having overhead similar
to that of Preemptible RCU.
There are of course downsides:
o The grace-period code can send IPIs to CPUs, even when those
CPUs are in the idle loop or in nohz_full userspace. This is
mitigated by later commits.
o It is necessary to scan the full tasklist, much as for Tasks RCU.
o There is a single callback queue guarded by a single lock,
again, much as for Tasks RCU. However, those early use cases
that request multiple grace periods in quick succession are
expected to do so from a single task, which makes the single
lock almost irrelevant. If needed, multiple callback queues
can be provided using any number of schemes.
Perhaps most important, this variant of RCU does not affect the vanilla
flavors, rcu_preempt and rcu_sched. The fact that RCU Tasks Trace
readers can operate from idle, offline, and exception entry/exit in no
way enables rcu_preempt and rcu_sched readers to do so.
The memory ordering was outlined here:
https://lore.kernel.org/lkml/20200319034030.GX3199@paulmck-ThinkPad-P72/
This effort benefited greatly from off-list discussions of BPF
requirements with Alexei Starovoitov and Andrii Nakryiko. At least
some of the on-list discussions are captured in the Link: tags below.
In addition, KCSAN was quite helpful in finding some early bugs.
Link: https://lore.kernel.org/lkml/20200219150744.428764577@infradead.org/
Link: https://lore.kernel.org/lkml/87mu8p797b.fsf@nanos.tec.linutronix.de/
Link: https://lore.kernel.org/lkml/20200225221305.605144982@linutronix.de/
Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Andrii Nakryiko <andriin@fb.com>
[ paulmck: Apply feedback from Steve Rostedt and Joel Fernandes. ]
[ paulmck: Decrement trc_n_readers_need_end upon IPI failure. ]
[ paulmck: Fix locking issue reported by rcutorture. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-03-10 10:56:53 +08:00
|
|
|
#endif /* #ifdef CONFIG_TASKS_TRACE_RCU */
|
|
|
|
|
2017-02-07 05:06:35 +08:00
|
|
|
struct sched_info sched_info;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2017-02-07 05:06:35 +08:00
|
|
|
struct list_head tasks;
|
2010-12-01 02:51:33 +08:00
|
|
|
#ifdef CONFIG_SMP
|
2017-02-07 05:06:35 +08:00
|
|
|
struct plist_node pushable_tasks;
|
|
|
|
struct rb_node pushable_dl_tasks;
|
2010-12-01 02:51:33 +08:00
|
|
|
#endif
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2017-02-07 05:06:35 +08:00
|
|
|
struct mm_struct *mm;
|
|
|
|
struct mm_struct *active_mm;
|
2017-02-03 18:03:31 +08:00
|
|
|
|
2017-02-07 05:06:35 +08:00
|
|
|
#ifdef SPLIT_RSS_COUNTING
|
|
|
|
struct task_rss_stat rss_stat;
|
2010-03-06 05:41:40 +08:00
|
|
|
#endif
|
2017-02-07 05:06:35 +08:00
|
|
|
int exit_state;
|
|
|
|
int exit_code;
|
|
|
|
int exit_signal;
|
|
|
|
/* The signal sent when the parent dies: */
|
|
|
|
int pdeath_signal;
|
|
|
|
/* JOBCTL_*, siglock protected: */
|
|
|
|
unsigned long jobctl;
|
|
|
|
|
|
|
|
/* Used for emulating ABI behavior of previous Linux versions: */
|
|
|
|
unsigned int personality;
|
|
|
|
|
|
|
|
/* Scheduler bits, serialized by scheduler locks: */
|
|
|
|
unsigned sched_reset_on_fork:1;
|
|
|
|
unsigned sched_contributes_to_load:1;
|
|
|
|
unsigned sched_migrated:1;
|
psi: pressure stall information for CPU, memory, and IO
When systems are overcommitted and resources become contended, it's hard
to tell exactly the impact this has on workload productivity, or how close
the system is to lockups and OOM kills. In particular, when machines work
multiple jobs concurrently, the impact of overcommit in terms of latency
and throughput on the individual job can be enormous.
In order to maximize hardware utilization without sacrificing individual
job health or risk complete machine lockups, this patch implements a way
to quantify resource pressure in the system.
A kernel built with CONFIG_PSI=y creates files in /proc/pressure/ that
expose the percentage of time the system is stalled on CPU, memory, or IO,
respectively. Stall states are aggregate versions of the per-task delay
accounting delays:
cpu: some tasks are runnable but not executing on a CPU
memory: tasks are reclaiming, or waiting for swapin or thrashing cache
io: tasks are waiting for io completions
These percentages of walltime can be thought of as pressure percentages,
and they give a general sense of system health and productivity loss
incurred by resource overcommit. They can also indicate when the system
is approaching lockup scenarios and OOMs.
To do this, psi keeps track of the task states associated with each CPU
and samples the time they spend in stall states. Every 2 seconds, the
samples are averaged across CPUs - weighted by the CPUs' non-idle time to
eliminate artifacts from unused CPUs - and translated into percentages of
walltime. A running average of those percentages is maintained over 10s,
1m, and 5m periods (similar to the loadaverage).
[hannes@cmpxchg.org: doc fixlet, per Randy]
Link: http://lkml.kernel.org/r/20180828205625.GA14030@cmpxchg.org
[hannes@cmpxchg.org: code optimization]
Link: http://lkml.kernel.org/r/20180907175015.GA8479@cmpxchg.org
[hannes@cmpxchg.org: rename psi_clock() to psi_update_work(), per Peter]
Link: http://lkml.kernel.org/r/20180907145404.GB11088@cmpxchg.org
[hannes@cmpxchg.org: fix build]
Link: http://lkml.kernel.org/r/20180913014222.GA2370@cmpxchg.org
Link: http://lkml.kernel.org/r/20180828172258.3185-9-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Daniel Drake <drake@endlessm.com>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-27 06:06:27 +08:00
|
|
|
#ifdef CONFIG_PSI
|
|
|
|
unsigned sched_psi_wake_requeue:1;
|
|
|
|
#endif
|
|
|
|
|
2017-02-07 05:06:35 +08:00
|
|
|
/* Force alignment to the next boundary: */
|
|
|
|
unsigned :0;
|
|
|
|
|
|
|
|
/* Unserialized, strictly 'current' */
|
|
|
|
|
2020-11-17 16:08:41 +08:00
|
|
|
/*
|
|
|
|
* This field must not be in the scheduler word above due to wakelist
|
|
|
|
* queueing no longer being serialized by p->on_cpu. However:
|
|
|
|
*
|
|
|
|
* p->XXX = X; ttwu()
|
|
|
|
* schedule() if (p->on_rq && ..) // false
|
|
|
|
* smp_mb__after_spinlock(); if (smp_load_acquire(&p->on_cpu) && //true
|
|
|
|
* deactivate_task() ttwu_queue_wakelist())
|
|
|
|
* p->on_rq = 0; p->sched_remote_wakeup = Y;
|
|
|
|
*
|
|
|
|
* guarantees all stores of 'current' are visible before
|
|
|
|
* ->sched_remote_wakeup gets used, so it can be in this word.
|
|
|
|
*/
|
|
|
|
unsigned sched_remote_wakeup:1;
|
|
|
|
|
2017-02-07 05:06:35 +08:00
|
|
|
/* Bit to tell LSMs we're in execve(): */
|
|
|
|
unsigned in_execve:1;
|
|
|
|
unsigned in_iowait:1;
|
|
|
|
#ifndef TIF_RESTORE_SIGMASK
|
|
|
|
unsigned restore_sigmask:1;
|
signal: consolidate {TS,TLF}_RESTORE_SIGMASK code
In general, there's no need for the "restore sigmask" flag to live in
ti->flags. alpha, ia64, microblaze, powerpc, sh, sparc (64-bit only),
tile, and x86 use essentially identical alternative implementations,
placing the flag in ti->status.
Replace those optimized implementations with an equally good common
implementation that stores it in a bitfield in struct task_struct and
drop the custom implementations.
Additional architectures can opt in by removing their
TIF_RESTORE_SIGMASK defines.
Link: http://lkml.kernel.org/r/8a14321d64a28e40adfddc90e18a96c086a6d6f9.1468522723.git.luto@kernel.org
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Tested-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc]
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Rich Felker <dalias@libc.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dmitry Safonov <dsafonov@virtuozzo.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-08-03 05:05:36 +08:00
|
|
|
#endif
|
2015-11-06 10:46:09 +08:00
|
|
|
#ifdef CONFIG_MEMCG
|
2018-08-18 06:47:11 +08:00
|
|
|
unsigned in_user_fault:1;
|
2016-01-21 07:02:32 +08:00
|
|
|
#endif
|
mm: multi-gen LRU: groundwork
Evictable pages are divided into multiple generations for each lruvec.
The youngest generation number is stored in lrugen->max_seq for both
anon and file types as they are aged on an equal footing. The oldest
generation numbers are stored in lrugen->min_seq[] separately for anon
and file types as clean file pages can be evicted regardless of swap
constraints. These three variables are monotonically increasing.
Generation numbers are truncated into order_base_2(MAX_NR_GENS+1) bits
in order to fit into the gen counter in folio->flags. Each truncated
generation number is an index to lrugen->lists[]. The sliding window
technique is used to track at least MIN_NR_GENS and at most
MAX_NR_GENS generations. The gen counter stores a value within [1,
MAX_NR_GENS] while a page is on one of lrugen->lists[]. Otherwise it
stores 0.
There are two conceptually independent procedures: "the aging", which
produces young generations, and "the eviction", which consumes old
generations. They form a closed-loop system, i.e., "the page reclaim".
Both procedures can be invoked from userspace for the purposes of working
set estimation and proactive reclaim. These techniques are commonly used
to optimize job scheduling (bin packing) in data centers [1][2].
To avoid confusion, the terms "hot" and "cold" will be applied to the
multi-gen LRU, as a new convention; the terms "active" and "inactive" will
be applied to the active/inactive LRU, as usual.
The protection of hot pages and the selection of cold pages are based
on page access channels and patterns. There are two access channels:
one through page tables and the other through file descriptors. The
protection of the former channel is by design stronger because:
1. The uncertainty in determining the access patterns of the former
channel is higher due to the approximation of the accessed bit.
2. The cost of evicting the former channel is higher due to the TLB
flushes required and the likelihood of encountering the dirty bit.
3. The penalty of underprotecting the former channel is higher because
applications usually do not prepare themselves for major page
faults like they do for blocked I/O. E.g., GUI applications
commonly use dedicated I/O threads to avoid blocking rendering
threads.
There are also two access patterns: one with temporal locality and the
other without. For the reasons listed above, the former channel is
assumed to follow the former pattern unless VM_SEQ_READ or VM_RAND_READ is
present; the latter channel is assumed to follow the latter pattern unless
outlying refaults have been observed [3][4].
The next patch will address the "outlying refaults". Three macros, i.e.,
LRU_REFS_WIDTH, LRU_REFS_PGOFF and LRU_REFS_MASK, used later are added in
this patch to make the entire patchset less diffy.
A page is added to the youngest generation on faulting. The aging needs
to check the accessed bit at least twice before handing this page over to
the eviction. The first check takes care of the accessed bit set on the
initial fault; the second check makes sure this page has not been used
since then. This protocol, AKA second chance, requires a minimum of two
generations, hence MIN_NR_GENS.
[1] https://dl.acm.org/doi/10.1145/3297858.3304053
[2] https://dl.acm.org/doi/10.1145/3503222.3507731
[3] https://lwn.net/Articles/495543/
[4] https://lwn.net/Articles/815342/
Link: https://lkml.kernel.org/r/20220918080010.2920238-6-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michael Larabel <Michael@MichaelLarabel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-18 16:00:02 +08:00
|
|
|
#ifdef CONFIG_LRU_GEN
|
|
|
|
/* whether the LRU algorithm may apply to this access */
|
|
|
|
unsigned in_lru_fault:1;
|
|
|
|
#endif
|
2015-04-18 02:05:30 +08:00
|
|
|
#ifdef CONFIG_COMPAT_BRK
|
2017-02-07 05:06:35 +08:00
|
|
|
unsigned brk_randomized:1;
|
2015-04-18 02:05:30 +08:00
|
|
|
#endif
|
cgroup, kthread: close race window where new kthreads can be migrated to non-root cgroups
Creation of a kthread goes through a couple interlocked stages between
the kthread itself and its creator. Once the new kthread starts
running, it initializes itself and wakes up the creator. The creator
then can further configure the kthread and then let it start doing its
job by waking it up.
In this configuration-by-creator stage, the creator is the only one
that can wake it up but the kthread is visible to userland. When
altering the kthread's attributes from userland is allowed, this is
fine; however, for cases where CPU affinity is critical,
kthread_bind() is used to first disable affinity changes from userland
and then set the affinity. This also prevents the kthread from being
migrated into non-root cgroups as that can affect the CPU affinity and
many other things.
Unfortunately, the cgroup side of protection is racy. While the
PF_NO_SETAFFINITY flag prevents further migrations, userland can win
the race before the creator sets the flag with kthread_bind() and put
the kthread in a non-root cgroup, which can lead to all sorts of
problems including incorrect CPU affinity and starvation.
This bug got triggered by userland which periodically tries to migrate
all processes in the root cpuset cgroup to a non-root one. Per-cpu
workqueue workers got caught while being created and ended up with
incorrected CPU affinity breaking concurrency management and sometimes
stalling workqueue execution.
This patch adds task->no_cgroup_migration which disallows the task to
be migrated by userland. kthreadd starts with the flag set making
every child kthread start in the root cgroup with migration
disallowed. The flag is cleared after the kthread finishes
initialization by which time PF_NO_SETAFFINITY is set if the kthread
should stay in the root cgroup.
It'd be better to wait for the initialization instead of failing but I
couldn't think of a way of implementing that without adding either a
new PF flag, or sleeping and retrying from waiting side. Even if
userland depends on changing cgroup membership of a kthread, it either
has to be synchronized with kthread_create() or periodically repeat,
so it's unlikely that this would break anything.
v2: Switch to a simpler implementation using a new task_struct bit
field suggested by Oleg.
Signed-off-by: Tejun Heo <tj@kernel.org>
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Reported-and-debugged-by: Chris Mason <clm@fb.com>
Cc: stable@vger.kernel.org # v4.3+ (we can't close the race on < v4.3)
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-03-17 04:54:24 +08:00
|
|
|
#ifdef CONFIG_CGROUPS
|
|
|
|
/* disallow userland-initiated cgroup migration */
|
|
|
|
unsigned no_cgroup_migration:1;
|
cgroup: cgroup v2 freezer
Cgroup v1 implements the freezer controller, which provides an ability
to stop the workload in a cgroup and temporarily free up some
resources (cpu, io, network bandwidth and, potentially, memory)
for some other tasks. Cgroup v2 lacks this functionality.
This patch implements freezer for cgroup v2.
Cgroup v2 freezer tries to put tasks into a state similar to jobctl
stop. This means that tasks can be killed, ptraced (using
PTRACE_SEIZE*), and interrupted. It is possible to attach to
a frozen task, get some information (e.g. read registers) and detach.
It's also possible to migrate a frozen tasks to another cgroup.
This differs cgroup v2 freezer from cgroup v1 freezer, which mostly
tried to imitate the system-wide freezer. However uninterruptible
sleep is fine when all tasks are going to be frozen (hibernation case),
it's not the acceptable state for some subset of the system.
Cgroup v2 freezer is not supporting freezing kthreads.
If a non-root cgroup contains kthread, the cgroup still can be frozen,
but the kthread will remain running, the cgroup will be shown
as non-frozen, and the notification will not be delivered.
* PTRACE_ATTACH is not working because non-fatal signal delivery
is blocked in frozen state.
There are some interface differences between cgroup v1 and cgroup v2
freezer too, which are required to conform the cgroup v2 interface
design principles:
1) There is no separate controller, which has to be turned on:
the functionality is always available and is represented by
cgroup.freeze and cgroup.events cgroup control files.
2) The desired state is defined by the cgroup.freeze control file.
Any hierarchical configuration is allowed.
3) The interface is asynchronous. The actual state is available
using cgroup.events control file ("frozen" field). There are no
dedicated transitional states.
4) It's allowed to make any changes with the cgroup hierarchy
(create new cgroups, remove old cgroups, move tasks between cgroups)
no matter if some cgroups are frozen.
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
No-objection-from-me-by: Oleg Nesterov <oleg@redhat.com>
Cc: kernel-team@fb.com
2019-04-20 01:03:04 +08:00
|
|
|
/* task is frozen/stopped (used by the cgroup freezer) */
|
|
|
|
unsigned frozen:1;
|
cgroup, kthread: close race window where new kthreads can be migrated to non-root cgroups
Creation of a kthread goes through a couple interlocked stages between
the kthread itself and its creator. Once the new kthread starts
running, it initializes itself and wakes up the creator. The creator
then can further configure the kthread and then let it start doing its
job by waking it up.
In this configuration-by-creator stage, the creator is the only one
that can wake it up but the kthread is visible to userland. When
altering the kthread's attributes from userland is allowed, this is
fine; however, for cases where CPU affinity is critical,
kthread_bind() is used to first disable affinity changes from userland
and then set the affinity. This also prevents the kthread from being
migrated into non-root cgroups as that can affect the CPU affinity and
many other things.
Unfortunately, the cgroup side of protection is racy. While the
PF_NO_SETAFFINITY flag prevents further migrations, userland can win
the race before the creator sets the flag with kthread_bind() and put
the kthread in a non-root cgroup, which can lead to all sorts of
problems including incorrect CPU affinity and starvation.
This bug got triggered by userland which periodically tries to migrate
all processes in the root cpuset cgroup to a non-root one. Per-cpu
workqueue workers got caught while being created and ended up with
incorrected CPU affinity breaking concurrency management and sometimes
stalling workqueue execution.
This patch adds task->no_cgroup_migration which disallows the task to
be migrated by userland. kthreadd starts with the flag set making
every child kthread start in the root cgroup with migration
disallowed. The flag is cleared after the kthread finishes
initialization by which time PF_NO_SETAFFINITY is set if the kthread
should stay in the root cgroup.
It'd be better to wait for the initialization instead of failing but I
couldn't think of a way of implementing that without adding either a
new PF flag, or sleeping and retrying from waiting side. Even if
userland depends on changing cgroup membership of a kthread, it either
has to be synchronized with kthread_create() or periodically repeat,
so it's unlikely that this would break anything.
v2: Switch to a simpler implementation using a new task_struct bit
field suggested by Oleg.
Signed-off-by: Tejun Heo <tj@kernel.org>
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Reported-and-debugged-by: Chris Mason <clm@fb.com>
Cc: stable@vger.kernel.org # v4.3+ (we can't close the race on < v4.3)
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-03-17 04:54:24 +08:00
|
|
|
#endif
|
2018-07-03 23:14:55 +08:00
|
|
|
#ifdef CONFIG_BLK_CGROUP
|
|
|
|
unsigned use_memdelay:1;
|
|
|
|
#endif
|
2020-03-17 09:28:05 +08:00
|
|
|
#ifdef CONFIG_PSI
|
|
|
|
/* Stalled due to lack of memory */
|
|
|
|
unsigned in_memstall:1;
|
|
|
|
#endif
|
2021-04-30 13:55:08 +08:00
|
|
|
#ifdef CONFIG_PAGE_OWNER
|
|
|
|
/* Used by page_owner=on to detect recursion in page tracking. */
|
|
|
|
unsigned in_page_owner:1;
|
|
|
|
#endif
|
2021-07-29 19:01:59 +08:00
|
|
|
#ifdef CONFIG_EVENTFD
|
|
|
|
/* Recursion prevention for eventfd_signal() */
|
eventfd: guard wake_up in eventfd fs calls as well
Guard wakeups that the user can trigger, and that may end up triggering a
call back into eventfd_signal. This is in addition to the current approach
that only guards in eventfd_signal.
Rename in_eventfd_signal -> in_eventfd at the same time to reflect this.
Without this there would be a deadlock in the following code using libaio:
int main()
{
struct io_context *ctx = NULL;
struct iocb iocb;
struct iocb *iocbs[] = { &iocb };
int evfd;
uint64_t val = 1;
evfd = eventfd(0, EFD_CLOEXEC);
assert(!io_setup(2, &ctx));
io_prep_poll(&iocb, evfd, POLLIN);
io_set_eventfd(&iocb, evfd);
assert(1 == io_submit(ctx, 1, iocbs));
write(evfd, &val, 8);
}
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/r/20220816135959.1490641-1-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-16 21:59:59 +08:00
|
|
|
unsigned in_eventfd:1;
|
2021-07-29 19:01:59 +08:00
|
|
|
#endif
|
2022-02-08 07:02:50 +08:00
|
|
|
#ifdef CONFIG_IOMMU_SVA
|
|
|
|
unsigned pasid_activated:1;
|
|
|
|
#endif
|
2022-03-11 04:48:53 +08:00
|
|
|
#ifdef CONFIG_CPU_SUP_INTEL
|
|
|
|
unsigned reported_split_lock:1;
|
|
|
|
#endif
|
2022-08-15 15:11:35 +08:00
|
|
|
#ifdef CONFIG_TASK_DELAY_ACCT
|
|
|
|
/* delay due to memory thrashing */
|
|
|
|
unsigned in_thrashing:1;
|
|
|
|
#endif
|
2014-12-13 08:55:15 +08:00
|
|
|
|
2017-02-07 05:06:35 +08:00
|
|
|
unsigned long atomic_flags; /* Flags requiring atomic access. */
|
2014-05-22 06:23:46 +08:00
|
|
|
|
2017-02-07 05:06:35 +08:00
|
|
|
struct restart_block restart_block;
|
2015-02-13 07:01:14 +08:00
|
|
|
|
2017-02-07 05:06:35 +08:00
|
|
|
pid_t pid;
|
|
|
|
pid_t tgid;
|
2006-09-26 16:52:38 +08:00
|
|
|
|
Kbuild: rename CC_STACKPROTECTOR[_STRONG] config variables
The changes to automatically test for working stack protector compiler
support in the Kconfig files removed the special STACKPROTECTOR_AUTO
option that picked the strongest stack protector that the compiler
supported.
That was all a nice cleanup - it makes no sense to have the AUTO case
now that the Kconfig phase can just determine the compiler support
directly.
HOWEVER.
It also meant that doing "make oldconfig" would now _disable_ the strong
stackprotector if you had AUTO enabled, because in a legacy config file,
the sane stack protector configuration would look like
CONFIG_HAVE_CC_STACKPROTECTOR=y
# CONFIG_CC_STACKPROTECTOR_NONE is not set
# CONFIG_CC_STACKPROTECTOR_REGULAR is not set
# CONFIG_CC_STACKPROTECTOR_STRONG is not set
CONFIG_CC_STACKPROTECTOR_AUTO=y
and when you ran this through "make oldconfig" with the Kbuild changes,
it would ask you about the regular CONFIG_CC_STACKPROTECTOR (that had
been renamed from CONFIG_CC_STACKPROTECTOR_REGULAR to just
CONFIG_CC_STACKPROTECTOR), but it would think that the STRONG version
used to be disabled (because it was really enabled by AUTO), and would
disable it in the new config, resulting in:
CONFIG_HAVE_CC_STACKPROTECTOR=y
CONFIG_CC_HAS_STACKPROTECTOR_NONE=y
CONFIG_CC_STACKPROTECTOR=y
# CONFIG_CC_STACKPROTECTOR_STRONG is not set
CONFIG_CC_HAS_SANE_STACKPROTECTOR=y
That's dangerously subtle - people could suddenly find themselves with
the weaker stack protector setup without even realizing.
The solution here is to just rename not just the old RECULAR stack
protector option, but also the strong one. This does that by just
removing the CC_ prefix entirely for the user choices, because it really
is not about the compiler support (the compiler support now instead
automatially impacts _visibility_ of the options to users).
This results in "make oldconfig" actually asking the user for their
choice, so that we don't have any silent subtle security model changes.
The end result would generally look like this:
CONFIG_HAVE_CC_STACKPROTECTOR=y
CONFIG_CC_HAS_STACKPROTECTOR_NONE=y
CONFIG_STACKPROTECTOR=y
CONFIG_STACKPROTECTOR_STRONG=y
CONFIG_CC_HAS_SANE_STACKPROTECTOR=y
where the "CC_" versions really are about internal compiler
infrastructure, not the user selections.
Acked-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-14 11:21:18 +08:00
|
|
|
#ifdef CONFIG_STACKPROTECTOR
|
2017-02-07 05:06:35 +08:00
|
|
|
/* Canary value for the -fstack-protector GCC feature: */
|
|
|
|
unsigned long stack_canary;
|
2009-08-18 14:06:02 +08:00
|
|
|
#endif
|
2012-05-11 08:59:08 +08:00
|
|
|
/*
|
2017-02-07 05:06:35 +08:00
|
|
|
* Pointers to the (original) parent process, youngest child, younger sibling,
|
2012-05-11 08:59:08 +08:00
|
|
|
* older sibling, respectively. (p->father can be replaced with
|
2008-03-25 09:36:23 +08:00
|
|
|
* p->real_parent->pid)
|
2005-04-17 06:20:36 +08:00
|
|
|
*/
|
2017-02-07 05:06:35 +08:00
|
|
|
|
|
|
|
/* Real parent process: */
|
|
|
|
struct task_struct __rcu *real_parent;
|
|
|
|
|
|
|
|
/* Recipient of SIGCHLD, wait4() reports: */
|
|
|
|
struct task_struct __rcu *parent;
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
/*
|
2017-02-07 05:06:35 +08:00
|
|
|
* Children/sibling form the list of natural children:
|
2005-04-17 06:20:36 +08:00
|
|
|
*/
|
2017-02-07 05:06:35 +08:00
|
|
|
struct list_head children;
|
|
|
|
struct list_head sibling;
|
|
|
|
struct task_struct *group_leader;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2008-03-25 09:36:23 +08:00
|
|
|
/*
|
2017-02-07 05:06:35 +08:00
|
|
|
* 'ptraced' is the list of tasks this task is using ptrace() on.
|
|
|
|
*
|
2008-03-25 09:36:23 +08:00
|
|
|
* This includes both natural children and PTRACE_ATTACH targets.
|
2017-02-07 05:06:35 +08:00
|
|
|
* 'ptrace_entry' is this task's link on the p->parent->ptraced list.
|
2008-03-25 09:36:23 +08:00
|
|
|
*/
|
2017-02-07 05:06:35 +08:00
|
|
|
struct list_head ptraced;
|
|
|
|
struct list_head ptrace_entry;
|
2008-03-25 09:36:23 +08:00
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
/* PID/PID hash table linkage. */
|
2017-09-27 02:06:43 +08:00
|
|
|
struct pid *thread_pid;
|
|
|
|
struct hlist_node pid_links[PIDTYPE_MAX];
|
2017-02-07 05:06:35 +08:00
|
|
|
struct list_head thread_group;
|
|
|
|
struct list_head thread_node;
|
|
|
|
|
|
|
|
struct completion *vfork_done;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2017-02-07 05:06:35 +08:00
|
|
|
/* CLONE_CHILD_SETTID: */
|
|
|
|
int __user *set_child_tid;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2017-02-07 05:06:35 +08:00
|
|
|
/* CLONE_CHILD_CLEARTID: */
|
|
|
|
int __user *clear_child_tid;
|
|
|
|
|
2021-12-23 12:10:09 +08:00
|
|
|
/* PF_KTHREAD | PF_IO_WORKER */
|
|
|
|
void *worker_private;
|
2021-02-17 05:15:30 +08:00
|
|
|
|
2017-02-07 05:06:35 +08:00
|
|
|
u64 utime;
|
|
|
|
u64 stime;
|
2016-11-15 10:06:51 +08:00
|
|
|
#ifdef CONFIG_ARCH_HAS_SCALED_CPUTIME
|
2017-02-07 05:06:35 +08:00
|
|
|
u64 utimescaled;
|
|
|
|
u64 stimescaled;
|
2016-11-15 10:06:51 +08:00
|
|
|
#endif
|
2017-02-07 05:06:35 +08:00
|
|
|
u64 gtime;
|
|
|
|
struct prev_cputime prev_cputime;
|
2012-12-17 03:00:34 +08:00
|
|
|
#ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
|
2017-06-30 01:15:10 +08:00
|
|
|
struct vtime vtime;
|
2009-12-02 16:26:47 +08:00
|
|
|
#endif
|
2015-06-07 21:54:30 +08:00
|
|
|
|
|
|
|
#ifdef CONFIG_NO_HZ_FULL
|
2017-02-07 05:06:35 +08:00
|
|
|
atomic_t tick_dep_mask;
|
2015-06-07 21:54:30 +08:00
|
|
|
#endif
|
2017-02-07 05:06:35 +08:00
|
|
|
/* Context switch counts: */
|
|
|
|
unsigned long nvcsw;
|
|
|
|
unsigned long nivcsw;
|
|
|
|
|
|
|
|
/* Monotonic time in nsecs: */
|
|
|
|
u64 start_time;
|
|
|
|
|
|
|
|
/* Boot based time in nsecs: */
|
2019-11-07 18:07:58 +08:00
|
|
|
u64 start_boottime;
|
2017-02-07 05:06:35 +08:00
|
|
|
|
|
|
|
/* MM fault and swap info: this can arguably be seen as either mm-specific or thread-specific: */
|
|
|
|
unsigned long min_flt;
|
|
|
|
unsigned long maj_flt;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2019-08-22 03:09:04 +08:00
|
|
|
/* Empty if CONFIG_POSIX_CPUTIMERS=n */
|
|
|
|
struct posix_cputimers posix_cputimers;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2020-07-30 18:14:06 +08:00
|
|
|
#ifdef CONFIG_POSIX_CPU_TIMERS_TASK_WORK
|
|
|
|
struct posix_cputimers_work posix_cputimers_work;
|
|
|
|
#endif
|
|
|
|
|
2017-02-07 05:06:35 +08:00
|
|
|
/* Process credentials: */
|
|
|
|
|
|
|
|
/* Tracer's credentials at attach: */
|
|
|
|
const struct cred __rcu *ptracer_cred;
|
|
|
|
|
|
|
|
/* Objective and real subjective task credentials (COW): */
|
|
|
|
const struct cred __rcu *real_cred;
|
|
|
|
|
|
|
|
/* Effective (overridable) subjective task credentials (COW): */
|
|
|
|
const struct cred __rcu *cred;
|
|
|
|
|
keys: Cache result of request_key*() temporarily in task_struct
If a filesystem uses keys to hold authentication tokens, then it needs a
token for each VFS operation that might perform an authentication check -
either by passing it to the server, or using to perform a check based on
authentication data cached locally.
For open files this isn't a problem, since the key should be cached in the
file struct since it represents the subject performing operations on that
file descriptor.
During pathwalk, however, there isn't anywhere to cache the key, except
perhaps in the nameidata struct - but that isn't exposed to the
filesystems. Further, a pathwalk can incur a lot of operations, calling
one or more of the following, for instance:
->lookup()
->permission()
->d_revalidate()
->d_automount()
->get_acl()
->getxattr()
on each dentry/inode it encounters - and each one may need to call
request_key(). And then, at the end of pathwalk, it will call the actual
operation:
->mkdir()
->mknod()
->getattr()
->open()
...
which may need to go and get the token again.
However, it is very likely that all of the operations on a single
dentry/inode - and quite possibly a sequence of them - will all want to use
the same authentication token, which suggests that caching it would be a
good idea.
To this end:
(1) Make it so that a positive result of request_key() and co. that didn't
require upcalling to userspace is cached temporarily in task_struct.
(2) The cache is 1 deep, so a new result displaces the old one.
(3) The key is released by exit and by notify-resume.
(4) The cache is cleared in a newly forked process.
Signed-off-by: David Howells <dhowells@redhat.com>
2019-06-19 23:10:15 +08:00
|
|
|
#ifdef CONFIG_KEYS
|
|
|
|
/* Cached requested key. */
|
|
|
|
struct key *cached_requested_key;
|
|
|
|
#endif
|
|
|
|
|
2017-02-07 05:06:35 +08:00
|
|
|
/*
|
|
|
|
* executable name, excluding path.
|
|
|
|
*
|
|
|
|
* - normally initialized setup_new_exec()
|
|
|
|
* - access it with [gs]et_task_comm()
|
|
|
|
* - lock it with task_lock()
|
|
|
|
*/
|
|
|
|
char comm[TASK_COMM_LEN];
|
|
|
|
|
|
|
|
struct nameidata *nameidata;
|
|
|
|
|
2006-09-29 16:59:40 +08:00
|
|
|
#ifdef CONFIG_SYSVIPC
|
2017-02-07 05:06:35 +08:00
|
|
|
struct sysv_sem sysvsem;
|
|
|
|
struct sysv_shm sysvshm;
|
2006-09-29 16:59:40 +08:00
|
|
|
#endif
|
2009-01-16 03:08:40 +08:00
|
|
|
#ifdef CONFIG_DETECT_HUNG_TASK
|
2017-02-07 05:06:35 +08:00
|
|
|
unsigned long last_switch_count;
|
2018-08-22 12:55:52 +08:00
|
|
|
unsigned long last_switch_time;
|
2008-01-26 04:08:02 +08:00
|
|
|
#endif
|
2017-02-07 05:06:35 +08:00
|
|
|
/* Filesystem information: */
|
|
|
|
struct fs_struct *fs;
|
|
|
|
|
|
|
|
/* Open file information: */
|
|
|
|
struct files_struct *files;
|
|
|
|
|
2020-09-14 03:09:39 +08:00
|
|
|
#ifdef CONFIG_IO_URING
|
|
|
|
struct io_uring_task *io_uring;
|
|
|
|
#endif
|
|
|
|
|
2017-02-07 05:06:35 +08:00
|
|
|
/* Namespaces: */
|
|
|
|
struct nsproxy *nsproxy;
|
|
|
|
|
|
|
|
/* Signal handlers: */
|
|
|
|
struct signal_struct *signal;
|
2020-01-24 12:59:08 +08:00
|
|
|
struct sighand_struct __rcu *sighand;
|
2017-02-07 05:06:35 +08:00
|
|
|
sigset_t blocked;
|
|
|
|
sigset_t real_blocked;
|
|
|
|
/* Restored if set_restore_sigmask() was used: */
|
|
|
|
sigset_t saved_sigmask;
|
|
|
|
struct sigpending pending;
|
|
|
|
unsigned long sas_ss_sp;
|
|
|
|
size_t sas_ss_size;
|
|
|
|
unsigned int sas_ss_flags;
|
|
|
|
|
|
|
|
struct callback_head *task_works;
|
|
|
|
|
2019-01-23 06:06:39 +08:00
|
|
|
#ifdef CONFIG_AUDIT
|
2008-01-10 17:53:18 +08:00
|
|
|
#ifdef CONFIG_AUDITSYSCALL
|
2019-02-02 11:45:17 +08:00
|
|
|
struct audit_context *audit_context;
|
|
|
|
#endif
|
2017-02-07 05:06:35 +08:00
|
|
|
kuid_t loginuid;
|
|
|
|
unsigned int sessionid;
|
2008-01-10 17:53:18 +08:00
|
|
|
#endif
|
2017-02-07 05:06:35 +08:00
|
|
|
struct seccomp seccomp;
|
kernel: Implement selective syscall userspace redirection
Introduce a mechanism to quickly disable/enable syscall handling for a
specific process and redirect to userspace via SIGSYS. This is useful
for processes with parts that require syscall redirection and parts that
don't, but who need to perform this boundary crossing really fast,
without paying the cost of a system call to reconfigure syscall handling
on each boundary transition. This is particularly important for Windows
games running over Wine.
The proposed interface looks like this:
prctl(PR_SET_SYSCALL_USER_DISPATCH, <op>, <off>, <length>, [selector])
The range [<offset>,<offset>+<length>) is a part of the process memory
map that is allowed to by-pass the redirection code and dispatch
syscalls directly, such that in fast paths a process doesn't need to
disable the trap nor the kernel has to check the selector. This is
essential to return from SIGSYS to a blocked area without triggering
another SIGSYS from rt_sigreturn.
selector is an optional pointer to a char-sized userspace memory region
that has a key switch for the mechanism. This key switch is set to
either PR_SYS_DISPATCH_ON, PR_SYS_DISPATCH_OFF to enable and disable the
redirection without calling the kernel.
The feature is meant to be set per-thread and it is disabled on
fork/clone/execv.
Internally, this doesn't add overhead to the syscall hot path, and it
requires very little per-architecture support. I avoided using seccomp,
even though it duplicates some functionality, due to previous feedback
that maybe it shouldn't mix with seccomp since it is not a security
mechanism. And obviously, this should never be considered a security
mechanism, since any part of the program can by-pass it by using the
syscall dispatcher.
For the sysinfo benchmark, which measures the overhead added to
executing a native syscall that doesn't require interception, the
overhead using only the direct dispatcher region to issue syscalls is
pretty much irrelevant. The overhead of using the selector goes around
40ns for a native (unredirected) syscall in my system, and it is (as
expected) dominated by the supervisor-mode user-address access. In
fact, with SMAP off, the overhead is consistently less than 5ns on my
test box.
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Andy Lutomirski <luto@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20201127193238.821364-4-krisman@collabora.com
2020-11-28 03:32:34 +08:00
|
|
|
struct syscall_user_dispatch syscall_dispatch;
|
2017-02-07 05:06:35 +08:00
|
|
|
|
|
|
|
/* Thread group tracking: */
|
2020-03-31 08:01:04 +08:00
|
|
|
u64 parent_exec_id;
|
|
|
|
u64 self_exec_id;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2017-02-07 05:06:35 +08:00
|
|
|
/* Protection against (de-)allocation: mm, files, fs, tty, keyrings, mems_allowed, mempolicy: */
|
|
|
|
spinlock_t alloc_lock;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2006-06-27 17:54:51 +08:00
|
|
|
/* Protection of the PI data structures: */
|
2017-02-07 05:06:35 +08:00
|
|
|
raw_spinlock_t pi_lock;
|
2006-06-27 17:54:51 +08:00
|
|
|
|
2017-02-07 05:06:35 +08:00
|
|
|
struct wake_q_node wake_q;
|
2015-05-01 23:27:50 +08:00
|
|
|
|
2006-06-27 17:54:53 +08:00
|
|
|
#ifdef CONFIG_RT_MUTEXES
|
2017-02-07 05:06:35 +08:00
|
|
|
/* PI waiters blocked on a rt_mutex held by this task: */
|
2017-09-09 07:15:01 +08:00
|
|
|
struct rb_root_cached pi_waiters;
|
2017-03-23 22:56:08 +08:00
|
|
|
/* Updated under owner's pi_lock and rq lock */
|
|
|
|
struct task_struct *pi_top_task;
|
2017-02-07 05:06:35 +08:00
|
|
|
/* Deadlock detection and priority inheritance handling: */
|
|
|
|
struct rt_mutex_waiter *pi_blocked_on;
|
2006-06-27 17:54:53 +08:00
|
|
|
#endif
|
|
|
|
|
2006-01-10 07:59:20 +08:00
|
|
|
#ifdef CONFIG_DEBUG_MUTEXES
|
2017-02-07 05:06:35 +08:00
|
|
|
/* Mutex deadlock detection: */
|
|
|
|
struct mutex_waiter *blocked_on;
|
2006-01-10 07:59:20 +08:00
|
|
|
#endif
|
2017-02-07 05:06:35 +08:00
|
|
|
|
2019-08-27 04:14:23 +08:00
|
|
|
#ifdef CONFIG_DEBUG_ATOMIC_SLEEP
|
|
|
|
int non_block_count;
|
|
|
|
#endif
|
|
|
|
|
2006-07-03 15:24:42 +08:00
|
|
|
#ifdef CONFIG_TRACE_IRQFLAGS
|
2020-07-29 19:09:15 +08:00
|
|
|
struct irqtrace_events irqtrace;
|
lockdep: Introduce wait-type checks
Extend lockdep to validate lock wait-type context.
The current wait-types are:
LD_WAIT_FREE, /* wait free, rcu etc.. */
LD_WAIT_SPIN, /* spin loops, raw_spinlock_t etc.. */
LD_WAIT_CONFIG, /* CONFIG_PREEMPT_LOCK, spinlock_t etc.. */
LD_WAIT_SLEEP, /* sleeping locks, mutex_t etc.. */
Where lockdep validates that the current lock (the one being acquired)
fits in the current wait-context (as generated by the held stack).
This ensures that there is no attempt to acquire mutexes while holding
spinlocks, to acquire spinlocks while holding raw_spinlocks and so on. In
other words, its a more fancy might_sleep().
Obviously RCU made the entire ordeal more complex than a simple single
value test because RCU can be acquired in (pretty much) any context and
while it presents a context to nested locks it is not the same as it
got acquired in.
Therefore its necessary to split the wait_type into two values, one
representing the acquire (outer) and one representing the nested context
(inner). For most 'normal' locks these two are the same.
[ To make static initialization easier we have the rule that:
.outer == INV means .outer == .inner; because INV == 0. ]
It further means that its required to find the minimal .inner of the held
stack to compare against the outer of the new lock; because while 'normal'
RCU presents a CONFIG type to nested locks, if it is taken while already
holding a SPIN type it obviously doesn't relax the rules.
Below is an example output generated by the trivial test code:
raw_spin_lock(&foo);
spin_lock(&bar);
spin_unlock(&bar);
raw_spin_unlock(&foo);
[ BUG: Invalid wait context ]
-----------------------------
swapper/0/1 is trying to lock:
ffffc90000013f20 (&bar){....}-{3:3}, at: kernel_init+0xdb/0x187
other info that might help us debug this:
1 lock held by swapper/0/1:
#0: ffffc90000013ee0 (&foo){+.+.}-{2:2}, at: kernel_init+0xd1/0x187
The way to read it is to look at the new -{n,m} part in the lock
description; -{3:3} for the attempted lock, and try and match that up to
the held locks, which in this case is the one: -{2,2}.
This tells that the acquiring lock requires a more relaxed environment than
presented by the lock stack.
Currently only the normal locks and RCU are converted, the rest of the
lockdep users defaults to .inner = INV which is ignored. More conversions
can be done when desired.
The check for spinlock_t nesting is not enabled by default. It's a separate
config option for now as there are known problems which are currently
addressed. The config option allows to identify these problems and to
verify that the solutions found are indeed solving them.
The config switch will be removed and the checks will permanently enabled
once the vast majority of issues has been addressed.
[ bigeasy: Move LD_WAIT_FREE,… out of CONFIG_LOCKDEP to avoid compile
failure with CONFIG_DEBUG_SPINLOCK + !CONFIG_LOCKDEP]
[ tglx: Add the config option ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200321113242.427089655@linutronix.de
2020-03-21 19:26:01 +08:00
|
|
|
unsigned int hardirq_threaded;
|
2020-03-18 21:22:03 +08:00
|
|
|
u64 hardirq_chain_key;
|
2017-02-07 05:06:35 +08:00
|
|
|
int softirqs_enabled;
|
|
|
|
int softirq_context;
|
2020-03-21 19:26:02 +08:00
|
|
|
int irq_config;
|
2006-07-03 15:24:42 +08:00
|
|
|
#endif
|
2021-03-09 16:55:53 +08:00
|
|
|
#ifdef CONFIG_PREEMPT_RT
|
|
|
|
int softirq_disable_cnt;
|
|
|
|
#endif
|
2017-02-07 05:06:35 +08:00
|
|
|
|
[PATCH] lockdep: core
Do 'make oldconfig' and accept all the defaults for new config options -
reboot into the kernel and if everything goes well it should boot up fine and
you should have /proc/lockdep and /proc/lockdep_stats files.
Typically if the lock validator finds some problem it will print out
voluminous debug output that begins with "BUG: ..." and which syslog output
can be used by kernel developers to figure out the precise locking scenario.
What does the lock validator do? It "observes" and maps all locking rules as
they occur dynamically (as triggered by the kernel's natural use of spinlocks,
rwlocks, mutexes and rwsems). Whenever the lock validator subsystem detects a
new locking scenario, it validates this new rule against the existing set of
rules. If this new rule is consistent with the existing set of rules then the
new rule is added transparently and the kernel continues as normal. If the
new rule could create a deadlock scenario then this condition is printed out.
When determining validity of locking, all possible "deadlock scenarios" are
considered: assuming arbitrary number of CPUs, arbitrary irq context and task
context constellations, running arbitrary combinations of all the existing
locking scenarios. In a typical system this means millions of separate
scenarios. This is why we call it a "locking correctness" validator - for all
rules that are observed the lock validator proves it with mathematical
certainty that a deadlock could not occur (assuming that the lock validator
implementation itself is correct and its internal data structures are not
corrupted by some other kernel subsystem). [see more details and conditionals
of this statement in include/linux/lockdep.h and
Documentation/lockdep-design.txt]
Furthermore, this "all possible scenarios" property of the validator also
enables the finding of complex, highly unlikely multi-CPU multi-context races
via single single-context rules, increasing the likelyhood of finding bugs
drastically. In practical terms: the lock validator already found a bug in
the upstream kernel that could only occur on systems with 3 or more CPUs, and
which needed 3 very unlikely code sequences to occur at once on the 3 CPUs.
That bug was found and reported on a single-CPU system (!). So in essence a
race will be found "piecemail-wise", triggering all the necessary components
for the race, without having to reproduce the race scenario itself! In its
short existence the lock validator found and reported many bugs before they
actually caused a real deadlock.
To further increase the efficiency of the validator, the mapping is not per
"lock instance", but per "lock-class". For example, all struct inode objects
in the kernel have inode->inotify_mutex. If there are 10,000 inodes cached,
then there are 10,000 lock objects. But ->inotify_mutex is a single "lock
type", and all locking activities that occur against ->inotify_mutex are
"unified" into this single lock-class. The advantage of the lock-class
approach is that all historical ->inotify_mutex uses are mapped into a single
(and as narrow as possible) set of locking rules - regardless of how many
different tasks or inode structures it took to build this set of rules. The
set of rules persist during the lifetime of the kernel.
To see the rough magnitude of checking that the lock validator does, here's a
portion of /proc/lockdep_stats, fresh after bootup:
lock-classes: 694 [max: 2048]
direct dependencies: 1598 [max: 8192]
indirect dependencies: 17896
all direct dependencies: 16206
dependency chains: 1910 [max: 8192]
in-hardirq chains: 17
in-softirq chains: 105
in-process chains: 1065
stack-trace entries: 38761 [max: 131072]
combined max dependencies: 2033928
hardirq-safe locks: 24
hardirq-unsafe locks: 176
softirq-safe locks: 53
softirq-unsafe locks: 137
irq-safe locks: 59
irq-unsafe locks: 176
The lock validator has observed 1598 actual single-thread locking patterns,
and has validated all possible 2033928 distinct locking scenarios.
More details about the design of the lock validator can be found in
Documentation/lockdep-design.txt, which can also found at:
http://redhat.com/~mingo/lockdep-patches/lockdep-design.txt
[bunk@stusta.de: cleanups]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:24:50 +08:00
|
|
|
#ifdef CONFIG_LOCKDEP
|
2017-02-07 05:06:35 +08:00
|
|
|
# define MAX_LOCK_DEPTH 48UL
|
|
|
|
u64 curr_chain_key;
|
|
|
|
int lockdep_depth;
|
|
|
|
unsigned int lockdep_recursion;
|
|
|
|
struct held_lock held_locks[MAX_LOCK_DEPTH];
|
[PATCH] lockdep: core
Do 'make oldconfig' and accept all the defaults for new config options -
reboot into the kernel and if everything goes well it should boot up fine and
you should have /proc/lockdep and /proc/lockdep_stats files.
Typically if the lock validator finds some problem it will print out
voluminous debug output that begins with "BUG: ..." and which syslog output
can be used by kernel developers to figure out the precise locking scenario.
What does the lock validator do? It "observes" and maps all locking rules as
they occur dynamically (as triggered by the kernel's natural use of spinlocks,
rwlocks, mutexes and rwsems). Whenever the lock validator subsystem detects a
new locking scenario, it validates this new rule against the existing set of
rules. If this new rule is consistent with the existing set of rules then the
new rule is added transparently and the kernel continues as normal. If the
new rule could create a deadlock scenario then this condition is printed out.
When determining validity of locking, all possible "deadlock scenarios" are
considered: assuming arbitrary number of CPUs, arbitrary irq context and task
context constellations, running arbitrary combinations of all the existing
locking scenarios. In a typical system this means millions of separate
scenarios. This is why we call it a "locking correctness" validator - for all
rules that are observed the lock validator proves it with mathematical
certainty that a deadlock could not occur (assuming that the lock validator
implementation itself is correct and its internal data structures are not
corrupted by some other kernel subsystem). [see more details and conditionals
of this statement in include/linux/lockdep.h and
Documentation/lockdep-design.txt]
Furthermore, this "all possible scenarios" property of the validator also
enables the finding of complex, highly unlikely multi-CPU multi-context races
via single single-context rules, increasing the likelyhood of finding bugs
drastically. In practical terms: the lock validator already found a bug in
the upstream kernel that could only occur on systems with 3 or more CPUs, and
which needed 3 very unlikely code sequences to occur at once on the 3 CPUs.
That bug was found and reported on a single-CPU system (!). So in essence a
race will be found "piecemail-wise", triggering all the necessary components
for the race, without having to reproduce the race scenario itself! In its
short existence the lock validator found and reported many bugs before they
actually caused a real deadlock.
To further increase the efficiency of the validator, the mapping is not per
"lock instance", but per "lock-class". For example, all struct inode objects
in the kernel have inode->inotify_mutex. If there are 10,000 inodes cached,
then there are 10,000 lock objects. But ->inotify_mutex is a single "lock
type", and all locking activities that occur against ->inotify_mutex are
"unified" into this single lock-class. The advantage of the lock-class
approach is that all historical ->inotify_mutex uses are mapped into a single
(and as narrow as possible) set of locking rules - regardless of how many
different tasks or inode structures it took to build this set of rules. The
set of rules persist during the lifetime of the kernel.
To see the rough magnitude of checking that the lock validator does, here's a
portion of /proc/lockdep_stats, fresh after bootup:
lock-classes: 694 [max: 2048]
direct dependencies: 1598 [max: 8192]
indirect dependencies: 17896
all direct dependencies: 16206
dependency chains: 1910 [max: 8192]
in-hardirq chains: 17
in-softirq chains: 105
in-process chains: 1065
stack-trace entries: 38761 [max: 131072]
combined max dependencies: 2033928
hardirq-safe locks: 24
hardirq-unsafe locks: 176
softirq-safe locks: 53
softirq-unsafe locks: 137
irq-safe locks: 59
irq-unsafe locks: 176
The lock validator has observed 1598 actual single-thread locking patterns,
and has validated all possible 2033928 distinct locking scenarios.
More details about the design of the lock validator can be found in
Documentation/lockdep-design.txt, which can also found at:
http://redhat.com/~mingo/lockdep-patches/lockdep-design.txt
[bunk@stusta.de: cleanups]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:24:50 +08:00
|
|
|
#endif
|
2017-02-07 05:06:35 +08:00
|
|
|
|
2020-10-16 11:13:35 +08:00
|
|
|
#if defined(CONFIG_UBSAN) && !defined(CONFIG_UBSAN_TRAP)
|
2017-02-07 05:06:35 +08:00
|
|
|
unsigned int in_ubsan;
|
2016-01-21 07:00:55 +08:00
|
|
|
#endif
|
2006-01-10 07:59:20 +08:00
|
|
|
|
2017-02-07 05:06:35 +08:00
|
|
|
/* Journalling filesystem info: */
|
|
|
|
void *journal_info;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2017-02-07 05:06:35 +08:00
|
|
|
/* Stacked block device info: */
|
|
|
|
struct bio_list *bio_list;
|
When stacked block devices are in-use (e.g. md or dm), the recursive calls
to generic_make_request can use up a lot of space, and we would rather they
didn't.
As generic_make_request is a void function, and as it is generally not
expected that it will have any effect immediately, it is safe to delay any
call to generic_make_request until there is sufficient stack space
available.
As ->bi_next is reserved for the driver to use, it can have no valid value
when generic_make_request is called, and as __make_request implicitly
assumes it will be NULL (ELEVATOR_BACK_MERGE fork of switch) we can be
certain that all callers set it to NULL. We can therefore safely use
bi_next to link pending requests together, providing we clear it before
making the real call.
So, we choose to allow each thread to only be active in one
generic_make_request at a time. If a subsequent (recursive) call is made,
the bio is linked into a per-thread list, and is handled when the active
call completes.
As the list of pending bios is per-thread, there are no locking issues to
worry about.
I say above that it is "safe to delay any call...". There are, however,
some behaviours of a make_request_fn which would make it unsafe. These
include any behaviour that assumes anything will have changed after a
recursive call to generic_make_request.
These could include:
- waiting for that call to finish and call it's bi_end_io function.
md use to sometimes do this (marking the superblock dirty before
completing a write) but doesn't any more
- inspecting the bio for fields that generic_make_request might
change, such as bi_sector or bi_bdev. It is hard to see a good
reason for this, and I don't think anyone actually does it.
- inspecing the queue to see if, e.g. it is 'full' yet. Again, I
think this is very unlikely to be useful, or to be done.
Signed-off-by: Neil Brown <neilb@suse.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: <dm-devel@redhat.com>
Alasdair G Kergon <agk@redhat.com> said:
I can see nothing wrong with this in principle.
For device-mapper at the moment though it's essential that, while the bio
mappings may now get delayed, they still get processed in exactly
the same order as they were passed to generic_make_request().
My main concern is whether the timing changes implicit in this patch
will make the rare data-corrupting races in the existing snapshot code
more likely. (I'm working on a fix for these races, but the unfinished
patch is already several hundred lines long.)
It would be helpful if some people on this mailing list would test
this patch in various scenarios and report back.
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-05-01 15:53:42 +08:00
|
|
|
|
2017-02-07 05:06:35 +08:00
|
|
|
/* Stack plugging: */
|
|
|
|
struct blk_plug *plug;
|
2011-03-08 20:19:51 +08:00
|
|
|
|
2017-02-07 05:06:35 +08:00
|
|
|
/* VM state: */
|
|
|
|
struct reclaim_state *reclaim_state;
|
|
|
|
|
|
|
|
struct backing_dev_info *backing_dev_info;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2017-02-07 05:06:35 +08:00
|
|
|
struct io_context *io_context;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2019-03-06 07:45:41 +08:00
|
|
|
#ifdef CONFIG_COMPACTION
|
|
|
|
struct capture_control *capture_control;
|
|
|
|
#endif
|
2017-02-07 05:06:35 +08:00
|
|
|
/* Ptrace state: */
|
|
|
|
unsigned long ptrace_message;
|
2018-09-25 17:27:20 +08:00
|
|
|
kernel_siginfo_t *last_siginfo;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2017-02-07 05:06:35 +08:00
|
|
|
struct task_io_accounting ioac;
|
psi: pressure stall information for CPU, memory, and IO
When systems are overcommitted and resources become contended, it's hard
to tell exactly the impact this has on workload productivity, or how close
the system is to lockups and OOM kills. In particular, when machines work
multiple jobs concurrently, the impact of overcommit in terms of latency
and throughput on the individual job can be enormous.
In order to maximize hardware utilization without sacrificing individual
job health or risk complete machine lockups, this patch implements a way
to quantify resource pressure in the system.
A kernel built with CONFIG_PSI=y creates files in /proc/pressure/ that
expose the percentage of time the system is stalled on CPU, memory, or IO,
respectively. Stall states are aggregate versions of the per-task delay
accounting delays:
cpu: some tasks are runnable but not executing on a CPU
memory: tasks are reclaiming, or waiting for swapin or thrashing cache
io: tasks are waiting for io completions
These percentages of walltime can be thought of as pressure percentages,
and they give a general sense of system health and productivity loss
incurred by resource overcommit. They can also indicate when the system
is approaching lockup scenarios and OOMs.
To do this, psi keeps track of the task states associated with each CPU
and samples the time they spend in stall states. Every 2 seconds, the
samples are averaged across CPUs - weighted by the CPUs' non-idle time to
eliminate artifacts from unused CPUs - and translated into percentages of
walltime. A running average of those percentages is maintained over 10s,
1m, and 5m periods (similar to the loadaverage).
[hannes@cmpxchg.org: doc fixlet, per Randy]
Link: http://lkml.kernel.org/r/20180828205625.GA14030@cmpxchg.org
[hannes@cmpxchg.org: code optimization]
Link: http://lkml.kernel.org/r/20180907175015.GA8479@cmpxchg.org
[hannes@cmpxchg.org: rename psi_clock() to psi_update_work(), per Peter]
Link: http://lkml.kernel.org/r/20180907145404.GB11088@cmpxchg.org
[hannes@cmpxchg.org: fix build]
Link: http://lkml.kernel.org/r/20180913014222.GA2370@cmpxchg.org
Link: http://lkml.kernel.org/r/20180828172258.3185-9-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Daniel Drake <drake@endlessm.com>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-27 06:06:27 +08:00
|
|
|
#ifdef CONFIG_PSI
|
|
|
|
/* Pressure stall state */
|
|
|
|
unsigned int psi_flags;
|
|
|
|
#endif
|
2017-02-07 05:06:35 +08:00
|
|
|
#ifdef CONFIG_TASK_XACCT
|
|
|
|
/* Accumulated RSS usage: */
|
|
|
|
u64 acct_rss_mem1;
|
|
|
|
/* Accumulated virtual memory usage: */
|
|
|
|
u64 acct_vm_mem1;
|
|
|
|
/* stime + utime since last update: */
|
|
|
|
u64 acct_timexpd;
|
2005-04-17 06:20:36 +08:00
|
|
|
#endif
|
|
|
|
#ifdef CONFIG_CPUSETS
|
2017-02-07 05:06:35 +08:00
|
|
|
/* Protected by ->alloc_lock: */
|
|
|
|
nodemask_t mems_allowed;
|
2021-03-18 20:38:50 +08:00
|
|
|
/* Sequence number to catch updates: */
|
2020-07-20 23:55:19 +08:00
|
|
|
seqcount_spinlock_t mems_allowed_seq;
|
2017-02-07 05:06:35 +08:00
|
|
|
int cpuset_mem_spread_rotor;
|
|
|
|
int cpuset_slab_spread_rotor;
|
2005-04-17 06:20:36 +08:00
|
|
|
#endif
|
Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 14:39:30 +08:00
|
|
|
#ifdef CONFIG_CGROUPS
|
2017-02-07 05:06:35 +08:00
|
|
|
/* Control Group info protected by css_set_lock: */
|
|
|
|
struct css_set __rcu *cgroups;
|
|
|
|
/* cg_list protected by css_set_lock and tsk->alloc_lock: */
|
|
|
|
struct list_head cg_list;
|
Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 14:39:30 +08:00
|
|
|
#endif
|
2019-01-30 06:44:36 +08:00
|
|
|
#ifdef CONFIG_X86_CPU_RESCTRL
|
2017-07-26 05:14:33 +08:00
|
|
|
u32 closid;
|
2017-07-26 05:14:34 +08:00
|
|
|
u32 rmid;
|
2016-10-29 06:04:46 +08:00
|
|
|
#endif
|
2007-10-17 14:27:30 +08:00
|
|
|
#ifdef CONFIG_FUTEX
|
2017-02-07 05:06:35 +08:00
|
|
|
struct robust_list_head __user *robust_list;
|
2006-03-27 17:16:24 +08:00
|
|
|
#ifdef CONFIG_COMPAT
|
|
|
|
struct compat_robust_list_head __user *compat_robust_list;
|
|
|
|
#endif
|
2017-02-07 05:06:35 +08:00
|
|
|
struct list_head pi_state_list;
|
|
|
|
struct futex_pi_state *pi_state_cache;
|
2019-11-07 05:55:44 +08:00
|
|
|
struct mutex futex_exit_mutex;
|
2019-11-07 05:55:37 +08:00
|
|
|
unsigned int futex_state;
|
2008-05-15 19:09:15 +08:00
|
|
|
#endif
|
perf: Do the big rename: Performance Counters -> Performance Events
Bye-bye Performance Counters, welcome Performance Events!
In the past few months the perfcounters subsystem has grown out its
initial role of counting hardware events, and has become (and is
becoming) a much broader generic event enumeration, reporting, logging,
monitoring, analysis facility.
Naming its core object 'perf_counter' and naming the subsystem
'perfcounters' has become more and more of a misnomer. With pending
code like hw-breakpoints support the 'counter' name is less and
less appropriate.
All in one, we've decided to rename the subsystem to 'performance
events' and to propagate this rename through all fields, variables
and API names. (in an ABI compatible fashion)
The word 'event' is also a bit shorter than 'counter' - which makes
it slightly more convenient to write/handle as well.
Thanks goes to Stephane Eranian who first observed this misnomer and
suggested a rename.
User-space tooling and ABI compatibility is not affected - this patch
should be function-invariant. (Also, defconfigs were not touched to
keep the size down.)
This patch has been generated via the following script:
FILES=$(find * -type f | grep -vE 'oprofile|[^K]config')
sed -i \
-e 's/PERF_EVENT_/PERF_RECORD_/g' \
-e 's/PERF_COUNTER/PERF_EVENT/g' \
-e 's/perf_counter/perf_event/g' \
-e 's/nb_counters/nb_events/g' \
-e 's/swcounter/swevent/g' \
-e 's/tpcounter_event/tp_event/g' \
$FILES
for N in $(find . -name perf_counter.[ch]); do
M=$(echo $N | sed 's/perf_counter/perf_event/g')
mv $N $M
done
FILES=$(find . -name perf_event.*)
sed -i \
-e 's/COUNTER_MASK/REG_MASK/g' \
-e 's/COUNTER/EVENT/g' \
-e 's/\<event\>/event_id/g' \
-e 's/counter/event/g' \
-e 's/Counter/Event/g' \
$FILES
... to keep it as correct as possible. This script can also be
used by anyone who has pending perfcounters patches - it converts
a Linux kernel tree over to the new naming. We tried to time this
change to the point in time where the amount of pending patches
is the smallest: the end of the merge window.
Namespace clashes were fixed up in a preparatory patch - and some
stylistic fallout will be fixed up in a subsequent patch.
( NOTE: 'counters' are still the proper terminology when we deal
with hardware registers - and these sed scripts are a bit
over-eager in renaming them. I've undone some of that, but
in case there's something left where 'counter' would be
better than 'event' we can undo that on an individual basis
instead of touching an otherwise nicely automated patch. )
Suggested-by: Stephane Eranian <eranian@google.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Paul Mackerras <paulus@samba.org>
Reviewed-by: Arjan van de Ven <arjan@linux.intel.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: <linux-arch@vger.kernel.org>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-09-21 18:02:48 +08:00
|
|
|
#ifdef CONFIG_PERF_EVENTS
|
2017-02-07 05:06:35 +08:00
|
|
|
struct perf_event_context *perf_event_ctxp[perf_nr_task_contexts];
|
|
|
|
struct mutex perf_event_mutex;
|
|
|
|
struct list_head perf_event_list;
|
perf_counter: Dynamically allocate tasks' perf_counter_context struct
This replaces the struct perf_counter_context in the task_struct with
a pointer to a dynamically allocated perf_counter_context struct. The
main reason for doing is this is to allow us to transfer a
perf_counter_context from one task to another when we do lazy PMU
switching in a later patch.
This has a few side-benefits: the task_struct becomes a little smaller,
we save some memory because only tasks that have perf_counters attached
get a perf_counter_context allocated for them, and we can remove the
inclusion of <linux/perf_counter.h> in sched.h, meaning that we don't
end up recompiling nearly everything whenever perf_counter.h changes.
The perf_counter_context structures are reference-counted and freed
when the last reference is dropped. A context can have references
from its task and the counters on its task. Counters can outlive the
task so it is possible that a context will be freed well after its
task has exited.
Contexts are allocated on fork if the parent had a context, or
otherwise the first time that a per-task counter is created on a task.
In the latter case, we set the context pointer in the task struct
locklessly using an atomic compare-and-exchange operation in case we
raced with some other task in creating a context for the subject task.
This also removes the task pointer from the perf_counter struct. The
task pointer was not used anywhere and would make it harder to move a
context from one task to another. Anything that needed to know which
task a counter was attached to was already using counter->ctx->task.
The __perf_counter_init_context function moves up in perf_counter.c
so that it can be called from find_get_context, and now initializes
the refcount, but is otherwise unchanged.
We were potentially calling list_del_counter twice: once from
__perf_counter_exit_task when the task exits and once from
__perf_counter_remove_from_context when the counter's fd gets closed.
This adds a check in list_del_counter so it doesn't do anything if
the counter has already been removed from the lists.
Since perf_counter_task_sched_in doesn't do anything if the task doesn't
have a context, and leaves cpuctx->task_ctx = NULL, this adds code to
__perf_install_in_context to set cpuctx->task_ctx if necessary, i.e. in
the case where the current task adds the first counter to itself and
thus creates a context for itself.
This also adds similar code to __perf_counter_enable to handle a
similar situation which can arise when the counters have been disabled
using prctl; that also leaves cpuctx->task_ctx = NULL.
[ Impact: refactor counter context management to prepare for new feature ]
Signed-off-by: Paul Mackerras <paulus@samba.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
LKML-Reference: <18966.10075.781053.231153@cargo.ozlabs.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-05-22 12:17:31 +08:00
|
|
|
#endif
|
2014-02-08 03:58:39 +08:00
|
|
|
#ifdef CONFIG_DEBUG_PREEMPT
|
2017-02-07 05:06:35 +08:00
|
|
|
unsigned long preempt_disable_ip;
|
2014-02-08 03:58:39 +08:00
|
|
|
#endif
|
2008-05-15 19:09:15 +08:00
|
|
|
#ifdef CONFIG_NUMA
|
2017-02-07 05:06:35 +08:00
|
|
|
/* Protected by alloc_lock: */
|
|
|
|
struct mempolicy *mempolicy;
|
2017-07-07 06:39:59 +08:00
|
|
|
short il_prev;
|
2017-02-07 05:06:35 +08:00
|
|
|
short pref_node_fork;
|
2007-10-17 14:27:30 +08:00
|
|
|
#endif
|
2012-10-25 20:16:43 +08:00
|
|
|
#ifdef CONFIG_NUMA_BALANCING
|
2017-02-07 05:06:35 +08:00
|
|
|
int numa_scan_seq;
|
|
|
|
unsigned int numa_scan_period;
|
|
|
|
unsigned int numa_scan_period_max;
|
|
|
|
int numa_preferred_nid;
|
|
|
|
unsigned long numa_migrate_retry;
|
|
|
|
/* Migration stamp: */
|
|
|
|
u64 node_stamp;
|
|
|
|
u64 last_task_numa_placement;
|
|
|
|
u64 last_sum_exec_runtime;
|
|
|
|
struct callback_head numa_work;
|
|
|
|
|
2019-07-16 23:20:47 +08:00
|
|
|
/*
|
|
|
|
* This pointer is only modified for current in syscall and
|
|
|
|
* pagefault context (and for tasks being destroyed), so it can be read
|
|
|
|
* from any of the following contexts:
|
|
|
|
* - RCU read-side critical section
|
|
|
|
* - current->numa_group from everywhere
|
|
|
|
* - task's runqueue locked, task not running
|
|
|
|
*/
|
|
|
|
struct numa_group __rcu *numa_group;
|
2013-10-07 18:29:21 +08:00
|
|
|
|
2013-10-07 18:28:59 +08:00
|
|
|
/*
|
2014-10-31 08:13:31 +08:00
|
|
|
* numa_faults is an array split into four regions:
|
|
|
|
* faults_memory, faults_cpu, faults_memory_buffer, faults_cpu_buffer
|
|
|
|
* in this precise order.
|
|
|
|
*
|
|
|
|
* faults_memory: Exponential decaying average of faults on a per-node
|
|
|
|
* basis. Scheduling placement decisions are made based on these
|
|
|
|
* counts. The values remain static for the duration of a PTE scan.
|
|
|
|
* faults_cpu: Track the nodes the process was running on when a NUMA
|
|
|
|
* hinting fault was incurred.
|
|
|
|
* faults_memory_buffer and faults_cpu_buffer: Record faults per node
|
|
|
|
* during the current scan window. When the scan completes, the counts
|
|
|
|
* in faults_memory and faults_cpu decay and these values are copied.
|
2013-10-07 18:28:59 +08:00
|
|
|
*/
|
2017-02-07 05:06:35 +08:00
|
|
|
unsigned long *numa_faults;
|
|
|
|
unsigned long total_numa_faults;
|
2013-10-07 18:28:59 +08:00
|
|
|
|
2013-10-07 18:29:36 +08:00
|
|
|
/*
|
|
|
|
* numa_faults_locality tracks if faults recorded during the last
|
2015-03-26 06:55:42 +08:00
|
|
|
* scan window were remote/local or failed to migrate. The task scan
|
|
|
|
* period is adapted based on the locality of the faults with different
|
|
|
|
* weights depending on whether they were shared or private faults
|
2013-10-07 18:29:36 +08:00
|
|
|
*/
|
2017-02-07 05:06:35 +08:00
|
|
|
unsigned long numa_faults_locality[3];
|
2013-10-07 18:29:36 +08:00
|
|
|
|
2017-02-07 05:06:35 +08:00
|
|
|
unsigned long numa_pages_migrated;
|
2012-10-25 20:16:43 +08:00
|
|
|
#endif /* CONFIG_NUMA_BALANCING */
|
|
|
|
|
rseq: Introduce restartable sequences system call
Expose a new system call allowing each thread to register one userspace
memory area to be used as an ABI between kernel and user-space for two
purposes: user-space restartable sequences and quick access to read the
current CPU number value from user-space.
* Restartable sequences (per-cpu atomics)
Restartables sequences allow user-space to perform update operations on
per-cpu data without requiring heavy-weight atomic operations.
The restartable critical sections (percpu atomics) work has been started
by Paul Turner and Andrew Hunter. It lets the kernel handle restart of
critical sections. [1] [2] The re-implementation proposed here brings a
few simplifications to the ABI which facilitates porting to other
architectures and speeds up the user-space fast path.
Here are benchmarks of various rseq use-cases.
Test hardware:
arm32: ARMv7 Processor rev 4 (v7l) "Cubietruck", 2-core
x86-64: Intel E5-2630 v3@2.40GHz, 16-core, hyperthreading
The following benchmarks were all performed on a single thread.
* Per-CPU statistic counter increment
getcpu+atomic (ns/op) rseq (ns/op) speedup
arm32: 344.0 31.4 11.0
x86-64: 15.3 2.0 7.7
* LTTng-UST: write event 32-bit header, 32-bit payload into tracer
per-cpu buffer
getcpu+atomic (ns/op) rseq (ns/op) speedup
arm32: 2502.0 2250.0 1.1
x86-64: 117.4 98.0 1.2
* liburcu percpu: lock-unlock pair, dereference, read/compare word
getcpu+atomic (ns/op) rseq (ns/op) speedup
arm32: 751.0 128.5 5.8
x86-64: 53.4 28.6 1.9
* jemalloc memory allocator adapted to use rseq
Using rseq with per-cpu memory pools in jemalloc at Facebook (based on
rseq 2016 implementation):
The production workload response-time has 1-2% gain avg. latency, and
the P99 overall latency drops by 2-3%.
* Reading the current CPU number
Speeding up reading the current CPU number on which the caller thread is
running is done by keeping the current CPU number up do date within the
cpu_id field of the memory area registered by the thread. This is done
by making scheduler preemption set the TIF_NOTIFY_RESUME flag on the
current thread. Upon return to user-space, a notify-resume handler
updates the current CPU value within the registered user-space memory
area. User-space can then read the current CPU number directly from
memory.
Keeping the current cpu id in a memory area shared between kernel and
user-space is an improvement over current mechanisms available to read
the current CPU number, which has the following benefits over
alternative approaches:
- 35x speedup on ARM vs system call through glibc
- 20x speedup on x86 compared to calling glibc, which calls vdso
executing a "lsl" instruction,
- 14x speedup on x86 compared to inlined "lsl" instruction,
- Unlike vdso approaches, this cpu_id value can be read from an inline
assembly, which makes it a useful building block for restartable
sequences.
- The approach of reading the cpu id through memory mapping shared
between kernel and user-space is portable (e.g. ARM), which is not the
case for the lsl-based x86 vdso.
On x86, yet another possible approach would be to use the gs segment
selector to point to user-space per-cpu data. This approach performs
similarly to the cpu id cache, but it has two disadvantages: it is
not portable, and it is incompatible with existing applications already
using the gs segment selector for other purposes.
Benchmarking various approaches for reading the current CPU number:
ARMv7 Processor rev 4 (v7l)
Machine model: Cubietruck
- Baseline (empty loop): 8.4 ns
- Read CPU from rseq cpu_id: 16.7 ns
- Read CPU from rseq cpu_id (lazy register): 19.8 ns
- glibc 2.19-0ubuntu6.6 getcpu: 301.8 ns
- getcpu system call: 234.9 ns
x86-64 Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz:
- Baseline (empty loop): 0.8 ns
- Read CPU from rseq cpu_id: 0.8 ns
- Read CPU from rseq cpu_id (lazy register): 0.8 ns
- Read using gs segment selector: 0.8 ns
- "lsl" inline assembly: 13.0 ns
- glibc 2.19-0ubuntu6 getcpu: 16.6 ns
- getcpu system call: 53.9 ns
- Speed (benchmark taken on v8 of patchset)
Running 10 runs of hackbench -l 100000 seems to indicate, contrary to
expectations, that enabling CONFIG_RSEQ slightly accelerates the
scheduler:
Configuration: 2 sockets * 8-core Intel(R) Xeon(R) CPU E5-2630 v3 @
2.40GHz (directly on hardware, hyperthreading disabled in BIOS, energy
saving disabled in BIOS, turboboost disabled in BIOS, cpuidle.off=1
kernel parameter), with a Linux v4.6 defconfig+localyesconfig,
restartable sequences series applied.
* CONFIG_RSEQ=n
avg.: 41.37 s
std.dev.: 0.36 s
* CONFIG_RSEQ=y
avg.: 40.46 s
std.dev.: 0.33 s
- Size
On x86-64, between CONFIG_RSEQ=n/y, the text size increase of vmlinux is
567 bytes, and the data size increase of vmlinux is 5696 bytes.
[1] https://lwn.net/Articles/650333/
[2] http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdf
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Watson <davejwatson@fb.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Chris Lameter <cl@linux.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Andrew Hunter <ahh@google.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Maurer <bmaurer@fb.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-api@vger.kernel.org
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20151027235635.16059.11630.stgit@pjt-glaptop.roam.corp.google.com
Link: http://lkml.kernel.org/r/20150624222609.6116.86035.stgit@kitami.mtv.corp.google.com
Link: https://lkml.kernel.org/r/20180602124408.8430-3-mathieu.desnoyers@efficios.com
2018-06-02 20:43:54 +08:00
|
|
|
#ifdef CONFIG_RSEQ
|
|
|
|
struct rseq __user *rseq;
|
|
|
|
u32 rseq_sig;
|
|
|
|
/*
|
|
|
|
* RmW on rseq_event_mask must be performed atomically
|
|
|
|
* with respect to preemption.
|
|
|
|
*/
|
|
|
|
unsigned long rseq_event_mask;
|
|
|
|
#endif
|
|
|
|
|
2017-02-07 05:06:35 +08:00
|
|
|
struct tlbflush_unmap_batch tlb_ubc;
|
2015-09-05 06:47:32 +08:00
|
|
|
|
2019-09-14 20:33:34 +08:00
|
|
|
union {
|
|
|
|
refcount_t rcu_users;
|
|
|
|
struct rcu_head rcu;
|
|
|
|
};
|
2006-04-11 19:52:07 +08:00
|
|
|
|
2017-02-07 05:06:35 +08:00
|
|
|
/* Cache last used pipe for splice(): */
|
|
|
|
struct pipe_inode_info *splice_pipe;
|
net: use a per task frag allocator
We currently use a per socket order-0 page cache for tcp_sendmsg()
operations.
This page is used to build fragments for skbs.
Its done to increase probability of coalescing small write() into
single segments in skbs still in write queue (not yet sent)
But it wastes a lot of memory for applications handling many mostly
idle sockets, since each socket holds one page in sk->sk_sndmsg_page
Its also quite inefficient to build TSO 64KB packets, because we need
about 16 pages per skb on arches where PAGE_SIZE = 4096, so we hit
page allocator more than wanted.
This patch adds a per task frag allocator and uses bigger pages,
if available. An automatic fallback is done in case of memory pressure.
(up to 32768 bytes per frag, thats order-3 pages on x86)
This increases TCP stream performance by 20% on loopback device,
but also benefits on other network devices, since 8x less frags are
mapped on transmit and unmapped on tx completion. Alexander Duyck
mentioned a probable performance win on systems with IOMMU enabled.
Its possible some SG enabled hardware cant cope with bigger fragments,
but their ndo_start_xmit() should already handle this, splitting a
fragment in sub fragments, since some arches have PAGE_SIZE=65536
Successfully tested on various ethernet devices.
(ixgbe, igb, bnx2x, tg3, mellanox mlx4)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Cc: Vijay Subramanian <subramanian.vijay@gmail.com>
Cc: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Vijay Subramanian <subramanian.vijay@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-24 07:04:42 +08:00
|
|
|
|
2017-02-07 05:06:35 +08:00
|
|
|
struct page_frag task_frag;
|
net: use a per task frag allocator
We currently use a per socket order-0 page cache for tcp_sendmsg()
operations.
This page is used to build fragments for skbs.
Its done to increase probability of coalescing small write() into
single segments in skbs still in write queue (not yet sent)
But it wastes a lot of memory for applications handling many mostly
idle sockets, since each socket holds one page in sk->sk_sndmsg_page
Its also quite inefficient to build TSO 64KB packets, because we need
about 16 pages per skb on arches where PAGE_SIZE = 4096, so we hit
page allocator more than wanted.
This patch adds a per task frag allocator and uses bigger pages,
if available. An automatic fallback is done in case of memory pressure.
(up to 32768 bytes per frag, thats order-3 pages on x86)
This increases TCP stream performance by 20% on loopback device,
but also benefits on other network devices, since 8x less frags are
mapped on transmit and unmapped on tx completion. Alexander Duyck
mentioned a probable performance win on systems with IOMMU enabled.
Its possible some SG enabled hardware cant cope with bigger fragments,
but their ndo_start_xmit() should already handle this, splitting a
fragment in sub fragments, since some arches have PAGE_SIZE=65536
Successfully tested on various ethernet devices.
(ixgbe, igb, bnx2x, tg3, mellanox mlx4)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Cc: Vijay Subramanian <subramanian.vijay@gmail.com>
Cc: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Vijay Subramanian <subramanian.vijay@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-24 07:04:42 +08:00
|
|
|
|
2017-02-02 01:00:26 +08:00
|
|
|
#ifdef CONFIG_TASK_DELAY_ACCT
|
|
|
|
struct task_delay_info *delays;
|
2006-12-08 18:39:47 +08:00
|
|
|
#endif
|
2017-02-02 01:00:26 +08:00
|
|
|
|
2006-12-08 18:39:47 +08:00
|
|
|
#ifdef CONFIG_FAULT_INJECTION
|
2017-02-07 05:06:35 +08:00
|
|
|
int make_it_fail;
|
2017-07-15 05:49:52 +08:00
|
|
|
unsigned int fail_nth;
|
2006-07-14 15:24:36 +08:00
|
|
|
#endif
|
writeback: per task dirty rate limit
Add two fields to task_struct.
1) account dirtied pages in the individual tasks, for accuracy
2) per-task balance_dirty_pages() call intervals, for flexibility
The balance_dirty_pages() call interval (ie. nr_dirtied_pause) will
scale near-sqrt to the safety gap between dirty pages and threshold.
The main problem of per-task nr_dirtied is, if 1k+ tasks start dirtying
pages at exactly the same time, each task will be assigned a large
initial nr_dirtied_pause, so that the dirty threshold will be exceeded
long before each task reached its nr_dirtied_pause and hence call
balance_dirty_pages().
The solution is to watch for the number of pages dirtied on each CPU in
between the calls into balance_dirty_pages(). If it exceeds ratelimit_pages
(3% dirty threshold), force call balance_dirty_pages() for a chance to
set bdi->dirty_exceeded. In normal situations, this safeguarding
condition is not expected to trigger at all.
On the sqrt in dirty_poll_interval():
It will serve as an initial guess when dirty pages are still in the
freerun area.
When dirty pages are floating inside the dirty control scope [freerun,
limit], a followup patch will use some refined dirty poll interval to
get the desired pause time.
thresh-dirty (MB) sqrt
1 16
2 22
4 32
8 45
16 64
32 90
64 128
128 181
256 256
512 362
1024 512
The above table means, given 1MB (or 1GB) gap and the dd tasks polling
balance_dirty_pages() on every 16 (or 512) pages, the dirty limit won't
be exceeded as long as there are less than 16 (or 512) concurrent dd's.
So sqrt naturally leads to less overheads and more safe concurrent tasks
for large memory servers, which have large (thresh-freerun) gaps.
peter: keep the per-CPU ratelimit for safeguarding the 1k+ tasks case
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Reviewed-by: Andrea Righi <andrea@betterlinux.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-06-12 08:10:12 +08:00
|
|
|
/*
|
2017-02-07 05:06:35 +08:00
|
|
|
* When (nr_dirtied >= nr_dirtied_pause), it's time to call
|
|
|
|
* balance_dirty_pages() for a dirty throttling pause:
|
writeback: per task dirty rate limit
Add two fields to task_struct.
1) account dirtied pages in the individual tasks, for accuracy
2) per-task balance_dirty_pages() call intervals, for flexibility
The balance_dirty_pages() call interval (ie. nr_dirtied_pause) will
scale near-sqrt to the safety gap between dirty pages and threshold.
The main problem of per-task nr_dirtied is, if 1k+ tasks start dirtying
pages at exactly the same time, each task will be assigned a large
initial nr_dirtied_pause, so that the dirty threshold will be exceeded
long before each task reached its nr_dirtied_pause and hence call
balance_dirty_pages().
The solution is to watch for the number of pages dirtied on each CPU in
between the calls into balance_dirty_pages(). If it exceeds ratelimit_pages
(3% dirty threshold), force call balance_dirty_pages() for a chance to
set bdi->dirty_exceeded. In normal situations, this safeguarding
condition is not expected to trigger at all.
On the sqrt in dirty_poll_interval():
It will serve as an initial guess when dirty pages are still in the
freerun area.
When dirty pages are floating inside the dirty control scope [freerun,
limit], a followup patch will use some refined dirty poll interval to
get the desired pause time.
thresh-dirty (MB) sqrt
1 16
2 22
4 32
8 45
16 64
32 90
64 128
128 181
256 256
512 362
1024 512
The above table means, given 1MB (or 1GB) gap and the dd tasks polling
balance_dirty_pages() on every 16 (or 512) pages, the dirty limit won't
be exceeded as long as there are less than 16 (or 512) concurrent dd's.
So sqrt naturally leads to less overheads and more safe concurrent tasks
for large memory servers, which have large (thresh-freerun) gaps.
peter: keep the per-CPU ratelimit for safeguarding the 1k+ tasks case
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Reviewed-by: Andrea Righi <andrea@betterlinux.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-06-12 08:10:12 +08:00
|
|
|
*/
|
2017-02-07 05:06:35 +08:00
|
|
|
int nr_dirtied;
|
|
|
|
int nr_dirtied_pause;
|
|
|
|
/* Start of a write-and-pause period: */
|
|
|
|
unsigned long dirty_paused_when;
|
writeback: per task dirty rate limit
Add two fields to task_struct.
1) account dirtied pages in the individual tasks, for accuracy
2) per-task balance_dirty_pages() call intervals, for flexibility
The balance_dirty_pages() call interval (ie. nr_dirtied_pause) will
scale near-sqrt to the safety gap between dirty pages and threshold.
The main problem of per-task nr_dirtied is, if 1k+ tasks start dirtying
pages at exactly the same time, each task will be assigned a large
initial nr_dirtied_pause, so that the dirty threshold will be exceeded
long before each task reached its nr_dirtied_pause and hence call
balance_dirty_pages().
The solution is to watch for the number of pages dirtied on each CPU in
between the calls into balance_dirty_pages(). If it exceeds ratelimit_pages
(3% dirty threshold), force call balance_dirty_pages() for a chance to
set bdi->dirty_exceeded. In normal situations, this safeguarding
condition is not expected to trigger at all.
On the sqrt in dirty_poll_interval():
It will serve as an initial guess when dirty pages are still in the
freerun area.
When dirty pages are floating inside the dirty control scope [freerun,
limit], a followup patch will use some refined dirty poll interval to
get the desired pause time.
thresh-dirty (MB) sqrt
1 16
2 22
4 32
8 45
16 64
32 90
64 128
128 181
256 256
512 362
1024 512
The above table means, given 1MB (or 1GB) gap and the dd tasks polling
balance_dirty_pages() on every 16 (or 512) pages, the dirty limit won't
be exceeded as long as there are less than 16 (or 512) concurrent dd's.
So sqrt naturally leads to less overheads and more safe concurrent tasks
for large memory servers, which have large (thresh-freerun) gaps.
peter: keep the per-CPU ratelimit for safeguarding the 1k+ tasks case
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Reviewed-by: Andrea Righi <andrea@betterlinux.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-06-12 08:10:12 +08:00
|
|
|
|
2008-01-26 04:08:34 +08:00
|
|
|
#ifdef CONFIG_LATENCYTOP
|
2017-02-07 05:06:35 +08:00
|
|
|
int latency_record_count;
|
|
|
|
struct latency_record latency_record[LT_SAVECOUNT];
|
2008-01-26 04:08:34 +08:00
|
|
|
#endif
|
2008-09-02 06:52:40 +08:00
|
|
|
/*
|
2017-02-07 05:06:35 +08:00
|
|
|
* Time slack values; these are used to round up poll() and
|
2008-09-02 06:52:40 +08:00
|
|
|
* select() etc timeout values. These are in nanoseconds.
|
|
|
|
*/
|
2017-02-07 05:06:35 +08:00
|
|
|
u64 timer_slack_ns;
|
|
|
|
u64 default_timer_slack_ns;
|
2008-11-06 16:37:40 +08:00
|
|
|
|
2020-12-23 04:00:56 +08:00
|
|
|
#if defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS)
|
2017-02-07 05:06:35 +08:00
|
|
|
unsigned int kasan_depth;
|
kasan: add kernel address sanitizer infrastructure
Kernel Address sanitizer (KASan) is a dynamic memory error detector. It
provides fast and comprehensive solution for finding use-after-free and
out-of-bounds bugs.
KASAN uses compile-time instrumentation for checking every memory access,
therefore GCC > v4.9.2 required. v4.9.2 almost works, but has issues with
putting symbol aliases into the wrong section, which breaks kasan
instrumentation of globals.
This patch only adds infrastructure for kernel address sanitizer. It's
not available for use yet. The idea and some code was borrowed from [1].
Basic idea:
The main idea of KASAN is to use shadow memory to record whether each byte
of memory is safe to access or not, and use compiler's instrumentation to
check the shadow memory on each memory access.
Address sanitizer uses 1/8 of the memory addressable in kernel for shadow
memory and uses direct mapping with a scale and offset to translate a
memory address to its corresponding shadow address.
Here is function to translate address to corresponding shadow address:
unsigned long kasan_mem_to_shadow(unsigned long addr)
{
return (addr >> KASAN_SHADOW_SCALE_SHIFT) + KASAN_SHADOW_OFFSET;
}
where KASAN_SHADOW_SCALE_SHIFT = 3.
So for every 8 bytes there is one corresponding byte of shadow memory.
The following encoding used for each shadow byte: 0 means that all 8 bytes
of the corresponding memory region are valid for access; k (1 <= k <= 7)
means that the first k bytes are valid for access, and other (8 - k) bytes
are not; Any negative value indicates that the entire 8-bytes are
inaccessible. Different negative values used to distinguish between
different kinds of inaccessible memory (redzones, freed memory) (see
mm/kasan/kasan.h).
To be able to detect accesses to bad memory we need a special compiler.
Such compiler inserts a specific function calls (__asan_load*(addr),
__asan_store*(addr)) before each memory access of size 1, 2, 4, 8 or 16.
These functions check whether memory region is valid to access or not by
checking corresponding shadow memory. If access is not valid an error
printed.
Historical background of the address sanitizer from Dmitry Vyukov:
"We've developed the set of tools, AddressSanitizer (Asan),
ThreadSanitizer and MemorySanitizer, for user space. We actively use
them for testing inside of Google (continuous testing, fuzzing,
running prod services). To date the tools have found more than 10'000
scary bugs in Chromium, Google internal codebase and various
open-source projects (Firefox, OpenSSL, gcc, clang, ffmpeg, MySQL and
lots of others): [2] [3] [4].
The tools are part of both gcc and clang compilers.
We have not yet done massive testing under the Kernel AddressSanitizer
(it's kind of chicken and egg problem, you need it to be upstream to
start applying it extensively). To date it has found about 50 bugs.
Bugs that we've found in upstream kernel are listed in [5].
We've also found ~20 bugs in out internal version of the kernel. Also
people from Samsung and Oracle have found some.
[...]
As others noted, the main feature of AddressSanitizer is its
performance due to inline compiler instrumentation and simple linear
shadow memory. User-space Asan has ~2x slowdown on computational
programs and ~2x memory consumption increase. Taking into account that
kernel usually consumes only small fraction of CPU and memory when
running real user-space programs, I would expect that kernel Asan will
have ~10-30% slowdown and similar memory consumption increase (when we
finish all tuning).
I agree that Asan can well replace kmemcheck. We have plans to start
working on Kernel MemorySanitizer that finds uses of unitialized
memory. Asan+Msan will provide feature-parity with kmemcheck. As
others noted, Asan will unlikely replace debug slab and pagealloc that
can be enabled at runtime. Asan uses compiler instrumentation, so even
if it is disabled, it still incurs visible overheads.
Asan technology is easily portable to other architectures. Compiler
instrumentation is fully portable. Runtime has some arch-dependent
parts like shadow mapping and atomic operation interception. They are
relatively easy to port."
Comparison with other debugging features:
========================================
KMEMCHECK:
- KASan can do almost everything that kmemcheck can. KASan uses
compile-time instrumentation, which makes it significantly faster than
kmemcheck. The only advantage of kmemcheck over KASan is detection of
uninitialized memory reads.
Some brief performance testing showed that kasan could be
x500-x600 times faster than kmemcheck:
$ netperf -l 30
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to localhost (127.0.0.1) port 0 AF_INET
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10^6bits/sec
no debug: 87380 16384 16384 30.00 41624.72
kasan inline: 87380 16384 16384 30.00 12870.54
kasan outline: 87380 16384 16384 30.00 10586.39
kmemcheck: 87380 16384 16384 30.03 20.23
- Also kmemcheck couldn't work on several CPUs. It always sets
number of CPUs to 1. KASan doesn't have such limitation.
DEBUG_PAGEALLOC:
- KASan is slower than DEBUG_PAGEALLOC, but KASan works on sub-page
granularity level, so it able to find more bugs.
SLUB_DEBUG (poisoning, redzones):
- SLUB_DEBUG has lower overhead than KASan.
- SLUB_DEBUG in most cases are not able to detect bad reads,
KASan able to detect both reads and writes.
- In some cases (e.g. redzone overwritten) SLUB_DEBUG detect
bugs only on allocation/freeing of object. KASan catch
bugs right before it will happen, so we always know exact
place of first bad read/write.
[1] https://code.google.com/p/address-sanitizer/wiki/AddressSanitizerForKernel
[2] https://code.google.com/p/address-sanitizer/wiki/FoundBugs
[3] https://code.google.com/p/thread-sanitizer/wiki/FoundBugs
[4] https://code.google.com/p/memory-sanitizer/wiki/FoundBugs
[5] https://code.google.com/p/address-sanitizer/wiki/AddressSanitizerForKernel#Trophies
Based on work by Andrey Konovalov.
Signed-off-by: Andrey Ryabinin <a.ryabinin@samsung.com>
Acked-by: Michal Marek <mmarek@suse.cz>
Signed-off-by: Andrey Konovalov <adech.fo@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Konstantin Serebryany <kcc@google.com>
Cc: Dmitry Chernenkov <dmitryc@google.com>
Cc: Yuri Gribov <tetra2005@gmail.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-14 06:39:17 +08:00
|
|
|
#endif
|
2020-07-29 19:09:16 +08:00
|
|
|
|
2019-11-15 02:02:54 +08:00
|
|
|
#ifdef CONFIG_KCSAN
|
|
|
|
struct kcsan_ctx kcsan_ctx;
|
2020-07-29 19:09:16 +08:00
|
|
|
#ifdef CONFIG_TRACE_IRQFLAGS
|
|
|
|
struct irqtrace_events kcsan_save_irqtrace;
|
|
|
|
#endif
|
2021-08-05 20:57:45 +08:00
|
|
|
#ifdef CONFIG_KCSAN_WEAK_MEMORY
|
|
|
|
int kcsan_stack_depth;
|
|
|
|
#endif
|
2019-11-15 02:02:54 +08:00
|
|
|
#endif
|
2017-02-07 05:06:35 +08:00
|
|
|
|
2022-09-15 23:03:45 +08:00
|
|
|
#ifdef CONFIG_KMSAN
|
|
|
|
struct kmsan_ctx kmsan_ctx;
|
|
|
|
#endif
|
|
|
|
|
2020-10-14 07:54:58 +08:00
|
|
|
#if IS_ENABLED(CONFIG_KUNIT)
|
|
|
|
struct kunit *kunit_test;
|
|
|
|
#endif
|
|
|
|
|
2008-11-26 04:07:04 +08:00
|
|
|
#ifdef CONFIG_FUNCTION_GRAPH_TRACER
|
2017-02-07 05:06:35 +08:00
|
|
|
/* Index of current stored address in ret_stack: */
|
|
|
|
int curr_ret_stack;
|
function_graph: Use new curr_ret_depth to manage depth instead of curr_ret_stack
Currently, the depth of the ret_stack is determined by curr_ret_stack index.
The issue is that there's a race between setting of the curr_ret_stack and
calling of the callback attached to the return of the function.
Commit 03274a3ffb44 ("tracing/fgraph: Adjust fgraph depth before calling
trace return callback") moved the calling of the callback to after the
setting of the curr_ret_stack, even stating that it was safe to do so, when
in fact, it was the reason there was a barrier() there (yes, I should have
commented that barrier()).
Not only does the curr_ret_stack keep track of the current call graph depth,
it also keeps the ret_stack content from being overwritten by new data.
The function profiler, uses the "subtime" variable of ret_stack structure
and by moving the curr_ret_stack, it allows for interrupts to use the same
structure it was using, corrupting the data, and breaking the profiler.
To fix this, there needs to be two variables to handle the call stack depth
and the pointer to where the ret_stack is being used, as they need to change
at two different locations.
Cc: stable@kernel.org
Fixes: 03274a3ffb449 ("tracing/fgraph: Adjust fgraph depth before calling trace return callback")
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-11-19 21:07:12 +08:00
|
|
|
int curr_ret_depth;
|
2017-02-07 05:06:35 +08:00
|
|
|
|
|
|
|
/* Stack of return addresses for return function tracing: */
|
|
|
|
struct ftrace_ret_stack *ret_stack;
|
|
|
|
|
|
|
|
/* Timestamp for last schedule: */
|
|
|
|
unsigned long long ftrace_timestamp;
|
|
|
|
|
2008-11-23 13:22:56 +08:00
|
|
|
/*
|
|
|
|
* Number of functions that haven't been traced
|
2017-02-07 05:06:35 +08:00
|
|
|
* because of depth overrun:
|
2008-11-23 13:22:56 +08:00
|
|
|
*/
|
2017-02-07 05:06:35 +08:00
|
|
|
atomic_t trace_overrun;
|
|
|
|
|
|
|
|
/* Pause tracing: */
|
|
|
|
atomic_t tracing_graph_pause;
|
2008-11-23 13:22:56 +08:00
|
|
|
#endif
|
2017-02-07 05:06:35 +08:00
|
|
|
|
2008-12-04 04:36:57 +08:00
|
|
|
#ifdef CONFIG_TRACING
|
2017-02-07 05:06:35 +08:00
|
|
|
/* Bitmask and counter of trace recursion: */
|
|
|
|
unsigned long trace_recursion;
|
tracing: add same level recursion detection
The tracing infrastructure allows for recursion. That is, an interrupt
may interrupt the act of tracing an event, and that interrupt may very well
perform its own trace. This is a recursive trace, and is fine to do.
The problem arises when there is a bug, and the utility doing the trace
calls something that recurses back into the tracer. This recursion is not
caused by an external event like an interrupt, but by code that is not
expected to recurse. The result could be a lockup.
This patch adds a bitmask to the task structure that keeps track
of the trace recursion. To find the interrupt depth, the following
algorithm is used:
level = hardirq_count() + softirq_count() + in_nmi;
Here, level will be the depth of interrutps and softirqs, and even handles
the nmi. Then the corresponding bit is set in the recursion bitmask.
If the bit was already set, we know we had a recursion at the same level
and we warn about it and fail the writing to the buffer.
After the data has been committed to the buffer, we clear the bit.
No atomics are needed. The only races are with interrupts and they reset
the bitmask before returning anywy.
[ Impact: detect same irq level trace recursion ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-04-17 09:41:52 +08:00
|
|
|
#endif /* CONFIG_TRACING */
|
2017-02-07 05:06:35 +08:00
|
|
|
|
kernel: add kcov code coverage
kcov provides code coverage collection for coverage-guided fuzzing
(randomized testing). Coverage-guided fuzzing is a testing technique
that uses coverage feedback to determine new interesting inputs to a
system. A notable user-space example is AFL
(http://lcamtuf.coredump.cx/afl/). However, this technique is not
widely used for kernel testing due to missing compiler and kernel
support.
kcov does not aim to collect as much coverage as possible. It aims to
collect more or less stable coverage that is function of syscall inputs.
To achieve this goal it does not collect coverage in soft/hard
interrupts and instrumentation of some inherently non-deterministic or
non-interesting parts of kernel is disbled (e.g. scheduler, locking).
Currently there is a single coverage collection mode (tracing), but the
API anticipates additional collection modes. Initially I also
implemented a second mode which exposes coverage in a fixed-size hash
table of counters (what Quentin used in his original patch). I've
dropped the second mode for simplicity.
This patch adds the necessary support on kernel side. The complimentary
compiler support was added in gcc revision 231296.
We've used this support to build syzkaller system call fuzzer, which has
found 90 kernel bugs in just 2 months:
https://github.com/google/syzkaller/wiki/Found-Bugs
We've also found 30+ bugs in our internal systems with syzkaller.
Another (yet unexplored) direction where kcov coverage would greatly
help is more traditional "blob mutation". For example, mounting a
random blob as a filesystem, or receiving a random blob over wire.
Why not gcov. Typical fuzzing loop looks as follows: (1) reset
coverage, (2) execute a bit of code, (3) collect coverage, repeat. A
typical coverage can be just a dozen of basic blocks (e.g. an invalid
input). In such context gcov becomes prohibitively expensive as
reset/collect coverage steps depend on total number of basic
blocks/edges in program (in case of kernel it is about 2M). Cost of
kcov depends only on number of executed basic blocks/edges. On top of
that, kernel requires per-thread coverage because there are always
background threads and unrelated processes that also produce coverage.
With inlined gcov instrumentation per-thread coverage is not possible.
kcov exposes kernel PCs and control flow to user-space which is
insecure. But debugfs should not be mapped as user accessible.
Based on a patch by Quentin Casasnovas.
[akpm@linux-foundation.org: make task_struct.kcov_mode have type `enum kcov_mode']
[akpm@linux-foundation.org: unbreak allmodconfig]
[akpm@linux-foundation.org: follow x86 Makefile layout standards]
Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: syzkaller <syzkaller@googlegroups.com>
Cc: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Tavis Ormandy <taviso@google.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Cc: Kostya Serebryany <kcc@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Kees Cook <keescook@google.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: David Drysdale <drysdale@google.com>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-23 05:27:30 +08:00
|
|
|
#ifdef CONFIG_KCOV
|
kcov: remote coverage support
Patch series " kcov: collect coverage from usb and vhost", v3.
This patchset extends kcov to allow collecting coverage from backgound
kernel threads. This extension requires custom annotations for each of
the places where coverage collection is desired. This patchset
implements this for hub events in the USB subsystem and for vhost
workers. See the first patch description for details about the kcov
extension. The other two patches apply this kcov extension to USB and
vhost.
Examples of other subsystems that might potentially benefit from this
when custom annotations are added (the list is based on
process_one_work() callers for bugs recently reported by syzbot):
1. fs: writeback wb_workfn() worker,
2. net: addrconf_dad_work()/addrconf_verify_work() workers,
3. net: neigh_periodic_work() worker,
4. net/p9: p9_write_work()/p9_read_work() workers,
5. block: blk_mq_run_work_fn() worker.
These patches have been used to enable coverage-guided USB fuzzing with
syzkaller for the last few years, see the details here:
https://github.com/google/syzkaller/blob/master/docs/linux/external_fuzzing_usb.md
This patchset has been pushed to the public Linux kernel Gerrit
instance:
https://linux-review.googlesource.com/c/linux/kernel/git/torvalds/linux/+/1524
This patch (of 3):
Add background thread coverage collection ability to kcov.
With KCOV_ENABLE coverage is collected only for syscalls that are issued
from the current process. With KCOV_REMOTE_ENABLE it's possible to
collect coverage for arbitrary parts of the kernel code, provided that
those parts are annotated with kcov_remote_start()/kcov_remote_stop().
This allows to collect coverage from two types of kernel background
threads: the global ones, that are spawned during kernel boot in a
limited number of instances (e.g. one USB hub_event() worker thread is
spawned per USB HCD); and the local ones, that are spawned when a user
interacts with some kernel interface (e.g. vhost workers).
To enable collecting coverage from a global background thread, a unique
global handle must be assigned and passed to the corresponding
kcov_remote_start() call. Then a userspace process can pass a list of
such handles to the KCOV_REMOTE_ENABLE ioctl in the handles array field
of the kcov_remote_arg struct. This will attach the used kcov device to
the code sections, that are referenced by those handles.
Since there might be many local background threads spawned from
different userspace processes, we can't use a single global handle per
annotation. Instead, the userspace process passes a non-zero handle
through the common_handle field of the kcov_remote_arg struct. This
common handle gets saved to the kcov_handle field in the current
task_struct and needs to be passed to the newly spawned threads via
custom annotations. Those threads should in turn be annotated with
kcov_remote_start()/kcov_remote_stop().
Internally kcov stores handles as u64 integers. The top byte of a
handle is used to denote the id of a subsystem that this handle belongs
to, and the lower 4 bytes are used to denote the id of a thread instance
within that subsystem. A reserved value 0 is used as a subsystem id for
common handles as they don't belong to a particular subsystem. The
bytes 4-7 are currently reserved and must be zero. In the future the
number of bytes used for the subsystem or handle ids might be increased.
When a particular userspace process collects coverage by via a common
handle, kcov will collect coverage for each code section that is
annotated to use the common handle obtained as kcov_handle from the
current task_struct. However non common handles allow to collect
coverage selectively from different subsystems.
Link: http://lkml.kernel.org/r/e90e315426a384207edbec1d6aa89e43008e4caf.1572366574.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: David Windsor <dwindsor@gmail.com>
Cc: Elena Reshetova <elena.reshetova@intel.com>
Cc: Anders Roxell <anders.roxell@linaro.org>
Cc: Alexander Potapenko <glider@google.com>
Cc: Marco Elver <elver@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-05 08:52:43 +08:00
|
|
|
/* See kernel/kcov.c for more details. */
|
|
|
|
|
2017-02-07 05:06:35 +08:00
|
|
|
/* Coverage collection mode enabled for this task (0 if disabled): */
|
sched/core / kcov: avoid kcov_area during task switch
During a context switch, we first switch_mm() to the next task's mm,
then switch_to() that new task. This means that vmalloc'd regions which
had previously been faulted in can transiently disappear in the context
of the prev task.
Functions instrumented by KCOV may try to access a vmalloc'd kcov_area
during this window, and as the fault handling code is instrumented, this
results in a recursive fault.
We must avoid accessing any kcov_area during this window. We can do so
with a new flag in kcov_mode, set prior to switching the mm, and cleared
once the new task is live. Since task_struct::kcov_mode isn't always a
specific enum kcov_mode value, this is made an unsigned int.
The manipulation is hidden behind kcov_{prepare,finish}_switch() helpers,
which are empty for !CONFIG_KCOV kernels.
The code uses macros because I can't use static inline functions without a
circular include dependency between <linux/sched.h> and <linux/kcov.h>,
since the definition of task_struct uses things defined in <linux/kcov.h>
Link: http://lkml.kernel.org/r/20180504135535.53744-4-mark.rutland@arm.com
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-15 06:27:41 +08:00
|
|
|
unsigned int kcov_mode;
|
2017-02-07 05:06:35 +08:00
|
|
|
|
|
|
|
/* Size of the kcov_area: */
|
|
|
|
unsigned int kcov_size;
|
|
|
|
|
|
|
|
/* Buffer for coverage collection: */
|
|
|
|
void *kcov_area;
|
|
|
|
|
|
|
|
/* KCOV descriptor wired with this task or NULL: */
|
|
|
|
struct kcov *kcov;
|
kcov: remote coverage support
Patch series " kcov: collect coverage from usb and vhost", v3.
This patchset extends kcov to allow collecting coverage from backgound
kernel threads. This extension requires custom annotations for each of
the places where coverage collection is desired. This patchset
implements this for hub events in the USB subsystem and for vhost
workers. See the first patch description for details about the kcov
extension. The other two patches apply this kcov extension to USB and
vhost.
Examples of other subsystems that might potentially benefit from this
when custom annotations are added (the list is based on
process_one_work() callers for bugs recently reported by syzbot):
1. fs: writeback wb_workfn() worker,
2. net: addrconf_dad_work()/addrconf_verify_work() workers,
3. net: neigh_periodic_work() worker,
4. net/p9: p9_write_work()/p9_read_work() workers,
5. block: blk_mq_run_work_fn() worker.
These patches have been used to enable coverage-guided USB fuzzing with
syzkaller for the last few years, see the details here:
https://github.com/google/syzkaller/blob/master/docs/linux/external_fuzzing_usb.md
This patchset has been pushed to the public Linux kernel Gerrit
instance:
https://linux-review.googlesource.com/c/linux/kernel/git/torvalds/linux/+/1524
This patch (of 3):
Add background thread coverage collection ability to kcov.
With KCOV_ENABLE coverage is collected only for syscalls that are issued
from the current process. With KCOV_REMOTE_ENABLE it's possible to
collect coverage for arbitrary parts of the kernel code, provided that
those parts are annotated with kcov_remote_start()/kcov_remote_stop().
This allows to collect coverage from two types of kernel background
threads: the global ones, that are spawned during kernel boot in a
limited number of instances (e.g. one USB hub_event() worker thread is
spawned per USB HCD); and the local ones, that are spawned when a user
interacts with some kernel interface (e.g. vhost workers).
To enable collecting coverage from a global background thread, a unique
global handle must be assigned and passed to the corresponding
kcov_remote_start() call. Then a userspace process can pass a list of
such handles to the KCOV_REMOTE_ENABLE ioctl in the handles array field
of the kcov_remote_arg struct. This will attach the used kcov device to
the code sections, that are referenced by those handles.
Since there might be many local background threads spawned from
different userspace processes, we can't use a single global handle per
annotation. Instead, the userspace process passes a non-zero handle
through the common_handle field of the kcov_remote_arg struct. This
common handle gets saved to the kcov_handle field in the current
task_struct and needs to be passed to the newly spawned threads via
custom annotations. Those threads should in turn be annotated with
kcov_remote_start()/kcov_remote_stop().
Internally kcov stores handles as u64 integers. The top byte of a
handle is used to denote the id of a subsystem that this handle belongs
to, and the lower 4 bytes are used to denote the id of a thread instance
within that subsystem. A reserved value 0 is used as a subsystem id for
common handles as they don't belong to a particular subsystem. The
bytes 4-7 are currently reserved and must be zero. In the future the
number of bytes used for the subsystem or handle ids might be increased.
When a particular userspace process collects coverage by via a common
handle, kcov will collect coverage for each code section that is
annotated to use the common handle obtained as kcov_handle from the
current task_struct. However non common handles allow to collect
coverage selectively from different subsystems.
Link: http://lkml.kernel.org/r/e90e315426a384207edbec1d6aa89e43008e4caf.1572366574.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: David Windsor <dwindsor@gmail.com>
Cc: Elena Reshetova <elena.reshetova@intel.com>
Cc: Anders Roxell <anders.roxell@linaro.org>
Cc: Alexander Potapenko <glider@google.com>
Cc: Marco Elver <elver@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-05 08:52:43 +08:00
|
|
|
|
|
|
|
/* KCOV common handle for remote coverage collection: */
|
|
|
|
u64 kcov_handle;
|
|
|
|
|
|
|
|
/* KCOV sequence number: */
|
|
|
|
int kcov_sequence;
|
2020-06-05 07:46:04 +08:00
|
|
|
|
|
|
|
/* Collect coverage from softirq context: */
|
|
|
|
unsigned int kcov_softirq;
|
kernel: add kcov code coverage
kcov provides code coverage collection for coverage-guided fuzzing
(randomized testing). Coverage-guided fuzzing is a testing technique
that uses coverage feedback to determine new interesting inputs to a
system. A notable user-space example is AFL
(http://lcamtuf.coredump.cx/afl/). However, this technique is not
widely used for kernel testing due to missing compiler and kernel
support.
kcov does not aim to collect as much coverage as possible. It aims to
collect more or less stable coverage that is function of syscall inputs.
To achieve this goal it does not collect coverage in soft/hard
interrupts and instrumentation of some inherently non-deterministic or
non-interesting parts of kernel is disbled (e.g. scheduler, locking).
Currently there is a single coverage collection mode (tracing), but the
API anticipates additional collection modes. Initially I also
implemented a second mode which exposes coverage in a fixed-size hash
table of counters (what Quentin used in his original patch). I've
dropped the second mode for simplicity.
This patch adds the necessary support on kernel side. The complimentary
compiler support was added in gcc revision 231296.
We've used this support to build syzkaller system call fuzzer, which has
found 90 kernel bugs in just 2 months:
https://github.com/google/syzkaller/wiki/Found-Bugs
We've also found 30+ bugs in our internal systems with syzkaller.
Another (yet unexplored) direction where kcov coverage would greatly
help is more traditional "blob mutation". For example, mounting a
random blob as a filesystem, or receiving a random blob over wire.
Why not gcov. Typical fuzzing loop looks as follows: (1) reset
coverage, (2) execute a bit of code, (3) collect coverage, repeat. A
typical coverage can be just a dozen of basic blocks (e.g. an invalid
input). In such context gcov becomes prohibitively expensive as
reset/collect coverage steps depend on total number of basic
blocks/edges in program (in case of kernel it is about 2M). Cost of
kcov depends only on number of executed basic blocks/edges. On top of
that, kernel requires per-thread coverage because there are always
background threads and unrelated processes that also produce coverage.
With inlined gcov instrumentation per-thread coverage is not possible.
kcov exposes kernel PCs and control flow to user-space which is
insecure. But debugfs should not be mapped as user accessible.
Based on a patch by Quentin Casasnovas.
[akpm@linux-foundation.org: make task_struct.kcov_mode have type `enum kcov_mode']
[akpm@linux-foundation.org: unbreak allmodconfig]
[akpm@linux-foundation.org: follow x86 Makefile layout standards]
Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: syzkaller <syzkaller@googlegroups.com>
Cc: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Tavis Ormandy <taviso@google.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Cc: Kostya Serebryany <kcc@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Kees Cook <keescook@google.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: David Drysdale <drysdale@google.com>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-23 05:27:30 +08:00
|
|
|
#endif
|
2017-02-07 05:06:35 +08:00
|
|
|
|
2014-12-13 08:55:15 +08:00
|
|
|
#ifdef CONFIG_MEMCG
|
2017-02-07 05:06:35 +08:00
|
|
|
struct mem_cgroup *memcg_in_oom;
|
|
|
|
gfp_t memcg_oom_gfp_mask;
|
|
|
|
int memcg_oom_order;
|
memcg: punt high overage reclaim to return-to-userland path
Currently, try_charge() tries to reclaim memory synchronously when the
high limit is breached; however, if the allocation doesn't have
__GFP_WAIT, synchronous reclaim is skipped. If a process performs only
speculative allocations, it can blow way past the high limit. This is
actually easily reproducible by simply doing "find /". slab/slub
allocator tries speculative allocations first, so as long as there's
memory which can be consumed without blocking, it can keep allocating
memory regardless of the high limit.
This patch makes try_charge() always punt the over-high reclaim to the
return-to-userland path. If try_charge() detects that high limit is
breached, it adds the overage to current->memcg_nr_pages_over_high and
schedules execution of mem_cgroup_handle_over_high() which performs
synchronous reclaim from the return-to-userland path.
As long as kernel doesn't have a run-away allocation spree, this should
provide enough protection while making kmemcg behave more consistently.
It also has the following benefits.
- All over-high reclaims can use GFP_KERNEL regardless of the specific
gfp mask in use, e.g. GFP_NOFS, when the limit was breached.
- It copes with prio inversion. Previously, a low-prio task with
small memory.high might perform over-high reclaim with a bunch of
locks held. If a higher prio task needed any of these locks, it
would have to wait until the low prio task finished reclaim and
released the locks. By handing over-high reclaim to the task exit
path this issue can be avoided.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Michal Hocko <mhocko@kernel.org>
Reviewed-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06 10:46:11 +08:00
|
|
|
|
2017-02-07 05:06:35 +08:00
|
|
|
/* Number of pages to reclaim on returning to userland: */
|
|
|
|
unsigned int memcg_nr_pages_over_high;
|
fs: fsnotify: account fsnotify metadata to kmemcg
Patch series "Directed kmem charging", v8.
The Linux kernel's memory cgroup allows limiting the memory usage of the
jobs running on the system to provide isolation between the jobs. All
the kernel memory allocated in the context of the job and marked with
__GFP_ACCOUNT will also be included in the memory usage and be limited
by the job's limit.
The kernel memory can only be charged to the memcg of the process in
whose context kernel memory was allocated. However there are cases
where the allocated kernel memory should be charged to the memcg
different from the current processes's memcg. This patch series
contains two such concrete use-cases i.e. fsnotify and buffer_head.
The fsnotify event objects can consume a lot of system memory for large
or unlimited queues if there is either no or slow listener. The events
are allocated in the context of the event producer. However they should
be charged to the event consumer. Similarly the buffer_head objects can
be allocated in a memcg different from the memcg of the page for which
buffer_head objects are being allocated.
To solve this issue, this patch series introduces mechanism to charge
kernel memory to a given memcg. In case of fsnotify events, the memcg
of the consumer can be used for charging and for buffer_head, the memcg
of the page can be charged. For directed charging, the caller can use
the scope API memalloc_[un]use_memcg() to specify the memcg to charge
for all the __GFP_ACCOUNT allocations within the scope.
This patch (of 2):
A lot of memory can be consumed by the events generated for the huge or
unlimited queues if there is either no or slow listener. This can cause
system level memory pressure or OOMs. So, it's better to account the
fsnotify kmem caches to the memcg of the listener.
However the listener can be in a different memcg than the memcg of the
producer and these allocations happen in the context of the event
producer. This patch introduces remote memcg charging API which the
producer can use to charge the allocations to the memcg of the listener.
There are seven fsnotify kmem caches and among them allocations from
dnotify_struct_cache, dnotify_mark_cache, fanotify_mark_cache and
inotify_inode_mark_cachep happens in the context of syscall from the
listener. So, SLAB_ACCOUNT is enough for these caches.
The objects from fsnotify_mark_connector_cachep are not accounted as
they are small compared to the notification mark or events and it is
unclear whom to account connector to since it is shared by all events
attached to the inode.
The allocations from the event caches happen in the context of the event
producer. For such caches we will need to remote charge the allocations
to the listener's memcg. Thus we save the memcg reference in the
fsnotify_group structure of the listener.
This patch has also moved the members of fsnotify_group to keep the size
same, at least for 64 bit build, even with additional member by filling
the holes.
[shakeelb@google.com: use GFP_KERNEL_ACCOUNT rather than open-coding it]
Link: http://lkml.kernel.org/r/20180702215439.211597-1-shakeelb@google.com
Link: http://lkml.kernel.org/r/20180627191250.209150-2-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-18 06:46:39 +08:00
|
|
|
|
|
|
|
/* Used by memcontrol for targeted memcg charge: */
|
|
|
|
struct mem_cgroup *active_memcg;
|
2009-12-16 08:47:03 +08:00
|
|
|
#endif
|
2017-02-07 05:06:35 +08:00
|
|
|
|
2018-07-03 23:14:55 +08:00
|
|
|
#ifdef CONFIG_BLK_CGROUP
|
|
|
|
struct request_queue *throttle_queue;
|
|
|
|
#endif
|
|
|
|
|
uprobes/core: Handle breakpoint and singlestep exceptions
Uprobes uses exception notifiers to get to know if a thread hit
a breakpoint or a singlestep exception.
When a thread hits a uprobe or is singlestepping post a uprobe
hit, the uprobe exception notifier sets its TIF_UPROBE bit,
which will then be checked on its return to userspace path
(do_notify_resume() ->uprobe_notify_resume()), where the
consumers handlers are run (in task context) based on the
defined filters.
Uprobe hits are thread specific and hence we need to maintain
information about if a task hit a uprobe, what uprobe was hit,
the slot where the original instruction was copied for xol so
that it can be singlestepped with appropriate fixups.
In some cases, special care is needed for instructions that are
executed out of line (xol). These are architecture specific
artefacts, such as handling RIP relative instructions on x86_64.
Since the instruction at which the uprobe was inserted is
executed out of line, architecture specific fixups are added so
that the thread continues normal execution in the presence of a
uprobe.
Postpone the signals until we execute the probed insn.
post_xol() path does a recalc_sigpending() before return to
user-mode, this ensures the signal can't be lost.
Uprobes relies on DIE_DEBUG notification to notify if a
singlestep is complete.
Adds x86 specific uprobe exception notifiers and appropriate
hooks needed to determine a uprobe hit and subsequent post
processing.
Add requisite x86 fixups for xol for uprobes. Specific cases
needing fixups include relative jumps (x86_64), calls, etc.
Where possible, we check and skip singlestepping the
breakpointed instructions. For now we skip single byte as well
as few multibyte nop instructions. However this can be extended
to other instructions too.
Credits to Oleg Nesterov for suggestions/patches related to
signal, breakpoint, singlestep handling code.
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Jim Keniston <jkenisto@linux.vnet.ibm.com>
Cc: Linux-mm <linux-mm@kvack.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20120313180011.29771.89027.sendpatchset@srdronam.in.ibm.com
[ Performed various cleanliness edits ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-14 02:00:11 +08:00
|
|
|
#ifdef CONFIG_UPROBES
|
2017-02-07 05:06:35 +08:00
|
|
|
struct uprobe_task *utask;
|
uprobes/core: Handle breakpoint and singlestep exceptions
Uprobes uses exception notifiers to get to know if a thread hit
a breakpoint or a singlestep exception.
When a thread hits a uprobe or is singlestepping post a uprobe
hit, the uprobe exception notifier sets its TIF_UPROBE bit,
which will then be checked on its return to userspace path
(do_notify_resume() ->uprobe_notify_resume()), where the
consumers handlers are run (in task context) based on the
defined filters.
Uprobe hits are thread specific and hence we need to maintain
information about if a task hit a uprobe, what uprobe was hit,
the slot where the original instruction was copied for xol so
that it can be singlestepped with appropriate fixups.
In some cases, special care is needed for instructions that are
executed out of line (xol). These are architecture specific
artefacts, such as handling RIP relative instructions on x86_64.
Since the instruction at which the uprobe was inserted is
executed out of line, architecture specific fixups are added so
that the thread continues normal execution in the presence of a
uprobe.
Postpone the signals until we execute the probed insn.
post_xol() path does a recalc_sigpending() before return to
user-mode, this ensures the signal can't be lost.
Uprobes relies on DIE_DEBUG notification to notify if a
singlestep is complete.
Adds x86 specific uprobe exception notifiers and appropriate
hooks needed to determine a uprobe hit and subsequent post
processing.
Add requisite x86 fixups for xol for uprobes. Specific cases
needing fixups include relative jumps (x86_64), calls, etc.
Where possible, we check and skip singlestepping the
breakpointed instructions. For now we skip single byte as well
as few multibyte nop instructions. However this can be extended
to other instructions too.
Credits to Oleg Nesterov for suggestions/patches related to
signal, breakpoint, singlestep handling code.
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Jim Keniston <jkenisto@linux.vnet.ibm.com>
Cc: Linux-mm <linux-mm@kvack.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20120313180011.29771.89027.sendpatchset@srdronam.in.ibm.com
[ Performed various cleanliness edits ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-14 02:00:11 +08:00
|
|
|
#endif
|
2013-03-24 07:11:31 +08:00
|
|
|
#if defined(CONFIG_BCACHE) || defined(CONFIG_BCACHE_MODULE)
|
2017-02-07 05:06:35 +08:00
|
|
|
unsigned int sequential_io;
|
|
|
|
unsigned int sequential_io_avg;
|
2013-03-24 07:11:31 +08:00
|
|
|
#endif
|
2020-11-19 03:48:43 +08:00
|
|
|
struct kmap_ctrl kmap_ctrl;
|
2014-09-24 16:18:55 +08:00
|
|
|
#ifdef CONFIG_DEBUG_ATOMIC_SLEEP
|
2017-02-07 05:06:35 +08:00
|
|
|
unsigned long task_state_change;
|
2021-08-16 05:27:44 +08:00
|
|
|
# ifdef CONFIG_PREEMPT_RT
|
|
|
|
unsigned long saved_state_change;
|
|
|
|
# endif
|
2014-09-24 16:18:55 +08:00
|
|
|
#endif
|
2017-02-07 05:06:35 +08:00
|
|
|
int pagefault_disabled;
|
2016-03-26 05:20:33 +08:00
|
|
|
#ifdef CONFIG_MMU
|
2017-02-07 05:06:35 +08:00
|
|
|
struct task_struct *oom_reaper_list;
|
2022-04-22 07:36:01 +08:00
|
|
|
struct timer_list oom_reaper_timer;
|
2016-03-26 05:20:33 +08:00
|
|
|
#endif
|
2016-08-11 17:35:21 +08:00
|
|
|
#ifdef CONFIG_VMAP_STACK
|
2017-02-07 05:06:35 +08:00
|
|
|
struct vm_struct *stack_vm_area;
|
2016-08-11 17:35:21 +08:00
|
|
|
#endif
|
2016-09-16 13:45:48 +08:00
|
|
|
#ifdef CONFIG_THREAD_INFO_IN_TASK
|
2017-02-07 05:06:35 +08:00
|
|
|
/* A live task holds one reference: */
|
2019-01-18 20:27:30 +08:00
|
|
|
refcount_t stack_refcount;
|
livepatch: change to a per-task consistency model
Change livepatch to use a basic per-task consistency model. This is the
foundation which will eventually enable us to patch those ~10% of
security patches which change function or data semantics. This is the
biggest remaining piece needed to make livepatch more generally useful.
This code stems from the design proposal made by Vojtech [1] in November
2014. It's a hybrid of kGraft and kpatch: it uses kGraft's per-task
consistency and syscall barrier switching combined with kpatch's stack
trace switching. There are also a number of fallback options which make
it quite flexible.
Patches are applied on a per-task basis, when the task is deemed safe to
switch over. When a patch is enabled, livepatch enters into a
transition state where tasks are converging to the patched state.
Usually this transition state can complete in a few seconds. The same
sequence occurs when a patch is disabled, except the tasks converge from
the patched state to the unpatched state.
An interrupt handler inherits the patched state of the task it
interrupts. The same is true for forked tasks: the child inherits the
patched state of the parent.
Livepatch uses several complementary approaches to determine when it's
safe to patch tasks:
1. The first and most effective approach is stack checking of sleeping
tasks. If no affected functions are on the stack of a given task,
the task is patched. In most cases this will patch most or all of
the tasks on the first try. Otherwise it'll keep trying
periodically. This option is only available if the architecture has
reliable stacks (HAVE_RELIABLE_STACKTRACE).
2. The second approach, if needed, is kernel exit switching. A
task is switched when it returns to user space from a system call, a
user space IRQ, or a signal. It's useful in the following cases:
a) Patching I/O-bound user tasks which are sleeping on an affected
function. In this case you have to send SIGSTOP and SIGCONT to
force it to exit the kernel and be patched.
b) Patching CPU-bound user tasks. If the task is highly CPU-bound
then it will get patched the next time it gets interrupted by an
IRQ.
c) In the future it could be useful for applying patches for
architectures which don't yet have HAVE_RELIABLE_STACKTRACE. In
this case you would have to signal most of the tasks on the
system. However this isn't supported yet because there's
currently no way to patch kthreads without
HAVE_RELIABLE_STACKTRACE.
3. For idle "swapper" tasks, since they don't ever exit the kernel, they
instead have a klp_update_patch_state() call in the idle loop which
allows them to be patched before the CPU enters the idle state.
(Note there's not yet such an approach for kthreads.)
All the above approaches may be skipped by setting the 'immediate' flag
in the 'klp_patch' struct, which will disable per-task consistency and
patch all tasks immediately. This can be useful if the patch doesn't
change any function or data semantics. Note that, even with this flag
set, it's possible that some tasks may still be running with an old
version of the function, until that function returns.
There's also an 'immediate' flag in the 'klp_func' struct which allows
you to specify that certain functions in the patch can be applied
without per-task consistency. This might be useful if you want to patch
a common function like schedule(), and the function change doesn't need
consistency but the rest of the patch does.
For architectures which don't have HAVE_RELIABLE_STACKTRACE, the user
must set patch->immediate which causes all tasks to be patched
immediately. This option should be used with care, only when the patch
doesn't change any function or data semantics.
In the future, architectures which don't have HAVE_RELIABLE_STACKTRACE
may be allowed to use per-task consistency if we can come up with
another way to patch kthreads.
The /sys/kernel/livepatch/<patch>/transition file shows whether a patch
is in transition. Only a single patch (the topmost patch on the stack)
can be in transition at a given time. A patch can remain in transition
indefinitely, if any of the tasks are stuck in the initial patch state.
A transition can be reversed and effectively canceled by writing the
opposite value to the /sys/kernel/livepatch/<patch>/enabled file while
the transition is in progress. Then all the tasks will attempt to
converge back to the original patch state.
[1] https://lkml.kernel.org/r/20141107140458.GA21774@suse.cz
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Miroslav Benes <mbenes@suse.cz>
Acked-by: Ingo Molnar <mingo@kernel.org> # for the scheduler changes
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2017-02-14 09:42:40 +08:00
|
|
|
#endif
|
|
|
|
#ifdef CONFIG_LIVEPATCH
|
|
|
|
int patch_state;
|
2017-05-03 23:50:52 +08:00
|
|
|
#endif
|
2017-03-24 19:46:33 +08:00
|
|
|
#ifdef CONFIG_SECURITY
|
|
|
|
/* Used by LSM modules for access restriction: */
|
|
|
|
void *security;
|
2016-09-16 13:45:48 +08:00
|
|
|
#endif
|
2021-02-26 07:43:14 +08:00
|
|
|
#ifdef CONFIG_BPF_SYSCALL
|
|
|
|
/* Used by BPF task local storage */
|
|
|
|
struct bpf_local_storage __rcu *bpf_storage;
|
bpf: Add ambient BPF runtime context stored in current
b910eaaaa4b8 ("bpf: Fix NULL pointer dereference in bpf_get_local_storage()
helper") fixed the problem with cgroup-local storage use in BPF by
pre-allocating per-CPU array of 8 cgroup storage pointers to accommodate
possible BPF program preemptions and nested executions.
While this seems to work good in practice, it introduces new and unnecessary
failure mode in which not all BPF programs might be executed if we fail to
find an unused slot for cgroup storage, however unlikely it is. It might also
not be so unlikely when/if we allow sleepable cgroup BPF programs in the
future.
Further, the way that cgroup storage is implemented as ambiently-available
property during entire BPF program execution is a convenient way to pass extra
information to BPF program and helpers without requiring user code to pass
around extra arguments explicitly. So it would be good to have a generic
solution that can allow implementing this without arbitrary restrictions.
Ideally, such solution would work for both preemptable and sleepable BPF
programs in exactly the same way.
This patch introduces such solution, bpf_run_ctx. It adds one pointer field
(bpf_ctx) to task_struct. This field is maintained by BPF_PROG_RUN family of
macros in such a way that it always stays valid throughout BPF program
execution. BPF program preemption is handled by remembering previous
current->bpf_ctx value locally while executing nested BPF program and
restoring old value after nested BPF program finishes. This is handled by two
helper functions, bpf_set_run_ctx() and bpf_reset_run_ctx(), which are
supposed to be used before and after BPF program runs, respectively.
Restoring old value of the pointer handles preemption, while bpf_run_ctx
pointer being a property of current task_struct naturally solves this problem
for sleepable BPF programs by "following" BPF program execution as it is
scheduled in and out of CPU. It would even allow CPU migration of BPF
programs, even though it's not currently allowed by BPF infra.
This patch cleans up cgroup local storage handling as a first application. The
design itself is generic, though, with bpf_run_ctx being an empty struct that
is supposed to be embedded into a specific struct for a given BPF program type
(bpf_cg_run_ctx in this case). Follow up patches are planned that will expand
this mechanism for other uses within tracing BPF programs.
To verify that this change doesn't revert the fix to the original cgroup
storage issue, I ran the same repro as in the original report ([0]) and didn't
get any problems. Replacing bpf_reset_run_ctx(old_run_ctx) with
bpf_reset_run_ctx(NULL) triggers the issue pretty quickly (so repro does work).
[0] https://lore.kernel.org/bpf/YEEvBUiJl2pJkxTd@krava/
Fixes: b910eaaaa4b8 ("bpf: Fix NULL pointer dereference in bpf_get_local_storage() helper")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20210712230615.3525979-1-andrii@kernel.org
2021-07-13 07:06:15 +08:00
|
|
|
/* Used for BPF run context */
|
|
|
|
struct bpf_run_ctx *bpf_ctx;
|
2021-02-26 07:43:14 +08:00
|
|
|
#endif
|
2017-04-06 13:43:33 +08:00
|
|
|
|
2018-08-17 06:16:58 +08:00
|
|
|
#ifdef CONFIG_GCC_PLUGIN_STACKLEAK
|
|
|
|
unsigned long lowest_stack;
|
2018-08-17 06:17:01 +08:00
|
|
|
unsigned long prev_lowest_stack;
|
2018-08-17 06:16:58 +08:00
|
|
|
#endif
|
|
|
|
|
2020-02-19 17:22:06 +08:00
|
|
|
#ifdef CONFIG_X86_MCE
|
2020-10-07 05:09:09 +08:00
|
|
|
void __user *mce_vaddr;
|
|
|
|
__u64 mce_kflags;
|
2020-02-19 17:22:06 +08:00
|
|
|
u64 mce_addr;
|
2020-05-21 00:35:46 +08:00
|
|
|
__u64 mce_ripv : 1,
|
|
|
|
mce_whole_page : 1,
|
|
|
|
__mce_reserved : 62;
|
2020-02-19 17:22:06 +08:00
|
|
|
struct callback_head mce_kill_me;
|
2021-09-14 05:52:39 +08:00
|
|
|
int mce_count;
|
2020-02-19 17:22:06 +08:00
|
|
|
#endif
|
|
|
|
|
2020-08-29 21:03:24 +08:00
|
|
|
#ifdef CONFIG_KRETPROBES
|
|
|
|
struct llist_head kretprobe_instances;
|
|
|
|
#endif
|
2022-03-15 22:00:50 +08:00
|
|
|
#ifdef CONFIG_RETHOOK
|
|
|
|
struct llist_head rethooks;
|
|
|
|
#endif
|
2020-08-29 21:03:24 +08:00
|
|
|
|
2021-04-27 03:59:11 +08:00
|
|
|
#ifdef CONFIG_ARCH_HAS_PARANOID_L1D_FLUSH
|
|
|
|
/*
|
|
|
|
* If L1D flush is supported on mm context switch
|
|
|
|
* then we use this callback head to queue kill work
|
|
|
|
* to kill tasks that are not running on SMT disabled
|
|
|
|
* cores
|
|
|
|
*/
|
|
|
|
struct callback_head l1d_flush_kill;
|
|
|
|
#endif
|
|
|
|
|
rv: Add Runtime Verification (RV) interface
RV is a lightweight (yet rigorous) method that complements classical
exhaustive verification techniques (such as model checking and
theorem proving) with a more practical approach to complex systems.
RV works by analyzing the trace of the system's actual execution,
comparing it against a formal specification of the system behavior.
RV can give precise information on the runtime behavior of the
monitored system while enabling the reaction for unexpected
events, avoiding, for example, the propagation of a failure on
safety-critical systems.
The development of this interface roots in the development of the
paper:
De Oliveira, Daniel Bristot; Cucinotta, Tommaso; De Oliveira, Romulo
Silva. Efficient formal verification for the Linux kernel. In:
International Conference on Software Engineering and Formal Methods.
Springer, Cham, 2019. p. 315-332.
And:
De Oliveira, Daniel Bristot. Automata-based formal analysis
and verification of the real-time Linux kernel. PhD Thesis, 2020.
The RV interface resembles the tracing/ interface on purpose. The current
path for the RV interface is /sys/kernel/tracing/rv/.
It presents these files:
"available_monitors"
- List the available monitors, one per line.
For example:
# cat available_monitors
wip
wwnr
"enabled_monitors"
- Lists the enabled monitors, one per line;
- Writing to it enables a given monitor;
- Writing a monitor name with a '!' prefix disables it;
- Truncating the file disables all enabled monitors.
For example:
# cat enabled_monitors
# echo wip > enabled_monitors
# echo wwnr >> enabled_monitors
# cat enabled_monitors
wip
wwnr
# echo '!wip' >> enabled_monitors
# cat enabled_monitors
wwnr
# echo > enabled_monitors
# cat enabled_monitors
#
Note that more than one monitor can be enabled concurrently.
"monitoring_on"
- It is an on/off general switcher for monitoring. Note
that it does not disable enabled monitors or detach events,
but stop the per-entity monitors of monitoring the events
received from the system. It resembles the "tracing_on" switcher.
"monitors/"
Each monitor will have its one directory inside "monitors/". There
the monitor specific files will be presented.
The "monitors/" directory resembles the "events" directory on
tracefs.
For example:
# cd monitors/wip/
# ls
desc enable
# cat desc
wakeup in preemptive per-cpu testing monitor.
# cat enable
0
For further information, see the comments in the header of
kernel/trace/rv/rv.c from this patch.
Link: https://lkml.kernel.org/r/a4bfe038f50cb047bfb343ad0e12b0e646ab308b.1659052063.git.bristot@kernel.org
Cc: Wim Van Sebroeck <wim@linux-watchdog.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Marco Elver <elver@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Shuah Khan <skhan@linuxfoundation.org>
Cc: Gabriele Paoloni <gpaoloni@redhat.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: Tao Zhou <tao.zhou@linux.dev>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-trace-devel@vger.kernel.org
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-07-29 17:38:40 +08:00
|
|
|
#ifdef CONFIG_RV
|
|
|
|
/*
|
|
|
|
* Per-task RV monitor. Nowadays fixed in RV_PER_TASK_MONITORS.
|
|
|
|
* If we find justification for more monitors, we can think
|
|
|
|
* about adding more or developing a dynamic method. So far,
|
|
|
|
* none of these are justified.
|
|
|
|
*/
|
|
|
|
union rv_task_monitor rv[RV_PER_TASK_MONITORS];
|
|
|
|
#endif
|
|
|
|
|
2017-04-06 13:43:33 +08:00
|
|
|
/*
|
|
|
|
* New fields for task_struct should be added above here, so that
|
|
|
|
* they are included in the randomized portion of task_struct.
|
|
|
|
*/
|
|
|
|
randomized_struct_fields_end
|
|
|
|
|
2017-02-07 05:06:35 +08:00
|
|
|
/* CPU-specific state of this task: */
|
|
|
|
struct thread_struct thread;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* WARNING: on x86, 'thread_struct' contains a variable-sized
|
|
|
|
* structure. It *MUST* be at the end of 'task_struct'.
|
|
|
|
*
|
|
|
|
* Do not put anything below here!
|
|
|
|
*/
|
2005-04-17 06:20:36 +08:00
|
|
|
};
|
|
|
|
|
2007-10-26 16:17:22 +08:00
|
|
|
static inline struct pid *task_pid(struct task_struct *task)
|
2006-10-02 17:17:09 +08:00
|
|
|
{
|
2017-09-27 02:06:43 +08:00
|
|
|
return task->thread_pid;
|
2006-10-02 17:17:09 +08:00
|
|
|
}
|
|
|
|
|
2007-10-19 14:40:06 +08:00
|
|
|
/*
|
|
|
|
* the helpers to get the task's different pids as they are seen
|
|
|
|
* from various namespaces
|
|
|
|
*
|
|
|
|
* task_xid_nr() : global id, i.e. the id seen from the init namespace;
|
2008-02-08 20:19:15 +08:00
|
|
|
* task_xid_vnr() : virtual id, i.e. the id seen from the pid namespace of
|
|
|
|
* current.
|
2007-10-19 14:40:06 +08:00
|
|
|
* task_xid_nr_ns() : id seen from the ns specified;
|
|
|
|
*
|
|
|
|
* see also pid_nr() etc in include/linux/pid.h
|
|
|
|
*/
|
2017-02-07 05:06:35 +08:00
|
|
|
pid_t __task_pid_nr_ns(struct task_struct *task, enum pid_type type, struct pid_namespace *ns);
|
2007-10-19 14:40:06 +08:00
|
|
|
|
2007-10-26 16:17:22 +08:00
|
|
|
static inline pid_t task_pid_nr(struct task_struct *tsk)
|
2007-10-19 14:40:06 +08:00
|
|
|
{
|
|
|
|
return tsk->pid;
|
|
|
|
}
|
|
|
|
|
2017-02-07 05:06:35 +08:00
|
|
|
static inline pid_t task_pid_nr_ns(struct task_struct *tsk, struct pid_namespace *ns)
|
pids: refactor vnr/nr_ns helpers to make them safe
Inho, the safety rules for vnr/nr_ns helpers are horrible and buggy.
task_pid_nr_ns(task) needs rcu/tasklist depending on task == current.
As for "special" pids, vnr/nr_ns helpers always need rcu. However, if
task != current, they are unsafe even under rcu lock, we can't trust
task->group_leader without the special checks.
And almost every helper has a callsite which needs a fix.
Also, it is a bit annoying that the implementations of, say,
task_pgrp_vnr() and task_pgrp_nr_ns() are not "symmetrical".
This patch introduces the new helper, __task_pid_nr_ns(), which is always
safe to use, and turns all other helpers into the trivial wrappers.
After this I'll send another patch which converts task_tgid_xxx() as well,
they're are a bit special.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Louis Rilling <Louis.Rilling@kerlabs.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-03 07:58:38 +08:00
|
|
|
{
|
|
|
|
return __task_pid_nr_ns(tsk, PIDTYPE_PID, ns);
|
|
|
|
}
|
2007-10-19 14:40:06 +08:00
|
|
|
|
|
|
|
static inline pid_t task_pid_vnr(struct task_struct *tsk)
|
|
|
|
{
|
pids: refactor vnr/nr_ns helpers to make them safe
Inho, the safety rules for vnr/nr_ns helpers are horrible and buggy.
task_pid_nr_ns(task) needs rcu/tasklist depending on task == current.
As for "special" pids, vnr/nr_ns helpers always need rcu. However, if
task != current, they are unsafe even under rcu lock, we can't trust
task->group_leader without the special checks.
And almost every helper has a callsite which needs a fix.
Also, it is a bit annoying that the implementations of, say,
task_pgrp_vnr() and task_pgrp_nr_ns() are not "symmetrical".
This patch introduces the new helper, __task_pid_nr_ns(), which is always
safe to use, and turns all other helpers into the trivial wrappers.
After this I'll send another patch which converts task_tgid_xxx() as well,
they're are a bit special.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Louis Rilling <Louis.Rilling@kerlabs.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-03 07:58:38 +08:00
|
|
|
return __task_pid_nr_ns(tsk, PIDTYPE_PID, NULL);
|
2007-10-19 14:40:06 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
|
2007-10-26 16:17:22 +08:00
|
|
|
static inline pid_t task_tgid_nr(struct task_struct *tsk)
|
2007-10-19 14:40:06 +08:00
|
|
|
{
|
|
|
|
return tsk->tgid;
|
|
|
|
}
|
|
|
|
|
2017-02-07 05:06:35 +08:00
|
|
|
/**
|
|
|
|
* pid_alive - check that a task structure is not stale
|
|
|
|
* @p: Task structure to be checked.
|
|
|
|
*
|
|
|
|
* Test if a process is not yet dead (at most zombie state)
|
|
|
|
* If pid_alive fails, then pointers within the task structure
|
|
|
|
* can be stale and must not be dereferenced.
|
|
|
|
*
|
|
|
|
* Return: 1 if the process is alive. 0 otherwise.
|
|
|
|
*/
|
|
|
|
static inline int pid_alive(const struct task_struct *p)
|
|
|
|
{
|
2017-09-27 02:06:43 +08:00
|
|
|
return p->thread_pid != NULL;
|
2017-02-07 05:06:35 +08:00
|
|
|
}
|
2007-10-19 14:40:06 +08:00
|
|
|
|
2017-02-07 05:06:35 +08:00
|
|
|
static inline pid_t task_pgrp_nr_ns(struct task_struct *tsk, struct pid_namespace *ns)
|
2007-10-19 14:40:06 +08:00
|
|
|
{
|
pids: refactor vnr/nr_ns helpers to make them safe
Inho, the safety rules for vnr/nr_ns helpers are horrible and buggy.
task_pid_nr_ns(task) needs rcu/tasklist depending on task == current.
As for "special" pids, vnr/nr_ns helpers always need rcu. However, if
task != current, they are unsafe even under rcu lock, we can't trust
task->group_leader without the special checks.
And almost every helper has a callsite which needs a fix.
Also, it is a bit annoying that the implementations of, say,
task_pgrp_vnr() and task_pgrp_nr_ns() are not "symmetrical".
This patch introduces the new helper, __task_pid_nr_ns(), which is always
safe to use, and turns all other helpers into the trivial wrappers.
After this I'll send another patch which converts task_tgid_xxx() as well,
they're are a bit special.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Louis Rilling <Louis.Rilling@kerlabs.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-03 07:58:38 +08:00
|
|
|
return __task_pid_nr_ns(tsk, PIDTYPE_PGID, ns);
|
2007-10-19 14:40:06 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static inline pid_t task_pgrp_vnr(struct task_struct *tsk)
|
|
|
|
{
|
pids: refactor vnr/nr_ns helpers to make them safe
Inho, the safety rules for vnr/nr_ns helpers are horrible and buggy.
task_pid_nr_ns(task) needs rcu/tasklist depending on task == current.
As for "special" pids, vnr/nr_ns helpers always need rcu. However, if
task != current, they are unsafe even under rcu lock, we can't trust
task->group_leader without the special checks.
And almost every helper has a callsite which needs a fix.
Also, it is a bit annoying that the implementations of, say,
task_pgrp_vnr() and task_pgrp_nr_ns() are not "symmetrical".
This patch introduces the new helper, __task_pid_nr_ns(), which is always
safe to use, and turns all other helpers into the trivial wrappers.
After this I'll send another patch which converts task_tgid_xxx() as well,
they're are a bit special.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Louis Rilling <Louis.Rilling@kerlabs.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-03 07:58:38 +08:00
|
|
|
return __task_pid_nr_ns(tsk, PIDTYPE_PGID, NULL);
|
2007-10-19 14:40:06 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
|
2017-02-07 05:06:35 +08:00
|
|
|
static inline pid_t task_session_nr_ns(struct task_struct *tsk, struct pid_namespace *ns)
|
2007-10-19 14:40:06 +08:00
|
|
|
{
|
pids: refactor vnr/nr_ns helpers to make them safe
Inho, the safety rules for vnr/nr_ns helpers are horrible and buggy.
task_pid_nr_ns(task) needs rcu/tasklist depending on task == current.
As for "special" pids, vnr/nr_ns helpers always need rcu. However, if
task != current, they are unsafe even under rcu lock, we can't trust
task->group_leader without the special checks.
And almost every helper has a callsite which needs a fix.
Also, it is a bit annoying that the implementations of, say,
task_pgrp_vnr() and task_pgrp_nr_ns() are not "symmetrical".
This patch introduces the new helper, __task_pid_nr_ns(), which is always
safe to use, and turns all other helpers into the trivial wrappers.
After this I'll send another patch which converts task_tgid_xxx() as well,
they're are a bit special.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Louis Rilling <Louis.Rilling@kerlabs.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-03 07:58:38 +08:00
|
|
|
return __task_pid_nr_ns(tsk, PIDTYPE_SID, ns);
|
2007-10-19 14:40:06 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static inline pid_t task_session_vnr(struct task_struct *tsk)
|
|
|
|
{
|
pids: refactor vnr/nr_ns helpers to make them safe
Inho, the safety rules for vnr/nr_ns helpers are horrible and buggy.
task_pid_nr_ns(task) needs rcu/tasklist depending on task == current.
As for "special" pids, vnr/nr_ns helpers always need rcu. However, if
task != current, they are unsafe even under rcu lock, we can't trust
task->group_leader without the special checks.
And almost every helper has a callsite which needs a fix.
Also, it is a bit annoying that the implementations of, say,
task_pgrp_vnr() and task_pgrp_nr_ns() are not "symmetrical".
This patch introduces the new helper, __task_pid_nr_ns(), which is always
safe to use, and turns all other helpers into the trivial wrappers.
After this I'll send another patch which converts task_tgid_xxx() as well,
they're are a bit special.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Louis Rilling <Louis.Rilling@kerlabs.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-03 07:58:38 +08:00
|
|
|
return __task_pid_nr_ns(tsk, PIDTYPE_SID, NULL);
|
2007-10-19 14:40:06 +08:00
|
|
|
}
|
|
|
|
|
2017-08-21 23:35:02 +08:00
|
|
|
static inline pid_t task_tgid_nr_ns(struct task_struct *tsk, struct pid_namespace *ns)
|
|
|
|
{
|
2017-06-04 17:32:13 +08:00
|
|
|
return __task_pid_nr_ns(tsk, PIDTYPE_TGID, ns);
|
2017-08-21 23:35:02 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static inline pid_t task_tgid_vnr(struct task_struct *tsk)
|
|
|
|
{
|
2017-06-04 17:32:13 +08:00
|
|
|
return __task_pid_nr_ns(tsk, PIDTYPE_TGID, NULL);
|
2017-08-21 23:35:02 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static inline pid_t task_ppid_nr_ns(const struct task_struct *tsk, struct pid_namespace *ns)
|
|
|
|
{
|
|
|
|
pid_t pid = 0;
|
|
|
|
|
|
|
|
rcu_read_lock();
|
|
|
|
if (pid_alive(tsk))
|
|
|
|
pid = task_tgid_nr_ns(rcu_dereference(tsk->real_parent), ns);
|
|
|
|
rcu_read_unlock();
|
|
|
|
|
|
|
|
return pid;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline pid_t task_ppid_nr(const struct task_struct *tsk)
|
|
|
|
{
|
|
|
|
return task_ppid_nr_ns(tsk, &init_pid_ns);
|
|
|
|
}
|
|
|
|
|
2017-02-07 05:06:35 +08:00
|
|
|
/* Obsolete, do not use: */
|
2009-04-03 07:58:39 +08:00
|
|
|
static inline pid_t task_pgrp_nr(struct task_struct *tsk)
|
|
|
|
{
|
|
|
|
return task_pgrp_nr_ns(tsk, &init_pid_ns);
|
|
|
|
}
|
2007-10-19 14:40:06 +08:00
|
|
|
|
2017-09-23 00:30:40 +08:00
|
|
|
#define TASK_REPORT_IDLE (TASK_REPORT + 1)
|
|
|
|
#define TASK_REPORT_MAX (TASK_REPORT_IDLE << 1)
|
|
|
|
|
2022-01-21 00:25:19 +08:00
|
|
|
static inline unsigned int __task_state_index(unsigned int tsk_state,
|
|
|
|
unsigned int tsk_exit_state)
|
2017-08-07 16:44:23 +08:00
|
|
|
{
|
2022-01-21 00:25:19 +08:00
|
|
|
unsigned int state = (tsk_state | tsk_exit_state) & TASK_REPORT;
|
2017-08-07 16:44:23 +08:00
|
|
|
|
2017-09-23 00:30:40 +08:00
|
|
|
BUILD_BUG_ON_NOT_POWER_OF_2(TASK_REPORT_MAX);
|
|
|
|
|
|
|
|
if (tsk_state == TASK_IDLE)
|
|
|
|
state = TASK_REPORT_IDLE;
|
|
|
|
|
2022-01-21 00:25:20 +08:00
|
|
|
/*
|
|
|
|
* We're lying here, but rather than expose a completely new task state
|
|
|
|
* to userspace, we can make this appear as if the task has gone through
|
|
|
|
* a regular rt_mutex_lock() call.
|
|
|
|
*/
|
|
|
|
if (tsk_state == TASK_RTLOCK_WAIT)
|
|
|
|
state = TASK_UNINTERRUPTIBLE;
|
|
|
|
|
2017-09-23 00:09:26 +08:00
|
|
|
return fls(state);
|
|
|
|
}
|
|
|
|
|
2022-01-21 00:25:19 +08:00
|
|
|
static inline unsigned int task_state_index(struct task_struct *tsk)
|
|
|
|
{
|
|
|
|
return __task_state_index(READ_ONCE(tsk->__state), tsk->exit_state);
|
|
|
|
}
|
|
|
|
|
2017-09-29 19:50:16 +08:00
|
|
|
static inline char task_index_to_char(unsigned int state)
|
2017-09-23 00:09:26 +08:00
|
|
|
{
|
2017-09-23 00:37:28 +08:00
|
|
|
static const char state_char[] = "RSDTtXZPI";
|
2017-09-23 00:09:26 +08:00
|
|
|
|
2017-09-23 00:30:40 +08:00
|
|
|
BUILD_BUG_ON(1 + ilog2(TASK_REPORT_MAX) != sizeof(state_char) - 1);
|
2017-08-07 16:44:23 +08:00
|
|
|
|
2017-09-23 00:09:26 +08:00
|
|
|
return state_char[state];
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline char task_state_to_char(struct task_struct *tsk)
|
|
|
|
{
|
2017-09-29 19:50:16 +08:00
|
|
|
return task_index_to_char(task_state_index(tsk));
|
2017-08-07 16:44:23 +08:00
|
|
|
}
|
|
|
|
|
2006-09-29 17:00:07 +08:00
|
|
|
/**
|
2016-01-01 22:03:01 +08:00
|
|
|
* is_global_init - check if a task structure is init. Since init
|
|
|
|
* is free to have sub-threads we need to check tgid.
|
2006-10-06 15:44:01 +08:00
|
|
|
* @tsk: Task structure to be checked.
|
|
|
|
*
|
|
|
|
* Check if a task structure is the first user space task the kernel created.
|
2013-07-13 02:45:47 +08:00
|
|
|
*
|
|
|
|
* Return: 1 if the task structure is init. 0 otherwise.
|
2007-10-19 14:39:52 +08:00
|
|
|
*/
|
2007-10-26 16:17:22 +08:00
|
|
|
static inline int is_global_init(struct task_struct *tsk)
|
2007-10-19 14:40:09 +08:00
|
|
|
{
|
2016-01-01 22:03:01 +08:00
|
|
|
return task_tgid_nr(tsk) == 1;
|
2007-10-19 14:40:09 +08:00
|
|
|
}
|
2007-10-19 14:39:52 +08:00
|
|
|
|
2006-10-02 17:19:00 +08:00
|
|
|
extern struct pid *cad_pid;
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
/*
|
|
|
|
* Per process flags
|
|
|
|
*/
|
2020-08-20 03:55:05 +08:00
|
|
|
#define PF_VCPU 0x00000001 /* I'm a virtual CPU */
|
2017-02-07 05:06:35 +08:00
|
|
|
#define PF_IDLE 0x00000002 /* I am an IDLE thread */
|
|
|
|
#define PF_EXITING 0x00000004 /* Getting shut down */
|
2021-09-02 00:33:50 +08:00
|
|
|
#define PF_POSTCOREDUMP 0x00000008 /* Coredumps should ignore this task */
|
2020-08-20 03:55:05 +08:00
|
|
|
#define PF_IO_WORKER 0x00000010 /* Task is an IO worker */
|
2017-02-07 05:06:35 +08:00
|
|
|
#define PF_WQ_WORKER 0x00000020 /* I'm a workqueue worker */
|
|
|
|
#define PF_FORKNOEXEC 0x00000040 /* Forked but didn't exec */
|
|
|
|
#define PF_MCE_PROCESS 0x00000080 /* Process policy on mce errors */
|
|
|
|
#define PF_SUPERPRIV 0x00000100 /* Used super-user privileges */
|
|
|
|
#define PF_DUMPCORE 0x00000200 /* Dumped core */
|
|
|
|
#define PF_SIGNALED 0x00000400 /* Killed by a signal */
|
|
|
|
#define PF_MEMALLOC 0x00000800 /* Allocating memory */
|
|
|
|
#define PF_NPROC_EXCEEDED 0x00001000 /* set_user() noticed that RLIMIT_NPROC was exceeded */
|
|
|
|
#define PF_USED_MATH 0x00002000 /* If unset the fpu must be initialized before use */
|
2022-09-06 19:27:32 +08:00
|
|
|
#define PF__HOLE__00004000 0x00004000
|
2017-02-07 05:06:35 +08:00
|
|
|
#define PF_NOFREEZE 0x00008000 /* This thread should not be frozen */
|
2022-09-06 19:27:32 +08:00
|
|
|
#define PF__HOLE__00010000 0x00010000
|
mm: introduce memalloc_nofs_{save,restore} API
GFP_NOFS context is used for the following 5 reasons currently:
- to prevent from deadlocks when the lock held by the allocation
context would be needed during the memory reclaim
- to prevent from stack overflows during the reclaim because the
allocation is performed from a deep context already
- to prevent lockups when the allocation context depends on other
reclaimers to make a forward progress indirectly
- just in case because this would be safe from the fs POV
- silence lockdep false positives
Unfortunately overuse of this allocation context brings some problems to
the MM. Memory reclaim is much weaker (especially during heavy FS
metadata workloads), OOM killer cannot be invoked because the MM layer
doesn't have enough information about how much memory is freeable by the
FS layer.
In many cases it is far from clear why the weaker context is even used
and so it might be used unnecessarily. We would like to get rid of
those as much as possible. One way to do that is to use the flag in
scopes rather than isolated cases. Such a scope is declared when really
necessary, tracked per task and all the allocation requests from within
the context will simply inherit the GFP_NOFS semantic.
Not only this is easier to understand and maintain because there are
much less problematic contexts than specific allocation requests, this
also helps code paths where FS layer interacts with other layers (e.g.
crypto, security modules, MM etc...) and there is no easy way to convey
the allocation context between the layers.
Introduce memalloc_nofs_{save,restore} API to control the scope of
GFP_NOFS allocation context. This is basically copying
memalloc_noio_{save,restore} API we have for other restricted allocation
context GFP_NOIO. The PF_MEMALLOC_NOFS flag already exists and it is
just an alias for PF_FSTRANS which has been xfs specific until recently.
There are no more PF_FSTRANS users anymore so let's just drop it.
PF_MEMALLOC_NOFS is now checked in the MM layer and drops __GFP_FS
implicitly same as PF_MEMALLOC_NOIO drops __GFP_IO. memalloc_noio_flags
is renamed to current_gfp_context because it now cares about both
PF_MEMALLOC_NOFS and PF_MEMALLOC_NOIO contexts. Xfs code paths preserve
their semantic. kmem_flags_convert() doesn't need to evaluate the flag
anymore.
This patch shouldn't introduce any functional changes.
Let's hope that filesystems will drop direct GFP_NOFS (resp. ~__GFP_FS)
usage as much as possible and only use a properly documented
memalloc_nofs_{save,restore} checkpoints where they are appropriate.
[akpm@linux-foundation.org: fix comment typo, reflow comment]
Link: http://lkml.kernel.org/r/20170306131408.9828-5-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Chris Mason <clm@fb.com>
Cc: David Sterba <dsterba@suse.cz>
Cc: Jan Kara <jack@suse.cz>
Cc: Brian Foster <bfoster@redhat.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Nikolay Borisov <nborisov@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-04 05:53:15 +08:00
|
|
|
#define PF_KSWAPD 0x00020000 /* I am kswapd */
|
|
|
|
#define PF_MEMALLOC_NOFS 0x00040000 /* All allocation requests will inherit GFP_NOFS */
|
|
|
|
#define PF_MEMALLOC_NOIO 0x00080000 /* All allocation requests will inherit GFP_NOIO */
|
mm/writeback: replace PF_LESS_THROTTLE with PF_LOCAL_THROTTLE
PF_LESS_THROTTLE exists for loop-back nfsd (and a similar need in the
loop block driver and callers of prctl(PR_SET_IO_FLUSHER)), where a
daemon needs to write to one bdi (the final bdi) in order to free up
writes queued to another bdi (the client bdi).
The daemon sets PF_LESS_THROTTLE and gets a larger allowance of dirty
pages, so that it can still dirty pages after other processses have been
throttled. The purpose of this is to avoid deadlock that happen when
the PF_LESS_THROTTLE process must write for any dirty pages to be freed,
but it is being thottled and cannot write.
This approach was designed when all threads were blocked equally,
independently on which device they were writing to, or how fast it was.
Since that time the writeback algorithm has changed substantially with
different threads getting different allowances based on non-trivial
heuristics. This means the simple "add 25%" heuristic is no longer
reliable.
The important issue is not that the daemon needs a *larger* dirty page
allowance, but that it needs a *private* dirty page allowance, so that
dirty pages for the "client" bdi that it is helping to clear (the bdi
for an NFS filesystem or loop block device etc) do not affect the
throttling of the daemon writing to the "final" bdi.
This patch changes the heuristic so that the task is not throttled when
the bdi it is writing to has a dirty page count below below (or equal
to) the free-run threshold for that bdi. This ensures it will always be
able to have some pages in flight, and so will not deadlock.
In a steady-state, it is expected that PF_LOCAL_THROTTLE tasks might
still be throttled by global threshold, but that is acceptable as it is
only the deadlock state that is interesting for this flag.
This approach of "only throttle when target bdi is busy" is consistent
with the other use of PF_LESS_THROTTLE in current_may_throttle(), were
it causes attention to be focussed only on the target bdi.
So this patch
- renames PF_LESS_THROTTLE to PF_LOCAL_THROTTLE,
- removes the 25% bonus that that flag gives, and
- If PF_LOCAL_THROTTLE is set, don't delay at all unless the
global and the local free-run thresholds are exceeded.
Note that previously realtime threads were treated the same as
PF_LESS_THROTTLE threads. This patch does *not* change the behvaiour
for real-time threads, so it is now different from the behaviour of nfsd
and loop tasks. I don't know what is wanted for realtime.
[akpm@linux-foundation.org: coding style fixes]
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Chuck Lever <chuck.lever@oracle.com> [nfsd]
Cc: Christoph Hellwig <hch@lst.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Link: http://lkml.kernel.org/r/87ftbf7gs3.fsf@notabene.neil.brown.name
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 12:48:18 +08:00
|
|
|
#define PF_LOCAL_THROTTLE 0x00100000 /* Throttle writes only against the bdi I write to,
|
|
|
|
* I am cleaning dirty pages from some other bdi. */
|
2017-02-07 05:06:35 +08:00
|
|
|
#define PF_KTHREAD 0x00200000 /* I am a kernel thread */
|
|
|
|
#define PF_RANDOMIZE 0x00400000 /* Randomize virtual address space */
|
2022-09-06 19:27:32 +08:00
|
|
|
#define PF__HOLE__00800000 0x00800000
|
|
|
|
#define PF__HOLE__01000000 0x01000000
|
|
|
|
#define PF__HOLE__02000000 0x02000000
|
2019-04-23 22:26:36 +08:00
|
|
|
#define PF_NO_SETAFFINITY 0x04000000 /* Userland is not allowed to meddle with cpus_mask */
|
2017-02-07 05:06:35 +08:00
|
|
|
#define PF_MCE_EARLY 0x08000000 /* Early kill for mce process policy */
|
2021-05-05 09:38:53 +08:00
|
|
|
#define PF_MEMALLOC_PIN 0x10000000 /* Allocation context constrained to zones which allow long term pinning. */
|
2022-09-06 19:27:32 +08:00
|
|
|
#define PF__HOLE__20000000 0x20000000
|
|
|
|
#define PF__HOLE__40000000 0x40000000
|
2017-02-07 05:06:35 +08:00
|
|
|
#define PF_SUSPEND_TASK 0x80000000 /* This thread called freeze_processes() and should not be frozen */
|
2005-04-17 06:20:36 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Only the _current_ task can read/write to tsk->flags, but other
|
|
|
|
* tasks can access tsk->flags in readonly mode for example
|
|
|
|
* with tsk_used_math (like during threaded core dumping).
|
|
|
|
* There is however an exception to this rule during ptrace
|
|
|
|
* or during fork: the ptracer task is allowed to write to the
|
|
|
|
* child->flags of its traced child (same goes for fork, the parent
|
|
|
|
* can write to the child->flags), because we're guaranteed the
|
|
|
|
* child is not running and in turn not changing child->flags
|
|
|
|
* at the same time the parent does it.
|
|
|
|
*/
|
2017-02-07 05:06:35 +08:00
|
|
|
#define clear_stopped_child_used_math(child) do { (child)->flags &= ~PF_USED_MATH; } while (0)
|
|
|
|
#define set_stopped_child_used_math(child) do { (child)->flags |= PF_USED_MATH; } while (0)
|
|
|
|
#define clear_used_math() clear_stopped_child_used_math(current)
|
|
|
|
#define set_used_math() set_stopped_child_used_math(current)
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
#define conditional_stopped_child_used_math(condition, child) \
|
|
|
|
do { (child)->flags &= ~PF_USED_MATH, (child)->flags |= (condition) ? PF_USED_MATH : 0; } while (0)
|
2017-02-07 05:06:35 +08:00
|
|
|
|
|
|
|
#define conditional_used_math(condition) conditional_stopped_child_used_math(condition, current)
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
#define copy_to_stopped_child_used_math(child) \
|
|
|
|
do { (child)->flags &= ~PF_USED_MATH, (child)->flags |= current->flags & PF_USED_MATH; } while (0)
|
2017-02-07 05:06:35 +08:00
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
/* NOTE: this will return 0 or PF_USED_MATH, it will never return 1 */
|
2017-02-07 05:06:35 +08:00
|
|
|
#define tsk_used_math(p) ((p)->flags & PF_USED_MATH)
|
|
|
|
#define used_math() tsk_used_math(current)
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2021-09-20 21:31:11 +08:00
|
|
|
static __always_inline bool is_percpu_thread(void)
|
2017-05-24 16:15:41 +08:00
|
|
|
{
|
|
|
|
#ifdef CONFIG_SMP
|
|
|
|
return (current->flags & PF_NO_SETAFFINITY) &&
|
|
|
|
(current->nr_cpus_allowed == 1);
|
|
|
|
#else
|
|
|
|
return true;
|
|
|
|
#endif
|
|
|
|
}
|
|
|
|
|
2014-05-22 06:23:46 +08:00
|
|
|
/* Per-process atomic flags. */
|
2017-02-07 05:06:35 +08:00
|
|
|
#define PFA_NO_NEW_PRIVS 0 /* May not gain new privileges. */
|
|
|
|
#define PFA_SPREAD_PAGE 1 /* Spread page cache over cpuset */
|
|
|
|
#define PFA_SPREAD_SLAB 2 /* Spread some slab caches over cpuset */
|
2018-05-04 04:09:15 +08:00
|
|
|
#define PFA_SPEC_SSB_DISABLE 3 /* Speculative Store Bypass disabled */
|
|
|
|
#define PFA_SPEC_SSB_FORCE_DISABLE 4 /* Speculative Store Bypass force disabled*/
|
x86/speculation: Add prctl() control for indirect branch speculation
Add the PR_SPEC_INDIRECT_BRANCH option for the PR_GET_SPECULATION_CTRL and
PR_SET_SPECULATION_CTRL prctls to allow fine grained per task control of
indirect branch speculation via STIBP and IBPB.
Invocations:
Check indirect branch speculation status with
- prctl(PR_GET_SPECULATION_CTRL, PR_SPEC_INDIRECT_BRANCH, 0, 0, 0);
Enable indirect branch speculation with
- prctl(PR_SET_SPECULATION_CTRL, PR_SPEC_INDIRECT_BRANCH, PR_SPEC_ENABLE, 0, 0);
Disable indirect branch speculation with
- prctl(PR_SET_SPECULATION_CTRL, PR_SPEC_INDIRECT_BRANCH, PR_SPEC_DISABLE, 0, 0);
Force disable indirect branch speculation with
- prctl(PR_SET_SPECULATION_CTRL, PR_SPEC_INDIRECT_BRANCH, PR_SPEC_FORCE_DISABLE, 0, 0);
See Documentation/userspace-api/spec_ctrl.rst.
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Casey Schaufler <casey.schaufler@intel.com>
Cc: Asit Mallick <asit.k.mallick@intel.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Jon Masters <jcm@redhat.com>
Cc: Waiman Long <longman9394@gmail.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Dave Stewart <david.c.stewart@intel.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20181125185005.866780996@linutronix.de
2018-11-26 02:33:53 +08:00
|
|
|
#define PFA_SPEC_IB_DISABLE 5 /* Indirect branch speculation restricted */
|
|
|
|
#define PFA_SPEC_IB_FORCE_DISABLE 6 /* Indirect branch speculation permanently restricted */
|
2019-01-17 06:01:36 +08:00
|
|
|
#define PFA_SPEC_SSB_NOEXEC 7 /* Speculative Store Bypass clear on execve() */
|
2014-05-22 06:23:46 +08:00
|
|
|
|
2014-09-25 09:40:40 +08:00
|
|
|
#define TASK_PFA_TEST(name, func) \
|
|
|
|
static inline bool task_##func(struct task_struct *p) \
|
|
|
|
{ return test_bit(PFA_##name, &p->atomic_flags); }
|
2017-02-07 05:06:35 +08:00
|
|
|
|
2014-09-25 09:40:40 +08:00
|
|
|
#define TASK_PFA_SET(name, func) \
|
|
|
|
static inline void task_set_##func(struct task_struct *p) \
|
|
|
|
{ set_bit(PFA_##name, &p->atomic_flags); }
|
2017-02-07 05:06:35 +08:00
|
|
|
|
2014-09-25 09:40:40 +08:00
|
|
|
#define TASK_PFA_CLEAR(name, func) \
|
|
|
|
static inline void task_clear_##func(struct task_struct *p) \
|
|
|
|
{ clear_bit(PFA_##name, &p->atomic_flags); }
|
|
|
|
|
|
|
|
TASK_PFA_TEST(NO_NEW_PRIVS, no_new_privs)
|
|
|
|
TASK_PFA_SET(NO_NEW_PRIVS, no_new_privs)
|
2014-05-22 06:23:46 +08:00
|
|
|
|
2014-09-25 09:41:02 +08:00
|
|
|
TASK_PFA_TEST(SPREAD_PAGE, spread_page)
|
|
|
|
TASK_PFA_SET(SPREAD_PAGE, spread_page)
|
|
|
|
TASK_PFA_CLEAR(SPREAD_PAGE, spread_page)
|
|
|
|
|
|
|
|
TASK_PFA_TEST(SPREAD_SLAB, spread_slab)
|
|
|
|
TASK_PFA_SET(SPREAD_SLAB, spread_slab)
|
|
|
|
TASK_PFA_CLEAR(SPREAD_SLAB, spread_slab)
|
2014-05-22 06:23:46 +08:00
|
|
|
|
2018-05-04 04:09:15 +08:00
|
|
|
TASK_PFA_TEST(SPEC_SSB_DISABLE, spec_ssb_disable)
|
|
|
|
TASK_PFA_SET(SPEC_SSB_DISABLE, spec_ssb_disable)
|
|
|
|
TASK_PFA_CLEAR(SPEC_SSB_DISABLE, spec_ssb_disable)
|
|
|
|
|
2019-01-17 06:01:36 +08:00
|
|
|
TASK_PFA_TEST(SPEC_SSB_NOEXEC, spec_ssb_noexec)
|
|
|
|
TASK_PFA_SET(SPEC_SSB_NOEXEC, spec_ssb_noexec)
|
|
|
|
TASK_PFA_CLEAR(SPEC_SSB_NOEXEC, spec_ssb_noexec)
|
|
|
|
|
2018-05-04 04:09:15 +08:00
|
|
|
TASK_PFA_TEST(SPEC_SSB_FORCE_DISABLE, spec_ssb_force_disable)
|
|
|
|
TASK_PFA_SET(SPEC_SSB_FORCE_DISABLE, spec_ssb_force_disable)
|
|
|
|
|
x86/speculation: Add prctl() control for indirect branch speculation
Add the PR_SPEC_INDIRECT_BRANCH option for the PR_GET_SPECULATION_CTRL and
PR_SET_SPECULATION_CTRL prctls to allow fine grained per task control of
indirect branch speculation via STIBP and IBPB.
Invocations:
Check indirect branch speculation status with
- prctl(PR_GET_SPECULATION_CTRL, PR_SPEC_INDIRECT_BRANCH, 0, 0, 0);
Enable indirect branch speculation with
- prctl(PR_SET_SPECULATION_CTRL, PR_SPEC_INDIRECT_BRANCH, PR_SPEC_ENABLE, 0, 0);
Disable indirect branch speculation with
- prctl(PR_SET_SPECULATION_CTRL, PR_SPEC_INDIRECT_BRANCH, PR_SPEC_DISABLE, 0, 0);
Force disable indirect branch speculation with
- prctl(PR_SET_SPECULATION_CTRL, PR_SPEC_INDIRECT_BRANCH, PR_SPEC_FORCE_DISABLE, 0, 0);
See Documentation/userspace-api/spec_ctrl.rst.
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Casey Schaufler <casey.schaufler@intel.com>
Cc: Asit Mallick <asit.k.mallick@intel.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Jon Masters <jcm@redhat.com>
Cc: Waiman Long <longman9394@gmail.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Dave Stewart <david.c.stewart@intel.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20181125185005.866780996@linutronix.de
2018-11-26 02:33:53 +08:00
|
|
|
TASK_PFA_TEST(SPEC_IB_DISABLE, spec_ib_disable)
|
|
|
|
TASK_PFA_SET(SPEC_IB_DISABLE, spec_ib_disable)
|
|
|
|
TASK_PFA_CLEAR(SPEC_IB_DISABLE, spec_ib_disable)
|
|
|
|
|
|
|
|
TASK_PFA_TEST(SPEC_IB_FORCE_DISABLE, spec_ib_force_disable)
|
|
|
|
TASK_PFA_SET(SPEC_IB_FORCE_DISABLE, spec_ib_force_disable)
|
|
|
|
|
2017-02-07 05:06:35 +08:00
|
|
|
static inline void
|
2017-04-07 08:03:26 +08:00
|
|
|
current_restore_flags(unsigned long orig_flags, unsigned long flags)
|
2012-08-01 07:44:07 +08:00
|
|
|
{
|
2017-04-07 08:03:26 +08:00
|
|
|
current->flags &= ~flags;
|
|
|
|
current->flags |= orig_flags & flags;
|
2012-08-01 07:44:07 +08:00
|
|
|
}
|
|
|
|
|
2017-02-07 05:06:35 +08:00
|
|
|
extern int cpuset_cpumask_can_shrink(const struct cpumask *cur, const struct cpumask *trial);
|
2022-08-03 09:54:51 +08:00
|
|
|
extern int task_can_attach(struct task_struct *p, const struct cpumask *cs_effective_cpus);
|
2005-04-17 06:20:36 +08:00
|
|
|
#ifdef CONFIG_SMP
|
2017-02-07 05:06:35 +08:00
|
|
|
extern void do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask);
|
|
|
|
extern int set_cpus_allowed_ptr(struct task_struct *p, const struct cpumask *new_mask);
|
2021-07-30 19:24:33 +08:00
|
|
|
extern int dup_user_cpus_ptr(struct task_struct *dst, struct task_struct *src, int node);
|
|
|
|
extern void release_user_cpus_ptr(struct task_struct *p);
|
2021-07-30 19:24:36 +08:00
|
|
|
extern int dl_task_check_affinity(struct task_struct *p, const struct cpumask *mask);
|
sched: Allow task CPU affinity to be restricted on asymmetric systems
Asymmetric systems may not offer the same level of userspace ISA support
across all CPUs, meaning that some applications cannot be executed by
some CPUs. As a concrete example, upcoming arm64 big.LITTLE designs do
not feature support for 32-bit applications on both clusters.
Although userspace can carefully manage the affinity masks for such
tasks, one place where it is particularly problematic is execve()
because the CPU on which the execve() is occurring may be incompatible
with the new application image. In such a situation, it is desirable to
restrict the affinity mask of the task and ensure that the new image is
entered on a compatible CPU. From userspace's point of view, this looks
the same as if the incompatible CPUs have been hotplugged off in the
task's affinity mask. Similarly, if a subsequent execve() reverts to
a compatible image, then the old affinity is restored if it is still
valid.
In preparation for restricting the affinity mask for compat tasks on
arm64 systems without uniform support for 32-bit applications, introduce
{force,relax}_compatible_cpus_allowed_ptr(), which respectively restrict
and restore the affinity mask for a task based on the compatible CPUs.
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Reviewed-by: Quentin Perret <qperret@google.com>
Link: https://lore.kernel.org/r/20210730112443.23245-9-will@kernel.org
2021-07-30 19:24:35 +08:00
|
|
|
extern void force_compatible_cpus_allowed_ptr(struct task_struct *p);
|
|
|
|
extern void relax_compatible_cpus_allowed_ptr(struct task_struct *p);
|
2005-04-17 06:20:36 +08:00
|
|
|
#else
|
2017-02-07 05:06:35 +08:00
|
|
|
static inline void do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask)
|
2011-05-19 14:08:58 +08:00
|
|
|
{
|
|
|
|
}
|
2017-02-07 05:06:35 +08:00
|
|
|
static inline int set_cpus_allowed_ptr(struct task_struct *p, const struct cpumask *new_mask)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
2008-11-25 00:05:14 +08:00
|
|
|
if (!cpumask_test_cpu(0, new_mask))
|
2005-04-17 06:20:36 +08:00
|
|
|
return -EINVAL;
|
|
|
|
return 0;
|
|
|
|
}
|
2021-07-30 19:24:33 +08:00
|
|
|
static inline int dup_user_cpus_ptr(struct task_struct *dst, struct task_struct *src, int node)
|
|
|
|
{
|
|
|
|
if (src->user_cpus_ptr)
|
|
|
|
return -EINVAL;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
static inline void release_user_cpus_ptr(struct task_struct *p)
|
|
|
|
{
|
|
|
|
WARN_ON(p->user_cpus_ptr);
|
|
|
|
}
|
2021-07-30 19:24:36 +08:00
|
|
|
|
|
|
|
static inline int dl_task_check_affinity(struct task_struct *p, const struct cpumask *mask)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
2005-04-17 06:20:36 +08:00
|
|
|
#endif
|
2009-09-24 23:34:38 +08:00
|
|
|
|
2014-05-23 18:20:42 +08:00
|
|
|
extern int yield_to(struct task_struct *p, bool preempt);
|
2006-07-03 15:25:41 +08:00
|
|
|
extern void set_user_nice(struct task_struct *p, long nice);
|
|
|
|
extern int task_prio(const struct task_struct *p);
|
2017-02-07 05:06:35 +08:00
|
|
|
|
2014-01-28 11:00:45 +08:00
|
|
|
/**
|
|
|
|
* task_nice - return the nice value of a given task.
|
|
|
|
* @p: the task in question.
|
|
|
|
*
|
|
|
|
* Return: The nice value [ -20 ... 0 ... 19 ].
|
|
|
|
*/
|
|
|
|
static inline int task_nice(const struct task_struct *p)
|
|
|
|
{
|
|
|
|
return PRIO_TO_NICE((p)->static_prio);
|
|
|
|
}
|
2017-02-07 05:06:35 +08:00
|
|
|
|
2006-07-03 15:25:41 +08:00
|
|
|
extern int can_nice(const struct task_struct *p, const int nice);
|
|
|
|
extern int task_curr(const struct task_struct *p);
|
2005-04-17 06:20:36 +08:00
|
|
|
extern int idle_cpu(int cpu);
|
2018-05-10 00:39:48 +08:00
|
|
|
extern int available_idle_cpu(int cpu);
|
2017-02-07 05:06:35 +08:00
|
|
|
extern int sched_setscheduler(struct task_struct *, int, const struct sched_param *);
|
|
|
|
extern int sched_setscheduler_nocheck(struct task_struct *, int, const struct sched_param *);
|
2020-04-22 19:10:04 +08:00
|
|
|
extern void sched_set_fifo(struct task_struct *p);
|
|
|
|
extern void sched_set_fifo_low(struct task_struct *p);
|
|
|
|
extern void sched_set_normal(struct task_struct *p, int nice);
|
2017-02-07 05:06:35 +08:00
|
|
|
extern int sched_setattr(struct task_struct *, const struct sched_attr *);
|
2017-12-04 18:23:20 +08:00
|
|
|
extern int sched_setattr_nocheck(struct task_struct *, const struct sched_attr *);
|
2006-07-03 15:25:41 +08:00
|
|
|
extern struct task_struct *idle_task(int cpu);
|
2017-02-07 05:06:35 +08:00
|
|
|
|
2011-11-11 04:41:56 +08:00
|
|
|
/**
|
|
|
|
* is_idle_task - is the specified task an idle task?
|
2012-01-22 03:03:13 +08:00
|
|
|
* @p: the task in question.
|
2013-07-13 02:45:47 +08:00
|
|
|
*
|
|
|
|
* Return: 1 if @p is an idle task. 0 otherwise.
|
2011-11-11 04:41:56 +08:00
|
|
|
*/
|
2020-08-21 01:20:46 +08:00
|
|
|
static __always_inline bool is_idle_task(const struct task_struct *p)
|
2011-11-11 04:41:56 +08:00
|
|
|
{
|
2016-11-29 15:03:05 +08:00
|
|
|
return !!(p->flags & PF_IDLE);
|
2011-11-11 04:41:56 +08:00
|
|
|
}
|
2017-02-07 05:06:35 +08:00
|
|
|
|
2006-07-03 15:25:41 +08:00
|
|
|
extern struct task_struct *curr_task(int cpu);
|
2016-09-21 02:29:40 +08:00
|
|
|
extern void ia64_set_curr_task(int cpu, struct task_struct *p);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
|
|
|
void yield(void);
|
|
|
|
|
|
|
|
union thread_union {
|
2018-01-02 23:12:01 +08:00
|
|
|
#ifndef CONFIG_ARCH_TASK_STRUCT_ON_STACK
|
|
|
|
struct task_struct task;
|
|
|
|
#endif
|
2016-09-14 05:29:24 +08:00
|
|
|
#ifndef CONFIG_THREAD_INFO_IN_TASK
|
2005-04-17 06:20:36 +08:00
|
|
|
struct thread_info thread_info;
|
2016-09-14 05:29:24 +08:00
|
|
|
#endif
|
2005-04-17 06:20:36 +08:00
|
|
|
unsigned long stack[THREAD_SIZE/sizeof(long)];
|
|
|
|
};
|
|
|
|
|
2018-01-02 23:12:01 +08:00
|
|
|
#ifndef CONFIG_THREAD_INFO_IN_TASK
|
|
|
|
extern struct thread_info init_thread_info;
|
|
|
|
#endif
|
|
|
|
|
|
|
|
extern unsigned long init_stack[THREAD_SIZE / sizeof(unsigned long)];
|
|
|
|
|
2017-02-04 05:59:33 +08:00
|
|
|
#ifdef CONFIG_THREAD_INFO_IN_TASK
|
2021-09-14 20:10:33 +08:00
|
|
|
# define task_thread_info(task) (&(task)->thread_info)
|
2017-02-04 05:59:33 +08:00
|
|
|
#elif !defined(__HAVE_THREAD_FUNCTIONS)
|
|
|
|
# define task_thread_info(task) ((struct thread_info *)(task)->stack)
|
|
|
|
#endif
|
|
|
|
|
2007-10-19 14:40:06 +08:00
|
|
|
/*
|
|
|
|
* find a task by one of its numerical ids
|
|
|
|
*
|
|
|
|
* find_task_by_pid_ns():
|
|
|
|
* finds a task by its pid in the specified namespace
|
2007-10-19 14:40:16 +08:00
|
|
|
* find_task_by_vpid():
|
|
|
|
* finds a task by its virtual pid
|
2007-10-19 14:40:06 +08:00
|
|
|
*
|
2008-07-25 16:48:36 +08:00
|
|
|
* see also find_vpid() etc in include/linux/pid.h
|
2007-10-19 14:40:06 +08:00
|
|
|
*/
|
|
|
|
|
2007-10-19 14:40:16 +08:00
|
|
|
extern struct task_struct *find_task_by_vpid(pid_t nr);
|
2017-02-07 05:06:35 +08:00
|
|
|
extern struct task_struct *find_task_by_pid_ns(pid_t nr, struct pid_namespace *ns);
|
2007-10-19 14:40:06 +08:00
|
|
|
|
2018-02-07 07:40:17 +08:00
|
|
|
/*
|
|
|
|
* find a task by its virtual pid and get the task struct
|
|
|
|
*/
|
|
|
|
extern struct task_struct *find_get_task_by_vpid(pid_t nr);
|
|
|
|
|
2008-02-14 07:03:15 +08:00
|
|
|
extern int wake_up_state(struct task_struct *tsk, unsigned int state);
|
|
|
|
extern int wake_up_process(struct task_struct *tsk);
|
2011-05-12 00:18:05 +08:00
|
|
|
extern void wake_up_new_task(struct task_struct *tsk);
|
2017-02-07 05:06:35 +08:00
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
#ifdef CONFIG_SMP
|
2017-02-07 05:06:35 +08:00
|
|
|
extern void kick_process(struct task_struct *tsk);
|
2005-04-17 06:20:36 +08:00
|
|
|
#else
|
2017-02-07 05:06:35 +08:00
|
|
|
static inline void kick_process(struct task_struct *tsk) { }
|
2005-04-17 06:20:36 +08:00
|
|
|
#endif
|
|
|
|
|
2014-05-28 16:45:04 +08:00
|
|
|
extern void __set_task_comm(struct task_struct *tsk, const char *from, bool exec);
|
2017-02-07 05:06:35 +08:00
|
|
|
|
2014-05-28 16:45:04 +08:00
|
|
|
static inline void set_task_comm(struct task_struct *tsk, const char *from)
|
|
|
|
{
|
|
|
|
__set_task_comm(tsk, from, false);
|
|
|
|
}
|
2017-02-07 05:06:35 +08:00
|
|
|
|
2017-12-15 07:32:41 +08:00
|
|
|
extern char *__get_task_comm(char *to, size_t len, struct task_struct *tsk);
|
|
|
|
#define get_task_comm(buf, tsk) ({ \
|
|
|
|
BUILD_BUG_ON(sizeof(buf) != TASK_COMM_LEN); \
|
|
|
|
__get_task_comm(buf, sizeof(buf), tsk); \
|
|
|
|
})
|
2005-04-17 06:20:36 +08:00
|
|
|
|
|
|
|
#ifdef CONFIG_SMP
|
2020-03-27 19:42:00 +08:00
|
|
|
static __always_inline void scheduler_ipi(void)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* Fold TIF_NEED_RESCHED into the preempt_count; anybody setting
|
|
|
|
* TIF_NEED_RESCHED remotely (for the first time) will also send
|
|
|
|
* this IPI.
|
|
|
|
*/
|
|
|
|
preempt_fold_need_resched();
|
|
|
|
}
|
2021-06-11 16:28:17 +08:00
|
|
|
extern unsigned long wait_task_inactive(struct task_struct *, unsigned int match_state);
|
2005-04-17 06:20:36 +08:00
|
|
|
#else
|
2011-04-05 23:23:39 +08:00
|
|
|
static inline void scheduler_ipi(void) { }
|
2021-06-11 16:28:17 +08:00
|
|
|
static inline unsigned long wait_task_inactive(struct task_struct *p, unsigned int match_state)
|
2008-07-26 10:45:58 +08:00
|
|
|
{
|
|
|
|
return 1;
|
|
|
|
}
|
2005-04-17 06:20:36 +08:00
|
|
|
#endif
|
|
|
|
|
2017-02-07 05:06:35 +08:00
|
|
|
/*
|
|
|
|
* Set thread flags in other task's structures.
|
|
|
|
* See asm/thread_info.h for TIF_xxxx flags available:
|
2005-04-17 06:20:36 +08:00
|
|
|
*/
|
|
|
|
static inline void set_tsk_thread_flag(struct task_struct *tsk, int flag)
|
|
|
|
{
|
2005-11-14 08:06:55 +08:00
|
|
|
set_ti_thread_flag(task_thread_info(tsk), flag);
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static inline void clear_tsk_thread_flag(struct task_struct *tsk, int flag)
|
|
|
|
{
|
2005-11-14 08:06:55 +08:00
|
|
|
clear_ti_thread_flag(task_thread_info(tsk), flag);
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
2018-04-12 00:54:20 +08:00
|
|
|
static inline void update_tsk_thread_flag(struct task_struct *tsk, int flag,
|
|
|
|
bool value)
|
|
|
|
{
|
|
|
|
update_ti_thread_flag(task_thread_info(tsk), flag, value);
|
|
|
|
}
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
static inline int test_and_set_tsk_thread_flag(struct task_struct *tsk, int flag)
|
|
|
|
{
|
2005-11-14 08:06:55 +08:00
|
|
|
return test_and_set_ti_thread_flag(task_thread_info(tsk), flag);
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static inline int test_and_clear_tsk_thread_flag(struct task_struct *tsk, int flag)
|
|
|
|
{
|
2005-11-14 08:06:55 +08:00
|
|
|
return test_and_clear_ti_thread_flag(task_thread_info(tsk), flag);
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static inline int test_tsk_thread_flag(struct task_struct *tsk, int flag)
|
|
|
|
{
|
2005-11-14 08:06:55 +08:00
|
|
|
return test_ti_thread_flag(task_thread_info(tsk), flag);
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static inline void set_tsk_need_resched(struct task_struct *tsk)
|
|
|
|
{
|
|
|
|
set_tsk_thread_flag(tsk,TIF_NEED_RESCHED);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void clear_tsk_need_resched(struct task_struct *tsk)
|
|
|
|
{
|
|
|
|
clear_tsk_thread_flag(tsk,TIF_NEED_RESCHED);
|
|
|
|
}
|
|
|
|
|
2008-04-23 19:13:29 +08:00
|
|
|
static inline int test_tsk_need_resched(struct task_struct *tsk)
|
|
|
|
{
|
|
|
|
return unlikely(test_tsk_thread_flag(tsk,TIF_NEED_RESCHED));
|
|
|
|
}
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
/*
|
|
|
|
* cond_resched() and cond_resched_lock(): latency reduction via
|
|
|
|
* explicit rescheduling in places that are safe. The return
|
|
|
|
* value indicates whether a reschedule was done in fact.
|
|
|
|
* cond_resched_lock() will drop the spinlock before scheduling,
|
|
|
|
*/
|
2021-01-18 22:12:20 +08:00
|
|
|
#if !defined(CONFIG_PREEMPTION) || defined(CONFIG_PREEMPT_DYNAMIC)
|
|
|
|
extern int __cond_resched(void);
|
|
|
|
|
sched/preempt: Add PREEMPT_DYNAMIC using static keys
Where an architecture selects HAVE_STATIC_CALL but not
HAVE_STATIC_CALL_INLINE, each static call has an out-of-line trampoline
which will either branch to a callee or return to the caller.
On such architectures, a number of constraints can conspire to make
those trampolines more complicated and potentially less useful than we'd
like. For example:
* Hardware and software control flow integrity schemes can require the
addition of "landing pad" instructions (e.g. `BTI` for arm64), which
will also be present at the "real" callee.
* Limited branch ranges can require that trampolines generate or load an
address into a register and perform an indirect branch (or at least
have a slow path that does so). This loses some of the benefits of
having a direct branch.
* Interaction with SW CFI schemes can be complicated and fragile, e.g.
requiring that we can recognise idiomatic codegen and remove
indirections understand, at least until clang proves more helpful
mechanisms for dealing with this.
For PREEMPT_DYNAMIC, we don't need the full power of static calls, as we
really only need to enable/disable specific preemption functions. We can
achieve the same effect without a number of the pain points above by
using static keys to fold early returns into the preemption functions
themselves rather than in an out-of-line trampoline, effectively
inlining the trampoline into the start of the function.
For arm64, this results in good code generation. For example, the
dynamic_cond_resched() wrapper looks as follows when enabled. When
disabled, the first `B` is replaced with a `NOP`, resulting in an early
return.
| <dynamic_cond_resched>:
| bti c
| b <dynamic_cond_resched+0x10> // or `nop`
| mov w0, #0x0
| ret
| mrs x0, sp_el0
| ldr x0, [x0, #8]
| cbnz x0, <dynamic_cond_resched+0x8>
| paciasp
| stp x29, x30, [sp, #-16]!
| mov x29, sp
| bl <preempt_schedule_common>
| mov w0, #0x1
| ldp x29, x30, [sp], #16
| autiasp
| ret
... compared to the regular form of the function:
| <__cond_resched>:
| bti c
| mrs x0, sp_el0
| ldr x1, [x0, #8]
| cbz x1, <__cond_resched+0x18>
| mov w0, #0x0
| ret
| paciasp
| stp x29, x30, [sp, #-16]!
| mov x29, sp
| bl <preempt_schedule_common>
| mov w0, #0x1
| ldp x29, x30, [sp], #16
| autiasp
| ret
Any architecture which implements static keys should be able to use this
to implement PREEMPT_DYNAMIC with similar cost to non-inlined static
calls. Since this is likely to have greater overhead than (inlined)
static calls, PREEMPT_DYNAMIC is only defaulted to enabled when
HAVE_PREEMPT_DYNAMIC_CALL is selected.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20220214165216.2231574-6-mark.rutland@arm.com
2022-02-15 00:52:14 +08:00
|
|
|
#if defined(CONFIG_PREEMPT_DYNAMIC) && defined(CONFIG_HAVE_PREEMPT_DYNAMIC_CALL)
|
2021-01-18 22:12:20 +08:00
|
|
|
|
|
|
|
DECLARE_STATIC_CALL(cond_resched, __cond_resched);
|
|
|
|
|
|
|
|
static __always_inline int _cond_resched(void)
|
|
|
|
{
|
2021-01-25 23:26:50 +08:00
|
|
|
return static_call_mod(cond_resched)();
|
2021-01-18 22:12:20 +08:00
|
|
|
}
|
|
|
|
|
sched/preempt: Add PREEMPT_DYNAMIC using static keys
Where an architecture selects HAVE_STATIC_CALL but not
HAVE_STATIC_CALL_INLINE, each static call has an out-of-line trampoline
which will either branch to a callee or return to the caller.
On such architectures, a number of constraints can conspire to make
those trampolines more complicated and potentially less useful than we'd
like. For example:
* Hardware and software control flow integrity schemes can require the
addition of "landing pad" instructions (e.g. `BTI` for arm64), which
will also be present at the "real" callee.
* Limited branch ranges can require that trampolines generate or load an
address into a register and perform an indirect branch (or at least
have a slow path that does so). This loses some of the benefits of
having a direct branch.
* Interaction with SW CFI schemes can be complicated and fragile, e.g.
requiring that we can recognise idiomatic codegen and remove
indirections understand, at least until clang proves more helpful
mechanisms for dealing with this.
For PREEMPT_DYNAMIC, we don't need the full power of static calls, as we
really only need to enable/disable specific preemption functions. We can
achieve the same effect without a number of the pain points above by
using static keys to fold early returns into the preemption functions
themselves rather than in an out-of-line trampoline, effectively
inlining the trampoline into the start of the function.
For arm64, this results in good code generation. For example, the
dynamic_cond_resched() wrapper looks as follows when enabled. When
disabled, the first `B` is replaced with a `NOP`, resulting in an early
return.
| <dynamic_cond_resched>:
| bti c
| b <dynamic_cond_resched+0x10> // or `nop`
| mov w0, #0x0
| ret
| mrs x0, sp_el0
| ldr x0, [x0, #8]
| cbnz x0, <dynamic_cond_resched+0x8>
| paciasp
| stp x29, x30, [sp, #-16]!
| mov x29, sp
| bl <preempt_schedule_common>
| mov w0, #0x1
| ldp x29, x30, [sp], #16
| autiasp
| ret
... compared to the regular form of the function:
| <__cond_resched>:
| bti c
| mrs x0, sp_el0
| ldr x1, [x0, #8]
| cbz x1, <__cond_resched+0x18>
| mov w0, #0x0
| ret
| paciasp
| stp x29, x30, [sp, #-16]!
| mov x29, sp
| bl <preempt_schedule_common>
| mov w0, #0x1
| ldp x29, x30, [sp], #16
| autiasp
| ret
Any architecture which implements static keys should be able to use this
to implement PREEMPT_DYNAMIC with similar cost to non-inlined static
calls. Since this is likely to have greater overhead than (inlined)
static calls, PREEMPT_DYNAMIC is only defaulted to enabled when
HAVE_PREEMPT_DYNAMIC_CALL is selected.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20220214165216.2231574-6-mark.rutland@arm.com
2022-02-15 00:52:14 +08:00
|
|
|
#elif defined(CONFIG_PREEMPT_DYNAMIC) && defined(CONFIG_HAVE_PREEMPT_DYNAMIC_KEY)
|
|
|
|
extern int dynamic_cond_resched(void);
|
|
|
|
|
|
|
|
static __always_inline int _cond_resched(void)
|
|
|
|
{
|
|
|
|
return dynamic_cond_resched();
|
|
|
|
}
|
|
|
|
|
2016-09-19 18:57:53 +08:00
|
|
|
#else
|
2021-01-18 22:12:20 +08:00
|
|
|
|
|
|
|
static inline int _cond_resched(void)
|
|
|
|
{
|
|
|
|
return __cond_resched();
|
|
|
|
}
|
|
|
|
|
|
|
|
#endif /* CONFIG_PREEMPT_DYNAMIC */
|
|
|
|
|
|
|
|
#else
|
|
|
|
|
2016-09-19 18:57:53 +08:00
|
|
|
static inline int _cond_resched(void) { return 0; }
|
2021-01-18 22:12:20 +08:00
|
|
|
|
|
|
|
#endif /* !defined(CONFIG_PREEMPTION) || defined(CONFIG_PREEMPT_DYNAMIC) */
|
2009-07-16 21:44:29 +08:00
|
|
|
|
2009-07-16 21:44:29 +08:00
|
|
|
#define cond_resched() ({ \
|
2021-09-24 00:54:35 +08:00
|
|
|
__might_resched(__FILE__, __LINE__, 0); \
|
2009-07-16 21:44:29 +08:00
|
|
|
_cond_resched(); \
|
|
|
|
})
|
2009-07-16 21:44:29 +08:00
|
|
|
|
2009-07-16 21:44:29 +08:00
|
|
|
extern int __cond_resched_lock(spinlock_t *lock);
|
2021-02-03 02:57:14 +08:00
|
|
|
extern int __cond_resched_rwlock_read(rwlock_t *lock);
|
|
|
|
extern int __cond_resched_rwlock_write(rwlock_t *lock);
|
2009-07-16 21:44:29 +08:00
|
|
|
|
2021-09-24 00:54:43 +08:00
|
|
|
#define MIGHT_RESCHED_RCU_SHIFT 8
|
|
|
|
#define MIGHT_RESCHED_PREEMPT_MASK ((1U << MIGHT_RESCHED_RCU_SHIFT) - 1)
|
|
|
|
|
2021-09-24 00:54:44 +08:00
|
|
|
#ifndef CONFIG_PREEMPT_RT
|
|
|
|
/*
|
|
|
|
* Non RT kernels have an elevated preempt count due to the held lock,
|
|
|
|
* but are not allowed to be inside a RCU read side critical section
|
|
|
|
*/
|
|
|
|
# define PREEMPT_LOCK_RESCHED_OFFSETS PREEMPT_LOCK_OFFSET
|
|
|
|
#else
|
|
|
|
/*
|
|
|
|
* spin/rw_lock() on RT implies rcu_read_lock(). The might_sleep() check in
|
|
|
|
* cond_resched*lock() has to take that into account because it checks for
|
|
|
|
* preempt_count() and rcu_preempt_depth().
|
|
|
|
*/
|
|
|
|
# define PREEMPT_LOCK_RESCHED_OFFSETS \
|
|
|
|
(PREEMPT_LOCK_OFFSET + (1U << MIGHT_RESCHED_RCU_SHIFT))
|
|
|
|
#endif
|
|
|
|
|
|
|
|
#define cond_resched_lock(lock) ({ \
|
|
|
|
__might_resched(__FILE__, __LINE__, PREEMPT_LOCK_RESCHED_OFFSETS); \
|
|
|
|
__cond_resched_lock(lock); \
|
2009-07-16 21:44:29 +08:00
|
|
|
})
|
|
|
|
|
2021-09-24 00:54:44 +08:00
|
|
|
#define cond_resched_rwlock_read(lock) ({ \
|
|
|
|
__might_resched(__FILE__, __LINE__, PREEMPT_LOCK_RESCHED_OFFSETS); \
|
|
|
|
__cond_resched_rwlock_read(lock); \
|
2021-02-03 02:57:14 +08:00
|
|
|
})
|
|
|
|
|
2021-09-24 00:54:44 +08:00
|
|
|
#define cond_resched_rwlock_write(lock) ({ \
|
|
|
|
__might_resched(__FILE__, __LINE__, PREEMPT_LOCK_RESCHED_OFFSETS); \
|
|
|
|
__cond_resched_rwlock_write(lock); \
|
2021-02-03 02:57:14 +08:00
|
|
|
})
|
|
|
|
|
2013-05-22 13:50:31 +08:00
|
|
|
static inline void cond_resched_rcu(void)
|
|
|
|
{
|
|
|
|
#if defined(CONFIG_DEBUG_ATOMIC_SLEEP) || !defined(CONFIG_PREEMPT_RCU)
|
|
|
|
rcu_read_unlock();
|
|
|
|
cond_resched();
|
|
|
|
rcu_read_lock();
|
|
|
|
#endif
|
|
|
|
}
|
|
|
|
|
2021-11-13 02:52:01 +08:00
|
|
|
#ifdef CONFIG_PREEMPT_DYNAMIC
|
|
|
|
|
|
|
|
extern bool preempt_model_none(void);
|
|
|
|
extern bool preempt_model_voluntary(void);
|
|
|
|
extern bool preempt_model_full(void);
|
|
|
|
|
|
|
|
#else
|
|
|
|
|
|
|
|
static inline bool preempt_model_none(void)
|
|
|
|
{
|
|
|
|
return IS_ENABLED(CONFIG_PREEMPT_NONE);
|
|
|
|
}
|
|
|
|
static inline bool preempt_model_voluntary(void)
|
|
|
|
{
|
|
|
|
return IS_ENABLED(CONFIG_PREEMPT_VOLUNTARY);
|
|
|
|
}
|
|
|
|
static inline bool preempt_model_full(void)
|
|
|
|
{
|
|
|
|
return IS_ENABLED(CONFIG_PREEMPT);
|
|
|
|
}
|
|
|
|
|
|
|
|
#endif
|
|
|
|
|
|
|
|
static inline bool preempt_model_rt(void)
|
|
|
|
{
|
|
|
|
return IS_ENABLED(CONFIG_PREEMPT_RT);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Does the preemption model allow non-cooperative preemption?
|
|
|
|
*
|
|
|
|
* For !CONFIG_PREEMPT_DYNAMIC kernels this is an exact match with
|
|
|
|
* CONFIG_PREEMPTION; for CONFIG_PREEMPT_DYNAMIC this doesn't work as the
|
|
|
|
* kernel is *built* with CONFIG_PREEMPTION=y but may run with e.g. the
|
|
|
|
* PREEMPT_NONE model.
|
|
|
|
*/
|
|
|
|
static inline bool preempt_model_preemptible(void)
|
|
|
|
{
|
|
|
|
return preempt_model_full() || preempt_model_rt();
|
|
|
|
}
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
/*
|
|
|
|
* Does a critical section need to be broken due to another
|
2019-07-27 05:19:37 +08:00
|
|
|
* task waiting?: (technically does not depend on CONFIG_PREEMPTION,
|
2008-01-30 20:31:20 +08:00
|
|
|
* but a general need for low latency)
|
2005-04-17 06:20:36 +08:00
|
|
|
*/
|
2008-01-30 20:31:20 +08:00
|
|
|
static inline int spin_needbreak(spinlock_t *lock)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
2019-07-27 05:19:37 +08:00
|
|
|
#ifdef CONFIG_PREEMPTION
|
2008-01-30 20:31:20 +08:00
|
|
|
return spin_is_contended(lock);
|
|
|
|
#else
|
2005-04-17 06:20:36 +08:00
|
|
|
return 0;
|
2008-01-30 20:31:20 +08:00
|
|
|
#endif
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
2021-02-03 02:57:13 +08:00
|
|
|
/*
|
|
|
|
* Check if a rwlock is contended.
|
|
|
|
* Returns non-zero if there is another task waiting on the rwlock.
|
|
|
|
* Returns zero if the lock is not contended or the system / underlying
|
|
|
|
* rwlock implementation does not support contention detection.
|
|
|
|
* Technically does not depend on CONFIG_PREEMPTION, but a general need
|
|
|
|
* for low latency.
|
|
|
|
*/
|
|
|
|
static inline int rwlock_needbreak(rwlock_t *lock)
|
|
|
|
{
|
|
|
|
#ifdef CONFIG_PREEMPTION
|
|
|
|
return rwlock_is_contended(lock);
|
|
|
|
#else
|
|
|
|
return 0;
|
|
|
|
#endif
|
|
|
|
}
|
|
|
|
|
2013-09-27 23:30:03 +08:00
|
|
|
static __always_inline bool need_resched(void)
|
|
|
|
{
|
|
|
|
return unlikely(tif_need_resched());
|
|
|
|
}
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
/*
|
|
|
|
* Wrappers for p->thread_info->cpu access. No-op on UP.
|
|
|
|
*/
|
|
|
|
#ifdef CONFIG_SMP
|
|
|
|
|
|
|
|
static inline unsigned int task_cpu(const struct task_struct *p)
|
|
|
|
{
|
2019-01-21 23:52:40 +08:00
|
|
|
return READ_ONCE(task_thread_info(p)->cpu);
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
2007-07-10 00:51:58 +08:00
|
|
|
extern void set_task_cpu(struct task_struct *p, unsigned int cpu);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
|
|
|
#else
|
|
|
|
|
|
|
|
static inline unsigned int task_cpu(const struct task_struct *p)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void set_task_cpu(struct task_struct *p, unsigned int cpu)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
|
|
|
|
#endif /* CONFIG_SMP */
|
|
|
|
|
2021-05-13 07:29:22 +08:00
|
|
|
extern bool sched_task_on_rq(struct task_struct *p);
|
2021-09-30 06:02:14 +08:00
|
|
|
extern unsigned long get_wchan(struct task_struct *p);
|
rcu-tasks: Eliminate RCU Tasks Trace IPIs to online CPUs
Currently, the RCU Tasks Trace grace-period kthread IPIs each online CPU
using smp_call_function_single() in order to track any tasks currently in
RCU Tasks Trace read-side critical sections during which the corresponding
task has neither blocked nor been preempted. These IPIs are annoying
and are also not strictly necessary because any task that blocks or is
preempted within its current RCU Tasks Trace read-side critical section
will be tracked on one of the per-CPU rcu_tasks_percpu structure's
->rtp_blkd_tasks list. So the only time that this is a problem is if
one of the CPUs runs through a long-duration RCU Tasks Trace read-side
critical section without a context switch.
Note that the task_call_func() function cannot help here because there is
no safe way to identify the target task. Of course, the task_call_func()
function will be very useful later, when processing the list of tasks,
but it needs to know the task.
This commit therefore creates a cpu_curr_snapshot() function that returns
a pointer the task_struct structure of some task that happened to be
running on the specified CPU more or less during the time that the
cpu_curr_snapshot() function was executing. If there was no context
switch during this time, this function will return a pointer to the
task_struct structure of the task that was running throughout. If there
was a context switch, then the outgoing task will be taken care of by
RCU's context-switch hook, and the incoming task was either already taken
care during some previous context switch, or it is not currently within an
RCU Tasks Trace read-side critical section. And in this latter case, the
grace period already started, so there is no need to wait on this task.
This new cpu_curr_snapshot() function is invoked on each CPU early in
the RCU Tasks Trace grace-period processing, and the resulting tasks
are queued for later quiescent-state inspection.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Martin KaFai Lau <kafai@fb.com>
Cc: KP Singh <kpsingh@kernel.org>
2022-06-03 08:30:01 +08:00
|
|
|
extern struct task_struct *cpu_curr_snapshot(int cpu);
|
2021-05-13 07:29:22 +08:00
|
|
|
|
2016-11-02 17:08:28 +08:00
|
|
|
/*
|
|
|
|
* In order to reduce various lock holder preemption latencies provide an
|
|
|
|
* interface to see if a vCPU is currently running or not.
|
|
|
|
*
|
|
|
|
* This allows us to terminate optimistic spin loops and block, analogous to
|
|
|
|
* the native optimistic spin heuristic of testing if the lock owner task is
|
|
|
|
* running or not.
|
|
|
|
*/
|
|
|
|
#ifndef vcpu_is_preempted
|
2019-09-17 22:34:54 +08:00
|
|
|
static inline bool vcpu_is_preempted(int cpu)
|
|
|
|
{
|
|
|
|
return false;
|
|
|
|
}
|
2016-11-02 17:08:28 +08:00
|
|
|
#endif
|
|
|
|
|
2008-11-25 00:05:14 +08:00
|
|
|
extern long sched_setaffinity(pid_t pid, const struct cpumask *new_mask);
|
|
|
|
extern long sched_getaffinity(pid_t pid, struct cpumask *mask);
|
2006-06-27 17:54:42 +08:00
|
|
|
|
2008-02-05 14:28:59 +08:00
|
|
|
#ifndef TASK_SIZE_OF
|
|
|
|
#define TASK_SIZE_OF(tsk) TASK_SIZE
|
|
|
|
#endif
|
|
|
|
|
2020-12-08 12:16:56 +08:00
|
|
|
#ifdef CONFIG_SMP
|
2021-12-03 15:59:34 +08:00
|
|
|
static inline bool owner_on_cpu(struct task_struct *owner)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* As lock holder preemption issue, we both skip spinning if
|
|
|
|
* task is not on cpu or its cpu is preempted
|
|
|
|
*/
|
2021-12-03 15:59:35 +08:00
|
|
|
return READ_ONCE(owner->on_cpu) && !vcpu_is_preempted(task_cpu(owner));
|
2021-12-03 15:59:34 +08:00
|
|
|
}
|
|
|
|
|
2020-12-08 12:16:56 +08:00
|
|
|
/* Returns effective CPU energy utilization, as seen by the scheduler */
|
2022-06-21 17:04:10 +08:00
|
|
|
unsigned long sched_cpu_util(int cpu);
|
2020-12-08 12:16:56 +08:00
|
|
|
#endif /* CONFIG_SMP */
|
|
|
|
|
rseq: Introduce restartable sequences system call
Expose a new system call allowing each thread to register one userspace
memory area to be used as an ABI between kernel and user-space for two
purposes: user-space restartable sequences and quick access to read the
current CPU number value from user-space.
* Restartable sequences (per-cpu atomics)
Restartables sequences allow user-space to perform update operations on
per-cpu data without requiring heavy-weight atomic operations.
The restartable critical sections (percpu atomics) work has been started
by Paul Turner and Andrew Hunter. It lets the kernel handle restart of
critical sections. [1] [2] The re-implementation proposed here brings a
few simplifications to the ABI which facilitates porting to other
architectures and speeds up the user-space fast path.
Here are benchmarks of various rseq use-cases.
Test hardware:
arm32: ARMv7 Processor rev 4 (v7l) "Cubietruck", 2-core
x86-64: Intel E5-2630 v3@2.40GHz, 16-core, hyperthreading
The following benchmarks were all performed on a single thread.
* Per-CPU statistic counter increment
getcpu+atomic (ns/op) rseq (ns/op) speedup
arm32: 344.0 31.4 11.0
x86-64: 15.3 2.0 7.7
* LTTng-UST: write event 32-bit header, 32-bit payload into tracer
per-cpu buffer
getcpu+atomic (ns/op) rseq (ns/op) speedup
arm32: 2502.0 2250.0 1.1
x86-64: 117.4 98.0 1.2
* liburcu percpu: lock-unlock pair, dereference, read/compare word
getcpu+atomic (ns/op) rseq (ns/op) speedup
arm32: 751.0 128.5 5.8
x86-64: 53.4 28.6 1.9
* jemalloc memory allocator adapted to use rseq
Using rseq with per-cpu memory pools in jemalloc at Facebook (based on
rseq 2016 implementation):
The production workload response-time has 1-2% gain avg. latency, and
the P99 overall latency drops by 2-3%.
* Reading the current CPU number
Speeding up reading the current CPU number on which the caller thread is
running is done by keeping the current CPU number up do date within the
cpu_id field of the memory area registered by the thread. This is done
by making scheduler preemption set the TIF_NOTIFY_RESUME flag on the
current thread. Upon return to user-space, a notify-resume handler
updates the current CPU value within the registered user-space memory
area. User-space can then read the current CPU number directly from
memory.
Keeping the current cpu id in a memory area shared between kernel and
user-space is an improvement over current mechanisms available to read
the current CPU number, which has the following benefits over
alternative approaches:
- 35x speedup on ARM vs system call through glibc
- 20x speedup on x86 compared to calling glibc, which calls vdso
executing a "lsl" instruction,
- 14x speedup on x86 compared to inlined "lsl" instruction,
- Unlike vdso approaches, this cpu_id value can be read from an inline
assembly, which makes it a useful building block for restartable
sequences.
- The approach of reading the cpu id through memory mapping shared
between kernel and user-space is portable (e.g. ARM), which is not the
case for the lsl-based x86 vdso.
On x86, yet another possible approach would be to use the gs segment
selector to point to user-space per-cpu data. This approach performs
similarly to the cpu id cache, but it has two disadvantages: it is
not portable, and it is incompatible with existing applications already
using the gs segment selector for other purposes.
Benchmarking various approaches for reading the current CPU number:
ARMv7 Processor rev 4 (v7l)
Machine model: Cubietruck
- Baseline (empty loop): 8.4 ns
- Read CPU from rseq cpu_id: 16.7 ns
- Read CPU from rseq cpu_id (lazy register): 19.8 ns
- glibc 2.19-0ubuntu6.6 getcpu: 301.8 ns
- getcpu system call: 234.9 ns
x86-64 Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz:
- Baseline (empty loop): 0.8 ns
- Read CPU from rseq cpu_id: 0.8 ns
- Read CPU from rseq cpu_id (lazy register): 0.8 ns
- Read using gs segment selector: 0.8 ns
- "lsl" inline assembly: 13.0 ns
- glibc 2.19-0ubuntu6 getcpu: 16.6 ns
- getcpu system call: 53.9 ns
- Speed (benchmark taken on v8 of patchset)
Running 10 runs of hackbench -l 100000 seems to indicate, contrary to
expectations, that enabling CONFIG_RSEQ slightly accelerates the
scheduler:
Configuration: 2 sockets * 8-core Intel(R) Xeon(R) CPU E5-2630 v3 @
2.40GHz (directly on hardware, hyperthreading disabled in BIOS, energy
saving disabled in BIOS, turboboost disabled in BIOS, cpuidle.off=1
kernel parameter), with a Linux v4.6 defconfig+localyesconfig,
restartable sequences series applied.
* CONFIG_RSEQ=n
avg.: 41.37 s
std.dev.: 0.36 s
* CONFIG_RSEQ=y
avg.: 40.46 s
std.dev.: 0.33 s
- Size
On x86-64, between CONFIG_RSEQ=n/y, the text size increase of vmlinux is
567 bytes, and the data size increase of vmlinux is 5696 bytes.
[1] https://lwn.net/Articles/650333/
[2] http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdf
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Watson <davejwatson@fb.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Chris Lameter <cl@linux.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Andrew Hunter <ahh@google.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Maurer <bmaurer@fb.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-api@vger.kernel.org
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20151027235635.16059.11630.stgit@pjt-glaptop.roam.corp.google.com
Link: http://lkml.kernel.org/r/20150624222609.6116.86035.stgit@kitami.mtv.corp.google.com
Link: https://lkml.kernel.org/r/20180602124408.8430-3-mathieu.desnoyers@efficios.com
2018-06-02 20:43:54 +08:00
|
|
|
#ifdef CONFIG_RSEQ
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Map the event mask on the user-space ABI enum rseq_cs_flags
|
|
|
|
* for direct mask checks.
|
|
|
|
*/
|
|
|
|
enum rseq_event_mask_bits {
|
|
|
|
RSEQ_EVENT_PREEMPT_BIT = RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT_BIT,
|
|
|
|
RSEQ_EVENT_SIGNAL_BIT = RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT,
|
|
|
|
RSEQ_EVENT_MIGRATE_BIT = RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT,
|
|
|
|
};
|
|
|
|
|
|
|
|
enum rseq_event_mask {
|
|
|
|
RSEQ_EVENT_PREEMPT = (1U << RSEQ_EVENT_PREEMPT_BIT),
|
|
|
|
RSEQ_EVENT_SIGNAL = (1U << RSEQ_EVENT_SIGNAL_BIT),
|
|
|
|
RSEQ_EVENT_MIGRATE = (1U << RSEQ_EVENT_MIGRATE_BIT),
|
|
|
|
};
|
|
|
|
|
|
|
|
static inline void rseq_set_notify_resume(struct task_struct *t)
|
|
|
|
{
|
|
|
|
if (t->rseq)
|
|
|
|
set_tsk_thread_flag(t, TIF_NOTIFY_RESUME);
|
|
|
|
}
|
|
|
|
|
2018-06-22 18:45:07 +08:00
|
|
|
void __rseq_handle_notify_resume(struct ksignal *sig, struct pt_regs *regs);
|
rseq: Introduce restartable sequences system call
Expose a new system call allowing each thread to register one userspace
memory area to be used as an ABI between kernel and user-space for two
purposes: user-space restartable sequences and quick access to read the
current CPU number value from user-space.
* Restartable sequences (per-cpu atomics)
Restartables sequences allow user-space to perform update operations on
per-cpu data without requiring heavy-weight atomic operations.
The restartable critical sections (percpu atomics) work has been started
by Paul Turner and Andrew Hunter. It lets the kernel handle restart of
critical sections. [1] [2] The re-implementation proposed here brings a
few simplifications to the ABI which facilitates porting to other
architectures and speeds up the user-space fast path.
Here are benchmarks of various rseq use-cases.
Test hardware:
arm32: ARMv7 Processor rev 4 (v7l) "Cubietruck", 2-core
x86-64: Intel E5-2630 v3@2.40GHz, 16-core, hyperthreading
The following benchmarks were all performed on a single thread.
* Per-CPU statistic counter increment
getcpu+atomic (ns/op) rseq (ns/op) speedup
arm32: 344.0 31.4 11.0
x86-64: 15.3 2.0 7.7
* LTTng-UST: write event 32-bit header, 32-bit payload into tracer
per-cpu buffer
getcpu+atomic (ns/op) rseq (ns/op) speedup
arm32: 2502.0 2250.0 1.1
x86-64: 117.4 98.0 1.2
* liburcu percpu: lock-unlock pair, dereference, read/compare word
getcpu+atomic (ns/op) rseq (ns/op) speedup
arm32: 751.0 128.5 5.8
x86-64: 53.4 28.6 1.9
* jemalloc memory allocator adapted to use rseq
Using rseq with per-cpu memory pools in jemalloc at Facebook (based on
rseq 2016 implementation):
The production workload response-time has 1-2% gain avg. latency, and
the P99 overall latency drops by 2-3%.
* Reading the current CPU number
Speeding up reading the current CPU number on which the caller thread is
running is done by keeping the current CPU number up do date within the
cpu_id field of the memory area registered by the thread. This is done
by making scheduler preemption set the TIF_NOTIFY_RESUME flag on the
current thread. Upon return to user-space, a notify-resume handler
updates the current CPU value within the registered user-space memory
area. User-space can then read the current CPU number directly from
memory.
Keeping the current cpu id in a memory area shared between kernel and
user-space is an improvement over current mechanisms available to read
the current CPU number, which has the following benefits over
alternative approaches:
- 35x speedup on ARM vs system call through glibc
- 20x speedup on x86 compared to calling glibc, which calls vdso
executing a "lsl" instruction,
- 14x speedup on x86 compared to inlined "lsl" instruction,
- Unlike vdso approaches, this cpu_id value can be read from an inline
assembly, which makes it a useful building block for restartable
sequences.
- The approach of reading the cpu id through memory mapping shared
between kernel and user-space is portable (e.g. ARM), which is not the
case for the lsl-based x86 vdso.
On x86, yet another possible approach would be to use the gs segment
selector to point to user-space per-cpu data. This approach performs
similarly to the cpu id cache, but it has two disadvantages: it is
not portable, and it is incompatible with existing applications already
using the gs segment selector for other purposes.
Benchmarking various approaches for reading the current CPU number:
ARMv7 Processor rev 4 (v7l)
Machine model: Cubietruck
- Baseline (empty loop): 8.4 ns
- Read CPU from rseq cpu_id: 16.7 ns
- Read CPU from rseq cpu_id (lazy register): 19.8 ns
- glibc 2.19-0ubuntu6.6 getcpu: 301.8 ns
- getcpu system call: 234.9 ns
x86-64 Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz:
- Baseline (empty loop): 0.8 ns
- Read CPU from rseq cpu_id: 0.8 ns
- Read CPU from rseq cpu_id (lazy register): 0.8 ns
- Read using gs segment selector: 0.8 ns
- "lsl" inline assembly: 13.0 ns
- glibc 2.19-0ubuntu6 getcpu: 16.6 ns
- getcpu system call: 53.9 ns
- Speed (benchmark taken on v8 of patchset)
Running 10 runs of hackbench -l 100000 seems to indicate, contrary to
expectations, that enabling CONFIG_RSEQ slightly accelerates the
scheduler:
Configuration: 2 sockets * 8-core Intel(R) Xeon(R) CPU E5-2630 v3 @
2.40GHz (directly on hardware, hyperthreading disabled in BIOS, energy
saving disabled in BIOS, turboboost disabled in BIOS, cpuidle.off=1
kernel parameter), with a Linux v4.6 defconfig+localyesconfig,
restartable sequences series applied.
* CONFIG_RSEQ=n
avg.: 41.37 s
std.dev.: 0.36 s
* CONFIG_RSEQ=y
avg.: 40.46 s
std.dev.: 0.33 s
- Size
On x86-64, between CONFIG_RSEQ=n/y, the text size increase of vmlinux is
567 bytes, and the data size increase of vmlinux is 5696 bytes.
[1] https://lwn.net/Articles/650333/
[2] http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdf
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Watson <davejwatson@fb.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Chris Lameter <cl@linux.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Andrew Hunter <ahh@google.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Maurer <bmaurer@fb.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-api@vger.kernel.org
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20151027235635.16059.11630.stgit@pjt-glaptop.roam.corp.google.com
Link: http://lkml.kernel.org/r/20150624222609.6116.86035.stgit@kitami.mtv.corp.google.com
Link: https://lkml.kernel.org/r/20180602124408.8430-3-mathieu.desnoyers@efficios.com
2018-06-02 20:43:54 +08:00
|
|
|
|
2018-06-22 18:45:07 +08:00
|
|
|
static inline void rseq_handle_notify_resume(struct ksignal *ksig,
|
|
|
|
struct pt_regs *regs)
|
rseq: Introduce restartable sequences system call
Expose a new system call allowing each thread to register one userspace
memory area to be used as an ABI between kernel and user-space for two
purposes: user-space restartable sequences and quick access to read the
current CPU number value from user-space.
* Restartable sequences (per-cpu atomics)
Restartables sequences allow user-space to perform update operations on
per-cpu data without requiring heavy-weight atomic operations.
The restartable critical sections (percpu atomics) work has been started
by Paul Turner and Andrew Hunter. It lets the kernel handle restart of
critical sections. [1] [2] The re-implementation proposed here brings a
few simplifications to the ABI which facilitates porting to other
architectures and speeds up the user-space fast path.
Here are benchmarks of various rseq use-cases.
Test hardware:
arm32: ARMv7 Processor rev 4 (v7l) "Cubietruck", 2-core
x86-64: Intel E5-2630 v3@2.40GHz, 16-core, hyperthreading
The following benchmarks were all performed on a single thread.
* Per-CPU statistic counter increment
getcpu+atomic (ns/op) rseq (ns/op) speedup
arm32: 344.0 31.4 11.0
x86-64: 15.3 2.0 7.7
* LTTng-UST: write event 32-bit header, 32-bit payload into tracer
per-cpu buffer
getcpu+atomic (ns/op) rseq (ns/op) speedup
arm32: 2502.0 2250.0 1.1
x86-64: 117.4 98.0 1.2
* liburcu percpu: lock-unlock pair, dereference, read/compare word
getcpu+atomic (ns/op) rseq (ns/op) speedup
arm32: 751.0 128.5 5.8
x86-64: 53.4 28.6 1.9
* jemalloc memory allocator adapted to use rseq
Using rseq with per-cpu memory pools in jemalloc at Facebook (based on
rseq 2016 implementation):
The production workload response-time has 1-2% gain avg. latency, and
the P99 overall latency drops by 2-3%.
* Reading the current CPU number
Speeding up reading the current CPU number on which the caller thread is
running is done by keeping the current CPU number up do date within the
cpu_id field of the memory area registered by the thread. This is done
by making scheduler preemption set the TIF_NOTIFY_RESUME flag on the
current thread. Upon return to user-space, a notify-resume handler
updates the current CPU value within the registered user-space memory
area. User-space can then read the current CPU number directly from
memory.
Keeping the current cpu id in a memory area shared between kernel and
user-space is an improvement over current mechanisms available to read
the current CPU number, which has the following benefits over
alternative approaches:
- 35x speedup on ARM vs system call through glibc
- 20x speedup on x86 compared to calling glibc, which calls vdso
executing a "lsl" instruction,
- 14x speedup on x86 compared to inlined "lsl" instruction,
- Unlike vdso approaches, this cpu_id value can be read from an inline
assembly, which makes it a useful building block for restartable
sequences.
- The approach of reading the cpu id through memory mapping shared
between kernel and user-space is portable (e.g. ARM), which is not the
case for the lsl-based x86 vdso.
On x86, yet another possible approach would be to use the gs segment
selector to point to user-space per-cpu data. This approach performs
similarly to the cpu id cache, but it has two disadvantages: it is
not portable, and it is incompatible with existing applications already
using the gs segment selector for other purposes.
Benchmarking various approaches for reading the current CPU number:
ARMv7 Processor rev 4 (v7l)
Machine model: Cubietruck
- Baseline (empty loop): 8.4 ns
- Read CPU from rseq cpu_id: 16.7 ns
- Read CPU from rseq cpu_id (lazy register): 19.8 ns
- glibc 2.19-0ubuntu6.6 getcpu: 301.8 ns
- getcpu system call: 234.9 ns
x86-64 Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz:
- Baseline (empty loop): 0.8 ns
- Read CPU from rseq cpu_id: 0.8 ns
- Read CPU from rseq cpu_id (lazy register): 0.8 ns
- Read using gs segment selector: 0.8 ns
- "lsl" inline assembly: 13.0 ns
- glibc 2.19-0ubuntu6 getcpu: 16.6 ns
- getcpu system call: 53.9 ns
- Speed (benchmark taken on v8 of patchset)
Running 10 runs of hackbench -l 100000 seems to indicate, contrary to
expectations, that enabling CONFIG_RSEQ slightly accelerates the
scheduler:
Configuration: 2 sockets * 8-core Intel(R) Xeon(R) CPU E5-2630 v3 @
2.40GHz (directly on hardware, hyperthreading disabled in BIOS, energy
saving disabled in BIOS, turboboost disabled in BIOS, cpuidle.off=1
kernel parameter), with a Linux v4.6 defconfig+localyesconfig,
restartable sequences series applied.
* CONFIG_RSEQ=n
avg.: 41.37 s
std.dev.: 0.36 s
* CONFIG_RSEQ=y
avg.: 40.46 s
std.dev.: 0.33 s
- Size
On x86-64, between CONFIG_RSEQ=n/y, the text size increase of vmlinux is
567 bytes, and the data size increase of vmlinux is 5696 bytes.
[1] https://lwn.net/Articles/650333/
[2] http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdf
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Watson <davejwatson@fb.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Chris Lameter <cl@linux.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Andrew Hunter <ahh@google.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Maurer <bmaurer@fb.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-api@vger.kernel.org
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20151027235635.16059.11630.stgit@pjt-glaptop.roam.corp.google.com
Link: http://lkml.kernel.org/r/20150624222609.6116.86035.stgit@kitami.mtv.corp.google.com
Link: https://lkml.kernel.org/r/20180602124408.8430-3-mathieu.desnoyers@efficios.com
2018-06-02 20:43:54 +08:00
|
|
|
{
|
|
|
|
if (current->rseq)
|
2018-06-22 18:45:07 +08:00
|
|
|
__rseq_handle_notify_resume(ksig, regs);
|
rseq: Introduce restartable sequences system call
Expose a new system call allowing each thread to register one userspace
memory area to be used as an ABI between kernel and user-space for two
purposes: user-space restartable sequences and quick access to read the
current CPU number value from user-space.
* Restartable sequences (per-cpu atomics)
Restartables sequences allow user-space to perform update operations on
per-cpu data without requiring heavy-weight atomic operations.
The restartable critical sections (percpu atomics) work has been started
by Paul Turner and Andrew Hunter. It lets the kernel handle restart of
critical sections. [1] [2] The re-implementation proposed here brings a
few simplifications to the ABI which facilitates porting to other
architectures and speeds up the user-space fast path.
Here are benchmarks of various rseq use-cases.
Test hardware:
arm32: ARMv7 Processor rev 4 (v7l) "Cubietruck", 2-core
x86-64: Intel E5-2630 v3@2.40GHz, 16-core, hyperthreading
The following benchmarks were all performed on a single thread.
* Per-CPU statistic counter increment
getcpu+atomic (ns/op) rseq (ns/op) speedup
arm32: 344.0 31.4 11.0
x86-64: 15.3 2.0 7.7
* LTTng-UST: write event 32-bit header, 32-bit payload into tracer
per-cpu buffer
getcpu+atomic (ns/op) rseq (ns/op) speedup
arm32: 2502.0 2250.0 1.1
x86-64: 117.4 98.0 1.2
* liburcu percpu: lock-unlock pair, dereference, read/compare word
getcpu+atomic (ns/op) rseq (ns/op) speedup
arm32: 751.0 128.5 5.8
x86-64: 53.4 28.6 1.9
* jemalloc memory allocator adapted to use rseq
Using rseq with per-cpu memory pools in jemalloc at Facebook (based on
rseq 2016 implementation):
The production workload response-time has 1-2% gain avg. latency, and
the P99 overall latency drops by 2-3%.
* Reading the current CPU number
Speeding up reading the current CPU number on which the caller thread is
running is done by keeping the current CPU number up do date within the
cpu_id field of the memory area registered by the thread. This is done
by making scheduler preemption set the TIF_NOTIFY_RESUME flag on the
current thread. Upon return to user-space, a notify-resume handler
updates the current CPU value within the registered user-space memory
area. User-space can then read the current CPU number directly from
memory.
Keeping the current cpu id in a memory area shared between kernel and
user-space is an improvement over current mechanisms available to read
the current CPU number, which has the following benefits over
alternative approaches:
- 35x speedup on ARM vs system call through glibc
- 20x speedup on x86 compared to calling glibc, which calls vdso
executing a "lsl" instruction,
- 14x speedup on x86 compared to inlined "lsl" instruction,
- Unlike vdso approaches, this cpu_id value can be read from an inline
assembly, which makes it a useful building block for restartable
sequences.
- The approach of reading the cpu id through memory mapping shared
between kernel and user-space is portable (e.g. ARM), which is not the
case for the lsl-based x86 vdso.
On x86, yet another possible approach would be to use the gs segment
selector to point to user-space per-cpu data. This approach performs
similarly to the cpu id cache, but it has two disadvantages: it is
not portable, and it is incompatible with existing applications already
using the gs segment selector for other purposes.
Benchmarking various approaches for reading the current CPU number:
ARMv7 Processor rev 4 (v7l)
Machine model: Cubietruck
- Baseline (empty loop): 8.4 ns
- Read CPU from rseq cpu_id: 16.7 ns
- Read CPU from rseq cpu_id (lazy register): 19.8 ns
- glibc 2.19-0ubuntu6.6 getcpu: 301.8 ns
- getcpu system call: 234.9 ns
x86-64 Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz:
- Baseline (empty loop): 0.8 ns
- Read CPU from rseq cpu_id: 0.8 ns
- Read CPU from rseq cpu_id (lazy register): 0.8 ns
- Read using gs segment selector: 0.8 ns
- "lsl" inline assembly: 13.0 ns
- glibc 2.19-0ubuntu6 getcpu: 16.6 ns
- getcpu system call: 53.9 ns
- Speed (benchmark taken on v8 of patchset)
Running 10 runs of hackbench -l 100000 seems to indicate, contrary to
expectations, that enabling CONFIG_RSEQ slightly accelerates the
scheduler:
Configuration: 2 sockets * 8-core Intel(R) Xeon(R) CPU E5-2630 v3 @
2.40GHz (directly on hardware, hyperthreading disabled in BIOS, energy
saving disabled in BIOS, turboboost disabled in BIOS, cpuidle.off=1
kernel parameter), with a Linux v4.6 defconfig+localyesconfig,
restartable sequences series applied.
* CONFIG_RSEQ=n
avg.: 41.37 s
std.dev.: 0.36 s
* CONFIG_RSEQ=y
avg.: 40.46 s
std.dev.: 0.33 s
- Size
On x86-64, between CONFIG_RSEQ=n/y, the text size increase of vmlinux is
567 bytes, and the data size increase of vmlinux is 5696 bytes.
[1] https://lwn.net/Articles/650333/
[2] http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdf
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Watson <davejwatson@fb.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Chris Lameter <cl@linux.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Andrew Hunter <ahh@google.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Maurer <bmaurer@fb.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-api@vger.kernel.org
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20151027235635.16059.11630.stgit@pjt-glaptop.roam.corp.google.com
Link: http://lkml.kernel.org/r/20150624222609.6116.86035.stgit@kitami.mtv.corp.google.com
Link: https://lkml.kernel.org/r/20180602124408.8430-3-mathieu.desnoyers@efficios.com
2018-06-02 20:43:54 +08:00
|
|
|
}
|
|
|
|
|
2018-06-22 18:45:07 +08:00
|
|
|
static inline void rseq_signal_deliver(struct ksignal *ksig,
|
|
|
|
struct pt_regs *regs)
|
rseq: Introduce restartable sequences system call
Expose a new system call allowing each thread to register one userspace
memory area to be used as an ABI between kernel and user-space for two
purposes: user-space restartable sequences and quick access to read the
current CPU number value from user-space.
* Restartable sequences (per-cpu atomics)
Restartables sequences allow user-space to perform update operations on
per-cpu data without requiring heavy-weight atomic operations.
The restartable critical sections (percpu atomics) work has been started
by Paul Turner and Andrew Hunter. It lets the kernel handle restart of
critical sections. [1] [2] The re-implementation proposed here brings a
few simplifications to the ABI which facilitates porting to other
architectures and speeds up the user-space fast path.
Here are benchmarks of various rseq use-cases.
Test hardware:
arm32: ARMv7 Processor rev 4 (v7l) "Cubietruck", 2-core
x86-64: Intel E5-2630 v3@2.40GHz, 16-core, hyperthreading
The following benchmarks were all performed on a single thread.
* Per-CPU statistic counter increment
getcpu+atomic (ns/op) rseq (ns/op) speedup
arm32: 344.0 31.4 11.0
x86-64: 15.3 2.0 7.7
* LTTng-UST: write event 32-bit header, 32-bit payload into tracer
per-cpu buffer
getcpu+atomic (ns/op) rseq (ns/op) speedup
arm32: 2502.0 2250.0 1.1
x86-64: 117.4 98.0 1.2
* liburcu percpu: lock-unlock pair, dereference, read/compare word
getcpu+atomic (ns/op) rseq (ns/op) speedup
arm32: 751.0 128.5 5.8
x86-64: 53.4 28.6 1.9
* jemalloc memory allocator adapted to use rseq
Using rseq with per-cpu memory pools in jemalloc at Facebook (based on
rseq 2016 implementation):
The production workload response-time has 1-2% gain avg. latency, and
the P99 overall latency drops by 2-3%.
* Reading the current CPU number
Speeding up reading the current CPU number on which the caller thread is
running is done by keeping the current CPU number up do date within the
cpu_id field of the memory area registered by the thread. This is done
by making scheduler preemption set the TIF_NOTIFY_RESUME flag on the
current thread. Upon return to user-space, a notify-resume handler
updates the current CPU value within the registered user-space memory
area. User-space can then read the current CPU number directly from
memory.
Keeping the current cpu id in a memory area shared between kernel and
user-space is an improvement over current mechanisms available to read
the current CPU number, which has the following benefits over
alternative approaches:
- 35x speedup on ARM vs system call through glibc
- 20x speedup on x86 compared to calling glibc, which calls vdso
executing a "lsl" instruction,
- 14x speedup on x86 compared to inlined "lsl" instruction,
- Unlike vdso approaches, this cpu_id value can be read from an inline
assembly, which makes it a useful building block for restartable
sequences.
- The approach of reading the cpu id through memory mapping shared
between kernel and user-space is portable (e.g. ARM), which is not the
case for the lsl-based x86 vdso.
On x86, yet another possible approach would be to use the gs segment
selector to point to user-space per-cpu data. This approach performs
similarly to the cpu id cache, but it has two disadvantages: it is
not portable, and it is incompatible with existing applications already
using the gs segment selector for other purposes.
Benchmarking various approaches for reading the current CPU number:
ARMv7 Processor rev 4 (v7l)
Machine model: Cubietruck
- Baseline (empty loop): 8.4 ns
- Read CPU from rseq cpu_id: 16.7 ns
- Read CPU from rseq cpu_id (lazy register): 19.8 ns
- glibc 2.19-0ubuntu6.6 getcpu: 301.8 ns
- getcpu system call: 234.9 ns
x86-64 Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz:
- Baseline (empty loop): 0.8 ns
- Read CPU from rseq cpu_id: 0.8 ns
- Read CPU from rseq cpu_id (lazy register): 0.8 ns
- Read using gs segment selector: 0.8 ns
- "lsl" inline assembly: 13.0 ns
- glibc 2.19-0ubuntu6 getcpu: 16.6 ns
- getcpu system call: 53.9 ns
- Speed (benchmark taken on v8 of patchset)
Running 10 runs of hackbench -l 100000 seems to indicate, contrary to
expectations, that enabling CONFIG_RSEQ slightly accelerates the
scheduler:
Configuration: 2 sockets * 8-core Intel(R) Xeon(R) CPU E5-2630 v3 @
2.40GHz (directly on hardware, hyperthreading disabled in BIOS, energy
saving disabled in BIOS, turboboost disabled in BIOS, cpuidle.off=1
kernel parameter), with a Linux v4.6 defconfig+localyesconfig,
restartable sequences series applied.
* CONFIG_RSEQ=n
avg.: 41.37 s
std.dev.: 0.36 s
* CONFIG_RSEQ=y
avg.: 40.46 s
std.dev.: 0.33 s
- Size
On x86-64, between CONFIG_RSEQ=n/y, the text size increase of vmlinux is
567 bytes, and the data size increase of vmlinux is 5696 bytes.
[1] https://lwn.net/Articles/650333/
[2] http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdf
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Watson <davejwatson@fb.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Chris Lameter <cl@linux.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Andrew Hunter <ahh@google.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Maurer <bmaurer@fb.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-api@vger.kernel.org
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20151027235635.16059.11630.stgit@pjt-glaptop.roam.corp.google.com
Link: http://lkml.kernel.org/r/20150624222609.6116.86035.stgit@kitami.mtv.corp.google.com
Link: https://lkml.kernel.org/r/20180602124408.8430-3-mathieu.desnoyers@efficios.com
2018-06-02 20:43:54 +08:00
|
|
|
{
|
|
|
|
preempt_disable();
|
|
|
|
__set_bit(RSEQ_EVENT_SIGNAL_BIT, ¤t->rseq_event_mask);
|
|
|
|
preempt_enable();
|
2018-06-22 18:45:07 +08:00
|
|
|
rseq_handle_notify_resume(ksig, regs);
|
rseq: Introduce restartable sequences system call
Expose a new system call allowing each thread to register one userspace
memory area to be used as an ABI between kernel and user-space for two
purposes: user-space restartable sequences and quick access to read the
current CPU number value from user-space.
* Restartable sequences (per-cpu atomics)
Restartables sequences allow user-space to perform update operations on
per-cpu data without requiring heavy-weight atomic operations.
The restartable critical sections (percpu atomics) work has been started
by Paul Turner and Andrew Hunter. It lets the kernel handle restart of
critical sections. [1] [2] The re-implementation proposed here brings a
few simplifications to the ABI which facilitates porting to other
architectures and speeds up the user-space fast path.
Here are benchmarks of various rseq use-cases.
Test hardware:
arm32: ARMv7 Processor rev 4 (v7l) "Cubietruck", 2-core
x86-64: Intel E5-2630 v3@2.40GHz, 16-core, hyperthreading
The following benchmarks were all performed on a single thread.
* Per-CPU statistic counter increment
getcpu+atomic (ns/op) rseq (ns/op) speedup
arm32: 344.0 31.4 11.0
x86-64: 15.3 2.0 7.7
* LTTng-UST: write event 32-bit header, 32-bit payload into tracer
per-cpu buffer
getcpu+atomic (ns/op) rseq (ns/op) speedup
arm32: 2502.0 2250.0 1.1
x86-64: 117.4 98.0 1.2
* liburcu percpu: lock-unlock pair, dereference, read/compare word
getcpu+atomic (ns/op) rseq (ns/op) speedup
arm32: 751.0 128.5 5.8
x86-64: 53.4 28.6 1.9
* jemalloc memory allocator adapted to use rseq
Using rseq with per-cpu memory pools in jemalloc at Facebook (based on
rseq 2016 implementation):
The production workload response-time has 1-2% gain avg. latency, and
the P99 overall latency drops by 2-3%.
* Reading the current CPU number
Speeding up reading the current CPU number on which the caller thread is
running is done by keeping the current CPU number up do date within the
cpu_id field of the memory area registered by the thread. This is done
by making scheduler preemption set the TIF_NOTIFY_RESUME flag on the
current thread. Upon return to user-space, a notify-resume handler
updates the current CPU value within the registered user-space memory
area. User-space can then read the current CPU number directly from
memory.
Keeping the current cpu id in a memory area shared between kernel and
user-space is an improvement over current mechanisms available to read
the current CPU number, which has the following benefits over
alternative approaches:
- 35x speedup on ARM vs system call through glibc
- 20x speedup on x86 compared to calling glibc, which calls vdso
executing a "lsl" instruction,
- 14x speedup on x86 compared to inlined "lsl" instruction,
- Unlike vdso approaches, this cpu_id value can be read from an inline
assembly, which makes it a useful building block for restartable
sequences.
- The approach of reading the cpu id through memory mapping shared
between kernel and user-space is portable (e.g. ARM), which is not the
case for the lsl-based x86 vdso.
On x86, yet another possible approach would be to use the gs segment
selector to point to user-space per-cpu data. This approach performs
similarly to the cpu id cache, but it has two disadvantages: it is
not portable, and it is incompatible with existing applications already
using the gs segment selector for other purposes.
Benchmarking various approaches for reading the current CPU number:
ARMv7 Processor rev 4 (v7l)
Machine model: Cubietruck
- Baseline (empty loop): 8.4 ns
- Read CPU from rseq cpu_id: 16.7 ns
- Read CPU from rseq cpu_id (lazy register): 19.8 ns
- glibc 2.19-0ubuntu6.6 getcpu: 301.8 ns
- getcpu system call: 234.9 ns
x86-64 Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz:
- Baseline (empty loop): 0.8 ns
- Read CPU from rseq cpu_id: 0.8 ns
- Read CPU from rseq cpu_id (lazy register): 0.8 ns
- Read using gs segment selector: 0.8 ns
- "lsl" inline assembly: 13.0 ns
- glibc 2.19-0ubuntu6 getcpu: 16.6 ns
- getcpu system call: 53.9 ns
- Speed (benchmark taken on v8 of patchset)
Running 10 runs of hackbench -l 100000 seems to indicate, contrary to
expectations, that enabling CONFIG_RSEQ slightly accelerates the
scheduler:
Configuration: 2 sockets * 8-core Intel(R) Xeon(R) CPU E5-2630 v3 @
2.40GHz (directly on hardware, hyperthreading disabled in BIOS, energy
saving disabled in BIOS, turboboost disabled in BIOS, cpuidle.off=1
kernel parameter), with a Linux v4.6 defconfig+localyesconfig,
restartable sequences series applied.
* CONFIG_RSEQ=n
avg.: 41.37 s
std.dev.: 0.36 s
* CONFIG_RSEQ=y
avg.: 40.46 s
std.dev.: 0.33 s
- Size
On x86-64, between CONFIG_RSEQ=n/y, the text size increase of vmlinux is
567 bytes, and the data size increase of vmlinux is 5696 bytes.
[1] https://lwn.net/Articles/650333/
[2] http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdf
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Watson <davejwatson@fb.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Chris Lameter <cl@linux.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Andrew Hunter <ahh@google.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Maurer <bmaurer@fb.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-api@vger.kernel.org
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20151027235635.16059.11630.stgit@pjt-glaptop.roam.corp.google.com
Link: http://lkml.kernel.org/r/20150624222609.6116.86035.stgit@kitami.mtv.corp.google.com
Link: https://lkml.kernel.org/r/20180602124408.8430-3-mathieu.desnoyers@efficios.com
2018-06-02 20:43:54 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/* rseq_preempt() requires preemption to be disabled. */
|
|
|
|
static inline void rseq_preempt(struct task_struct *t)
|
|
|
|
{
|
|
|
|
__set_bit(RSEQ_EVENT_PREEMPT_BIT, &t->rseq_event_mask);
|
|
|
|
rseq_set_notify_resume(t);
|
|
|
|
}
|
|
|
|
|
|
|
|
/* rseq_migrate() requires preemption to be disabled. */
|
|
|
|
static inline void rseq_migrate(struct task_struct *t)
|
|
|
|
{
|
|
|
|
__set_bit(RSEQ_EVENT_MIGRATE_BIT, &t->rseq_event_mask);
|
|
|
|
rseq_set_notify_resume(t);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If parent process has a registered restartable sequences area, the
|
2019-12-12 00:17:12 +08:00
|
|
|
* child inherits. Unregister rseq for a clone with CLONE_VM set.
|
rseq: Introduce restartable sequences system call
Expose a new system call allowing each thread to register one userspace
memory area to be used as an ABI between kernel and user-space for two
purposes: user-space restartable sequences and quick access to read the
current CPU number value from user-space.
* Restartable sequences (per-cpu atomics)
Restartables sequences allow user-space to perform update operations on
per-cpu data without requiring heavy-weight atomic operations.
The restartable critical sections (percpu atomics) work has been started
by Paul Turner and Andrew Hunter. It lets the kernel handle restart of
critical sections. [1] [2] The re-implementation proposed here brings a
few simplifications to the ABI which facilitates porting to other
architectures and speeds up the user-space fast path.
Here are benchmarks of various rseq use-cases.
Test hardware:
arm32: ARMv7 Processor rev 4 (v7l) "Cubietruck", 2-core
x86-64: Intel E5-2630 v3@2.40GHz, 16-core, hyperthreading
The following benchmarks were all performed on a single thread.
* Per-CPU statistic counter increment
getcpu+atomic (ns/op) rseq (ns/op) speedup
arm32: 344.0 31.4 11.0
x86-64: 15.3 2.0 7.7
* LTTng-UST: write event 32-bit header, 32-bit payload into tracer
per-cpu buffer
getcpu+atomic (ns/op) rseq (ns/op) speedup
arm32: 2502.0 2250.0 1.1
x86-64: 117.4 98.0 1.2
* liburcu percpu: lock-unlock pair, dereference, read/compare word
getcpu+atomic (ns/op) rseq (ns/op) speedup
arm32: 751.0 128.5 5.8
x86-64: 53.4 28.6 1.9
* jemalloc memory allocator adapted to use rseq
Using rseq with per-cpu memory pools in jemalloc at Facebook (based on
rseq 2016 implementation):
The production workload response-time has 1-2% gain avg. latency, and
the P99 overall latency drops by 2-3%.
* Reading the current CPU number
Speeding up reading the current CPU number on which the caller thread is
running is done by keeping the current CPU number up do date within the
cpu_id field of the memory area registered by the thread. This is done
by making scheduler preemption set the TIF_NOTIFY_RESUME flag on the
current thread. Upon return to user-space, a notify-resume handler
updates the current CPU value within the registered user-space memory
area. User-space can then read the current CPU number directly from
memory.
Keeping the current cpu id in a memory area shared between kernel and
user-space is an improvement over current mechanisms available to read
the current CPU number, which has the following benefits over
alternative approaches:
- 35x speedup on ARM vs system call through glibc
- 20x speedup on x86 compared to calling glibc, which calls vdso
executing a "lsl" instruction,
- 14x speedup on x86 compared to inlined "lsl" instruction,
- Unlike vdso approaches, this cpu_id value can be read from an inline
assembly, which makes it a useful building block for restartable
sequences.
- The approach of reading the cpu id through memory mapping shared
between kernel and user-space is portable (e.g. ARM), which is not the
case for the lsl-based x86 vdso.
On x86, yet another possible approach would be to use the gs segment
selector to point to user-space per-cpu data. This approach performs
similarly to the cpu id cache, but it has two disadvantages: it is
not portable, and it is incompatible with existing applications already
using the gs segment selector for other purposes.
Benchmarking various approaches for reading the current CPU number:
ARMv7 Processor rev 4 (v7l)
Machine model: Cubietruck
- Baseline (empty loop): 8.4 ns
- Read CPU from rseq cpu_id: 16.7 ns
- Read CPU from rseq cpu_id (lazy register): 19.8 ns
- glibc 2.19-0ubuntu6.6 getcpu: 301.8 ns
- getcpu system call: 234.9 ns
x86-64 Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz:
- Baseline (empty loop): 0.8 ns
- Read CPU from rseq cpu_id: 0.8 ns
- Read CPU from rseq cpu_id (lazy register): 0.8 ns
- Read using gs segment selector: 0.8 ns
- "lsl" inline assembly: 13.0 ns
- glibc 2.19-0ubuntu6 getcpu: 16.6 ns
- getcpu system call: 53.9 ns
- Speed (benchmark taken on v8 of patchset)
Running 10 runs of hackbench -l 100000 seems to indicate, contrary to
expectations, that enabling CONFIG_RSEQ slightly accelerates the
scheduler:
Configuration: 2 sockets * 8-core Intel(R) Xeon(R) CPU E5-2630 v3 @
2.40GHz (directly on hardware, hyperthreading disabled in BIOS, energy
saving disabled in BIOS, turboboost disabled in BIOS, cpuidle.off=1
kernel parameter), with a Linux v4.6 defconfig+localyesconfig,
restartable sequences series applied.
* CONFIG_RSEQ=n
avg.: 41.37 s
std.dev.: 0.36 s
* CONFIG_RSEQ=y
avg.: 40.46 s
std.dev.: 0.33 s
- Size
On x86-64, between CONFIG_RSEQ=n/y, the text size increase of vmlinux is
567 bytes, and the data size increase of vmlinux is 5696 bytes.
[1] https://lwn.net/Articles/650333/
[2] http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdf
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Watson <davejwatson@fb.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Chris Lameter <cl@linux.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Andrew Hunter <ahh@google.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Maurer <bmaurer@fb.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-api@vger.kernel.org
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20151027235635.16059.11630.stgit@pjt-glaptop.roam.corp.google.com
Link: http://lkml.kernel.org/r/20150624222609.6116.86035.stgit@kitami.mtv.corp.google.com
Link: https://lkml.kernel.org/r/20180602124408.8430-3-mathieu.desnoyers@efficios.com
2018-06-02 20:43:54 +08:00
|
|
|
*/
|
|
|
|
static inline void rseq_fork(struct task_struct *t, unsigned long clone_flags)
|
|
|
|
{
|
2019-12-12 00:17:12 +08:00
|
|
|
if (clone_flags & CLONE_VM) {
|
rseq: Introduce restartable sequences system call
Expose a new system call allowing each thread to register one userspace
memory area to be used as an ABI between kernel and user-space for two
purposes: user-space restartable sequences and quick access to read the
current CPU number value from user-space.
* Restartable sequences (per-cpu atomics)
Restartables sequences allow user-space to perform update operations on
per-cpu data without requiring heavy-weight atomic operations.
The restartable critical sections (percpu atomics) work has been started
by Paul Turner and Andrew Hunter. It lets the kernel handle restart of
critical sections. [1] [2] The re-implementation proposed here brings a
few simplifications to the ABI which facilitates porting to other
architectures and speeds up the user-space fast path.
Here are benchmarks of various rseq use-cases.
Test hardware:
arm32: ARMv7 Processor rev 4 (v7l) "Cubietruck", 2-core
x86-64: Intel E5-2630 v3@2.40GHz, 16-core, hyperthreading
The following benchmarks were all performed on a single thread.
* Per-CPU statistic counter increment
getcpu+atomic (ns/op) rseq (ns/op) speedup
arm32: 344.0 31.4 11.0
x86-64: 15.3 2.0 7.7
* LTTng-UST: write event 32-bit header, 32-bit payload into tracer
per-cpu buffer
getcpu+atomic (ns/op) rseq (ns/op) speedup
arm32: 2502.0 2250.0 1.1
x86-64: 117.4 98.0 1.2
* liburcu percpu: lock-unlock pair, dereference, read/compare word
getcpu+atomic (ns/op) rseq (ns/op) speedup
arm32: 751.0 128.5 5.8
x86-64: 53.4 28.6 1.9
* jemalloc memory allocator adapted to use rseq
Using rseq with per-cpu memory pools in jemalloc at Facebook (based on
rseq 2016 implementation):
The production workload response-time has 1-2% gain avg. latency, and
the P99 overall latency drops by 2-3%.
* Reading the current CPU number
Speeding up reading the current CPU number on which the caller thread is
running is done by keeping the current CPU number up do date within the
cpu_id field of the memory area registered by the thread. This is done
by making scheduler preemption set the TIF_NOTIFY_RESUME flag on the
current thread. Upon return to user-space, a notify-resume handler
updates the current CPU value within the registered user-space memory
area. User-space can then read the current CPU number directly from
memory.
Keeping the current cpu id in a memory area shared between kernel and
user-space is an improvement over current mechanisms available to read
the current CPU number, which has the following benefits over
alternative approaches:
- 35x speedup on ARM vs system call through glibc
- 20x speedup on x86 compared to calling glibc, which calls vdso
executing a "lsl" instruction,
- 14x speedup on x86 compared to inlined "lsl" instruction,
- Unlike vdso approaches, this cpu_id value can be read from an inline
assembly, which makes it a useful building block for restartable
sequences.
- The approach of reading the cpu id through memory mapping shared
between kernel and user-space is portable (e.g. ARM), which is not the
case for the lsl-based x86 vdso.
On x86, yet another possible approach would be to use the gs segment
selector to point to user-space per-cpu data. This approach performs
similarly to the cpu id cache, but it has two disadvantages: it is
not portable, and it is incompatible with existing applications already
using the gs segment selector for other purposes.
Benchmarking various approaches for reading the current CPU number:
ARMv7 Processor rev 4 (v7l)
Machine model: Cubietruck
- Baseline (empty loop): 8.4 ns
- Read CPU from rseq cpu_id: 16.7 ns
- Read CPU from rseq cpu_id (lazy register): 19.8 ns
- glibc 2.19-0ubuntu6.6 getcpu: 301.8 ns
- getcpu system call: 234.9 ns
x86-64 Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz:
- Baseline (empty loop): 0.8 ns
- Read CPU from rseq cpu_id: 0.8 ns
- Read CPU from rseq cpu_id (lazy register): 0.8 ns
- Read using gs segment selector: 0.8 ns
- "lsl" inline assembly: 13.0 ns
- glibc 2.19-0ubuntu6 getcpu: 16.6 ns
- getcpu system call: 53.9 ns
- Speed (benchmark taken on v8 of patchset)
Running 10 runs of hackbench -l 100000 seems to indicate, contrary to
expectations, that enabling CONFIG_RSEQ slightly accelerates the
scheduler:
Configuration: 2 sockets * 8-core Intel(R) Xeon(R) CPU E5-2630 v3 @
2.40GHz (directly on hardware, hyperthreading disabled in BIOS, energy
saving disabled in BIOS, turboboost disabled in BIOS, cpuidle.off=1
kernel parameter), with a Linux v4.6 defconfig+localyesconfig,
restartable sequences series applied.
* CONFIG_RSEQ=n
avg.: 41.37 s
std.dev.: 0.36 s
* CONFIG_RSEQ=y
avg.: 40.46 s
std.dev.: 0.33 s
- Size
On x86-64, between CONFIG_RSEQ=n/y, the text size increase of vmlinux is
567 bytes, and the data size increase of vmlinux is 5696 bytes.
[1] https://lwn.net/Articles/650333/
[2] http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdf
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Watson <davejwatson@fb.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Chris Lameter <cl@linux.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Andrew Hunter <ahh@google.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Maurer <bmaurer@fb.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-api@vger.kernel.org
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20151027235635.16059.11630.stgit@pjt-glaptop.roam.corp.google.com
Link: http://lkml.kernel.org/r/20150624222609.6116.86035.stgit@kitami.mtv.corp.google.com
Link: https://lkml.kernel.org/r/20180602124408.8430-3-mathieu.desnoyers@efficios.com
2018-06-02 20:43:54 +08:00
|
|
|
t->rseq = NULL;
|
|
|
|
t->rseq_sig = 0;
|
|
|
|
t->rseq_event_mask = 0;
|
|
|
|
} else {
|
|
|
|
t->rseq = current->rseq;
|
|
|
|
t->rseq_sig = current->rseq_sig;
|
|
|
|
t->rseq_event_mask = current->rseq_event_mask;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void rseq_execve(struct task_struct *t)
|
|
|
|
{
|
|
|
|
t->rseq = NULL;
|
|
|
|
t->rseq_sig = 0;
|
|
|
|
t->rseq_event_mask = 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
#else
|
|
|
|
|
|
|
|
static inline void rseq_set_notify_resume(struct task_struct *t)
|
|
|
|
{
|
|
|
|
}
|
2018-06-22 18:45:07 +08:00
|
|
|
static inline void rseq_handle_notify_resume(struct ksignal *ksig,
|
|
|
|
struct pt_regs *regs)
|
rseq: Introduce restartable sequences system call
Expose a new system call allowing each thread to register one userspace
memory area to be used as an ABI between kernel and user-space for two
purposes: user-space restartable sequences and quick access to read the
current CPU number value from user-space.
* Restartable sequences (per-cpu atomics)
Restartables sequences allow user-space to perform update operations on
per-cpu data without requiring heavy-weight atomic operations.
The restartable critical sections (percpu atomics) work has been started
by Paul Turner and Andrew Hunter. It lets the kernel handle restart of
critical sections. [1] [2] The re-implementation proposed here brings a
few simplifications to the ABI which facilitates porting to other
architectures and speeds up the user-space fast path.
Here are benchmarks of various rseq use-cases.
Test hardware:
arm32: ARMv7 Processor rev 4 (v7l) "Cubietruck", 2-core
x86-64: Intel E5-2630 v3@2.40GHz, 16-core, hyperthreading
The following benchmarks were all performed on a single thread.
* Per-CPU statistic counter increment
getcpu+atomic (ns/op) rseq (ns/op) speedup
arm32: 344.0 31.4 11.0
x86-64: 15.3 2.0 7.7
* LTTng-UST: write event 32-bit header, 32-bit payload into tracer
per-cpu buffer
getcpu+atomic (ns/op) rseq (ns/op) speedup
arm32: 2502.0 2250.0 1.1
x86-64: 117.4 98.0 1.2
* liburcu percpu: lock-unlock pair, dereference, read/compare word
getcpu+atomic (ns/op) rseq (ns/op) speedup
arm32: 751.0 128.5 5.8
x86-64: 53.4 28.6 1.9
* jemalloc memory allocator adapted to use rseq
Using rseq with per-cpu memory pools in jemalloc at Facebook (based on
rseq 2016 implementation):
The production workload response-time has 1-2% gain avg. latency, and
the P99 overall latency drops by 2-3%.
* Reading the current CPU number
Speeding up reading the current CPU number on which the caller thread is
running is done by keeping the current CPU number up do date within the
cpu_id field of the memory area registered by the thread. This is done
by making scheduler preemption set the TIF_NOTIFY_RESUME flag on the
current thread. Upon return to user-space, a notify-resume handler
updates the current CPU value within the registered user-space memory
area. User-space can then read the current CPU number directly from
memory.
Keeping the current cpu id in a memory area shared between kernel and
user-space is an improvement over current mechanisms available to read
the current CPU number, which has the following benefits over
alternative approaches:
- 35x speedup on ARM vs system call through glibc
- 20x speedup on x86 compared to calling glibc, which calls vdso
executing a "lsl" instruction,
- 14x speedup on x86 compared to inlined "lsl" instruction,
- Unlike vdso approaches, this cpu_id value can be read from an inline
assembly, which makes it a useful building block for restartable
sequences.
- The approach of reading the cpu id through memory mapping shared
between kernel and user-space is portable (e.g. ARM), which is not the
case for the lsl-based x86 vdso.
On x86, yet another possible approach would be to use the gs segment
selector to point to user-space per-cpu data. This approach performs
similarly to the cpu id cache, but it has two disadvantages: it is
not portable, and it is incompatible with existing applications already
using the gs segment selector for other purposes.
Benchmarking various approaches for reading the current CPU number:
ARMv7 Processor rev 4 (v7l)
Machine model: Cubietruck
- Baseline (empty loop): 8.4 ns
- Read CPU from rseq cpu_id: 16.7 ns
- Read CPU from rseq cpu_id (lazy register): 19.8 ns
- glibc 2.19-0ubuntu6.6 getcpu: 301.8 ns
- getcpu system call: 234.9 ns
x86-64 Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz:
- Baseline (empty loop): 0.8 ns
- Read CPU from rseq cpu_id: 0.8 ns
- Read CPU from rseq cpu_id (lazy register): 0.8 ns
- Read using gs segment selector: 0.8 ns
- "lsl" inline assembly: 13.0 ns
- glibc 2.19-0ubuntu6 getcpu: 16.6 ns
- getcpu system call: 53.9 ns
- Speed (benchmark taken on v8 of patchset)
Running 10 runs of hackbench -l 100000 seems to indicate, contrary to
expectations, that enabling CONFIG_RSEQ slightly accelerates the
scheduler:
Configuration: 2 sockets * 8-core Intel(R) Xeon(R) CPU E5-2630 v3 @
2.40GHz (directly on hardware, hyperthreading disabled in BIOS, energy
saving disabled in BIOS, turboboost disabled in BIOS, cpuidle.off=1
kernel parameter), with a Linux v4.6 defconfig+localyesconfig,
restartable sequences series applied.
* CONFIG_RSEQ=n
avg.: 41.37 s
std.dev.: 0.36 s
* CONFIG_RSEQ=y
avg.: 40.46 s
std.dev.: 0.33 s
- Size
On x86-64, between CONFIG_RSEQ=n/y, the text size increase of vmlinux is
567 bytes, and the data size increase of vmlinux is 5696 bytes.
[1] https://lwn.net/Articles/650333/
[2] http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdf
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Watson <davejwatson@fb.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Chris Lameter <cl@linux.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Andrew Hunter <ahh@google.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Maurer <bmaurer@fb.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-api@vger.kernel.org
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20151027235635.16059.11630.stgit@pjt-glaptop.roam.corp.google.com
Link: http://lkml.kernel.org/r/20150624222609.6116.86035.stgit@kitami.mtv.corp.google.com
Link: https://lkml.kernel.org/r/20180602124408.8430-3-mathieu.desnoyers@efficios.com
2018-06-02 20:43:54 +08:00
|
|
|
{
|
|
|
|
}
|
2018-06-22 18:45:07 +08:00
|
|
|
static inline void rseq_signal_deliver(struct ksignal *ksig,
|
|
|
|
struct pt_regs *regs)
|
rseq: Introduce restartable sequences system call
Expose a new system call allowing each thread to register one userspace
memory area to be used as an ABI between kernel and user-space for two
purposes: user-space restartable sequences and quick access to read the
current CPU number value from user-space.
* Restartable sequences (per-cpu atomics)
Restartables sequences allow user-space to perform update operations on
per-cpu data without requiring heavy-weight atomic operations.
The restartable critical sections (percpu atomics) work has been started
by Paul Turner and Andrew Hunter. It lets the kernel handle restart of
critical sections. [1] [2] The re-implementation proposed here brings a
few simplifications to the ABI which facilitates porting to other
architectures and speeds up the user-space fast path.
Here are benchmarks of various rseq use-cases.
Test hardware:
arm32: ARMv7 Processor rev 4 (v7l) "Cubietruck", 2-core
x86-64: Intel E5-2630 v3@2.40GHz, 16-core, hyperthreading
The following benchmarks were all performed on a single thread.
* Per-CPU statistic counter increment
getcpu+atomic (ns/op) rseq (ns/op) speedup
arm32: 344.0 31.4 11.0
x86-64: 15.3 2.0 7.7
* LTTng-UST: write event 32-bit header, 32-bit payload into tracer
per-cpu buffer
getcpu+atomic (ns/op) rseq (ns/op) speedup
arm32: 2502.0 2250.0 1.1
x86-64: 117.4 98.0 1.2
* liburcu percpu: lock-unlock pair, dereference, read/compare word
getcpu+atomic (ns/op) rseq (ns/op) speedup
arm32: 751.0 128.5 5.8
x86-64: 53.4 28.6 1.9
* jemalloc memory allocator adapted to use rseq
Using rseq with per-cpu memory pools in jemalloc at Facebook (based on
rseq 2016 implementation):
The production workload response-time has 1-2% gain avg. latency, and
the P99 overall latency drops by 2-3%.
* Reading the current CPU number
Speeding up reading the current CPU number on which the caller thread is
running is done by keeping the current CPU number up do date within the
cpu_id field of the memory area registered by the thread. This is done
by making scheduler preemption set the TIF_NOTIFY_RESUME flag on the
current thread. Upon return to user-space, a notify-resume handler
updates the current CPU value within the registered user-space memory
area. User-space can then read the current CPU number directly from
memory.
Keeping the current cpu id in a memory area shared between kernel and
user-space is an improvement over current mechanisms available to read
the current CPU number, which has the following benefits over
alternative approaches:
- 35x speedup on ARM vs system call through glibc
- 20x speedup on x86 compared to calling glibc, which calls vdso
executing a "lsl" instruction,
- 14x speedup on x86 compared to inlined "lsl" instruction,
- Unlike vdso approaches, this cpu_id value can be read from an inline
assembly, which makes it a useful building block for restartable
sequences.
- The approach of reading the cpu id through memory mapping shared
between kernel and user-space is portable (e.g. ARM), which is not the
case for the lsl-based x86 vdso.
On x86, yet another possible approach would be to use the gs segment
selector to point to user-space per-cpu data. This approach performs
similarly to the cpu id cache, but it has two disadvantages: it is
not portable, and it is incompatible with existing applications already
using the gs segment selector for other purposes.
Benchmarking various approaches for reading the current CPU number:
ARMv7 Processor rev 4 (v7l)
Machine model: Cubietruck
- Baseline (empty loop): 8.4 ns
- Read CPU from rseq cpu_id: 16.7 ns
- Read CPU from rseq cpu_id (lazy register): 19.8 ns
- glibc 2.19-0ubuntu6.6 getcpu: 301.8 ns
- getcpu system call: 234.9 ns
x86-64 Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz:
- Baseline (empty loop): 0.8 ns
- Read CPU from rseq cpu_id: 0.8 ns
- Read CPU from rseq cpu_id (lazy register): 0.8 ns
- Read using gs segment selector: 0.8 ns
- "lsl" inline assembly: 13.0 ns
- glibc 2.19-0ubuntu6 getcpu: 16.6 ns
- getcpu system call: 53.9 ns
- Speed (benchmark taken on v8 of patchset)
Running 10 runs of hackbench -l 100000 seems to indicate, contrary to
expectations, that enabling CONFIG_RSEQ slightly accelerates the
scheduler:
Configuration: 2 sockets * 8-core Intel(R) Xeon(R) CPU E5-2630 v3 @
2.40GHz (directly on hardware, hyperthreading disabled in BIOS, energy
saving disabled in BIOS, turboboost disabled in BIOS, cpuidle.off=1
kernel parameter), with a Linux v4.6 defconfig+localyesconfig,
restartable sequences series applied.
* CONFIG_RSEQ=n
avg.: 41.37 s
std.dev.: 0.36 s
* CONFIG_RSEQ=y
avg.: 40.46 s
std.dev.: 0.33 s
- Size
On x86-64, between CONFIG_RSEQ=n/y, the text size increase of vmlinux is
567 bytes, and the data size increase of vmlinux is 5696 bytes.
[1] https://lwn.net/Articles/650333/
[2] http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdf
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Watson <davejwatson@fb.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Chris Lameter <cl@linux.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Andrew Hunter <ahh@google.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Maurer <bmaurer@fb.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-api@vger.kernel.org
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20151027235635.16059.11630.stgit@pjt-glaptop.roam.corp.google.com
Link: http://lkml.kernel.org/r/20150624222609.6116.86035.stgit@kitami.mtv.corp.google.com
Link: https://lkml.kernel.org/r/20180602124408.8430-3-mathieu.desnoyers@efficios.com
2018-06-02 20:43:54 +08:00
|
|
|
{
|
|
|
|
}
|
|
|
|
static inline void rseq_preempt(struct task_struct *t)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
static inline void rseq_migrate(struct task_struct *t)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
static inline void rseq_fork(struct task_struct *t, unsigned long clone_flags)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
static inline void rseq_execve(struct task_struct *t)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
|
|
|
|
#endif
|
|
|
|
|
|
|
|
#ifdef CONFIG_DEBUG_RSEQ
|
|
|
|
|
|
|
|
void rseq_syscall(struct pt_regs *regs);
|
|
|
|
|
|
|
|
#else
|
|
|
|
|
|
|
|
static inline void rseq_syscall(struct pt_regs *regs)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
|
|
|
|
#endif
|
|
|
|
|
2021-03-27 01:55:06 +08:00
|
|
|
#ifdef CONFIG_SCHED_CORE
|
|
|
|
extern void sched_core_free(struct task_struct *tsk);
|
2021-03-29 21:18:35 +08:00
|
|
|
extern void sched_core_fork(struct task_struct *p);
|
sched: prctl() core-scheduling interface
This patch provides support for setting and copying core scheduling
'task cookies' between threads (PID), processes (TGID), and process
groups (PGID).
The value of core scheduling isn't that tasks don't share a core,
'nosmt' can do that. The value lies in exploiting all the sharing
opportunities that exist to recover possible lost performance and that
requires a degree of flexibility in the API.
From a security perspective (and there are others), the thread,
process and process group distinction is an existent hierarchal
categorization of tasks that reflects many of the security concerns
about 'data sharing'. For example, protecting against cache-snooping
by a thread that can just read the memory directly isn't all that
useful.
With this in mind, subcommands to CREATE/SHARE (TO/FROM) provide a
mechanism to create and share cookies. CREATE/SHARE_TO specify a
target pid with enum pidtype used to specify the scope of the targeted
tasks. For example, PIDTYPE_TGID will share the cookie with the
process and all of it's threads as typically desired in a security
scenario.
API:
prctl(PR_SCHED_CORE, PR_SCHED_CORE_GET, tgtpid, pidtype, &cookie)
prctl(PR_SCHED_CORE, PR_SCHED_CORE_CREATE, tgtpid, pidtype, NULL)
prctl(PR_SCHED_CORE, PR_SCHED_CORE_SHARE_TO, tgtpid, pidtype, NULL)
prctl(PR_SCHED_CORE, PR_SCHED_CORE_SHARE_FROM, srcpid, pidtype, NULL)
where 'tgtpid/srcpid == 0' implies the current process and pidtype is
kernel enum pid_type {PIDTYPE_PID, PIDTYPE_TGID, PIDTYPE_PGID, ...}.
For return values, EINVAL, ENOMEM are what they say. ESRCH means the
tgtpid/srcpid was not found. EPERM indicates lack of PTRACE permission
access to tgtpid/srcpid. ENODEV indicates your machines lacks SMT.
[peterz: complete rewrite]
Signed-off-by: Chris Hyser <chris.hyser@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Don Hiatt <dhiatt@digitalocean.com>
Tested-by: Hongyu Ning <hongyu.ning@linux.intel.com>
Tested-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20210422123309.039845339@infradead.org
2021-03-25 05:40:15 +08:00
|
|
|
extern int sched_core_share_pid(unsigned int cmd, pid_t pid, enum pid_type type,
|
|
|
|
unsigned long uaddr);
|
2021-03-27 01:55:06 +08:00
|
|
|
#else
|
|
|
|
static inline void sched_core_free(struct task_struct *tsk) { }
|
2021-03-29 21:18:35 +08:00
|
|
|
static inline void sched_core_fork(struct task_struct *p) { }
|
2021-03-27 01:55:06 +08:00
|
|
|
#endif
|
|
|
|
|
2022-04-13 21:31:02 +08:00
|
|
|
extern void sched_set_stop_task(int cpu, struct task_struct *stop);
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
#endif
|