This set of changes removes tracehook.h, moves modification of all of
the ptrace fields inside of siglock to remove races, adds a missing
permission check to ptrace.c
The removal of tracehook.h is quite significant as it has been a major
source of confusion in recent years. Much of that confusion was
around task_work and TIF_NOTIFY_SIGNAL (which I have now decoupled
making the semantics clearer).
For people who don't know tracehook.h is a vestiage of an attempt to
implement uprobes like functionality that was never fully merged, and
was later superseeded by uprobes when uprobes was merged. For many
years now we have been removing what tracehook functionaly a little
bit at a time. To the point where now anything left in tracehook.h is
some weird strange thing that is difficult to understand.
Eric W. Biederman (15):
ptrace: Move ptrace_report_syscall into ptrace.h
ptrace/arm: Rename tracehook_report_syscall report_syscall
ptrace: Create ptrace_report_syscall_{entry,exit} in ptrace.h
ptrace: Remove arch_syscall_{enter,exit}_tracehook
ptrace: Remove tracehook_signal_handler
task_work: Remove unnecessary include from posix_timers.h
task_work: Introduce task_work_pending
task_work: Call tracehook_notify_signal from get_signal on all architectures
task_work: Decouple TIF_NOTIFY_SIGNAL and task_work
signal: Move set_notify_signal and clear_notify_signal into sched/signal.h
resume_user_mode: Remove #ifdef TIF_NOTIFY_RESUME in set_notify_resume
resume_user_mode: Move to resume_user_mode.h
tracehook: Remove tracehook.h
ptrace: Move setting/clearing ptrace_message into ptrace_stop
ptrace: Return the signal to continue with from ptrace_stop
Jann Horn (1):
ptrace: Check PTRACE_O_SUSPEND_SECCOMP permission on PTRACE_SEIZE
Yang Li (1):
ptrace: Remove duplicated include in ptrace.c
MAINTAINERS | 1 -
arch/Kconfig | 5 +-
arch/alpha/kernel/ptrace.c | 5 +-
arch/alpha/kernel/signal.c | 4 +-
arch/arc/kernel/ptrace.c | 5 +-
arch/arc/kernel/signal.c | 4 +-
arch/arm/kernel/ptrace.c | 12 +-
arch/arm/kernel/signal.c | 4 +-
arch/arm64/kernel/ptrace.c | 14 +--
arch/arm64/kernel/signal.c | 4 +-
arch/csky/kernel/ptrace.c | 5 +-
arch/csky/kernel/signal.c | 4 +-
arch/h8300/kernel/ptrace.c | 5 +-
arch/h8300/kernel/signal.c | 4 +-
arch/hexagon/kernel/process.c | 4 +-
arch/hexagon/kernel/signal.c | 1 -
arch/hexagon/kernel/traps.c | 6 +-
arch/ia64/kernel/process.c | 4 +-
arch/ia64/kernel/ptrace.c | 6 +-
arch/ia64/kernel/signal.c | 1 -
arch/m68k/kernel/ptrace.c | 5 +-
arch/m68k/kernel/signal.c | 4 +-
arch/microblaze/kernel/ptrace.c | 5 +-
arch/microblaze/kernel/signal.c | 4 +-
arch/mips/kernel/ptrace.c | 5 +-
arch/mips/kernel/signal.c | 4 +-
arch/nds32/include/asm/syscall.h | 2 +-
arch/nds32/kernel/ptrace.c | 5 +-
arch/nds32/kernel/signal.c | 4 +-
arch/nios2/kernel/ptrace.c | 5 +-
arch/nios2/kernel/signal.c | 4 +-
arch/openrisc/kernel/ptrace.c | 5 +-
arch/openrisc/kernel/signal.c | 4 +-
arch/parisc/kernel/ptrace.c | 7 +-
arch/parisc/kernel/signal.c | 4 +-
arch/powerpc/kernel/ptrace/ptrace.c | 8 +-
arch/powerpc/kernel/signal.c | 4 +-
arch/riscv/kernel/ptrace.c | 5 +-
arch/riscv/kernel/signal.c | 4 +-
arch/s390/include/asm/entry-common.h | 1 -
arch/s390/kernel/ptrace.c | 1 -
arch/s390/kernel/signal.c | 5 +-
arch/sh/kernel/ptrace_32.c | 5 +-
arch/sh/kernel/signal_32.c | 4 +-
arch/sparc/kernel/ptrace_32.c | 5 +-
arch/sparc/kernel/ptrace_64.c | 5 +-
arch/sparc/kernel/signal32.c | 1 -
arch/sparc/kernel/signal_32.c | 4 +-
arch/sparc/kernel/signal_64.c | 4 +-
arch/um/kernel/process.c | 4 +-
arch/um/kernel/ptrace.c | 5 +-
arch/x86/kernel/ptrace.c | 1 -
arch/x86/kernel/signal.c | 5 +-
arch/x86/mm/tlb.c | 1 +
arch/xtensa/kernel/ptrace.c | 5 +-
arch/xtensa/kernel/signal.c | 4 +-
block/blk-cgroup.c | 2 +-
fs/coredump.c | 1 -
fs/exec.c | 1 -
fs/io-wq.c | 6 +-
fs/io_uring.c | 11 +-
fs/proc/array.c | 1 -
fs/proc/base.c | 1 -
include/asm-generic/syscall.h | 2 +-
include/linux/entry-common.h | 47 +-------
include/linux/entry-kvm.h | 2 +-
include/linux/posix-timers.h | 1 -
include/linux/ptrace.h | 81 ++++++++++++-
include/linux/resume_user_mode.h | 64 ++++++++++
include/linux/sched/signal.h | 17 +++
include/linux/task_work.h | 5 +
include/linux/tracehook.h | 226 -----------------------------------
include/uapi/linux/ptrace.h | 2 +-
kernel/entry/common.c | 19 +--
kernel/entry/kvm.c | 9 +-
kernel/exit.c | 3 +-
kernel/livepatch/transition.c | 1 -
kernel/ptrace.c | 47 +++++---
kernel/seccomp.c | 1 -
kernel/signal.c | 62 +++++-----
kernel/task_work.c | 4 +-
kernel/time/posix-cpu-timers.c | 1 +
mm/memcontrol.c | 2 +-
security/apparmor/domain.c | 1 -
security/selinux/hooks.c | 1 -
85 files changed, 372 insertions(+), 495 deletions(-)
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEgjlraLDcwBA2B+6cC/v6Eiajj0AFAmJCQkoACgkQC/v6Eiaj
j0DCWQ/5AZVFU+hX32obUNCLackHTwgcCtSOs3JNBmNA/zL/htPiYYG0ghkvtlDR
Dw5J5DnxC6P7PVAdAqrpvx2uX2FebHYU0bRlyLx8LYUEP5dhyNicxX9jA882Z+vw
Ud0Ue9EojwGWS76dC9YoKUj3slThMATbhA2r4GVEoof8fSNJaBxQIqath44t0FwU
DinWa+tIOvZANGBZr6CUUINNIgqBIZCH/R4h6ArBhMlJpuQ5Ufk2kAaiWFwZCkX4
0LuuAwbKsCKkF8eap5I2KrIg/7zZVgxAg9O3cHOzzm8OPbKzRnNnQClcDe8perqp
S6e/f3MgpE+eavd1EiLxevZ660cJChnmikXVVh8ZYYoefaMKGqBaBSsB38bNcLjY
3+f2dB+TNBFRnZs1aCujK3tWBT9QyjZDKtCBfzxDNWBpXGLhHH6j6lA5Lj+Cef5K
/HNHFb+FuqedlFZh5m1Y+piFQ70hTgCa2u8b+FSOubI2hW9Zd+WzINV0ANaZ2LvZ
4YGtcyDNk1q1+c87lxP9xMRl/xi6rNg+B9T2MCo4IUnHgpSVP6VEB3osgUmrrrN0
eQlUI154G/AaDlqXLgmn1xhRmlPGfmenkxpok1AuzxvNJsfLKnpEwQSc13g3oiZr
disZQxNY0kBO2Nv3G323Z6PLinhbiIIFez6cJzK5v0YJ2WtO3pY=
=uEro
-----END PGP SIGNATURE-----
Merge tag 'ptrace-cleanups-for-v5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull ptrace cleanups from Eric Biederman:
"This set of changes removes tracehook.h, moves modification of all of
the ptrace fields inside of siglock to remove races, adds a missing
permission check to ptrace.c
The removal of tracehook.h is quite significant as it has been a major
source of confusion in recent years. Much of that confusion was around
task_work and TIF_NOTIFY_SIGNAL (which I have now decoupled making the
semantics clearer).
For people who don't know tracehook.h is a vestiage of an attempt to
implement uprobes like functionality that was never fully merged, and
was later superseeded by uprobes when uprobes was merged. For many
years now we have been removing what tracehook functionaly a little
bit at a time. To the point where anything left in tracehook.h was
some weird strange thing that was difficult to understand"
* tag 'ptrace-cleanups-for-v5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
ptrace: Remove duplicated include in ptrace.c
ptrace: Check PTRACE_O_SUSPEND_SECCOMP permission on PTRACE_SEIZE
ptrace: Return the signal to continue with from ptrace_stop
ptrace: Move setting/clearing ptrace_message into ptrace_stop
tracehook: Remove tracehook.h
resume_user_mode: Move to resume_user_mode.h
resume_user_mode: Remove #ifdef TIF_NOTIFY_RESUME in set_notify_resume
signal: Move set_notify_signal and clear_notify_signal into sched/signal.h
task_work: Decouple TIF_NOTIFY_SIGNAL and task_work
task_work: Call tracehook_notify_signal from get_signal on all architectures
task_work: Introduce task_work_pending
task_work: Remove unnecessary include from posix_timers.h
ptrace: Remove tracehook_signal_handler
ptrace: Remove arch_syscall_{enter,exit}_tracehook
ptrace: Create ptrace_report_syscall_{entry,exit} in ptrace.h
ptrace/arm: Rename tracehook_report_syscall report_syscall
ptrace: Move ptrace_report_syscall into ptrace.h
coarse grained, hardware based, forward edge Control-Flow-Integrity mechanism
where any indirect CALL/JMP must target an ENDBR instruction or suffer #CP.
Additionally, since Alderlake (12th gen)/Sapphire-Rapids, speculation is
limited to 2 instructions (and typically fewer) on branch targets not starting
with ENDBR. CET-IBT also limits speculation of the next sequential instruction
after the indirect CALL/JMP [1].
CET-IBT is fundamentally incompatible with retpolines, but provides, as
described above, speculation limits itself.
[1] https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/branch-history-injection.html
-----BEGIN PGP SIGNATURE-----
iQJJBAABCgAzFiEEv3OU3/byMaA0LqWJdkfhpEvA5LoFAmI/LI8VHHBldGVyekBp
bmZyYWRlYWQub3JnAAoJEHZH4aRLwOS6ZnkP/2QCgQLTu6oRxv9O020CHwlaSEeD
1Hoy3loum5q5hAi1Ik3dR9p0H5u64c9qbrBVxaFoNKaLt5GKrtHaDSHNk2L/CFHX
urpH65uvTLxbyZzcahkAahoJ71XU+m7PcrHLWMunw9sy10rExYVsUOlFyoyG6XCF
BDCNZpdkC09ZM3vwlWGMZd5Pp+6HcZNPyoV9tpvWAS2l+WYFWAID7mflbpQ+tA8b
y/hM6b3Ud0rT2ubuG1iUpopgNdwqQZ+HisMPGprh+wKZkYwS2l8pUTrz0MaBkFde
go7fW16kFy2HQzGm6aIEBmfcg0palP/mFVaWP0zS62LwhJSWTn5G6xWBr3yxSsht
9gWCiI0oDZuTg698MedWmomdG2SK6yAuZuqmdKtLLoWfWgviPEi7TDFG/cKtZdAW
ag8GM8T4iyYZzpCEcWO9GWbjo6TTGq30JBQefCBG47GjD0csv2ubXXx0Iey+jOwT
x3E8wnv9dl8V9FSd/tMpTFmje8ges23yGrWtNpb5BRBuWTeuGiBPZED2BNyyIf+T
dmewi2ufNMONgyNp27bDKopY81CPAQq9cVxqNm9Cg3eWPFnpOq2KGYEvisZ/rpEL
EjMQeUBsy/C3AUFAleu1vwNnkwP/7JfKYpN00gnSyeQNZpqwxXBCKnHNgOMTXyJz
beB/7u2KIUbKEkSN
=jZfK
-----END PGP SIGNATURE-----
Merge tag 'x86_core_for_5.18_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 CET-IBT (Control-Flow-Integrity) support from Peter Zijlstra:
"Add support for Intel CET-IBT, available since Tigerlake (11th gen),
which is a coarse grained, hardware based, forward edge
Control-Flow-Integrity mechanism where any indirect CALL/JMP must
target an ENDBR instruction or suffer #CP.
Additionally, since Alderlake (12th gen)/Sapphire-Rapids, speculation
is limited to 2 instructions (and typically fewer) on branch targets
not starting with ENDBR. CET-IBT also limits speculation of the next
sequential instruction after the indirect CALL/JMP [1].
CET-IBT is fundamentally incompatible with retpolines, but provides,
as described above, speculation limits itself"
[1] https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/branch-history-injection.html
* tag 'x86_core_for_5.18_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (53 commits)
kvm/emulate: Fix SETcc emulation for ENDBR
x86/Kconfig: Only allow CONFIG_X86_KERNEL_IBT with ld.lld >= 14.0.0
x86/Kconfig: Only enable CONFIG_CC_HAS_IBT for clang >= 14.0.0
kbuild: Fixup the IBT kbuild changes
x86/Kconfig: Do not allow CONFIG_X86_X32_ABI=y with llvm-objcopy
x86: Remove toolchain check for X32 ABI capability
x86/alternative: Use .ibt_endbr_seal to seal indirect calls
objtool: Find unused ENDBR instructions
objtool: Validate IBT assumptions
objtool: Add IBT/ENDBR decoding
objtool: Read the NOENDBR annotation
x86: Annotate idtentry_df()
x86,objtool: Move the ASM_REACHABLE annotation to objtool.h
x86: Annotate call_on_stack()
objtool: Rework ASM_REACHABLE
x86: Mark __invalid_creds() __noreturn
exit: Mark do_group_exit() __noreturn
x86: Mark stop_this_cpu() __noreturn
objtool: Ignore extra-symbol code
objtool: Rename --duplicate to --lto
...
With the advent of various new memory types, some machines will have
multiple types of memory, e.g. DRAM and PMEM (persistent memory). The
memory subsystem of these machines can be called memory tiering system,
because the performance of the different types of memory are usually
different.
In such system, because of the memory accessing pattern changing etc,
some pages in the slow memory may become hot globally. So in this
patch, the NUMA balancing mechanism is enhanced to optimize the page
placement among the different memory types according to hot/cold
dynamically.
In a typical memory tiering system, there are CPUs, fast memory and slow
memory in each physical NUMA node. The CPUs and the fast memory will be
put in one logical node (called fast memory node), while the slow memory
will be put in another (faked) logical node (called slow memory node).
That is, the fast memory is regarded as local while the slow memory is
regarded as remote. So it's possible for the recently accessed pages in
the slow memory node to be promoted to the fast memory node via the
existing NUMA balancing mechanism.
The original NUMA balancing mechanism will stop to migrate pages if the
free memory of the target node becomes below the high watermark. This
is a reasonable policy if there's only one memory type. But this makes
the original NUMA balancing mechanism almost do not work to optimize
page placement among different memory types. Details are as follows.
It's the common cases that the working-set size of the workload is
larger than the size of the fast memory nodes. Otherwise, it's
unnecessary to use the slow memory at all. So, there are almost always
no enough free pages in the fast memory nodes, so that the globally hot
pages in the slow memory node cannot be promoted to the fast memory
node. To solve the issue, we have 2 choices as follows,
a. Ignore the free pages watermark checking when promoting hot pages
from the slow memory node to the fast memory node. This will
create some memory pressure in the fast memory node, thus trigger
the memory reclaiming. So that, the cold pages in the fast memory
node will be demoted to the slow memory node.
b. Define a new watermark called wmark_promo which is higher than
wmark_high, and have kswapd reclaiming pages until free pages reach
such watermark. The scenario is as follows: when we want to promote
hot-pages from a slow memory to a fast memory, but fast memory's free
pages would go lower than high watermark with such promotion, we wake
up kswapd with wmark_promo watermark in order to demote cold pages and
free us up some space. So, next time we want to promote hot-pages we
might have a chance of doing so.
The choice "a" may create high memory pressure in the fast memory node.
If the memory pressure of the workload is high, the memory pressure
may become so high that the memory allocation latency of the workload
is influenced, e.g. the direct reclaiming may be triggered.
The choice "b" works much better at this aspect. If the memory
pressure of the workload is high, the hot pages promotion will stop
earlier because its allocation watermark is higher than that of the
normal memory allocation. So in this patch, choice "b" is implemented.
A new zone watermark (WMARK_PROMO) is added. Which is larger than the
high watermark and can be controlled via watermark_scale_factor.
In addition to the original page placement optimization among sockets,
the NUMA balancing mechanism is extended to be used to optimize page
placement according to hot/cold among different memory types. So the
sysctl user space interface (numa_balancing) is extended in a backward
compatible way as follow, so that the users can enable/disable these
functionality individually.
The sysctl is converted from a Boolean value to a bits field. The
definition of the flags is,
- 0: NUMA_BALANCING_DISABLED
- 1: NUMA_BALANCING_NORMAL
- 2: NUMA_BALANCING_MEMORY_TIERING
We have tested the patch with the pmbench memory accessing benchmark
with the 80:20 read/write ratio and the Gauss access address
distribution on a 2 socket Intel server with Optane DC Persistent
Memory Model. The test results shows that the pmbench score can
improve up to 95.9%.
Thanks Andrew Morton to help fix the document format error.
Link: https://lkml.kernel.org/r/20220221084529.1052339-3-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Wei Xu <weixugc@google.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: zhongjiang-ali <zhongjiang-ali@linux.alibaba.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Feng Tang <feng.tang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- Cleanups for SCHED_DEADLINE
- Tracing updates/fixes
- CPU Accounting fixes
- First wave of changes to optimize the overhead of the scheduler build,
from the fast-headers tree - including placeholder *_api.h headers for
later header split-ups.
- Preempt-dynamic using static_branch() for ARM64
- Isolation housekeeping mask rework; preperatory for further changes
- NUMA-balancing: deal with CPU-less nodes
- NUMA-balancing: tune systems that have multiple LLC cache domains per node (eg. AMD)
- Updates to RSEQ UAPI in preparation for glibc usage
- Lots of RSEQ/selftests, for same
- Add Suren as PSI co-maintainer
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmI5rg8RHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1hGrw/+M3QOk6fH7G48wjlNnBvcOife6ls+Ni4k
ixOAcF4JKoixO8HieU5vv0A7yf/83tAa6fpeXeMf1hkCGc0NSlmLtuIux+WOmoAL
LzCyDEYfiP8KnVh0A1Tui/lK0+AkGo21O6ADhQE2gh8o2LpslOHQMzvtyekSzeeb
mVxMYQN+QH0m518xdO2D8IQv9ctOYK0eGjmkqdNfntOlytypPZHeNel/tCzwklP/
dElJUjNiSKDlUgTBPtL3DfpoLOI/0mHF2p6NEXvNyULxSOqJTu8pv9Z2ADb2kKo1
0D56iXBDngMi9MHIJLgvzsA8gKzHLFSuPbpODDqkTZCa28vaMB9NYGhJ643NtEie
IXTJEvF1rmNkcLcZlZxo0yjL0fjvPkczjw4Vj27gbrUQeEBfb4mfuI4BRmij63Ep
qEkgQTJhduCqqrQP1rVyhwWZRk1JNcVug+F6N42qWW3fg1xhj0YSrLai2c9nPez6
3Zt98H8YGS1Z/JQomSw48iGXVqfTp/ETI7uU7jqHK8QcjzQ4lFK5H4GZpwuqGBZi
NJJ1l97XMEas+rPHiwMEN7Z1DVhzJLCp8omEj12QU+tGLofxxwAuuOVat3CQWLRk
f80Oya3TLEgd22hGIKDRmHa22vdWnNQyS0S15wJotawBzQf+n3auS9Q3/rh979+t
ES/qvlGxTIs=
=Z8uT
-----END PGP SIGNATURE-----
Merge tag 'sched-core-2022-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
- Cleanups for SCHED_DEADLINE
- Tracing updates/fixes
- CPU Accounting fixes
- First wave of changes to optimize the overhead of the scheduler
build, from the fast-headers tree - including placeholder *_api.h
headers for later header split-ups.
- Preempt-dynamic using static_branch() for ARM64
- Isolation housekeeping mask rework; preperatory for further changes
- NUMA-balancing: deal with CPU-less nodes
- NUMA-balancing: tune systems that have multiple LLC cache domains per
node (eg. AMD)
- Updates to RSEQ UAPI in preparation for glibc usage
- Lots of RSEQ/selftests, for same
- Add Suren as PSI co-maintainer
* tag 'sched-core-2022-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (81 commits)
sched/headers: ARM needs asm/paravirt_api_clock.h too
sched/numa: Fix boot crash on arm64 systems
headers/prep: Fix header to build standalone: <linux/psi.h>
sched/headers: Only include <linux/entry-common.h> when CONFIG_GENERIC_ENTRY=y
cgroup: Fix suspicious rcu_dereference_check() usage warning
sched/preempt: Tell about PREEMPT_DYNAMIC on kernel headers
sched/topology: Remove redundant variable and fix incorrect type in build_sched_domains
sched/deadline,rt: Remove unused parameter from pick_next_[rt|dl]_entity()
sched/deadline,rt: Remove unused functions for !CONFIG_SMP
sched/deadline: Use __node_2_[pdl|dle]() and rb_first_cached() consistently
sched/deadline: Merge dl_task_can_attach() and dl_cpu_busy()
sched/deadline: Move bandwidth mgmt and reclaim functions into sched class source file
sched/deadline: Remove unused def_dl_bandwidth
sched/tracing: Report TASK_RTLOCK_WAIT tasks as TASK_UNINTERRUPTIBLE
sched/tracing: Don't re-read p->state when emitting sched_switch event
sched/rt: Plug rt_mutex_setprio() vs push_rt_task() race
sched/cpuacct: Remove redundant RCU read lock
sched/cpuacct: Optimize away RCU read lock
sched/cpuacct: Fix charge percpu cpuusage
sched/headers: Reorganize, clean up and optimize kernel/sched/sched.h dependencies
...
- Reduce the amount of work to release a task stack in context
switch. There is no real reason to do cgroup accounting and memory
freeing in this performance sensitive context. Aside of this the
invoked functions cannot be called from this preemption disabled
context on PREEMPT_RT enabled kernels. Solve this by moving the
accounting into do_exit() and delaying the freeing of the stack unless
the vmap stack can be cached.
- Provide a mechanism to delay raising signals from atomic context on
PREEMPT_RT enabled kernels as sighand::lock cannot be acquired. Store
the information in the task struct and raise it in the exit path.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmI4U6gTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoSpkEACwgaaQUbqVrpw5yb6LbwzUPnjEdFNN
uUQCv0XZD8LWbfhcQQVSPWGho7S/w2Mkpdhi0DkVb2K0dkB7EvITSNEC4KoS/yez
8iQBpv6Lm00quHdNLjkQySSZ4NYB8M1GasBI7zSBjROK/+sRqioTPQsM0oDemGmD
uMvw0dgDJRlB8X4LZv0xuJbYLdSzu2VOlWd5aJG9BUgHkd7PfUWMlHsa29FP0hkP
A5yziOnr9kMsmCAsgmiyDW/GmefrEealby5M/jgnxTruF/OLnDsP+PYMlws47fPx
g6xpHkT5H0zQJ/nMJtK2JAlxpnbIl4cLuUnpn7wX316yjBpP2s3Pw04AVdzPPoBa
ufAoOLFtnrKN6enIqLWaJHGAsBHEULw6d3/7HoAEQOVWChnQSuWOob8z0QDbvM14
kKtz+LTrO+P5a15fd4g5+9lFBXJUTnF74SYQNwxIm2cV9hxrf15NhAr8yg+RtUvF
/ilNNAFtXkASLqs9moEi7U+GyBYwemG+gduVZ3Dw8FBxK/vHmDrhlItcZdKom+UJ
k4VFDVhzd2GYRHMrcaLfkCYew6ou+LD/rjdPhIU9OQHgILIMLY5aLqxDuyPtHqDz
TEyF5qsL4wYLIUdsWlqyHISqQQ6LfnpIyko5kb2Zt56sYtrcZr8swDy+yimiEOdL
G4BzQu0nVbCLhw==
=uGTc
-----END PGP SIGNATURE-----
Merge tag 'core-core-2022-03-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull core process handling RT latency updates from Thomas Gleixner:
- Reduce the amount of work to release a task stack in context switch.
There is no real reason to do cgroup accounting and memory freeing in
this performance sensitive context.
Aside of this the invoked functions cannot be called from this
preemption disabled context on PREEMPT_RT enabled kernels. Solve this
by moving the accounting into do_exit() and delaying the freeing of
the stack unless the vmap stack can be cached.
- Provide a mechanism to delay raising signals from atomic context on
PREEMPT_RT enabled kernels as sighand::lock cannot be acquired. Store
the information in the task struct and raise it in the exit path.
* tag 'core-core-2022-03-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
signal, x86: Delay calling signals in atomic on RT enabled kernels
fork: Use IS_ENABLED() in account_kernel_stack()
fork: Only cache the VMAP stack in finish_task_switch()
fork: Move task stack accounting to do_exit()
fork: Move memcg_charge_kernel_stack() into CONFIG_VMAP_STACK
fork: Don't assign the stack pointer in dup_task_struct()
fork, IA64: Provide alloc_thread_stack_node() for IA64
fork: Duplicate task_struct before stack allocation
fork: Redo ifdefs around task stack handling
- Simplify the PASID handling to allocate the PASID once, associate it to
the mm of a process and free it on mm_exit(). The previous attempt of
refcounted PASIDs and dynamic alloc()/free() turned out to be error
prone and too complex. The PASID space is 20bits, so the case of
resource exhaustion is a pure academic concern.
- Populate the PASID MSR on demand via #GP to avoid racy updates via IPIs.
- Reenable ENQCMD and let objtool check for the forbidden usage of ENQCMD
in the kernel.
- Update the documentation for Shared Virtual Addressing accordingly.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmI4WpETHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoUfnD/0bY94rgEX4Uuy/mFQ1W8X8XlcyKrha
0/cRATb+4QV/pwJgGr2nClKhGlFMYPdJLvKMC1TCUPCVrLD1RNmluIZoFzeqXwhm
jDdCcFOuGZ2D4ujDPWwOOpKBT1ytovnQa7+lH6QJyKkEqdcC2ncOvGJQoiRxRQIG
8wTVs/OUvQJ5ZhSZQMKQN4uMWMyHEjhbroYS30/uNi/598jTPgzlEoa14XocQ9Os
nS6ALvjuc9MsJ34F61etMaJU1ZMI3Wx75u9QjEvX6hmJs87YdvgwE7lzJUKFDEuh
gewM0wp2fTa8/azzP0eMiHTin56PqFdmllzRqXmilbZMEPOeI29dZVArCdpKcAn0
r9p1kJUT3Xl2G3Oir/OdCaaQHcznD1Y5ZFOyh12wgEucZ/rdeSr7nq7n5HoOL5Bw
Q2o6YvTkE9DOL0nTN1lSXGiPspou7fzX0uUcRBrbJUS3sBv4zGIlaJXUaTVnSdAt
VZj4LeOK7v2BjyeiOY0iaaIQd3xjmLUF0UjozXS5M13SoVcToZRbyWqhDzPvNuKA
imQb/dnFpXhABgmuqAiJLeqM0VtGMFNc780OURkcsBSPng+iSEdV4DzuhK0jpU8x
Uk1RuGMd/vgmrlDFBrw+orQQiiKR1ixpI0LiHfcOBycfJhqTwcnrNZvAN5/do28Z
E23+QzlUbZF0cw==
=Dy8V
-----END PGP SIGNATURE-----
Merge tag 'x86-pasid-2022-03-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 PASID support from Thomas Gleixner:
"Reenable ENQCMD/PASID support:
- Simplify the PASID handling to allocate the PASID once, associate
it to the mm of a process and free it on mm_exit().
The previous attempt of refcounted PASIDs and dynamic
alloc()/free() turned out to be error prone and too complex. The
PASID space is 20bits, so the case of resource exhaustion is a pure
academic concern.
- Populate the PASID MSR on demand via #GP to avoid racy updates via
IPIs.
- Reenable ENQCMD and let objtool check for the forbidden usage of
ENQCMD in the kernel.
- Update the documentation for Shared Virtual Addressing accordingly"
* tag 'x86-pasid-2022-03-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
Documentation/x86: Update documentation for SVA (Shared Virtual Addressing)
tools/objtool: Check for use of the ENQCMD instruction in the kernel
x86/cpufeatures: Re-enable ENQCMD
x86/traps: Demand-populate PASID MSR via #GP
sched: Define and initialize a flag to identify valid PASID in the task
x86/fpu: Clear PASID when copying fpstate
iommu/sva: Assign a PASID to mm on PASID allocation and free it on mm exit
kernel/fork: Initialize mm's PASID
iommu/ioasid: Introduce a helper to check for valid PASIDs
mm: Change CONFIG option for mm->pasid field
iommu/sva: Rename CONFIG_IOMMU_SVA_LIB to CONFIG_IOMMU_SVA
The header tracehook.h is no place for code to live. The functions
set_notify_signal and clear_notify_signal are not about signals. They
are about interruptions that act like signals. The fundamental signal
primitives wind up calling set_notify_signal and clear_notify_signal.
Which means they need to be maintained with the signal code.
Since set_notify_signal and clear_notify_signal must be maintained
with the signal subsystem move them into sched/signal.h and claim them
as part of the signal subsystem.
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lkml.kernel.org/r/20220309162454.123006-10-ebiederm@xmission.com
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
This allows code sharing between fast-headers tree and the vanilla
scheduler tree.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Peter Zijlstra <peterz@infradead.org>
There is no need to perform the stack accounting of the outgoing task in
its final schedule() invocation which happens with preemption disabled.
The task is leaving, the resources will be freed and the accounting can
happen in do_exit() before the actual schedule invocation which
frees the stack memory.
Move the accounting of the stack memory from release_task_stack() to
exit_task_stack_account() which then can be invoked from do_exit().
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Andy Lutomirski <luto@kernel.org>
Link: https://lore.kernel.org/r/20220217102406.3697941-7-bigeasy@linutronix.de
-----BEGIN PGP SIGNATURE-----
iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmISrYgeHHRvcnZhbGRz
QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGg20IAKDZr7rfSHBopjQV
Cocw744tom0XuxpvSZpp2GGOOXF+tkswcNNaRIrbGOl1mkyxA7eBZCTMpDeDS9aQ
wB0D0Gxx8QBAJp4KgB1W7TB+hIGes/rs8Ve+6iO4ulLLdCVWX/q2boI0aZ7QX9O9
qNi8OsoZQtk6falRvciZFHwV5Av1p2Sy1AW57udQ7DvJ4H98AfKf1u8/z208WWW8
1ixC+qJxQcUcM9vI+7P9Tt7NbFSKv8SvAmqjFY7P+DxQAsVw6KXoqVXykDzeOv0t
fUNOE/t0oFZafwtn8h7KBQnwS9lH03+3KkslVZs+iMFyUj/Bar+NVVyKoDhWXtVg
/PuMhEg=
=eU1o
-----END PGP SIGNATURE-----
Merge tag 'v5.17-rc5' into sched/core, to resolve conflicts
New conflicts in sched/core due to the following upstream fixes:
44585f7bc0 ("psi: fix "defined but not used" warnings when CONFIG_PROC_FS=n")
a06247c680 ("psi: Fix uaf issue when psi trigger is destroyed while being polled")
Conflicts:
include/linux/psi_types.h
kernel/sched/psi.c
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Where commit 4ef0c5c6b5 ("kernel/sched: Fix sched_fork() access an
invalid sched_task_group") fixed a fork race vs cgroup, it opened up a
race vs syscalls by not placing the task on the runqueue before it
gets exposed through the pidhash.
Commit 13765de814 ("sched/fair: Fix fault in reweight_entity") is
trying to fix a single instance of this, instead fix the whole class
of issues, effectively reverting this commit.
Fixes: 4ef0c5c6b5 ("kernel/sched: Fix sched_fork() access an invalid sched_task_group")
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Tadeusz Struk <tadeusz.struk@linaro.org>
Tested-by: Zhang Qiao <zhangqiao22@huawei.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lkml.kernel.org/r/YgoeCbwj5mbCR0qA@hirez.programming.kicks-ass.net
Refer to housekeeping APIs using single feature types instead of flags.
This prevents from passing multiple isolation features at once to
housekeeping interfaces, which soon won't be possible anymore as each
isolation features will have their own cpumask.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Reviewed-by: Phil Auld <pauld@redhat.com>
Link: https://lore.kernel.org/r/20220207155910.527133-5-frederic@kernel.org
PASIDs are process-wide. It was attempted to use refcounted PASIDs to
free them when the last thread drops the refcount. This turned out to
be complex and error prone. Given the fact that the PASID space is 20
bits, which allows up to 1M processes to have a PASID associated
concurrently, PASID resource exhaustion is not a realistic concern.
Therefore, it was decided to simplify the approach and stick with lazy
on demand PASID allocation, but drop the eager free approach and make an
allocated PASID's lifetime bound to the lifetime of the process.
Get rid of the refcounting mechanisms and replace/rename the interfaces
to reflect this new approach.
[ bp: Massage commit message. ]
Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Joerg Roedel <jroedel@suse.de>
Link: https://lore.kernel.org/r/20220207230254.3342514-6-fenghua.yu@intel.com
A new mm doesn't have a PASID yet when it's created. Initialize
the mm's PASID on fork() or for init_mm to INVALID_IOASID (-1).
INIT_PASID (0) is reserved for kernel legacy DMA PASID. It cannot be
allocated to a user process. Initializing the process's PASID to 0 may
cause confusion that's why the process uses the reserved kernel legacy
DMA PASID. Initializing the PASID to INVALID_IOASID (-1) explicitly
tells the process doesn't have a valid PASID yet.
Even though the only user of mm_pasid_init() is in fork.c, define it in
<linux/sched/mm.h> as the first of three mm/pasid life cycle functions
(init/set/drop) to keep these all together.
Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20220207230254.3342514-5-fenghua.yu@intel.com
Commit 7d2b5dd0bc ("sched/numa: Allow a floating imbalance between NUMA
nodes") allowed an imbalance between NUMA nodes such that communicating
tasks would not be pulled apart by the load balancer. This works fine when
there is a 1:1 relationship between LLC and node but can be suboptimal
for multiple LLCs if independent tasks prematurely use CPUs sharing cache.
Zen* has multiple LLCs per node with local memory channels and due to
the allowed imbalance, it's far harder to tune some workloads to run
optimally than it is on hardware that has 1 LLC per node. This patch
allows an imbalance to exist up to the point where LLCs should be balanced
between nodes.
On a Zen3 machine running STREAM parallelised with OMP to have on instance
per LLC the results and without binding, the results are
5.17.0-rc0 5.17.0-rc0
vanilla sched-numaimb-v6
MB/sec copy-16 162596.94 ( 0.00%) 580559.74 ( 257.05%)
MB/sec scale-16 136901.28 ( 0.00%) 374450.52 ( 173.52%)
MB/sec add-16 157300.70 ( 0.00%) 564113.76 ( 258.62%)
MB/sec triad-16 151446.88 ( 0.00%) 564304.24 ( 272.61%)
STREAM can use directives to force the spread if the OpenMP is new
enough but that doesn't help if an application uses threads and
it's not known in advance how many threads will be created.
Coremark is a CPU and cache intensive benchmark parallelised with
threads. When running with 1 thread per core, the vanilla kernel
allows threads to contend on cache. With the patch;
5.17.0-rc0 5.17.0-rc0
vanilla sched-numaimb-v5
Min Score-16 368239.36 ( 0.00%) 389816.06 ( 5.86%)
Hmean Score-16 388607.33 ( 0.00%) 427877.08 * 10.11%*
Max Score-16 408945.69 ( 0.00%) 481022.17 ( 17.62%)
Stddev Score-16 15247.04 ( 0.00%) 24966.82 ( -63.75%)
CoeffVar Score-16 3.92 ( 0.00%) 5.82 ( -48.48%)
It can also make a big difference for semi-realistic workloads
like specjbb which can execute arbitrary numbers of threads without
advance knowledge of how they should be placed. Even in cases where
the average performance is neutral, the results are more stable.
5.17.0-rc0 5.17.0-rc0
vanilla sched-numaimb-v6
Hmean tput-1 71631.55 ( 0.00%) 73065.57 ( 2.00%)
Hmean tput-8 582758.78 ( 0.00%) 556777.23 ( -4.46%)
Hmean tput-16 1020372.75 ( 0.00%) 1009995.26 ( -1.02%)
Hmean tput-24 1416430.67 ( 0.00%) 1398700.11 ( -1.25%)
Hmean tput-32 1687702.72 ( 0.00%) 1671357.04 ( -0.97%)
Hmean tput-40 1798094.90 ( 0.00%) 2015616.46 * 12.10%*
Hmean tput-48 1972731.77 ( 0.00%) 2333233.72 ( 18.27%)
Hmean tput-56 2386872.38 ( 0.00%) 2759483.38 ( 15.61%)
Hmean tput-64 2909475.33 ( 0.00%) 2925074.69 ( 0.54%)
Hmean tput-72 2585071.36 ( 0.00%) 2962443.97 ( 14.60%)
Hmean tput-80 2994387.24 ( 0.00%) 3015980.59 ( 0.72%)
Hmean tput-88 3061408.57 ( 0.00%) 3010296.16 ( -1.67%)
Hmean tput-96 3052394.82 ( 0.00%) 2784743.41 ( -8.77%)
Hmean tput-104 2997814.76 ( 0.00%) 2758184.50 ( -7.99%)
Hmean tput-112 2955353.29 ( 0.00%) 2859705.09 ( -3.24%)
Hmean tput-120 2889770.71 ( 0.00%) 2764478.46 ( -4.34%)
Hmean tput-128 2871713.84 ( 0.00%) 2750136.73 ( -4.23%)
Stddev tput-1 5325.93 ( 0.00%) 2002.53 ( 62.40%)
Stddev tput-8 6630.54 ( 0.00%) 10905.00 ( -64.47%)
Stddev tput-16 25608.58 ( 0.00%) 6851.16 ( 73.25%)
Stddev tput-24 12117.69 ( 0.00%) 4227.79 ( 65.11%)
Stddev tput-32 27577.16 ( 0.00%) 8761.05 ( 68.23%)
Stddev tput-40 59505.86 ( 0.00%) 2048.49 ( 96.56%)
Stddev tput-48 168330.30 ( 0.00%) 93058.08 ( 44.72%)
Stddev tput-56 219540.39 ( 0.00%) 30687.02 ( 86.02%)
Stddev tput-64 121750.35 ( 0.00%) 9617.36 ( 92.10%)
Stddev tput-72 223387.05 ( 0.00%) 34081.13 ( 84.74%)
Stddev tput-80 128198.46 ( 0.00%) 22565.19 ( 82.40%)
Stddev tput-88 136665.36 ( 0.00%) 27905.97 ( 79.58%)
Stddev tput-96 111925.81 ( 0.00%) 99615.79 ( 11.00%)
Stddev tput-104 146455.96 ( 0.00%) 28861.98 ( 80.29%)
Stddev tput-112 88740.49 ( 0.00%) 58288.23 ( 34.32%)
Stddev tput-120 186384.86 ( 0.00%) 45812.03 ( 75.42%)
Stddev tput-128 78761.09 ( 0.00%) 57418.48 ( 27.10%)
Similarly, for embarassingly parallel problems like NPB-ep, there are
improvements due to better spreading across LLC when the machine is not
fully utilised.
vanilla sched-numaimb-v6
Min ep.D 31.79 ( 0.00%) 26.11 ( 17.87%)
Amean ep.D 31.86 ( 0.00%) 26.17 * 17.86%*
Stddev ep.D 0.07 ( 0.00%) 0.05 ( 24.41%)
CoeffVar ep.D 0.22 ( 0.00%) 0.20 ( 7.97%)
Max ep.D 31.93 ( 0.00%) 26.21 ( 17.91%)
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://lore.kernel.org/r/20220208094334.16379-3-mgorman@techsingularity.net
move autogroup sysctls to autogroup.c and use the new
register_sysctl_init() to register the sysctl interface.
Signed-off-by: Zhen Ni <nizhen@uniontech.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20220128095025.8745-1-nizhen@uniontech.com
The kernel/sysctl.c is a kitchen sink where everyone leaves their dirty
dishes, this makes it very difficult to maintain.
To help with this maintenance let's start by moving sysctls to places
where they actually belong. The proc sysctl maintainers do not want to
know what sysctl knobs you wish to add for your own piece of code, we
just care about the core logic.
So move hung_task sysctl interface to hung_task.c and use
register_sysctl() to register the sysctl interface.
[mcgrof@kernel.org: commit log refresh and fixed 2-3 0day reported compile issues]
Link: https://lkml.kernel.org/r/20211123202347.818157-4-mcgrof@kernel.org
Signed-off-by: Xiaoming Ni <nixiaoming@huawei.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qing Wang <wangqing@vivo.com>
Cc: Sebastian Reichel <sre@kernel.org>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Stephen Kitt <steve@sk2.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Antti Palosaari <crope@iki.fi>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Clemens Ladisch <clemens@ladisch.de>
Cc: David Airlie <airlied@linux.ie>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Julia Lawall <julia.lawall@inria.fr>
Cc: Lukas Middendorf <kernel@tuxforce.de>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Phillip Potter <phil@philpotter.co.uk>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Douglas Gilbert <dgilbert@interlog.com>
Cc: James E.J. Bottomley <jejb@linux.ibm.com>
Cc: Jani Nikula <jani.nikula@intel.com>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull signal/exit/ptrace updates from Eric Biederman:
"This set of changes deletes some dead code, makes a lot of cleanups
which hopefully make the code easier to follow, and fixes bugs found
along the way.
The end-game which I have not yet reached yet is for fatal signals
that generate coredumps to be short-circuit deliverable from
complete_signal, for force_siginfo_to_task not to require changing
userspace configured signal delivery state, and for the ptrace stops
to always happen in locations where we can guarantee on all
architectures that the all of the registers are saved and available on
the stack.
Removal of profile_task_ext, profile_munmap, and profile_handoff_task
are the big successes for dead code removal this round.
A bunch of small bug fixes are included, as most of the issues
reported were small enough that they would not affect bisection so I
simply added the fixes and did not fold the fixes into the changes
they were fixing.
There was a bug that broke coredumps piped to systemd-coredump. I
dropped the change that caused that bug and replaced it entirely with
something much more restrained. Unfortunately that required some
rebasing.
Some successes after this set of changes: There are few enough calls
to do_exit to audit in a reasonable amount of time. The lifetime of
struct kthread now matches the lifetime of struct task, and the
pointer to struct kthread is no longer stored in set_child_tid. The
flag SIGNAL_GROUP_COREDUMP is removed. The field group_exit_task is
removed. Issues where task->exit_code was examined with
signal->group_exit_code should been examined were fixed.
There are several loosely related changes included because I am
cleaning up and if I don't include them they will probably get lost.
The original postings of these changes can be found at:
https://lkml.kernel.org/r/87a6ha4zsd.fsf@email.froward.int.ebiederm.orghttps://lkml.kernel.org/r/87bl1kunjj.fsf@email.froward.int.ebiederm.orghttps://lkml.kernel.org/r/87r19opkx1.fsf_-_@email.froward.int.ebiederm.org
I trimmed back the last set of changes to only the obviously correct
once. Simply because there was less time for review than I had hoped"
* 'signal-for-v5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (44 commits)
ptrace/m68k: Stop open coding ptrace_report_syscall
ptrace: Remove unused regs argument from ptrace_report_syscall
ptrace: Remove second setting of PT_SEIZED in ptrace_attach
taskstats: Cleanup the use of task->exit_code
exit: Use the correct exit_code in /proc/<pid>/stat
exit: Fix the exit_code for wait_task_zombie
exit: Coredumps reach do_group_exit
exit: Remove profile_handoff_task
exit: Remove profile_task_exit & profile_munmap
signal: clean up kernel-doc comments
signal: Remove the helper signal_group_exit
signal: Rename group_exit_task group_exec_task
coredump: Stop setting signal->group_exit_task
signal: Remove SIGNAL_GROUP_COREDUMP
signal: During coredumps set SIGNAL_GROUP_EXIT in zap_process
signal: Make coredump handling explicit in complete_signal
signal: Have prepare_signal detect coredumps using signal->core_state
signal: Have the oom killer detect coredumps using signal->core_state
exit: Move force_uaccess back into do_exit
exit: Guarantee make_task_dead leaks the tsk when calling do_task_exit
...
Merge misc updates from Andrew Morton:
"146 patches.
Subsystems affected by this patch series: kthread, ia64, scripts,
ntfs, squashfs, ocfs2, vfs, and mm (slab-generic, slab, kmemleak,
dax, kasan, debug, pagecache, gup, shmem, frontswap, memremap,
memcg, selftests, pagemap, dma, vmalloc, memory-failure, hugetlb,
userfaultfd, vmscan, mempolicy, oom-kill, hugetlbfs, migration, thp,
ksm, page-poison, percpu, rmap, zswap, zram, cleanups, hmm, and
damon)"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (146 commits)
mm/damon: hide kernel pointer from tracepoint event
mm/damon/vaddr: hide kernel pointer from damon_va_three_regions() failure log
mm/damon/vaddr: use pr_debug() for damon_va_three_regions() failure logging
mm/damon/dbgfs: remove an unnecessary variable
mm/damon: move the implementation of damon_insert_region to damon.h
mm/damon: add access checking for hugetlb pages
Docs/admin-guide/mm/damon/usage: update for schemes statistics
mm/damon/dbgfs: support all DAMOS stats
Docs/admin-guide/mm/damon/reclaim: document statistics parameters
mm/damon/reclaim: provide reclamation statistics
mm/damon/schemes: account how many times quota limit has exceeded
mm/damon/schemes: account scheme actions that successfully applied
mm/damon: remove a mistakenly added comment for a future feature
Docs/admin-guide/mm/damon/usage: update for kdamond_pid and (mk|rm)_contexts
Docs/admin-guide/mm/damon/usage: mention tracepoint at the beginning
Docs/admin-guide/mm/damon/usage: remove redundant information
Docs/admin-guide/mm/damon/usage: update for scheme quotas and watermarks
mm/damon: convert macro functions to static inline functions
mm/damon: modify damon_rand() macro to static inline function
mm/damon: move damon_rand() definition into damon.h
...
Various places in the kernel - largely in filesystems - respond to a
memory allocation failure by looping around and re-trying. Some of
these cannot conveniently use __GFP_NOFAIL, for reasons such as:
- a GFP_ATOMIC allocation, which __GFP_NOFAIL doesn't work on
- a need to check for the process being signalled between failures
- the possibility that other recovery actions could be performed
- the allocation is quite deep in support code, and passing down an
extra flag to say if __GFP_NOFAIL is wanted would be clumsy.
Many of these currently use congestion_wait() which (in almost all
cases) simply waits the given timeout - congestion isn't tracked for
most devices.
It isn't clear what the best delay is for loops, but it is clear that
the various filesystems shouldn't be responsible for choosing a timeout.
This patch introduces memalloc_retry_wait() with takes on that
responsibility. Code that wants to retry a memory allocation can call
this function passing the GFP flags that were used. It will wait
however is appropriate.
For now, it only considers __GFP_NORETRY and whatever
gfpflags_allow_blocking() tests. If blocking is allowed without
__GFP_NORETRY, then alloc_page either made some reclaim progress, or
waited for a while, before failing. So there is no need for much
further waiting. memalloc_retry_wait() will wait until the current
jiffie ends. If this condition is not met, then alloc_page() won't have
waited much if at all. In that case memalloc_retry_wait() waits about
200ms. This is the delay that most current loops uses.
linux/sched/mm.h needs to be included in some files now,
but linux/backing-dev.h does not.
Link: https://lkml.kernel.org/r/163754371968.13692.1277530886009912421@noble.neil.brown.name
Signed-off-by: NeilBrown <neilb@suse.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Chao Yu <chao@kernel.org>
Cc: Darrick J. Wong <djwong@kernel.org>
Cc: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This helper is misleading. It tests for an ongoing exec as well as
the process having received a fatal signal.
Sometimes it is appropriate to treat an on-going exec differently than
a process that is shutting down due to a fatal signal. In particular
taking the fast path out of exit_signals instead of retargeting
signals is not appropriate during exec, and not changing the the exit
code in do_group_exit during exec.
Removing the helper makes it more obvious what is going on as both
cases must be coded for explicitly.
While removing the helper fix the two cases where I have observed
using signal_group_exit resulted in the wrong result.
In exit_signals only test for SIGNAL_GROUP_EXIT so that signals are
retargetted during an exec.
In do_group_exit use 0 as the exit code during an exec as de_thread
does not set group_exit_code. As best as I can determine
group_exit_code has been is set to 0 most of the time during
de_thread. During a thread group stop group_exit_code is set to the
stop signal and when the thread group receives SIGCONT group_exit_code
is reset to 0.
Link: https://lkml.kernel.org/r/20211213225350.27481-8-ebiederm@xmission.com
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
The only remaining user of group_exit_task is exec. Rename the field
so that it is clear which part of the code uses it.
Update the comment above the definition of group_exec_task to document
how it is currently used.
Link: https://lkml.kernel.org/r/20211213225350.27481-7-ebiederm@xmission.com
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
After the previous cleanups "signal->core_state" is set whenever
SIGNAL_GROUP_COREDUMP is set and "signal->core_state" is tested
whenver the code wants to know if a coredump is in progress. The
remaining tests of SIGNAL_GROUP_COREDUMP also test to see if
SIGNAL_GROUP_EXIT is set. Similarly the only place that sets
SIGNAL_GROUP_COREDUMP also sets SIGNAL_GROUP_EXIT.
Which makes SIGNAL_GROUP_COREDUMP unecessary and redundant. So stop
setting SIGNAL_GROUP_COREDUMP, stop testing SIGNAL_GROUP_COREDUMP, and
remove it's definition.
With the setting of SIGNAL_GROUP_COREDUMP gone, coredump_finish no
longer needs to clear SIGNAL_GROUP_COREDUMP out of signal->flags
by setting SIGNAL_GROUP_EXIT.
Link: https://lkml.kernel.org/r/20211213225350.27481-5-ebiederm@xmission.com
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Pull ARM cpufreq updates for 5.17-rc1 from Viresh Kumar:
"- Qcom cpufreq driver updates improve irq support (Ard Biesheuvel, Stephen Boyd,
and Vladimir Zapolskiy).
- Fixes double devm_remap for mediatek driver (Hector Yuan).
- Introduces thermal pressure helpers (Lukasz Luba)."
* 'cpufreq/arm/linux-next' of git://git.kernel.org/pub/scm/linux/kernel/git/vireshk/pm:
cpufreq: mediatek-hw: Fix double devm_remap in hotplug case
cpufreq: qcom-hw: Use optional irq API
cpufreq: qcom-hw: Set CPU affinity of dcvsh interrupts
cpufreq: qcom-hw: Fix probable nested interrupt handling
cpufreq: qcom-cpufreq-hw: Avoid stack buffer for IRQ name
arch_topology: Remove unused topology_set_thermal_pressure() and related
cpufreq: qcom-cpufreq-hw: Use new thermal pressure update function
cpufreq: qcom-cpufreq-hw: Update offline CPUs per-cpu thermal pressure
thermal: cpufreq_cooling: Use new thermal pressure update function
arch_topology: Introduce thermal pressure update function
There are two big uses of do_exit. The first is it's design use to be
the guts of the exit(2) system call. The second use is to terminate
a task after something catastrophic has happened like a NULL pointer
in kernel code.
Add a function make_task_dead that is initialy exactly the same as
do_exit to cover the cases where do_exit is called to handle
catastrophic failure. In time this can probably be reduced to just a
light wrapper around do_task_dead. For now keep it exactly the same so
that there will be no behavioral differences introducing this new
concept.
Replace all of the uses of do_exit that use it for catastraphic
task cleanup with make_task_dead to make it clear what the code
is doing.
As part of this rename rewind_stack_do_exit
rewind_stack_and_make_dead.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
I completed the first batch of signal changes for v5.17 against
v5.16-rc1 before the SA_IMMUTABLE fixes where completed. Which leaves
me with two lines of development that I want on my signal development
branch both rooted at v5.16-rc1. Especially as I am hoping
to reach the point of being able to remove SA_IMMUTABLE.
Linus merged my SA_IMUTABLE fixes as:
7af959b5d5 ("Merge branch 'SA_IMMUTABLE-fixes-for-v5.16-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace")
To avoid rebasing the development changes that are currently complete I am
merging the work I sent upstream to Linus to make my life simpler.
The SA_IMMUTABLE changes as they are described in Linus's merge commit.
Pull exit-vs-signal handling fixes from Eric Biederman:
"This is a small set of changes where debuggers were no longer able to
intercept synchronous SIGTRAP and SIGSEGV, introduced by the exit
cleanups.
This is essentially the change you suggested with all of i's dotted
and the t's crossed so that ptrace can intercept all of the cases it
has been able to intercept the past, and all of the cases that made it
to exit without giving ptrace a chance still don't give ptrace a
chance"
* 'SA_IMMUTABLE-fixes-for-v5.16-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
signal: Replace force_fatal_sig with force_exit_sig when in doubt
signal: Don't always set SA_IMMUTABLE for forced signals
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
getrusage(RUSAGE_THREAD) with nohz_full may return shorter utime/stime
than the actual time.
task_cputime_adjusted() snapshots utime and stime and then adjust their
sum to match the scheduler maintained cputime.sum_exec_runtime.
Unfortunately in nohz_full, sum_exec_runtime is only updated once per
second in the worst case, causing a discrepancy against utime and stime
that can be updated anytime by the reader using vtime.
To fix this situation, perform an update of cputime.sum_exec_runtime
when the cputime snapshot reports the task as actually running while
the tick is disabled. The related overhead is then contained within the
relevant situations.
Reported-by: Hasegawa Hitomi <hasegawa-hitomi@fujitsu.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Hasegawa Hitomi <hasegawa-hitomi@fujitsu.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
Acked-by: Phil Auld <pauld@redhat.com>
Link: https://lore.kernel.org/r/20211026141055.57358-3-frederic@kernel.org
There is no need of this function (and related) since code has been
converted to use the new arch_update_thermal_pressure() API. The old
code can be removed.
Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
The thermal pressure is a mechanism which is used for providing
information about reduced CPU performance to the scheduler. Usually code
has to convert the value from frequency units into capacity units,
which are understandable by the scheduler. Create a common conversion code
which can be just used via a handy API.
Internally, the topology_update_thermal_pressure() operates on frequency
in MHz and max CPU frequency is taken from 'freq_factor' (per-cpu).
Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
Reviewed-by: Thara Gopinath <thara.gopinath@linaro.org>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Currently, the exit_shm() function not designed to work properly when
task->sysvshm.shm_clist holds shm objects from different IPC namespaces.
This is a real pain when sysctl kernel.shm_rmid_forced = 1, because it
leads to use-after-free (reproducer exists).
This is an attempt to fix the problem by extending exit_shm mechanism to
handle shm's destroy from several IPC ns'es.
To achieve that we do several things:
1. add a namespace (non-refcounted) pointer to the struct shmid_kernel
2. during new shm object creation (newseg()/shmget syscall) we
initialize this pointer by current task IPC ns
3. exit_shm() fully reworked such that it traverses over all shp's in
task->sysvshm.shm_clist and gets IPC namespace not from current task
as it was before but from shp's object itself, then call
shm_destroy(shp, ns).
Note: We need to be really careful here, because as it was said before
(1), our pointer to IPC ns non-refcnt'ed. To be on the safe side we
using special helper get_ipc_ns_not_zero() which allows to get IPC ns
refcounter only if IPC ns not in the "state of destruction".
Q/A
Q: Why can we access shp->ns memory using non-refcounted pointer?
A: Because shp object lifetime is always shorther than IPC namespace
lifetime, so, if we get shp object from the task->sysvshm.shm_clist
while holding task_lock(task) nobody can steal our namespace.
Q: Does this patch change semantics of unshare/setns/clone syscalls?
A: No. It's just fixes non-covered case when process may leave IPC
namespace without getting task->sysvshm.shm_clist list cleaned up.
Link: https://lkml.kernel.org/r/67bb03e5-f79c-1815-e2bf-949c67047418@colorfullife.com
Link: https://lkml.kernel.org/r/20211109151501.4921-1-manfred@colorfullife.com
Fixes: ab602f7991 ("shm: make exit_shm work proportional to task activity")
Co-developed-by: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Cc: Vasily Averin <vvs@virtuozzo.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Recently to prevent issues with SECCOMP_RET_KILL and similar signals
being changed before they are delivered SA_IMMUTABLE was added.
Unfortunately this broke debuggers[1][2] which reasonably expect
to be able to trap synchronous SIGTRAP and SIGSEGV even when
the target process is not configured to handle those signals.
Add force_exit_sig and use it instead of force_fatal_sig where
historically the code has directly called do_exit. This has the
implementation benefits of going through the signal exit path
(including generating core dumps) without the danger of allowing
userspace to ignore or change these signals.
This avoids userspace regressions as older kernels exited with do_exit
which debuggers also can not intercept.
In the future is should be possible to improve the quality of
implementation of the kernel by changing some of these force_exit_sig
calls to force_fatal_sig. That can be done where it matters on
a case-by-case basis with careful analysis.
Reported-by: Kyle Huey <me@kylehuey.com>
Reported-by: kernel test robot <oliver.sang@intel.com>
[1] https://lkml.kernel.org/r/CAP045AoMY4xf8aC_4QU_-j7obuEPYgTcnQQP3Yxk=2X90jtpjw@mail.gmail.com
[2] https://lkml.kernel.org/r/20211117150258.GB5403@xsang-OptiPlex-9020
Fixes: 00b06da29c ("signal: Add SA_IMMUTABLE to ensure forced siganls do not get changed")
Fixes: a3616a3c02 ("signal/m68k: Use force_sigsegv(SIGSEGV) in fpsp040_die")
Fixes: 83a1f27ad7 ("signal/powerpc: On swapcontext failure force SIGSEGV")
Fixes: 9bc508cf07 ("signal/s390: Use force_sigsegv in default_trap_handler")
Fixes: 086ec444f8 ("signal/sparc32: In setup_rt_frame and setup_fram use force_fatal_sig")
Fixes: c317d306d5 ("signal/sparc32: Exit with a fatal signal when try_to_clear_window_buffer fails")
Fixes: 695dd0d634 ("signal/x86: In emulate_vsyscall force a signal instead of calling do_exit")
Fixes: 1fbd60df8a ("signal/vm86_32: Properly send SIGSEGV when the vm86 state cannot be saved.")
Fixes: 941edc5bf1 ("exit/syscall_user_dispatch: Send ordinary signals on failure")
Link: https://lkml.kernel.org/r/871r3dqfv8.fsf_-_@email.froward.int.ebiederm.org
Reviewed-by: Kees Cook <keescook@chromium.org>
Tested-by: Kees Cook <keescook@chromium.org>
Tested-by: Kyle Huey <khuey@kylehuey.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
In the event that a tracer changes which signal needs to be delivered
and that signal is currently blocked then the signal needs to be
requeued for later delivery.
With the advent of CLONE_THREAD the kernel has 2 signal queues per
task. The per process queue and the per task queue. Update the code
so that if the signal is removed from the per process queue it is
requeued on the per process queue. This is necessary to make it
appear the signal was never dequeued.
The rr debugger reasonably believes that the state of the process from
the last ptrace_stop it observed until PTRACE_EVENT_EXIT can be recreated
by simply letting a process run. If a SIGKILL interrupts a ptrace_stop
this is not true today.
So return signals to their original queue in ptrace_signal so that
signals that are not delivered appear like they were never dequeued.
Fixes: 794aa320b79d ("[PATCH] sigfix-2.5.40-D6")
History Tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.gi
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lkml.kernel.org/r/87zgq4d5r4.fsf_-_@email.froward.int.ebiederm.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Pull exit cleanups from Eric Biederman:
"While looking at some issues related to the exit path in the kernel I
found several instances where the code is not using the existing
abstractions properly.
This set of changes introduces force_fatal_sig a way of sending a
signal and not allowing it to be caught, and corrects the misuse of
the existing abstractions that I found.
A lot of the misuse of the existing abstractions are silly things such
as doing something after calling a no return function, rolling BUG by
hand, doing more work than necessary to terminate a kernel thread, or
calling do_exit(SIGKILL) instead of calling force_sig(SIGKILL).
In the review a deficiency in force_fatal_sig and force_sig_seccomp
where ptrace or sigaction could prevent the delivery of the signal was
found. I have added a change that adds SA_IMMUTABLE to change that
makes it impossible to interrupt the delivery of those signals, and
allows backporting to fix force_sig_seccomp
And Arnd found an issue where a function passed to kthread_run had the
wrong prototype, and after my cleanup was failing to build."
* 'exit-cleanups-for-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (23 commits)
soc: ti: fix wkup_m3_rproc_boot_thread return type
signal: Add SA_IMMUTABLE to ensure forced siganls do not get changed
signal: Replace force_sigsegv(SIGSEGV) with force_fatal_sig(SIGSEGV)
exit/r8188eu: Replace the macro thread_exit with a simple return 0
exit/rtl8712: Replace the macro thread_exit with a simple return 0
exit/rtl8723bs: Replace the macro thread_exit with a simple return 0
signal/x86: In emulate_vsyscall force a signal instead of calling do_exit
signal/sparc32: In setup_rt_frame and setup_fram use force_fatal_sig
signal/sparc32: Exit with a fatal signal when try_to_clear_window_buffer fails
exit/syscall_user_dispatch: Send ordinary signals on failure
signal: Implement force_fatal_sig
exit/kthread: Have kernel threads return instead of calling do_exit
signal/s390: Use force_sigsegv in default_trap_handler
signal/vm86_32: Properly send SIGSEGV when the vm86 state cannot be saved.
signal/vm86_32: Replace open coded BUG_ON with an actual BUG_ON
signal/sparc: In setup_tsb_params convert open coded BUG into BUG
signal/powerpc: On swapcontext failure force SIGSEGV
signal/sh: Use force_sig(SIGKILL) instead of do_group_exit(SIGKILL)
signal/mips: Update (_save|_restore)_fp_context to fail with -EFAULT
signal/sparc32: Remove unreachable do_exit in do_sparc_fault
...
Pull per signal_struct coredumps from Eric Biederman:
"Current coredumps are mixed up with the exit code, the signal handling
code, and the ptrace code making coredumps much more complicated than
necessary and difficult to follow.
This series of changes starts with ptrace_stop and cleans it up,
making it easier to follow what is happening in ptrace_stop. Then
cleans up the exec interactions with coredumps. Then cleans up the
coredump interactions with exit. Finally the coredump interactions
with the signal handling code is cleaned up.
The first and last changes are bug fixes for minor bugs.
I believe the fact that vfork followed by execve can kill the process
the called vfork if exec fails is sufficient justification to change
the userspace visible behavior.
In previous discussions some of these changes were organized
differently and individually appeared to make the code base worse. As
currently written I believe they all stand on their own as cleanups
and bug fixes.
Which means that even if the worst should happen and the last change
needs to be reverted for some unimaginable reason, the code base will
still be improved.
If the worst does not happen there are a more cleanups that can be
made. Signals that generate coredumps can easily become eligible for
short circuit delivery in complete_signal. The entire rendezvous for
generating a coredump can move into get_signal. The function
force_sig_info_to_task be written in a way that does not modify the
signal handling state of the target task (because coredumps are
eligible for short circuit delivery). Many of these future cleanups
can be done another way but nothing so cleanly as if coredumps become
per signal_struct"
* 'per_signal_struct_coredumps-for-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
coredump: Limit coredumps to a single thread group
coredump: Don't perform any cleanups before dumping core
exit: Factor coredump_exit_mm out of exit_mm
exec: Check for a pending fatal signal instead of core_state
ptrace: Remove the unnecessary arguments from arch_ptrace_stop
signal: Remove the bogus sigkill_pending in ptrace_stop
Lots of new features and fixes:
* Added TOC (table of content) support, which is a debugging feature which is
either initiated by pressing the TOC button or via command in the BMC. If
pressed the Linux built-in KDB/KGDB will be called (Sven Schnelle)
* Fix CONFIG_PREEMPT (Sven)
* Fix unwinder on 64-bit kernels (Sven)
* Various kgdb fixes (Sven)
* Added KFENCE support (me)
* Switch to ARCH_STACKWALK implementation (me)
* Fix ptrace check on syscall return (me)
* Fix kernel crash with fixmaps on PA1.x machines (me)
* Move thread_info into task struct, aka CONFIG_THREAD_INFO_IN_TASK (me)
* Updated defconfigs
* Smaller cleanups, including Makefile cleanups (Masahiro Yamada),
use kthread_run() macro (Cai Huoqing), use swap() macro (Yihao Han).
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQS86RI+GtKfB8BJu973ErUQojoPXwUCYYALowAKCRD3ErUQojoP
X3o1APwK7wJBdFKAV2hMEouFNZLz2ZTkSQrCMhPTxRupkwJ71QD+JeXvyPLZBLIu
hlvi9mw9DKUKgCV+/Z65s8zjSHYC4wg=
=A4Ci
-----END PGP SIGNATURE-----
Merge tag 'for-5.16/parisc-1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux
Pull parisc updates from Helge Deller:
"Lots of new features and fixes:
- Added TOC (table of content) support, which is a debugging feature
which is either initiated by pressing the TOC button or via command
in the BMC. If pressed the Linux built-in KDB/KGDB will be called
(Sven Schnelle)
- Fix CONFIG_PREEMPT (Sven)
- Fix unwinder on 64-bit kernels (Sven)
- Various kgdb fixes (Sven)
- Added KFENCE support (me)
- Switch to ARCH_STACKWALK implementation (me)
- Fix ptrace check on syscall return (me)
- Fix kernel crash with fixmaps on PA1.x machines (me)
- Move thread_info into task struct, aka CONFIG_THREAD_INFO_IN_TASK
(me)
- Updated defconfigs
- Smaller cleanups, including Makefile cleanups (Masahiro Yamada),
use kthread_run() macro (Cai Huoqing), use swap() macro (Yihao
Han)"
* tag 'for-5.16/parisc-1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux: (36 commits)
parisc: Fix set_fixmap() on PA1.x CPUs
parisc: Use swap() to swap values in setup_bootmem()
parisc: Update defconfigs
parisc: decompressor: clean up Makefile
parisc: decompressor: remove repeated depenency of misc.o
parisc: Remove unused constants from asm-offsets.c
parisc/ftrace: use static key to enable/disable function graph tracer
parisc/ftrace: set function trace function
parisc: Make use of the helper macro kthread_run()
parisc: mark xchg functions notrace
parisc: enhance warning regarding usage of O_NONBLOCK
parisc: Drop ifdef __KERNEL__ from non-uapi kernel headers
parisc: Use PRIV_USER and PRIV_KERNEL in ptrace.h
parisc: Use PRIV_USER in syscall.S
parisc/kgdb: add kgdb_roundup() to make kgdb work with idle polling
parisc: Move thread_info into task struct
parisc: add support for TOC (transfer of control)
parisc/firmware: add functions to retrieve TOC data
parisc: add PIM TOC data structures
parisc: move virt_map macro to assembly.h
...
Decay max_newidle_lb_cost only when it has not been updated for a while
and ensure to not decay a recently changed value.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Link: https://lore.kernel.org/r/20211019123537.17146-4-vincent.guittot@linaro.org
The function end_of_stack() returns a pointer to the last entry of a
stack. For architectures like parisc where the stack grows upwards
return the pointer to the highest address in the stack.
Without this change I faced a crash on parisc, because the stackleak
functionality wrote STACKLEAK_POISON to the lowest address and thus
overwrote the first 4 bytes of the task_struct which included the
TIF_FLAGS.
Signed-off-by: Helge Deller <deller@gmx.de>
Add a simple helper force_fatal_sig that causes a signal to be
delivered to a process as if the signal handler was set to SIG_DFL.
Reimplement force_sigsegv based upon this new helper. This fixes
force_sigsegv so that when it forces the default signal handler
to be used the code now forces the signal to be unblocked as well.
Reusing the tested logic in force_sig_info_to_task that was built for
force_sig_seccomp this makes the implementation trivial.
This is interesting both because it makes force_sigsegv simpler and
because there are a couple of buggy places in the kernel that call
do_exit(SIGILL) or do_exit(SIGSYS) because there is no straight
forward way today for those places to simply force the exit of a
process with the chosen signal. Creating force_fatal_sig allows
those places to be implemented with normal signal exits.
Link: https://lkml.kernel.org/r/20211020174406.17889-13-ebiederm@xmission.com
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
There is a small race between copy_process() and sched_fork()
where child->sched_task_group point to an already freed pointer.
parent doing fork() | someone moving the parent
| to another cgroup
-------------------------------+-------------------------------
copy_process()
+ dup_task_struct()<1>
parent move to another cgroup,
and free the old cgroup. <2>
+ sched_fork()
+ __set_task_cpu()<3>
+ task_fork_fair()
+ sched_slice()<4>
In the worst case, this bug can lead to "use-after-free" and
cause panic as shown above:
(1) parent copy its sched_task_group to child at <1>;
(2) someone move the parent to another cgroup and free the old
cgroup at <2>;
(3) the sched_task_group and cfs_rq that belong to the old cgroup
will be accessed at <3> and <4>, which cause a panic:
[] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
[] PGD 8000001fa0a86067 P4D 8000001fa0a86067 PUD 2029955067 PMD 0
[] Oops: 0000 [#1] SMP PTI
[] CPU: 7 PID: 648398 Comm: ebizzy Kdump: loaded Tainted: G OE --------- - - 4.18.0.x86_64+ #1
[] RIP: 0010:sched_slice+0x84/0xc0
[] Call Trace:
[] task_fork_fair+0x81/0x120
[] sched_fork+0x132/0x240
[] copy_process.part.5+0x675/0x20e0
[] ? __handle_mm_fault+0x63f/0x690
[] _do_fork+0xcd/0x3b0
[] do_syscall_64+0x5d/0x1d0
[] entry_SYSCALL_64_after_hwframe+0x65/0xca
[] RIP: 0033:0x7f04418cd7e1
Between cgroup_can_fork() and cgroup_post_fork(), the cgroup
membership and thus sched_task_group can't change. So update child's
sched_task_group at sched_post_fork() and move task_fork() and
__set_task_cpu() (where accees the sched_task_group) from sched_fork()
to sched_post_fork().
Fixes: 8323f26ce3 ("sched: Fix race in task_group")
Signed-off-by: Zhang Qiao <zhangqiao22@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lkml.kernel.org/r/20210915064030.2231-1-zhangqiao22@huawei.com
Make sure to prod idle CPUs so they call klp_update_patch_state().
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Acked-by: Miroslav Benes <mbenes@suse.cz>
Acked-by: Vasily Gorbik <gor@linux.ibm.com>
Tested-by: Petr Mladek <pmladek@suse.com>
Tested-by: Vasily Gorbik <gor@linux.ibm.com> # on s390
Link: https://lkml.kernel.org/r/20210929151723.162004989@infradead.org
Today when a signal is delivered with a handler of SIG_DFL whose
default behavior is to generate a core dump not only that process but
every process that shares the mm is killed.
In the case of vfork this looks like a real world problem. Consider
the following well defined sequence.
if (vfork() == 0) {
execve(...);
_exit(EXIT_FAILURE);
}
If a signal that generates a core dump is received after vfork but
before the execve changes the mm the process that called vfork will
also be killed (as the mm is shared).
Similarly if the execve fails after the point of no return the kernel
delivers SIGSEGV which will kill both the exec'ing process and because
the mm is shared the process that called vfork as well.
As far as I can tell this behavior is a violation of people's
reasonable expectations, POSIX, and is unnecessarily fragile when the
system is low on memory.
Solve this by making a userspace visible change to only kill a single
process/thread group. This is possible because Jann Horn recently
modified[1] the coredump code so that the mm can safely be modified
while the coredump is happening. With LinuxThreads long gone I don't
expect anyone to have a notice this behavior change in practice.
To accomplish this move the core_state pointer from mm_struct to
signal_struct, which allows different thread groups to coredump
simultatenously.
In zap_threads remove the work to kill anything except for the current
thread group.
v2: Remove core_state from the VM_BUG_ON_MM print to fix
compile failure when CONFIG_DEBUG_VM is enabled.
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
[1] a07279c9a8 ("binfmt_elf, binfmt_elf_fdpic: use a VMA list snapshot")
Fixes: d89f3847def4 ("[PATCH] thread-aware coredumps, 2.5.43-C3")
History-tree: git://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
Link: https://lkml.kernel.org/r/87y27mvnke.fsf@disp2133
Link: https://lkml.kernel.org/r/20211007144701.67592574@canb.auug.org.au
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
mmdrop() is invoked from finish_task_switch() by the incoming task to drop
the mm which was handed over by the previous task. mmdrop() can be quite
expensive which prevents an incoming real-time task from getting useful
work done.
Provide mmdrop_sched() which maps to mmdrop() on !RT kernels. On RT kernels
it delagates the eventually required invocation of __mmdrop() to RCU.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210928122411.648582026@linutronix.de
Merge more updates from Andrew Morton:
"147 patches, based on 7d2a07b769.
Subsystems affected by this patch series: mm (memory-hotplug, rmap,
ioremap, highmem, cleanups, secretmem, kfence, damon, and vmscan),
alpha, percpu, procfs, misc, core-kernel, MAINTAINERS, lib,
checkpatch, epoll, init, nilfs2, coredump, fork, pids, criu, kconfig,
selftests, ipc, and scripts"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (94 commits)
scripts: check_extable: fix typo in user error message
mm/workingset: correct kernel-doc notations
ipc: replace costly bailout check in sysvipc_find_ipc()
selftests/memfd: remove unused variable
Kconfig.debug: drop selecting non-existing HARDLOCKUP_DETECTOR_ARCH
configs: remove the obsolete CONFIG_INPUT_POLLDEV
prctl: allow to setup brk for et_dyn executables
pid: cleanup the stale comment mentioning pidmap_init().
kernel/fork.c: unexport get_{mm,task}_exe_file
coredump: fix memleak in dump_vma_snapshot()
fs/coredump.c: log if a core dump is aborted due to changed file permissions
nilfs2: use refcount_dec_and_lock() to fix potential UAF
nilfs2: fix memory leak in nilfs_sysfs_delete_snapshot_group
nilfs2: fix memory leak in nilfs_sysfs_create_snapshot_group
nilfs2: fix memory leak in nilfs_sysfs_delete_##name##_group
nilfs2: fix memory leak in nilfs_sysfs_create_##name##_group
nilfs2: fix NULL pointer in nilfs_##name##_attr_release
nilfs2: fix memory leak in nilfs_sysfs_create_device_group
trap: cleanup trap_init()
init: move usermodehelper_enable() to populate_rootfs()
...
This counter tracks the number of watches a user has, to compare against
the 'max_user_watches' limit. This causes a scalability bottleneck on
SPECjbb2015 on large systems as there is only one user. Changing to a
per-cpu counter increases throughput of the benchmark by about 30% on a
16-socket, > 1000 thread system.
[rdunlap@infradead.org: fix build errors in kernel/user.c when CONFIG_EPOLL=n]
[npiggin@gmail.com: move ifdefs into wrapper functions, slightly improve panic message]
Link: https://lkml.kernel.org/r/1628051945.fens3r99ox.astroid@bobo.none
[akpm@linux-foundation.org: tweak user_epoll_alloc(), per Guenter]
Link: https://lkml.kernel.org/r/20210804191421.GA1900577@roeck-us.net
Link: https://lkml.kernel.org/r/20210802032013.2751916-1-npiggin@gmail.com
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reported-by: Anton Blanchard <anton@ozlabs.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Merge misc updates from Andrew Morton:
"173 patches.
Subsystems affected by this series: ia64, ocfs2, block, and mm (debug,
pagecache, gup, swap, shmem, memcg, selftests, pagemap, mremap,
bootmem, sparsemem, vmalloc, kasan, pagealloc, memory-failure,
hugetlb, userfaultfd, vmscan, compaction, mempolicy, memblock,
oom-kill, migration, ksm, percpu, vmstat, and madvise)"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (173 commits)
mm/madvise: add MADV_WILLNEED to process_madvise()
mm/vmstat: remove unneeded return value
mm/vmstat: simplify the array size calculation
mm/vmstat: correct some wrong comments
mm/percpu,c: remove obsolete comments of pcpu_chunk_populated()
selftests: vm: add COW time test for KSM pages
selftests: vm: add KSM merging time test
mm: KSM: fix data type
selftests: vm: add KSM merging across nodes test
selftests: vm: add KSM zero page merging test
selftests: vm: add KSM unmerge test
selftests: vm: add KSM merge test
mm/migrate: correct kernel-doc notation
mm: wire up syscall process_mrelease
mm: introduce process_mrelease system call
memblock: make memblock_find_in_range method private
mm/mempolicy.c: use in_task() in mempolicy_slab_node()
mm/mempolicy: unify the create() func for bind/interleave/prefer-many policies
mm/mempolicy: advertise new MPOL_PREFERRED_MANY
mm/hugetlb: add support for mempolicy MPOL_PREFERRED_MANY
...