Space is printed after each mode value including the last one:
$ echo \"$(sudo cat /sys/kernel/tracing/hwlat_detector/mode)\"
"none [round-robin] per-cpu "
Found by Linux Verification Center (linuxtesting.org) with SVACE.
Link: https://lore.kernel.org/linux-trace-kernel/20230825103432.7750-1-m.kobuk@ispras.ru
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Fixes: 8fa826b734 ("trace/hwlat: Implement the mode config option")
Signed-off-by: Mikhail Kobuk <m.kobuk@ispras.ru>
Reviewed-by: Alexey Khoroshilov <khoroshilov@ispras.ru>
Acked-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
User visible changes:
- Added a way to easier filter with cpumasks:
# echo 'cpumask & CPUS{17-42}' > /sys/kernel/tracing/events/ipi_send_cpumask/filter
- Show actual size of ring buffer after modifying the ring buffer size via
buffer_size_kb. Currently it just returns what was written, but the actual
size rounds up to the sub buffer size. Show that real size instead.
Major changes:
- Added "eventfs". This is the code that handles the inodes and dentries of
tracefs/events directory. As there are thousands of events, and each event
has several inodes and dentries that currently exist even when tracing is
never used, they take up precious memory. Instead, eventfs will allocate
the inodes and dentries in a JIT way (similar to what procfs does). There
is now metadata that handles the events and subdirectories, and will create
the inodes and dentries when they are used.
Note, I also have patches that remove the subdirectory meta data, but will
wait till the next merge window before applying them. It's a little more
complex, and I want to make sure the dynamic code works properly before
adding more complexity, making it easier to revert if need be.
Minor changes:
- Optimization to user event list traversal.
- Remove intermediate permission of tracefs files (note the intermediate
permission removes all access to the files so it is not a security concern,
but just a clean up.)
- Add the complex fix to FORTIFY_SOURCE to the kernel stack event logic.
- Other minor clean ups.
-----BEGIN PGP SIGNATURE-----
iQJIBAABCgAyFiEEXtmkj8VMCiLR0IBM68Js21pW3nMFAmTwtAsUHHJvc3RlZHRA
Z29vZG1pcy5vcmcACgkQ68Js21pW3nNOXRAAsslQT6alY4OeplC4x47+V6+6NiIA
oDtOmWAqf7TsH9bukzRFD36rUly42O20RJDx9z0Q3iRc3vGxEawId8z6P0HmBwRb
VSl5BryWvL5Wc5w94xS8EeCuC1MRfhVDyfbtVFmWigzfvd/f+hp71ViMPHUvrRJX
KhzzNSBc4ir5E1lzfwa7meYTXzDwrQlZbYfdf5aH94IWAkqDj85PUZDJ7UmLZhXG
CIglSpNFXZ0j19Wo/U6KZlHR1XfunBKungCzJ5Dbznc9YLWZTQXOIZF4YPKfPIJL
ulRG9chwXY0nQWhG3xM1UHZLsAMSWw5i13a4ZN4d8FCNOgv8ttcJnfDk7ZYUS0Oz
RmY1dGcSRKAZTUTjm8ZBtmyiUCc9kZAIk0fyEfIHtoDYXmhnvni3wuTnbRSdXaSi
q4YkxPaLfX8Fn3QloCqqddt8iONu7BnbpZOhUCl2AtBib52gnTTF7+rQ6/0D3rjo
SSuvEHhnjJhzk+3jM2odxjmTAztNT+yu6FbKXZUKPt1Kj9YHv1J9cEQw9/Etw+GV
8jQBe979D8hFJmDOJOT/O/TdPqE9mQoMNBt6Y8QnE4nbJWM+i/MBrThFpUSQhRCr
0Ya/HgR2QyRH7RmZW5o2H9mNtN+V9c7RxZW8erYzRbUs0YofK2OpGi9SrPzxWCke
w6j0VVZHaxdPguM=
=/s+e
-----END PGP SIGNATURE-----
Merge tag 'trace-v6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing updates from Steven Rostedt:
"User visible changes:
- Added a way to easier filter with cpumasks:
# echo 'cpumask & CPUS{17-42}' > /sys/kernel/tracing/events/ipi_send_cpumask/filter
- Show actual size of ring buffer after modifying the ring buffer
size via buffer_size_kb.
Currently it just returns what was written, but the actual size
rounds up to the sub buffer size. Show that real size instead.
Major changes:
- Added "eventfs". This is the code that handles the inodes and
dentries of tracefs/events directory. As there are thousands of
events, and each event has several inodes and dentries that
currently exist even when tracing is never used, they take up
precious memory. Instead, eventfs will allocate the inodes and
dentries in a JIT way (similar to what procfs does). There is now
metadata that handles the events and subdirectories, and will
create the inodes and dentries when they are used.
Note, I also have patches that remove the subdirectory meta data,
but will wait till the next merge window before applying them. It's
a little more complex, and I want to make sure the dynamic code
works properly before adding more complexity, making it easier to
revert if need be.
Minor changes:
- Optimization to user event list traversal
- Remove intermediate permission of tracefs files (note the
intermediate permission removes all access to the files so it is
not a security concern, but just a clean up)
- Add the complex fix to FORTIFY_SOURCE to the kernel stack event
logic
- Other minor cleanups"
* tag 'trace-v6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (29 commits)
tracefs: Remove kerneldoc from struct eventfs_file
tracefs: Avoid changing i_mode to a temp value
tracing/user_events: Optimize safe list traversals
ftrace: Remove empty declaration ftrace_enable_daemon() and ftrace_disable_daemon()
tracing: Remove unused function declarations
tracing/filters: Document cpumask filtering
tracing/filters: Further optimise scalar vs cpumask comparison
tracing/filters: Optimise CPU vs cpumask filtering when the user mask is a single CPU
tracing/filters: Optimise scalar vs cpumask filtering when the user mask is a single CPU
tracing/filters: Optimise cpumask vs cpumask filtering when user mask is a single CPU
tracing/filters: Enable filtering the CPU common field by a cpumask
tracing/filters: Enable filtering a scalar field by a cpumask
tracing/filters: Enable filtering a cpumask field by another cpumask
tracing/filters: Dynamically allocate filter_pred.regex
test: ftrace: Fix kprobe test for eventfs
eventfs: Move tracing/events to eventfs
eventfs: Implement removal of meta data from eventfs
eventfs: Implement functions to create files and dirs when accessed
eventfs: Implement eventfs lookup, read, open functions
eventfs: Implement eventfs file add functions
...
* Unbound workqueues now support more flexible affinity scopes. The default
behavior is to soft-affine according to last level cache boundaries. A
work item queued from a given LLC is executed by a worker running on the
same LLC but the worker may be moved across cache boundaries as the
scheduler sees fit. On machines which multiple L3 caches, which are
becoming more popular along with chiplet designs, this improves cache
locality while not harming work conservation too much.
Unbound workqueues are now also a lot more flexible in terms of execution
affinity. Differeing levels of affinity scopes are supported and both the
default and per-workqueue affinity settings can be modified dynamically.
This should help working around amny of sub-optimal behaviors observed
recently with asymmetric ARM CPUs.
This involved signficant restructuring of workqueue code. Nothing was
reported yet but there's some risk of subtle regressions. Should keep an
eye out.
* Rescuer workers now has more identifiable comms.
* workqueue.unbound_cpus added so that CPUs which can be used by workqueue
can be constrained early during boot.
* Now that all the in-tree users have been flushed out, trigger warning if
system-wide workqueues are flushed.
* One pull commit from for-6.5-fixes to avoid cascading conflicts in the
affinity scope patchset.
-----BEGIN PGP SIGNATURE-----
iIQEABYIACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZPERlQ4cdGpAa2VybmVs
Lm9yZwAKCRCxYfJx3gVYGVqQAPwIOy9tWY5jFAmMuIyH6wV50hbmfxCc2n5xhQNr
5HoyGgEA8lw1W7afDCIPiQVA7AYsu8dhwuNSOcRCJxhrrn4XsA0=
=g/Uu
-----END PGP SIGNATURE-----
Merge tag 'wq-for-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue updates from Tejun Heo:
- Unbound workqueues now support more flexible affinity scopes.
The default behavior is to soft-affine according to last level cache
boundaries. A work item queued from a given LLC is executed by a
worker running on the same LLC but the worker may be moved across
cache boundaries as the scheduler sees fit. On machines which
multiple L3 caches, which are becoming more popular along with
chiplet designs, this improves cache locality while not harming work
conservation too much.
Unbound workqueues are now also a lot more flexible in terms of
execution affinity. Differeing levels of affinity scopes are
supported and both the default and per-workqueue affinity settings
can be modified dynamically. This should help working around amny of
sub-optimal behaviors observed recently with asymmetric ARM CPUs.
This involved signficant restructuring of workqueue code. Nothing was
reported yet but there's some risk of subtle regressions. Should keep
an eye out.
- Rescuer workers now has more identifiable comms.
- workqueue.unbound_cpus added so that CPUs which can be used by
workqueue can be constrained early during boot.
- Now that all the in-tree users have been flushed out, trigger warning
if system-wide workqueues are flushed.
* tag 'wq-for-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (31 commits)
workqueue: fix data race with the pwq->stats[] increment
workqueue: Rename rescuer kworker
workqueue: Make default affinity_scope dynamically updatable
workqueue: Add "Affinity Scopes and Performance" section to documentation
workqueue: Implement non-strict affinity scope for unbound workqueues
workqueue: Add workqueue_attrs->__pod_cpumask
workqueue: Factor out need_more_worker() check and worker wake-up
workqueue: Factor out work to worker assignment and collision handling
workqueue: Add multiple affinity scopes and interface to select them
workqueue: Modularize wq_pod_type initialization
workqueue: Add tools/workqueue/wq_dump.py which prints out workqueue configuration
workqueue: Generalize unbound CPU pods
workqueue: Factor out clearing of workqueue-only attrs fields
workqueue: Factor out actual cpumask calculation to reduce subtlety in wq_update_pod()
workqueue: Initialize unbound CPU pods later in the boot
workqueue: Move wq_pod_init() below workqueue_init()
workqueue: Rename NUMA related names to use pod instead
workqueue: Rename workqueue_attrs->no_numa to ->ordered
workqueue: Make unbound workqueues to use per-cpu pool_workqueues
workqueue: Call wq_update_unbound_numa() on all CPUs in NUMA node on CPU hotplug
...
* Per-cpu cpu usage stats are now tracked. This currently isn't printed out
in the cgroupfs interface and can only be accessed through e.g. BPF.
Should decide on a not-too-ugly way to show per-cpu stats in cgroupfs.
* cpuset received some cleanups and prepatory patches for the pending
cpus.exclusive patchset which will allow cpuset partitions to be created
below non-partition parents, which should ease the management of partition
cpusets.
* A lot of code and documentation cleanup patches.
* tools/testing/selftests/cgroup/test_cpuset.c is added. This causes trivial
conflicts in .gitignore and Makefile under the directory against
fe3b1bf19b ("selftests: cgroup: add test_zswap program"). They can be
resolved by keeping lines from both branches.
-----BEGIN PGP SIGNATURE-----
iIQEABYIACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZPENTg4cdGpAa2VybmVs
Lm9yZwAKCRCxYfJx3gVYGcyBAP44cHwpSFxXe3cehxAzb1l/2BZXtzU5l48OqUQd
MwHyrwEAm7+MTVAR2xOF4f+oVM9KWmKj7oV7Clpixl1S7hHyjwE=
=FCc9
-----END PGP SIGNATURE-----
Merge tag 'cgroup-for-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:
- Per-cpu cpu usage stats are now tracked
This currently isn't printed out in the cgroupfs interface and can
only be accessed through e.g. BPF. Should decide on a not-too-ugly
way to show per-cpu stats in cgroupfs
- cpuset received some cleanups and prepatory patches for the pending
cpus.exclusive patchset which will allow cpuset partitions to be
created below non-partition parents, which should ease the management
of partition cpusets
- A lot of code and documentation cleanup patches
- tools/testing/selftests/cgroup/test_cpuset.c added
* tag 'cgroup-for-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (32 commits)
cgroup: Avoid -Wstringop-overflow warnings
cgroup:namespace: Remove unused cgroup_namespaces_init()
cgroup/rstat: Record the cumulative per-cpu time of cgroup and its descendants
cgroup: clean up if condition in cgroup_pidlist_start()
cgroup: fix obsolete function name in cgroup_destroy_locked()
Documentation: cgroup-v2.rst: Correct number of stats entries
cgroup: fix obsolete function name above css_free_rwork_fn()
cgroup/cpuset: fix kernel-doc
cgroup: clean up printk()
cgroup: fix obsolete comment above cgroup_create()
docs: cgroup-v1: fix typo
docs: cgroup-v1: correct the term of Page Cache organization in inode
cgroup/misc: Store atomic64_t reads to u64
cgroup/misc: Change counters to be explicit 64bit types
cgroup/misc: update struct members descriptions
cgroup: remove cgrp->kn check in css_populate_dir()
cgroup: fix obsolete function name
cgroup: use cached local variable parent in for loop
cgroup: remove obsolete comment above struct cgroupstats
cgroup: put cgroup_tryget_css() inside CONFIG_CGROUP_SCHED
...
percpu
* A couple cleanups by Baoquan He and Bibo Mao. The only behavior change
is to start printing messages if we're under the warn limit for failed
atomic allocations.
percpu_counter
* Shakeel introduced percpu counters into mm_struct which caused percpu
allocations be on the hot path [1]. Originally I spent some time
trying to improve the percpu allocator, but instead preferred what
Mateusz Guzik proposed grouping at the allocation site,
percpu_counter_init_many(). This allows a single percpu allocation to
be shared by the counters. I like this approach because it creates a
shared lifetime by the allocations. Additionally, I believe many inits
have higher level synchronization requirements, like percpu_counter
does against HOTPLUG_CPU. Therefore we can group these optimizations
together.
[1] https://lore.kernel.org/linux-mm/20221024052841.3291983-1-shakeelb@google.com/
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE3hZPHJdcVwe+yTTtiDc0yuoFPR0FAmTv2IUACgkQiDc0yuoF
PR0+gg//U430Y9jRSKQtbh3dEPaAeWGcTfSTnVHbQGfBj3A4ePJyWl/Tgzri31AC
rzr8SRs0yX8b82TbECWsV67i/GrntLJyz4yQ52S/RRqVwnQqSn/wicEdCY00lJBt
Tye8zApOnYBouaYqIOxm/M7ofvKzJ3gWOVeF/zBwM6hwvNaXXtY5r86fSDxoEbhY
HOFnCDmg5Spf0U50j1G7nV5KfAb7BNA3/HFyzfzH+w+OWi4IGbThsfrg1qvjyFot
KlEK/kF8Af2xj2A2se4XFsLc2D/Tj+29juYVQqIPBJzVPrZ2uerKSszK5Zcr+Use
kMiG7tRWKE+2vkOM1RQ5Y5NCVEBhlXlienz1gf/C7247SEGs6OIyqvyDAgPTRx6p
oR2/vx9hMtaSMf4aHWd+fYS5gNZ05iMvOIbRZnI1wZkQglQVkJvXhzuLaJ+dIGSP
ypv6XOepik7vDjZ3p3xJXd0TAn4NSkn3jWRetrymdtMFanF99qw1VqjmkLecSil0
Gr0UhRL1oiMde6niVJrOpdOGLwt/M4N99Y5rksw6NCnktRJ99coFGj7LglZGMsu+
YkOyjD8MVJXTkBtBNGeqHTKe6nyVkHFq9ad5EmWjPkefP5JziH8i18k7JlF1dLA5
c8peq3ES659D5f0mU2jilD9PsCsBfSn6Of4ruMZa2Zr1XDD8snI=
=vcA1
-----END PGP SIGNATURE-----
Merge tag 'percpu-for-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu
Pull percpu updates from Dennis Zhou:
"One bigger change to percpu_counter's api allowing for init and
destroy of multiple counters via percpu_counter_init_many() and
percpu_counter_destroy_many(). This is used to help begin remediating
a performance regression with percpu rss stats.
Additionally, it seems larger core count machines are feeling the
burden of the single threaded allocation of percpu. Mateusz is
thinking about it and I will spend some time on it too.
percpu:
- A couple cleanups by Baoquan He and Bibo Mao. The only behavior
change is to start printing messages if we're under the warn limit
for failed atomic allocations.
percpu_counter:
- Shakeel introduced percpu counters into mm_struct which caused
percpu allocations be on the hot path [1]. Originally I spent some
time trying to improve the percpu allocator, but instead preferred
what Mateusz Guzik proposed grouping at the allocation site,
percpu_counter_init_many(). This allows a single percpu allocation
to be shared by the counters. I like this approach because it
creates a shared lifetime by the allocations. Additionally, I
believe many inits have higher level synchronization requirements,
like percpu_counter does against HOTPLUG_CPU. Therefore we can
group these optimizations together"
Link: https://lore.kernel.org/linux-mm/20221024052841.3291983-1-shakeelb@google.com/ [1]
* tag 'percpu-for-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu:
kernel/fork: group allocation/free of per-cpu counters for mm struct
pcpcntr: add group allocation/free
mm/percpu.c: print error message too if atomic alloc failed
mm/percpu.c: optimize the code in pcpu_setup_first_chunk() a little bit
mm/percpu.c: remove redundant check
mm/percpu: Remove some local variables in pcpu_populate_pte
Here is the big set of tty and serial driver changes for 6.6-rc1.
Lots of cleanups in here this cycle, and some driver updates. Short
summary is:
- Jiri's continued work to make the tty code and apis be a bit more
sane with regards to modern kernel coding style and types
- cpm_uart driver updates
- n_gsm updates and fixes
- meson driver updates
- sc16is7xx driver updates
- 8250 driver updates for different hardware types
- qcom-geni driver fixes
- tegra serial driver change
- stm32 driver updates
- synclink_gt driver cleanups
- tty structure size reduction
All of these have been in linux-next this week with no reported issues.
The last bit of cleanups from Jiri and the tty structure size reduction
came in last week, a bit late but as they were just style changes and
size reductions, I figured they should get into this merge cycle so that
others can work on top of them with no merge conflicts.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCZPH+jA8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+ykKyACgldt6QeenTN+6dXIHS/eQHtTKZwMAn3arSeXI
QrUUnLFjOWyoX87tbMBQ
=LVw0
-----END PGP SIGNATURE-----
Merge tag 'tty-6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
Pull tty/serial driver updates from Greg KH:
"Here is the big set of tty and serial driver changes for 6.6-rc1.
Lots of cleanups in here this cycle, and some driver updates. Short
summary is:
- Jiri's continued work to make the tty code and apis be a bit more
sane with regards to modern kernel coding style and types
- cpm_uart driver updates
- n_gsm updates and fixes
- meson driver updates
- sc16is7xx driver updates
- 8250 driver updates for different hardware types
- qcom-geni driver fixes
- tegra serial driver change
- stm32 driver updates
- synclink_gt driver cleanups
- tty structure size reduction
All of these have been in linux-next this week with no reported
issues. The last bit of cleanups from Jiri and the tty structure size
reduction came in last week, a bit late but as they were just style
changes and size reductions, I figured they should get into this merge
cycle so that others can work on top of them with no merge conflicts"
* tag 'tty-6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (199 commits)
tty: shrink the size of struct tty_struct by 40 bytes
tty: n_tty: deduplicate copy code in n_tty_receive_buf_real_raw()
tty: n_tty: extract ECHO_OP processing to a separate function
tty: n_tty: unify counts to size_t
tty: n_tty: use u8 for chars and flags
tty: n_tty: simplify chars_in_buffer()
tty: n_tty: remove unsigned char casts from character constants
tty: n_tty: move newline handling to a separate function
tty: n_tty: move canon handling to a separate function
tty: n_tty: use MASK() for masking out size bits
tty: n_tty: make n_tty_data::num_overrun unsigned
tty: n_tty: use time_is_before_jiffies() in n_tty_receive_overrun()
tty: n_tty: use 'num' for writes' counts
tty: n_tty: use output character directly
tty: n_tty: make flow of n_tty_receive_buf_common() a bool
Revert "tty: serial: meson: Add a earlycon for the T7 SoC"
Documentation: devices.txt: Fix minors for ttyCPM*
Documentation: devices.txt: Remove ttySIOC*
Documentation: devices.txt: Remove ttyIOC*
serial: 8250_bcm7271: improve bcm7271 8250 port
...
- Add HOTPLUG_SMT support (/sys/devices/system/cpu/smt) and honour the
configured SMT state when hotplugging CPUs into the system.
- Combine final TLB flush and lazy TLB mm shootdown IPIs when using the Radix
MMU to avoid a broadcast TLBIE flush on exit.
- Drop the exclusion between ptrace/perf watchpoints, and drop the now unused
associated arch hooks.
- Add support for the "nohlt" command line option to disable CPU idle.
- Add support for -fpatchable-function-entry for ftrace, with GCC >= 13.1.
- Rework memory block size determination, and support 256MB size on systems
with GPUs that have hotpluggable memory.
- Various other small features and fixes.
Thanks to: Andrew Donnellan, Aneesh Kumar K.V, Arnd Bergmann, Athira Rajeev,
Benjamin Gray, Christophe Leroy, Frederic Barrat, Gautam Menghani, Geoff Levand,
Hari Bathini, Immad Mir, Jialin Zhang, Joel Stanley, Jordan Niethe, Justin
Stitt, Kajol Jain, Kees Cook, Krzysztof Kozlowski, Laurent Dufour, Liang He,
Linus Walleij, Mahesh Salgaonkar, Masahiro Yamada, Michal Suchanek, Nageswara
R Sastry, Nathan Chancellor, Nathan Lynch, Naveen N Rao, Nicholas Piggin, Nick
Desaulniers, Omar Sandoval, Randy Dunlap, Reza Arbab, Rob Herring, Russell
Currey, Sourabh Jain, Thomas Gleixner, Trevor Woerner, Uwe Kleine-König, Vaibhav
Jain, Xiongfeng Wang, Yuan Tan, Zhang Rui, Zheng Zengkai.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCAAxFiEEJFGtCPCthwEv2Y/bUevqPMjhpYAFAmTwgbwTHG1wZUBlbGxl
cm1hbi5pZC5hdQAKCRBR6+o8yOGlgFmpD/432vipeoqvkAYsyK0xi/Y3GcY0wcyd
WJApLXXadEbtKQrgXQ6sowWqalg5thYnQCRarg/tXKK/po3KfgwkPjGDpOL+cIdr
12QVN2XJm9VmJ1wYJxzk+yXx4F43AdmMdr94qWAGufbTHezwb4UpzVR1NxtFrOE/
X5TNsC2+2mdZY/ZaNHS5vsTIFv3EhQfqgjZPlIAdLn6CGc8xWT514Q/uHA8+ytM/
HL7Hqs33DoPSvgTa5TT/2E0d0k5nO3P5KObzAjpYlireTPaBi51mpKGewcrtm0o2
v3cBlbfx3C7pe9ZhKBK9BH8cjynfiqsVZ9/lCw/7eBNdm9tHuzG0jeS7Db9tCZXS
fM7G2R7SoIusPTqxlBmkU5DpYslwrHiVgCyy3ijxkoA/fakVwh/GgTcMsRt73IY6
n6DsUvWwuYHCIeIiHmHQJqCqCRtV+aMzU3AbbBHOjtdIanhlW16M686dEsgCirh7
akRVRD5VqKaqXs34PpkRL89Xv3wZRjl6XZ3hZFfCjSYXfpXDXhgSToIskpHYhKL8
gpY7WtG9YQP05Xz5HRCx6EluaZVeKe0lZi6fezX7Mi9AygJQO8FfXqP1mHBlEq40
ThWtvL9D89RV6lADqqFN20XepgvKNOyAXcE4szvsnIZYUSPmZQZSPxx+DHtROaLP
jX3ifxtxJp92pQ==
=5g7K
-----END PGP SIGNATURE-----
Merge tag 'powerpc-6.6-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates from Michael Ellerman:
- Add HOTPLUG_SMT support (/sys/devices/system/cpu/smt) and honour the
configured SMT state when hotplugging CPUs into the system
- Combine final TLB flush and lazy TLB mm shootdown IPIs when using the
Radix MMU to avoid a broadcast TLBIE flush on exit
- Drop the exclusion between ptrace/perf watchpoints, and drop the now
unused associated arch hooks
- Add support for the "nohlt" command line option to disable CPU idle
- Add support for -fpatchable-function-entry for ftrace, with GCC >=
13.1
- Rework memory block size determination, and support 256MB size on
systems with GPUs that have hotpluggable memory
- Various other small features and fixes
Thanks to Andrew Donnellan, Aneesh Kumar K.V, Arnd Bergmann, Athira
Rajeev, Benjamin Gray, Christophe Leroy, Frederic Barrat, Gautam
Menghani, Geoff Levand, Hari Bathini, Immad Mir, Jialin Zhang, Joel
Stanley, Jordan Niethe, Justin Stitt, Kajol Jain, Kees Cook, Krzysztof
Kozlowski, Laurent Dufour, Liang He, Linus Walleij, Mahesh Salgaonkar,
Masahiro Yamada, Michal Suchanek, Nageswara R Sastry, Nathan Chancellor,
Nathan Lynch, Naveen N Rao, Nicholas Piggin, Nick Desaulniers, Omar
Sandoval, Randy Dunlap, Reza Arbab, Rob Herring, Russell Currey, Sourabh
Jain, Thomas Gleixner, Trevor Woerner, Uwe Kleine-König, Vaibhav Jain,
Xiongfeng Wang, Yuan Tan, Zhang Rui, and Zheng Zengkai.
* tag 'powerpc-6.6-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (135 commits)
macintosh/ams: linux/platform_device.h is needed
powerpc/xmon: Reapply "Relax frame size for clang"
powerpc/mm/book3s64: Use 256M as the upper limit with coherent device memory attached
powerpc/mm/book3s64: Fix build error with SPARSEMEM disabled
powerpc/iommu: Fix notifiers being shared by PCI and VIO buses
powerpc/mpc5xxx: Add missing fwnode_handle_put()
powerpc/config: Disable SLAB_DEBUG_ON in skiroot
powerpc/pseries: Remove unused hcall tracing instruction
powerpc/pseries: Fix hcall tracepoints with JUMP_LABEL=n
powerpc: dts: add missing space before {
powerpc/eeh: Use pci_dev_id() to simplify the code
powerpc/64s: Move CPU -mtune options into Kconfig
powerpc/powermac: Fix unused function warning
powerpc/pseries: Rework lppaca_shared_proc() to avoid DEBUG_PREEMPT
powerpc: Don't include lppaca.h in paca.h
powerpc/pseries: Move hcall_vphn() prototype into vphn.h
powerpc/pseries: Move VPHN constants into vphn.h
cxl: Drop unused detach_spa()
powerpc: Drop zalloc_maybe_bootmem()
powerpc/powernv: Use struct opal_prd_msg in more places
...
Convert IBT selftest to asm to fix objtool warning
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEV76QKkVc4xCGURexaDWVMHDJkrAFAmTv1QQACgkQaDWVMHDJ
krAUwhAAn6TOwHJK8BSkHeiQhON1nrlP3c5cv0AyZ2NP8RYDrZrSZvhpYBJ6wgKC
Cx5CGq5nn9twYsYS3KsktLKDfR3lRdsQ7K9qtyFtYiaeaVKo+7gEKl/K+klwai8/
gninQWHk0zmSCja8Vi77q52WOMkQKapT8+vaON9EVDO8dVEi+CvhAIfPwMafuiwO
Rk4X86SzoZu9FP79LcCg9XyGC/XbM2OG9eNUTSCKT40qTTKm5y4gix687NvAlaHR
ko5MTsdl0Wfp6Qk0ohT74LnoA2c1g/FluvZIM33ci/2rFpkf9Hw7ip3lUXqn6CPx
rKiZ+pVRc0xikVWkraMfIGMJfUd2rhelp8OyoozD7DB7UZw40Q4RW4N5tgq9Fhe9
MQs3p1v9N8xHdRKl365UcOczUxNAmv4u0nV5gY/4FMC6VjldCl2V9fmqYXyzFS4/
Ogg4FSd7c2JyGFKPs+5uXyi+RY2qOX4+nzHOoKD7SY616IYqtgKoz5usxETLwZ6s
VtJOmJL0h//z0A7tBliB0zd+SQ5UQQBDC2XouQH2fNX2isJMn0UDmWJGjaHgK6Hh
8jVp6LNqf+CEQS387UxckOyj7fu438hDky1Ggaw4YqowEOhQeqLVO4++x+HITrbp
AupXfbJw9h9cMN63Yc0gVxXQ9IMZ+M7UxLtZ3Cd8/PVztNy/clA=
=3UUm
-----END PGP SIGNATURE-----
Merge tag 'x86_shstk_for_6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 shadow stack support from Dave Hansen:
"This is the long awaited x86 shadow stack support, part of Intel's
Control-flow Enforcement Technology (CET).
CET consists of two related security features: shadow stacks and
indirect branch tracking. This series implements just the shadow stack
part of this feature, and just for userspace.
The main use case for shadow stack is providing protection against
return oriented programming attacks. It works by maintaining a
secondary (shadow) stack using a special memory type that has
protections against modification. When executing a CALL instruction,
the processor pushes the return address to both the normal stack and
to the special permission shadow stack. Upon RET, the processor pops
the shadow stack copy and compares it to the normal stack copy.
For more information, refer to the links below for the earlier
versions of this patch set"
Link: https://lore.kernel.org/lkml/20220130211838.8382-1-rick.p.edgecombe@intel.com/
Link: https://lore.kernel.org/lkml/20230613001108.3040476-1-rick.p.edgecombe@intel.com/
* tag 'x86_shstk_for_6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (47 commits)
x86/shstk: Change order of __user in type
x86/ibt: Convert IBT selftest to asm
x86/shstk: Don't retry vm_munmap() on -EINTR
x86/kbuild: Fix Documentation/ reference
x86/shstk: Move arch detail comment out of core mm
x86/shstk: Add ARCH_SHSTK_STATUS
x86/shstk: Add ARCH_SHSTK_UNLOCK
x86: Add PTRACE interface for shadow stack
selftests/x86: Add shadow stack test
x86/cpufeatures: Enable CET CR4 bit for shadow stack
x86/shstk: Wire in shadow stack interface
x86: Expose thread features in /proc/$PID/status
x86/shstk: Support WRSS for userspace
x86/shstk: Introduce map_shadow_stack syscall
x86/shstk: Check that signal frame is shadow stack mem
x86/shstk: Check that SSP is aligned on sigreturn
x86/shstk: Handle signals for shadow stack
x86/shstk: Introduce routines modifying shstk
x86/shstk: Handle thread shadow stack
x86/shstk: Add user-mode shadow stack support
...
Move the #endif a line so that free_page label is only seen by the
compile pass when actually used.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chunhui He <hchunhui@mail.ustc.edu.cn>
Reviewed-by: Robin Murphy <roin.murphy@arm.com>
- Work from Carlos Bilbao to integrate rustdoc output into the generated
HTML documentation. This took some work to figure out how to do it
without slowing the docs build and without creating people who don't have
Rust installed, but Carlos got there.
- Move the loongarch and mips architecture documentation under
Documentation/arch/.
- Some more maintainer documentation from Jakub
...plus the usual assortment of updates, translations, and fixes.
-----BEGIN PGP SIGNATURE-----
iQFDBAABCAAtFiEEIw+MvkEiF49krdp9F0NaE2wMflgFAmTvqNkPHGNvcmJldEBs
d24ubmV0AAoJEBdDWhNsDH5YgIgH/3drfLtlFtzLqDOzrzDXS8yGnE3pPdxw796b
/ZFzAK16wYKaKevYoIz8bVGGKaE1sEUW0mhlq4KGdfZuxLG8YnWS8URyCW4FDU2E
6qNL+8oJ8LZfID46f9Q8ZgfEz7yF/mhCqPk7MEswYtwbscs2ZTGCTGYB/5BHlBuT
LR+M89uLmHgr8S1o24v30OgiX+VvQFyu0xoxIhbiqUZvBd/XdfX2pgYd9BGzMj5q
C2ZP+V14g36c5pV0EO9TwhCXOF/WVrp7DbjbfWAsqBSLxvpXPydH2q1DUzGeQtP1
exujrBD1O8q3pPdaNA5R+h6cWlHmUZug9mE4BRLp9ErGrozwJsQ=
=C3Uv
-----END PGP SIGNATURE-----
Merge tag 'docs-6.6' of git://git.lwn.net/linux
Pull documentation updates from Jonathan Corbet:
"Documentation work keeps chugging along; this includes:
- Work from Carlos Bilbao to integrate rustdoc output into the
generated HTML documentation. This took some work to figure out how
to do it without slowing the docs build and without creating people
who don't have Rust installed, but Carlos got there
- Move the loongarch and mips architecture documentation under
Documentation/arch/
- Some more maintainer documentation from Jakub
... plus the usual assortment of updates, translations, and fixes"
* tag 'docs-6.6' of git://git.lwn.net/linux: (56 commits)
Docu: genericirq.rst: fix irq-example
input: docs: pxrc: remove reference to phoenix-sim
Documentation: serial-console: Fix literal block marker
docs/mm: remove references to hmm_mirror ops and clean typos
docs/zh_CN: correct regi_chg(),regi_add() to region_chg(),region_add()
Documentation: Fix typos
Documentation/ABI: Fix typos
scripts: kernel-doc: fix macro handling in enums
scripts: kernel-doc: parse DEFINE_DMA_UNMAP_[ADDR|LEN]
Documentation: riscv: Update boot image header since EFI stub is supported
Documentation: riscv: Add early boot document
Documentation: arm: Add bootargs to the table of added DT parameters
docs: kernel-parameters: Refer to the correct bitmap function
doc: update params of memhp_default_state=
docs: Add book to process/kernel-docs.rst
docs: sparse: fix invalid link addresses
docs: vfs: clean up after the iterate() removal
docs: Add a section on surveys to the researcher guidelines
docs: move mips under arch
docs: move loongarch under arch
...
Resetting rules' stateful data happens outside of the transaction logic,
so 'get' and 'dump' handlers have to emit audit log entries themselves.
Fixes: 8daa8fde3f ("netfilter: nf_tables: Introduce NFT_MSG_GETRULE_RESET")
Signed-off-by: Phil Sutter <phil@nwl.cc>
Reviewed-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Since set element reset is not integrated into nf_tables' transaction
logic, an explicit log call is needed, similar to NFT_MSG_GETOBJ_RESET
handling.
For the sake of simplicity, catchall element reset will always generate
a dedicated log entry. This relieves nf_tables_dump_set() from having to
adjust the logged element count depending on whether a catchall element
was found or not.
Fixes: 079cd63321 ("netfilter: nf_tables: Introduce NFT_MSG_GETSETELEM_RESET")
Signed-off-by: Phil Sutter <phil@nwl.cc>
Reviewed-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQQdXVVFGN5XqKr1Hj7LwZzRsCrn5QUCZO0WoxQcem9oYXJAbGlu
dXguaWJtLmNvbQAKCRDLwZzRsCrn5alsAP0UZQIKI2zEjFdtucgClcSouflIOC5i
Hvtgv3qVFXPZQwEA2H/SGjigtH5NruVXECDZdrIfaGGvBhyeY72lbswXfQ0=
=Gu8i
-----END PGP SIGNATURE-----
Merge tag 'integrity-v6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity
Pull integrity subsystem updates from Mimi Zohar:
- With commit 099f26f22f ("integrity: machine keyring CA
configuration") certificates may be loaded onto the IMA keyring,
directly or indirectly signed by keys on either the "builtin" or the
"machine" keyrings.
With the ability for the system/machine owner to sign the IMA policy
itself without needing to recompile the kernel, update the IMA
architecture specific policy rules to require the IMA policy itself
be signed.
[ As commit 099f26f22f was upstreamed in linux-6.4, updating the
IMA architecture specific policy now to require signed IMA policies
may break userspace expectations. ]
- IMA only checked the file data hash was not on the system blacklist
keyring for files with an appended signature (e.g. kernel modules,
Power kernel image).
Check all file data hashes regardless of how it was signed
- Code cleanup, and a kernel-doc update
* tag 'integrity-v6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity:
kexec_lock: Replace kexec_mutex() by kexec_lock() in two comments
ima: require signed IMA policy when UEFI secure boot is enabled
integrity: Always reference the blacklist keyring with appraisal
ima: Remove deprecated IMA_TRUSTED_KEYRING Kconfig
-----BEGIN PGP SIGNATURE-----
iQJIBAABCAAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAmTuKLcUHHBhdWxAcGF1
bC1tb29yZS5jb20ACgkQ6iDy2pc3iXM/Eg//cwaOu/ASS08Cz/tfXeKpzg9UpzbW
uHqGtgdE9ZEvS71z+3dorOJVPEwPr+/yviq3FXYjYHFqvVhLZCvYM9rw+eNo/k4T
I95UTchGUsMWwkw61YBDLythfXm2UL5nabjckO81i9UPtxUYOwF6xQMQXYyMcLL8
6fm1vnCvK5FBEXi2HSUWy3Eb3wdviGdHrL6h19Aeew+q8u33asWSxn9vmBSSFEzZ
492//Pgy0t3FA6paWXQRvoR+GvLgBXNOvHB68cAx9vS8Lq6mAwJJSCRrQtKGh2Gd
YInr49f+TXOosD5Tm6ueWO4sr8RzQZ7nPyM+BLue4Yn2ZzdYgjwfHdkHWS1KeH5X
qVqa9s6/QONvkSCzqHs/ne2qio1Q0/0uGgwOkx6N7oVWQWjE7iTYlADwM0CDJnd2
UD7AHTOgpc88x1T1eW599MZttSCznBTSFXv4waaS5/5NT9n8Db1TpTtCTedOc1x2
n+c+F5BHLy69vhSGCanvum/8i2gNoKVyYaHyaMsQxr5LRcLnvN6oOjWIv7jMKxe7
GavUAxU7M5rxPUH44vrrrI+XztKJOdpCz4S0xp+7pSSSGAK5KkmVVLXjzrlGO1WS
55ixxQWYTGK0KlWHp4Ofi6brE9a4ATKcd1XscPN+AtBYX2ufNHLskCZulu/lyrMx
lAy9RRDe1hHWTvg=
=dnm4
-----END PGP SIGNATURE-----
Merge tag 'lsm-pr-20230829' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm
Pull LSM updates from Paul Moore:
- Add proper multi-LSM support for xattrs in the
security_inode_init_security() hook
Historically the LSM layer has only allowed a single LSM to add an
xattr to an inode, with IMA/EVM measuring that and adding its own as
well. As we work towards promoting IMA/EVM to a "proper LSM" instead
of the special case that it is now, we need to better support the
case of multiple LSMs each adding xattrs to an inode and after
several attempts we now appear to have something that is working
well. It is worth noting that in the process of making this change we
uncovered a problem with Smack's SMACK64TRANSMUTE xattr which is also
fixed in this pull request.
- Additional LSM hook constification
Two patches to constify parameters to security_capget() and
security_binder_transfer_file(). While I generally don't make a
special note of who submitted these patches, these were the work of
an Outreachy intern, Khadija Kamran, and that makes me happy;
hopefully it does the same for all of you reading this.
- LSM hook comment header fixes
One patch to add a missing hook comment header, one to fix a minor
typo.
- Remove an old, unused credential function declaration
It wasn't clear to me who should pick this up, but it was trivial,
obviously correct, and arguably the LSM layer has a vested interest
in credentials so I merged it. Sadly I'm now noticing that despite my
subject line cleanup I didn't cleanup the "unsued" misspelling, sigh
* tag 'lsm-pr-20230829' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm:
lsm: constify the 'file' parameter in security_binder_transfer_file()
lsm: constify the 'target' parameter in security_capget()
lsm: add comment block for security_sk_classify_flow LSM hook
security: Fix ret values doc for security_inode_init_security()
cred: remove unsued extern declaration change_create_files_as()
evm: Support multiple LSMs providing an xattr
evm: Align evm_inode_init_security() definition with LSM infrastructure
smack: Set the SMACK64TRANSMUTE xattr in smack_inode_init_security()
security: Allow all LSMs to provide xattrs for inode_init_security hook
lsm: fix typo in security_file_lock() comment header
-----BEGIN PGP SIGNATURE-----
iQJIBAABCAAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAmTuKIQUHHBhdWxAcGF1
bC1tb29yZS5jb20ACgkQ6iDy2pc3iXMSahAA4o+mfGxcadExo8wsEFfizsQd0JS1
6KpV8Gl9/uwPTCUmvjquFnTb5tbNFZ1X7jnj2g0+/ZHYPp9yJQqTKu7NX1Q9w+dE
11tiipc4CyrcJpWrjBinNH27txjulLSCN1imMnRYLZOpk1AbXTwjuLjFBy2iTDtm
8TAPj4vcKbi5MlcUodp/DGO6ysL75gTsLn5UUsHJhWbofz4ECay0heQoPeZ/MaW3
gBPMRgt/REg8ikdR/ntFMOD6ywBZZ0Vsf/S+hNWGwHUgGxQ5H7rJBEFI65HL4Ur1
c36UFRsypT1sFaIDbS/PrvpT3M48XwmqdmWNx5Z1dtJCCwNhuhsmEkXB+GEud2qM
SOQQfMgfjKvnaLMPUmDePuAiSflSJj2AHo1HXlYxKFtybI1plJGiRoDX5jlsklCp
JbwUJ2y7YlxNPIaZSBHYIUuniUDqET83cR2D3YJiU+2I9myg8Z5Amto8d4MFgf21
f4qfm0SDBMvXYHUuhUry0/kuk2A0R89H4HUNcrGky+cSsaelpm06uaxj43B/M9Dp
v1nSwDQpDtYKSt+16GUDfqq5BywjwMe4J7wlE9+YdTDrvuc2qUxZMky5GzZ55Wnl
mbe6BVEBc19FhDeC3muhgV0jWCUGKuq6q+W+CRmxafyOMzX9NIDFaZf1KxkaesxD
S9I7AYmT7fCghFQ=
=tZaJ
-----END PGP SIGNATURE-----
Merge tag 'audit-pr-20230829' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit
Pull audit updates from Paul Moore:
"Six audit patches, the highlights are:
- Add an explicit cond_resched() call when generating PATH records
Certain tracefs/debugfs operations can generate a *lot* of audit
PATH entries and if one has an aggressive system configuration (not
the default) this can cause a soft lockup in the audit code as it
works to process all of these new entries.
This is in sharp contrast to the common case where only one or two
PATH entries are logged. In order to fix this corner case without
excessively impacting the common case we're adding a single
cond_rescued() call between two of the most intensive loops in the
__audit_inode_child() function.
- Various minor cleanups
We removed a conditional header file as the included header already
had the necessary logic in place, fixed a dummy function's return
value, and the usual collection of checkpatch.pl noise (whitespace,
brace, and trailing statement tweaks)"
* tag 'audit-pr-20230829' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
audit: move trailing statements to next line
audit: cleanup function braces and assignment-in-if-condition
audit: add space before parenthesis and around '=', "==", and '<'
audit: fix possible soft lockup in __audit_inode_child()
audit: correct audit_filter_inodes() definition
audit: include security.h unconditionally
It makes no sense to expose CONFIG_DMA_NUMA_CMA if CONFIG_NUMA is not
enabled, and random config options shouldn't be default unless there
is a good reason. Replace the default NUMA with a depends on to fix both
issues.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Robin Murphy <roin.murphy@arm.com>
Xiongfeng reported and debugged a self deadlock of the task which initiates
and controls a CPU hot-unplug operation vs. the CFS bandwidth timer.
CPU1 CPU2
T1 sets cfs_quota
starts hrtimer cfs_bandwidth 'period_timer'
T1 is migrated to CPU2
T1 initiates offlining of CPU1
Hotplug operation starts
...
'period_timer' expires and is re-enqueued on CPU1
...
take_cpu_down()
CPU1 shuts down and does not handle timers
anymore. They have to be migrated in the
post dead hotplug steps by the control task.
T1 runs the post dead offline operation
T1 is scheduled out
T1 waits for 'period_timer' to expire
T1 waits there forever if it is scheduled out before it can execute the hrtimer
offline callback hrtimers_dead_cpu().
Cure this by delegating the hotplug control operation to a worker thread on
an online CPU. This takes the initiating user space task, which might be
affected by the bandwidth timer, completely out of the picture.
Reported-by: Xiongfeng Wang <wangxiongfeng2@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Yu Liao <liaoyu15@huawei.com>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/lkml/8e785777-03aa-99e1-d20e-e956f5685be6@huawei.com
Link: https://lore.kernel.org/r/87h6oqdq0i.ffs@tglx
In commit 0345691b24 ("tick/rcu: Stop allowing RCU_SOFTIRQ in idle") the
new function report_idle_softirq() was created by breaking code out of the
existing can_stop_idle_tick() for kernels v5.18 and newer.
In doing so, the code essentially went from a one conditional:
if (a && b && c)
warn();
to a three conditional:
if (!a)
return;
if (!b)
return;
if (!c)
return;
warn();
But that conversion got the condition for the RT specific
local_bh_blocked() wrong. The original condition was:
!local_bh_blocked()
but the conversion failed to negate it so it ended up as:
if (!local_bh_blocked())
return false;
This issue lay dormant until another fixup for the same commit was added
in commit a7e282c777 ("tick/rcu: Fix bogus ratelimit condition").
This commit realized the ratelimit was essentially set to zero instead
of ten, and hence *no* softirq pending messages would ever be issued.
Once this commit was backported via linux-stable, both the v6.1 and v6.4
preempt-rt kernels started printing out 10 instances of this at boot:
NOHZ tick-stop error: local softirq work is pending, handler #80!!!
Remove the negation and return when local_bh_blocked() evaluates to true to
bring the correct behaviour back.
Fixes: 0345691b24 ("tick/rcu: Stop allowing RCU_SOFTIRQ in idle")
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Ahmad Fatoum <a.fatoum@pengutronix.de>
Reviewed-by: Wen Yang <wenyang.linux@foxmail.com>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20230818200757.1808398-1-paul.gortmaker@windriver.com
- allow dynamic sizing of the swiotlb buffer, to cater for secure
virtualization workloads that require all I/O to be bounce buffered
(Petr Tesarik)
- move a declaration to a header (Arnd Bergmann)
- check for memory region overlap in dma-contiguous (Binglei Wang)
- remove the somewhat dangerous runtime swiotlb-xen enablement and
unexport is_swiotlb_active (Christoph Hellwig, Juergen Gross)
- per-node CMA improvements (Yajun Deng)
-----BEGIN PGP SIGNATURE-----
iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAmTuDHkLHGhjaEBsc3Qu
ZGUACgkQD55TZVIEUYOqvhAApMk2/ceTgVH17sXaKE822+xKvgv377O6TlggMeGG
W4zA0KD69DNz0AfaaCc5U5f7n8Ld/YY1RsvkHW4b3jgw+KRTeQr0jjitBgP5kP2M
A1+qxdyJpCTwiPt9s2+JFVPeyZ0s52V6OJODKRG3s0ore55R+U09VySKtASON+q3
GMKfWqQteKC+thg7NkrQ7JUixuo84oICws+rZn4K9ifsX2O0HYW6aMW0feRfZjJH
r0TgqZc4RdPTSaF22oapR9Ls39+7hp/pBvoLm5sBNA3cl5C3X4VWo9ERMU1jW9h+
VYQv39NycUspgskWJmpbU06/+ooYqQlwHSR/vdNusmFIvxo4tf6/UX72YO5F8Dar
ap0wYGauiEwTjSnhVxPTXk3obWyWEsgFAeRnPdTlH2CNmv38QZU2HLb8eU1pcXxX
j+WI2Ewy9z22uBVYiPOKpdW1jkSfmlmfPp/8SbAdua7I3YQ90rQN6AvU06zAi/cL
NQTgO81E4jPkygqAVgS/LeYziWAQ73yM7m9ExThtTgqFtHortwhJ4Fd8XKtvtvEb
viXAZ/WZtQBv/CIKAW98NhgIDP/SPOT8ym6V35WK+kkNFMS6LMSQUfl9GgbHGyFa
n9icMm7BmbDtT1+AKNafG9En4DtAf9M9QNidAVOyfrsIk6S0gZoZwvIStkA7on8a
cNY=
=kVVr
-----END PGP SIGNATURE-----
Merge tag 'dma-mapping-6.6-2023-08-29' of git://git.infradead.org/users/hch/dma-mapping
Pull dma-maping updates from Christoph Hellwig:
- allow dynamic sizing of the swiotlb buffer, to cater for secure
virtualization workloads that require all I/O to be bounce buffered
(Petr Tesarik)
- move a declaration to a header (Arnd Bergmann)
- check for memory region overlap in dma-contiguous (Binglei Wang)
- remove the somewhat dangerous runtime swiotlb-xen enablement and
unexport is_swiotlb_active (Christoph Hellwig, Juergen Gross)
- per-node CMA improvements (Yajun Deng)
* tag 'dma-mapping-6.6-2023-08-29' of git://git.infradead.org/users/hch/dma-mapping:
swiotlb: optimize get_max_slots()
swiotlb: move slot allocation explanation comment where it belongs
swiotlb: search the software IO TLB only if the device makes use of it
swiotlb: allocate a new memory pool when existing pools are full
swiotlb: determine potential physical address limit
swiotlb: if swiotlb is full, fall back to a transient memory pool
swiotlb: add a flag whether SWIOTLB is allowed to grow
swiotlb: separate memory pool data from other allocator data
swiotlb: add documentation and rename swiotlb_do_find_slots()
swiotlb: make io_tlb_default_mem local to swiotlb.c
swiotlb: bail out of swiotlb_init_late() if swiotlb is already allocated
dma-contiguous: check for memory region overlap
dma-contiguous: support numa CMA for specified node
dma-contiguous: support per-numa CMA for all architectures
dma-mapping: move arch_dma_set_mask() declaration to header
swiotlb: unexport is_swiotlb_active
x86: always initialize xen-swiotlb when xen-pcifront is enabling
xen/pci: add flag for PCI passthrough being possible
Long ago we set out to remove the kitchen sink on kernel/sysctl.c arrays and
placings sysctls to their own sybsystem or file to help avoid merge conflicts.
Matthew Wilcox pointed out though that if we're going to do that we might as
well also *save* space while at it and try to remove the extra last sysctl
entry added at the end of each array, a sentintel, instead of bloating the
kernel by adding a new sentinel with each array moved.
Doing that was not so trivial, and has required slowing down the moves of
kernel/sysctl.c arrays and measuring the impact on size by each new move.
The complex part of the effort to help reduce the size of each sysctl is being
done by the patient work of el señor Don Joel Granados. A lot of this is truly
painful code refactoring and testing and then trying to measure the savings of
each move and removing the sentinels. Although Joel already has code which does
most of this work, experience with sysctl moves in the past shows is we need to
be careful due to the slew of odd build failures that are possible due to the
amount of random Kconfig options sysctls use.
To that end Joel's work is split by first addressing the major housekeeping
needed to remove the sentinels, which is part of this merge request. The rest
of the work to actually remove the sentinels will be done later in future
kernel releases.
At first I was only going to send his first 7 patches of his patch series,
posted 1 month ago, but in retrospect due to the testing the changes have
received in linux-next and the minor changes they make this goes with the
entire set of patches Joel had planned: just sysctl house keeping. There are
networking changes but these are part of the house keeping too.
The preliminary math is showing this will all help reduce the overall build
time size of the kernel and run time memory consumed by the kernel by about
~64 bytes per array where we are able to remove each sentinel in the future.
That also means there is no more bloating the kernel with the extra ~64 bytes
per array moved as no new sentinels are created.
Most of this has been in linux-next for about a month, the last 7 patches took
a minor refresh 2 week ago based on feedback.
-----BEGIN PGP SIGNATURE-----
iQJGBAABCgAwFiEENnNq2KuOejlQLZofziMdCjCSiKcFAmTuVnMSHG1jZ3JvZkBr
ZXJuZWwub3JnAAoJEM4jHQowkoinIckP/imvRlfkO6L0IP7MmJBRPtwY01rsTAKO
Q14dZ//bG4DVQeGl1FdzrF6hhuLgekU0qW1YDFIWiCXO7CbaxaNBPSUkeW6ReVoC
R/VHNUPxSR1PWQy1OTJV2t4XKri2sB7ijmUsfsATtISwhei9bggTHEysShtP4tv+
U87DzhoqMnbYIsfMo49KCqOa1Qm7TmjC1a7WAp6Fph3GJuXAzZR5pXpsd0NtOZ9x
Ud5RT22icnQpMl7K+yPsqY6XcS5JkgBe/WbSzMAUkYZvBZFBq9t2D+OW5h9TZMhw
piJWQ9X0Rm7qI2D15mJfXwaOhhyDhWci391hzdJmS6DI0prf6Ma2NFdAWOt/zomI
uiRujS4bGeBUaK5F4TX2WQ1+jdMtAZ+0FncFnzt4U8q7dzUc91uVCm6iHW3gcfAb
N7OEg2ZL0gkkgCZHqKxN8wpNQiC2KwnNk+HLAbnL2a/oJYfBtdopQmlxWfrN2hpF
xxROiENqk483BRdMXDq6DR/gyDZmZWCobXIglSzlqCOjCOcLbDziIJ7pJk83ok09
h/QnXTYHf9protBq9OIQesgh2pwNzBBLifK84KZLKcb7IbdIKjpQrW5STp04oNGf
wcGJzEz8tXUe0UKyMM47AcHQGzIy6cdXNLjyF8a+m7rnZzr1ndnMqZyRStZzuQin
AUg2VWHKPmW9
=sq2p
-----END PGP SIGNATURE-----
Merge tag 'sysctl-6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux
Pull sysctl updates from Luis Chamberlain:
"Long ago we set out to remove the kitchen sink on kernel/sysctl.c
arrays and placings sysctls to their own sybsystem or file to help
avoid merge conflicts. Matthew Wilcox pointed out though that if we're
going to do that we might as well also *save* space while at it and
try to remove the extra last sysctl entry added at the end of each
array, a sentintel, instead of bloating the kernel by adding a new
sentinel with each array moved.
Doing that was not so trivial, and has required slowing down the moves
of kernel/sysctl.c arrays and measuring the impact on size by each new
move.
The complex part of the effort to help reduce the size of each sysctl
is being done by the patient work of el señor Don Joel Granados. A lot
of this is truly painful code refactoring and testing and then trying
to measure the savings of each move and removing the sentinels.
Although Joel already has code which does most of this work,
experience with sysctl moves in the past shows is we need to be
careful due to the slew of odd build failures that are possible due to
the amount of random Kconfig options sysctls use.
To that end Joel's work is split by first addressing the major
housekeeping needed to remove the sentinels, which is part of this
merge request. The rest of the work to actually remove the sentinels
will be done later in future kernel releases.
The preliminary math is showing this will all help reduce the overall
build time size of the kernel and run time memory consumed by the
kernel by about ~64 bytes per array where we are able to remove each
sentinel in the future. That also means there is no more bloating the
kernel with the extra ~64 bytes per array moved as no new sentinels
are created"
* tag 'sysctl-6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux:
sysctl: Use ctl_table_size as stopping criteria for list macro
sysctl: SIZE_MAX->ARRAY_SIZE in register_net_sysctl
vrf: Update to register_net_sysctl_sz
networking: Update to register_net_sysctl_sz
netfilter: Update to register_net_sysctl_sz
ax.25: Update to register_net_sysctl_sz
sysctl: Add size to register_net_sysctl function
sysctl: Add size arg to __register_sysctl_init
sysctl: Add size to register_sysctl
sysctl: Add a size arg to __register_sysctl_table
sysctl: Add size argument to init_header
sysctl: Add ctl_table_size to ctl_table_header
sysctl: Use ctl_table_header in list_for_each_table_entry
sysctl: Prefer ctl_table_header in proc_sysctl
Summary of the changes worth highlighting from most interesting to boring below:
* Christoph Hellwig's symbol_get() fix to Nvidia's efforts to circumvent the
protection he put in place in year 2020 to prevent proprietary modules from
using GPL only symbols, and also ensuring proprietary modules which export
symbols grandfather their taint. That was done through year 2020 commit
262e6ae708 ("modules: inherit TAINT_PROPRIETARY_MODULE"). Christoph's new
fix is done by clarifing __symbol_get() was only ever intended to prevent
module reference loops by Linux kernel modules and so making it only find
symbols exported via EXPORT_SYMBOL_GPL(). The circumvention tactic used
by Nvidia was to use symbol_get() to purposely swift through proprietary
module symbols and completley bypass our traditional EXPORT_SYMBOL*()
annotations and community agreed upon restrictions.
A small set of preamble patches fix up a few symbols which just needed
adjusting for this on two modules, the rtc ds1685 and the networking enetc
module. Two other modules just needed some build fixing and removal of use
of __symbol_get() as they can't ever be modular, as was done by Arnd on
the ARM pxa module and Christoph did on the mmc au1xmmc driver.
This is a good reminder to us that symbol_get() is just a hack to address
things which should be fixed through Kconfig at build time as was done in
the later patches, and so ultimately it should just go.
* Extremely late minor fix for old module layout 055f23b74b ("module: check
for exit sections in layout_sections() instead of module_init_section()") by
James Morse for arm64. Note that this layout thing is old, it is *not*
Song Liu's commit ac3b432839 ("module: replace module_layout with
module_memory"). The issue however is very odd to run into and so there was
no hurry to get this in fast.
* Although the fix did not go through the modules tree I'd like to highlight
the fix by Peter Zijlstra in commit 5409730962 ("x86/static_call: Fix
__static_call_fixup()") now merged in your tree which came out of what
was originally suspected to be a fallout of the the newer module layout
changes by Song Liu commit ac3b432839 ("module: replace module_layout
with module_memory") instead of module_init_section()"). Thanks to the report
by Christian Bricart and the debugging by Song Liu & Peter that turned to
be noted as a kernel regression in place since v5.19 through commit
ee88d363d1 ("x86,static_call: Use alternative RET encoding").
I highlight this to reflect and clarify that we haven't seen more fallout
from ac3b432839 ("module: replace module_layout with module_memory").
* RISC-V toolchain got mapping symbol support which prefix symbols with "$"
to help with alignment considerations for disassembly. This is used to
differentiate between incompatible instruction encodings when disassembling.
RISC-V just matches what ARM/AARCH64 did for alignment considerations and
Palmer Dabbelt extended is_mapping_symbol() to accept these symbols for
RISC-V. We already had support for this for all architectures but it also
checked for the second character, the RISC-V check Dabbelt added was just
for the "$". After a bit of testing and fallout on linux-next and based on
feedback from Masahiro Yamada it was decided to simplify the check and treat
the first char "$" as unique for all architectures, and so we no make
is_mapping_symbol() for all archs if the symbol starts with "$".
The most relevant commit for this for RISC-V on binutils was:
https://sourceware.org/pipermail/binutils/2021-July/117350.html
* A late fix by Andrea Righi (today) to make module zstd decompression use
vmalloc() instead of kmalloc() to account for large compressed modules. I
suspect we'll see similar things for other decompression algorithms soon.
* samples/hw_breakpoint minor fixes by Rong Tao, Arnd Bergmann and Chen Jiahao
-----BEGIN PGP SIGNATURE-----
iQJGBAABCgAwFiEENnNq2KuOejlQLZofziMdCjCSiKcFAmTuShISHG1jZ3JvZkBr
ZXJuZWwub3JnAAoJEM4jHQowkoin7rEQAIt9cGmkHyA6Po/Ex8DejWvSTTOQzIXk
NvtGurODghWnCejZ7Yofo1T48mvgHOenDQB9qNSkVtKDyhmWCbss6wQU/5M8Mc3A
G+9svkQ8H1BRzTwX3WJKF9KNMhI0HA0CXz3ED/I4iX/Q4Ffv3bgbAiitY6r48lJV
PSKPzwH9QMIti6k3j+bFf2SwWCV3X2jz+btdxwY34dVFyggdYgaBNKEdrumCx4nL
g0tQQxI8QgltOnwlfOPLEhdSU1yWyIWZtqtki6xksLziwTreRaw1HotgXQDpnt/S
iJY9xiKN1ChtVSprQlbTb9yhFbCEGvOYGEaKl/ZsGENQjKzRWsQ+dtT8Ww6n2Y1H
aJXwniv6SqCW7dCwdKo4sE7JFYDP56yFYKBLOPSPbMm6DJwTMbzLUf7TGNh6NKyl
3pqjGagJ+LTj3l9w5ur4zTrDGAmLzMpNR03+6niTM7C3TPOI1+wh5zGbvtoA/WdA
ytQeOTiUsi0uyVgk50f67IC6virrxwupeyZQlYFGNuEGBClgXzzzgw/MKwg0VMvc
aWhFPUOLx8/8juJ3A5qiOT+znQJ2DTqWlT+QkQ8R5qFVXEW1g9IOnhaHqDX+KB0A
OPlZ9xwss2U0Zd1XhourtqhUhvcODWNzTj3oPzjdrGiBjdENz8hPKP+7HV1CG6xy
RdxpSwu72kFu
=IQy2
-----END PGP SIGNATURE-----
Merge tag 'modules-6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux
Pull modules updates from Luis Chamberlain:
"Summary of the changes worth highlighting from most interesting to
boring below:
- Christoph Hellwig's symbol_get() fix to Nvidia's efforts to
circumvent the protection he put in place in year 2020 to prevent
proprietary modules from using GPL only symbols, and also ensuring
proprietary modules which export symbols grandfather their taint.
That was done through year 2020 commit 262e6ae708 ("modules:
inherit TAINT_PROPRIETARY_MODULE"). Christoph's new fix is done by
clarifing __symbol_get() was only ever intended to prevent module
reference loops by Linux kernel modules and so making it only find
symbols exported via EXPORT_SYMBOL_GPL(). The circumvention tactic
used by Nvidia was to use symbol_get() to purposely swift through
proprietary module symbols and completely bypass our traditional
EXPORT_SYMBOL*() annotations and community agreed upon
restrictions.
A small set of preamble patches fix up a few symbols which just
needed adjusting for this on two modules, the rtc ds1685 and the
networking enetc module. Two other modules just needed some build
fixing and removal of use of __symbol_get() as they can't ever be
modular, as was done by Arnd on the ARM pxa module and Christoph
did on the mmc au1xmmc driver.
This is a good reminder to us that symbol_get() is just a hack to
address things which should be fixed through Kconfig at build time
as was done in the later patches, and so ultimately it should just
go.
- Extremely late minor fix for old module layout 055f23b74b
("module: check for exit sections in layout_sections() instead of
module_init_section()") by James Morse for arm64. Note that this
layout thing is old, it is *not* Song Liu's commit ac3b432839
("module: replace module_layout with module_memory"). The issue
however is very odd to run into and so there was no hurry to get
this in fast.
- Although the fix did not go through the modules tree I'd like to
highlight the fix by Peter Zijlstra in commit 5409730962
("x86/static_call: Fix __static_call_fixup()") now merged in your
tree which came out of what was originally suspected to be a
fallout of the the newer module layout changes by Song Liu commit
ac3b432839 ("module: replace module_layout with module_memory")
instead of module_init_section()"). Thanks to the report by
Christian Bricart and the debugging by Song Liu & Peter that turned
to be noted as a kernel regression in place since v5.19 through
commit ee88d363d1 ("x86,static_call: Use alternative RET
encoding").
I highlight this to reflect and clarify that we haven't seen more
fallout from ac3b432839 ("module: replace module_layout with
module_memory").
- RISC-V toolchain got mapping symbol support which prefix symbols
with "$" to help with alignment considerations for disassembly.
This is used to differentiate between incompatible instruction
encodings when disassembling. RISC-V just matches what ARM/AARCH64
did for alignment considerations and Palmer Dabbelt extended
is_mapping_symbol() to accept these symbols for RISC-V. We already
had support for this for all architectures but it also checked for
the second character, the RISC-V check Dabbelt added was just for
the "$". After a bit of testing and fallout on linux-next and based
on feedback from Masahiro Yamada it was decided to simplify the
check and treat the first char "$" as unique for all architectures,
and so we no make is_mapping_symbol() for all archs if the symbol
starts with "$".
The most relevant commit for this for RISC-V on binutils was:
https://sourceware.org/pipermail/binutils/2021-July/117350.html
- A late fix by Andrea Righi (today) to make module zstd
decompression use vmalloc() instead of kmalloc() to account for
large compressed modules. I suspect we'll see similar things for
other decompression algorithms soon.
- samples/hw_breakpoint minor fixes by Rong Tao, Arnd Bergmann and
Chen Jiahao"
* tag 'modules-6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux:
module/decompress: use vmalloc() for zstd decompression workspace
kallsyms: Add more debug output for selftest
ARM: module: Use module_init_layout_section() to spot init sections
arm64: module: Use module_init_layout_section() to spot init sections
module: Expose module_init_layout_section()
modules: only allow symbol_get of EXPORT_SYMBOL_GPL modules
rtc: ds1685: use EXPORT_SYMBOL_GPL for ds1685_rtc_poweroff
net: enetc: use EXPORT_SYMBOL_GPL for enetc_phc_index
mmc: au1xmmc: force non-modular build and remove symbol_get usage
ARM: pxa: remove use of symbol_get()
samples/hw_breakpoint: mark sample_hbp as static
samples/hw_breakpoint: fix building without module unloading
samples/hw_breakpoint: Fix kernel BUG 'invalid opcode: 0000'
modpost, kallsyms: Treat add '$'-prefixed symbols as mapping symbols
kernel: params: Remove unnecessary ‘0’ values from err
module: Ignore RISC-V mapping symbols too
("refactor Kconfig to consolidate KEXEC and CRASH options").
- kernel.h slimming work from Andy Shevchenko ("kernel.h: Split out a
couple of macros to args.h").
- gdb feature work from Kuan-Ying Lee ("Add GDB memory helper
commands").
- vsprintf inclusion rationalization from Andy Shevchenko
("lib/vsprintf: Rework header inclusions").
- Switch the handling of kdump from a udev scheme to in-kernel handling,
by Eric DeVolder ("crash: Kernel handling of CPU and memory hot
un/plug").
- Many singleton patches to various parts of the tree
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZO2GpAAKCRDdBJ7gKXxA
juW3AQD1moHzlSN6x9I3tjm5TWWNYFoFL8af7wXDJspp/DWH/AD/TO0XlWWhhbYy
QHy7lL0Syha38kKLMXTM+bN6YQHi9AU=
=WJQa
-----END PGP SIGNATURE-----
Merge tag 'mm-nonmm-stable-2023-08-28-22-48' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull non-MM updates from Andrew Morton:
- An extensive rework of kexec and crash Kconfig from Eric DeVolder
("refactor Kconfig to consolidate KEXEC and CRASH options")
- kernel.h slimming work from Andy Shevchenko ("kernel.h: Split out a
couple of macros to args.h")
- gdb feature work from Kuan-Ying Lee ("Add GDB memory helper
commands")
- vsprintf inclusion rationalization from Andy Shevchenko
("lib/vsprintf: Rework header inclusions")
- Switch the handling of kdump from a udev scheme to in-kernel
handling, by Eric DeVolder ("crash: Kernel handling of CPU and memory
hot un/plug")
- Many singleton patches to various parts of the tree
* tag 'mm-nonmm-stable-2023-08-28-22-48' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (81 commits)
document while_each_thread(), change first_tid() to use for_each_thread()
drivers/char/mem.c: shrink character device's devlist[] array
x86/crash: optimize CPU changes
crash: change crash_prepare_elf64_headers() to for_each_possible_cpu()
crash: hotplug support for kexec_load()
x86/crash: add x86 crash hotplug support
crash: memory and CPU hotplug sysfs attributes
kexec: exclude elfcorehdr from the segment digest
crash: add generic infrastructure for crash hotplug support
crash: move a few code bits to setup support of crash hotplug
kstrtox: consistently use _tolower()
kill do_each_thread()
nilfs2: fix WARNING in mark_buffer_dirty due to discarded buffer reuse
scripts/bloat-o-meter: count weak symbol sizes
treewide: drop CONFIG_EMBEDDED
lockdep: fix static memory detection even more
lib/vsprintf: declare no_hash_pointers in sprintf.h
lib/vsprintf: split out sprintf() and friends
kernel/fork: stop playing lockless games for exe_file replacement
adfs: delete unused "union adfs_dirtail" definition
...
- Peter Xu has a series (mm/gup: Unify hugetlb, speed up thp") which
reduces the special-case code for handling hugetlb pages in GUP. It
also speeds up GUP handling of transparent hugepages.
- Peng Zhang provides some maple tree speedups ("Optimize the fast path
of mas_store()").
- Sergey Senozhatsky has improved te performance of zsmalloc during
compaction (zsmalloc: small compaction improvements").
- Domenico Cerasuolo has developed additional selftest code for zswap
("selftests: cgroup: add zswap test program").
- xu xin has doe some work on KSM's handling of zero pages. These
changes are mainly to enable the user to better understand the
effectiveness of KSM's treatment of zero pages ("ksm: support tracking
KSM-placed zero-pages").
- Jeff Xu has fixes the behaviour of memfd's
MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED sysctl ("mm/memfd: fix sysctl
MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED").
- David Howells has fixed an fscache optimization ("mm, netfs, fscache:
Stop read optimisation when folio removed from pagecache").
- Axel Rasmussen has given userfaultfd the ability to simulate memory
poisoning ("add UFFDIO_POISON to simulate memory poisoning with UFFD").
- Miaohe Lin has contributed some routine maintenance work on the
memory-failure code ("mm: memory-failure: remove unneeded PageHuge()
check").
- Peng Zhang has contributed some maintenance work on the maple tree
code ("Improve the validation for maple tree and some cleanup").
- Hugh Dickins has optimized the collapsing of shmem or file pages into
THPs ("mm: free retracted page table by RCU").
- Jiaqi Yan has a patch series which permits us to use the healthy
subpages within a hardware poisoned huge page for general purposes
("Improve hugetlbfs read on HWPOISON hugepages").
- Kemeng Shi has done some maintenance work on the pagetable-check code
("Remove unused parameters in page_table_check").
- More folioification work from Matthew Wilcox ("More filesystem folio
conversions for 6.6"), ("Followup folio conversions for zswap"). And
from ZhangPeng ("Convert several functions in page_io.c to use a
folio").
- page_ext cleanups from Kemeng Shi ("minor cleanups for page_ext").
- Baoquan He has converted some architectures to use the GENERIC_IOREMAP
ioremap()/iounmap() code ("mm: ioremap: Convert architectures to take
GENERIC_IOREMAP way").
- Anshuman Khandual has optimized arm64 tlb shootdown ("arm64: support
batched/deferred tlb shootdown during page reclamation/migration").
- Better maple tree lockdep checking from Liam Howlett ("More strict
maple tree lockdep"). Liam also developed some efficiency improvements
("Reduce preallocations for maple tree").
- Cleanup and optimization to the secondary IOMMU TLB invalidation, from
Alistair Popple ("Invalidate secondary IOMMU TLB on permission
upgrade").
- Ryan Roberts fixes some arm64 MM selftest issues ("selftests/mm fixes
for arm64").
- Kemeng Shi provides some maintenance work on the compaction code ("Two
minor cleanups for compaction").
- Some reduction in mmap_lock pressure from Matthew Wilcox ("Handle most
file-backed faults under the VMA lock").
- Aneesh Kumar contributes code to use the vmemmap optimization for DAX
on ppc64, under some circumstances ("Add support for DAX vmemmap
optimization for ppc64").
- page-ext cleanups from Kemeng Shi ("add page_ext_data to get client
data in page_ext"), ("minor cleanups to page_ext header").
- Some zswap cleanups from Johannes Weiner ("mm: zswap: three
cleanups").
- kmsan cleanups from ZhangPeng ("minor cleanups for kmsan").
- VMA handling cleanups from Kefeng Wang ("mm: convert to
vma_is_initial_heap/stack()").
- DAMON feature work from SeongJae Park ("mm/damon/sysfs-schemes:
implement DAMOS tried total bytes file"), ("Extend DAMOS filters for
address ranges and DAMON monitoring targets").
- Compaction work from Kemeng Shi ("Fixes and cleanups to compaction").
- Liam Howlett has improved the maple tree node replacement code
("maple_tree: Change replacement strategy").
- ZhangPeng has a general code cleanup - use the K() macro more widely
("cleanup with helper macro K()").
- Aneesh Kumar brings memmap-on-memory to ppc64 ("Add support for memmap
on memory feature on ppc64").
- pagealloc cleanups from Kemeng Shi ("Two minor cleanups for pcp list
in page_alloc"), ("Two minor cleanups for get pageblock migratetype").
- Vishal Moola introduces a memory descriptor for page table tracking,
"struct ptdesc" ("Split ptdesc from struct page").
- memfd selftest maintenance work from Aleksa Sarai ("memfd: cleanups
for vm.memfd_noexec").
- MM include file rationalization from Hugh Dickins ("arch: include
asm/cacheflush.h in asm/hugetlb.h").
- THP debug output fixes from Hugh Dickins ("mm,thp: fix sloppy text
output").
- kmemleak improvements from Xiaolei Wang ("mm/kmemleak: use
object_cache instead of kmemleak_initialized").
- More folio-related cleanups from Matthew Wilcox ("Remove _folio_dtor
and _folio_order").
- A VMA locking scalability improvement from Suren Baghdasaryan
("Per-VMA lock support for swap and userfaults").
- pagetable handling cleanups from Matthew Wilcox ("New page table range
API").
- A batch of swap/thp cleanups from David Hildenbrand ("mm/swap: stop
using page->private on tail pages for THP_SWAP + cleanups").
- Cleanups and speedups to the hugetlb fault handling from Matthew
Wilcox ("Change calling convention for ->huge_fault").
- Matthew Wilcox has also done some maintenance work on the MM subsystem
documentation ("Improve mm documentation").
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZO1JUQAKCRDdBJ7gKXxA
jrMwAP47r/fS8vAVT3zp/7fXmxaJYTK27CTAM881Gw1SDhFM/wEAv8o84mDenCg6
Nfio7afS1ncD+hPYT8947UnLxTgn+ww=
=Afws
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2023-08-28-18-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
- Some swap cleanups from Ma Wupeng ("fix WARN_ON in
add_to_avail_list")
- Peter Xu has a series (mm/gup: Unify hugetlb, speed up thp") which
reduces the special-case code for handling hugetlb pages in GUP. It
also speeds up GUP handling of transparent hugepages.
- Peng Zhang provides some maple tree speedups ("Optimize the fast path
of mas_store()").
- Sergey Senozhatsky has improved te performance of zsmalloc during
compaction (zsmalloc: small compaction improvements").
- Domenico Cerasuolo has developed additional selftest code for zswap
("selftests: cgroup: add zswap test program").
- xu xin has doe some work on KSM's handling of zero pages. These
changes are mainly to enable the user to better understand the
effectiveness of KSM's treatment of zero pages ("ksm: support
tracking KSM-placed zero-pages").
- Jeff Xu has fixes the behaviour of memfd's
MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED sysctl ("mm/memfd: fix sysctl
MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED").
- David Howells has fixed an fscache optimization ("mm, netfs, fscache:
Stop read optimisation when folio removed from pagecache").
- Axel Rasmussen has given userfaultfd the ability to simulate memory
poisoning ("add UFFDIO_POISON to simulate memory poisoning with
UFFD").
- Miaohe Lin has contributed some routine maintenance work on the
memory-failure code ("mm: memory-failure: remove unneeded PageHuge()
check").
- Peng Zhang has contributed some maintenance work on the maple tree
code ("Improve the validation for maple tree and some cleanup").
- Hugh Dickins has optimized the collapsing of shmem or file pages into
THPs ("mm: free retracted page table by RCU").
- Jiaqi Yan has a patch series which permits us to use the healthy
subpages within a hardware poisoned huge page for general purposes
("Improve hugetlbfs read on HWPOISON hugepages").
- Kemeng Shi has done some maintenance work on the pagetable-check code
("Remove unused parameters in page_table_check").
- More folioification work from Matthew Wilcox ("More filesystem folio
conversions for 6.6"), ("Followup folio conversions for zswap"). And
from ZhangPeng ("Convert several functions in page_io.c to use a
folio").
- page_ext cleanups from Kemeng Shi ("minor cleanups for page_ext").
- Baoquan He has converted some architectures to use the
GENERIC_IOREMAP ioremap()/iounmap() code ("mm: ioremap: Convert
architectures to take GENERIC_IOREMAP way").
- Anshuman Khandual has optimized arm64 tlb shootdown ("arm64: support
batched/deferred tlb shootdown during page reclamation/migration").
- Better maple tree lockdep checking from Liam Howlett ("More strict
maple tree lockdep"). Liam also developed some efficiency
improvements ("Reduce preallocations for maple tree").
- Cleanup and optimization to the secondary IOMMU TLB invalidation,
from Alistair Popple ("Invalidate secondary IOMMU TLB on permission
upgrade").
- Ryan Roberts fixes some arm64 MM selftest issues ("selftests/mm fixes
for arm64").
- Kemeng Shi provides some maintenance work on the compaction code
("Two minor cleanups for compaction").
- Some reduction in mmap_lock pressure from Matthew Wilcox ("Handle
most file-backed faults under the VMA lock").
- Aneesh Kumar contributes code to use the vmemmap optimization for DAX
on ppc64, under some circumstances ("Add support for DAX vmemmap
optimization for ppc64").
- page-ext cleanups from Kemeng Shi ("add page_ext_data to get client
data in page_ext"), ("minor cleanups to page_ext header").
- Some zswap cleanups from Johannes Weiner ("mm: zswap: three
cleanups").
- kmsan cleanups from ZhangPeng ("minor cleanups for kmsan").
- VMA handling cleanups from Kefeng Wang ("mm: convert to
vma_is_initial_heap/stack()").
- DAMON feature work from SeongJae Park ("mm/damon/sysfs-schemes:
implement DAMOS tried total bytes file"), ("Extend DAMOS filters for
address ranges and DAMON monitoring targets").
- Compaction work from Kemeng Shi ("Fixes and cleanups to compaction").
- Liam Howlett has improved the maple tree node replacement code
("maple_tree: Change replacement strategy").
- ZhangPeng has a general code cleanup - use the K() macro more widely
("cleanup with helper macro K()").
- Aneesh Kumar brings memmap-on-memory to ppc64 ("Add support for
memmap on memory feature on ppc64").
- pagealloc cleanups from Kemeng Shi ("Two minor cleanups for pcp list
in page_alloc"), ("Two minor cleanups for get pageblock
migratetype").
- Vishal Moola introduces a memory descriptor for page table tracking,
"struct ptdesc" ("Split ptdesc from struct page").
- memfd selftest maintenance work from Aleksa Sarai ("memfd: cleanups
for vm.memfd_noexec").
- MM include file rationalization from Hugh Dickins ("arch: include
asm/cacheflush.h in asm/hugetlb.h").
- THP debug output fixes from Hugh Dickins ("mm,thp: fix sloppy text
output").
- kmemleak improvements from Xiaolei Wang ("mm/kmemleak: use
object_cache instead of kmemleak_initialized").
- More folio-related cleanups from Matthew Wilcox ("Remove _folio_dtor
and _folio_order").
- A VMA locking scalability improvement from Suren Baghdasaryan
("Per-VMA lock support for swap and userfaults").
- pagetable handling cleanups from Matthew Wilcox ("New page table
range API").
- A batch of swap/thp cleanups from David Hildenbrand ("mm/swap: stop
using page->private on tail pages for THP_SWAP + cleanups").
- Cleanups and speedups to the hugetlb fault handling from Matthew
Wilcox ("Change calling convention for ->huge_fault").
- Matthew Wilcox has also done some maintenance work on the MM
subsystem documentation ("Improve mm documentation").
* tag 'mm-stable-2023-08-28-18-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (489 commits)
maple_tree: shrink struct maple_tree
maple_tree: clean up mas_wr_append()
secretmem: convert page_is_secretmem() to folio_is_secretmem()
nios2: fix flush_dcache_page() for usage from irq context
hugetlb: add documentation for vma_kernel_pagesize()
mm: add orphaned kernel-doc to the rst files.
mm: fix clean_record_shared_mapping_range kernel-doc
mm: fix get_mctgt_type() kernel-doc
mm: fix kernel-doc warning from tlb_flush_rmaps()
mm: remove enum page_entry_size
mm: allow ->huge_fault() to be called without the mmap_lock held
mm: move PMD_ORDER to pgtable.h
mm: remove checks for pte_index
memcg: remove duplication detection for mem_cgroup_uncharge_swap
mm/huge_memory: work on folio->swap instead of page->private when splitting folio
mm/swap: inline folio_set_swap_entry() and folio_swap_entry()
mm/swap: use dedicated entry for swap in folio
mm/swap: stop using page->private on tail pages for THP_SWAP
selftests/mm: fix WARNING comparing pointer to 0
selftests: cgroup: fix test_kmem_memcg_deletion kernel mem check
...
Core
----
- Increase size limits for to-be-sent skb frag allocations. This
allows tun, tap devices and packet sockets to better cope with large
writes operations.
- Store netdevs in an xarray, to simplify iterating over netdevs.
- Refactor nexthop selection for multipath routes.
- Improve sched class lifetime handling.
- Add backup nexthop ID support for bridge.
- Implement drop reasons support in openvswitch.
- Several data races annotations and fixes.
- Constify the sk parameter of routing functions.
- Prepend kernel version to netconsole message.
Protocols
---------
- Implement support for TCP probing the peer being under memory
pressure.
- Remove hard coded limitation on IPv6 specific info placement
inside the socket struct.
- Get rid of sysctl_tcp_adv_win_scale and use an auto-estimated
per socket scaling factor.
- Scaling-up the IPv6 expired route GC via a separated list of
expiring routes.
- In-kernel support for the TLS alert protocol.
- Better support for UDP reuseport with connected sockets.
- Add NEXT-C-SID support for SRv6 End.X behavior, reducing the SR
header size.
- Get rid of additional ancillary per MPTCP connection struct socket.
- Implement support for BPF-based MPTCP packet schedulers.
- Format MPTCP subtests selftests results in TAP.
- Several new SMC 2.1 features including unique experimental options,
max connections per lgr negotiation, max links per lgr negotiation.
BPF
---
- Multi-buffer support in AF_XDP.
- Add multi uprobe BPF links for attaching multiple uprobes
and usdt probes, which is significantly faster and saves extra fds.
- Implement an fd-based tc BPF attach API (TCX) and BPF link support on
top of it.
- Add SO_REUSEPORT support for TC bpf_sk_assign.
- Support new instructions from cpu v4 to simplify the generated code and
feature completeness, for x86, arm64, riscv64.
- Support defragmenting IPv(4|6) packets in BPF.
- Teach verifier actual bounds of bpf_get_smp_processor_id()
and fix perf+libbpf issue related to custom section handling.
- Introduce bpf map element count and enable it for all program types.
- Add a BPF hook in sys_socket() to change the protocol ID
from IPPROTO_TCP to IPPROTO_MPTCP to cover migration for legacy.
- Introduce bpf_me_mcache_free_rcu() and fix OOM under stress.
- Add uprobe support for the bpf_get_func_ip helper.
- Check skb ownership against full socket.
- Support for up to 12 arguments in BPF trampoline.
- Extend link_info for kprobe_multi and perf_event links.
Netfilter
---------
- Speed-up process exit by aborting ruleset validation if a
fatal signal is pending.
- Allow NLA_POLICY_MASK to be used with BE16/BE32 types.
Driver API
----------
- Page pool optimizations, to improve data locality and cache usage.
- Introduce ndo_hwtstamp_get() and ndo_hwtstamp_set() to avoid the need
for raw ioctl() handling in drivers.
- Simplify genetlink dump operations (doit/dumpit) providing them
the common information already populated in struct genl_info.
- Extend and use the yaml devlink specs to [re]generate the split ops.
- Introduce devlink selective dumps, to allow SF filtering SF based on
handle and other attributes.
- Add yaml netlink spec for netlink-raw families, allow route, link and
address related queries via the ynl tool.
- Remove phylink legacy mode support.
- Support offload LED blinking to phy.
- Add devlink port function attributes for IPsec.
New hardware / drivers
----------------------
- Ethernet:
- Broadcom ASP 2.0 (72165) ethernet controller
- MediaTek MT7988 SoC
- Texas Instruments AM654 SoC
- Texas Instruments IEP driver
- Atheros qca8081 phy
- Marvell 88Q2110 phy
- NXP TJA1120 phy
- WiFi:
- MediaTek mt7981 support
- Can:
- Kvaser SmartFusion2 PCI Express devices
- Allwinner T113 controllers
- Texas Instruments tcan4552/4553 chips
- Bluetooth:
- Intel Gale Peak
- Qualcomm WCN3988 and WCN7850
- NXP AW693 and IW624
- Mediatek MT2925
Drivers
-------
- Ethernet NICs:
- nVidia/Mellanox:
- mlx5:
- support UDP encapsulation in packet offload mode
- IPsec packet offload support in eswitch mode
- improve aRFS observability by adding new set of counters
- extends MACsec offload support to cover RoCE traffic
- dynamic completion EQs
- mlx4:
- convert to use auxiliary bus instead of custom interface logic
- Intel
- ice:
- implement switchdev bridge offload, even for LAG interfaces
- implement SRIOV support for LAG interfaces
- igc:
- add support for multiple in-flight TX timestamps
- Broadcom:
- bnxt:
- use the unified RX page pool buffers for XDP and non-XDP
- use the NAPI skb allocation cache
- OcteonTX2:
- support Round Robin scheduling HTB offload
- TC flower offload support for SPI field
- Freescale:
- add XDP_TX feature support
- AMD:
- ionic: add support for PCI FLR event
- sfc:
- basic conntrack offload
- introduce eth, ipv4 and ipv6 pedit offloads
- ST Microelectronics:
- stmmac: maximze PTP timestamping resolution
- Virtual NICs:
- Microsoft vNIC:
- batch ringing RX queue doorbell on receiving packets
- add page pool for RX buffers
- Virtio vNIC:
- add per queue interrupt coalescing support
- Google vNIC:
- add queue-page-list mode support
- Ethernet high-speed switches:
- nVidia/Mellanox (mlxsw):
- add port range matching tc-flower offload
- permit enslavement to netdevices with uppers
- Ethernet embedded switches:
- Marvell (mv88e6xxx):
- convert to phylink_pcs
- Renesas:
- r8A779fx: add speed change support
- rzn1: enables vlan support
- Ethernet PHYs:
- convert mv88e6xxx to phylink_pcs
- WiFi:
- Qualcomm Wi-Fi 7 (ath12k):
- extremely High Throughput (EHT) PHY support
- RealTek (rtl8xxxu):
- enable AP mode for: RTL8192FU, RTL8710BU (RTL8188GU),
RTL8192EU and RTL8723BU
- RealTek (rtw89):
- Introduce Time Averaged SAR (TAS) support
- Connector:
- support for event filtering
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmTt1ZoSHHBhYmVuaUBy
ZWRoYXQuY29tAAoJECkkeY3MjxOkgFUP/REFaYWdWUvAzmWeezyx9dqgZMfSOjWq
9QvySiA94OAOcjIYkb7wfzQ5BBAZqaBQ/f8XqWwS1EDDDEBs8sP1cxmABKwW7Hsr
qFRu2sOqLzKBk223d0jIgEocfQaFpGbF71gXoTlDivBjBi5UxWm9bF0XnbYWcKgO
/QEvzNosi9uNdi85Fzmv62J6YzAdidEpwGsM7X2CfejwNRmStxAEg/NwvRR0Hyiq
OJCo97omEgTRaUle8nc64PDx33u4h5kQ1BkaeHEv0rbE3hftFC2YPKn/InmqSFGz
6ew2xnrGPR37LCuAiCcIIv6yR7K0eu0iYJ7jXwZxBDqxGavEPuwWGBoCP6qFiitH
ZLWhIrAUrdmSbySkTOCONhJ475qFAuQoYHYpZnX/bJZUHlSsb/9lwDJYJQGpVfd1
/daqJVSb7lhaifmNO1iNd/ibCIXq9zapwtkRwA897M8GkZBTsnVvazFld1Em+Se3
Bx6DSDUVBqVQ9fpZG2IAGD6odDwOzC1lF2IoceFvK9Ff6oE0psI+A0qNLMkHxZbW
Qlo7LsNe53hpoCC+yHTfXX7e/X8eNt0EnCGOQJDusZ0Nr3K7H4LKFA0i8UBUK05n
4lKnnaSQW7GQgdofLWt103OMDR9GoDxpFsm7b1X9+AEk6Fz6tq50wWYeMZETUKYP
DCW8VGFOZjZM
=9CsR
-----END PGP SIGNATURE-----
Merge tag 'net-next-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Paolo Abeni:
"Core:
- Increase size limits for to-be-sent skb frag allocations. This
allows tun, tap devices and packet sockets to better cope with
large writes operations
- Store netdevs in an xarray, to simplify iterating over netdevs
- Refactor nexthop selection for multipath routes
- Improve sched class lifetime handling
- Add backup nexthop ID support for bridge
- Implement drop reasons support in openvswitch
- Several data races annotations and fixes
- Constify the sk parameter of routing functions
- Prepend kernel version to netconsole message
Protocols:
- Implement support for TCP probing the peer being under memory
pressure
- Remove hard coded limitation on IPv6 specific info placement inside
the socket struct
- Get rid of sysctl_tcp_adv_win_scale and use an auto-estimated per
socket scaling factor
- Scaling-up the IPv6 expired route GC via a separated list of
expiring routes
- In-kernel support for the TLS alert protocol
- Better support for UDP reuseport with connected sockets
- Add NEXT-C-SID support for SRv6 End.X behavior, reducing the SR
header size
- Get rid of additional ancillary per MPTCP connection struct socket
- Implement support for BPF-based MPTCP packet schedulers
- Format MPTCP subtests selftests results in TAP
- Several new SMC 2.1 features including unique experimental options,
max connections per lgr negotiation, max links per lgr negotiation
BPF:
- Multi-buffer support in AF_XDP
- Add multi uprobe BPF links for attaching multiple uprobes and usdt
probes, which is significantly faster and saves extra fds
- Implement an fd-based tc BPF attach API (TCX) and BPF link support
on top of it
- Add SO_REUSEPORT support for TC bpf_sk_assign
- Support new instructions from cpu v4 to simplify the generated code
and feature completeness, for x86, arm64, riscv64
- Support defragmenting IPv(4|6) packets in BPF
- Teach verifier actual bounds of bpf_get_smp_processor_id() and fix
perf+libbpf issue related to custom section handling
- Introduce bpf map element count and enable it for all program types
- Add a BPF hook in sys_socket() to change the protocol ID from
IPPROTO_TCP to IPPROTO_MPTCP to cover migration for legacy
- Introduce bpf_me_mcache_free_rcu() and fix OOM under stress
- Add uprobe support for the bpf_get_func_ip helper
- Check skb ownership against full socket
- Support for up to 12 arguments in BPF trampoline
- Extend link_info for kprobe_multi and perf_event links
Netfilter:
- Speed-up process exit by aborting ruleset validation if a fatal
signal is pending
- Allow NLA_POLICY_MASK to be used with BE16/BE32 types
Driver API:
- Page pool optimizations, to improve data locality and cache usage
- Introduce ndo_hwtstamp_get() and ndo_hwtstamp_set() to avoid the
need for raw ioctl() handling in drivers
- Simplify genetlink dump operations (doit/dumpit) providing them the
common information already populated in struct genl_info
- Extend and use the yaml devlink specs to [re]generate the split ops
- Introduce devlink selective dumps, to allow SF filtering SF based
on handle and other attributes
- Add yaml netlink spec for netlink-raw families, allow route, link
and address related queries via the ynl tool
- Remove phylink legacy mode support
- Support offload LED blinking to phy
- Add devlink port function attributes for IPsec
New hardware / drivers:
- Ethernet:
- Broadcom ASP 2.0 (72165) ethernet controller
- MediaTek MT7988 SoC
- Texas Instruments AM654 SoC
- Texas Instruments IEP driver
- Atheros qca8081 phy
- Marvell 88Q2110 phy
- NXP TJA1120 phy
- WiFi:
- MediaTek mt7981 support
- Can:
- Kvaser SmartFusion2 PCI Express devices
- Allwinner T113 controllers
- Texas Instruments tcan4552/4553 chips
- Bluetooth:
- Intel Gale Peak
- Qualcomm WCN3988 and WCN7850
- NXP AW693 and IW624
- Mediatek MT2925
Drivers:
- Ethernet NICs:
- nVidia/Mellanox:
- mlx5:
- support UDP encapsulation in packet offload mode
- IPsec packet offload support in eswitch mode
- improve aRFS observability by adding new set of counters
- extends MACsec offload support to cover RoCE traffic
- dynamic completion EQs
- mlx4:
- convert to use auxiliary bus instead of custom interface
logic
- Intel
- ice:
- implement switchdev bridge offload, even for LAG
interfaces
- implement SRIOV support for LAG interfaces
- igc:
- add support for multiple in-flight TX timestamps
- Broadcom:
- bnxt:
- use the unified RX page pool buffers for XDP and non-XDP
- use the NAPI skb allocation cache
- OcteonTX2:
- support Round Robin scheduling HTB offload
- TC flower offload support for SPI field
- Freescale:
- add XDP_TX feature support
- AMD:
- ionic: add support for PCI FLR event
- sfc:
- basic conntrack offload
- introduce eth, ipv4 and ipv6 pedit offloads
- ST Microelectronics:
- stmmac: maximze PTP timestamping resolution
- Virtual NICs:
- Microsoft vNIC:
- batch ringing RX queue doorbell on receiving packets
- add page pool for RX buffers
- Virtio vNIC:
- add per queue interrupt coalescing support
- Google vNIC:
- add queue-page-list mode support
- Ethernet high-speed switches:
- nVidia/Mellanox (mlxsw):
- add port range matching tc-flower offload
- permit enslavement to netdevices with uppers
- Ethernet embedded switches:
- Marvell (mv88e6xxx):
- convert to phylink_pcs
- Renesas:
- r8A779fx: add speed change support
- rzn1: enables vlan support
- Ethernet PHYs:
- convert mv88e6xxx to phylink_pcs
- WiFi:
- Qualcomm Wi-Fi 7 (ath12k):
- extremely High Throughput (EHT) PHY support
- RealTek (rtl8xxxu):
- enable AP mode for: RTL8192FU, RTL8710BU (RTL8188GU),
RTL8192EU and RTL8723BU
- RealTek (rtw89):
- Introduce Time Averaged SAR (TAS) support
- Connector:
- support for event filtering"
* tag 'net-next-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1806 commits)
net: ethernet: mtk_wed: minor change in wed_{tx,rx}info_show
net: ethernet: mtk_wed: add some more info in wed_txinfo_show handler
net: stmmac: clarify difference between "interface" and "phy_interface"
r8152: add vendor/device ID pair for D-Link DUB-E250
devlink: move devlink_notify_register/unregister() to dev.c
devlink: move small_ops definition into netlink.c
devlink: move tracepoint definitions into core.c
devlink: push linecard related code into separate file
devlink: push rate related code into separate file
devlink: push trap related code into separate file
devlink: use tracepoint_enabled() helper
devlink: push region related code into separate file
devlink: push param related code into separate file
devlink: push resource related code into separate file
devlink: push dpipe related code into separate file
devlink: move and rename devlink_dpipe_send_and_alloc_skb() helper
devlink: push shared buffer related code into separate file
devlink: push port related code into separate file
devlink: push object register/unregister notifications into separate helpers
inet: fix IP_TRANSPARENT error handling
...
This kunit update for Linux 6.6.rc1 consists of:
-- Adds support for running Rust documentation tests as KUnit tests
-- Makes init, str, sync, types doctests compilable/testable
-- Adds support for attributes API which include speed, modules
attributes, ability to filter and report attributes.
-- Adds support for marking tests slow using attributes API.
-- Adds attributes API documentation
-- Fixes to wild-memory-access bug in kunit_filter_suites() and
a possible memory leak in kunit_filter_suites()
-- Adds support for counting number of test suites in a module, list
action to kunit test modules, and test filtering on module tests.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEPZKym/RZuOCGeA/kCwJExA0NQxwFAmTsxL8ACgkQCwJExA0N
Qxwt6BAA5FgF7nUeGRZCnot4MQCNGRThxsns2k3CKjM1Iokp8tstTDoNHXzk2veS
WlRYOHFQqQOVTVRP+laXyjjMMHnlnhFxqbv93UKsen4JIUJDLFLq9x+0i+0bZh97
N1rE5cKUnqjAOL6MIJuomW9IzEIrbMcqdljm6SOCZp90NLvq1+I4pDGLgx2bxcow
Y/7dkx+dnlEsoACZ19CL1L2TaR21GpKdpOudpHNCShsbE0aOAlyHAVcmH64FTqCy
Z1LtrA0odS71q0yxDVCk5X3cIkeVfGBMz6aMZBRzS9k5jU4H1EN1eG1rGdGErIe5
YduwX3KMikYJB2stT64T1vgldIpT/emxqkBigmxQ37g3Flgopz4bI1snMBry+nKb
ViD/WQNjsf2iL8MooCgYBzH7yjmX6lXXQTZXROogBj4lP2/0gHiQVZyXZEAjtoO3
uNzUbfHQGnvtTphBHV4nNGaO+7kU9Y/oX8TYFcSYJQzcH5UVx16uBwevZjT1bii/
q89bRAQLnJpzkR93SGpnmsRgoDcYJSYsEA1o/f9Eqq8j3guOS2idpJvkheXq8+A2
MqTSOCJHENKZ3v0UGKlvZUPStaMaqN58z/VjlWug5EaB83LLfPcXJrGjz/EHk967
hYDHcwPoamTegr1zlg3ckOLiWEhga2tv6aHPkshkcFphpnhRU/c=
=Nsb8
-----END PGP SIGNATURE-----
Merge tag 'linux-kselftest-kunit-6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
Pull kunit updates from Shuah Khan:
- add support for running Rust documentation tests as KUnit tests
- make init, str, sync, types doctests compilable/testable
- add support for attributes API which include speed, modules
attributes, ability to filter and report attributes
- add support for marking tests slow using attributes API
- add attributes API documentation
- fix a wild-memory-access bug in kunit_filter_suites() and a possible
memory leak in kunit_filter_suites()
- add support for counting number of test suites in a module, list
action to kunit test modules, and test filtering on module tests
* tag 'linux-kselftest-kunit-6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest: (25 commits)
kunit: fix struct kunit_attr header
kunit: replace KUNIT_TRIGGER_STATIC_STUB maro with KUNIT_STATIC_STUB_REDIRECT
kunit: Allow kunit test modules to use test filtering
kunit: Make 'list' action available to kunit test modules
kunit: Report the count of test suites in a module
kunit: fix uninitialized variables bug in attributes filtering
kunit: fix possible memory leak in kunit_filter_suites()
kunit: fix wild-memory-access bug in kunit_filter_suites()
kunit: Add documentation of KUnit test attributes
kunit: add tests for filtering attributes
kunit: time: Mark test as slow using test attributes
kunit: memcpy: Mark tests as slow using test attributes
kunit: tool: Add command line interface to filter and report attributes
kunit: Add ability to filter attributes
kunit: Add module attribute
kunit: Add speed attribute
kunit: Add test attributes API structure
MAINTAINERS: add Rust KUnit files to the KUnit entry
rust: support running Rust documentation tests as KUnit ones
rust: types: make doctests compilable/testable
...
- Rework the menu and teo cpuidle governors to avoid calling
tick_nohz_get_sleep_length(), which is likely to become quite
expensive going forward, too often and improve making decisions
regarding whether or not to stop the scheduler tick in the teo
governor (Rafael Wysocki).
- Improve the performance of cpufreq_stats_create_table() in some
cases (Liao Chang).
- Fix two issues in the amd-pstate-ut cpufreq driver (Swapnil Sapkal).
- Use clamp() helper macro to improve the code readability in
cpufreq_verify_within_limits() (Liao Chang).
- Set stale CPU frequency to minimum in intel_pstate (Doug Smythies).
- Migrate cpufreq drivers for various platforms to use void remove
callback (Yangtao Li).
- Add online/offline/exit hooks for Tegra driver (Sumit Gupta).
- Explicitly include correct DT includes in cpufreq (Rob Herring).
- Frequency domain updates for qcom-hw driver (Neil Armstrong).
- Modify AMD pstate driver return the highest_perf value (Meng Li).
- Generic cleanups for cppc, mediatek and powernow driver (Liao Chang,
Konrad Dybcio).
- Add more platforms to cpufreq-arm driver's blocklist (AngeloGioacchino
Del Regno and Konrad Dybcio).
- brcmstb-avs-cpufreq: Fix -Warray-bounds bug (Gustavo A. R. Silva).
- Add device PM helpers to allow a device to remain powered-on during
system-wide transitions (Ulf Hansson).
- Rework hibernation memory snapshotting to avoid storing pages filled
with zeros in hibernation image files (Brian Geffon).
- Add check to make sure that CPU latency QoS constraints do not use
negative values (Clive Lin).
- Optimize rp->domains memory allocation in the Intel RAPL power
capping driver (xiongxin).
- Remove recursion while parsing zones in the arm_scmi power capping
driver (Cristian Marussi).
- Fix memory leak in devfreq_dev_release() (Boris Brezillon).
- Rewrite devfreq_monitor_start() kerneldoc comment (Manivannan
Sadhasivam).
- Explicitly include correct DT includes in devfreq (Rob Herring).
- Remove unsued pm_runtime_update_max_time_suspended() extern
declaration (YueHaibing).
- Add turbo-boost support to cpupower (Wyes Karny).
- Add support for amd_pstate mode change to cpupower (Wyes Karny).
- Fix 'cpupower idle_set' command to accept only numeric values of
arguments (Likhitha Korrapati).
- Clean up OPP code and add new frequency related APIs to it (Viresh
Kumar, Manivannan Sadhasivam).
- Convert ti cpufreq/opp bindings to json schema (Nishanth Menon).
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAmTslI4SHHJqd0Byand5
c29ja2kubmV0AAoJEILEb/54YlRxLMYP/3v0DxA3HZSZ/Xg63P9ylnln084cDt+/
qpJZ0CJUd6+MkoeuCYq/5udNwPSREsfx+pIEJy+h/iCiQlQz3NzriR7/dgPV0Ud0
t7k95lyZo+u51MNxk4SEqRMVTyYaNgDPvGbLyWFpLnne3CsxYzfH5xr77yHf342W
jHii1vJLXiXPnQWDlahf8tUpdQ0MQFmEwx0WkJp81NaAFyXDi0fPrB4YZaZrr6AQ
3TNaxTxZSirVSn19m5RPPAQhEfK8Dk4jF8wVPWsuL9F6v+9wERD9zcaxUPf3CD36
aj+SqKLCkOfkJHk45PCIYbS2wQ04fT/yWE9Rzm4iSr+fWA/q7vA0jXsaAgcv1Bm7
k6QyAy2ffLZTUFObX5bevIPvxZTzunLh0iglHx0WZKS/nn/9Jwpt6UMrpOsjiw/J
GLKEww+ZiKXj980GfvV2QUZG/XmsrvML/1L+qiDxNB2IPTxxuOxrWQ+cM7oxUTPM
pdIPIdwkm5ICVRVcAfNw/fr30s2yp1K304VWgzbKdK9b1aVhUSkxZGI8KHFODOHO
4Crii2rk0r972kxuJmenKwEfmwr/rbAAstFVSM736jH9RUANaWsIeNvkurXMOd2f
mil9DViTAu0iY4cy5tgLiLHDH4tOQOOCntRVFJ1tSytMyCFlMvVM0dwrc0yh254Q
zcrNj8ERJSsC
=6BIh
-----END PGP SIGNATURE-----
Merge tag 'pm-6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management updates from Rafael Wysocki:
"These rework cpuidle governors to call tick_nohz_get_sleep_length()
less often and fix one of them, rework hibernation to avoid storing
pages filled with zeros in hibernation images, switch over some
cpufreq drivers to use void remove callbacks, fix and clean up
multiple cpufreq drivers, fix the devfreq core, update the cpupower
utility and make other assorted improvements.
Specifics:
- Rework the menu and teo cpuidle governors to avoid calling
tick_nohz_get_sleep_length(), which is likely to become quite
expensive going forward, too often and improve making decisions
regarding whether or not to stop the scheduler tick in the teo
governor (Rafael Wysocki)
- Improve the performance of cpufreq_stats_create_table() in some
cases (Liao Chang)
- Fix two issues in the amd-pstate-ut cpufreq driver (Swapnil Sapkal)
- Use clamp() helper macro to improve the code readability in
cpufreq_verify_within_limits() (Liao Chang)
- Set stale CPU frequency to minimum in intel_pstate (Doug Smythies)
- Migrate cpufreq drivers for various platforms to use void remove
callback (Yangtao Li)
- Add online/offline/exit hooks for Tegra driver (Sumit Gupta)
- Explicitly include correct DT includes in cpufreq (Rob Herring)
- Frequency domain updates for qcom-hw driver (Neil Armstrong)
- Modify AMD pstate driver return the highest_perf value (Meng Li)
- Generic cleanups for cppc, mediatek and powernow driver (Liao
Chang, Konrad Dybcio)
- Add more platforms to cpufreq-arm driver's blocklist
(AngeloGioacchino Del Regno and Konrad Dybcio)
- brcmstb-avs-cpufreq: Fix -Warray-bounds bug (Gustavo A. R. Silva)
- Add device PM helpers to allow a device to remain powered-on during
system-wide transitions (Ulf Hansson)
- Rework hibernation memory snapshotting to avoid storing pages
filled with zeros in hibernation image files (Brian Geffon)
- Add check to make sure that CPU latency QoS constraints do not use
negative values (Clive Lin)
- Optimize rp->domains memory allocation in the Intel RAPL power
capping driver (xiongxin)
- Remove recursion while parsing zones in the arm_scmi power capping
driver (Cristian Marussi)
- Fix memory leak in devfreq_dev_release() (Boris Brezillon)
- Rewrite devfreq_monitor_start() kerneldoc comment (Manivannan
Sadhasivam)
- Explicitly include correct DT includes in devfreq (Rob Herring)
- Remove unsued pm_runtime_update_max_time_suspended() extern
declaration (YueHaibing)
- Add turbo-boost support to cpupower (Wyes Karny)
- Add support for amd_pstate mode change to cpupower (Wyes Karny)
- Fix 'cpupower idle_set' command to accept only numeric values of
arguments (Likhitha Korrapati)
- Clean up OPP code and add new frequency related APIs to it (Viresh
Kumar, Manivannan Sadhasivam)
- Convert ti cpufreq/opp bindings to json schema (Nishanth Menon)"
* tag 'pm-6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (74 commits)
cpufreq: tegra194: remove opp table in exit hook
cpufreq: powernow-k8: Use related_cpus instead of cpus in driver.exit()
cpufreq: tegra194: add online/offline hooks
cpuidle: teo: Avoid unnecessary variable assignments
cpufreq: qcom-cpufreq-hw: add support for 4 freq domains
dt-bindings: cpufreq: qcom-hw: add a 4th frequency domain
cpufreq: amd-pstate-ut: Fix kernel panic when loading the driver
cpufreq: amd-pstate-ut: Remove module parameter access
cpufreq: Use clamp() helper macro to improve the code readability
PM: sleep: Add helpers to allow a device to remain powered-on
PM: QoS: Add check to make sure CPU latency is non-negative
PM: runtime: Remove unsued extern declaration of pm_runtime_update_max_time_suspended()
cpufreq: intel_pstate: set stale CPU frequency to minimum
cpufreq: stats: Improve the performance of cpufreq_stats_create_table()
dt-bindings: cpufreq: Convert ti-cpufreq to json schema
dt-bindings: opp: Convert ti-omap5-opp-supply to json schema
OPP: Fix argument name in doc comment
cpuidle: menu: Skip tick_nohz_get_sleep_length() call in some cases
cpufreq: cppc: Set fie_disabled to FIE_DISABLED if fails to create kworker_fie
cpufreq: cppc: cppc_cpufreq_get_rate() returns zero in all error cases.
...
The following commit deserves special mention:
22dc02f81c Revert "sched/fair: Move unused stub functions to header"
This is in x86/cleanups, because the revert is a re-application of a
number of cleanups that got removed inadvertedly.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmTtDkoRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1jCMw//UvQGM8yxsTa57r0/ZpJHS2++P5pJxOsz
45kBb3aBiDV6idArce4EHpthp3MvF3Pycibp9w0qg//NOtIHTKeagXv52abxsu1W
hmS6gXJZDXZvjO1BFaUlmv97iYtzGfKnQppj32C4tMr9SaP49h3KvOHH1Z8CR3mP
1nZaJJwYIi2qBh7msnmLGG+F0drb85O/dfHdoLX6iVJw9UP4n5nu9u8u1E0iC7J7
2GC6AwP60A0EBRTK9EHQQEYwy9uvdS/TG5f2Qk1VP87KA9TTocs8MyapMG4DQu79
hZKVEGuVQAlV3rYe9cJBNpDx1mTu3rmuMH0G71KEe3T6UcG5QRUiAPm8UfA9prPD
uWjY4zm5o0W3tUio4V1MqqiLFIaBU63WmTY9RyM0QH8Ms8r8GugWKmnrTIuHfEC3
9D+Uhyb5d8ID6qFGLTOvPm0g+v64lnH71qq83PcVJgsmZvUb2XvFA3d/A0h9JzLT
2In/yfU10UsLUFTiNRyAgcLccjaGhliDB2oke9Kp0OyOTSQRcWmiq8kByVxCPImP
auOWWcNXjcuOgjlnziEkMTDuRY12MgUB2If4zhELvdEFibIaaNW5sNCbY2msWaN1
CUD7fcj0L3HZvzujUm72l5hxL2brJMuPwVNJfuOe4T8wzy569d6VJULrd1URBM1B
vfaPs1Dz46Q=
=kiAA
-----END PGP SIGNATURE-----
Merge tag 'x86-cleanups-2023-08-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull misc x86 cleanups from Ingo Molnar:
"The following commit deserves special mention:
22dc02f81c Revert "sched/fair: Move unused stub functions to header"
This is in x86/cleanups, because the revert is a re-application of a
number of cleanups that got removed inadvertedly"
[ This also effectively undoes the amd_check_microcode() microcode
declaration change I had done in my microcode loader merge in commit
42a7f6e3ff ("Merge tag 'x86_microcode_for_v6.6_rc1' [...]").
I picked the declaration change by Arnd from this branch instead,
which put it in <asm/processor.h> instead of <asm/microcode.h> like I
had done in my merge resolution - Linus ]
* tag 'x86-cleanups-2023-08-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/platform/uv: Refactor code using deprecated strncpy() interface to use strscpy()
x86/hpet: Refactor code using deprecated strncpy() interface to use strscpy()
x86/platform/uv: Refactor code using deprecated strcpy()/strncpy() interfaces to use strscpy()
x86/qspinlock-paravirt: Fix missing-prototype warning
x86/paravirt: Silence unused native_pv_lock_init() function warning
x86/alternative: Add a __alt_reloc_selftest() prototype
x86/purgatory: Include header for warn() declaration
x86/asm: Avoid unneeded __div64_32 function definition
Revert "sched/fair: Move unused stub functions to header"
x86/apic: Hide unused safe_smp_processor_id() on 32-bit UP
x86/cpu: Fix amd_check_microcode() declaration
- The biggest change is introduction of a new iteration of the
SCHED_FAIR interactivity code: the EEVDF ("Earliest Eligible Virtual
Deadline First") scheduler.
EEVDF too is a virtual-time scheduler, with two parameters (weight
and relative deadline), compared to CFS that had weight only.
It completely reworks the base scheduler: placement, preemption,
picking -- everything.
LWN.net, as usual, has a terrific writeup about EEVDF:
https://lwn.net/Articles/925371/
Preemption (both tick and wakeup) is driven by testing against
a fresh pick. Because the tree is now effectively an interval
tree, and the selection is no longer the 'leftmost' task,
over-scheduling is less of a problem. A lot of the CFS
heuristics are removed or replaced by more natural latency-space
parameters & constructs.
In terms of expected performance regressions: we'll and can fix
everything where a 'good' workload misbehaves with the new scheduler,
but EEVDF inevitably changes workload scheduling in a binary fashion,
hopefully for the better in the overwhelming majority of cases,
but in some cases it won't, especially in adversarial loads that
got lucky with the previous code, such as some variants of hackbench.
We are trying hard to err on the side of fixing all performance
regressions, but we expect some inevitable post-release iterations
of that process.
- Improve load-balancing on hybrid x86 systems: enable cluster
scheduling (again).
- Improve & fix bandwidth-scheduling on nohz systems.
- Improve bandwidth-throttling.
- Use lock guards to simplify and de-goto-ify control flow.
- Misc improvements, cleanups and fixes.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmTtDOgRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1iS4g//b9yewVW9OPxetKoN8zIJA0TjFYuuOVHK
BlCJi5dbzXeCTrtENI65BRA7kPbTQ3AjwLRQ2BallAZ4dJceK0RhlZJvcrMNsm4e
Adcpoch/FbqPKCrtAJQY04Ln1B244n/KyVifYett9220dMgTFQGJJYxrTc2G2+Kp
F44vdUHzRczIE+KeOgBild1CwfKv5Zn5xgaXgtuoPLZtWBE0C1fSSzbK/PTINcUx
bS4NVxK0CpOqSiNjnugV8KsYb71/0U6IgShBVjfHsrlBYigOH2NbVTH5xyjF8f83
WxiGstlhxj+N6Kv4L6FOJIAr2BIggH82j3FaPACmv4c8pzEoBBbvlAJkfinLEgbn
Povg3OF2t6uZ8NoHjeu3WxOjBsphbpkFz7H5nno1ibXSIR/JyUH5MdBPSx93QITB
QoUKQpr/L8zWauWDOEzSaJjEsZbl8rkcIVq5Bk0bR3qn2xkZsIeVte+vCEu3+tBc
b4JOZjq7AuPDqPnsBLvuyiFZ7zwsAfm+pOD5UF3/zbLjPn1N/7wTNQZ29zjc04jl
SifpCZGgF1KlG8m8wNTlSfVvq0ksppCzJt+C6VFuejZ191IGpirQHn4Vp0sluMhC
WRzXhb7v37Bq5JY10GMfeKb/jAiRs68kozhzqVPsBSAPS6I6jJssONgedq+LbQdC
tFsmE9n09do=
=XtCD
-----END PGP SIGNATURE-----
Merge tag 'sched-core-2023-08-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
- The biggest change is introduction of a new iteration of the
SCHED_FAIR interactivity code: the EEVDF ("Earliest Eligible Virtual
Deadline First") scheduler
EEVDF too is a virtual-time scheduler, with two parameters (weight
and relative deadline), compared to CFS that had weight only. It
completely reworks the base scheduler: placement, preemption, picking
-- everything
LWN.net, as usual, has a terrific writeup about EEVDF:
https://lwn.net/Articles/925371/
Preemption (both tick and wakeup) is driven by testing against a
fresh pick. Because the tree is now effectively an interval tree, and
the selection is no longer the 'leftmost' task, over-scheduling is
less of a problem. A lot of the CFS heuristics are removed or
replaced by more natural latency-space parameters & constructs
In terms of expected performance regressions: we will and can fix
everything where a 'good' workload misbehaves with the new scheduler,
but EEVDF inevitably changes workload scheduling in a binary fashion,
hopefully for the better in the overwhelming majority of cases, but
in some cases it won't, especially in adversarial loads that got
lucky with the previous code, such as some variants of hackbench. We
are trying hard to err on the side of fixing all performance
regressions, but we expect some inevitable post-release iterations of
that process
- Improve load-balancing on hybrid x86 systems: enable cluster
scheduling (again)
- Improve & fix bandwidth-scheduling on nohz systems
- Improve bandwidth-throttling
- Use lock guards to simplify and de-goto-ify control flow
- Misc improvements, cleanups and fixes
* tag 'sched-core-2023-08-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (43 commits)
sched/eevdf/doc: Modify the documented knob to base_slice_ns as well
sched/eevdf: Curb wakeup-preemption
sched: Simplify sched_core_cpu_{starting,deactivate}()
sched: Simplify try_steal_cookie()
sched: Simplify sched_tick_remote()
sched: Simplify sched_exec()
sched: Simplify ttwu()
sched: Simplify wake_up_if_idle()
sched: Simplify: migrate_swap_stop()
sched: Simplify sysctl_sched_uclamp_handler()
sched: Simplify get_nohz_timer_target()
sched/rt: sysctl_sched_rr_timeslice show default timeslice after reset
sched/rt: Fix sysctl_sched_rr_timeslice intial value
sched/fair: Block nohz tick_stop when cfs bandwidth in use
sched, cgroup: Restore meaning to hierarchical_quota
MAINTAINERS: Add Peter explicitly to the psi section
sched/psi: Select KERNFS as needed
sched/topology: Align group flags when removing degenerate domain
sched/fair: remove util_est boosting
sched/fair: Propagate enqueue flags into place_entity()
...
- Support partial SMT enablement.
So far the sysfs SMT control only allows to toggle between SMT on and
off. That's sufficient for x86 which usually has at max two threads
except for the Xeon PHI platform which has four threads per core.
Though PowerPC has up to 16 threads per core and so far it's only
possible to control the number of enabled threads per core via a
command line option. There is some way to control this at runtime, but
that lacks enforcement and the usability is awkward.
This update expands the sysfs interface and the core infrastructure to
accept numerical values so PowerPC can build SMT runtime control for
partial SMT enablement on top.
The core support has also been provided to the PowerPC maintainers who
added the PowerPC related changes on top.
- Minor cleanups and documentation updates.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmTsj4wTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoaszEADKMd/6m7/Bq7RU2OJ+IXw8yfMEF9nS
6HPrFu71a4cDufb/G8UckQOvkwdTFWD7bZ0snJe2sBDFTOtzK/inYkgPZTxlm7si
JcJmFnHKUM7OTwNZb7Tv1bd9Csz4JhggAYUw6P8CqsCmhQ+p6ECemx3bHDlYiywm
5eW2yzI9EM4dbsHPwUOvjI0WazGvAf0esSDAS8JTnhBXbd8FAckbMV+xuRPcCUK+
dBqbqr+3Nf4/wcXTro/gZIc7sEATAHH6m7zHlLVBSyVPnBxre8NLz6KciW4SezyJ
GWFnDV03mmG2KxQ2ugwI8n6M3zDJQtfEJFwW/x4t2M5RK+ka2a6G6GtCLHYOXLWR
akIuBXtTAC57BgpqzBihGej9eiC1BJ1QMa9ZK+6WDXSZtMTFOLlbwdY2/qyfxpfw
LfepWb+UMtFy5YyW84S1O5/AqpOtKD2kPTqfDjvDxWIAigispU+qwAKxcMzMjtwz
aAlf2Z/iX0R9DkRzGD2gaFG5AUsRich8RtVO7u+WDwYSsi8ywrvryiPlZrDDBkSQ
sRzdoHeXNGVY/FgkbZmEyBj4udrypymkR6ivqn6C2OrysgznSiv5NC983uS6TfJX
cVqdUv6CNYYNiNu0x0Qf0MluYT2s5c1Fa4bjCBJL+KwORwjM3+TCN9RA1KtFrW2T
G3Ta1KqI6wRonA==
=JQRJ
-----END PGP SIGNATURE-----
Merge tag 'smp-core-2023-08-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull CPU hotplug updates from Thomas Gleixner:
"Updates for the CPU hotplug core:
- Support partial SMT enablement.
So far the sysfs SMT control only allows to toggle between SMT on
and off. That's sufficient for x86 which usually has at max two
threads except for the Xeon PHI platform which has four threads per
core
Though PowerPC has up to 16 threads per core and so far it's only
possible to control the number of enabled threads per core via a
command line option. There is some way to control this at runtime,
but that lacks enforcement and the usability is awkward
This update expands the sysfs interface and the core infrastructure
to accept numerical values so PowerPC can build SMT runtime control
for partial SMT enablement on top
The core support has also been provided to the PowerPC maintainers
who added the PowerPC related changes on top
- Minor cleanups and documentation updates"
* tag 'smp-core-2023-08-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
Documentation: core-api/cpuhotplug: Fix state names
cpu/hotplug: Remove unused function declaration cpu_set_state_online()
cpu/SMT: Fix cpu_smt_possible() comment
cpu/SMT: Allow enabling partial SMT states via sysfs
cpu/SMT: Create topology_smt_thread_allowed()
cpu/SMT: Remove topology_smt_supported()
cpu/SMT: Store the current/max number of threads
cpu/SMT: Move smt/control simple exit cases earlier
cpu/SMT: Move SMT prototypes into cpu_smt.h
cpu/hotplug: Remove dependancy against cpu_primary_thread_mask
Core:
- Prevent a deadlock of nested interrupt threads vs.
synchronize_hard()
- Removal of a stale extern declaration
Drivers:
- The first new driver since v6.2 for Amlogic-C3 SoCs
- The usual small fixes, cleanups and improvements all over
the place
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmTsjR0THHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYofmLEAC5anouyAUbGjl/cL//+2GkvWB2YgO2
+D7q8tx3Tt5US/vpiYaDvFs+at+2lAyB6M/KUkvYSCne9bm80+YqaL+73iM6YQKH
yNDrAnLR1FA4+fHIvvhmk23U1uUjWgSTL7iKufgNWf8I0aYsWLTIX3N6m0606ZLE
eUNIf7w+aZRr/axHdadRQpib6l1fvfA3C72urPRBnZDA56ZDAgE9tS0kfk9D+3sW
BgXRp4knvHBf6I4RdA10hHDTa1RuX9xkDeAC1a/ljWpbCEgEDPJ+5JI+TD+fU/d5
TCVGa7GwqJc2srRFwy76/t0jQrG7DnwW56SsMomjS+vjIu4exNFwXJ6LqZSJacwa
Z3HB0Py3awQWPfHdFqdF9LHyum+a58RHX96RenlL8Q/42qe5K6RmAIfcAaiy2OpL
xAGy9+nplMWh+qde9q1o30WPr08GhhDEXrdHZdAAODjBeoUDGmFooH5NHAFjw2+Q
ba15/f7Nl8KIl854OUJv4cftNEv5klpueLR/YUviivoO55vydRae/k/CSPhvt7TN
VIQ+vgiaiOCEwAAx2kP7Au0ADeEMCYiEqH9KWBp33dvjNZMt2DbAGLDWagcy8N9y
R8ms4c5e7Z2MvN9Z6YDihQ1XvkQsdX/dWwJq3weH3c/tP1MBFFHZYdeQhIVKTIKR
4zFKi4jrlmn0vQ==
=jiUK
-----END PGP SIGNATURE-----
Merge tag 'irq-core-2023-08-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq updates from Thomas Gleixner:
"Boring updates for the interrupt subsystem:
Core:
- Prevent a deadlock of nested interrupt threads vs.
synchronize_hard()
- Removal of a stale extern declaration
Drivers:
- The first new driver since v6.2 for Amlogic-C3 SoCs
- The usual small fixes, cleanups and improvements all over the
place"
* tag 'irq-core-2023-08-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqchip: Add support for Amlogic-C3 SoCs
dt-bindings: interrupt-controller: Add support for Amlogic-C3 SoCs
irqchip/irq-mvebu-sei: Use devm_platform_get_and_ioremap_resource()
irqchip/ls-scfg-msi: Use devm_platform_get_and_ioremap_resource()
irqchip: Explicitly include correct DT includes
irqchip/orion: Use of_address_count() helper
irqchip/irq-pruss-intc: Do not check for 0 return after calling platform_get_irq()
irqchip/imx-mu-msi: Do not check for 0 return after calling platform_get_irq()
irqchipr/i8259: Mark i8259_of_init() static
irqchip/mips-gic: Mark gic_irq_domain_free() static
irqchip/xtensa-pic: Include header for xtensa_pic_init_legacy()
irqchip/loongson-eiointc: Fix return value checking of eiointc_index
genirq: Remove unused extern declaration
genirq: Prevent nested thread vs synchronize_hardirq() deadlock
address limit check which is a leftover of the removed TIF_FSCHECK.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmTsi5ETHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoZRGEACDGWXaadhCYNpHftScC6X3z2d3DXZX
E5PAiVtq/sQX8FV2KJgMxywxQ4QEJjtX9lgoxGMKc7zai2FgKbWzPnlz6qz3przh
P5Ji+sBl0DbpVnLMaAAVzikvWpjFfSbf/b8lHA3EVs/9HkPfZY7rwByONsVkxX2e
6AS8tOv0CJP3lMaZ02tDs48PWOeF1CEpub9Eg5JfYG+CTU0gy+wFMnIUCkN/eP2E
CYNo6wTFjBQ43S7GWrqA6eYgbHLBBvOuHLHM3RlLOm2Rexct/umf84At9K9wUJvJ
mGSrZKsgD3UZJi7HpF5RXsY88+4uV38vhkN6LGRdHrarLz3WMvnc191WP7iwCBmo
HGIgWWxm9+bGAxiw9wTNgmERvwKBeMNNQEDu/58An637VDucrYOlRi2Mh0CE5QiG
i1R+KiKBUZw3Blogx+O65m0PyXpJQqHfr2WkfT+uKJCs7wRBdupmWv+ZAcSj6tys
ILqCHRmI4n46T2qp67/M6FbYTrk0DNWsjgjtUgLBquEsj6z00favxAug5NrJV6+c
5/kf7C97h1TmtqqNjtL4uwfWGm2bqc6AZyMpsk0KqnirywmnkgIKOWHu//TwQVJs
jpwRvsAv3UNnUrO6qtqNzbNDQQ0MOLAAuDgardGWW7gEEhvaa+HdbwyjZSwDZvZy
b8PLikU7gRB9rQ==
=W9kd
-----END PGP SIGNATURE-----
Merge tag 'core-entry-2023-08-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull core entry code update from Thomas Gleixner:
"A single update to the core entry code, which removes the empty user
address limit check which is a leftover of the removed TIF_FSCHECK"
* tag 'core-entry-2023-08-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
entry: Remove empty addr_limit_user_check()
This pull reqeust contains the following:
o Handle negative skews in "skew is too large" messages.
o Extend watchdog check exemption to 4-Socket platforms
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEbK7UrM+RBIrCoViJnr8S83LZ+4wFAmTcF94THHBhdWxtY2tA
a2VybmVsLm9yZwAKCRCevxLzctn7jPeHEACXvYFMwno+1DPlt2PtFJglnkuYyOUQ
SwRRiOLWXdzsikzjWIvdrSOqslbdIjhZmSSyPDjyo4GPYFSSOyKNsqczwsX8R49u
O/2Yzkrx2OmIWcKiAkII8Iw4fFedoITtzC59wbZHoo/+upEpbZYP3u7AjJTird7y
du3WcWcGc1eEt5+7MNwbZfwpzo2t5Rb3Wqfgs6vnKTG7Abc/23uChsCBzPavX7X/
djNd1bA5YmEldKKxSoF5XSW/F1TWIA4fXMDkBwgRKHBx1Y7xU+nJMtam4ogAzN6a
4zgthgy5wQ7/VnTBv2rmQQb6ae3Blm09Yg1ac8zt95RLgmGkyX73lZGnRBTdYqyh
kb47Tfw3a+e+VPTD+W3rY1NOSOwbstdDHVckK+0bFvqNyXOoaEJL+EEOhm9rPXxv
le+T6Ct1VPAF9lHPUz7lVCVXN91vP4Gqrxjmeq5rqWNOvRW3jBcCLnEpFzWJtu50
JjQBi3HA0HW+Bxqov22W0llFAa0gVm8xyxXfNSSL7VoCinnS6/qyQvD1GoG0brk3
l5orOk38m2/acvTyvw2tnvAAuqmOr+oGlcQhJOOVl7jDz+sae6RMTwWCMaVxKUDo
YW7v5YFQePmzC9J4M5XFMdCCTb3cUCjVLMdsPgONM3kn9ALEDhhTBSw76N//bGLg
4/OEsT7Aq6LHkA==
=6/HI
-----END PGP SIGNATURE-----
Merge tag 'clocksource.2023.08.15a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu
Pull clocksource watchdog updates from Paul McKenney:
- Handle negative skews in "skew is too large" messages
- Extend watchdog check exemption to 4-Socket platforms
* tag 'clocksource.2023.08.15a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu:
x86/tsc: Extend watchdog check exemption to 4-Sockets platform
clocksource: Handle negative skews in "skew is too large" messages
This series reduces the number of stack traces dumped during CSD-lock
debugging. This helps to avoid console overrun on systems with large
numbers of CPUs.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEbK7UrM+RBIrCoViJnr8S83LZ+4wFAmSxzCITHHBhdWxtY2tA
a2VybmVsLm9yZwAKCRCevxLzctn7jK3wD/9bNte4pbnG5CmJvu+bh1idbD+ZIyaU
13mTg1J35XurW/QlmbOKoNcdylvmE4QdzUxURXgv4FHO7wdiARBGgrz9ArKjgq+e
NfklTr4EY9UbRb27tO1iJDUiP6ZZB3fw+gYs2zmJMkn5CqK+rkUrTdcMa8EFHvlT
vf6OL8xeFjsrTCWfYTAYJU1Yp+0UOiO+BRwzq4u76Wzpex79EiMEE2lLeRZfXhz9
mF704EXn7VEkfRo50GlGOjVkezghlItXlaUCV2eQ4T6/LwXgreStCTKfhrDA5Qs2
mAQ5OMZJztlbUWcVrEPZMpQ6pXWaJx5qoMZ1uP8Obec89ocr+/DuL1Myyaau9g+H
rCYA9Om4XfAd2JURrxOIlKQ7SmvRJNZWpv0DHizTfWpSTOumtON2RyVlC2EYwx9Q
2ZL4Eo99VzYcAXWx8KiGpF6CtW67VXKZsHwTtJegu4Vkjk9wOt5Sa9svhiHv0Kz4
veYE9XuOH5+tIfN6tP4eikI+4VJOVhudsOKiXCjhoscy+1/gtXRH5WgYwvSiopWo
nEsj05V7U0hWdsPpu7niZ982vAU1eHC7EeQt+pc5f17NeNr51xG3Oj3xyy14yzFC
TbEyOft0MEsJ8NkC93FCbNqere7dk6k+bxVqvoWQ5tDfsEhIaVn+HzVhPdsluvfP
1JixcSuqZ42RqA==
=t4s8
-----END PGP SIGNATURE-----
Merge tag 'csd-lock.2023.07.15a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu
Pull CSD lock updates from Paul McKenney:
"This series reduces the number of stack traces dumped during CSD-lock
debugging. This helps to avoid console overrun on systems with large
numbers of CPUs"
* tag 'csd-lock.2023.07.15a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu:
smp: Reduce NMI traffic from CSD waiters to CSD destination
smp: Reduce logging due to dump_stack of CSD waiters
This pull request prevents some memory-exhaustion false-postitive failures
in scftorture testing.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEbK7UrM+RBIrCoViJnr8S83LZ+4wFAmTcFWYTHHBhdWxtY2tA
a2VybmVsLm9yZwAKCRCevxLzctn7jBPhD/9jqgTgPuF3bmRVpkggIXHN0wCTihS9
BNQPaUdLRmTymtZAecaPOdRvPPMUvqjOK5dS8sx7rnoyU+qr33mUkRzSFCIrsGHM
62FowQ4grokOkQnJYUpVuLhitYwwmWi7aKi5T2Xolc4ooSIpWZe/NPoiteGkm4lc
nuA84DcV51rRykjBjW3LIrffoi9fu3lU65FsAjQttG7OZwWmAjhhHl29loCPlG3F
+Ui+0p+cp8WAB/2J0B/6aHTqK6JJoV0t/gzKpzYvI/Gydz/7PaYjdBhPCSxHcsXd
LMf+OO5/LtGfw4kcYF/8O4Ir0t4F681iOXlz06op2P2OT90S0O31SGUWznKMVq3E
V307I9LnfT5Jo2aK9xD4ad8GM9rMKb9btc284QvaYAjCUD5RBoyA/S1d6e0u9rt3
oK7rJWIG9bzCbZ7R7xXCzpkCYw98npVeDxS9gdwWSCA0vBwmhF8BbVQODZ/u+YQ0
TQyTSankebeaoINeieto0ZAbK9iDSbsnTmKZ146hoLGFshDFN7qPOL4PggXPqw5B
CXILQH+SOjMO+JaIrd4iOr172REzp1/64K4szaheV4LxyEwC/QJBdxhajdpJOTOS
LowIG+LIIElr8dPIiiEIBVAaTehadgqA1+5zIcevt8OSMb7KOoB6FkXKj/9kWOfD
PwFfqskEYoY8xQ==
=8rLK
-----END PGP SIGNATURE-----
Merge tag 'scftorture.2023.08.15a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu
Pull smp_call_function torture-test updates from Paul McKenney:
"This prevents some memory-exhaustion false-postitive failures in
scftorture testing"
* tag 'scftorture.2023.08.15a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu:
scftorture: Add CONFIG_PREEMPT_DYNAMIC=n to NOPREEMPT scenario
scftorture: Pause testing after memory-allocation failure
scftorture: Forgive memory-allocation failure if KASAN
torture: Scale scftorture memory based on number of CPUs
doc.2023.07.14b: Documentation updates.
fixes.2023.08.16a: Miscellaneous fixes, perhaps most notably simplifying
SRCU_NOTIFIER_INIT() as suggested.
rcu-tasks.2023.07.24a: RCU Tasks updates, most notably treating
Tasks RCU callbacks as lazy while still treating synchronous
grace periods as urgent. Also fixes one bug that restores the
ability to apply debug-objects to RCU Tasks and another that
fixes a race condition that could result in false-positive
failures of the boot-time self-test code.
rcuscale.2023.07.14b: RCU-scalability performance-test updates,
most notably adding the ability to measure the RCU-Tasks's
grace-period kthread's CPU consumption. This proved
quite useful for the rcu-tasks.2023.07.24a work.
refscale.2023.07.14b: Reference-acquisition/release performance-test
updates, including a fix for an uninitialized wait_queue_head_t.
torture.2023.08.14a: Miscellaneous torture-test updates.
torturescripts.2023.07.20a: Torture-test scripting updates, including
removal of the non-longer-functional formal-verification scripts,
test builds of individual RCU Tasks flavors, better diagnostics
for loss of connectivity for distributed rcutorture tests,
disabling of reboot loops in qemu/KVM-based rcutorture testing,
and passing of init parameters to rcutorture's init program.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEbK7UrM+RBIrCoViJnr8S83LZ+4wFAmTjkssTHHBhdWxtY2tA
a2VybmVsLm9yZwAKCRCevxLzctn7jITND/9zEqYNbeFrcBs/YaHdoAjsNgOt1IYN
csfF/KArVgdvmrwlV/nEaQMLaJcw9X7DVU5+7E2JbbDaB/2FSacseNyKk6mfgSVK
/0rnTOXpqI9/T1HiJObWZvDQFuKL12bfteXWGJg1sMt2JUGZ4nAWhdZ3xRjp2XkO
89qB5r0fF8gyGwvQ3M29ss8T9Oy0uUNJmDY/QyVxHM6dhkpSAezFffKzD7C4zkSV
WucRTpYJ7bs6otBGtVmwz3x60UAuLwcVfQyB+CTbnGLsps9yAYU+1DDVdm7olcr3
ARXMeboeodMvy9jWXhtbWRVAAob4lVUDXQN27kb4sBgroRQBfQXMuByRAU6s0VtX
frOl6rlbORuAetsC8wFL0IFVn4yTpvXKbYw7h1MXTs7gVVbl33O9FieGvWu0r79/
VR4Xw+JbmYWtyvFV8Zaq4iIEcOe+PeNH6u0bPx+htsHYd1+DUG2UY0MVmJQ3a4sb
ygejA6mguCk7KBzWab8wdDpgAfhNwg0T9a+LQYcaskuD5SSWjYqqg6i1ulqqqyiE
bOfRKDX4mWmAobWKHLssqUrjiLbxfygIaHjCrt7rWJKPIs1bK/WfWa4JbrE0NRwK
9IDd1lWc9C+zoUpjyZWSG3ahK5lWo2u4sPNoRtMQjowjobIz1cBhaEwmFe72bG7C
FCKb7Da2oUaLOw==
=EujZ
-----END PGP SIGNATURE-----
Merge tag 'rcu.2023.08.21a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu
Pull RCU updates from Paul McKenney:
- Documentation updates
- Miscellaneous fixes, perhaps most notably simplifying
SRCU_NOTIFIER_INIT() as suggested
- RCU Tasks updates, most notably treating Tasks RCU callbacks as lazy
while still treating synchronous grace periods as urgent. Also fixes
one bug that restores the ability to apply debug-objects to RCU Tasks
and another that fixes a race condition that could result in
false-positive failures of the boot-time self-test code
- RCU-scalability performance-test updates, most notably adding the
ability to measure the RCU-Tasks's grace-period kthread's CPU
consumption. This proved quite useful for the RCU Tasks work
- Reference-acquisition/release performance-test updates, including a
fix for an uninitialized wait_queue_head_t
- Miscellaneous torture-test updates
- Torture-test scripting updates, including removal of the
non-longer-functional formal-verification scripts, test builds of
individual RCU Tasks flavors, better diagnostics for loss of
connectivity for distributed rcutorture tests, disabling of reboot
loops in qemu/KVM-based rcutorture testing, and passing of init
parameters to rcutorture's init program
* tag 'rcu.2023.08.21a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu: (64 commits)
rcu: Use WRITE_ONCE() for assignments to ->next for rculist_nulls
rcu: Make the rcu_nocb_poll boot parameter usable via boot config
rcu: Mark __rcu_irq_enter_check_tick() ->rcu_urgent_qs load
srcu,notifier: Remove #ifdefs in favor of SRCU Tiny srcu_usage
rcutorture: Stop right-shifting torture_random() return values
torture: Stop right-shifting torture_random() return values
torture: Move stutter_wait() timeouts to hrtimers
torture: Move torture_shuffle() timeouts to hrtimers
torture: Move torture_onoff() timeouts to hrtimers
torture: Make torture_hrtimeout_*() use TASK_IDLE
torture: Add lock_torture writer_fifo module parameter
torture: Add a kthread-creation callback to _torture_create_kthread()
rcu-tasks: Fix boot-time RCU tasks debug-only deadlock
rcu-tasks: Permit use of debug-objects with RCU Tasks flavors
checkpatch: Complain about unexpected uses of RCU Tasks Trace
torture: Cause mkinitrd.sh to indicate failure on compile errors
torture: Make init program dump command-line arguments
torture: Switch qemu from -nographic to -display none
torture: Add init-program support for loongarch
torture: Avoid torture-test reboot loops
...
- Carve out the new CONFIG_LIST_HARDENED as a more focused subset of
CONFIG_DEBUG_LIST (Marco Elver).
- Fix kallsyms lookup failure under Clang LTO (Yonghong Song).
- Clarify documentation for CONFIG_UBSAN_TRAP (Jann Horn).
- Flexible array member conversion not carried in other tree (Gustavo
A. R. Silva).
- Various strlcpy() and strncpy() removals not carried in other trees
(Azeem Shaikh, Justin Stitt).
- Convert nsproxy.count to refcount_t (Elena Reshetova).
- Add handful of __counted_by annotations not carried in other trees,
as well as an LKDTM test.
- Fix build failure with gcc-plugins on GCC 14+.
- Fix selftests to respect SKIP for signal-delivery tests.
- Fix CFI warning for paravirt callback prototype.
- Clarify documentation for seq_show_option_n() usage.
-----BEGIN PGP SIGNATURE-----
iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmTs6ZAWHGtlZXNjb29r
QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJkpjD/9AeST5Imc2t0t71Qd+wPxW3jT3
kDZPlHH8wHmuxSpRscX82m21SozvEMvybo6Cp7FSH4qr863FnBWMlo8acr7rKxUf
0f7Y9qgY/hKADiVx5p0pbnCgcy+l4pwsxIqVCGuhjvNCbWHrdGqLM4UjIfaVz5Ws
+55a/C3S1KVwB1s1+6to43jtKqQAx6yrqYWOaT3wEfCzHC87f9PUHhIGnFQVwPGP
WpjQI/BQKpH7+MDCoJOPrZqXaE/4lWALxR6+5BBheGbvLoWifpJEYHX6bDUzkgBz
liQDkgr4eAw5EXSOS7mX3EApfeMKakznJt9Mcmn0h3pPRlM3ZSVD64Xrou2Brpje
exS2JRuh6HwIiXY9nTHc6YMGcAWG1syAR/hM2fQdujM0CWtBUk9+kkuYWsqF6nIK
3tOxYLB/Ph4p+tShd+v5R3mEmp/6snYRKJoUk+9Fk67i54NnK4huyxaCO4zui+ML
3vHuGp8KgFHUjJaYmYXHs3TRZnKSFUkPGc4MbpiGtmJ9zhfSwlhhF+yfBJCsvmTf
ZajA+sPupT4OjLxU6vUD/ZNkXAEjWzktyX2v9YBA7FHh7SqPtX9ARRIxh417AjEJ
tBPHhW/iRw9ftBIAKDmI7gPLynngd/zvjhvk6O5egHYjjgRM1/WAJZ4V26XR6+hf
TWfQb7VRzdZIqwOEUA==
=9ZWP
-----END PGP SIGNATURE-----
Merge tag 'hardening-v6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull hardening updates from Kees Cook:
"As has become normal, changes are scattered around the tree (either
explicitly maintainer Acked or for trivial stuff that went ignored):
- Carve out the new CONFIG_LIST_HARDENED as a more focused subset of
CONFIG_DEBUG_LIST (Marco Elver)
- Fix kallsyms lookup failure under Clang LTO (Yonghong Song)
- Clarify documentation for CONFIG_UBSAN_TRAP (Jann Horn)
- Flexible array member conversion not carried in other tree (Gustavo
A. R. Silva)
- Various strlcpy() and strncpy() removals not carried in other trees
(Azeem Shaikh, Justin Stitt)
- Convert nsproxy.count to refcount_t (Elena Reshetova)
- Add handful of __counted_by annotations not carried in other trees,
as well as an LKDTM test
- Fix build failure with gcc-plugins on GCC 14+
- Fix selftests to respect SKIP for signal-delivery tests
- Fix CFI warning for paravirt callback prototype
- Clarify documentation for seq_show_option_n() usage"
* tag 'hardening-v6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (23 commits)
LoadPin: Annotate struct dm_verity_loadpin_trusted_root_digest with __counted_by
kallsyms: Change func signature for cleanup_symbol_name()
kallsyms: Fix kallsyms_selftest failure
nsproxy: Convert nsproxy.count to refcount_t
integrity: Annotate struct ima_rule_opt_list with __counted_by
lkdtm: Add FAM_BOUNDS test for __counted_by
Compiler Attributes: counted_by: Adjust name and identifier expansion
um: refactor deprecated strncpy to memcpy
um: vector: refactor deprecated strncpy
alpha: Replace one-element array with flexible-array member
hardening: Move BUG_ON_DATA_CORRUPTION to hardening options
list: Introduce CONFIG_LIST_HARDENED
list_debug: Introduce inline wrappers for debug checks
compiler_types: Introduce the Clang __preserve_most function attribute
gcc-plugins: Rename last_stmt() for GCC 14+
selftests/harness: Actually report SKIP for signal tests
x86/paravirt: Fix tlb_remove_table function callback prototype warning
EISA: Replace all non-returning strlcpy with strscpy
perf: Replace strlcpy with strscpy
um: Remove strlcpy declaration
...
- Provide USER_NOTIFY flag for synchronous mode (Andrei Vagin, Peter
Oskolkov). This touches the scheduler and perf but has been Acked by
Peter Zijlstra.
- Fix regression in syscall skipping and restart tracing on arm32.
This touches arch/arm/ but has been Acked by Arnd Bergmann.
-----BEGIN PGP SIGNATURE-----
iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmTs418WHGtlZXNjb29r
QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJohpD/4tEfRdnb/KDgwQ7uvqBonUJXcx
wqw17LZCGTpBV3/Tp3+aEseD1NezOxiMJL88VyUHSy7nfDJShbL6QtyoenwEOeXJ
HmBUfcIH3cqRutHEJ3drYBzBetpeeK2G+gTYVj+JoEfPWyPf+Egj+1JE2n1xLi92
WC1miBAyBZ59kN+D1hcDzJu24CkAwbcUYlEzGejN5lBOwxYV3/fjARBVRvefOO5m
jljSCIVJOFgCiybKhJ7Zw1+lkFc3cIlcOgr4/ZegSc8PxFVebnuImTHHp/gvoo6F
7d1xe5Hk+PSfNvVq41MAeRB2vK2tY5efwjXRarThUaydPTO43KiQm0dzP0EYWK9a
LcOg8zAXZnpvuWU5O2SqUKADcxe2TjS1WuQ/Q4ixxgKz2kJKDwrNU8Frf327eLSR
acfZgMMiUfEXyXDV9B3LzNAtwdvwyxYrzEzxgKywhThIhZmQDat0rI2IaTV5QIc5
pkxiFEe0TPwpzyUVO9dSzE+ughTmNQOKk5uAM9e2NwRwVdhEmlZAxo0kStJ1NoaA
yDjYIKfaNBElchL4v2931KJFJseI+uRaWdW10JEV+1M69+gEAEs6wbmAxtcYS776
xWsYp3slXzlmeVyvQp/ah8p0y55r+qTbcnhkvIdiwLYei4Bh3KOoJUlVmW0V5dKq
b+7qspIvBA0kKRAqPw==
=DI8R
-----END PGP SIGNATURE-----
Merge tag 'seccomp-v6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull seccomp updates from Kees Cook:
- Provide USER_NOTIFY flag for synchronous mode (Andrei Vagin, Peter
Oskolkov). This touches the scheduler and perf but has been Acked by
Peter Zijlstra.
- Fix regression in syscall skipping and restart tracing on arm32. This
touches arch/arm/ but has been Acked by Arnd Bergmann.
* tag 'seccomp-v6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
seccomp: Add missing kerndoc notations
ARM: ptrace: Restore syscall skipping for tracers
ARM: ptrace: Restore syscall restart tracing
selftests/seccomp: Handle arm32 corner cases better
perf/benchmark: add a new benchmark for seccom_unotify
selftest/seccomp: add a new test for the sync mode of seccomp_user_notify
seccomp: add the synchronous mode for seccomp_unotify
sched: add a few helpers to wake up tasks on the current cpu
sched: add WF_CURRENT_CPU and externise ttwu
seccomp: don't use semaphore and wait_queue together
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZOXTKAAKCRCRxhvAZXjc
oifJAQCzi/p+AdQu8LA/0XvR7fTwaq64ZDCibU4BISuLGT2kEgEAuGbuoFZa0rs2
XYD/s4+gi64p9Z01MmXm2XO1pu3GPg0=
=eJz5
-----END PGP SIGNATURE-----
Merge tag 'v6.6-vfs.ctime' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs timestamp updates from Christian Brauner:
"This adds VFS support for multi-grain timestamps and converts tmpfs,
xfs, ext4, and btrfs to use them. This carries acks from all relevant
filesystems.
The VFS always uses coarse-grained timestamps when updating the ctime
and mtime after a change. This has the benefit of allowing filesystems
to optimize away a lot of metadata updates, down to around 1 per
jiffy, even when a file is under heavy writes.
Unfortunately, this has always been an issue when we're exporting via
NFSv3, which relies on timestamps to validate caches. A lot of changes
can happen in a jiffy, so timestamps aren't sufficient to help the
client decide to invalidate the cache.
Even with NFSv4, a lot of exported filesystems don't properly support
a change attribute and are subject to the same problems with timestamp
granularity. Other applications have similar issues with timestamps
(e.g., backup applications).
If we were to always use fine-grained timestamps, that would improve
the situation, but that becomes rather expensive, as the underlying
filesystem would have to log a lot more metadata updates.
This introduces fine-grained timestamps that are used when they are
actively queried.
This uses the 31st bit of the ctime tv_nsec field to indicate that
something has queried the inode for the mtime or ctime. When this flag
is set, on the next mtime or ctime update, the kernel will fetch a
fine-grained timestamp instead of the usual coarse-grained one.
As POSIX generally mandates that when the mtime changes, the ctime
must also change the kernel always stores normalized ctime values, so
only the first 30 bits of the tv_nsec field are ever used.
Filesytems can opt into this behavior by setting the FS_MGTIME flag in
the fstype. Filesystems that don't set this flag will continue to use
coarse-grained timestamps.
Various preparatory changes, fixes and cleanups are included:
- Fixup all relevant places where POSIX requires updating ctime
together with mtime. This is a wide-range of places and all
maintainers provided necessary Acks.
- Add new accessors for inode->i_ctime directly and change all
callers to rely on them. Plain accesses to inode->i_ctime are now
gone and it is accordingly rename to inode->__i_ctime and commented
as requiring accessors.
- Extend generic_fillattr() to pass in a request mask mirroring in a
sense the statx() uapi. This allows callers to pass in a request
mask to only get a subset of attributes filled in.
- Rework timestamp updates so it's possible to drop the @now
parameter the update_time() inode operation and associated helpers.
- Add inode_update_timestamps() and convert all filesystems to it
removing a bunch of open-coding"
* tag 'v6.6-vfs.ctime' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (107 commits)
btrfs: convert to multigrain timestamps
ext4: switch to multigrain timestamps
xfs: switch to multigrain timestamps
tmpfs: add support for multigrain timestamps
fs: add infrastructure for multigrain timestamps
fs: drop the timespec64 argument from update_time
xfs: have xfs_vn_update_time gets its own timestamp
fat: make fat_update_time get its own timestamp
fat: remove i_version handling from fat_update_time
ubifs: have ubifs_update_time use inode_update_timestamps
btrfs: have it use inode_update_timestamps
fs: drop the timespec64 arg from generic_update_time
fs: pass the request_mask to generic_fillattr
fs: remove silly warning from current_time
gfs2: fix timestamp handling on quota inodes
fs: rename i_ctime field to __i_ctime
selinux: convert to ctime accessor functions
security: convert to ctime accessor functions
apparmor: convert to ctime accessor functions
sunrpc: convert to ctime accessor functions
...
conversion of the software based interrupt resend mechanism to hlist missed
to add a check whether the descriptor is already enqueued and dropped the
interrupt descriptor lookup for nested interrupts.
The missing check whether the descriptor is already queued causes hlist
corruption and can be observed in the wild. The dropped parent descriptor
lookup has not yet caused problems, but it would result in stale interrupt
line in the worst case.
Add the missing enqueued check and bring the descriptor lookup back to cure
this.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmTqNLQTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoTeBD/0b8zbNhmO5TXhP6GrCPXahFM6aTmyK
NveZMzh1c7tQZzMBNNEnRoaYvmcgPOviZ1Yi3+/Hs3oaR/b6nLt36K8+MRC7J+15
j6cIylmpTp9eH5Na3IT1wmTNfCVAdoejoZVYq4PPHAHUrzqu7ESOTLzHbPmWS97E
VGdvUrKnQ7J4ajOZn7bXWaia+qCuIij87CYAKH++c9JVMIc0iTs2Zd7FG2sncgLm
OJdvjmMy/qN9a1jYdM4DrGOS8HBdvuYb9EEDuZB4NEY3nBR+svQqBHsD462LgxNe
+OTzLBVMoP9heKbyTU9357PUq2qz6OmpC0vE1n5XgkSEdrvm9x1UjYcPQnagRm25
JZp/pEI/ryD8oGQNWzsPe7PDyyKHV5F0Q1KPHGUvvEJxwF+USVe9Zm6damfZvGeA
dp34zYg0mFCH0hmqdYs6+cc8sJcEy8aR8FFUgI1Uj5nr9zZ3vV7WTsOjJ12NDFo/
L+oDKz6/sdL2X/EKddP3ffQrImPF8DdSYfEPmoukTMhihfgXewBlgvg3b9HekVVm
9j7UhqsQw/mdPcTpkM6cd5ngxB71X64gMjAfotwsproJg/EUw978CM++9sGKmKy8
jU7hlgZQ3DniSCyCpXB/7vZxAFej8TKTWmTc4KZYKiMfej2vqI3FjA3KLGY6GzK+
ls/Rm57EOhKZlw==
=Snax
-----END PGP SIGNATURE-----
Merge tag 'irq-urgent-2023-08-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fix from Thomas Gleixner:
"A last minute fix for a regression introduced in the v6.5 merge
window.
The conversion of the software based interrupt resend mechanism to
hlist missed to add a check whether the descriptor is already enqueued
and dropped the interrupt descriptor lookup for nested interrupts.
The missing check whether the descriptor is already queued causes
hlist corruption and can be observed in the wild. The dropped parent
descriptor lookup has not yet caused problems, but it would result in
stale interrupt line in the worst case.
Add the missing enqueued check and bring the descriptor lookup back to
cure this"
* tag 'irq-urgent-2023-08-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
genirq: Fix software resend lockup and nested resend
The switch to using hlist for managing software resend of interrupts
broke resend in at least two ways:
First, unconditionally adding interrupt descriptors to the resend list can
corrupt the list when the descriptor in question has already been
added. This causes the resend tasklet to loop indefinitely with interrupts
disabled as was recently reported with the Lenovo ThinkPad X13s after
threaded NAPI was disabled in the ath11k WiFi driver.
This bug is easily fixed by restoring the old semantics of irq_sw_resend()
so that it can be called also for descriptors that have already been marked
for resend.
Second, the offending commit also broke software resend of nested
interrupts by simply discarding the code that made sure that such
interrupts are retriggered using the parent interrupt.
Add back the corresponding code that adds the parent descriptor to the
resend list.
Fixes: bc06a9e087 ("genirq: Use hlist for managing resend handlers")
Signed-off-by: Johan Hovold <johan+linaro@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/lkml/20230809073432.4193-1-johan+linaro@kernel.org/
Link: https://lore.kernel.org/r/20230826154004.1417-1-johan+linaro@kernel.org
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZOjkTAAKCRDbK58LschI
gx32AP9gaaHFBtOYBfoenKTJfMgv1WhtQHIBas+WN9ItmBx9MAEA4gm/VyQ6oD7O
EBjJKJQ2CZ/QKw7cNacXw+l5jF7/+Q0=
=8P7g
-----END PGP SIGNATURE-----
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:
====================
pull-request: bpf-next 2023-08-25
We've added 87 non-merge commits during the last 8 day(s) which contain
a total of 104 files changed, 3719 insertions(+), 4212 deletions(-).
The main changes are:
1) Add multi uprobe BPF links for attaching multiple uprobes
and usdt probes, which is significantly faster and saves extra fds,
from Jiri Olsa.
2) Add support BPF cpu v4 instructions for arm64 JIT compiler,
from Xu Kuohai.
3) Add support BPF cpu v4 instructions for riscv64 JIT compiler,
from Pu Lehui.
4) Fix LWT BPF xmit hooks wrt their return values where propagating
the result from skb_do_redirect() would trigger a use-after-free,
from Yan Zhai.
5) Fix a BPF verifier issue related to bpf_kptr_xchg() with local kptr
where the map's value kptr type and locally allocated obj type
mismatch, from Yonghong Song.
6) Fix BPF verifier's check_func_arg_reg_off() function wrt graph
root/node which bypassed reg->off == 0 enforcement,
from Kumar Kartikeya Dwivedi.
7) Lift BPF verifier restriction in networking BPF programs to treat
comparison of packet pointers not as a pointer leak,
from Yafang Shao.
8) Remove unmaintained XDP BPF samples as they are maintained
in xdp-tools repository out of tree, from Toke Høiland-Jørgensen.
9) Batch of fixes for the tracing programs from BPF samples in order
to make them more libbpf-aware, from Daniel T. Lee.
10) Fix a libbpf signedness determination bug in the CO-RE relocation
handling logic, from Andrii Nakryiko.
11) Extend libbpf to support CO-RE kfunc relocations. Also follow-up
fixes for bpf_refcount shared ownership implementation,
both from Dave Marchevsky.
12) Add a new bpf_object__unpin() API function to libbpf,
from Daniel Xu.
13) Fix a memory leak in libbpf to also free btf_vmlinux
when the bpf_object gets closed, from Hao Luo.
14) Small error output improvements to test_bpf module, from Helge Deller.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (87 commits)
selftests/bpf: Add tests for rbtree API interaction in sleepable progs
bpf: Allow bpf_spin_{lock,unlock} in sleepable progs
bpf: Consider non-owning refs to refcounted nodes RCU protected
bpf: Reenable bpf_refcount_acquire
bpf: Use bpf_mem_free_rcu when bpf_obj_dropping refcounted nodes
bpf: Consider non-owning refs trusted
bpf: Ensure kptr_struct_meta is non-NULL for collection insert and refcount_acquire
selftests/bpf: Enable cpu v4 tests for RV64
riscv, bpf: Support unconditional bswap insn
riscv, bpf: Support signed div/mod insns
riscv, bpf: Support 32-bit offset jmp insn
riscv, bpf: Support sign-extension mov insns
riscv, bpf: Support sign-extension load insns
riscv, bpf: Fix missing exception handling and redundant zext for LDX_B/H/W
samples/bpf: Add note to README about the XDP utilities moved to xdp-tools
samples/bpf: Cleanup .gitignore
samples/bpf: Remove the xdp_sample_pkts utility
samples/bpf: Remove the xdp1 and xdp2 utilities
samples/bpf: Remove the xdp_rxq_info utility
samples/bpf: Remove the xdp_redirect* utilities
...
====================
Link: https://lore.kernel.org/r/20230825194319.12727-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
All users of cleanup_symbol_name() do not use the return value.
So let us change the return value of cleanup_symbol_name() to
'void' to reflect its usage pattern.
Suggested-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Reviewed-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230825202036.441212-1-yonghong.song@linux.dev
Signed-off-by: Kees Cook <keescook@chromium.org>
Merge system-wide power management changes and power capping updates
for 6.6-rc1:
- Add device PM helpers to allow a device to remain powered-on during
system-wide transitions (Ulf Hansson).
- Rework hibernation memory snapshotting to avoid storing pages filled
with zeros in hibernation image files (Brian Geffon).
- Add check to make sure that CPU latency QoS constraints do not use
negative values (Clive Lin).
- Optimize rp->domains memory allocation in the Intel RAPL power
capping driver (xiongxin).
- Remove recursion while parsing zones in the arm_scmi power capping
driver (Cristian Marussi).
* pm-sleep:
PM: sleep: Add helpers to allow a device to remain powered-on
PM: hibernate: don't store zero pages in the image file
* pm-qos:
PM: QoS: Add check to make sure CPU latency is non-negative
* powercap:
powercap: intel_rapl: Optimize rp->domains memory allocation
powercap: arm_scmi: Remove recursion while parsing zones
Kernel test robot reported a kallsyms_test failure when clang lto is
enabled (thin or full) and CONFIG_KALLSYMS_SELFTEST is also enabled.
I can reproduce in my local environment with the following error message
with thin lto:
[ 1.877897] kallsyms_selftest: Test for 1750th symbol failed: (tsc_cs_mark_unstable) addr=ffffffff81038090
[ 1.877901] kallsyms_selftest: abort
It appears that commit 8cc32a9bbf ("kallsyms: strip LTO-only suffixes
from promoted global functions") caused the failure. Commit 8cc32a9bbf
changed cleanup_symbol_name() based on ".llvm." instead of '.' where
".llvm." is appended to a before-lto-optimization local symbol name.
We need to propagate such knowledge in kallsyms_selftest.c as well.
Further more, compare_symbol_name() in kallsyms.c needs change as well.
In scripts/kallsyms.c, kallsyms_names and kallsyms_seqs_of_names are used
to record symbol names themselves and index to symbol names respectively.
For example:
kallsyms_names:
...
__amd_smn_rw._entry <== seq 1000
__amd_smn_rw._entry.5 <== seq 1001
__amd_smn_rw.llvm.<hash> <== seq 1002
...
kallsyms_seqs_of_names are sorted based on cleanup_symbol_name() through, so
the order in kallsyms_seqs_of_names actually has
index 1000: seq 1002 <== __amd_smn_rw.llvm.<hash> (actual symbol comparison using '__amd_smn_rw')
index 1001: seq 1000 <== __amd_smn_rw._entry
index 1002: seq 1001 <== __amd_smn_rw._entry.5
Let us say at a particular point, at index 1000, symbol '__amd_smn_rw.llvm.<hash>'
is comparing to '__amd_smn_rw._entry' where '__amd_smn_rw._entry' is the one to
search e.g., with function kallsyms_on_each_match_symbol(). The current implementation
will find out '__amd_smn_rw._entry' is less than '__amd_smn_rw.llvm.<hash>' and
then continue to search e.g., index 999 and never found a match although the actual
index 1001 is a match.
To fix this issue, let us do cleanup_symbol_name() first and then do comparison.
In the above case, comparing '__amd_smn_rw' vs '__amd_smn_rw._entry' and
'__amd_smn_rw._entry' being greater than '__amd_smn_rw', the next comparison will
be > index 1000 and eventually index 1001 will be hit an a match is found.
For any symbols not having '.llvm.' substr, there is no functionality change
for compare_symbol_name().
Fixes: 8cc32a9bbf ("kallsyms: strip LTO-only suffixes from promoted global functions")
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202308232200.1c932a90-oliver.sang@intel.com
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Reviewed-by: Song Liu <song@kernel.org>
Reviewed-by: Zhen Lei <thunder.leizhen@huawei.com>
Link: https://lore.kernel.org/r/20230825034659.1037627-1-yonghong.song@linux.dev
Cc: stable@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Commit 9e7a4d9831 ("bpf: Allow LSM programs to use bpf spin locks")
disabled bpf_spin_lock usage in sleepable progs, stating:
Sleepable LSM programs can be preempted which means that allowng spin
locks will need more work (disabling preemption and the verifier
ensuring that no sleepable helpers are called when a spin lock is
held).
This patch disables preemption before grabbing bpf_spin_lock. The second
requirement above "no sleepable helpers are called when a spin lock is
held" is implicitly enforced by current verifier logic due to helper
calls in spin_lock CS being disabled except for a few exceptions, none
of which sleep.
Due to above preemption changes, bpf_spin_lock CS can also be considered
a RCU CS, so verifier's in_rcu_cs check is modified to account for this.
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Link: https://lore.kernel.org/r/20230821193311.3290257-7-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
An earlier patch in the series ensures that the underlying memory of
nodes with bpf_refcount - which can have multiple owners - is not reused
until RCU grace period has elapsed. This prevents
use-after-free with non-owning references that may point to
recently-freed memory. While RCU read lock is held, it's safe to
dereference such a non-owning ref, as by definition RCU GP couldn't have
elapsed and therefore underlying memory couldn't have been reused.
From the perspective of verifier "trustedness" non-owning refs to
refcounted nodes are now trusted only in RCU CS and therefore should no
longer pass is_trusted_reg, but rather is_rcu_reg. Let's mark them
MEM_RCU in order to reflect this new state.
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Link: https://lore.kernel.org/r/20230821193311.3290257-6-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Now that all reported issues are fixed, bpf_refcount_acquire can be
turned back on. Also reenable all bpf_refcount-related tests which were
disabled.
This a revert of:
* commit f3514a5d67 ("selftests/bpf: Disable newly-added 'owner' field test until refcount re-enabled")
* commit 7deca5eae8 ("bpf: Disable bpf_refcount_acquire kfunc calls until race conditions are fixed")
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230821193311.3290257-5-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This is the final fix for the use-after-free scenario described in
commit 7793fc3bab ("bpf: Make bpf_refcount_acquire fallible for
non-owning refs"). That commit, by virtue of changing
bpf_refcount_acquire's refcount_inc to a refcount_inc_not_zero, fixed
the "refcount incr on 0" splat. The not_zero check in
refcount_inc_not_zero, though, still occurs on memory that could have
been free'd and reused, so the commit didn't properly fix the root
cause.
This patch actually fixes the issue by free'ing using the recently-added
bpf_mem_free_rcu, which ensures that the memory is not reused until
RCU grace period has elapsed. If that has happened then
there are no non-owning references alive that point to the
recently-free'd memory, so it can be safely reused.
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230821193311.3290257-4-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
It's straightforward to prove that kptr_struct_meta must be non-NULL for
any valid call to these kfuncs:
* btf_parse_struct_metas in btf.c creates a btf_struct_meta for any
struct in user BTF with a special field (e.g. bpf_refcount,
{rb,list}_node). These are stored in that BTF's struct_meta_tab.
* __process_kf_arg_ptr_to_graph_node in verifier.c ensures that nodes
have {rb,list}_node field and that it's at the correct offset.
Similarly, check_kfunc_args ensures bpf_refcount field existence for
node param to bpf_refcount_acquire.
* So a btf_struct_meta must have been created for the struct type of
node param to these kfuncs
* That BTF and its struct_meta_tab are guaranteed to still be around.
Any arbitrary {rb,list} node the BPF program interacts with either:
came from bpf_obj_new or a collection removal kfunc in the same
program, in which case the BTF is associated with the program and
still around; or came from bpf_kptr_xchg, in which case the BTF was
associated with the map and is still around
Instead of silently continuing with NULL struct_meta, which caused
confusing bugs such as those addressed by commit 2140a6e342 ("bpf: Set
kptr_struct_meta for node param to list and rbtree insert funcs"), let's
error out. Then, at runtime, we can confidently say that the
implementations of these kfuncs were given a non-NULL kptr_struct_meta,
meaning that special-field-specific functionality like
bpf_obj_free_fields and the bpf_obj_drop change introduced later in this
series are guaranteed to execute.
This patch doesn't change functionality, just makes it easier to reason
about existing functionality.
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230821193311.3290257-2-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
A trivial execve scalability test which tries to be very friendly
(statically linked binaries, all separate) is predominantly bottlenecked
by back-to-back per-cpu counter allocations which serialize on global
locks.
Ease the pain by allocating and freeing them in one go.
Bench can be found here:
http://apollo.backplane.com/DFlyMisc/doexec.c
$ cc -static -O2 -o static-doexec doexec.c
$ ./static-doexec $(nproc)
Even at a very modest scale of 26 cores (ops/s):
before: 133543.63
after: 186061.81 (+39%)
While with the patch these allocations remain a significant problem,
the primary bottleneck shifts to page release handling.
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://lore.kernel.org/r/20230823050609.2228718-3-mjguzik@gmail.com
[Dennis: reflowed 1 line]
Signed-off-by: Dennis Zhou <dennis@kernel.org>
The function crash_prepare_elf64_headers() generates the elfcorehdr which
describes the CPUs and memory in the system for the crash kernel. In
particular, it writes out ELF PT_NOTEs for memory regions and the CPUs in
the system.
With respect to the CPUs, the current implementation utilizes
for_each_present_cpu() which means that as CPUs are added and removed, the
elfcorehdr must again be updated to reflect the new set of CPUs.
The reasoning behind the move to use for_each_possible_cpu(), is:
- At kernel boot time, all percpu crash_notes are allocated for all
possible CPUs; that is, crash_notes are not allocated dynamically
when CPUs are plugged/unplugged. Thus the crash_notes for each
possible CPU are always available.
- The crash_prepare_elf64_headers() creates an ELF PT_NOTE per CPU.
Changing to for_each_possible_cpu() is valid as the crash_notes
pointed to by each CPU PT_NOTE are present and always valid.
Furthermore, examining a common crash processing path of:
kernel panic -> crash kernel -> makedumpfile -> 'crash' analyzer
elfcorehdr /proc/vmcore vmcore
reveals how the ELF CPU PT_NOTEs are utilized:
- Upon panic, each CPU is sent an IPI and shuts itself down, recording
its state in its crash_notes. When all CPUs are shutdown, the
crash kernel is launched with a pointer to the elfcorehdr.
- The crash kernel via linux/fs/proc/vmcore.c does not examine or
use the contents of the PT_NOTEs, it exposes them via /proc/vmcore.
- The makedumpfile utility uses /proc/vmcore and reads the CPU
PT_NOTEs to craft a nr_cpus variable, which is reported in a
header but otherwise generally unused. Makedumpfile creates the
vmcore.
- The 'crash' dump analyzer does not appear to reference the CPU
PT_NOTEs. Instead it looks-up the cpu_[possible|present|onlin]_mask
symbols and directly examines those structure contents from vmcore
memory. From that information it is able to determine which CPUs
are present and online, and locate the corresponding crash_notes.
Said differently, it appears that 'crash' analyzer does not rely
on the ELF PT_NOTEs for CPUs; rather it obtains the information
directly via kernel symbols and the memory within the vmcore.
(There maybe other vmcore generating and analysis tools that do use these
PT_NOTEs, but 'makedumpfile' and 'crash' seems to be the most common
solution.)
This results in the benefit of having all CPUs described in the
elfcorehdr, and therefore reducing the need to re-generate the elfcorehdr
on CPU changes, at the small expense of an additional 56 bytes per PT_NOTE
for not-present-but-possible CPUs.
On systems where kexec_file_load() syscall is utilized, all the above is
valid. On systems where kexec_load() syscall is utilized, there may be
the need for the elfcorehdr to be regenerated once. The reason being that
some archs only populate the 'present' CPUs from the
/sys/devices/system/cpus entries, which the userspace 'kexec' utility uses
to generate the userspace-supplied elfcorehdr. In this situation, one
memory or CPU change will rewrite the elfcorehdr via the
crash_prepare_elf64_headers() function and now all possible CPUs will be
described, just as with kexec_file_load() syscall.
Link: https://lkml.kernel.org/r/20230814214446.6659-8-eric.devolder@oracle.com
Signed-off-by: Eric DeVolder <eric.devolder@oracle.com>
Suggested-by: Sourabh Jain <sourabhjain@linux.ibm.com>
Reviewed-by: Sourabh Jain <sourabhjain@linux.ibm.com>
Acked-by: Hari Bathini <hbathini@linux.ibm.com>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Akhil Raj <lf32.dev@gmail.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Borislav Petkov (AMD) <bp@alien8.de>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Mimi Zohar <zohar@linux.ibm.com>
Cc: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Thomas Weißschuh <linux@weissschuh.net>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The hotplug support for kexec_load() requires changes to the userspace
kexec-tools and a little extra help from the kernel.
Given a kdump capture kernel loaded via kexec_load(), and a subsequent
hotplug event, the crash hotplug handler finds the elfcorehdr and rewrites
it to reflect the hotplug change. That is the desired outcome, however,
at kernel panic time, the purgatory integrity check fails (because the
elfcorehdr changed), and the capture kernel does not boot and no vmcore is
generated.
Therefore, the userspace kexec-tools/kexec must indicate to the kernel
that the elfcorehdr can be modified (because the kexec excluded the
elfcorehdr from the digest, and sized the elfcorehdr memory buffer
appropriately).
To facilitate hotplug support with kexec_load():
- a new kexec flag KEXEC_UPATE_ELFCOREHDR indicates that it is
safe for the kernel to modify the kexec_load()'d elfcorehdr
- the /sys/kernel/crash_elfcorehdr_size node communicates the
preferred size of the elfcorehdr memory buffer
- The sysfs crash_hotplug nodes (ie.
/sys/devices/system/[cpu|memory]/crash_hotplug) dynamically
take into account kexec_file_load() vs kexec_load() and
KEXEC_UPDATE_ELFCOREHDR.
This is critical so that the udev rule processing of crash_hotplug
is all that is needed to determine if the userspace unload-then-load
of the kdump image is to be skipped, or not. The proposed udev
rule change looks like:
# The kernel updates the crash elfcorehdr for CPU and memory changes
SUBSYSTEM=="cpu", ATTRS{crash_hotplug}=="1", GOTO="kdump_reload_end"
SUBSYSTEM=="memory", ATTRS{crash_hotplug}=="1", GOTO="kdump_reload_end"
The table below indicates the behavior of kexec_load()'d kdump image
updates (with the new udev crash_hotplug rule in place):
Kernel |Kexec
-------+-----+----
Old |Old |New
| a | a
-------+-----+----
New | a | b
-------+-----+----
where kexec 'old' and 'new' delineate kexec-tools has the needed
modifications for the crash hotplug feature, and kernel 'old' and 'new'
delineate the kernel supports this crash hotplug feature.
Behavior 'a' indicates the unload-then-reload of the entire kdump image.
For the kexec 'old' column, the unload-then-reload occurs due to the
missing flag KEXEC_UPDATE_ELFCOREHDR. An 'old' kernel (with 'new' kexec)
does not present the crash_hotplug sysfs node, which leads to the
unload-then-reload of the kdump image.
Behavior 'b' indicates the desired optimized behavior of the kernel
directly modifying the elfcorehdr and avoiding the unload-then-reload of
the kdump image.
If the udev rule is not updated with crash_hotplug node check, then no
matter any combination of kernel or kexec is new or old, the kdump image
continues to be unload-then-reload on hotplug changes.
To fully support crash hotplug feature, there needs to be a rollout of
kernel, kexec-tools and udev rule changes. However, the order of the
rollout of these pieces does not matter; kexec_load()'d kdump images still
function for hotplug as-is.
Link: https://lkml.kernel.org/r/20230814214446.6659-7-eric.devolder@oracle.com
Signed-off-by: Eric DeVolder <eric.devolder@oracle.com>
Suggested-by: Hari Bathini <hbathini@linux.ibm.com>
Acked-by: Hari Bathini <hbathini@linux.ibm.com>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Akhil Raj <lf32.dev@gmail.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Borislav Petkov (AMD) <bp@alien8.de>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Mimi Zohar <zohar@linux.ibm.com>
Cc: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Sourabh Jain <sourabhjain@linux.ibm.com>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Thomas Weißschuh <linux@weissschuh.net>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
When a crash kernel is loaded via the kexec_file_load() syscall, the
kernel places the various segments (ie crash kernel, crash initrd,
boot_params, elfcorehdr, purgatory, etc) in memory. For those
architectures that utilize purgatory, a hash digest of the segments is
calculated for integrity checking. The digest is embedded into the
purgatory image prior to placing in memory.
Updates to the elfcorehdr in response to CPU and memory changes would
cause the purgatory integrity checking to fail (at crash time, and no
vmcore created). Therefore, the elfcorehdr segment is explicitly excluded
from the purgatory digest, enabling updates to the elfcorehdr while also
avoiding the need to recompute the hash digest and reload purgatory.
Link: https://lkml.kernel.org/r/20230814214446.6659-4-eric.devolder@oracle.com
Signed-off-by: Eric DeVolder <eric.devolder@oracle.com>
Suggested-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Sourabh Jain <sourabhjain@linux.ibm.com>
Acked-by: Hari Bathini <hbathini@linux.ibm.com>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Akhil Raj <lf32.dev@gmail.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Borislav Petkov (AMD) <bp@alien8.de>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Mimi Zohar <zohar@linux.ibm.com>
Cc: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Thomas Weißschuh <linux@weissschuh.net>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
To support crash hotplug, a mechanism is needed to update the crash
elfcorehdr upon CPU or memory changes (eg. hot un/plug or off/ onlining).
The crash elfcorehdr describes the CPUs and memory to be written into the
vmcore.
To track CPU changes, callbacks are registered with the cpuhp mechanism
via cpuhp_setup_state_nocalls(CPUHP_BP_PREPARE_DYN). The crash hotplug
elfcorehdr update has no explicit ordering requirement (relative to other
cpuhp states), so meets the criteria for utilizing CPUHP_BP_PREPARE_DYN.
CPUHP_BP_PREPARE_DYN is a dynamic state and avoids the need to introduce a
new state for crash hotplug. Also, CPUHP_BP_PREPARE_DYN is the last state
in the PREPARE group, just prior to the STARTING group, which is very
close to the CPU starting up in a plug/online situation, or stopping in a
unplug/ offline situation. This minimizes the window of time during an
actual plug/online or unplug/offline situation in which the elfcorehdr
would be inaccurate. Note that for a CPU being unplugged or offlined, the
CPU will still be present in the list of CPUs generated by
crash_prepare_elf64_headers(). However, there is no need to explicitly
omit the CPU, see justification in 'crash: change
crash_prepare_elf64_headers() to for_each_possible_cpu()'.
To track memory changes, a notifier is registered to capture the memblock
MEM_ONLINE and MEM_OFFLINE events via register_memory_notifier().
The CPU callbacks and memory notifiers invoke crash_handle_hotplug_event()
which performs needed tasks and then dispatches the event to the
architecture specific arch_crash_handle_hotplug_event() to update the
elfcorehdr with the current state of CPUs and memory. During the process,
the kexec_lock is held.
Link: https://lkml.kernel.org/r/20230814214446.6659-3-eric.devolder@oracle.com
Signed-off-by: Eric DeVolder <eric.devolder@oracle.com>
Reviewed-by: Sourabh Jain <sourabhjain@linux.ibm.com>
Acked-by: Hari Bathini <hbathini@linux.ibm.com>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Akhil Raj <lf32.dev@gmail.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Borislav Petkov (AMD) <bp@alien8.de>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Mimi Zohar <zohar@linux.ibm.com>
Cc: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Thomas Weißschuh <linux@weissschuh.net>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "crash: Kernel handling of CPU and memory hot un/plug", v28.
Once the kdump service is loaded, if changes to CPUs or memory occur,
either by hot un/plug or off/onlining, the crash elfcorehdr must also be
updated.
The elfcorehdr describes to kdump the CPUs and memory in the system, and
any inaccuracies can result in a vmcore with missing CPU context or memory
regions.
The current solution utilizes udev to initiate an unload-then-reload of
the kdump image (eg. kernel, initrd, boot_params, purgatory and
elfcorehdr) by the userspace kexec utility. In the original post I
outlined the significant performance problems related to offloading this
activity to userspace.
This patchset introduces a generic crash handler that registers with the
CPU and memory notifiers. Upon CPU or memory changes, from either hot
un/plug or off/onlining, this generic handler is invoked and performs
important housekeeping, for example obtaining the appropriate lock, and
then invokes an architecture specific handler to do the appropriate
elfcorehdr update.
Note the description in patch 'crash: change crash_prepare_elf64_headers()
to for_each_possible_cpu()' and 'x86/crash: optimize CPU changes' that
enables further optimizations related to CPU plug/unplug/online/offline
performance of elfcorehdr updates.
In the case of x86_64, the arch specific handler generates a new
elfcorehdr, and overwrites the old one in memory; thus no involvement with
userspace needed.
To realize the benefits/test this patchset, one must make a couple
of minor changes to userspace:
- Prevent udev from updating kdump crash kernel on hot un/plug changes.
Add the following as the first lines to the RHEL udev rule file
/usr/lib/udev/rules.d/98-kexec.rules:
# The kernel updates the crash elfcorehdr for CPU and memory changes
SUBSYSTEM=="cpu", ATTRS{crash_hotplug}=="1", GOTO="kdump_reload_end"
SUBSYSTEM=="memory", ATTRS{crash_hotplug}=="1", GOTO="kdump_reload_end"
With this changeset applied, the two rules evaluate to false for
CPU and memory change events and thus skip the userspace
unload-then-reload of kdump.
- Change to the kexec_file_load for loading the kdump kernel:
Eg. on RHEL: in /usr/bin/kdumpctl, change to:
standard_kexec_args="-p -d -s"
which adds the -s to select kexec_file_load() syscall.
This kernel patchset also supports kexec_load() with a modified kexec
userspace utility. A working changeset to the kexec userspace utility is
posted to the kexec-tools mailing list here:
http://lists.infradead.org/pipermail/kexec/2023-May/027049.html
To use the kexec-tools patch, apply, build and install kexec-tools, then
change the kdumpctl's standard_kexec_args to replace the -s with
--hotplug. The removal of -s reverts to the kexec_load syscall and the
addition of --hotplug invokes the changes put forth in the kexec-tools
patch.
This patch (of 8):
The crash hotplug support leans on the work for the kexec_file_load()
syscall. To also support the kexec_load() syscall, a few bits of code
need to be move outside of CONFIG_KEXEC_FILE. As such, these bits are
moved out of kexec_file.c and into a common location crash_core.c.
In addition, struct crash_mem and crash_notes were moved to new locales so
that PROC_KCORE, which sets CRASH_CORE alone, builds correctly.
No functionality change intended.
Link: https://lkml.kernel.org/r/20230814214446.6659-1-eric.devolder@oracle.com
Link: https://lkml.kernel.org/r/20230814214446.6659-2-eric.devolder@oracle.com
Signed-off-by: Eric DeVolder <eric.devolder@oracle.com>
Reviewed-by: Sourabh Jain <sourabhjain@linux.ibm.com>
Acked-by: Hari Bathini <hbathini@linux.ibm.com>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Akhil Raj <lf32.dev@gmail.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Borislav Petkov (AMD) <bp@alien8.de>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Mimi Zohar <zohar@linux.ibm.com>
Cc: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Thomas Weißschuh <linux@weissschuh.net>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
While debugging a recent kallsyms_selftest failure[1], I needed more
details on what specifically was failing. This adds those details for
each failure state that is checked.
[1] https://lore.kernel.org/all/202308232200.1c932a90-oliver.sang@intel.com/
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Yonghong Song <yhs@meta.com>
Cc: "Erhard F." <erhard_f@mailbox.org>
Cc: Zhen Lei <thunder.leizhen@huawei.com>
Cc: kernel test robot <oliver.sang@intel.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Yang Li <yang.lee@linux.alibaba.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Currently, in function bpf_obj_free_fields(), for local kptr,
a warning will be issued if the struct does not contain any
special fields. But actually the kernel seems totally okay
with a local kptr without any special fields. Permitting
no special fields also aligns with future percpu kptr which
also allows no special fields.
Acked-by: Dave Marchevsky <davemarchevsky@fb.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230824063417.201925-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
After we converted the capabilities of our networking-bpf program from
cap_sys_admin to cap_net_admin+cap_bpf, our networking-bpf program
failed to start. Because it failed the bpf verifier, and the error log
is "R3 pointer comparison prohibited".
A simple reproducer as follows,
SEC("cls-ingress")
int ingress(struct __sk_buff *skb)
{
struct iphdr *iph = (void *)(long)skb->data + sizeof(struct ethhdr);
if ((long)(iph + 1) > (long)skb->data_end)
return TC_ACT_STOLEN;
return TC_ACT_OK;
}
Per discussion with Yonghong and Alexei [1], comparison of two packet
pointers is not a pointer leak. This patch fixes it.
Our local kernel is 6.1.y and we expect this fix to be backported to
6.1.y, so stable is CCed.
[1]. https://lore.kernel.org/bpf/CAADnVQ+Nmspr7Si+pxWn8zkE7hX-7s93ugwC+94aXSy4uQ9vBg@mail.gmail.com/
Suggested-by: Yonghong Song <yonghong.song@linux.dev>
Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20230823020703.3790-2-laoar.shao@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Back when set_fs() was a generic API for altering the address limit,
addr_limit_user_check() was a safety measure to prevent userspace being
able to issue syscalls with an unbound limit.
With the the removal of set_fs() as a generic API, the last user of
addr_limit_user_check() was removed in commit:
b5a5a01d8e ("arm64: uaccess: remove addr_limit_user_check()")
... as since that commit, no architecture defines TIF_FSCHECK, and hence
addr_limit_user_check() always expands to nothing.
Remove addr_limit_user_check(), updating the comment in
exit_to_user_mode_prepare() to no longer refer to it. At the same time,
the comment is reworded to be a little more generic so as to cover
kmap_assert_nomap() in addition to lockdep_sys_exit().
No functional change.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20230821163526.2319443-1-mark.rutland@arm.com
Assume the fprobe event is a return event if there is $retval is
used in the probe's argument without %return. e.g.
echo 'f:myevent vfs_read $retval' >> dynamic_events
then 'myevent' is a return probe event.
Link: https://lore.kernel.org/all/169272160261.160970.13613040161560998787.stgit@devnote2/
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Add a string type checking with BTF information if possible.
This will check whether the given BTF argument (and field) is
signed char array or pointer to signed char. If not, it reject
the 'string' type. If it is pointer to signed char, it adds
a dereference opration so that it can correctly fetch the
string data from memory.
# echo 'f getname_flags%return retval->name:string' >> dynamic_events
# echo 't sched_switch next->comm:string' >> dynamic_events
The above cases, 'struct filename::name' is 'char *' and
'struct task_struct::comm' is 'char []'. But in both case,
user can specify ':string' to fetch the string data.
Link: https://lore.kernel.org/all/169272159250.160970.1881112937198526188.stgit@devnote2/
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Add btf_find_struct_member() API to search a member of a given data structure
or union from the member's name.
Link: https://lore.kernel.org/all/169272156248.160970.8868479822371129043.stgit@devnote2/
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reviewed-by: Alan Maguire <alan.maguire@oracle.com>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Move generic function-proto find API and getting function parameter API
to BTF library code from trace_probe.c. This will avoid redundant efforts
on different feature.
Link: https://lore.kernel.org/all/169272155255.160970.719426926348706349.stgit@devnote2/
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Since the btf returned from bpf_get_btf_vmlinux() only covers functions in
the vmlinux, BTF argument is not available on the functions in the modules.
Use bpf_find_btf_id() instead of bpf_get_btf_vmlinux()+btf_find_name_kind()
so that BTF argument can find the correct struct btf and btf_type in it.
With this fix, fprobe events can use `$arg*` on module functions as below
# grep nf_log_ip_packet /proc/kallsyms
ffffffffa0005c00 t nf_log_ip_packet [nf_log_syslog]
ffffffffa0005bf0 t __pfx_nf_log_ip_packet [nf_log_syslog]
# echo 'f nf_log_ip_packet $arg*' > dynamic_events
# cat dynamic_events
f:fprobes/nf_log_ip_packet__entry nf_log_ip_packet net=net pf=pf hooknum=hooknum skb=skb in=in out=out loginfo=loginfo prefix=prefix
To support the module's btf which is removable, the struct btf needs to be
ref-counted. So this also records the btf in the traceprobe_parse_context
and returns the refcount when the parse has done.
Link: https://lore.kernel.org/all/169272154223.160970.3507930084247934031.stgit@devnote2/
Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Use struct_size() instead of hand-writing it, when allocating a structure
with a flex array.
This is less verbose.
Link: https://lore.kernel.org/all/20230725195424.3469242-1-ruanjinjie@huawei.com/
Signed-off-by: Ruan Jinjie <ruanjinjie@huawei.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
The commit being fixed introduced a hunk into check_func_arg_reg_off
that bypasses reg->off == 0 enforcement when offset points to a graph
node or root. This might possibly be done for treating bpf_rbtree_remove
and others as KF_RELEASE and then later check correct reg->off in helper
argument checks.
But this is not the case, those helpers are already not KF_RELEASE and
permit non-zero reg->off and verify it later to match the subobject in
BTF type.
However, this logic leads to bpf_obj_drop permitting free of register
arguments with non-zero offset when they point to a graph root or node
within them, which is not ok.
For instance:
struct foo {
int i;
int j;
struct bpf_rb_node node;
};
struct foo *f = bpf_obj_new(typeof(*f));
if (!f) ...
bpf_obj_drop(f); // OK
bpf_obj_drop(&f->i); // still ok from verifier PoV
bpf_obj_drop(&f->node); // Not OK, but permitted right now
Fix this by dropping the whole part of code altogether.
Fixes: 6a3cd3318f ("bpf: Migrate release_on_unlock logic to non-owning ref semantics")
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20230822175140.1317749-2-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
CPU latency should never be negative, which will be incorrectly high
when converted to unsigned data type.
Commit 8d36694245 ("PM: QoS: Add check to make sure CPU freq is
non-negative") makes sure CPU frequency is non-negative to fix incorrect
behavior in freqency QoS.
Add an analogous check to make sure CPU latency is non-negative so as to
prevent this problem from happening in CPU latency QoS.
Signed-off-by: Clive Lin <clive.lin@mediatek.com>
[ rjw: Changelog edits ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
When reviewing local percpu kptr support, Alexei discovered a bug
wherea bpf_kptr_xchg() may succeed even if the map value kptr type and
locally allocated obj type do not match ([1]). Missed struct btf_id
comparison is the reason for the bug. This patch added such struct btf_id
comparison and will flag verification failure if types do not match.
[1] https://lore.kernel.org/bpf/20230819002907.io3iphmnuk43xblu@macbook-pro-8.dhcp.thefacebook.com/#t
Reported-by: Alexei Starovoitov <ast@kernel.org>
Fixes: 738c96d5e2 ("bpf: Allow local kptrs to be exchanged via bpf_kptr_xchg")
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20230822050053.2886960-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Several of the list traversals in the user_events facility use safe list
traversals where they could be using the unsafe versions instead.
Replace these safe traversals with their unsafe counterparts in the
interest of optimization.
Link: https://lore.kernel.org/linux-trace-kernel/20230810194337.695983-1-ervaughn@linux.microsoft.com
Suggested-by: Beau Belgrave <beaub@linux.microsoft.com>
Signed-off-by: Eric Vaughn <ervaughn@linux.microsoft.com>
Acked-by: Beau Belgrave <beaub@linux.microsoft.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Commit 9457158bbc ("tracing: Fix reset of time stamps during trace_clock changes")
left behind tracing_reset_current() declaration.
Also commit 6954e41526 ("tracing: Place trace_pid_list logic into abstract functions")
removed trace_free_pid_list() implementation but leave declaration.
Link: https://lore.kernel.org/linux-trace-kernel/20230803144028.25492-1-yuehaibing@huawei.com
Cc: <mhiramat@kernel.org>
Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Per the previous commits, we now only enter do_filter_scalar_cpumask() with
a mask of weight greater than one. Optimise the equality checks.
Link: https://lkml.kernel.org/r/20230707172155.70873-9-vschneid@redhat.com
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Leonardo Bras <leobras@redhat.com>
Cc: Frederic Weisbecker <frederic@kernel.org>
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Steven noted that when the user-provided cpumask contains a single CPU,
then the filtering function can use a scalar as input instead of a
full-fledged cpumask.
In this case we can directly re-use filter_pred_cpu(), we just need to
transform '&' into '==' before executing it.
Link: https://lkml.kernel.org/r/20230707172155.70873-8-vschneid@redhat.com
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Leonardo Bras <leobras@redhat.com>
Cc: Frederic Weisbecker <frederic@kernel.org>
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Steven noted that when the user-provided cpumask contains a single CPU,
then the filtering function can use a scalar as input instead of a
full-fledged cpumask.
When the mask contains a single CPU, directly re-use the unsigned field
predicate functions. Transform '&' into '==' beforehand.
Link: https://lkml.kernel.org/r/20230707172155.70873-7-vschneid@redhat.com
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Leonardo Bras <leobras@redhat.com>
Cc: Frederic Weisbecker <frederic@kernel.org>
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Steven noted that when the user-provided cpumask contains a single CPU,
then the filtering function can use a scalar as input instead of a
full-fledged cpumask.
Reuse do_filter_scalar_cpumask() when the input mask has a weight of one.
Link: https://lkml.kernel.org/r/20230707172155.70873-6-vschneid@redhat.com
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Leonardo Bras <leobras@redhat.com>
Cc: Frederic Weisbecker <frederic@kernel.org>
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The tracing_cpumask lets us specify which CPUs are traced in a buffer
instance, but doesn't let us do this on a per-event basis (unless one
creates an instance per event).
A previous commit added filtering scalar fields by a user-given cpumask,
make this work with the CPU common field as well.
This enables doing things like
$ trace-cmd record -e 'sched_switch' -f 'CPU & CPUS{12-52}' \
-e 'sched_wakeup' -f 'target_cpu & CPUS{12-52}'
Link: https://lkml.kernel.org/r/20230707172155.70873-5-vschneid@redhat.com
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Leonardo Bras <leobras@redhat.com>
Cc: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Several events use a scalar field to denote a CPU:
o sched_wakeup.target_cpu
o sched_migrate_task.orig_cpu,dest_cpu
o sched_move_numa.src_cpu,dst_cpu
o ipi_send_cpu.cpu
o ...
Filtering these currently requires using arithmetic comparison functions,
which can be tedious when dealing with interleaved SMT or NUMA CPU ids.
Allow these to be filtered by a user-provided cpumask, which enables e.g.:
$ trace-cmd record -e 'sched_wakeup' -f 'target_cpu & CPUS{2,4,6,8-32}'
Link: https://lkml.kernel.org/r/20230707172155.70873-4-vschneid@redhat.com
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Leonardo Bras <leobras@redhat.com>
Cc: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The recently introduced ipi_send_cpumask trace event contains a cpumask
field, but it currently cannot be used in filter expressions.
Make event filtering aware of cpumask fields, and allow these to be
filtered by a user-provided cpumask.
The user-provided cpumask is to be given in cpulist format and wrapped as:
"CPUS{$cpulist}". The use of curly braces instead of parentheses is to
prevent predicate_parse() from parsing the contents of CPUS{...} as a
full-fledged predicate subexpression.
This enables e.g.:
$ trace-cmd record -e 'ipi_send_cpumask' -f 'cpumask & CPUS{2,4,6,8-32}'
Link: https://lkml.kernel.org/r/20230707172155.70873-3-vschneid@redhat.com
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Leonardo Bras <leobras@redhat.com>
Cc: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Every predicate allocation includes a MAX_FILTER_STR_VAL (256) char array
in the regex field, even if the predicate function does not use the field.
A later commit will introduce a dynamically allocated cpumask to struct
filter_pred, which will require a dedicated freeing function. Bite the
bullet and make filter_pred.regex dynamically allocated.
While at it, reorder the fields of filter_pred to fill in the byte
holes. The struct now fits on a single cacheline.
No change in behaviour intended.
The kfree()'s were patched via Coccinelle:
@@
struct filter_pred *pred;
@@
-kfree(pred);
+free_predicate(pred);
Link: https://lkml.kernel.org/r/20230707172155.70873-2-vschneid@redhat.com
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Leonardo Bras <leobras@redhat.com>
Cc: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Adding support for bpf_get_func_ip helper being called from
ebpf program attached by uprobe_multi link.
It returns the ip of the uprobe.
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230809083440.3209381-7-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Adding support to specify pid for uprobe_multi link and the uprobes
are created only for task with given pid value.
Using the consumer.filter filter callback for that, so the task gets
filtered during the uprobe installation.
We still need to check the task during runtime in the uprobe handler,
because the handler could get executed if there's another system
wide consumer on the same uprobe (thanks Oleg for the insight).
Cc: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230809083440.3209381-6-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Adding support to specify cookies array for uprobe_multi link.
The cookies array share indexes and length with other uprobe_multi
arrays (offsets/ref_ctr_offsets).
The cookies[i] value defines cookie for i-the uprobe and will be
returned by bpf_get_attach_cookie helper when called from ebpf
program hooked to that specific uprobe.
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230809083440.3209381-5-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Adding new multi uprobe link that allows to attach bpf program
to multiple uprobes.
Uprobes to attach are specified via new link_create uprobe_multi
union:
struct {
__aligned_u64 path;
__aligned_u64 offsets;
__aligned_u64 ref_ctr_offsets;
__u32 cnt;
__u32 flags;
} uprobe_multi;
Uprobes are defined for single binary specified in path and multiple
calling sites specified in offsets array with optional reference
counters specified in ref_ctr_offsets array. All specified arrays
have length of 'cnt'.
The 'flags' supports single bit for now that marks the uprobe as
return probe.
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230809083440.3209381-4-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
After synchronous_rcu(), both the dettached XDP program and
xdp_do_flush() are completed, and the only user of bpf_cpu_map_entry
will be cpu_map_kthread_run(), so instead of calling
__cpu_map_entry_replace() to stop kthread and cleanup entry after a RCU
grace period, do these things directly.
Signed-off-by: Hou Tao <houtao1@huawei.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/r/20230816045959.358059-3-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
As for now __cpu_map_entry_replace() uses call_rcu() to wait for the
inflight xdp program to exit the RCU read critical section, and then
launch kworker cpu_map_kthread_stop() to call kthread_stop() to flush
all pending xdp frames or skbs.
But it is unnecessary to use rcu_barrier() in cpu_map_kthread_stop() to
wait for the completion of __cpu_map_entry_free(), because rcu_barrier()
will wait for all pending RCU callbacks and cpu_map_kthread_stop() only
needs to wait for the completion of a specific __cpu_map_entry_free().
So use queue_rcu_work() to replace call_rcu(), schedule_work() and
rcu_barrier(). queue_rcu_work() will queue a __cpu_map_entry_free()
kworker after a RCU grace period. Because __cpu_map_entry_free() is
running in a kworker context, so it is OK to do all of these freeing
procedures include kthread_stop() in it.
After the update, there is no need to do reference-counting for
bpf_cpu_map_entry, because bpf_cpu_map_entry is freed directly in
__cpu_map_entry_free(), so just remove it.
Signed-off-by: Hou Tao <houtao1@huawei.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/r/20230816045959.358059-2-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Store the folio order in the low byte of the flags word in the first tail
page. This frees up the word that was being used to store the order and
dtor bytes previously.
Link: https://lkml.kernel.org/r/20230816151201.3655946-11-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Cc: Yanteng Si <siyanteng@loongson.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Stored in the first tail page's flags, this flag replaces the destructor.
That removes the last of the destructors, so remove all references to
folio_dtor and compound_dtor.
Link: https://lkml.kernel.org/r/20230816151201.3655946-9-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Cc: Yanteng Si <siyanteng@loongson.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
We can use a bit in page[1].flags to indicate that this folio belongs to
hugetlb instead of using a value in page[1].dtors. That lets
folio_test_hugetlb() become an inline function like it should be. We can
also get rid of NULL_COMPOUND_DTOR.
Link: https://lkml.kernel.org/r/20230816151201.3655946-8-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Cc: Yanteng Si <siyanteng@loongson.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
There is only one Kconfig user of CONFIG_EMBEDDED and it can be switched
to EXPERT or "if !ARCH_MULTIPLATFORM" (suggested by Arnd).
Link: https://lkml.kernel.org/r/20230816055010.31534-1-rdunlap@infradead.org
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Acked-by: Geert Uytterhoeven <geert+renesas@glider.be>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Palmer Dabbelt <palmer@rivosinc.com> [RISC-V]
Acked-by: Greg Ungerer <gerg@linux-m68k.org>
Acked-by: Jason A. Donenfeld <Jason@zx2c4.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc]
Cc: Russell King <linux@armlinux.org.uk>
Cc: Vineet Gupta <vgupta@kernel.org>
Cc: Brian Cain <bcain@quicinc.com>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Dinh Nguyen <dinguyen@kernel.org>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Rich Felker <dalias@libc.org>
Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
On the parisc architecture, lockdep reports for all static objects which
are in the __initdata section (e.g. "setup_done" in devtmpfs,
"kthreadd_done" in init/main.c) this warning:
INFO: trying to register non-static key.
The warning itself is wrong, because those objects are in the __initdata
section, but the section itself is on parisc outside of range from
_stext to _end, which is why the static_obj() functions returns a wrong
answer.
While fixing this issue, I noticed that the whole existing check can
be simplified a lot.
Instead of checking against the _stext and _end symbols (which include
code areas too) just check for the .data and .bss segments (since we check a
data object). This can be done with the existing is_kernel_core_data()
macro.
In addition objects in the __initdata section can be checked with
init_section_contains(), and is_kernel_rodata() allows keys to be in the
_ro_after_init section.
This partly reverts and simplifies commit bac59d18c7 ("x86/setup: Fix static
memory detection").
Link: https://lkml.kernel.org/r/ZNqrLRaOi/3wPAdp@p100
Fixes: bac59d18c7 ("x86/setup: Fix static memory detection")
Signed-off-by: Helge Deller <deller@gmx.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
xchg originated in 6e399cd144 ("prctl: avoid using mmap_sem for exe_file
serialization"). While the commit message does not explain *why* the
change, I found the original submission [1] which ultimately claims it
cleans things up by removing dependency of exe_file on the semaphore.
However, fe69d560b5 ("kernel/fork: always deny write access to current
MM exe_file") added a semaphore up/down cycle to synchronize the state of
exe_file against fork, defeating the point of the original change.
This is on top of semaphore trips already present both in the replacing
function and prctl (the only consumer).
Normally replacing exe_file does not happen for busy processes, thus
write-locking is not an impediment to performance in the intended use
case. If someone keeps invoking the routine for a busy processes they are
trying to play dirty and that's another reason to avoid any trickery.
As such I think the atomic here only adds complexity for no benefit.
Just write-lock around the replacement.
I also note that replacement races against the mapping check loop as
nothing synchronizes actual assignment with with said checks but I am not
addressing it in this patch. (Is the loop of any use to begin with?)
Link: https://lore.kernel.org/linux-mm/1424979417.10344.14.camel@stgolabs.net/ [1]
Link: https://lkml.kernel.org/r/20230814172140.1777161-1-mjguzik@gmail.com
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: "Christian Brauner (Microsoft)" <brauner@kernel.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mateusz Guzik <mjguzik@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This sysctl has the very unusual behaviour of not allowing any user (even
CAP_SYS_ADMIN) to reduce the restriction setting, meaning that if you were
to set this sysctl to a more restrictive option in the host pidns you
would need to reboot your machine in order to reset it.
The justification given in [1] is that this is a security feature and thus
it should not be possible to disable. Aside from the fact that we have
plenty of security-related sysctls that can be disabled after being
enabled (fs.protected_symlinks for instance), the protection provided by
the sysctl is to stop users from being able to create a binary and then
execute it. A user with CAP_SYS_ADMIN can trivially do this without
memfd_create(2):
% cat mount-memfd.c
#include <fcntl.h>
#include <string.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <linux/mount.h>
#define SHELLCODE "#!/bin/echo this file was executed from this totally private tmpfs:"
int main(void)
{
int fsfd = fsopen("tmpfs", FSOPEN_CLOEXEC);
assert(fsfd >= 0);
assert(!fsconfig(fsfd, FSCONFIG_CMD_CREATE, NULL, NULL, 2));
int dfd = fsmount(fsfd, FSMOUNT_CLOEXEC, 0);
assert(dfd >= 0);
int execfd = openat(dfd, "exe", O_CREAT | O_RDWR | O_CLOEXEC, 0782);
assert(execfd >= 0);
assert(write(execfd, SHELLCODE, strlen(SHELLCODE)) == strlen(SHELLCODE));
assert(!close(execfd));
char *execpath = NULL;
char *argv[] = { "bad-exe", NULL }, *envp[] = { NULL };
execfd = openat(dfd, "exe", O_PATH | O_CLOEXEC);
assert(execfd >= 0);
assert(asprintf(&execpath, "/proc/self/fd/%d", execfd) > 0);
assert(!execve(execpath, argv, envp));
}
% ./mount-memfd
this file was executed from this totally private tmpfs: /proc/self/fd/5
%
Given that it is possible for CAP_SYS_ADMIN users to create executable
binaries without memfd_create(2) and without touching the host filesystem
(not to mention the many other things a CAP_SYS_ADMIN process would be
able to do that would be equivalent or worse), it seems strange to cause a
fair amount of headache to admins when there doesn't appear to be an
actual security benefit to blocking this. There appear to be concerns
about confused-deputy-esque attacks[2] but a confused deputy that can
write to arbitrary sysctls is a bigger security issue than executable
memfds.
/* New API */
The primary requirement from the original author appears to be more based
on the need to be able to restrict an entire system in a hierarchical
manner[3], such that child namespaces cannot re-enable executable memfds.
So, implement that behaviour explicitly -- the vm.memfd_noexec scope is
evaluated up the pidns tree to &init_pid_ns and you have the most
restrictive value applied to you. The new lower limit you can set
vm.memfd_noexec is whatever limit applies to your parent.
Note that a pidns will inherit a copy of the parent pidns's effective
vm.memfd_noexec setting at unshare() time. This matches the existing
behaviour, and it also ensures that a pidns will never have its
vm.memfd_noexec setting *lowered* behind its back (but it will be raised
if the parent raises theirs).
/* Backwards Compatibility */
As the previous version of the sysctl didn't allow you to lower the
setting at all, there are no backwards compatibility issues with this
aspect of the change.
However it should be noted that now that the setting is completely
hierarchical. Previously, a cloned pidns would just copy the current
pidns setting, meaning that if the parent's vm.memfd_noexec was changed it
wouldn't propoagate to existing pid namespaces. Now, the restriction
applies recursively. This is a uAPI change, however:
* The sysctl is very new, having been merged in 6.3.
* Several aspects of the sysctl were broken up until this patchset and
the other patchset by Jeff Xu last month.
And thus it seems incredibly unlikely that any real users would run into
this issue. In the worst case, if this causes userspace isues we could
make it so that modifying the setting follows the hierarchical rules but
the restriction checking uses the cached copy.
[1]: https://lore.kernel.org/CABi2SkWnAgHK1i6iqSqPMYuNEhtHBkO8jUuCvmG3RmUB5TKHJw@mail.gmail.com/
[2]: https://lore.kernel.org/CALmYWFs_dNCzw_pW1yRAo4bGCPEtykroEQaowNULp7svwMLjOg@mail.gmail.com/
[3]: https://lore.kernel.org/CALmYWFuahdUF7cT4cm7_TGLqPanuHXJ-hVSfZt7vpTnc18DPrw@mail.gmail.com/
Link: https://lkml.kernel.org/r/20230814-memfd-vm-noexec-uapi-fixes-v2-4-7ff9e3e10ba6@cyphar.com
Fixes: 105ff5339f ("mm/memfd: add MFD_NOEXEC_SEAL and MFD_EXEC")
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Cc: Dominique Martinet <asmadeus@codewreck.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Daniel Verkamp <dverkamp@chromium.org>
Cc: Jeff Xu <jeffxu@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Use the helpers to simplify code, also kill unneeded goto cpy_name.
Link: https://lkml.kernel.org/r/20230728050043.59880-5-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Christian Göttsche <cgzones@googlemail.com>
Cc: "Christian König" <christian.koenig@amd.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: David Airlie <airlied@gmail.com>
Cc: Eric Paris <eparis@parisplace.org>
Cc: Felix Kuehling <felix.kuehling@amd.com>
Cc: "Pan, Xinhui" <Xinhui.Pan@amd.com>
Cc: Paul Moore <paul@paul-moore.com>
Cc: Stephen Smalley <stephen.smalley.work@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
No portable code calls into this function any more, and on architectures
that don't use or define their own, it causes a warning:
kernel/iomem.c:10:22: warning: no previous prototype for 'ioremap_cache' [-Wmissing-prototypes]
10 | __weak void __iomem *ioremap_cache(resource_size_t offset, unsigned long size)
Fold it into the only caller that uses it on architectures
without the #define.
Note that the fallback to ioremap is probably still wrong on
those architectures, but this is what it's always done there.
Link: https://lkml.kernel.org/r/20230726145432.1617809-1-arnd@kernel.org
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
atomic_t variables are currently used to implement reference counters
with the following properties:
- counter is initialized to 1 using atomic_set()
- a resource is freed upon counter reaching zero
- once counter reaches zero, its further
increments aren't allowed
- counter schema uses basic atomic operations
(set, inc, inc_not_zero, dec_and_test, etc.)
Such atomic variables should be converted to a newly provided
refcount_t type and API that prevents accidental counter overflows and
underflows. This is important since overflows and underflows can lead
to use-after-free situation and be exploitable.
The variable nsproxy.count is used as pure reference counter. Convert it
to refcount_t and fix up the operations.
**Important note for maintainers:
Some functions from refcount_t API defined in refcount.h have different
memory ordering guarantees than their atomic counterparts. Please check
Documentation/core-api/refcount-vs-atomic.rst for more information.
Normally the differences should not matter since refcount_t provides
enough guarantees to satisfy the refcounting use cases, but in some
rare cases it might matter. Please double check that you don't have
some undocumented memory guarantees for this variable usage.
For the nsproxy.count it might make a difference in following places:
- put_nsproxy() and switch_task_namespaces(): decrement in
refcount_dec_and_test() only provides RELEASE ordering and ACQUIRE
ordering on success vs. fully ordered atomic counterpart
Suggested-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Reviewed-by: David Windsor <dwindsor@gmail.com>
Reviewed-by: Hans Liljestrand <ishkamiel@gmail.com>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/r/20230818041327.gonna.210-kees@kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
There is race issue when concurrently splice_read main trace_pipe and
per_cpu trace_pipes which will result in data read out being different
from what actually writen.
As suggested by Steven:
> I believe we should add a ref count to trace_pipe and the per_cpu
> trace_pipes, where if they are opened, nothing else can read it.
>
> Opening trace_pipe locks all per_cpu ref counts, if any of them are
> open, then the trace_pipe open will fail (and releases any ref counts
> it had taken).
>
> Opening a per_cpu trace_pipe will up the ref count for just that
> CPU buffer. This will allow multiple tasks to read different per_cpu
> trace_pipe files, but will prevent the main trace_pipe file from
> being opened.
But because we only need to know whether per_cpu trace_pipe is open or
not, using a cpumask instead of using ref count may be easier.
After this patch, users will find that:
- Main trace_pipe can be opened by only one user, and if it is
opened, all per_cpu trace_pipes cannot be opened;
- Per_cpu trace_pipes can be opened by multiple users, but each per_cpu
trace_pipe can only be opened by one user. And if one of them is
opened, main trace_pipe cannot be opened.
Link: https://lore.kernel.org/linux-trace-kernel/20230818022645.1948314-1-zhengyejian1@huawei.com
Suggested-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Cross-merge networking fixes after downstream PR.
Conflicts:
drivers/net/ethernet/sfc/tc.c
fa165e1949 ("sfc: don't unregister flow_indr if it was never registered")
3bf969e88a ("sfc: add MAE table machinery for conntrack table")
https://lore.kernel.org/all/20230818112159.7430e9b4@canb.auug.org.au/
No adjacent changes.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
After commit 77c12fc959 ("watchdog/hardlockup: add a "cpu" param to
watchdog_hardlockup_check()") we started storing a `struct cpumask` on the
stack in watchdog_hardlockup_check(). On systems with CONFIG_NR_CPUS set
to 8192 this takes up 1K on the stack. That triggers warnings with
`CONFIG_FRAME_WARN` set to 1024.
We'll use the new trigger_allbutcpu_cpu_backtrace() to avoid needing to
use a CPU mask at all.
Link: https://lkml.kernel.org/r/20230804065935.v4.2.I501ab68cb926ee33a7c87e063d207abf09b9943c@changeid
Fixes: 77c12fc959 ("watchdog/hardlockup: add a "cpu" param to watchdog_hardlockup_check()")
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/r/202307310955.pLZDhpnl-lkp@intel.com
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Cc: Lecopzer Chen <lecopzer.chen@mediatek.com>
Cc: Pingfan Liu <kernelfans@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The APIs that allow backtracing across CPUs have always had a way to
exclude the current CPU. This convenience means callers didn't need to
find a place to allocate a CPU mask just to handle the common case.
Let's extend the API to take a CPU ID to exclude instead of just a
boolean. This isn't any more complex for the API to handle and allows the
hardlockup detector to exclude a different CPU (the one it already did a
trace for) without needing to find space for a CPU mask.
Arguably, this new API also encourages safer behavior. Specifically if
the caller wants to avoid tracing the current CPU (maybe because they
already traced the current CPU) this makes it more obvious to the caller
that they need to make sure that the current CPU ID can't change.
[akpm@linux-foundation.org: fix trigger_allbutcpu_cpu_backtrace() stub]
Link: https://lkml.kernel.org/r/20230804065935.v4.1.Ia35521b91fc781368945161d7b28538f9996c182@changeid
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: kernel test robot <lkp@intel.com>
Cc: Lecopzer Chen <lecopzer.chen@mediatek.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Pingfan Liu <kernelfans@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
There are no in-kernel users of __kthread_should_park() so mark it as
static and do not export it.
Link: https://lkml.kernel.org/r/2023080450-handcuff-stump-1d6e@gregkh
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: John Stultz <jstultz@google.com>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Cc: "Arve Hjønnevåg" <arve@android.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: "Christian Brauner (Microsoft)" <brauner@kernel.org>
Cc: Mike Christie <michael.christie@oracle.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Zqiang <qiang1.zhang@intel.com>
Cc: Prathu Baronia <quic_pbaronia@quicinc.com>
Cc: Sami Tolvanen <samitolvanen@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
gcov uses global functions that are called from generated code, but these
have no prototype in a header, which causes a W=1 build warning:
kernel/gcov/gcc_base.c:12:6: error: no previous prototype for '__gcov_init' [-Werror=missing-prototypes]
kernel/gcov/gcc_base.c:40:6: error: no previous prototype for '__gcov_flush' [-Werror=missing-prototypes]
kernel/gcov/gcc_base.c:46:6: error: no previous prototype for '__gcov_merge_add' [-Werror=missing-prototypes]
kernel/gcov/gcc_base.c:52:6: error: no previous prototype for '__gcov_merge_single' [-Werror=missing-prototypes]
Just turn off these warnings unconditionally for the two files that
contain them.
Link: https://lore.kernel.org/all/0820010f-e9dc-779d-7924-49c7df446bce@linux.ibm.com/
Link: https://lkml.kernel.org/r/20230725123042.2269077-1-arnd@kernel.org
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Tested-by: Peter Oberparleiter <oberpar@linux.ibm.com>
Acked-by: Peter Oberparleiter <oberpar@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
buf is assigned first, so it does not need to initialize the assignment.
Link: https://lkml.kernel.org/r/20230713234459.2908-1-kunyu@nfschina.com
Signed-off-by: Li kunyu <kunyu@nfschina.com>
Reviewed-by: Andrew Morton <akpm@linux-foudation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This patch is a minor cleanup to the series "refactor Kconfig to
consolidate KEXEC and CRASH options".
In that series, a new option ARCH_DEFAULT_KEXEC was introduced in order to
obtain the equivalent behavior of s390 original Kconfig settings for
KEXEC. As it turns out, this new option did not fully provide the
equivalent behavior, rather a "select KEXEC" did.
As such, the ARCH_DEFAULT_KEXEC is not needed anymore, so remove it.
Link: https://lkml.kernel.org/r/20230802161750.2215-1-eric.devolder@oracle.com
Signed-off-by: Eric DeVolder <eric.devolder@oracle.com>
Acked-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The Kconfig refactor to consolidate KEXEC and CRASH options utilized
option names of the form ARCH_SUPPORTS_<option>. Thus rename the
ARCH_HAS_KEXEC_PURGATORY to ARCH_SUPPORTS_KEXEC_PURGATORY to follow
the same.
Link: https://lkml.kernel.org/r/20230712161545.87870-15-eric.devolder@oracle.com
Signed-off-by: Eric DeVolder <eric.devolder@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "refactor Kconfig to consolidate KEXEC and CRASH options", v6.
The Kconfig is refactored to consolidate KEXEC and CRASH options from
various arch/<arch>/Kconfig files into new file kernel/Kconfig.kexec.
The Kconfig.kexec is now a submenu titled "Kexec and crash features"
located under "General Setup".
The following options are impacted:
- KEXEC
- KEXEC_FILE
- KEXEC_SIG
- KEXEC_SIG_FORCE
- KEXEC_IMAGE_VERIFY_SIG
- KEXEC_BZIMAGE_VERIFY_SIG
- KEXEC_JUMP
- CRASH_DUMP
Over time, these options have been copied between Kconfig files and
are very similar to one another, but with slight differences.
The following architectures are impacted by the refactor (because of
use of one or more KEXEC/CRASH options):
- arm
- arm64
- ia64
- loongarch
- m68k
- mips
- parisc
- powerpc
- riscv
- s390
- sh
- x86
More information:
In the patch series "crash: Kernel handling of CPU and memory hot
un/plug"
https://lore.kernel.org/lkml/20230503224145.7405-1-eric.devolder@oracle.com/
the new kernel feature introduces the config option CRASH_HOTPLUG.
In reviewing, Thomas Gleixner requested that the new config option
not be placed in x86 Kconfig. Rather the option needs a generic/common
home. To Thomas' point, the KEXEC and CRASH options have largely been
duplicated in the various arch/<arch>/Kconfig files, with minor
differences. This kind of proliferation is to be avoid/stopped.
https://lore.kernel.org/lkml/875y91yv63.ffs@tglx/
To that end, I have refactored the arch Kconfigs so as to consolidate
the various KEXEC and CRASH options. Generally speaking, this work has
the following themes:
- KEXEC and CRASH options are moved into new file kernel/Kconfig.kexec
- These items from arch/Kconfig:
CRASH_CORE KEXEC_CORE KEXEC_ELF HAVE_IMA_KEXEC
- These items from arch/x86/Kconfig form the common options:
KEXEC KEXEC_FILE KEXEC_SIG KEXEC_SIG_FORCE
KEXEC_BZIMAGE_VERIFY_SIG KEXEC_JUMP CRASH_DUMP
- These items from arch/arm64/Kconfig form the common options:
KEXEC_IMAGE_VERIFY_SIG
- The crash hotplug series appends CRASH_HOTPLUG to Kconfig.kexec
- The Kconfig.kexec is now a submenu titled "Kexec and crash features"
and is now listed in "General Setup" submenu from init/Kconfig.
- To control the common options, each has a new ARCH_SUPPORTS_<option>
option. These gateway options determine whether the common options
options are valid for the architecture.
- To account for the slight differences in the original architecture
coding of the common options, each now has a corresponding
ARCH_SELECTS_<option> which are used to elicit the same side effects
as the original arch/<arch>/Kconfig files for KEXEC and CRASH options.
An example, 'make menuconfig' illustrating the submenu:
> General setup > Kexec and crash features
[*] Enable kexec system call
[*] Enable kexec file based system call
[*] Verify kernel signature during kexec_file_load() syscall
[ ] Require a valid signature in kexec_file_load() syscall
[ ] Enable bzImage signature verification support
[*] kexec jump
[*] kernel crash dumps
[*] Update the crash elfcorehdr on system configuration changes
In the process of consolidating the common options, I encountered
slight differences in the coding of these options in several of the
architectures. As a result, I settled on the following solution:
- Each of the common options has a 'depends on ARCH_SUPPORTS_<option>'
statement. For example, the KEXEC_FILE option has a 'depends on
ARCH_SUPPORTS_KEXEC_FILE' statement.
This approach is needed on all common options so as to prevent
options from appearing for architectures which previously did
not allow/enable them. For example, arm supports KEXEC but not
KEXEC_FILE. The arch/arm/Kconfig does not provide
ARCH_SUPPORTS_KEXEC_FILE and so KEXEC_FILE and related options
are not available to arm.
- The boolean ARCH_SUPPORTS_<option> in effect allows the arch to
determine when the feature is allowed. Archs which don't have the
feature simply do not provide the corresponding ARCH_SUPPORTS_<option>.
For each arch, where there previously were KEXEC and/or CRASH
options, these have been replaced with the corresponding boolean
ARCH_SUPPORTS_<option>, and an appropriate def_bool statement.
For example, if the arch supports KEXEC_FILE, then the
ARCH_SUPPORTS_KEXEC_FILE simply has a 'def_bool y'. This permits
the KEXEC_FILE option to be available.
If the arch has a 'depends on' statement in its original coding
of the option, then that expression becomes part of the def_bool
expression. For example, arm64 had:
config KEXEC
depends on PM_SLEEP_SMP
and in this solution, this converts to:
config ARCH_SUPPORTS_KEXEC
def_bool PM_SLEEP_SMP
- In order to account for the architecture differences in the
coding for the common options, the ARCH_SELECTS_<option> in the
arch/<arch>/Kconfig is used. This option has a 'depends on
<option>' statement to couple it to the main option, and from
there can insert the differences from the common option and the
arch original coding of that option.
For example, a few archs enable CRYPTO and CRYTPO_SHA256 for
KEXEC_FILE. These require a ARCH_SELECTS_KEXEC_FILE and
'select CRYPTO' and 'select CRYPTO_SHA256' statements.
Illustrating the option relationships:
For each of the common KEXEC and CRASH options:
ARCH_SUPPORTS_<option> <- <option> <- ARCH_SELECTS_<option>
<option> # in Kconfig.kexec
ARCH_SUPPORTS_<option> # in arch/<arch>/Kconfig, as needed
ARCH_SELECTS_<option> # in arch/<arch>/Kconfig, as needed
For example, KEXEC:
ARCH_SUPPORTS_KEXEC <- KEXEC <- ARCH_SELECTS_KEXEC
KEXEC # in Kconfig.kexec
ARCH_SUPPORTS_KEXEC # in arch/<arch>/Kconfig, as needed
ARCH_SELECTS_KEXEC # in arch/<arch>/Kconfig, as needed
To summarize, the ARCH_SUPPORTS_<option> permits the <option> to be
enabled, and the ARCH_SELECTS_<option> handles side effects (ie.
select statements).
Examples:
A few examples to show the new strategy in action:
===== x86 (minus the help section) =====
Original:
config KEXEC
bool "kexec system call"
select KEXEC_CORE
config KEXEC_FILE
bool "kexec file based system call"
select KEXEC_CORE
select HAVE_IMA_KEXEC if IMA
depends on X86_64
depends on CRYPTO=y
depends on CRYPTO_SHA256=y
config ARCH_HAS_KEXEC_PURGATORY
def_bool KEXEC_FILE
config KEXEC_SIG
bool "Verify kernel signature during kexec_file_load() syscall"
depends on KEXEC_FILE
config KEXEC_SIG_FORCE
bool "Require a valid signature in kexec_file_load() syscall"
depends on KEXEC_SIG
config KEXEC_BZIMAGE_VERIFY_SIG
bool "Enable bzImage signature verification support"
depends on KEXEC_SIG
depends on SIGNED_PE_FILE_VERIFICATION
select SYSTEM_TRUSTED_KEYRING
config CRASH_DUMP
bool "kernel crash dumps"
depends on X86_64 || (X86_32 && HIGHMEM)
config KEXEC_JUMP
bool "kexec jump"
depends on KEXEC && HIBERNATION
help
becomes...
New:
config ARCH_SUPPORTS_KEXEC
def_bool y
config ARCH_SUPPORTS_KEXEC_FILE
def_bool X86_64 && CRYPTO && CRYPTO_SHA256
config ARCH_SELECTS_KEXEC_FILE
def_bool y
depends on KEXEC_FILE
select HAVE_IMA_KEXEC if IMA
config ARCH_SUPPORTS_KEXEC_PURGATORY
def_bool KEXEC_FILE
config ARCH_SUPPORTS_KEXEC_SIG
def_bool y
config ARCH_SUPPORTS_KEXEC_SIG_FORCE
def_bool y
config ARCH_SUPPORTS_KEXEC_BZIMAGE_VERIFY_SIG
def_bool y
config ARCH_SUPPORTS_KEXEC_JUMP
def_bool y
config ARCH_SUPPORTS_CRASH_DUMP
def_bool X86_64 || (X86_32 && HIGHMEM)
===== powerpc (minus the help section) =====
Original:
config KEXEC
bool "kexec system call"
depends on PPC_BOOK3S || PPC_E500 || (44x && !SMP)
select KEXEC_CORE
config KEXEC_FILE
bool "kexec file based system call"
select KEXEC_CORE
select HAVE_IMA_KEXEC if IMA
select KEXEC_ELF
depends on PPC64
depends on CRYPTO=y
depends on CRYPTO_SHA256=y
config ARCH_HAS_KEXEC_PURGATORY
def_bool KEXEC_FILE
config CRASH_DUMP
bool "Build a dump capture kernel"
depends on PPC64 || PPC_BOOK3S_32 || PPC_85xx || (44x && !SMP)
select RELOCATABLE if PPC64 || 44x || PPC_85xx
becomes...
New:
config ARCH_SUPPORTS_KEXEC
def_bool PPC_BOOK3S || PPC_E500 || (44x && !SMP)
config ARCH_SUPPORTS_KEXEC_FILE
def_bool PPC64 && CRYPTO=y && CRYPTO_SHA256=y
config ARCH_SUPPORTS_KEXEC_PURGATORY
def_bool KEXEC_FILE
config ARCH_SELECTS_KEXEC_FILE
def_bool y
depends on KEXEC_FILE
select KEXEC_ELF
select HAVE_IMA_KEXEC if IMA
config ARCH_SUPPORTS_CRASH_DUMP
def_bool PPC64 || PPC_BOOK3S_32 || PPC_85xx || (44x && !SMP)
config ARCH_SELECTS_CRASH_DUMP
def_bool y
depends on CRASH_DUMP
select RELOCATABLE if PPC64 || 44x || PPC_85xx
Testing Approach and Results
There are 388 config files in the arch/<arch>/configs directories.
For each of these config files, a .config is generated both before and
after this Kconfig series, and checked for equivalence. This approach
allows for a rather rapid check of all architectures and a wide
variety of configs wrt/ KEXEC and CRASH, and avoids requiring
compiling for all architectures and running kernels and run-time
testing.
For each config file, the olddefconfig, allnoconfig and allyesconfig
targets are utilized. In testing the randconfig has revealed problems
as well, but is not used in the before and after equivalence check
since one can not generate the "same" .config for before and after,
even if using the same KCONFIG_SEED since the option list is
different.
As such, the following script steps compare the before and after
of 'make olddefconfig'. The new symbols introduced by this series
are filtered out, but otherwise the config files are PASS only if
they were equivalent, and FAIL otherwise.
The script performs the test by doing the following:
# Obtain the "golden" .config output for given config file
# Reset test sandbox
git checkout master
git branch -D test_Kconfig
git checkout -B test_Kconfig master
make distclean
# Write out updated config
cp -f <config file> .config
make ARCH=<arch> olddefconfig
# Track each item in .config, LHSB is "golden"
scoreboard .config
# Obtain the "changed" .config output for given config file
# Reset test sandbox
make distclean
# Apply this Kconfig series
git am <this Kconfig series>
# Write out updated config
cp -f <config file> .config
make ARCH=<arch> olddefconfig
# Track each item in .config, RHSB is "changed"
scoreboard .config
# Determine test result
# Filter-out new symbols introduced by this series
# Filter-out symbol=n which not in either scoreboard
# Compare LHSB "golden" and RHSB "changed" scoreboards and issue PASS/FAIL
The script was instrumental during the refactoring of Kconfig as it
continually revealed problems. The end result being that the solution
presented in this series passes all configs as checked by the script,
with the following exceptions:
- arch/ia64/configs/zx1_config with olddefconfig
This config file has:
# CONFIG_KEXEC is not set
CONFIG_CRASH_DUMP=y
and this refactor now couples KEXEC to CRASH_DUMP, so it is not
possible to enable CRASH_DUMP without KEXEC.
- arch/sh/configs/* with allyesconfig
The arch/sh/Kconfig codes CRASH_DUMP as dependent upon BROKEN_ON_MMU
(which clearly is not meant to be set). This symbol is not provided
but with the allyesconfig it is set to yes which enables CRASH_DUMP.
But KEXEC is coded as dependent upon MMU, and is set to no in
arch/sh/mm/Kconfig, so KEXEC is not enabled.
This refactor now couples KEXEC to CRASH_DUMP, so it is not
possible to enable CRASH_DUMP without KEXEC.
While the above exceptions are not equivalent to their original,
the config file produced is valid (and in fact better wrt/ CRASH_DUMP
handling).
This patch (of 14)
The config options for kexec and crash features are consolidated
into new file kernel/Kconfig.kexec. Under the "General Setup" submenu
is a new submenu "Kexec and crash handling". All the kexec and
crash options that were once in the arch-dependent submenu "Processor
type and features" are now consolidated in the new submenu.
The following options are impacted:
- KEXEC
- KEXEC_FILE
- KEXEC_SIG
- KEXEC_SIG_FORCE
- KEXEC_BZIMAGE_VERIFY_SIG
- KEXEC_JUMP
- CRASH_DUMP
The three main options are KEXEC, KEXEC_FILE and CRASH_DUMP.
Architectures specify support of certain KEXEC and CRASH features with
similarly named new ARCH_SUPPORTS_<option> config options.
Architectures can utilize the new ARCH_SELECTS_<option> config
options to specify additional components when <option> is enabled.
To summarize, the ARCH_SUPPORTS_<option> permits the <option> to be
enabled, and the ARCH_SELECTS_<option> handles side effects (ie.
select statements).
Link: https://lkml.kernel.org/r/20230712161545.87870-1-eric.devolder@oracle.com
Link: https://lkml.kernel.org/r/20230712161545.87870-2-eric.devolder@oracle.com
Signed-off-by: Eric DeVolder <eric.devolder@oracle.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov (AMD) <bp@alien8.de>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Cc. "H. Peter Anvin" <hpa@zytor.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Dave Hansen <dave.hansen@linux.intel.com> # for x86
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Hari Bathini <hbathini@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Cc: Juerg Haefliger <juerg.haefliger@canonical.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Linus Walleij <linus.walleij@linaro.org>
Cc: Marc Aurèle La France <tsi@tuyoix.net>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Miguel Ojeda <ojeda@kernel.org>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Cc: Sami Tolvanen <samitolvanen@google.com>
Cc: Sebastian Reichel <sebastian.reichel@collabora.com>
Cc: Sourabh Jain <sourabhjain@linux.ibm.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: WANG Xuerui <kernel@xen0n.name>
Cc: Will Deacon <will@kernel.org>
Cc: Xin Li <xin3.li@intel.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Zhen Lei <thunder.leizhen@huawei.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Make the print-fatal-signals message more useful by printing the comm
and the exe name for the process which received the fatal signal:
Before:
potentially unexpected fatal signal 4
potentially unexpected fatal signal 11
After:
buggy-program: pool: potentially unexpected fatal signal 4
some-daemon: gdbus: potentially unexpected fatal signal 11
comm used to be present but was removed in commit 681a90ffe8
("arc, print-fatal-signals: reduce duplicated information") because it's
also included as part of the later stack trace. Having the comm as part
of the main "unexpected fatal..." print is rather useful though when
analysing logs, and the exe name is also valuable as shown in the
examples above where the comm ends up having some generic name like
"pool".
[akpm@linux-foundation.org: don't include linux/file.h twice]
Link: https://lkml.kernel.org/r/20230707-fatal-comm-v1-1-400363905d5e@axis.com
Signed-off-by: Vincent Whitchurch <vincent.whitchurch@axis.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vineet Gupta <vgupta@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Secondary TLBs are now invalidated from the architecture specific TLB
invalidation functions. Therefore there is no need to explicitly notify
or invalidate as part of the range end functions. This means we can
remove mmu_notifier_invalidate_range_end_only() and some of the
ptep_*_notify() functions.
Link: https://lkml.kernel.org/r/90d749d03cbab256ca0edeb5287069599566d783.1690292440.git-series.apopple@nvidia.com
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Cc: Andrew Donnellan <ajd@linux.ibm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chaitanya Kumar Borah <chaitanya.kumar.borah@intel.com>
Cc: Frederic Barrat <fbarrat@linux.ibm.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Nicolin Chen <nicolinc@nvidia.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
Cc: Will Deacon <will@kernel.org>
Cc: Zhi Wang <zhi.wang.linux@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Now is_ioremap_addr() is only used in kernel/iomem.c and gonna be used in
mm/ioremap.c. Move it into its own new header file linux/ioremap.h.
Link: https://lkml.kernel.org/r/20230706154520.11257-17-bhe@redhat.com
Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Brian Cain <bcain@quicinc.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Chris Zankel <chris@zankel.net>
Cc: David Laight <David.Laight@ACULAB.COM>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Niklas Schnelle <schnelle@linux.ibm.com>
Cc: Rich Felker <dalias@libc.org>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@kernel.org>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
HASH_SMALL only works when parameter numentries is 0. But the sole caller
futex_init() never calls alloc_large_system_hash() with numentries set to
0. So HASH_SMALL is obsolete and remove it.
Link: https://lkml.kernel.org/r/20230625021323.849147-1-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: André Almeida <andrealmeid@igalia.com>
Cc: Darren Hart <dvhart@infradead.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
All callers of show_mem() pass 0 and NULL, so we can remove the two
arguments by directly calling __show_mem(0, NULL, MAX_NR_ZONES - 1) in
show_mem().
Link: https://lkml.kernel.org/r/20230630062253.189440-1-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Change the notation from pointer-to-array to pointer-to-pointer.
With this, we avoid the compiler complaining about trying
to access a region of size zero as an argument during function
calls.
This is a workaround to prevent the compiler complaining about
accessing an array of size zero when evaluating the arguments
of a couple of function calls. See below:
kernel/cgroup/cgroup.c: In function 'find_css_set':
kernel/cgroup/cgroup.c:1206:16: warning: 'find_existing_css_set' accessing 4 bytes in a region of size 0 [-Wstringop-overflow=]
1206 | cset = find_existing_css_set(old_cset, cgrp, template);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
kernel/cgroup/cgroup.c:1206:16: note: referencing argument 3 of type 'struct cgroup_subsys_state *[0]'
kernel/cgroup/cgroup.c:1071:24: note: in a call to function 'find_existing_css_set'
1071 | static struct css_set *find_existing_css_set(struct css_set *old_cset,
| ^~~~~~~~~~~~~~~~~~~~~
With the change to pointer-to-pointer, the functions are not prevented
from being executed, and they will do what they have to do when
CGROUP_SUBSYS_COUNT == 0.
Address the following -Wstringop-overflow warnings seen when
built with ARM architecture and aspeed_g4_defconfig configuration
(notice that under this configuration CGROUP_SUBSYS_COUNT == 0):
kernel/cgroup/cgroup.c:1208:16: warning: 'find_existing_css_set' accessing 4 bytes in a region of size 0 [-Wstringop-overflow=]
kernel/cgroup/cgroup.c:1258:15: warning: 'css_set_hash' accessing 4 bytes in a region of size 0 [-Wstringop-overflow=]
kernel/cgroup/cgroup.c:6089:18: warning: 'css_set_hash' accessing 4 bytes in a region of size 0 [-Wstringop-overflow=]
kernel/cgroup/cgroup.c:6153:18: warning: 'css_set_hash' accessing 4 bytes in a region of size 0 [-Wstringop-overflow=]
This results in no differences in binary output.
Link: https://github.com/KSPP/linux/issues/316
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
The kerndoc for some struct member and function arguments were missing.
Add them.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Will Drewry <wad@chromium.org>
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202308171742.AncabIG1-lkp@intel.com/
Signed-off-by: Kees Cook <keescook@chromium.org>
Kmemleak report a leak in graph_trace_open():
unreferenced object 0xffff0040b95f4a00 (size 128):
comm "cat", pid 204981, jiffies 4301155872 (age 99771.964s)
hex dump (first 32 bytes):
e0 05 e7 b4 ab 7d 00 00 0b 00 01 00 00 00 00 00 .....}..........
f4 00 01 10 00 a0 ff ff 00 00 00 00 65 00 10 00 ............e...
backtrace:
[<000000005db27c8b>] kmem_cache_alloc_trace+0x348/0x5f0
[<000000007df90faa>] graph_trace_open+0xb0/0x344
[<00000000737524cd>] __tracing_open+0x450/0xb10
[<0000000098043327>] tracing_open+0x1a0/0x2a0
[<00000000291c3876>] do_dentry_open+0x3c0/0xdc0
[<000000004015bcd6>] vfs_open+0x98/0xd0
[<000000002b5f60c9>] do_open+0x520/0x8d0
[<00000000376c7820>] path_openat+0x1c0/0x3e0
[<00000000336a54b5>] do_filp_open+0x14c/0x324
[<000000002802df13>] do_sys_openat2+0x2c4/0x530
[<0000000094eea458>] __arm64_sys_openat+0x130/0x1c4
[<00000000a71d7881>] el0_svc_common.constprop.0+0xfc/0x394
[<00000000313647bf>] do_el0_svc+0xac/0xec
[<000000002ef1c651>] el0_svc+0x20/0x30
[<000000002fd4692a>] el0_sync_handler+0xb0/0xb4
[<000000000c309c35>] el0_sync+0x160/0x180
The root cause is descripted as follows:
__tracing_open() { // 1. File 'trace' is being opened;
...
*iter->trace = *tr->current_trace; // 2. Tracer 'function_graph' is
// currently set;
...
iter->trace->open(iter); // 3. Call graph_trace_open() here,
// and memory are allocated in it;
...
}
s_start() { // 4. The opened file is being read;
...
*iter->trace = *tr->current_trace; // 5. If tracer is switched to
// 'nop' or others, then memory
// in step 3 are leaked!!!
...
}
To fix it, in s_start(), close tracer before switching then reopen the
new tracer after switching. And some tracers like 'wakeup' may not update
'iter->private' in some cases when reopen, then it should be cleared
to avoid being mistakenly closed again.
Link: https://lore.kernel.org/linux-trace-kernel/20230817125539.1646321-1-zhengyejian1@huawei.com
Fixes: d7350c3f45 ("tracing/core: make the read callbacks reentrants")
Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Mike and others noticed that EEVDF does like to over-schedule quite a
bit -- which does hurt performance of a number of benchmarks /
workloads.
In particular, what seems to cause over-scheduling is that when lag is
of the same order (or larger) than the request / slice then placement
will not only cause the task to be placed left of current, but also
with a smaller deadline than current, which causes immediate
preemption.
[ notably, lag bounds are relative to HZ ]
Mike suggested we stick to picking 'current' for as long as it's
eligible to run, giving it uninterrupted runtime until it reaches
parity with the pack.
Augment Mike's suggestion by only allowing it to exhaust it's initial
request.
One random data point:
echo NO_RUN_TO_PARITY > /debug/sched/features
perf stat -a -e context-switches --repeat 10 -- perf bench sched messaging -g 20 -t -l 5000
3,723,554 context-switches ( +- 0.56% )
9.5136 +- 0.0394 seconds time elapsed ( +- 0.41% )
echo RUN_TO_PARITY > /debug/sched/features
perf stat -a -e context-switches --repeat 10 -- perf bench sched messaging -g 20 -t -l 5000
2,556,535 context-switches ( +- 0.51% )
9.2427 +- 0.0302 seconds time elapsed ( +- 0.33% )
Suggested-by: Mike Galbraith <umgwanakikbuti@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20230816134059.GC982867@hirez.programming.kicks-ass.net
The rcu_nocb_poll kernel boot parameter is defined via early_param(),
whose parsing functions are invoked from parse_early_param() which
is in turn invoked by setup_arch(), which is very early indeed. It
is invoked so early that the console output timestamps read 0.000000,
in other words, before time begins.
This use of early_param() means that the rcu_nocb_poll kernel boot
parameter cannot usefully be embedded into the kernel image. Yes, you
can embed it, but setup_boot_config() is invoked from start_kernel()
too late for it to be parsed.
But it makes no sense to parse this parameter so early. After all,
it cannot do anything until the rcuog kthreads are created, which is
long after rcu_init() time, let alone setup_boot_config() time.
This commit therefore switches the rcu_nocb_poll kernel boot parameter
from early_param() to __setup(), which allows boot-config parsing of
this parameter, in turn allowing it to be embedded into the kernel image.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
The rcu_request_urgent_qs_task() function does a cross-CPU store
to ->rcu_urgent_qs, so this commit therefore marks the load in
__rcu_irq_enter_check_tick() with READ_ONCE().
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
While debugging another issue I noticed that the stack trace contains one
invalid entry at the end:
<idle>-0 [008] d..4. 26.484201: wake_lat: pid=0 delta=2629976084 000000009cc24024 stack=STACK:
=> __schedule+0xac6/0x1a98
=> schedule+0x126/0x2c0
=> schedule_timeout+0x150/0x2c0
=> kcompactd+0x9ca/0xc20
=> kthread+0x2f6/0x3d8
=> __ret_from_fork+0x8a/0xe8
=> 0x6b6b6b6b6b6b6b6b
This is because the code failed to add the one element containing the
number of entries to field_size.
Link: https://lkml.kernel.org/r/20230816154928.4171614-4-svens@linux.ibm.com
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Fixes: 00cf3d672a ("tracing: Allow synthetic events to pass around stacktraces")
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
While debugging another issue I noticed that the stack trace output
contains the number of entries on top:
<idle>-0 [000] d..4. 203.322502: wake_lat: pid=0 delta=2268270616 stack=STACK:
=> 0x10
=> __schedule+0xac6/0x1a98
=> schedule+0x126/0x2c0
=> schedule_timeout+0x242/0x2c0
=> __wait_for_common+0x434/0x680
=> __wait_rcu_gp+0x198/0x3e0
=> synchronize_rcu+0x112/0x138
=> ring_buffer_reset_online_cpus+0x140/0x2e0
=> tracing_reset_online_cpus+0x15c/0x1d0
=> tracing_set_clock+0x180/0x1d8
=> hist_register_trigger+0x486/0x670
=> event_hist_trigger_parse+0x494/0x1318
=> trigger_process_regex+0x1d4/0x258
=> event_trigger_write+0xb4/0x170
=> vfs_write+0x210/0xad0
=> ksys_write+0x122/0x208
Fix this by skipping the first element. Also replace the pointer
logic with an index variable which is easier to read.
Link: https://lkml.kernel.org/r/20230816154928.4171614-3-svens@linux.ibm.com
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Fixes: 00cf3d672a ("tracing: Allow synthetic events to pass around stacktraces")
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The current code uses a lot of casts to access the fields member in struct
synth_trace_events with different sizes. This makes the code hard to
read, and had already introduced an endianness bug. Use a union and struct
instead.
Link: https://lkml.kernel.org/r/20230816154928.4171614-2-svens@linux.ibm.com
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Fixes: 00cf3d672a ("tracing: Allow synthetic events to pass around stacktraces")
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Trace ring buffer can no longer record anything after executing
following commands at the shell prompt:
# cd /sys/kernel/tracing
# cat tracing_cpumask
fff
# echo 0 > tracing_cpumask
# echo 1 > snapshot
# echo fff > tracing_cpumask
# echo 1 > tracing_on
# echo "hello world" > trace_marker
-bash: echo: write error: Bad file descriptor
The root cause is that:
1. After `echo 0 > tracing_cpumask`, 'record_disabled' of cpu buffers
in 'tr->array_buffer.buffer' became 1 (see tracing_set_cpumask());
2. After `echo 1 > snapshot`, 'tr->array_buffer.buffer' is swapped
with 'tr->max_buffer.buffer', then the 'record_disabled' became 0
(see update_max_tr());
3. After `echo fff > tracing_cpumask`, the 'record_disabled' become -1;
Then array_buffer and max_buffer are both unavailable due to value of
'record_disabled' is not 0.
To fix it, enable or disable both array_buffer and max_buffer at the same
time in tracing_set_cpumask().
Link: https://lkml.kernel.org/r/20230805033816.3284594-2-zhengyejian1@huawei.com
Cc: <mhiramat@kernel.org>
Cc: <vnagarnaik@google.com>
Cc: <shuah@kernel.org>
Fixes: 71babb2705 ("tracing: change CPU ring buffer state from tracing_cpumask")
Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
the module is out-of-tree, it saves kernel logs when panic
Signed-off-by: Enlin Mu <enlin.mu@unisoc.com>
Acked-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20230815020711.2604939-1-yunlong.xing@unisoc.com
The commit 1b715e1b0e ("bpf: Support ->fill_link_info for perf_event") leads
to the following Smatch static checker warning:
kernel/bpf/syscall.c:3416 bpf_perf_link_fill_kprobe()
error: uninitialized symbol 'type'.
That can happens when uname is NULL. So fix it by verifying the uname when we
really need to fill it.
Fixes: 1b715e1b0e ("bpf: Support ->fill_link_info for perf_event")
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Closes: https://lore.kernel.org/bpf/85697a7e-f897-4f74-8b43-82721bebc462@kili.mountain
Link: https://lore.kernel.org/bpf/20230813141900.1268-2-laoar.shao@gmail.com
PowerPC was the only user of these hooks, and has been refactored to no
longer require them. There is no need to keep them around, so remove
them to reduce complexity.
Signed-off-by: Benjamin Gray <bgray@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/20230801011744.153973-8-bgray@linux.ibm.com
This commit adds table_size to register_sysctl in preparation for the
removal of the sentinel elements in the ctl_table arrays (last empty
markers). And though we do *not* remove any sentinels in this commit, we
set things up by either passing the table_size explicitly or using
ARRAY_SIZE on the ctl_table arrays.
We replace the register_syctl function with a macro that will add the
ARRAY_SIZE to the new register_sysctl_sz function. In this way the
callers that are already using an array of ctl_table structs do not
change. For the callers that pass a ctl_table array pointer, we pass the
table_size to register_sysctl_sz instead of the macro.
Signed-off-by: Joel Granados <j.granados@samsung.com>
Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
We make these changes in order to prepare __register_sysctl_table and
its callers for when we remove the sentinel element (empty element at
the end of ctl_table arrays). We don't actually remove any sentinels in
this commit, but we *do* make sure to use ARRAY_SIZE so the table_size
is available when the removal occurs.
We add a table_size argument to __register_sysctl_table and adjust
callers, all of which pass ctl_table pointers and need an explicit call
to ARRAY_SIZE. We implement a size calculation in register_net_sysctl in
order to forward the size of the array pointer received from the network
register calls.
The new table_size argument does not yet have any effect in the
init_header call which is still dependent on the sentinel's presence.
table_size *does* however drive the `kzalloc` allocation in
__register_sysctl_table with no adverse effects as the allocated memory
is either one element greater than the calculated ctl_table array (for
the calls in ipc_sysctl.c, mq_sysctl.c and ucount.c) or the exact size
of the calculated ctl_table array (for the call from sysctl_net.c and
register_sysctl). This approach will allows us to "just" remove the
sentinel without further changes to __register_sysctl_table as
table_size will represent the exact size for all the callers at that
point.
Signed-off-by: Joel Granados <j.granados@samsung.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Fixes following checkpatch.pl issue:
ERROR: trailing statements should be on next line
Signed-off-by: Atul Kumar Pant <atulpant.linux@gmail.com>
[PM: subject line tweak]
Signed-off-by: Paul Moore <paul@paul-moore.com>
The patch fixes following checkpatch.pl issue:
ERROR: open brace '{' following function definitions go on the next line
ERROR: do not use assignment in if condition
Signed-off-by: Atul Kumar Pant <atulpant.linux@gmail.com>
[PM: subject line tweaks]
Signed-off-by: Paul Moore <paul@paul-moore.com>
Fixes following checkpatch.pl issue:
ERROR: space required before the open parenthesis '('
ERROR: spaces required around that '='
ERROR: spaces required around that '<'
ERROR: spaces required around that '=='
Signed-off-by: Atul Kumar Pant <atulpant.linux@gmail.com>
[PM: subject line tweaks]
Signed-off-by: Paul Moore <paul@paul-moore.com>
Currently, if a struct_ops map is loaded with BPF_F_LINK, it must also
define the .validate() and .update() callbacks in its corresponding
struct bpf_struct_ops in the kernel. Enabling struct_ops link is useful
in its own right to ensure that the map is unloaded if an application
crashes. For example, with sched_ext, we want to automatically unload
the host-wide scheduler if the application crashes. We would likely
never support updating elements of a sched_ext struct_ops map, so we'd
have to implement these callbacks showing that they _can't_ support
element updates just to benefit from the basic lifetime management of
struct_ops links.
Let's enable struct_ops maps to work with BPF_F_LINK even if they
haven't defined these callbacks, by assuming that a struct_ops map
element cannot be updated by default.
Acked-by: Kui-Feng Lee <thinker.li@gmail.com>
Signed-off-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/r/20230814185908.700553-2-void@manifault.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
cgroup_namspace_init() just return 0. Therefore, there is no need to
call it during start_kernel. Just remove it.
Fixes: a79a908fd2 ("cgroup: introduce cgroup namespaces")
Signed-off-by: Lu Jialin <lujialin4@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Each CPU-specific and unbound kworker kthread conforms to a particular
naming scheme. However, this does not extend to the rescuer kworker.
At present, a rescuer kworker is simply named according to its
workqueue's name. This can be cryptic.
This patch modifies a rescuer to follow the kworker naming scheme.
The "R" is indicative of a rescuer and after "-" is its workqueue's
name e.g. "kworker/R-ext4-rsv-conver".
tj: Use "R" instead of "r" as the prefix to make it more distinctive and
consistent with how highpri pools are marked.
Signed-off-by: Aaron Tomlin <atomlin@atomlin.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Now that torture_random() uses swahw32(), its callers no longer see
not-so-random low-order bits, as these are now swapped up into the upper
16 bits of the torture_random() function's return value. This commit
therefore removes the right-shifting of torture_random() return values.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Now that torture_random() uses swahw32(), its callers no longer see
not-so-random low-order bits, as these are now swapped up into the upper
16 bits of the torture_random() function's return value. This commit
therefore removes the right-shifting of torture_random() return values.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
In order to gain better race coverage, move the test start/stop
waits in stutter_wait() to torture_hrtimeout_jiffies().
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
In order to gain better race coverage, move the CPU-migration timed
waits in torture_shuffle() to torture_hrtimeout_jiffies().
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
In order to gain better race coverage, move the CPU-hotplug-related
timed waits in torture_onoff() to torture_hrtimeout_jiffies().
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Given that it is expected that more code will use torture_hrtimeout_*(),
including for longer timeouts, make it use TASK_IDLE instead of
TASK_UNINTERRUPTIBLE.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit adds a module parameter that causes the locktorture writer
to run at real-time priority.
To use it:
insmod /lib/modules/torture.ko random_shuffle=1
insmod /lib/modules/locktorture.ko torture_type=mutex_lock rt_boost=1 rt_boost_factor=50 nested_locks=3 writer_fifo=1
^^^^^^^^^^^^^
A predecessor to this patch has been helpful to uncover issues with the
proxy-execution series.
[ paulmck: Remove locktorture-specific code from kernel/torture.c. ]
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: kernel-team@android.com
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
[jstultz: Include header change to build, reword commit message]
Signed-off-by: John Stultz <jstultz@google.com>
Acked-by: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit adds a kthread-creation callback to the
_torture_create_kthread() function, which allows callers of a new
torture_create_kthread_cb() macro to specify a function to be invoked
after the kthread is created but before it is awakened for the first time.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: kernel-team@android.com
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Acked-by: John Stultz <jstultz@google.com>
In kernels built with CONFIG_PROVE_RCU=y (for example, lockdep kernels),
the following sequence of events can occur:
o rcu_init_tasks_generic() is invoked just before init is spawned.
It invokes rcu_spawn_tasks_kthread() and friends.
o rcu_spawn_tasks_kthread() invokes rcu_spawn_tasks_kthread_generic(),
which uses kthread_run() to create the needed kthread.
o Control returns to rcu_init_tasks_generic(), which, because this
is a CONFIG_PROVE_RCU=y kernel, invokes the version of the
rcu_tasks_initiate_self_tests() function that actually does
something, including invoking synchronize_rcu_tasks(), which
in turn invokes synchronize_rcu_tasks_generic().
o synchronize_rcu_tasks_generic() sees that the ->kthread_ptr is
still NULL, because the newly spawned kthread has not yet
started.
o The new kthread starts, preempting synchronize_rcu_tasks_generic()
just after its check. This kthread invokes rcu_tasks_one_gp(),
which acquires ->tasks_gp_mutex, and, seeing no work, blocks
in rcuwait_wait_event(). Note that this step requires either
a preemptible kernel or a fault-injection-style sleep at the
beginning of mutex_lock().
o synchronize_rcu_tasks_generic() resumes and invokes rcu_tasks_one_gp().
o rcu_tasks_one_gp() attempts to acquire ->tasks_gp_mutex, which
is still held by the newly spawned kthread's rcu_tasks_one_gp()
function. Deadlock.
Because the only reason for ->tasks_gp_mutex is to handle pre-kthread
synchronous grace periods, this commit avoids this deadlock by having
rcu_tasks_one_gp() momentarily release ->tasks_gp_mutex while invoking
rcuwait_wait_event(). This allows the call to rcu_tasks_one_gp() from
synchronize_rcu_tasks_generic() proceed.
Note that it is not necessary to release the mutex anywhere else in
rcu_tasks_one_gp() because rcuwait_wait_event() is the only function
that can block indefinitely.
Reported-by: Guenter Roeck <linux@roeck-us.net>
Reported-by: Roy Hopkins <rhopkins@suse.de>
Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Tested-by: Roy Hopkins <rhopkins@suse.de>
Use guards to reduce gotos and simplify control flow.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lore.kernel.org/r/20230801211811.828443100@infradead.org
The sched_rr_timeslice can be reset to default by writing value that is
<= 0. However after reading from this file we always got the last value
written, which is not useful at all.
$ echo -1 > /proc/sys/kernel/sched_rr_timeslice_ms
$ cat /proc/sys/kernel/sched_rr_timeslice_ms
-1
Fix this by setting the variable that holds the sysctl file value to the
jiffies_to_msecs(RR_TIMESLICE) in case that <= 0 value was written.
Signed-off-by: Cyril Hrubis <chrubis@suse.cz>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Petr Vorel <pvorel@suse.cz>
Acked-by: Mel Gorman <mgorman@suse.de>
Tested-by: Petr Vorel <pvorel@suse.cz>
Link: https://lore.kernel.org/r/20230802151906.25258-3-chrubis@suse.cz
There is a 10% rounding error in the intial value of the
sysctl_sched_rr_timeslice with CONFIG_HZ_300=y.
This was found with LTP test sched_rr_get_interval01:
sched_rr_get_interval01.c:57: TPASS: sched_rr_get_interval() passed
sched_rr_get_interval01.c:64: TPASS: Time quantum 0s 99999990ns
sched_rr_get_interval01.c:72: TFAIL: /proc/sys/kernel/sched_rr_timeslice_ms != 100 got 90
sched_rr_get_interval01.c:57: TPASS: sched_rr_get_interval() passed
sched_rr_get_interval01.c:64: TPASS: Time quantum 0s 99999990ns
sched_rr_get_interval01.c:72: TFAIL: /proc/sys/kernel/sched_rr_timeslice_ms != 100 got 90
What this test does is to compare the return value from the
sched_rr_get_interval() and the sched_rr_timeslice_ms sysctl file and
fails if they do not match.
The problem it found is the intial sysctl file value which was computed as:
static int sysctl_sched_rr_timeslice = (MSEC_PER_SEC / HZ) * RR_TIMESLICE;
which works fine as long as MSEC_PER_SEC is multiple of HZ, however it
introduces 10% rounding error for CONFIG_HZ_300:
(MSEC_PER_SEC / HZ) * (100 * HZ / 1000)
(1000 / 300) * (100 * 300 / 1000)
3 * 30 = 90
This can be easily fixed by reversing the order of the multiplication
and division. After this fix we get:
(MSEC_PER_SEC * (100 * HZ / 1000)) / HZ
(1000 * (100 * 300 / 1000)) / 300
(1000 * 30) / 300 = 100
Fixes: 975e155ed8 ("sched/rt: Show the 'sched_rr_timeslice' SCHED_RR timeslice tuning knob in milliseconds")
Signed-off-by: Cyril Hrubis <chrubis@suse.cz>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Petr Vorel <pvorel@suse.cz>
Acked-by: Mel Gorman <mgorman@suse.de>
Tested-by: Petr Vorel <pvorel@suse.cz>
Link: https://lore.kernel.org/r/20230802151906.25258-2-chrubis@suse.cz
If an output buffer size exceeded U16_MAX, the min_t(u16, ...) cast in
copy_data() was causing writes to truncate. This manifested as output
bytes being skipped, seen as %NUL bytes in pstore dumps when the available
record size was larger than 65536. Fix the cast to no longer truncate
the calculation.
Cc: Petr Mladek <pmladek@suse.com>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: John Ogness <john.ogness@linutronix.de>
Reported-by: Vijay Balakrishna <vijayb@linux.microsoft.com>
Link: https://lore.kernel.org/lkml/d8bb1ec7-a4c5-43a2-9de0-9643a70b899f@linux.microsoft.com/
Fixes: b6cf8b3f33 ("printk: add lockless ringbuffer")
Cc: stable@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Tested-by: Vijay Balakrishna <vijayb@linux.microsoft.com>
Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck
Reviewed-by: Tyler Hicks (Microsoft) <code@tyhicks.com>
Tested-by: Tyler Hicks (Microsoft) <code@tyhicks.com>
Reviewed-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20230811054528.never.165-kees@kernel.org
- Make amd-pstate use device_attributes as expected by the CPU root
kobject (Thomas Weißschuh).
- Restore the previous behavior of resume_store() when hibernation is
not available which is to return the full number of bytes that were
to be written by user space (Vlastimil Babka).
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAmTWgJ8SHHJqd0Byand5
c29ja2kubmV0AAoJEILEb/54YlRxGEgP/01+F+nmq0c5QebC3LWw4cyYuepeCJ86
jfIbJR+XHOoiTaQMORHKBEk8xlelL/R65tRhkB/Gq1uFzeIId+xYJJlsW4Lpj7bz
rx/FXOAW8mAyPe/kNitBtcjh4tqEiPBiVzn1tKTA4OOLm0CzOE5v9KML93U2vsOa
Y2I3Jp1N6HHC8oRzbYpQgvB6R2MXX/oRd5fCvrVyMidFFbgYz8sWssRe8eUTGFAj
U/bufaKM7N/qlavikSul1f4T3KpRN+xpu7+I3W6M5/w0EQt663u3TffY1Mo+qllB
uoIM7emwsR6J6WsJyWbHgZEh/fIPmPAhGtsUsam9dN4aoDXfac2Trqrf+xYYbAtS
7mafAyWa+NxQCy/90QxoTrqhj3U4/dIbne4l1ZqgZQ7vyzM/NA4Gi0VBDEpt1BZU
q6uvhS4PXvkRm/PezQSQCSMaP66F0erMCHxKTXTN1wYNob0AKjV6l1bmG5LdPcIh
Nsk+CDkAVGmbqfDrtek9FfJZWgH3/lPDg0oVVMi9WiE8CdhYfKoB+Eh/MFVGiiDg
69cogAHqTUeuB46NPNedeOacGc6F0+mnAwkgNkClCTCHZJ0QSDlh2yVR003ZhnUj
sHx6jf6rYodW+nBQydjUzVm+twH47tltY0ibzN3ZIXiMM0UlALHBF+Oj4hOtGxUa
jiiqkLyB/9kH
=0RaA
-----END PGP SIGNATURE-----
Merge tag 'pm-6.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fixes from Rafael Wysocki:
"These fix an amd-pstate cpufreq driver issues and recently introduced
hibernation-related breakage.
Specifics:
- Make amd-pstate use device_attributes as expected by the CPU root
kobject (Thomas Weißschuh)
- Restore the previous behavior of resume_store() when hibernation is
not available which is to return the full number of bytes that were
to be written by user space (Vlastimil Babka)"
* tag 'pm-6.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
cpufreq: amd-pstate: fix global sysfs attribute type
PM: hibernate: fix resume_store() return value when hibernation not available
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQRdM/uy1Ege0+EN1fNar9k/UBDW4wUCZNRx8QAKCRBar9k/UBDW
46MBAQC3YDFsEfPzX4P7ZnlM5Lf1NynjNbso5bYW0TF/dp/Y+gD+M8wdM5Vj2Mb0
Zr56TnwCJei0kGBemiel4sStt3e4qwY=
=+0u+
-----END PGP SIGNATURE-----
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Martin KaFai Lau says:
====================
pull-request: bpf-next 2023-08-09
We've added 19 non-merge commits during the last 6 day(s) which contain
a total of 25 files changed, 369 insertions(+), 141 deletions(-).
The main changes are:
1) Fix array-index-out-of-bounds access when detaching from an
already empty mprog entry from Daniel Borkmann.
2) Adjust bpf selftest because of a recent llvm change
related to the cpu-v4 ISA from Eduard Zingerman.
3) Add uprobe support for the bpf_get_func_ip helper from Jiri Olsa.
4) Fix a KASAN splat due to the kernel incorrectly accepted
an invalid program using the recent cpu-v4 instruction from
Yonghong Song.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next:
bpf: btf: Remove two unused function declarations
bpf: lru: Remove unused declaration bpf_lru_promote()
selftests/bpf: relax expected log messages to allow emitting BPF_ST
selftests/bpf: remove duplicated functions
bpf, docs: Fix small typo and define semantics of sign extension
selftests/bpf: Add bpf_get_func_ip test for uprobe inside function
selftests/bpf: Add bpf_get_func_ip tests for uprobe on function entry
bpf: Add support for bpf_get_func_ip helper for uprobe program
selftests/bpf: Add a movsx selftest for sign-extension of R10
bpf: Fix an incorrect verification success with movsx insn
bpf, docs: Formalize type notation and function semantics in ISA standard
bpf: change bpf_alu_sign_string and bpf_movsx_string to static
libbpf: Use local includes inside the library
bpf: fix bpf_dynptr_slice() to stop return an ERR_PTR.
bpf: fix inconsistent return types of bpf_xdp_copy_buf().
selftests/bpf: fix the incorrect verification of port numbers.
selftests/bpf: Add test for detachment on empty mprog entry
bpf: Fix mprog detachment for empty mprog entry
bpf: bpf_struct_ops: Remove unnecessary initial values of variables
====================
Link: https://lore.kernel.org/r/20230810055123.109578-1-martin.lau@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pick up the EEVDF work into the main branch - it's looking good so far.
Conflicts:
kernel/sched/features.h
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Three LSMs register the implementations for the "capget" hook: AppArmor,
SELinux, and the normal capability code. Looking at the function
implementations we may observe that the first parameter "target" is not
changing.
Mark the first argument "target" of LSM hook security_capget() as
"const" since it will not be changing in the LSM hook.
cap_capget() LSM hook declaration exceeds the 80 characters per line
limit. Split the function declaration to multiple lines to decrease the
line length.
Signed-off-by: Khadija Kamran <kamrankhadijadj@gmail.com>
Acked-by: John Johansen <john.johansen@canonical.com>
[PM: align the cap_capget() declaration, spelling fixes]
Signed-off-by: Paul Moore <paul@paul-moore.com>
Tracefs or debugfs maybe cause hundreds to thousands of PATH records,
too many PATH records maybe cause soft lockup.
For example:
1. CONFIG_KASAN=y && CONFIG_PREEMPTION=n
2. auditctl -a exit,always -S open -k key
3. sysctl -w kernel.watchdog_thresh=5
4. mkdir /sys/kernel/debug/tracing/instances/test
There may be a soft lockup as follows:
watchdog: BUG: soft lockup - CPU#45 stuck for 7s! [mkdir:15498]
Kernel panic - not syncing: softlockup: hung tasks
Call trace:
dump_backtrace+0x0/0x30c
show_stack+0x20/0x30
dump_stack+0x11c/0x174
panic+0x27c/0x494
watchdog_timer_fn+0x2bc/0x390
__run_hrtimer+0x148/0x4fc
__hrtimer_run_queues+0x154/0x210
hrtimer_interrupt+0x2c4/0x760
arch_timer_handler_phys+0x48/0x60
handle_percpu_devid_irq+0xe0/0x340
__handle_domain_irq+0xbc/0x130
gic_handle_irq+0x78/0x460
el1_irq+0xb8/0x140
__audit_inode_child+0x240/0x7bc
tracefs_create_file+0x1b8/0x2a0
trace_create_file+0x18/0x50
event_create_dir+0x204/0x30c
__trace_add_new_event+0xac/0x100
event_trace_add_tracer+0xa0/0x130
trace_array_create_dir+0x60/0x140
trace_array_create+0x1e0/0x370
instance_mkdir+0x90/0xd0
tracefs_syscall_mkdir+0x68/0xa0
vfs_mkdir+0x21c/0x34c
do_mkdirat+0x1b4/0x1d4
__arm64_sys_mkdirat+0x4c/0x60
el0_svc_common.constprop.0+0xa8/0x240
do_el0_svc+0x8c/0xc0
el0_svc+0x20/0x30
el0_sync_handler+0xb0/0xb4
el0_sync+0x160/0x180
Therefore, we add cond_resched() to __audit_inode_child() to fix it.
Fixes: 5195d8e217 ("audit: dynamically allocate audit_names when not enough space is in the names array")
Signed-off-by: Gaosheng Cui <cuigaosheng1@huawei.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
Use a simple logical shift and increment to calculate the number of slots
taken by the DMA segment boundary.
At least GCC-13 is not able to optimize the expression, producing this
horrible assembly code on x86:
cmpq $-1, %rcx
je .L364
addq $2048, %rcx
shrq $11, %rcx
movq %rcx, %r13
.L331:
// rest of the function here...
// after function epilogue and return:
.L364:
movabsq $9007199254740992, %r13
jmp .L331
After the optimization, the code looks more reasonable:
shrq $11, %r11
leaq 1(%r11), %rbx
Signed-off-by: Petr Tesarik <petr.tesarik.ext@huawei.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Move the comment down in front of the loop that actually sets the list
member of struct io_tlb_slot to zero.
Fixes: 26a7e09478 ("swiotlb: refactor swiotlb_tbl_map_single")
Signed-off-by: Petr Tesarik <petr.tesarik.ext@huawei.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
While workqueue.default_affinity_scope is writable, it only affects
workqueues which are created afterwards and isn't very useful. Instead,
let's introduce explicit "default" scope and update the effective scope
dynamically when workqueue.default_affinity_scope is changed.
Signed-off-by: Tejun Heo <tj@kernel.org>
An unbound workqueue can be served by multiple worker_pools to improve
locality. The segmentation is achieved by grouping CPUs into pods. By
default, the cache boundaries according to cpus_share_cache() define the
CPUs are grouped. Let's a workqueue is allowed to run on all CPUs and the
system has two L3 caches. The workqueue would be mapped to two worker_pools
each serving one L3 cache domains.
While this improves locality, because the pod boundaries are strict, it
limits the total bandwidth a given issuer can consume. For example, let's
say there is a thread pinned to a CPU issuing enough work items to saturate
the whole machine. With the machine segmented into two pods, no matter how
many work items it issues, it can only use half of the CPUs on the system.
While this limitation has existed for a very long time, it wasn't very
pronounced because the affinity grouping used to be always by NUMA nodes.
With cache boundaries as the default and support for even finer grained
scopes (smt and cpu), it is now an a lot more pressing problem.
This patch implements non-strict affinity scope where the pod boundaries
aren't enforced strictly. Going back to the previous example, the workqueue
would still be mapped to two worker_pools; however, the affinity enforcement
would be soft. The workers in both pools would have their cpus_allowed set
to the whole machine thus allowing the scheduler to migrate them anywhere on
the machine. However, whenever an idle worker is woken up, the workqueue
code asks the scheduler to bring back the task within the pod if the worker
is outside. ie. work items start executing within its affinity scope but can
be migrated outside as the scheduler sees fit. This removes the hard cap on
utilization while maintaining the benefits of affinity scopes.
After the earlier ->__pod_cpumask changes, the implementation is pretty
simple. When non-strict which is the new default:
* pool_allowed_cpus() returns @pool->attrs->cpumask instead of
->__pod_cpumask so that the workers are allowed to run on any CPU that
the associated workqueues allow.
* If the idle worker task's ->wake_cpu is outside the pod, kick_pool() sets
the field to a CPU within the pod.
This would be the first use of task_struct->wake_cpu outside scheduler
proper, so it isn't clear whether this would be acceptable. However, other
methods of migrating tasks are significantly more expensive and are likely
prohibitively so if we want to do this on every work item. This needs
discussion with scheduler folks.
There is also a race window where setting ->wake_cpu wouldn't be effective
as the target task is still on CPU. However, the window is pretty small and
this being a best-effort optimization, it doesn't seem to warrant more
complexity at the moment.
While the non-strict cache affinity scopes seem to be the best option, the
performance picture interacts with the affinity scope and is a bit
complicated to fully discuss in this patch, so the behavior is made easily
selectable through wqattrs and sysfs and the next patch will add
documentation to discuss performance implications.
v2: pool->attrs->affn_strict is set to true for per-cpu worker_pools.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
workqueue_attrs has two uses:
* to specify the required unouned workqueue properties by users
* to match worker_pool's properties to workqueues by core code
For example, if the user wants to restrict a workqueue to run only CPUs 0
and 2, and the two CPUs are on different affinity scopes, the workqueue's
attrs->cpumask would contains CPUs 0 and 2, and the workqueue would be
associated with two worker_pools, one with attrs->cpumask containing just
CPU 0 and the other CPU 2.
Workqueue wants to support non-strict affinity scopes where work items are
started in their matching affinity scopes but the scheduler is free to
migrate them outside the starting scopes, which can enable utilizing the
whole machine while maintaining most of the locality benefits from affinity
scopes.
To enable that, worker_pools need to distinguish the strict affinity that it
has to follow (because that's the restriction coming from the user) and the
soft affinity that it wants to apply when dispatching work items. Note that
two worker_pools with different soft dispatching requirements have to be
separate; otherwise, for example, we'd be ping-ponging worker threads across
NUMA boundaries constantly.
This patch adds workqueue_attrs->__pod_cpumask. The new field is double
underscored as it's only used internally to distinguish worker_pools. A
worker_pool's ->cpumask is now always the same as the online subset of
allowed CPUs of the associated workqueues, and ->__pod_cpumask is the pod's
subset of that ->cpumask. Going back to the example above, both worker_pools
would have ->cpumask containing both CPUs 0 and 2 but one's ->__pod_cpumask
would contain 0 while the other's 2.
* pool_allowed_cpus() is added. It returns the worker_pool's strict cpumask
that the pool's workers must stay within. This is currently always
->__pod_cpumask as all boundaries are still strict.
* As a workqueue_attrs can now track both the associated workqueues' cpumask
and its per-pod subset, wq_calc_pod_cpumask() no longer needs an external
out-argument. Drop @cpumask and instead store the result in
->__pod_cpumask.
* The above also simplifies apply_wqattrs_prepare() as the same
workqueue_attrs can be used to create all pods associated with a
workqueue. tmp_attrs is dropped.
* wq_update_pod() is updated to use wqattrs_equal() to test whether a pwq
update is needed instead of only comparing ->cpumask so that
->__pod_cpumask is compared too. It can directly compare ->__pod_cpumaks
but the code is easier to understand and more robust this way.
The only user-visible behavior change is that two workqueues with different
cpumasks no longer can share worker_pools even when their pod subsets
coincide. Going back to the example, let's say there's another workqueue
with cpumask 0, 2, 3, where 2 and 3 are in the same pod. It would be mapped
to two worker_pools - one with CPU 0, the other with 2 and 3. The former has
the same cpumask as the first pod of the earlier example and would have
shared the same worker_pool but that's no longer the case after this patch.
The worker_pools would have the same ->__pod_cpumask but their ->cpumask's
wouldn't match.
While this is necessary to support non-strict affinity scopes, there can be
further optimizations to maintain sharing among strict affinity scopes.
However, non-strict affinity scopes are going to be preferable for most use
cases and we don't see very diverse mixture of unbound workqueue cpumasks
anyway, so the additional overhead doesn't seem to justify the extra
complexity.
v2: - wq_update_pod() was incorrectly comparing target_attrs->__pod_cpumask
to pool->attrs->cpumask instead of its ->__pod_cpumask. Fix it by
using wqattrs_equal() for comparison instead.
- Per-cpu worker pools weren't initializing ->__pod_cpumask which caused
a subtle problem later on. Set it to cpumask_of(cpu) like ->cpumask.
Signed-off-by: Tejun Heo <tj@kernel.org>
Checking need_more_worker() and calling wake_up_worker() is a repeated
pattern. Let's add kick_pool(), which checks need_more_worker() and
open-code wake_up_worker(), and replace wake_up_worker() uses. The following
conversions aren't one-to-one:
* __queue_work() was using __need_more_work() because it knows that
pool->worklist isn't empty. Switching to kick_pool() adds an extra
list_empty() test.
* create_worker() always needs to wake up the newly minted worker whether
there's more work to do or not to avoid triggering hung task check on the
new task. Keep the current wake_up_process() and still add kick_pool().
This may lead to an extra wakeup which isn't harmful.
* pwq_adjust_max_active() was explicitly checking whether it needs to wake
up a worker or not to avoid spurious wakeups. As kick_pool() only wakes up
a worker when necessary, this explicit check is no longer necessary and
dropped.
* unbind_workers() now calls kick_pool() instead of wake_up_worker() adding
a need_more_worker() test. This avoids spurious wakeups and shouldn't
break anything.
wake_up_worker() is dropped as kick_pool() replaces all its users. After
this patch, all paths that wakes up a non-rescuer worker to initiate work
item execution use kick_pool(). This will enable future changes to improve
locality.
Signed-off-by: Tejun Heo <tj@kernel.org>
The two work execution paths in worker_thread() and rescuer_thread() use
move_linked_works() to claim work items from @pool->worklist. Once claimed,
process_schedule_works() is called which invokes process_one_work() on each
work item. process_one_work() then uses find_worker_executing_work() to
detect and handle collisions - situations where the work item to be executed
is still running on another worker.
This works fine, but, to improve work execution locality, we want to
establish work to worker association earlier and know for sure that the
worker is going to excute the work once asssigned, which requires performing
collision handling earlier while trying to assign the work item to the
worker.
This patch introduces assign_work() which assigns a work item to a worker
using move_linked_works() and then performs collision handling. As collision
handling is handled earlier, process_one_work() no longer needs to worry
about them.
After the this patch, collision checks for linked work items are skipped,
which should be fine as they can't be queued multiple times concurrently.
For work items running from rescuers, the timing of collision handling may
change but the invariant that the work items go through collision handling
before starting execution does not.
This patch shouldn't cause noticeable behavior changes, especially given
that worker_thread() behavior remains the same.
Signed-off-by: Tejun Heo <tj@kernel.org>
Add three more affinity scopes - WQ_AFFN_CPU, SMT and CACHE - and make CACHE
the default. The code changes to actually add the additional scopes are
trivial.
Also add module parameter "workqueue.default_affinity_scope" to override the
default scope and "affinity_scope" sysfs file to configure it per workqueue.
wq_dump.py and documentations are updated accordingly.
This enables significant flexibility in configuring how unbound workqueues
behave. If affinity scope is set to "cpu", it'll behave close to a per-cpu
workqueue. On the other hand, "system" removes all locality boundaries.
Many modern machines have multiple L3 caches often while being mostly
uniform in terms of memory access. Thus, workqueue's previous behavior of
spreading work items in each NUMA node had negative performance implications
from unncessarily crossing L3 boundaries between issue and execution.
However, picking a finer grained affinity scope also has a downside in that
an issuer in one group can't utilize CPUs in other groups.
While dependent on the specifics of workload, there's usually a noticeable
penalty in crossing L3 boundaries, so let's default to CACHE. This issue
will be further addressed and documented with examples in future patches.
Signed-off-by: Tejun Heo <tj@kernel.org>
While wq_pod_type[] can now group CPUs in any aribitrary way, WQ_AFFN_NUM
init is hard coded into workqueue_init_topology(). This patch modularizes
the init path by introducing init_pod_type() which takes a callback to
determine whether two CPUs should share a pod as an argument.
init_pod_type() first scans the CPU combinations testing for sharing to
assign consecutive pod IDs and initialize pod_type->cpu_pod[]. Once
->cpu_pod[] is determined, ->pod_cpus[] and ->pod_node[] are initialized
accordingly. WQ_AFFN_NUMA is now initialized by calling init_pod_type() with
cpus_share_numa() which tests whether the CPU belongs to the same NUMA node.
This patch may change the pod ID assigned to each NUMA node but that
shouldn't cause any behavior changes as the NUMA node to use for allocations
are tracked separately in pod_type->pod_node[]. This makes adding new
affinty types pretty easy.
Signed-off-by: Tejun Heo <tj@kernel.org>
While renamed to pod, the code still assumes that the pods are defined by
NUMA boundaries. Let's generalize it:
* workqueue_attrs->affn_scope is added. Each enum represents the type of
boundaries that define the pods. There are currently two scopes -
WQ_AFFN_NUMA and WQ_AFFN_SYSTEM. The former is the same behavior as before
- one pod per NUMA node. The latter defines one global pod across the
whole system.
* struct wq_pod_type is added which describes how pods are configured for
each affnity scope. For each pod, it lists the member CPUs and the
preferred NUMA node for memory allocations. The reverse mapping from CPU
to pod is also available.
* wq_pod_enabled is dropped. Pod is now always enabled. The previously
disabled behavior is now implemented through WQ_AFFN_SYSTEM.
* get_unbound_pool() wants to determine the NUMA node to allocate memory
from for the new pool. The variables are renamed from node to pod but the
logic still assumes they're one and the same. Clearly distinguish them -
walk the WQ_AFFN_NUMA pods to find the matching pod and then use the pod's
NUMA node.
* wq_calc_pod_cpumask() was taking @pod but assumed that it was the NUMA
node. Take @cpu instead and determine the cpumask to use from the pod_type
matching @attrs.
* apply_wqattrs_prepare() is update to return ERR_PTR() on error instead of
NULL so that it can indicate -EINVAL on invalid affinity scopes.
This patch allows CPUs to be grouped into pods however desired per type.
While this patch causes some internal behavior changes, nothing material
should change for workqueue users.
v2: Trigger WARN_ON_ONCE() in wqattrs_pod_type() if affn_scope is
WQ_AFFN_NR_TYPES which indicates that the function is called with a
worker_pool's attrs instead of a workqueue's.
Signed-off-by: Tejun Heo <tj@kernel.org>
workqueue_attrs can be used for both workqueues and worker_pools. However,
some fields, currently only ->ordered, only apply to workqueues and should
be cleared to the default / invalid values.
Currently, an unbound workqueue explicitly clears attrs->ordered in
get_unbound_pool() after copying the source workqueue attrs, while per-cpu
workqueues rely on the fact that zeroing on allocation gives us the desired
default value for pool->attrs->ordered.
This is fragile. Let's add wqattrs_clear_for_pool() which clears
attrs->ordered and is called from both init_worker_pool() and
get_unbound_pool(). This will ease adding more workqueue-only attrs fields.
In get_unbound_pool(), pool->node initialization is moved upwards for
readability. This shouldn't cause any behavior changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
For an unbound pool, multiple cpumasks are involved.
U: The user-specified cpumask (may be filtered with cpu_possible_mask).
A: The actual cpumask filtered by wq_unbound_cpumask. If the filtering
leaves no CPU, wq_unbound_cpumask is used.
P: Per-pod subsets of #A.
wq->attrs stores #U, wq->dfl_pwq->pool->attrs->cpumask #A, and
wq->cpu_pwq[CPU]->pool->attrs->cpumask #P.
wq_update_pod() is called to update per-pod pwq's during CPU hotplug. To
calculate the new #P for each workqueue, it needs to call
wq_calc_pod_cpumask() with @attrs that contains #A. Currently,
wq_update_pod() achieves this by calling wq_calc_pod_cpumask() with
wq->dfl_pwq->pool->attrs.
This is rather fragile because we're calling wq_calc_pod_cpumask() with
@attrs of a worker_pool rather than the workqueue's actual attrs when what
we want to calculate is the workqueue's cpumask on the pod. While this works
fine currently, future changes will add fields which are used differently
between workqueues and worker_pools and this subtlety will bite us.
This patch factors out #U -> #A calculation from apply_wqattrs_prepare()
into wqattrs_actualize_cpumask and updates wq_update_pod() to copy
wq->unbound_attrs and use the new helper to obtain #A freshly instead of
abusing wq->dfl_pwq->pool_attrs.
This shouldn't cause any behavior changes in the current code.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: K Prateek Nayak <kprateek.nayak@amd.com>
Reference: http://lkml.kernel.org/r/30625cdd-4d61-594b-8db9-6816b017dde3@amd.com
During boot, to initialize unbound CPU pods, wq_pod_init() was called from
workqueue_init(). This is early enough for NUMA nodes to be set up but
before SMP is brought up and CPU topology information is populated.
Workqueue is in the process of improving CPU locality for unbound workqueues
and will need access to topology information during pod init. This adds a
new init function workqueue_init_topology() which is called after CPU
topology information is available and replaces wq_pod_init().
As unbound CPU pods are now initialized after workqueues are activated, we
need to revisit the workqueues to apply the pod configuration. Workqueues
which are created before workqueue_init_topology() are set up so that they
always use the default worker pool. After pods are set up in
workqueue_init_topology(), wq_update_pod() is called on all existing
workqueues to update the pool associations accordingly.
Note that wq_update_pod_attrs_buf allocation is moved to
workqueue_init_early(). This isn't necessary right now but enables further
generalization of pod handling in the future.
This patch changes the initialization sequence but the end result should be
the same.
Signed-off-by: Tejun Heo <tj@kernel.org>
wq_pod_init() is called from workqueue_init() and responsible for
initializing unbound CPU pods according to NUMA node. Workqueue is in the
process of improving affinity awareness and wants to use other topology
information to initialize unbound CPU pods; however, unlike NUMA nodes,
other topology information isn't yet available in workqueue_init().
The next patch will introduce a later stage init function for workqueue
which will be responsible for initializing unbound CPU pods. Relocate
wq_pod_init() below workqueue_init() where the new init function is going to
be located so that the diff can show the content differences.
Just a relocation. No functional changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Workqueue is in the process of improving CPU affinity awareness. It will
become more flexible and won't be tied to NUMA node boundaries. This patch
renames all NUMA related names in workqueue.c to use "pod" instead.
While "pod" isn't a very common term, it short and captures the grouping of
CPUs well enough. These names are only going to be used within workqueue
implementation proper, so the specific naming doesn't matter that much.
* wq_numa_possible_cpumask -> wq_pod_cpus
* wq_numa_enabled -> wq_pod_enabled
* wq_update_unbound_numa_attrs_buf -> wq_update_pod_attrs_buf
* workqueue_select_cpu_near -> select_numa_node_cpu
This rename is different from others. The function is only used by
queue_work_node() and specifically tries to find a CPU in the specified
NUMA node. As workqueue affinity will become more flexible and untied from
NUMA, this function's name should specifically describe that it's for
NUMA.
* wq_calc_node_cpumask -> wq_calc_pod_cpumask
* wq_update_unbound_numa -> wq_update_pod
* wq_numa_init -> wq_pod_init
* node -> pod in local variables
Only renames. No functional changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
With the recent removal of NUMA related module param and sysfs knob,
workqueue_attrs->no_numa is now only used to implement ordered workqueues.
Let's rename the field so that it's less confusing especially with the
planned CPU affinity awareness improvements.
Just a rename. No functional changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
A pwq (pool_workqueue) represents an association between a workqueue and a
worker_pool. When a work item is queued, the workqueue selects the pwq to
use, which in turn determines the pool, and queues the work item to the pool
through the pwq. pwq is also what implements the maximum concurrency limit -
@max_active.
As a per-cpu workqueue should be assocaited with a different worker_pool on
each CPU, it always had per-cpu pwq's that are accessed through wq->cpu_pwq.
However, unbound workqueues were sharing a pwq within each NUMA node by
default. The sharing has several downsides:
* Because @max_active is per-pwq, the meaning of @max_active changes
depending on the machine configuration and whether workqueue NUMA locality
support is enabled.
* Makes per-cpu and unbound code deviate.
* Gets in the way of making workqueue CPU locality awareness more flexible.
This patch makes unbound workqueues use per-cpu pwq's the same way per-cpu
workqueues do by making the following changes:
* wq->numa_pwq_tbl[] is removed and unbound workqueues now use wq->cpu_pwq
just like per-cpu workqueues. wq->cpu_pwq is now RCU protected for unbound
workqueues.
* numa_pwq_tbl_install() is renamed to install_unbound_pwq() and installs
the specified pwq to the target CPU's wq->cpu_pwq.
* apply_wqattrs_prepare() now always allocates a separate pwq for each CPU
unless the workqueue is ordered. If ordered, all CPUs use wq->dfl_pwq.
This makes the return value of wq_calc_node_cpumask() unnecessary. It now
returns void.
* @max_active now means the same thing for both per-cpu and unbound
workqueues. WQ_UNBOUND_MAX_ACTIVE now equals WQ_MAX_ACTIVE and
documentation is updated accordingly. WQ_UNBOUND_MAX_ACTIVE is no longer
used in workqueue implementation and will be removed later.
* All unbound pwq operations which used to be per-numa-node are now per-cpu.
For most unbound workqueue users, this shouldn't cause noticeable changes.
Work item issue and completion will be a small bit faster, flush_workqueue()
would become a bit more expensive, and the total concurrency limit would
likely become higher. All @max_active==1 use cases are currently being
audited for conversion into alloc_ordered_workqueue() and they shouldn't be
affected once the audit and conversion is complete.
One area where the behavior change may be more noticeable is
workqueue_congested() as the reported congestion state is now per CPU
instead of NUMA node. There are only two users of this interface -
drivers/infiniband/hw/hfi1 and net/smc. Maintainers of both subsystems are
cc'd. Inputs on the behavior change would be very much appreciated.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Karsten Graul <kgraul@linux.ibm.com>
Cc: Wenjia Zhang <wenjia@linux.ibm.com>
Cc: Jan Karcher <jaka@linux.ibm.com>
When a CPU went online or offline, wq_update_unbound_numa() was called only
on the CPU which was going up or down. This works fine because all CPUs on
the same NUMA node share the same pool_workqueue slot - one CPU updating it
updates it for everyone in the node.
However, future changes will make each CPU use a separate pool_workqueue
even when they're sharing the same worker_pool, which requires updating
pool_workqueue's for all CPUs which may be sharing the same pool_workqueue
on hotplug.
To accommodate the planned changes, this patch updates
workqueue_on/offline_cpu() so that they call wq_update_unbound_numa() for
all CPUs sharing the same NUMA node as the CPU going up or down. In the
current code, the second+ calls would be noops and there shouldn't be any
behavior changes.
* As wq_update_unbound_numa() is now called on multiple CPUs per each
hotplug event, @cpu is renamed to @hotplug_cpu and another @cpu argument
is added. The former indicates the CPU being hot[un]plugged and the latter
the CPU whose pool_workqueue is being updated.
* In wq_update_unbound_numa(), cpu_off is renamed to off_cpu for consistency
with the new @hotplug_cpu.
Signed-off-by: Tejun Heo <tj@kernel.org>
Currently, all per-cpu pwq's (pool_workqueue's) are allocated directly
through a per-cpu allocation and thus, unlike unbound workqueues, not
reference counted. This difference in lifetime management between the two
types is a bit confusing.
Unbound workqueues are currently accessed through wq->numa_pwq_tbl[] which
isn't suitiable for the planned CPU locality related improvements. The plan
is to unify pwq handling across per-cpu and unbound workqueues so that
they're always accessed through wq->cpu_pwq.
In preparation, this patch makes per-cpu pwq's to be allocated, reference
counted and released the same way as unbound pwq's. wq->cpu_pwq now holds
pointers to pwq's instead of containing them directly.
pwq_unbound_release_workfn() is renamed to pwq_release_workfn() as it's now
also used for per-cpu work items.
Signed-off-by: Tejun Heo <tj@kernel.org>
pool_workqueue release path is currently bounced to system_wq; however, this
is a bit tricky because this bouncing occurs while holding a pool lock and
thus has risk of causing a A-A deadlock. This is currently addressed by the
fact that only unbound workqueues use this bouncing path and system_wq is a
per-cpu workqueue.
While this works, it's brittle and requires a work-around like setting the
lockdep subclass for the lock of unbound pools. Besides, future changes will
use the bouncing path for per-cpu workqueues too making the current approach
unusable.
Let's just use a dedicated kthread_worker to untangle the dependency. This
is just one more kthread for all workqueues and makes the pwq release logic
simpler and more robust.
Signed-off-by: Tejun Heo <tj@kernel.org>
Unbound workqueue CPU affinity is going to receive an overhaul and the NUMA
specific knobs won't make sense anymore. Remove them. Also, the pool_ids
knob was used for debugging and not really meaningful given that there is no
visibility into the pools associated with those IDs. Remove it too. A future
patch will improve overall visibility.
Signed-off-by: Tejun Heo <tj@kernel.org>
Collect first_idle_worker(), worker_enter/leave_idle(),
find_worker_executing_work(), move_linked_works() and wake_up_worker() into
one place. These functions will later be used to implement higher level
worker management logic.
No functional changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
wq->cpu_pwqs is a percpu variable carraying one pointer to a pool_workqueue.
The field name being plural is unusual and confusing. Rename it to singular.
This patch doesn't cause any functional changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
insert_work() always tried to wake up a worker; however, the only time it
needs to try to wake up a worker is when a new active work item is queued.
When a work item goes on the inactive list or queueing a flush work item,
there's no reason to try to wake up a worker.
This patch moves the worker wakeup logic out of insert_work() and places it
in the active new work item queueing path in __queue_work().
While at it:
* __queue_work() is dereferencing pwq->pool repeatedly. Add local variable
pool.
* Every caller of insert_work() calls debug_work_activate(). Consolidate the
invocations into insert_work().
* In __queue_work() pool->watchdog_ts update is relocated slightly. This is
to better accommodate future changes.
This makes wakeups more precise and will help the planned change to assign
work items to workers before waking them up. No behavior changes intended.
v2: WARN_ON_ONCE(pool != last_pool) added in __queue_work() to clarify as
suggested by Lai.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
* Drop the trivial optimization in worker_thread() where it bypasses calling
process_scheduled_works() if the first work item isn't linked. This is a
mostly pointless micro optimization and gets in the way of improving the
work processing path.
* Consolidate pool->watchdog_ts updates in the two callers into
process_scheduled_works().
Signed-off-by: Tejun Heo <tj@kernel.org>
worker->flags used to be accessed from scheduler hooks without grabbing
pool->lock for concurrency management. This is no longer true since
6d25be5782 ("sched/core, workqueues: Distangle worker accounting from rq
lock"). Also, it's unclear why worker_pool->flags was using the "X" rule.
All relevant users are accessing it under the pool lock.
Let's drop the special "X" rule and use the "L" rule for these flag fields
instead. While at it, replace the CONTEXT comment with
lockdep_assert_held().
This allows worker_set/clr_flags() to be used from context which isn't the
worker itself. This will be used later to implement assinging work items to
workers before waking them up so that workqueue can have better control over
which worker executes which work item on which CPU.
The only actual changes are sanity checks. There shouldn't be any visible
behavior changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Unbound workqueue execution locality improvement patchset is about to
applied which will cause merge conflicts with changes in for-6.5-fixes.
Let's avoid future merge conflict by pulling in for-6.5-fixes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Adding support for bpf_get_func_ip helper for uprobe program to return
probed address for both uprobe and return uprobe.
We discussed this in [1] and agreed that uprobe can have special use
of bpf_get_func_ip helper that differs from kprobe.
The kprobe bpf_get_func_ip returns:
- address of the function if probe is attach on function entry
for both kprobe and return kprobe
- 0 if the probe is not attach on function entry
The uprobe bpf_get_func_ip returns:
- address of the probe for both uprobe and return uprobe
The reason for this semantic change is that kernel can't really tell
if the probe user space address is function entry.
The uprobe program is actually kprobe type program attached as uprobe.
One of the consequences of this design is that uprobes do not have its
own set of helpers, but share them with kprobes.
As we need different functionality for bpf_get_func_ip helper for uprobe,
I'm adding the bool value to the bpf_trace_run_ctx, so the helper can
detect that it's executed in uprobe context and call specific code.
The is_uprobe bool is set as true in bpf_prog_run_array_sleepable, which
is currently used only for executing bpf programs in uprobe.
Renaming bpf_prog_run_array_sleepable to bpf_prog_run_array_uprobe
to address that it's only used for uprobes and that it sets the
run_ctx.is_uprobe as suggested by Yafang Shao.
Suggested-by: Andrii Nakryiko <andrii@kernel.org>
Tested-by: Alan Maguire <alan.maguire@oracle.com>
[1] https://lore.kernel.org/bpf/CAEf4BzZ=xLVkG5eurEuvLU79wAMtwho7ReR+XJAgwhFF4M-7Cg@mail.gmail.com/
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Tested-by: Viktor Malik <vmalik@redhat.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230807085956.2344866-2-jolsa@kernel.org
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
syzbot reports a verifier bug which triggers a runtime panic.
The test bpf program is:
0: (62) *(u32 *)(r10 -8) = 553656332
1: (bf) r1 = (s16)r10
2: (07) r1 += -8
3: (b7) r2 = 3
4: (bd) if r2 <= r1 goto pc+0
5: (85) call bpf_trace_printk#-138320
6: (b7) r0 = 0
7: (95) exit
At insn 1, the current implementation keeps 'r1' as a frame pointer,
which caused later bpf_trace_printk helper call crash since frame
pointer address is not valid any more. Note that at insn 4,
the 'pointer vs. scalar' comparison is allowed for privileged
prog run.
To fix the problem with above insn 1, the fix in the patch adopts
similar pattern to existing 'R1 = (u32) R2' handling. For unprivileged
prog run, verification will fail with 'R<num> sign-extension part of pointer'.
For privileged prog run, the dst_reg 'r1' will be marked as
an unknown scalar, so later 'bpf_trace_pointk' helper will complain
since it expected certain pointers.
Reported-by: syzbot+d61b595e9205573133b3@syzkaller.appspotmail.com
Fixes: 8100928c88 ("bpf: Support new sign-extension mov insns")
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20230807175721.671696-1-yonghong.song@linux.dev
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Two commits:
* The recently added cpu_intensive auto detection and warning mechanism was
spuriously triggered on slow CPUs. While not causing serious issues, it's
still a nuisance and can cause unintended concurrency management
behaviors. Relax the threshold on machines with lower BogoMIPS. While
BogoMIPS is not an accurate measure of performance by most measures, we
don't have to be accurate and it has rough but strong enough correlation.
* A correction in Kconfig help text.
-----BEGIN PGP SIGNATURE-----
iIQEABYIACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZNFMTQ4cdGpAa2VybmVs
Lm9yZwAKCRCxYfJx3gVYGb+4AQCniWx3rwWWmLgviPR0AfYWbcQ8/P/qGh++fmsR
tEF3sQD/bLdeWcVa1pSzXjhGtRVGsTis6oOhk81A0zIZlx0v2Qg=
=sThu
-----END PGP SIGNATURE-----
Merge tag 'wq-for-6.5-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue fixes from Tejun Heo:
- The recently added cpu_intensive auto detection and warning mechanism
was spuriously triggered on slow CPUs.
While not causing serious issues, it's still a nuisance and can cause
unintended concurrency management behaviors.
Relax the threshold on machines with lower BogoMIPS. While BogoMIPS
is not an accurate measure of performance by most measures, we don't
have to be accurate and it has rough but strong enough correlation.
- A correction in Kconfig help text
* tag 'wq-for-6.5-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: Scale up wq_cpu_intensive_thresh_us if BogoMIPS is below 4000
workqueue: Fix cpu_intensive_thresh_us name in help text
The member variable bstat of the structure cgroup_rstat_cpu
records the per-cpu time of the cgroup itself, but does not
include the per-cpu time of its descendants. The per-cpu time
including descendants is very useful for calculating the
per-cpu usage of cgroups.
Although we can indirectly obtain the total per-cpu time
of the cgroup and its descendants by accumulating the per-cpu
bstat of each descendant of the cgroup. But after a child cgroup
is removed, we will lose its bstat information. This will cause
the cumulative value to be non-monotonic, thus affecting
the accuracy of cgroup per-cpu usage.
So we add the subtree_bstat variable to record the total
per-cpu time of this cgroup and its descendants, which is
similar to "cpuacct.usage*" in cgroup v1. And this is
also helpful for the migration from cgroup v1 to cgroup v2.
After adding this variable, we can obtain the per-cpu time of
cgroup and its descendants in user mode through eBPF/drgn, etc.
And we are still trying to determine how to expose it in the
cgroupfs interface.
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Hao Jia <jiahao.os@bytedance.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Use LIST_HEAD() to initialize cull_list instead of open-coding it.
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
There's no need to use '<=' when knowing 'l->list[mid] != pid' already.
No functional change intended.
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
kexec_mutex is replaced by an atomic variable
in 05c6257433 (panic, kexec: make __crash_kexec() NMI safe).
But there are still two comments that referenced kexec_mutex,
replace them by kexec_lock.
Signed-off-by: Wenyu Liu <liuwenyu7@huawei.com>
Acked-by: Baoquan He <bhe@redhat.com>
Acked-by: Paul Menzel <pmenzel@molgen.mpg.de>
Signed-off-by: Mimi Zohar <zohar@linux.ibm.com>
On a laptop with hibernation set up but not actively used, and with
secure boot and lockdown enabled kernel, 6.5-rc1 gets stuck on boot with
the following repeated messages:
A start job is running for Resume from hibernation using device /dev/system/swap (24s / no limit)
lockdown_is_locked_down: 25311154 callbacks suppressed
Lockdown: systemd-hiberna: hibernation is restricted; see man kernel_lockdown.7
...
Checking the resume code leads to commit cc89c63e2f ("PM: hibernate:
move finding the resume device out of software_resume") which
inadvertently changed the return value from resume_store() to 0 when
!hibernation_available(). This apparently translates to userspace
write() returning 0 as in number of bytes written, and userspace looping
indefinitely in the attempt to write the intended value.
Fix this by returning the full number of bytes that were to be written,
as that's what was done before the commit.
Fixes: cc89c63e2f ("PM: hibernate: move finding the resume device out of software_resume")
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Verify if the pointer obtained from bpf_xdp_pointer() is either an error or
NULL before returning it.
The function bpf_dynptr_slice() mistakenly returned an ERR_PTR. Instead of
solely checking for NULL, it should also verify if the pointer returned by
bpf_xdp_pointer() is an error or NULL.
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/bpf/d1360219-85c3-4a03-9449-253ea905f9d1@moroto.mountain/
Fixes: 66e3a13e7c ("bpf: Add bpf_dynptr_slice and bpf_dynptr_slice_rdwr")
Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230803231206.1060485-1-thinker.li@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
syzbot reported an UBSAN array-index-out-of-bounds access in bpf_mprog_read()
upon bpf_mprog_detach(). While it did not have a reproducer, I was able to
manually reproduce through an empty mprog entry which just has miniq present.
The latter is important given otherwise we get an ENOENT error as tcx detaches
the whole mprog entry. The index 4294967295 was triggered via NULL dtuple.prog
which then attempts to detach from the back. bpf_mprog_fetch() in this case
did hit the idx == total and therefore tried to grab the entry at idx -1.
Fix it by adding an explicit bpf_mprog_total() check in bpf_mprog_detach() and
bail out early with ENOENT.
Fixes: 053c8e1f23 ("bpf: Add generic attach/detach/query API for multi-progs")
Reported-by: syzbot+0c06ba0f831fe07a8f27@syzkaller.appspotmail.com
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/20230804131112.11012-1-daniel@iogearbox.net
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
err and tlinks is assigned first, so it does not need to initialize the
assignment.
Signed-off-by: Li kunyu <kunyu@nfschina.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20230804175929.2867-1-kunyu@nfschina.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Since commit e76ecaeef6 ("cgroup: use cgroup_kn_lock_live() in other
cgroup kernfs methods"), cgroup_kn_lock_live() is used in cgroup kernfs
methods. Update corresponding comment.
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQRdM/uy1Ege0+EN1fNar9k/UBDW4wUCZMvevwAKCRBar9k/UBDW
42Z0AP90hLZ9OmoghYAlALHLl8zqXuHCV8OeFXR5auqG+kkcCwEAx6h99vnh4zgP
Tngj6Yid60o39/IZXXblhV37HfSiyQ8=
=/kVE
-----END PGP SIGNATURE-----
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Martin KaFai Lau says:
====================
pull-request: bpf-next 2023-08-03
We've added 54 non-merge commits during the last 10 day(s) which contain
a total of 84 files changed, 4026 insertions(+), 562 deletions(-).
The main changes are:
1) Add SO_REUSEPORT support for TC bpf_sk_assign from Lorenz Bauer,
Daniel Borkmann
2) Support new insns from cpu v4 from Yonghong Song
3) Non-atomically allocate freelist during prefill from YiFei Zhu
4) Support defragmenting IPv(4|6) packets in BPF from Daniel Xu
5) Add tracepoint to xdp attaching failure from Leon Hwang
6) struct netdev_rx_queue and xdp.h reshuffling to reduce
rebuild time from Jakub Kicinski
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (54 commits)
net: invert the netdevice.h vs xdp.h dependency
net: move struct netdev_rx_queue out of netdevice.h
eth: add missing xdp.h includes in drivers
selftests/bpf: Add testcase for xdp attaching failure tracepoint
bpf, xdp: Add tracepoint to xdp attaching failure
selftests/bpf: fix static assert compilation issue for test_cls_*.c
bpf: fix bpf_probe_read_kernel prototype mismatch
riscv, bpf: Adapt bpf trampoline to optimized riscv ftrace framework
libbpf: fix typos in Makefile
tracing: bpf: use struct trace_entry in struct syscall_tp_t
bpf, devmap: Remove unused dtab field from bpf_dtab_netdev
bpf, cpumap: Remove unused cmap field from bpf_cpu_map_entry
netfilter: bpf: Only define get_proto_defrag_hook() if necessary
bpf: Fix an array-index-out-of-bounds issue in disasm.c
net: remove duplicate INDIRECT_CALLABLE_DECLARE of udp[6]_ehashfn
docs/bpf: Fix malformed documentation
bpf: selftests: Add defrag selftests
bpf: selftests: Support custom type and proto for client sockets
bpf: selftests: Support not connecting client socket
netfilter: bpf: Support BPF_F_NETFILTER_IP_DEFRAG in netfilter link
...
====================
Link: https://lore.kernel.org/r/20230803174845.825419-1-martin.lau@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Nothing scary here. Feels like the first wave of regressions
from v6.5 is addressed - one outstanding fix still to come
in TLS for the sendpage rework.
Current release - regressions:
- udp: fix __ip_append_data()'s handling of MSG_SPLICE_PAGES
- dsa: fix older DSA drivers using phylink
Previous releases - regressions:
- gro: fix misuse of CB in udp socket lookup
- mlx5: unregister devlink params in case interface is down
- Revert "wifi: ath11k: Enable threaded NAPI"
Previous releases - always broken:
- sched: cls_u32: fix match key mis-addressing
- sched: bind logic fixes for cls_fw, cls_u32 and cls_route
- add bound checks to a number of places which hand-parse netlink
- bpf: disable preemption in perf_event_output helpers code
- qed: fix scheduling in a tasklet while getting stats
- avoid using APIs which are not hardirq-safe in couple of drivers,
when we may be in a hard IRQ (netconsole)
- wifi: cfg80211: fix return value in scan logic, avoid page
allocator warning
- wifi: mt76: mt7615: do not advertise 5 GHz on first PHY
of MT7615D (DBDC)
Misc:
- drop handful of inactive maintainers, put some new in place
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmTMCRwACgkQMUZtbf5S
Irv1tRAArN6rfYrr2ulaTOfMqhWb1Q+kAs00nBCKqC+OdWgT0hqw2QAuqTAVjhje
8HBYlNGyhJ10yp0Q5y4Fp9CsBDHDDNjIp/YGEbr0vC/9mUDOhYD8WV07SmZmzEJu
gmt4LeFPTk07yZy7VxMLY5XKuwce6MWGHArehZE7PSa9+07yY2Ov9X02ntr9hSdH
ih+VdDI12aTVSj208qb0qNb2JkefFHW9dntVxce4/mtYJE9+47KMR2aXDXtCh0C6
ECgx0LQkdEJ5vNSYfypww0SXIG5aj7sE6HMTdJkjKH7ws4xrW8H+P9co77Hb/DTH
TsRBS4SgB20hFNxz3OQwVmAvj+2qfQssL7SeIkRnaEWeTBuVqCwjLdoIzKXJxxq+
cvtUAAM8XUPqec5cPiHPkeAJV6aJhrdUdMjjbCI9uFYU32AWFBQEqvVGP9xdhXHK
QIpTLiy26Vw8PwiJdROuGiZJCXePqQRLDuMX1L43ZO1rwIrZcWGHjCNtsR9nXKgQ
apbbxb2/rq2FBMB+6obKeHzWDy3JraNCsUspmfleqdjQ2mpbRokd4Vw2564FJgaC
5OznPIX6OuoCY5sftLUcRcpH5ncNj01BvyqjWyCIfJdkCqCUL7HSAgxfm5AUnZip
ZIXOzZnZ6uTUQFptXdjey/jNEQ6qpV8RmwY0CMsmJoo88DXI34Y=
=HYkl
-----END PGP SIGNATURE-----
Merge tag 'net-6.5-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from bpf and wireless.
Nothing scary here. Feels like the first wave of regressions from v6.5
is addressed - one outstanding fix still to come in TLS for the
sendpage rework.
Current release - regressions:
- udp: fix __ip_append_data()'s handling of MSG_SPLICE_PAGES
- dsa: fix older DSA drivers using phylink
Previous releases - regressions:
- gro: fix misuse of CB in udp socket lookup
- mlx5: unregister devlink params in case interface is down
- Revert "wifi: ath11k: Enable threaded NAPI"
Previous releases - always broken:
- sched: cls_u32: fix match key mis-addressing
- sched: bind logic fixes for cls_fw, cls_u32 and cls_route
- add bound checks to a number of places which hand-parse netlink
- bpf: disable preemption in perf_event_output helpers code
- qed: fix scheduling in a tasklet while getting stats
- avoid using APIs which are not hardirq-safe in couple of drivers,
when we may be in a hard IRQ (netconsole)
- wifi: cfg80211: fix return value in scan logic, avoid page
allocator warning
- wifi: mt76: mt7615: do not advertise 5 GHz on first PHY of MT7615D
(DBDC)
Misc:
- drop handful of inactive maintainers, put some new in place"
* tag 'net-6.5-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (98 commits)
MAINTAINERS: update TUN/TAP maintainers
test/vsock: remove vsock_perf executable on `make clean`
tcp_metrics: fix data-race in tcpm_suck_dst() vs fastopen
tcp_metrics: annotate data-races around tm->tcpm_net
tcp_metrics: annotate data-races around tm->tcpm_vals[]
tcp_metrics: annotate data-races around tm->tcpm_lock
tcp_metrics: annotate data-races around tm->tcpm_stamp
tcp_metrics: fix addr_same() helper
prestera: fix fallback to previous version on same major version
udp: Fix __ip_append_data()'s handling of MSG_SPLICE_PAGES
net/mlx5e: Set proper IPsec source port in L4 selector
net/mlx5: fs_core: Skip the FTs in the same FS_TYPE_PRIO_CHAINS fs_prio
net/mlx5: fs_core: Make find_closest_ft more generic
wifi: brcmfmac: Fix field-spanning write in brcmf_scan_params_v2_to_v1()
vxlan: Fix nexthop hash size
ip6mr: Fix skb_under_panic in ip6mr_cache_report()
s390/qeth: Don't call dev_close/dev_open (DOWN/UP)
net: tap_open(): set sk_uid from current_fsuid()
net: tun_chr_open(): set sk_uid from current_fsuid()
net: dcb: choose correct policy to parse DCB_ATTR_BCN
...
module_init_layout_section() choses whether the core module loader
considers a section as init or not. This affects the placement of the
exit section when module unloading is disabled. This code will never run,
so it can be free()d once the module has been initialised.
arm and arm64 need to count the number of PLTs they need before applying
relocations based on the section name. The init PLTs are stored separately
so they can be free()d. arm and arm64 both use within_module_init() to
decide which list of PLTs to use when applying the relocation.
Because within_module_init()'s behaviour changes when module unloading
is disabled, both architecture would need to take this into account when
counting the PLTs.
Today neither architecture does this, meaning when module unloading is
disabled there are insufficient PLTs in the init section to load some
modules, resulting in warnings:
| WARNING: CPU: 2 PID: 51 at arch/arm64/kernel/module-plts.c:99 module_emit_plt_entry+0x184/0x1cc
| Modules linked in: crct10dif_common
| CPU: 2 PID: 51 Comm: modprobe Not tainted 6.5.0-rc4-yocto-standard-dirty #15208
| Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
| pstate: 20400005 (nzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
| pc : module_emit_plt_entry+0x184/0x1cc
| lr : module_emit_plt_entry+0x94/0x1cc
| sp : ffffffc0803bba60
[...]
| Call trace:
| module_emit_plt_entry+0x184/0x1cc
| apply_relocate_add+0x2bc/0x8e4
| load_module+0xe34/0x1bd4
| init_module_from_file+0x84/0xc0
| __arm64_sys_finit_module+0x1b8/0x27c
| invoke_syscall.constprop.0+0x5c/0x104
| do_el0_svc+0x58/0x160
| el0_svc+0x38/0x110
| el0t_64_sync_handler+0xc0/0xc4
| el0t_64_sync+0x190/0x194
Instead of duplicating module_init_layout_section()s logic, expose it.
Reported-by: Adam Johnston <adam.johnston@arm.com>
Fixes: 055f23b74b ("module: check for exit sections in layout_sections() instead of module_init_section()")
Cc: stable@vger.kernel.org
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQRdM/uy1Ege0+EN1fNar9k/UBDW4wUCZMvqewAKCRBar9k/UBDW
48yeAQCnPnwzcvy+JDrdosuJEErhMv0pH3ECixNpPBpns95kzAEA9QhSYwjAhlFf
61d6hoiXj/sIibgMQT/ihODgeJ4wfQE=
=u7qn
-----END PGP SIGNATURE-----
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Martin KaFai Lau says:
====================
pull-request: bpf 2023-08-03
We've added 5 non-merge commits during the last 7 day(s) which contain
a total of 3 files changed, 37 insertions(+), 20 deletions(-).
The main changes are:
1) Disable preemption in perf_event_output helpers code,
from Jiri Olsa
2) Add length check for SK_DIAG_BPF_STORAGE_REQ_MAP_FD parsing,
from Lin Ma
3) Multiple warning splat fixes in cpumap from Hou Tao
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
bpf, cpumap: Handle skb as well when clean up ptr_ring
bpf, cpumap: Make sure kthread is running before map update returns
bpf: Add length check for SK_DIAG_BPF_STORAGE_REQ_MAP_FD parsing
bpf: Disable preemption in bpf_event_output
bpf: Disable preemption in bpf_perf_event_output
====================
Link: https://lore.kernel.org/r/20230803181429.994607-1-martin.lau@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
xdp.h is far more specific and is included in only 67 other
files vs netdevice.h's 1538 include sites.
Make xdp.h include netdevice.h, instead of the other way around.
This decreases the incremental allmodconfig builds size when
xdp.h is touched from 5947 to 662 objects.
Move bpf_prog_run_xdp() to xdp.h, seems appropriate and filter.h
is a mega-header in its own right so it's nice to avoid xdp.h
getting included there as well.
The only unfortunate part is that the typedef for xdp_features_t
has to move to netdevice.h, since its embedded in struct netdevice.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Link: https://lore.kernel.org/r/20230803010230.1755386-4-kuba@kernel.org
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
__pv_queued_spin_unlock_slowpath() is defined in a header file as
a global function, and designed to be called from inline asm, but
there is no prototype visible in the definition:
kernel/locking/qspinlock_paravirt.h:493:1: error: no previous \
prototype for '__pv_queued_spin_unlock_slowpath' [-Werror=missing-prototypes]
Add this to the x86 header that contains the inline asm calling it,
and ensure this gets included before the definition, rather than
after it.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230803082619.1369127-8-arnd@kernel.org
When operating with shadow stacks enabled, the kernel will automatically
allocate shadow stacks for new threads, however in some cases userspace
will need additional shadow stacks. The main example of this is the
ucontext family of functions, which require userspace allocating and
pivoting to userspace managed stacks.
Unlike most other user memory permissions, shadow stacks need to be
provisioned with special data in order to be useful. They need to be setup
with a restore token so that userspace can pivot to them via the RSTORSSP
instruction. But, the security design of shadow stacks is that they
should not be written to except in limited circumstances. This presents a
problem for userspace, as to how userspace can provision this special
data, without allowing for the shadow stack to be generally writable.
Previously, a new PROT_SHADOW_STACK was attempted, which could be
mprotect()ed from RW permissions after the data was provisioned. This was
found to not be secure enough, as other threads could write to the
shadow stack during the writable window.
The kernel can use a special instruction, WRUSS, to write directly to
userspace shadow stacks. So the solution can be that memory can be mapped
as shadow stack permissions from the beginning (never generally writable
in userspace), and the kernel itself can write the restore token.
First, a new madvise() flag was explored, which could operate on the
PROT_SHADOW_STACK memory. This had a couple of downsides:
1. Extra checks were needed in mprotect() to prevent writable memory from
ever becoming PROT_SHADOW_STACK.
2. Extra checks/vma state were needed in the new madvise() to prevent
restore tokens being written into the middle of pre-used shadow stacks.
It is ideal to prevent restore tokens being added at arbitrary
locations, so the check was to make sure the shadow stack had never been
written to.
3. It stood out from the rest of the madvise flags, as more of direct
action than a hint at future desired behavior.
So rather than repurpose two existing syscalls (mmap, madvise) that don't
quite fit, just implement a new map_shadow_stack syscall to allow
userspace to map and setup new shadow stacks in one step. While ucontext
is the primary motivator, userspace may have other unforeseen reasons to
setup its own shadow stacks using the WRSS instruction. Towards this
provide a flag so that stacks can be optionally setup securely for the
common case of ucontext without enabling WRSS. Or potentially have the
kernel set up the shadow stack in some new way.
The following example demonstrates how to create a new shadow stack with
map_shadow_stack:
void *shstk = map_shadow_stack(addr, stack_size, SHADOW_STACK_SET_TOKEN);
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/all/20230613001108.3040476-35-rick.p.edgecombe%40intel.com
bpf_probe_read_kernel() has a __weak definition in core.c and another
definition with an incompatible prototype in kernel/trace/bpf_trace.c,
when CONFIG_BPF_EVENTS is enabled.
Since the two are incompatible, there cannot be a shared declaration in
a header file, but the lack of a prototype causes a W=1 warning:
kernel/bpf/core.c:1638:12: error: no previous prototype for 'bpf_probe_read_kernel' [-Werror=missing-prototypes]
On 32-bit architectures, the local prototype
u64 __weak bpf_probe_read_kernel(void *dst, u32 size, const void *unsafe_ptr)
passes arguments in other registers as the one in bpf_trace.c
BPF_CALL_3(bpf_probe_read_kernel, void *, dst, u32, size,
const void *, unsafe_ptr)
which uses 64-bit arguments in pairs of registers.
As both versions of the function are fairly simple and only really
differ in one line, just move them into a header file as an inline
function that does not add any overhead for the bpf_trace.c callers
and actually avoids a function call for the other one.
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/all/ac25cb0f-b804-1649-3afb-1dc6138c2716@iogearbox.net/
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230801111449.185301-1-arnd@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Since commit 8f36aaec9c ("cgroup: Use rcu_work instead of explicit rcu
and work item"), css_free_work_fn has been renamed to css_free_rwork_fn.
Update corresponding comment.
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Add kernel-doc of param @rotor to fix warnings:
kernel/cgroup/cpuset.c:4162: warning: Function parameter or member
'rotor' not described in 'cpuset_spread_node'
kernel/cgroup/cpuset.c:3771: warning: Function parameter or member
'work' not described in 'cpuset_hotplug_workfn'
Signed-off-by: Cai Xinchen <caixinchen1@huawei.com>
Acked-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Convert the only printk() to use pr_*() helper. No functional change.
Signed-off-by: Kamalesh Babulal <kamalesh.babulal@oracle.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
It has recently come to my attention that nvidia is circumventing the
protection added in 262e6ae708 ("modules: inherit
TAINT_PROPRIETARY_MODULE") by importing exports from their proprietary
modules into an allegedly GPL licensed module and then rexporting them.
Given that symbol_get was only ever intended for tightly cooperating
modules using very internal symbols it is logical to restrict it to
being used on EXPORT_SYMBOL_GPL and prevent nvidia from costly DMCA
Circumvention of Access Controls law suites.
All symbols except for four used through symbol_get were already exported
as EXPORT_SYMBOL_GPL, and the remaining four ones were switched over in
the preparation patches.
Fixes: 262e6ae708 ("modules: inherit TAINT_PROPRIETARY_MODULE")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
CFS bandwidth limits and NOHZ full don't play well together. Tasks
can easily run well past their quotas before a remote tick does
accounting. This leads to long, multi-period stalls before such
tasks can run again. Currently, when presented with these conflicting
requirements the scheduler is favoring nohz_full and letting the tick
be stopped. However, nohz tick stopping is already best-effort, there
are a number of conditions that can prevent it, whereas cfs runtime
bandwidth is expected to be enforced.
Make the scheduler favor bandwidth over stopping the tick by setting
TICK_DEP_BIT_SCHED when the only running task is a cfs task with
runtime limit enabled. We use cfs_b->hierarchical_quota to
determine if the task requires the tick.
Add check in pick_next_task_fair() as well since that is where
we have a handle on the task that is actually going to be running.
Add check in sched_can_stop_tick() to cover some edge cases such
as nr_running going from 2->1 and the 1 remains the running task.
Reviewed-By: Ben Segall <bsegall@google.com>
Signed-off-by: Phil Auld <pauld@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230712133357.381137-3-pauld@redhat.com
In cgroupv2 cfs_b->hierarchical_quota is set to -1 for all task
groups due to the previous fix simply taking the min. It should
reflect a limit imposed at that level or by an ancestor. Even
though cgroupv2 does not require child quota to be less than or
equal to that of its ancestors the task group will still be
constrained by such a quota so this should be shown here. Cgroupv1
continues to set this correctly.
In both cases, add initialization when a new task group is created
based on the current parent's value (or RUNTIME_INF in the case of
root_task_group). Otherwise, the field is wrong until a quota is
changed after creation and __cfs_schedulable() is called.
Fixes: c53593e5cb ("sched, cgroup: Don't reject lower cpu.max on ancestors")
Signed-off-by: Phil Auld <pauld@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ben Segall <bsegall@google.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20230714125746.812891-1-pauld@redhat.com
bpf tracepoint program uses struct trace_event_raw_sys_enter as
argument where trace_entry is the first field. Use the same instead
of unsigned long long since if it's amended (for example by RT
patch) it accesses data with wrong offset.
Signed-off-by: Yauheni Kaliuta <ykaliuta@redhat.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230801075222.7717-1-ykaliuta@redhat.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Skip searching the software IO TLB if a device has never used it, making
sure these devices are not affected by the introduction of multiple IO TLB
memory pools.
Additional memory barrier is required to ensure that the new value of the
flag is visible to other CPUs after mapping a new bounce buffer. For
efficiency, the flag check should be inlined, and then the memory barrier
must be moved to is_swiotlb_buffer(). However, it can replace the existing
barrier in swiotlb_find_pool(), because all callers use is_swiotlb_buffer()
first to verify that the buffer address belongs to the software IO TLB.
Signed-off-by: Petr Tesarik <petr.tesarik.ext@huawei.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
When swiotlb_find_slots() cannot find suitable slots, schedule the
allocation of a new memory pool. It is not possible to allocate the pool
immediately, because this code may run in interrupt context, which is not
suitable for large memory allocations. This means that the memory pool will
be available too late for the currently requested mapping, but the stress
on the software IO TLB allocator is likely to continue, and subsequent
allocations will benefit from the additional pool eventually.
Keep all memory pools for an allocator in an RCU list to avoid locking on
the read side. For modifications, add a new spinlock to struct io_tlb_mem.
The spinlock also protects updates to the total number of slabs (nslabs in
struct io_tlb_mem), but not reads of the value. Readers may therefore
encounter a stale value, but this is not an issue:
- swiotlb_tbl_map_single() and is_swiotlb_active() only check for non-zero
value. This is ensured by the existence of the default memory pool,
allocated at boot.
- The exact value is used only for non-critical purposes (debugfs, kernel
messages).
Signed-off-by: Petr Tesarik <petr.tesarik.ext@huawei.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
The value returned by default_swiotlb_limit() should be constant, because
it is used to decide whether DMA can be used. To allow allocating memory
pools on the fly, use the maximum possible physical address rather than the
highest address used by the default pool.
For swiotlb_init_remap(), this is either an arch-specific limit used by
memblock_alloc_low(), or the highest directly mapped physical address if
the initialization flags include SWIOTLB_ANY. For swiotlb_init_late(), the
highest address is determined by the GFP flags.
Signed-off-by: Petr Tesarik <petr.tesarik.ext@huawei.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Try to allocate a transient memory pool if no suitable slots can be found
and the respective SWIOTLB is allowed to grow. The transient pool is just
enough big for this one bounce buffer. It is inserted into a per-device
list of transient memory pools, and it is freed again when the bounce
buffer is unmapped.
Transient memory pools are kept in an RCU list. A memory barrier is
required after adding a new entry, because any address within a transient
buffer must be immediately recognized as belonging to the SWIOTLB, even if
it is passed to another CPU.
Deletion does not require any synchronization beyond RCU ordering
guarantees. After a buffer is unmapped, its physical addresses may no
longer be passed to the DMA API, so the memory range of the corresponding
stale entry in the RCU list never matches. If the memory range gets
allocated again, then it happens only after a RCU quiescent state.
Since bounce buffers can now be allocated from different pools, add a
parameter to swiotlb_alloc_pool() to let the caller know which memory pool
is used. Add swiotlb_find_pool() to find the memory pool corresponding to
an address. This function is now also used by is_swiotlb_buffer(), because
a simple boundary check is no longer sufficient.
The logic in swiotlb_alloc_tlb() is taken from __dma_direct_alloc_pages(),
simplified and enhanced to use coherent memory pools if needed.
Note that this is not the most efficient way to provide a bounce buffer,
but when a DMA buffer can't be mapped, something may (and will) actually
break. At that point it is better to make an allocation, even if it may be
an expensive operation.
Signed-off-by: Petr Tesarik <petr.tesarik.ext@huawei.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Add a config option (CONFIG_SWIOTLB_DYNAMIC) to enable or disable dynamic
allocation of additional bounce buffers.
If this option is set, mark the default SWIOTLB as able to grow and
restricted DMA pools as unable.
However, if the address of the default memory pool is explicitly queried,
make the default SWIOTLB also unable to grow. This is currently used to set
up PCI BAR movable regions on some Octeon MIPS boards which may not be able
to use a SWIOTLB pool elsewhere in physical memory. See octeon_pci_setup()
for more details.
If a remap function is specified, it must be also called on any dynamically
allocated pools, but there are some issues:
- The remap function may block, so it should not be called from an atomic
context.
- There is no corresponding unremap() function if the memory pool is
freed.
- The only in-tree implementation (xen_swiotlb_fixup) requires that the
number of slots in the memory pool is a multiple of SWIOTLB_SEGSIZE.
Keep it simple for now and disable growing the SWIOTLB if a remap function
was specified.
Signed-off-by: Petr Tesarik <petr.tesarik.ext@huawei.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Carve out memory pool specific fields from struct io_tlb_mem. The original
struct now contains shared data for the whole allocator, while the new
struct io_tlb_pool contains data that is specific to one memory pool of
(potentially) many.
Signed-off-by: Petr Tesarik <petr.tesarik.ext@huawei.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Add some kernel-doc comments and move the existing documentation of struct
io_tlb_slot to its correct location. The latter was forgotten in commit
942a8186eb ("swiotlb: move struct io_tlb_slot to swiotlb.c").
Use the opportunity to give swiotlb_do_find_slots() a more descriptive name
and make it clear how it differs from swiotlb_find_slots().
Signed-off-by: Petr Tesarik <petr.tesarik.ext@huawei.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
SWIOTLB implementation details should not be exposed to the rest of the
kernel. This will allow to make changes to the implementation without
modifying non-swiotlb code.
To avoid breaking existing users, provide helper functions for the few
required fields.
As a bonus, using a helper function to initialize struct device allows to
get rid of an #ifdef in driver core.
Signed-off-by: Petr Tesarik <petr.tesarik.ext@huawei.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
If swiotlb is allocated, immediately return 0, so callers do not have to
check io_tlb_default_mem.nslabs explicitly.
Signed-off-by: Petr Tesarik <petr.tesarik.ext@huawei.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Commit 96360004b8 ("xdp: Make devmap flush_list common for all map
instances") removes the use of bpf_dtab_netdev::dtab in bq_enqueue(),
so just remove dtab from bpf_dtab_netdev.
Signed-off-by: Hou Tao <houtao1@huawei.com>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Link: https://lore.kernel.org/r/20230728014942.892272-3-houtao@huaweicloud.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Since commit cdfafe98ca ("xdp: Make cpumap flush_list common for all
map instances"), cmap is no longer used, so just remove it.
Signed-off-by: Hou Tao <houtao1@huawei.com>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20230728014942.892272-2-houtao@huaweicloud.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
syzbot reported an array-index-out-of-bounds when printing out bpf
insns. Further investigation shows the insn is illegal but
is printed out due to log level 1 or 2 before actual insn verification
in do_check().
This particular illegal insn is a MOVSX insn with offset value 2.
The legal offset value for MOVSX should be 8, 16 and 32.
The disasm sign-extension-size array index is calculated as
(insn->off / 8) - 1
and offset value 2 gives an out-of-bound index -1.
Tighten the checking for MOVSX insn in disasm.c to avoid
array-index-out-of-bounds issue.
Reported-by: syzbot+3758842a6c01012aa73b@syzkaller.appspotmail.com
Fixes: f835bb6222 ("bpf: Add kernel/bpftool asm support for new instructions")
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20230731204534.1975311-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Currently, cblist_init_generic() holds a raw spinlock when invoking
INIT_WORK(). This fails in kernels built with CONFIG_DEBUG_OBJECTS=y
due to memory allocation being forbidden while holding a raw spinlock.
But the only reason for holding the raw spinlock is to synchronize
with early boot calls to call_rcu_tasks(), call_rcu_tasks_rude, and,
last but not least, call_rcu_tasks_trace(). These calls also invoke
cblist_init_generic() in order to support early boot queueing of
callbacks.
Except that there are no early boot calls to either of these three
functions, and the BPF guys confirm that they have no plans to add any
such calls.
This commit therefore removes the synchronization and adds a
WARN_ON_ONCE() to catch the case of now-prohibited early boot RCU Tasks
callback queueing.
If early boot queueing is needed, an "initialized" flag may be added to
the rcu_tasks structure. Then queueing a callback before this flag is set
would initialize the callback list (if needed) and queue the callback.
The decision as to where to queue the callback given the possibility of
non-zero boot CPUs is left as an exercise for the reader.
Reported-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The following warning was reported when running xdp_redirect_cpu with
both skb-mode and stress-mode enabled:
------------[ cut here ]------------
Incorrect XDP memory type (-2128176192) usage
WARNING: CPU: 7 PID: 1442 at net/core/xdp.c:405
Modules linked in:
CPU: 7 PID: 1442 Comm: kworker/7:0 Tainted: G 6.5.0-rc2+ #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
Workqueue: events __cpu_map_entry_free
RIP: 0010:__xdp_return+0x1e4/0x4a0
......
Call Trace:
<TASK>
? show_regs+0x65/0x70
? __warn+0xa5/0x240
? __xdp_return+0x1e4/0x4a0
......
xdp_return_frame+0x4d/0x150
__cpu_map_entry_free+0xf9/0x230
process_one_work+0x6b0/0xb80
worker_thread+0x96/0x720
kthread+0x1a5/0x1f0
ret_from_fork+0x3a/0x70
ret_from_fork_asm+0x1b/0x30
</TASK>
The reason for the warning is twofold. One is due to the kthread
cpu_map_kthread_run() is stopped prematurely. Another one is
__cpu_map_ring_cleanup() doesn't handle skb mode and treats skbs in
ptr_ring as XDP frames.
Prematurely-stopped kthread will be fixed by the preceding patch and
ptr_ring will be empty when __cpu_map_ring_cleanup() is called. But
as the comments in __cpu_map_ring_cleanup() said, handling and freeing
skbs in ptr_ring as well to "catch any broken behaviour gracefully".
Fixes: 11941f8a85 ("bpf: cpumap: Implement generic cpumap")
Signed-off-by: Hou Tao <houtao1@huawei.com>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Link: https://lore.kernel.org/r/20230729095107.1722450-3-houtao@huaweicloud.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
The following warning was reported when running stress-mode enabled
xdp_redirect_cpu with some RT threads:
------------[ cut here ]------------
WARNING: CPU: 4 PID: 65 at kernel/bpf/cpumap.c:135
CPU: 4 PID: 65 Comm: kworker/4:1 Not tainted 6.5.0-rc2+ #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
Workqueue: events cpu_map_kthread_stop
RIP: 0010:put_cpu_map_entry+0xda/0x220
......
Call Trace:
<TASK>
? show_regs+0x65/0x70
? __warn+0xa5/0x240
......
? put_cpu_map_entry+0xda/0x220
cpu_map_kthread_stop+0x41/0x60
process_one_work+0x6b0/0xb80
worker_thread+0x96/0x720
kthread+0x1a5/0x1f0
ret_from_fork+0x3a/0x70
ret_from_fork_asm+0x1b/0x30
</TASK>
The root cause is the same as commit 4369016497 ("bpf: cpumap: Fix memory
leak in cpu_map_update_elem"). The kthread is stopped prematurely by
kthread_stop() in cpu_map_kthread_stop(), and kthread() doesn't call
cpu_map_kthread_run() at all but XDP program has already queued some
frames or skbs into ptr_ring. So when __cpu_map_ring_cleanup() checks
the ptr_ring, it will find it was not emptied and report a warning.
An alternative fix is to use __cpu_map_ring_cleanup() to drop these
pending frames or skbs when kthread_stop() returns -EINTR, but it may
confuse the user, because these frames or skbs have been handled
correctly by XDP program. So instead of dropping these frames or skbs,
just make sure the per-cpu kthread is running before
__cpu_map_entry_alloc() returns.
After apply the fix, the error handle for kthread_stop() will be
unnecessary because it will always return 0, so just remove it.
Fixes: 6710e11269 ("bpf: introduce new bpf cpu map type BPF_MAP_TYPE_CPUMAP")
Signed-off-by: Hou Tao <houtao1@huawei.com>
Reviewed-by: Pu Lehui <pulehui@huawei.com>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Link: https://lore.kernel.org/r/20230729095107.1722450-2-houtao@huaweicloud.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
During unregister_netdevice_many_notify(), the ordering of our concerned
function calls is like this:
unregister_netdevice_many_notify
dev_shutdown
qdisc_put
clsact_destroy
tcx_uninstall
The syzbot reproducer triggered a case that the qdisc refcnt is not
zero during dev_shutdown().
tcx_uninstall() will then WARN_ON_ONCE(tcx_entry(entry)->miniq_active)
because the miniq is still active and the entry should not be freed.
The latter assumed that qdisc destruction happens before tcx teardown.
This fix is to avoid tcx_uninstall() doing tcx_entry_free() when the
miniq is still alive and let the clsact_destroy() do the free later, so
that we do not assume any specific ordering for either of them.
If still active, tcx_uninstall() does clear the entry when flushing out
the prog/link. clsact_destroy() will then notice the "!tcx_entry_is_active()"
and then does the tcx_entry_free() eventually.
Fixes: e420bed025 ("bpf: Add fd-based tcx multi-prog infra with link support")
Reported-by: syzbot+376a289e86a0fd02b9ba@syzkaller.appspotmail.com
Reported-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Co-developed-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: syzbot+376a289e86a0fd02b9ba@syzkaller.appspotmail.com
Tested-by: Leon Romanovsky <leonro@nvidia.com>
Link: https://lore.kernel.org/r/222255fe07cb58f15ee662e7ee78328af5b438e4.1690549248.git.daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Up until now, /sys/kernel/tracing/events was no different than any other
part of tracefs. The files and directories within the events directory was
created when the tracefs was mounted, and also created for the instances in
/sys/kernel/tracing/instances/<instance>/events. Most of these files and
directories will never be referenced. Since there are thousands of these
files and directories they spend their time wasting precious memory
resources.
Move the "events" directory to the new eventfs. The eventfs will take the
meta data of the events that they represent and store that. When the files
in the events directory are referenced, the dentry and inodes to represent
them are then created. When the files are no longer referenced, they are
freed. This saves the precious memory resources that were wasted on these
seldom referenced dentries and inodes.
Running the following:
~# cat /proc/meminfo /proc/slabinfo > before.out
~# mkdir /sys/kernel/tracing/instances/foo
~# cat /proc/meminfo /proc/slabinfo > after.out
to test the changes produces the following deltas:
Before this change:
Before after deltas for meminfo:
MemFree: -32260
MemAvailable: -21496
KReclaimable: 21528
Slab: 22440
SReclaimable: 21528
SUnreclaim: 912
VmallocUsed: 16
Before after deltas for slabinfo:
<slab>: <objects> [ * <size> = <total>]
tracefs_inode_cache: 14472 [* 1184 = 17134848]
buffer_head: 24 [* 168 = 4032]
hmem_inode_cache: 28 [* 1480 = 41440]
dentry: 14450 [* 312 = 4508400]
lsm_inode_cache: 14453 [* 32 = 462496]
vma_lock: 11 [* 152 = 1672]
vm_area_struct: 2 [* 184 = 368]
trace_event_file: 1748 [* 88 = 153824]
kmalloc-256: 1072 [* 256 = 274432]
kmalloc-64: 2842 [* 64 = 181888]
Total slab additions in size: 22,763,400 bytes
With this change:
Before after deltas for meminfo:
MemFree: -12600
MemAvailable: -12580
Cached: 24
Active: 12
Inactive: 68
Inactive(anon): 48
Active(file): 12
Inactive(file): 20
Dirty: -4
AnonPages: 68
KReclaimable: 12
Slab: 1856
SReclaimable: 12
SUnreclaim: 1844
KernelStack: 16
PageTables: 36
VmallocUsed: 16
Before after deltas for slabinfo:
<slab>: <objects> [ * <size> = <total>]
tracefs_inode_cache: 108 [* 1184 = 127872]
buffer_head: 24 [* 168 = 4032]
hmem_inode_cache: 18 [* 1480 = 26640]
dentry: 127 [* 312 = 39624]
lsm_inode_cache: 152 [* 32 = 4864]
vma_lock: 67 [* 152 = 10184]
vm_area_struct: -12 [* 184 = -2208]
trace_event_file: 1764 [* 96 = 169344]
kmalloc-96: 14322 [* 96 = 1374912]
kmalloc-64: 2814 [* 64 = 180096]
kmalloc-32: 1103 [* 32 = 35296]
kmalloc-16: 2308 [* 16 = 36928]
kmalloc-8: 12800 [* 8 = 102400]
Total slab additions in size: 2,109,984 bytes
Which is a savings of 20,653,416 bytes (20 MB) per tracing instance.
Link: https://lkml.kernel.org/r/1690568452-46553-10-git-send-email-akaher@vmware.com
Signed-off-by: Ajay Kaher <akaher@vmware.com>
Co-developed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Tested-by: Ching-lin Yu <chinglinyu@google.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
In the process of parsing the DTS, check whether the memory region
specified by the DTS CMA node area overlaps with the kernel text
memory space reserved by memblock before calling
early_init_fdt_scan_reserved_mem.
Signed-off-by: Binglei Wang <l3b2w1@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
The kernel parameter 'cma_pernuma=' only supports reserving the same
size of CMA area for each node. We need to reserve different sizes of
CMA area for specified nodes if these devices belong to different nodes.
Adding another kernel parameter 'numa_cma=' to reserve CMA area for
the specified node. If we want to use one of these parameters, we need to
enable DMA_NUMA_CMA.
At the same time, print the node id in cma_declare_contiguous_nid() if
CONFIG_NUMA is enabled.
Signed-off-by: Yajun Deng <yajun.deng@linux.dev>
Signed-off-by: Christoph Hellwig <hch@lst.de>
In the commit b7176c261c ("dma-contiguous: provide the ability to
reserve per-numa CMA"), Barry adds DMA_PERNUMA_CMA for ARM64.
But this feature is architecture independent, so support per-numa CMA
for all architectures, and enable it by default if NUMA.
Signed-off-by: Yajun Deng <yajun.deng@linux.dev>
Tested-by: Yicong Yang <yangyicong@hisilicon.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
This function has a __weak definition and an override that is only used on
freescale powerpc chips. The powerpc definition however does not see the
declaration that is in a .c file:
arch/powerpc/kernel/dma-mask.c:7:6: error: no previous prototype for 'arch_dma_set_mask' [-Werror=missing-prototypes]
Move it into the linux/dma-map-ops.h header where the other arch_dma_* functions
are declared.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Drivers have no business looking at dma-mapping or swiotlb internals.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Juergen Gross <jgross@suse.com>
Commit e1572f1d08 ("cpu/SMT: create and export cpu_smt_possible()")
introduces cpu_smt_possible() to represent if SMT is theoretically
possible. It returns true when SMT is supported and not forcefully
disabled ('nosmt=force'). But the comment of it says "Returns true if
SMT is not supported of forcefully (irreversibly) disabled", which is
wrong. Fix that comment accordingly.
Signed-off-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20230728155313.44170-1-rui.zhang@intel.com
There is a possibility of deadlock if synchronize_hardirq() is called
when the nested threaded interrupt is active. The following scenario
was observed on a uniprocessor PREEMPT_NONE system:
Thread 1 Thread 2
handle_nested_thread()
Set INPROGRESS
Call ->thread_fn()
thread_fn goes to sleep
free_irq()
__synchronize_hardirq()
Busy-loop forever waiting for INPROGRESS
to be cleared
The INPROGRESS flag is only supposed to be used for hard interrupt
handlers. Remove the incorrect usage in the nested threaded interrupt
case and instead re-use the threads_active / wait_for_threads mechanism
to wait for nested threaded interrupts to complete.
Signed-off-by: Vincent Whitchurch <vincent.whitchurch@axis.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20230613-genirq-nested-v3-1-ae58221143eb@axis.com
Revert commit 7aa55f2a59 ("sched/fair: Move unused stub functions to
header"), for while it has the right Changelog, the actual patch
content a revert of the previous 4 patches:
f7df852ad6 ("sched: Make task_vruntime_update() prototype visible")
c0bdfd72fb ("sched/fair: Hide unused init_cfs_bandwidth() stub")
378be384e0 ("sched: Add schedule_user() declaration")
d55ebae3f3 ("sched: Hide unused sched_update_scaling()")
So in effect this is a revert of a revert and re-applies those
patches.
Fixes: 7aa55f2a59 ("sched/fair: Move unused stub functions to header")
Reported-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
The creation of the trace event directory requires that a TRACE_SYSTEM is
defined that the trace event directory is added within the system it was
defined in.
The code handled the case where a TRACE_SYSTEM was not added, and would
then add the event at the events directory. But nothing should be doing
this. This code also prevents the implementation of creating dynamic
dentrys for the eventfs system.
As this path has never been hit on correct code, remove it. If it does get
hit, issues a WARN_ON_ONCE() and return ENODEV.
Link: https://lkml.kernel.org/r/1690568452-46553-2-git-send-email-akaher@vmware.com
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Ajay Kaher <akaher@vmware.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Currently we can resize trace ringbuffer by writing a value into file
'buffer_size_kb', then by reading the file, we get the value that is
usually what we wrote. However, this value may be not actual size of
trace ring buffer because of the round up when doing resize in kernel,
and the actual size would be more useful.
Link: https://lore.kernel.org/linux-trace-kernel/20230705002705.576633-1-zhengyejian1@huawei.com
Cc: <mhiramat@kernel.org>
Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
As the trace iterator is created and used by various interfaces, the clean
up of it needs to be consistent. Create a free_trace_iter_content() helper
function that frees the content of the iterator and use that to clean it
up in all places that it is used.
Link: https://lkml.kernel.org/r/20230715141348.341887497@goodmis.org
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The iterator allocated a descriptor to copy the current_trace. This was done
with the assumption that the function pointers might change. But this was a
false assuption, as it does not change. There's no reason to make a copy of the
current_trace and just use the pointer it points to. This removes needing to
manage freeing the descriptor. Worse yet, there's locations that the iterator
is used but does make a copy and just uses the pointer. This could cause the
actual pointer to the trace descriptor to be freed and not the allocated copy.
This is more of a clean up than a fix.
Link: https://lkml.kernel.org/r/20230715141348.135792275@goodmis.org
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Fixes: d7350c3f45 ("tracing/core: make the read callbacks reentrants")
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Use try_cmpxchg instead of cmpxchg (*ptr, old, new) == old in
ring_buffer.c. x86 CMPXCHG instruction returns success in ZF flag,
so this change saves a compare after cmpxchg (and related move
instruction in front of cmpxchg).
No functional change intended.
Link: https://lore.kernel.org/linux-trace-kernel/20230714154418.8884-1-ubizjak@gmail.com
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
For backward compatibility, older tooling expects to see the kernel_stack
event with a "caller" field that is a fixed size array of 8 addresses. The
code now supports more than 8 with an added "size" field that states the
real number of entries. But the "caller" field still just looks like a
fixed size to user space.
Since the tracing macros that create the user space format files also
creates the structures that those files represent, the kernel_stack event
structure had its "caller" field a fixed size of 8, but in reality, when
it is allocated on the ring buffer, it can hold more if the stack trace is
bigger that 8 functions. The copying of these entries was simply done with
a memcpy():
size = nr_entries * sizeof(unsigned long);
memcpy(entry->caller, fstack->calls, size);
The FORTIFY_SOURCE logic noticed at runtime that when the nr_entries was
larger than 8, that the memcpy() was writing more than what the structure
stated it can hold and it complained about it. This is because the
FORTIFY_SOURCE code is unaware that the amount allocated is actually
enough to hold the size. It does not expect that a fixed size field will
hold more than the fixed size.
This was originally solved by hiding the caller assignment with some
pointer arithmetic.
ptr = ring_buffer_data();
entry = ptr;
ptr += offsetof(typeof(*entry), caller);
memcpy(ptr, fstack->calls, size);
But it is considered bad form to hide from kernel hardening. Instead, make
it work nicely with FORTIFY_SOURCE by adding a new __stack_array() macro
that is specific for this one special use case. The macro will take 4
arguments: type, item, len, field (whereas the __array() macro takes just
the first three). This macro will act just like the __array() macro when
creating the code to deal with the format file that is exposed to user
space. But for the kernel, it will turn the caller field into:
type item[] __counted_by(field);
or for this instance:
unsigned long caller[] __counted_by(size);
Now the kernel code can expose the assignment of the caller to the
FORTIFY_SOURCE and everyone is happy!
Link: https://lore.kernel.org/linux-trace-kernel/20230712105235.5fc441aa@gandalf.local.home/
Link: https://lore.kernel.org/linux-trace-kernel/20230713092605.2ddb9788@rorschach.local.home
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Suggested-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
- probe-events: Fix to add NULL check for some BTF API calls which can
return error code and NULL.
- ftrace selftests: Fix to check fprobe and kprobe event correctly. This
fixes a miss condition of the test command.
- kprobes: Prohibit probing on the function which starts from "__cfi_"
and "__pfx_" since those are auto generated for kernel CFI and not
executed.
-----BEGIN PGP SIGNATURE-----
iQFPBAABCgA5FiEEh7BulGwFlgAOi5DV2/sHvwUrPxsFAmTGdH4bHG1hc2FtaS5o
aXJhbWF0c3VAZ21haWwuY29tAAoJENv7B78FKz8bmMAH/0qTHII0KYQDvrNJ40tT
SDM8+4zOJEtnjVYq87+4EWBhpVEL3VbLRJaprjXh40lZJrCP3MglCF152p4bOhgb
ZrjWuTAgE0N+rBhdeUJlzy3iLzl0G9dzfA+sn1XMcW+/HSPstJcjAG6wD7ROeZzL
XCxzE+NY6Y6mYbB52DaS8Hv7g7WccaTV+KeRjokhMPt+u7/KItJ4hQb/RXtAL31S
n4thCeVllaPBuc7m2CmKwJ9jzOg7/0qpAIUGx1Z+Khy/3YfRhG1nT93GxP8hLmad
SH9kGps09WXF5f8FbjYglOmq7ioDbIUz3oXPQRZYPymV8A0EU+b+/8IsRog1ySd1
BVk=
=qKWS
-----END PGP SIGNATURE-----
Merge tag 'probes-fixes-v6.5-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull probe fixes from Masami Hiramatsu:
- probe-events: add NULL check for some BTF API calls which can return
error code and NULL.
- ftrace selftests: check fprobe and kprobe event correctly. This fixes
a miss condition of the test command.
- kprobes: do not allow probing functions that start with "__cfi_" or
"__pfx_" since those are auto generated for kernel CFI and not
executed.
* tag 'probes-fixes-v6.5-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
kprobes: Prohibit probing on CFI preamble symbol
selftests/ftrace: Fix to check fprobe event eneblement
tracing/probes: Fix to add NULL check for BTF APIs
between the lock waiters and the PI chain tree (->pi_waiters) of
a task by giving each tree their own sort key
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmTGMGAACgkQEsHwGGHe
VUrqNw/9F7p+/eDXP9TURup3Ms87lOlqCJg2CClTRiPF/FZcVzVbmltWewLDOWC6
GBEa7l++lpKwIGCuzg99tDScutdlCw8mQe4knrFvrJc04zzPV3O4N0Tiw1eD8wIw
/i6mzRaEuB5R8nxd9YK24gEsijerxhgVwKFvsCjbTV71LrY1geCJRqaqnooVnNm7
kvy+UIuckFJL8GDsPBJNsZdTyBrs7lC+m4su5QHr7FkQmF+4iEbDkwlYqPzafhVL
e4/FRjWHC5HRKd69j8fqyYZP8ryUlrmHFxkeAE08CxiqwdOELNHrp+i/x2GN9FTl
wL8psVl9oVt2KT79Py0e40ZOoUWHDOugx2IGzWl13C0XFKY+QH2HR6C+gJdedPMQ
tBtNyf8Ivukp+SSjyUXu8q9dGLJfdxtIKmbNHOuZDU121Fqg5+b04vQkzdgkMwJh
OoYj0fDUMBa6Zfv1PQkkSRU2wMQlgGTjMiCEvUjAs3vGBFQcu+flAm2UOr8NDLaF
0r3wNRVzk53k85R0L49oS8aU2FOcaHlpYXOevh0NYAt68EbmZctns85SJRfrwNGE
QuykN8yDID/zjGsjSSF6L1IaQSjM1ieozPVgqP4RLIxqaM7n/BJfZqiLTn0gBrbc
tEaoSIay6/zybKPCDeVgMsTSlX8EOoRKD/1Daog+zfYkUt19U/w=
=97eo
-----END PGP SIGNATURE-----
Merge tag 'locking_urgent_for_v6.5_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking fix from Borislav Petkov:
- Fix a rtmutex race condition resulting from sharing of the sort key
between the lock waiters and the PI chain tree (->pi_waiters) of a
task by giving each tree their own sort key
* tag 'locking_urgent_for_v6.5_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
locking/rtmutex: Fix task->pi_waiters integrity
- Fix to /sys/kernel/tracing/per_cpu/cpu*/stats read and entries.
If a resize shrinks the buffer it clears the read count to notify
readers that they need to reset. But the read count is also used for
accounting and this causes the numbers to be off. Instead, create a
separate variable to use to notify readers to reset.
- Fix the ref counts of the "soft disable" mode. The wrong value was
used for testing if soft disable mode should be enabled or disable,
but instead, just change the logic to do the enable and disable
in place when the SOFT_MODE is set or cleared.
- Several kernel-doc fixes
- Removal of unused external declarations
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZMVd3xQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qgzwAQCBHA3gf30GChf0EZUdIVueA31/1n2Z
2ZW1VeKUHQufpAEAuzbkeTdaj6bbpsT5T1Pf3zUIvpHs7kOYJWQq+75GBAI=
=9o02
-----END PGP SIGNATURE-----
Merge tag 'trace-v6.5-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fixes from Steven Rostedt:
- Fix to /sys/kernel/tracing/per_cpu/cpu*/stats read and entries.
If a resize shrinks the buffer it clears the read count to notify
readers that they need to reset. But the read count is also used for
accounting and this causes the numbers to be off. Instead, create a
separate variable to use to notify readers to reset.
- Fix the ref counts of the "soft disable" mode. The wrong value was
used for testing if soft disable mode should be enabled or disable,
but instead, just change the logic to do the enable and disable in
place when the SOFT_MODE is set or cleared.
- Several kernel-doc fixes
- Removal of unused external declarations
* tag 'trace-v6.5-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing: Fix warning in trace_buffered_event_disable()
ftrace: Remove unused extern declarations
tracing: Fix kernel-doc warnings in trace_seq.c
tracing: Fix kernel-doc warnings in trace_events_trigger.c
tracing/synthetic: Fix kernel-doc warnings in trace_events_synth.c
ring-buffer: Fix kernel-doc warnings in ring_buffer.c
ring-buffer: Fix wrong stat of cpu_buffer->read
Do not allow to probe on "__cfi_" or "__pfx_" started symbol, because those
are used for CFI and not executed. Probing it will break the CFI.
Link: https://lore.kernel.org/all/168904024679.116016.18089228029322008512.stgit@devnote2/
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Warning happened in trace_buffered_event_disable() at
WARN_ON_ONCE(!trace_buffered_event_ref)
Call Trace:
? __warn+0xa5/0x1b0
? trace_buffered_event_disable+0x189/0x1b0
__ftrace_event_enable_disable+0x19e/0x3e0
free_probe_data+0x3b/0xa0
unregister_ftrace_function_probe_func+0x6b8/0x800
event_enable_func+0x2f0/0x3d0
ftrace_process_regex.isra.0+0x12d/0x1b0
ftrace_filter_write+0xe6/0x140
vfs_write+0x1c9/0x6f0
[...]
The cause of the warning is in __ftrace_event_enable_disable(),
trace_buffered_event_enable() was called once while
trace_buffered_event_disable() was called twice.
Reproduction script show as below, for analysis, see the comments:
```
#!/bin/bash
cd /sys/kernel/tracing/
# 1. Register a 'disable_event' command, then:
# 1) SOFT_DISABLED_BIT was set;
# 2) trace_buffered_event_enable() was called first time;
echo 'cmdline_proc_show:disable_event:initcall:initcall_finish' > \
set_ftrace_filter
# 2. Enable the event registered, then:
# 1) SOFT_DISABLED_BIT was cleared;
# 2) trace_buffered_event_disable() was called first time;
echo 1 > events/initcall/initcall_finish/enable
# 3. Try to call into cmdline_proc_show(), then SOFT_DISABLED_BIT was
# set again!!!
cat /proc/cmdline
# 4. Unregister the 'disable_event' command, then:
# 1) SOFT_DISABLED_BIT was cleared again;
# 2) trace_buffered_event_disable() was called second time!!!
echo '!cmdline_proc_show:disable_event:initcall:initcall_finish' > \
set_ftrace_filter
```
To fix it, IIUC, we can change to call trace_buffered_event_enable() at
fist time soft-mode enabled, and call trace_buffered_event_disable() at
last time soft-mode disabled.
Link: https://lore.kernel.org/linux-trace-kernel/20230726095804.920457-1-zhengyejian1@huawei.com
Cc: <mhiramat@kernel.org>
Fixes: 0fc1b09ff1 ("tracing: Use temp buffer when filtering events")
Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Fix kernel-doc warning:
kernel/trace/trace_seq.c:142: warning: Function parameter or member
'args' not described in 'trace_seq_vprintf'
Link: https://lkml.kernel.org/r/20230724140827.1023266-5-cuigaosheng1@huawei.com
Cc: <mhiramat@kernel.org>
Signed-off-by: Gaosheng Cui <cuigaosheng1@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Fix kernel-doc warnings:
kernel/trace/trace_events_trigger.c:59: warning: Function parameter
or member 'buffer' not described in 'event_triggers_call'
kernel/trace/trace_events_trigger.c:59: warning: Function parameter
or member 'event' not described in 'event_triggers_call'
Link: https://lkml.kernel.org/r/20230724140827.1023266-4-cuigaosheng1@huawei.com
Cc: <mhiramat@kernel.org>
Signed-off-by: Gaosheng Cui <cuigaosheng1@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Fix kernel-doc warning:
kernel/trace/trace_events_synth.c:1257: warning: Function parameter
or member 'mod' not described in 'synth_event_gen_cmd_array_start'
Link: https://lkml.kernel.org/r/20230724140827.1023266-3-cuigaosheng1@huawei.com
Cc: <mhiramat@kernel.org>
Signed-off-by: Gaosheng Cui <cuigaosheng1@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Fix kernel-doc warnings:
kernel/trace/ring_buffer.c:954: warning: Function parameter or
member 'cpu' not described in 'ring_buffer_wake_waiters'
kernel/trace/ring_buffer.c:3383: warning: Excess function parameter
'event' description in 'ring_buffer_unlock_commit'
kernel/trace/ring_buffer.c:5359: warning: Excess function parameter
'cpu' description in 'ring_buffer_reset_online_cpus'
Link: https://lkml.kernel.org/r/20230724140827.1023266-2-cuigaosheng1@huawei.com
Cc: <mhiramat@kernel.org>
Signed-off-by: Gaosheng Cui <cuigaosheng1@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
When pages are removed in rb_remove_pages(), 'cpu_buffer->read' is set
to 0 in order to make sure any read iterators reset themselves. However,
this will mess 'entries' stating, see following steps:
# cd /sys/kernel/tracing/
# 1. Enlarge ring buffer prepare for later reducing:
# echo 20 > per_cpu/cpu0/buffer_size_kb
# 2. Write a log into ring buffer of cpu0:
# taskset -c 0 echo "hello1" > trace_marker
# 3. Read the log:
# cat per_cpu/cpu0/trace_pipe
<...>-332 [000] ..... 62.406844: tracing_mark_write: hello1
# 4. Stop reading and see the stats, now 0 entries, and 1 event readed:
# cat per_cpu/cpu0/stats
entries: 0
[...]
read events: 1
# 5. Reduce the ring buffer
# echo 7 > per_cpu/cpu0/buffer_size_kb
# 6. Now entries became unexpected 1 because actually no entries!!!
# cat per_cpu/cpu0/stats
entries: 1
[...]
read events: 0
To fix it, introduce 'page_removed' field to count total removed pages
since last reset, then use it to let read iterators reset themselves
instead of changing the 'read' pointer.
Link: https://lore.kernel.org/linux-trace-kernel/20230724054040.3489499-1-zhengyejian1@huawei.com
Cc: <mhiramat@kernel.org>
Cc: <vnagarnaik@google.com>
Fixes: 83f40318da ("ring-buffer: Make removal of ring buffer pages atomic")
Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
In internal testing of test_maps, we sometimes observed failures like:
test_maps: test_maps.c:173: void test_hashmap_percpu(unsigned int, void *):
Assertion `bpf_map_update_elem(fd, &key, value, BPF_ANY) == 0' failed.
where the errno is ENOMEM. After some troubleshooting and enabling
the warnings, we saw:
[ 91.304708] percpu: allocation failed, size=8 align=8 atomic=1, atomic alloc failed, no space left
[ 91.304716] CPU: 51 PID: 24145 Comm: test_maps Kdump: loaded Tainted: G N 6.1.38-smp-DEV #7
[ 91.304719] Hardware name: Google Astoria/astoria, BIOS 0.20230627.0-0 06/27/2023
[ 91.304721] Call Trace:
[ 91.304724] <TASK>
[ 91.304730] [<ffffffffa7ef83b9>] dump_stack_lvl+0x59/0x88
[ 91.304737] [<ffffffffa7ef83f8>] dump_stack+0x10/0x18
[ 91.304738] [<ffffffffa75caa0c>] pcpu_alloc+0x6fc/0x870
[ 91.304741] [<ffffffffa75ca302>] __alloc_percpu_gfp+0x12/0x20
[ 91.304743] [<ffffffffa756785e>] alloc_bulk+0xde/0x1e0
[ 91.304746] [<ffffffffa7566c02>] bpf_mem_alloc_init+0xd2/0x2f0
[ 91.304747] [<ffffffffa7547c69>] htab_map_alloc+0x479/0x650
[ 91.304750] [<ffffffffa751d6e0>] map_create+0x140/0x2e0
[ 91.304752] [<ffffffffa751d413>] __sys_bpf+0x5a3/0x6c0
[ 91.304753] [<ffffffffa751c3ec>] __x64_sys_bpf+0x1c/0x30
[ 91.304754] [<ffffffffa7ef847a>] do_syscall_64+0x5a/0x80
[ 91.304756] [<ffffffffa800009b>] entry_SYSCALL_64_after_hwframe+0x63/0xcd
This makes sense, because in atomic context, percpu allocation would
not create new chunks; it would only create in non-atomic contexts.
And if during prefill all precpu chunks are full, -ENOMEM would
happen immediately upon next unit_alloc.
Prefill phase does not actually run in atomic context, so we can
use this fact to allocate non-atomically with GFP_KERNEL instead
of GFP_NOWAIT. This avoids the immediate -ENOMEM.
GFP_NOWAIT has to be used in unit_alloc when bpf program runs
in atomic context. Even if bpf program runs in non-atomic context,
in most cases, rcu read lock is enabled for the program so
GFP_NOWAIT is still needed. This is often also the case for
BPF_MAP_UPDATE_ELEM syscalls.
Signed-off-by: YiFei Zhu <zhuyifei@google.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20230728043359.3324347-1-zhuyifei@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The kernel test robot reported compilation warnings when -Wparentheses is
added to KBUILD_CFLAGS with gcc compiler. The following is the error message:
.../bpf-next/kernel/bpf/verifier.c: In function ‘coerce_reg_to_size_sx’:
.../bpf-next/kernel/bpf/verifier.c:5901:14:
error: suggest parentheses around comparison in operand of ‘==’ [-Werror=parentheses]
if (s64_max >= 0 == s64_min >= 0) {
~~~~~~~~^~~~
.../bpf-next/kernel/bpf/verifier.c: In function ‘coerce_subreg_to_size_sx’:
.../bpf-next/kernel/bpf/verifier.c:5965:14:
error: suggest parentheses around comparison in operand of ‘==’ [-Werror=parentheses]
if (s32_min >= 0 == s32_max >= 0) {
~~~~~~~~^~~~
To fix the issue, add proper parentheses for the above '>=' condition
to silence the warning/error.
I tried a few clang compilers like clang16 and clang18 and they do not emit
such warnings with -Wparentheses.
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202307281133.wi0c4SqG-lkp@intel.com/
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20230728055740.2284534-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Add support to the /sys/devices/system/cpu/smt/control interface for
enabling a specified number of SMT threads per core, including partial
SMT states where not all threads are brought online.
The current interface accepts "on" and "off", to enable either 1 or all
SMT threads per core.
This commit allows writing an integer, between 1 and the number of SMT
threads supported by the machine. Writing 1 is a synonym for "off", 2 or
more enables SMT with the specified number of threads.
When reading the file, if all threads are online "on" is returned, to
avoid changing behaviour for existing users. If some other number of
threads is online then the integer value is returned.
Architectures like x86 only supporting 1 thread or all threads, should not
define CONFIG_SMT_NUM_THREADS_DYNAMIC. Architecture supporting partial SMT
states, like PowerPC, should define it.
[ ldufour: Slightly reword the commit's description ]
[ ldufour: Remove switch() in __store_smt_control() ]
[ ldufour: Rix build issue in control_show() ]
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Link: https://lore.kernel.org/r/20230705145143.40545-8-ldufour@linux.ibm.com
Some architectures allows partial SMT states, i.e. when not all SMT threads
are brought online.
To support that, add an architecture helper which checks whether a given
CPU is allowed to be brought online depending on how many SMT threads are
currently enabled. Since this is only applicable to architecture supporting
partial SMT, only these architectures should select the new configuration
variable CONFIG_SMT_NUM_THREADS_DYNAMIC. For the other architectures, not
supporting the partial SMT states, there is no need to define
topology_cpu_smt_allowed(), the generic code assumed that all the threads
are allowed or only the primary ones.
Call the helper from cpu_smt_enable(), and cpu_smt_allowed() when SMT is
enabled, to check if the particular thread should be onlined. Notably,
also call it from cpu_smt_disable() if CPU_SMT_ENABLED, to allow
offlining some threads to move from a higher to lower number of threads
online.
[ ldufour: Slightly reword the commit's description ]
[ ldufour: Introduce CONFIG_SMT_NUM_THREADS_DYNAMIC ]
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Link: https://lore.kernel.org/r/20230705145143.40545-7-ldufour@linux.ibm.com
Since the maximum number of threads is now passed to cpu_smt_set_num_threads(),
checking that value is enough to know whether SMT is supported.
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Link: https://lore.kernel.org/r/20230705145143.40545-6-ldufour@linux.ibm.com
Some architectures allow partial SMT states at boot time, ie. when not all
SMT threads are brought online.
To support that the SMT code needs to know the maximum number of SMT
threads, and also the currently configured number.
The architecture code knows the max number of threads, so have the
architecture code pass that value to cpu_smt_set_num_threads(). Note that
although topology_max_smt_threads() exists, it is not configured early
enough to be used here. As architecture, like PowerPC, allows the threads
number to be set through the kernel command line, also pass that value.
[ ldufour: Slightly reword the commit message ]
[ ldufour: Rename cpu_smt_check_topology and add a num_threads argument ]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Link: https://lore.kernel.org/r/20230705145143.40545-5-ldufour@linux.ibm.com
Move the simple exit cases, i.e. those which don't depend on the value
written, earlier in the function. That makes it clearer that regardless of
the input those states cannot be transitioned out of.
That does have a user-visible effect, in that the error returned will
now always be EPERM/ENODEV for those states, regardless of the value
written. Previously writing an invalid value would return EINVAL even
when in those states.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Link: https://lore.kernel.org/r/20230705145143.40545-4-ldufour@linux.ibm.com
In order to export the cpuhp_smt_control enum as part of the interface
between generic and architecture code, the architecture code needs to
include asm/topology.h.
But that leads to circular header dependencies. So split the enum and
related declarations into a separate header.
[ ldufour: Reworded the commit's description ]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Link: https://lore.kernel.org/r/20230705145143.40545-3-ldufour@linux.ibm.com
The commit 18415f33e2 ("cpu/hotplug: Allow "parallel" bringup up to
CPUHP_BP_KICK_AP_STATE") introduce a dependancy against a global variable
cpu_primary_thread_mask exported by the X86 code. This variable is only
used when CONFIG_HOTPLUG_PARALLEL is set.
Since cpuhp_get_primary_thread_mask() and cpuhp_smt_aware() are only used
when CONFIG_HOTPLUG_PARALLEL is set, don't define them when it is not set.
No functional change.
Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Link: https://lore.kernel.org/r/20230705145143.40545-2-ldufour@linux.ibm.com
Add asm support for new instructions so kernel verifier and bpftool
xlated insn dumps can have proper asm syntax for new instructions.
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Quentin Monnet <quentin@isovalent.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Add interpreter/jit/verifier support for 32bit offset jmp instruction.
If a conditional jmp instruction needs more than 16bit offset,
it can be simulated with a conditional jmp + a 32bit jmp insn.
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230728011231.3716103-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Add interpreter/jit support for new signed div/mod insns.
The new signed div/mod instructions are encoded with
unsigned div/mod instructions plus insn->off == 1.
Also add basic verifier support to ensure new insns get
accepted.
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230728011219.3714605-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The existing 'be' and 'le' insns will do conditional bswap
depends on host endianness. This patch implements
unconditional bswap insns.
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230728011213.3712808-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Currently, if user accesses a ctx member with signed types,
the compiler will generate an unsigned load followed by
necessary left and right shifts.
With the introduction of sign-extension load, compiler may
just emit a ldsx insn instead. Let us do a final movsx sign
extension to the final unsigned ctx load result to
satisfy original sign extension requirement.
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230728011207.3712528-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Add interpreter/jit support for new sign-extension mov insns.
The original 'MOV' insn is extended to support reg-to-reg
signed version for both ALU and ALU64 operations. For ALU mode,
the insn->off value of 8 or 16 indicates sign-extension
from 8- or 16-bit value to 32-bit value. For ALU64 mode,
the insn->off value of 8/16/32 indicates sign-extension
from 8-, 16- or 32-bit value to 64-bit value.
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230728011202.3712300-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Add interpreter/jit support for new sign-extension load insns
which adds a new mode (BPF_MEMSX).
Also add verifier support to recognize these insns and to
do proper verification with new insns. In verifier, besides
to deduce proper bounds for the dst_reg, probed memory access
is also properly handled.
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230728011156.3711870-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Mark the time KUnit test, time64_to_tm_test_date_range, as slow using test
attributes.
This test ran relatively much slower than most other KUnit tests.
By marking this test as slow, the test can now be filtered using the KUnit
test attribute filtering feature. Example: --filter "speed>slow". This will
run only the tests that have speeds faster than slow. The slow attribute
will also be outputted in KTAP.
Reviewed-by: David Gow <davidgow@google.com>
Signed-off-by: Rae Moar <rmoar@google.com>
Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
Commit eda0047296 ("mm: make the page fault mmap locking killable")
intentionally made it much easier to trigger the "page fault fails
because a fatal signal is pending" situation, by having the mmap locking
fail early in that case.
We have long aborted page faults in other fatal cases when the actual IO
for a page is interrupted by SIGKILL - which is particularly useful for
the traditional case of NFS hanging due to network issues, but local
filesystems could cause it too if you happened to get the SIGKILL while
waiting for a page to be faulted in (eg lock_folio_maybe_drop_mmap()).
So aborting the page fault wasn't a new condition - but it now triggers
earlier, before we even get to 'handle_mm_fault()'. And as a result the
error doesn't go through our 'fault_signal_pending()' logic, and doesn't
get filtered away there.
Normally you'd never even notice, because if a fatal signal is pending,
the new SIGSEGV we send ends up being ignored anyway.
But it turns out that there is one very noticeable exception: if you
enable 'show_unhandled_signals', the aborted page fault will be logged
in the kernel messages, and you'll get a scary line looking something
like this in your logs:
pverados[2183248]: segfault at 55e5a00f9ae0 ip 000055e5a00f9ae0 sp 00007ffc0720bea8 error 14 in perl[55e5a00d4000+195000] likely on CPU 10 (core 4, socket 0)
which is rather misleading. It's not really a segfault at all, it's
just "the thread was killed before the page fault completed, so we
aborted the page fault".
Fix this by just making it clear that a pending fatal signal means that
any new signal coming in after that is implicitly handled. This will
avoid the misleading logging, since now the signal isn't 'unhandled' any
more.
Reported-and-tested-by: Fiona Ebner <f.ebner@proxmox.com>
Tested-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
Link: https://lore.kernel.org/lkml/8d063a26-43f5-0bb7-3203-c6a04dc159f8@proxmox.com/
Acked-by: Oleg Nesterov <oleg@redhat.com>
Fixes: eda0047296 ("mm: make the page fault mmap locking killable")
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The flags of the child of a given scheduling domain are used to initialize
the flags of its scheduling groups. When the child of a scheduling domain
is degenerated, the flags of its local scheduling group need to be updated
to align with the flags of its new child domain.
The flag SD_SHARE_CPUCAPACITY was aligned in
Commit bf2dc42d6b ("sched/topology: Propagate SMT flags when removing degenerate domain").
Further generalize this alignment so other flags can be used later, such as
in cluster-based task wakeup. [1]
Reported-by: Yicong Yang <yangyicong@huawei.com>
Suggested-by: Ricardo Neri <ricardo.neri@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Reviewed-by: Yicong Yang <yangyicong@hisilicon.com>
Link: https://lore.kernel.org/r/20230713013133.2314153-1-yu.c.chen@intel.com
There is no need to use runnable_avg when estimating util_est and that
even generates wrong behavior because one includes blocked tasks whereas
the other one doesn't. This can lead to accounting twice the waking task p,
once with the blocked runnable_avg and another one when adding its
util_est.
cpu's runnable_avg is already used when computing util_avg which is then
compared with util_est.
In some situation, feec will not select prev_cpu but another one on the
same performance domain because of higher max_util
Fixes: 7d0583cf9e ("sched/fair, cpufreq: Introduce 'runnable boosting'")
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lore.kernel.org/r/20230706135144.324311-1-vincent.guittot@linaro.org
Since find_btf_func_param() abd btf_type_by_id() can return NULL,
the caller must check the return value correctly.
Link: https://lore.kernel.org/all/169024903951.395371.11361556840733470934.stgit@devnote2/
Fixes: b576e09701 ("tracing/probes: Support function parameters if BTF is available")
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Splitting these out into separate helper functions means that we
actually pass an uninitialized variable into another function call
if dec_active() happens to not be inlined, and CONFIG_PREEMPT_RT
is disabled:
kernel/bpf/memalloc.c: In function 'add_obj_to_free_list':
kernel/bpf/memalloc.c:200:9: error: 'flags' is used uninitialized [-Werror=uninitialized]
200 | dec_active(c, flags);
Avoid this by passing the flags by reference, so they either get
initialized and dereferenced through a pointer, or the pointer never
gets accessed at all.
Fixes: 18e027b1c7 ("bpf: Factor out inc/dec of active flag into helpers.")
Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/r/20230725202653.2905259-1-arnd@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
We received report [1] of kernel crash, which is caused by
using nesting protection without disabled preemption.
The bpf_event_output can be called by programs executed by
bpf_prog_run_array_cg function that disabled migration but
keeps preemption enabled.
This can cause task to be preempted by another one inside the
nesting protection and lead eventually to two tasks using same
perf_sample_data buffer and cause crashes like:
BUG: kernel NULL pointer dereference, address: 0000000000000001
#PF: supervisor instruction fetch in kernel mode
#PF: error_code(0x0010) - not-present page
...
? perf_output_sample+0x12a/0x9a0
? finish_task_switch.isra.0+0x81/0x280
? perf_event_output+0x66/0xa0
? bpf_event_output+0x13a/0x190
? bpf_event_output_data+0x22/0x40
? bpf_prog_dfc84bbde731b257_cil_sock4_connect+0x40a/0xacb
? xa_load+0x87/0xe0
? __cgroup_bpf_run_filter_sock_addr+0xc1/0x1a0
? release_sock+0x3e/0x90
? sk_setsockopt+0x1a1/0x12f0
? udp_pre_connect+0x36/0x50
? inet_dgram_connect+0x93/0xa0
? __sys_connect+0xb4/0xe0
? udp_setsockopt+0x27/0x40
? __pfx_udp_push_pending_frames+0x10/0x10
? __sys_setsockopt+0xdf/0x1a0
? __x64_sys_connect+0xf/0x20
? do_syscall_64+0x3a/0x90
? entry_SYSCALL_64_after_hwframe+0x72/0xdc
Fixing this by disabling preemption in bpf_event_output.
[1] https://github.com/cilium/cilium/issues/26756
Cc: stable@vger.kernel.org
Reported-by: Oleg "livelace" Popov <o.popov@livelace.ru>
Closes: https://github.com/cilium/cilium/issues/26756
Fixes: 2a916f2f54 ("bpf: Use migrate_disable/enable in array macros and cgroup/lirc code.")
Acked-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20230725084206.580930-3-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The nesting protection in bpf_perf_event_output relies on disabled
preemption, which is guaranteed for kprobes and tracepoints.
However bpf_perf_event_output can be also called from uprobes context
through bpf_prog_run_array_sleepable function which disables migration,
but keeps preemption enabled.
This can cause task to be preempted by another one inside the nesting
protection and lead eventually to two tasks using same perf_sample_data
buffer and cause crashes like:
kernel tried to execute NX-protected page - exploit attempt? (uid: 0)
BUG: unable to handle page fault for address: ffffffff82be3eea
...
Call Trace:
? __die+0x1f/0x70
? page_fault_oops+0x176/0x4d0
? exc_page_fault+0x132/0x230
? asm_exc_page_fault+0x22/0x30
? perf_output_sample+0x12b/0x910
? perf_event_output+0xd0/0x1d0
? bpf_perf_event_output+0x162/0x1d0
? bpf_prog_c6271286d9a4c938_krava1+0x76/0x87
? __uprobe_perf_func+0x12b/0x540
? uprobe_dispatcher+0x2c4/0x430
? uprobe_notify_resume+0x2da/0xce0
? atomic_notifier_call_chain+0x7b/0x110
? exit_to_user_mode_prepare+0x13e/0x290
? irqentry_exit_to_user_mode+0x5/0x30
? asm_exc_int3+0x35/0x40
Fixing this by disabling preemption in bpf_perf_event_output.
Cc: stable@vger.kernel.org
Fixes: 8c7dcb84e3 ("bpf: implement sleepable uprobes by chaining gps")
Acked-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20230725084206.580930-2-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
wq_cpu_intensive_thresh_us is used to detect CPU-hogging per-cpu work items.
Once detected, they're excluded from concurrency management to prevent them
from blocking other per-cpu work items. If CONFIG_WQ_CPU_INTENSIVE_REPORT is
enabled, repeat offenders are also reported so that the code can be updated.
The default threshold is 10ms which is long enough to do fair bit of work on
modern CPUs while short enough to be usually not noticeable. This
unfortunately leads to a lot of, arguable spurious, detections on very slow
CPUs. Using the same threshold across CPUs whose performance levels may be
apart by multiple levels of magnitude doesn't make whole lot of sense.
This patch scales up wq_cpu_intensive_thresh_us upto 1 second when BogoMIPS
is below 4000. This is obviously very inaccurate but it doesn't have to be
accurate to be useful. The mechanism is still useful when the threshold is
fully scaled up and the benefits of reports are usually shared with everyone
regardless of who's reporting, so as long as there are sufficient number of
fast machines reporting, we don't lose much.
Some (or is it all?) ARM CPUs systemtically report significantly lower
BogoMIPS. While this doesn't break anything, given how widespread ARM CPUs
are, it's at least a missed opportunity and it probably would be a good idea
to teach workqueue about it.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-Tested-by: Geert Uytterhoeven <geert@linux-m68k.org>
The passed parameter to sysrq handlers is a key (a character). So change
the type from 'int' to 'u8'. Let it specifically be 'u8' for two
reasons:
* unsigned: unsigned values come from the upper layers (devices) and the
tty layer assumes unsigned on most places, and
* 8-bit: as that what's supposed to be one day in all the layers built
on the top of tty. (Currently, we use mostly 'unsigned char' and
somewhere still only 'char'. (But that also translates to the former
thanks to -funsigned-char.))
Signed-off-by: Jiri Slaby (SUSE) <jirislaby@kernel.org>
Cc: Richard Henderson <richard.henderson@linaro.org>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: WANG Xuerui <kernel@xen0n.name>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Maxime Ripard <mripard@kernel.org>
Cc: Thomas Zimmermann <tzimmermann@suse.de>
Cc: David Airlie <airlied@gmail.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Jason Wessel <jason.wessel@windriver.com>
Cc: Daniel Thompson <daniel.thompson@linaro.org>
Cc: Douglas Anderson <dianders@chromium.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Len Brown <len.brown@intel.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Zqiang <qiang.zhang1211@gmail.com>
Acked-by: Thomas Zimmermann <tzimmermann@suse.de> # DRM
Acked-by: WANG Xuerui <git@xen0n.name> # loongarch
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Acked-by: Daniel Thompson <daniel.thompson@linaro.org>
Link: https://lore.kernel.org/r/20230712081811.29004-3-jirislaby@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Trying to restrict the '$'-prefix change to RISC-V caused some fallout,
so let's just treat all those symbols as special.
Fixes: c05780ef3c ("module: Ignore RISC-V mapping symbols too")
Link: https://lore.kernel.org/all/20230712015747.77263-1-wangkefeng.wang@huawei.com/
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
Reviewed-by: Masahiro Yamada <masahiroy@kernel.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
In later patches, we're going to change how the inode's ctime field is
used. Switch to using accessor functions instead of raw accesses of
inode->i_ctime.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Message-Id: <20230705190309.579783-84-jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
On ChromeOS we've observed a considerable number of in-use pages filled with
zeros. Today with hibernate it's entirely possible that saveable pages are just
zero filled. Since we're already copying pages word-by-word in do_copy_page it
becomes almost free to determine if a page was completely filled with zeros.
This change introduces a new bitmap which will track these zero pages. If a page
is zero it will not be included in the saved image, instead to track these zero
pages in the image file we will introduce a new flag which we will set on the
packed PFN list. When reading back in the image file we will detect these zero
page PFNs and rebuild the zero page bitmap.
When the image is being loaded through calls to write_next_page if we encounter
a zero page we will silently memset it to 0 and then continue on to the next
page. Given the implementation in snapshot_read_next/snapshot_write_next this
change will be transparent to non-compressed/compressed and swsusp modes of
operation.
To provide some concrete numbers from simple ad-hoc testing, on a device which
was lightly in use we saw that:
PM: hibernation: Image created (964408 pages copied, 548304 zero pages)
Of the approximately 6.2GB of saveable pages 2.2GB (36%) were just zero filled
and could be tracked entirely within the packed PFN list. The savings would
obviously be much lower for lzo compressed images, but even in the case of
compression not copying pages across to the compression threads will still
speed things up. It's also possible that we would see better overall compression
ratios as larger regions of "real data" would improve the compressibility.
Finally, such an approach could dramatically improve swsusp performance
as each one of those zero pages requires a write syscall to reload, by
handling it as part of the packed PFN list we're able to fully avoid
that.
Signed-off-by: Brian Geffon <bgeffon@google.com>
[ rjw: Whitespace adjustments, removal of redundant parentheses ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
- Swapping the ring buffer for snapshotting (for things like irqsoff)
can crash if the ring buffer is being resized. Disable swapping
when this happens. The missed swap will be reported to the tracer.
- Report error if the histogram fails to be created due to an error in
adding a histogram variable, in event_hist_trigger_parse().
- Remove unused declaration of tracing_map_set_field_descr().
Chen Lin (1):
ring-buffer: Do not swap cpu_buffer during resize process
Mohamed Khalfella (1):
tracing/histograms: Return an error if we fail to add histogram to hist_vars list
YueHaibing (1):
tracing: Remove unused extern declaration tracing_map_set_field_descr()
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZL2IixQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qsHAAQCS/VLpMOA5AS9JWvwuEnGAVymyJcGS
jmnWkuMmf5fPpQD/di/xY1clLNhz6P7PAZvR3N6qw3AsNjPW/ZapDkrRWQA=
=RoHL
-----END PGP SIGNATURE-----
Merge tag 'trace-v6.5-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fixes from Steven Rostedt:
- Swapping the ring buffer for snapshotting (for things like irqsoff)
can crash if the ring buffer is being resized. Disable swapping when
this happens. The missed swap will be reported to the tracer
- Report error if the histogram fails to be created due to an error in
adding a histogram variable, in event_hist_trigger_parse()
- Remove unused declaration of tracing_map_set_field_descr()
* tag 'trace-v6.5-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing/histograms: Return an error if we fail to add histogram to hist_vars list
ring-buffer: Do not swap cpu_buffer during resize process
tracing: Remove unused extern declaration tracing_map_set_field_descr()
Commit 6018b585e8 ("tracing/histograms: Add histograms to hist_vars if
they have referenced variables") added a check to fail histogram creation
if save_hist_vars() failed to add histogram to hist_vars list. But the
commit failed to set ret to failed return code before jumping to
unregister histogram, fix it.
Link: https://lore.kernel.org/linux-trace-kernel/20230714203341.51396-1-mkhalfella@purestorage.com
Cc: stable@vger.kernel.org
Fixes: 6018b585e8 ("tracing/histograms: Add histograms to hist_vars if they have referenced variables")
Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Since commit 743210386c ("cgroup: use cgrp->kn->id as the cgroup ID"),
cgrp is associated with its kernfs_node. Update corresponding comment.
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Change 'new_usage' type to u64 so it can be compared with unsigned 'max'
and 'capacity' properly even if the value crosses the signed boundary.
Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
After changes in commit 0590b9335a ("fixing audit rule ordering mess,
part 1"), audit_filter_inodes() returns void, so if CONFIG_AUDITSYSCALL
not defined, it should be do {} while(0).
Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
Current release - regressions:
- eth: r8169: multiple fixes for PCIe ASPM-related problems
- vrf: fix RCU lockdep splat in output path
Previous releases - regressions:
- gso: fall back to SW segmenting with GSO_UDP_L4 dodgy bit set
- dsa: mv88e6xxx: do a final check before timing out when polling
- nf_tables: fix sleep in atomic in nft_chain_validate
Previous releases - always broken:
- sched: fix undoing tcf_bind_filter() in multiple classifiers
- bpf, arm64: fix BTI type used for freplace attached functions
- can: gs_usb: fix time stamp counter initialization
- nft_set_pipapo: fix improper element removal (leading to UAF)
Misc:
- net: support STP on bridge in non-root netns, STP prevents
packet loops so not supporting it results in freezing systems
of unsuspecting users, and in turn very upset noises being made
- fix kdoc warnings
- annotate various bits of TCP state to prevent data races
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmS5pp0ACgkQMUZtbf5S
IrtudA/9Ep+URprI3tpv+VHOQMWtMd7lzz+wwEUDQSo2T6xdMcYbd1E4ZWWOPw/y
jTIIVF3qde4nuI/MZtzGhvCD8v4bzhw10uRm4f4vhC2i+CzXr/UdOQSMqeZmJZgN
vndixvRjHJKYxogOa+DjXgOiuQTQfuSfSnaai0kvw3zZzi4tev/Bdj6KZmFW+UK+
Q7uQZ5n8tdE4UvUdj8Jek23SZ4kL+HtQOIdAAqyduQnYnax5L5sbep0TjuCjjkpK
26rvmwYFJmEab4mC2T3Y7VDaXYM9M2f/EuFBMBVEohE3KPTTdT12WzLfJv7TTKTl
hymfXgfmCXiZElzoQTJ69bFGbhqFaCJwhCUHFwYqkqj0bW9cXYJD2achpi3nVgnn
CV8vfqJtkzdgh2bV2faG+1wmAm1wzHSURmT5NlnFaX6a6BYypaN7CERn7BnIdLM/
YA2wud39bL0EJsic5e3gtlyJdfhtx7iqCMzE7S5FiUZvgOmUhBZ4IWkMs6Aq5PpL
FLLgBSHGEIAdLVQGvXLjfQ/LeSrW8JsiSy6deztzR+ZflvvaBIP5y8sC3+KdxAvN
3ybMsMEE5OK3i808aV3l6/8DLeAJ+DWuMc96Ix7Yyt2LXFnnV79DX49zJAEUWrc7
54FnNzkgAO/Q9aEFmmQoFt5qZmoFHuNwcHBOmXARAatQqNCwDqk=
=Xifr
-----END PGP SIGNATURE-----
Merge tag 'net-6.5-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from BPF, netfilter, bluetooth and CAN.
Current release - regressions:
- eth: r8169: multiple fixes for PCIe ASPM-related problems
- vrf: fix RCU lockdep splat in output path
Previous releases - regressions:
- gso: fall back to SW segmenting with GSO_UDP_L4 dodgy bit set
- dsa: mv88e6xxx: do a final check before timing out when polling
- nf_tables: fix sleep in atomic in nft_chain_validate
Previous releases - always broken:
- sched: fix undoing tcf_bind_filter() in multiple classifiers
- bpf, arm64: fix BTI type used for freplace attached functions
- can: gs_usb: fix time stamp counter initialization
- nft_set_pipapo: fix improper element removal (leading to UAF)
Misc:
- net: support STP on bridge in non-root netns, STP prevents packet
loops so not supporting it results in freezing systems of
unsuspecting users, and in turn very upset noises being made
- fix kdoc warnings
- annotate various bits of TCP state to prevent data races"
* tag 'net-6.5-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (95 commits)
net: phy: prevent stale pointer dereference in phy_init()
tcp: annotate data-races around fastopenq.max_qlen
tcp: annotate data-races around icsk->icsk_user_timeout
tcp: annotate data-races around tp->notsent_lowat
tcp: annotate data-races around rskq_defer_accept
tcp: annotate data-races around tp->linger2
tcp: annotate data-races around icsk->icsk_syn_retries
tcp: annotate data-races around tp->keepalive_probes
tcp: annotate data-races around tp->keepalive_intvl
tcp: annotate data-races around tp->keepalive_time
tcp: annotate data-races around tp->tsoffset
tcp: annotate data-races around tp->tcp_tx_delay
Bluetooth: MGMT: Use correct address for memcpy()
Bluetooth: btusb: Fix bluetooth on Intel Macbook 2014
Bluetooth: SCO: fix sco_conn related locking and validity issues
Bluetooth: hci_conn: return ERR_PTR instead of NULL when there is no link
Bluetooth: hci_sync: Avoid use-after-free in dbg for hci_remove_adv_monitor()
Bluetooth: coredump: fix building with coredump disabled
Bluetooth: ISO: fix iso_conn related locking and validity issues
Bluetooth: hci_event: call disconnect callback before deleting conn
...
The ifdef-else logic is already in the header file, so include it
unconditionally, no functional changes here.
Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com>
[PM: fixed misspelling in the subject]
Signed-off-by: Paul Moore <paul@paul-moore.com>
Currently abandon_console_lock_in_panic() is only used to determine if
the current CPU should immediately release the console lock because
another CPU is in panic. However, later this function will be used by
the CPU to immediately release other resources in this situation.
Rename the function to other_cpu_in_panic(), which is a better
description and does not assume it is related to the console lock.
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20230717194607.145135-8-john.ogness@linutronix.de
Currently the global @console_suspended is used to determine if
consoles are in a suspended state. Its primary purpose is to allow
usage of the console_lock when suspended without causing console
printing. It is synchronized by the console_lock.
Rather than relying on the console_lock to determine suspended
state, make it an official per-console state that is set within
console->flags. This allows the state to be queried via SRCU.
Remove @console_suspended. Console printing will still be avoided
when suspended because console_is_usable() returns false when
the new suspended flag is set for that console.
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20230717194607.145135-7-john.ogness@linutronix.de
Printing to consoles can be deferred for several reasons:
- explicitly with printk_deferred()
- printk() in NMI context
- recursive printk() calls
The current implementation is not consistent. For printk_deferred(),
irq work is scheduled twice. For NMI und recursive, panic CPU
suppression and caller delays are not properly enforced.
Correct these inconsistencies by consolidating the deferred printing
code so that vprintk_deferred() is the top-level function for
deferred printing and vprintk_emit() will perform whichever irq_work
queueing is appropriate.
Also add kerneldoc for wake_up_klogd() and defer_console_output() to
clarify their differences and appropriate usage.
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20230717194607.145135-6-john.ogness@linutronix.de
Currently console_flush_on_panic() will attempt to acquire the
console lock when flushing the buffer on panic. If it fails to
acquire the lock, it continues anyway because this is the last
chance to get any pending records printed.
The reason why the console lock was attempted at all was to
prevent any other CPUs from acquiring the console lock for
printing while the panic CPU was printing. But as of the
previous commit, non-panic CPUs will no longer attempt to
acquire the console lock in a panic situation. Therefore it is
no longer strictly necessary for a panic CPU to acquire the
console lock.
Avoiding taking the console lock when flushing in panic has
the additional benefit of avoiding possible deadlocks due to
semaphore usage in NMI context (semaphores are not NMI-safe)
and avoiding possible deadlocks if another CPU accesses the
semaphore and is stopped while holding one of the semaphore's
internal spinlocks.
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20230717194607.145135-5-john.ogness@linutronix.de
When in a panic situation, non-panic CPUs should avoid holding the
console lock so as not to contend with the panic CPU. This is already
implemented with abandon_console_lock_in_panic(), which is checked
after each printed line. However, non-panic CPUs should also avoid
trying to acquire the console lock during a panic.
Modify console_trylock() to fail and console_lock() to block() when
called from a non-panic CPU during a panic.
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20230717194607.145135-4-john.ogness@linutronix.de
A semaphore is not NMI-safe, even when using down_trylock(). Both
down_trylock() and up() are using internal spinlocks and up()
might even call wake_up_process().
In the panic() code path it gets even worse because the internal
spinlocks of the semaphore may have been taken by a CPU that has
been stopped.
To reduce the risk of deadlocks caused by the console semaphore in
the panic path, make the following changes:
- First check if any consoles have implemented the unblank()
callback. If not, then there is no reason to take the console
semaphore anyway. (This check is also useful for the non-panic
path since the locking/unlocking of the console lock can be
quite expensive due to console printing.)
- If the panic path is in NMI context, bail out without attempting
to take the console semaphore or calling any unblank() callbacks.
Bailing out is acceptable because console_unblank() would already
bail out if the console semaphore is contended. The alternative of
ignoring the console semaphore and calling the unblank() callbacks
anyway is a bad idea because these callbacks are also not NMI-safe.
If consoles with unblank() callbacks exist and console_unblank() is
called from a non-NMI panic context, it will still attempt a
down_trylock(). This could still result in a deadlock if one of the
stopped CPUs is holding the semaphore internal spinlock. But this
is a risk that the kernel has been (and continues to be) willing
to take.
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20230717194607.145135-3-john.ogness@linutronix.de
It is allowed for consoles to not provide a write() callback. For
example ttynull does this.
Check if a write() callback is available before using it.
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Daniel Thompson <daniel.thompson@linaro.org>
Acked-by: Daniel Thompson <daniel.thompson@linaro.org>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20230717194607.145135-2-john.ogness@linutronix.de
Make it clear that this function always returns either true or false
without other planned failure modes.
Reported-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
This work refactors and adds a lightweight extension ("tcx") to the tc BPF
ingress and egress data path side for allowing BPF program management based
on fds via bpf() syscall through the newly added generic multi-prog API.
The main goal behind this work which we also presented at LPC [0] last year
and a recent update at LSF/MM/BPF this year [3] is to support long-awaited
BPF link functionality for tc BPF programs, which allows for a model of safe
ownership and program detachment.
Given the rise in tc BPF users in cloud native environments, this becomes
necessary to avoid hard to debug incidents either through stale leftover
programs or 3rd party applications accidentally stepping on each others toes.
As a recap, a BPF link represents the attachment of a BPF program to a BPF
hook point. The BPF link holds a single reference to keep BPF program alive.
Moreover, hook points do not reference a BPF link, only the application's
fd or pinning does. A BPF link holds meta-data specific to attachment and
implements operations for link creation, (atomic) BPF program update,
detachment and introspection. The motivation for BPF links for tc BPF programs
is multi-fold, for example:
- From Meta: "It's especially important for applications that are deployed
fleet-wide and that don't "control" hosts they are deployed to. If such
application crashes and no one notices and does anything about that, BPF
program will keep running draining resources or even just, say, dropping
packets. We at FB had outages due to such permanent BPF attachment
semantics. With fd-based BPF link we are getting a framework, which allows
safe, auto-detachable behavior by default, unless application explicitly
opts in by pinning the BPF link." [1]
- From Cilium-side the tc BPF programs we attach to host-facing veth devices
and phys devices build the core datapath for Kubernetes Pods, and they
implement forwarding, load-balancing, policy, EDT-management, etc, within
BPF. Currently there is no concept of 'safe' ownership, e.g. we've recently
experienced hard-to-debug issues in a user's staging environment where
another Kubernetes application using tc BPF attached to the same prio/handle
of cls_bpf, accidentally wiping all Cilium-based BPF programs from underneath
it. The goal is to establish a clear/safe ownership model via links which
cannot accidentally be overridden. [0,2]
BPF links for tc can co-exist with non-link attachments, and the semantics are
in line also with XDP links: BPF links cannot replace other BPF links, BPF
links cannot replace non-BPF links, non-BPF links cannot replace BPF links and
lastly only non-BPF links can replace non-BPF links. In case of Cilium, this
would solve mentioned issue of safe ownership model as 3rd party applications
would not be able to accidentally wipe Cilium programs, even if they are not
BPF link aware.
Earlier attempts [4] have tried to integrate BPF links into core tc machinery
to solve cls_bpf, which has been intrusive to the generic tc kernel API with
extensions only specific to cls_bpf and suboptimal/complex since cls_bpf could
be wiped from the qdisc also. Locking a tc BPF program in place this way, is
getting into layering hacks given the two object models are vastly different.
We instead implemented the tcx (tc 'express') layer which is an fd-based tc BPF
attach API, so that the BPF link implementation blends in naturally similar to
other link types which are fd-based and without the need for changing core tc
internal APIs. BPF programs for tc can then be successively migrated from classic
cls_bpf to the new tc BPF link without needing to change the program's source
code, just the BPF loader mechanics for attaching is sufficient.
For the current tc framework, there is no change in behavior with this change
and neither does this change touch on tc core kernel APIs. The gist of this
patch is that the ingress and egress hook have a lightweight, qdisc-less
extension for BPF to attach its tc BPF programs, in other words, a minimal
entry point for tc BPF. The name tcx has been suggested from discussion of
earlier revisions of this work as a good fit, and to more easily differ between
the classic cls_bpf attachment and the fd-based one.
For the ingress and egress tcx points, the device holds a cache-friendly array
with program pointers which is separated from control plane (slow-path) data.
Earlier versions of this work used priority to determine ordering and expression
of dependencies similar as with classic tc, but it was challenged that for
something more future-proof a better user experience is required. Hence this
resulted in the design and development of the generic attach/detach/query API
for multi-progs. See prior patch with its discussion on the API design. tcx is
the first user and later we plan to integrate also others, for example, one
candidate is multi-prog support for XDP which would benefit and have the same
'look and feel' from API perspective.
The goal with tcx is to have maximum compatibility to existing tc BPF programs,
so they don't need to be rewritten specifically. Compatibility to call into
classic tcf_classify() is also provided in order to allow successive migration
or both to cleanly co-exist where needed given its all one logical tc layer and
the tcx plus classic tc cls/act build one logical overall processing pipeline.
tcx supports the simplified return codes TCX_NEXT which is non-terminating (go
to next program) and terminating ones with TCX_PASS, TCX_DROP, TCX_REDIRECT.
The fd-based API is behind a static key, so that when unused the code is also
not entered. The struct tcx_entry's program array is currently static, but
could be made dynamic if necessary at a point in future. The a/b pair swap
design has been chosen so that for detachment there are no allocations which
otherwise could fail.
The work has been tested with tc-testing selftest suite which all passes, as
well as the tc BPF tests from the BPF CI, and also with Cilium's L4LB.
Thanks also to Nikolay Aleksandrov and Martin Lau for in-depth early reviews
of this work.
[0] https://lpc.events/event/16/contributions/1353/
[1] https://lore.kernel.org/bpf/CAEf4BzbokCJN33Nw_kg82sO=xppXnKWEncGTWCTB9vGCmLB6pw@mail.gmail.com
[2] https://colocatedeventseu2023.sched.com/event/1Jo6O/tales-from-an-ebpf-programs-murder-mystery-hemanth-malla-guillaume-fournier-datadog
[3] http://vger.kernel.org/bpfconf2023_material/tcx_meta_netdev_borkmann.pdf
[4] https://lore.kernel.org/bpf/20210604063116.234316-1-memxor@gmail.com
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230719140858.13224-3-daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This adds a generic layer called bpf_mprog which can be reused by different
attachment layers to enable multi-program attachment and dependency resolution.
In-kernel users of the bpf_mprog don't need to care about the dependency
resolution internals, they can just consume it with few API calls.
The initial idea of having a generic API sparked out of discussion [0] from an
earlier revision of this work where tc's priority was reused and exposed via
BPF uapi as a way to coordinate dependencies among tc BPF programs, similar
as-is for classic tc BPF. The feedback was that priority provides a bad user
experience and is hard to use [1], e.g.:
I cannot help but feel that priority logic copy-paste from old tc, netfilter
and friends is done because "that's how things were done in the past". [...]
Priority gets exposed everywhere in uapi all the way to bpftool when it's
right there for users to understand. And that's the main problem with it.
The user don't want to and don't need to be aware of it, but uapi forces them
to pick the priority. [...] Your cover letter [0] example proves that in
real life different service pick the same priority. They simply don't know
any better. Priority is an unnecessary magic that apps _have_ to pick, so
they just copy-paste and everyone ends up using the same.
The course of the discussion showed more and more the need for a generic,
reusable API where the "same look and feel" can be applied for various other
program types beyond just tc BPF, for example XDP today does not have multi-
program support in kernel, but also there was interest around this API for
improving management of cgroup program types. Such common multi-program
management concept is useful for BPF management daemons or user space BPF
applications coordinating internally about their attachments.
Both from Cilium and Meta side [2], we've collected the following requirements
for a generic attach/detach/query API for multi-progs which has been implemented
as part of this work:
- Support prog-based attach/detach and link API
- Dependency directives (can also be combined):
- BPF_F_{BEFORE,AFTER} with relative_{fd,id} which can be {prog,link,none}
- BPF_F_ID flag as {fd,id} toggle; the rationale for id is so that user
space application does not need CAP_SYS_ADMIN to retrieve foreign fds
via bpf_*_get_fd_by_id()
- BPF_F_LINK flag as {prog,link} toggle
- If relative_{fd,id} is none, then BPF_F_BEFORE will just prepend, and
BPF_F_AFTER will just append for attaching
- Enforced only at attach time
- BPF_F_REPLACE with replace_bpf_fd which can be prog, links have their
own infra for replacing their internal prog
- If no flags are set, then it's default append behavior for attaching
- Internal revision counter and optionally being able to pass expected_revision
- User space application can query current state with revision, and pass it
along for attachment to assert current state before doing updates
- Query also gets extension for link_ids array and link_attach_flags:
- prog_ids are always filled with program IDs
- link_ids are filled with link IDs when link was used, otherwise 0
- {prog,link}_attach_flags for holding {prog,link}-specific flags
- Must be easy to integrate/reuse for in-kernel users
The uapi-side changes needed for supporting bpf_mprog are rather minimal,
consisting of the additions of the attachment flags, revision counter, and
expanding existing union with relative_{fd,id} member.
The bpf_mprog framework consists of an bpf_mprog_entry object which holds
an array of bpf_mprog_fp (fast-path structure). The bpf_mprog_cp (control-path
structure) is part of bpf_mprog_bundle. Both have been separated, so that
fast-path gets efficient packing of bpf_prog pointers for maximum cache
efficiency. Also, array has been chosen instead of linked list or other
structures to remove unnecessary indirections for a fast point-to-entry in
tc for BPF.
The bpf_mprog_entry comes as a pair via bpf_mprog_bundle so that in case of
updates the peer bpf_mprog_entry is populated and then just swapped which
avoids additional allocations that could otherwise fail, for example, in
detach case. bpf_mprog_{fp,cp} arrays are currently static, but they could
be converted to dynamic allocation if necessary at a point in future.
Locking is deferred to the in-kernel user of bpf_mprog, for example, in case
of tcx which uses this API in the next patch, it piggybacks on rtnl.
An extensive test suite for checking all aspects of this API for prog-based
attach/detach and link API comes as BPF selftests in this series.
Thanks also to Andrii Nakryiko for early API discussions wrt Meta's BPF prog
management.
[0] https://lore.kernel.org/bpf/20221004231143.19190-1-daniel@iogearbox.net
[1] https://lore.kernel.org/bpf/CAADnVQ+gEY3FjCR=+DmjDR4gp5bOYZUFJQXj4agKFHT9CQPZBw@mail.gmail.com
[2] http://vger.kernel.org/bpfconf2023_material/tcx_meta_netdev_borkmann.pdf
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/20230719140858.13224-2-daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Register the bpf_map_sum_elem_count func for all programs, and update the
map_ptr subtest of the test_progs test to test the new functionality.
The usage is allowed as long as the pointer to the map is trusted (when
using tracing programs) or is a const pointer to map, as in the following
example:
struct {
__uint(type, BPF_MAP_TYPE_HASH);
...
} hash SEC(".maps");
...
static inline int some_bpf_prog(void)
{
struct bpf_map *map = (struct bpf_map *)&hash;
__s64 count;
count = bpf_map_sum_elem_count(map);
...
}
Signed-off-by: Anton Protopopov <aspsk@isovalent.com>
Link: https://lore.kernel.org/r/20230719092952.41202-5-aspsk@isovalent.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
We use the map pointer only to read the counter values, no locking
involved, so mark the argument as const.
Signed-off-by: Anton Protopopov <aspsk@isovalent.com>
Link: https://lore.kernel.org/r/20230719092952.41202-4-aspsk@isovalent.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Add the BTF id of struct bpf_map to the reg2btf_ids array. This makes the
values of the CONST_PTR_TO_MAP type to be considered as trusted by kfuncs.
This, in turn, allows users to execute trusted kfuncs which accept `struct
bpf_map *` arguments from non-tracing programs.
While exporting the btf_bpf_map_id variable, save some bytes by defining
it as BTF_ID_LIST_GLOBAL_SINGLE (which is u32[1]) and not as BTF_ID_LIST
(which is u32[64]).
Signed-off-by: Anton Protopopov <aspsk@isovalent.com>
Link: https://lore.kernel.org/r/20230719092952.41202-3-aspsk@isovalent.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The reg2btf_ids array contains a list of types for which we can (and need)
to find a corresponding static BTF id. All the types in the list can be
considered as trusted for purposes of kfuncs.
Signed-off-by: Anton Protopopov <aspsk@isovalent.com>
Link: https://lore.kernel.org/r/20230719092952.41202-2-aspsk@isovalent.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
EEVDF uses this tunable as the base request/slice -- make sure the
name reflects this.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230531124604.205287511@infradead.org
EEVDF is a better defined scheduling policy, as a result it has less
heuristics/tunables. There is no compelling reason to keep CFS around.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230531124604.137187212@infradead.org
Using lag is both more correct and simpler when moving between
runqueues.
Notable, min_vruntime() was invented as a cheap approximation of
avg_vruntime() for this very purpose (SMP migration). Since we now
have the real thing; use it.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230531124604.068911180@infradead.org
Removes the FAIR_SLEEPERS code in favour of the new LAG based
placement.
Specifically, the whole FAIR_SLEEPER thing was a very crude
approximation to make up for the lack of lag based placement,
specifically the 'service owed' part. This is important for things
like 'starve' and 'hackbench'.
One side effect of FAIR_SLEEPER is that it caused 'small' unfairness,
specifically, by always ignoring up-to 'thresh' sleeptime it would
have a 50%/50% time distribution for a 50% sleeper vs a 100% runner,
while strictly speaking this should (of course) result in a 33%/67%
split (as CFS will also do if the sleep period exceeds 'thresh').
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230531124604.000198861@infradead.org
Where CFS is currently a WFQ based scheduler with only a single knob,
the weight. The addition of a second, latency oriented parameter,
makes something like WF2Q or EEVDF based a much better fit.
Specifically, EEVDF does EDF like scheduling in the left half of the
tree -- those entities that are owed service. Except because this is a
virtual time scheduler, the deadlines are in virtual time as well,
which is what allows over-subscription.
EEVDF has two parameters:
- weight, or time-slope: which is mapped to nice just as before
- request size, or slice length: which is used to compute
the virtual deadline as: vd_i = ve_i + r_i/w_i
Basically, by setting a smaller slice, the deadline will be earlier
and the task will be more eligible and ran earlier.
Tick driven preemption is driven by request/slice completion; while
wakeup preemption is driven by the deadline.
Because the tree is now effectively an interval tree, and the
selection is no longer 'leftmost', over-scheduling is less of a
problem.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230531124603.931005524@infradead.org
With the introduction of avg_vruntime, it is possible to approximate
lag (the entire purpose of introducing it in fact). Use this to do lag
based placement over sleep+wake.
Specifically, the FAIR_SLEEPERS thing places things too far to the
left and messes up the deadline aspect of EEVDF.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230531124603.794929315@infradead.org
With the introduction of avg_vruntime() there is no need to use worse
approximations. Take the 0-lag point as starting point for inserting
new tasks.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230531124603.722361178@infradead.org
In order to move to an eligibility based scheduling policy, we need
to have a better approximation of the ideal scheduler.
Specifically, for a virtual time weighted fair queueing based
scheduler the ideal scheduler will be the weighted average of the
individual virtual runtimes (math in the comment).
As such, compute the weighted average to approximate the ideal
scheduler -- note that the approximation is in the individual task
behaviour, which isn't strictly conformant.
Specifically consider adding a task with a vruntime left of center, in
this case the average will move backwards in time -- something the
ideal scheduler would of course never do.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230531124603.654144274@infradead.org
As described by Kumar in [0], in shared ownership scenarios it is
necessary to do runtime tracking of {rb,list} node ownership - and
synchronize updates using this ownership information - in order to
prevent races. This patch adds an 'owner' field to struct bpf_list_node
and bpf_rb_node to implement such runtime tracking.
The owner field is a void * that describes the ownership state of a
node. It can have the following values:
NULL - the node is not owned by any data structure
BPF_PTR_POISON - the node is in the process of being added to a data
structure
ptr_to_root - the pointee is a data structure 'root'
(bpf_rb_root / bpf_list_head) which owns this node
The field is initially NULL (set by bpf_obj_init_field default behavior)
and transitions states in the following sequence:
Insertion: NULL -> BPF_PTR_POISON -> ptr_to_root
Removal: ptr_to_root -> NULL
Before a node has been successfully inserted, it is not protected by any
root's lock, and therefore two programs can attempt to add the same node
to different roots simultaneously. For this reason the intermediate
BPF_PTR_POISON state is necessary. For removal, the node is protected
by some root's lock so this intermediate hop isn't necessary.
Note that bpf_list_pop_{front,back} helpers don't need to check owner
before removing as the node-to-be-removed is not passed in as input and
is instead taken directly from the list. Do the check anyways and
WARN_ON_ONCE in this unexpected scenario.
Selftest changes in this patch are entirely mechanical: some BTF
tests have hardcoded struct sizes for structs that contain
bpf_{list,rb}_node fields, those were adjusted to account for the new
sizes. Selftest additions to validate the owner field are added in a
further patch in the series.
[0]: https://lore.kernel.org/bpf/d7hyspcow5wtjcmw4fugdgyp3fwhljwuscp3xyut5qnwivyeru@ysdq543otzv2
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Suggested-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20230718083813.3416104-4-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Structs bpf_rb_node and bpf_list_node are opaquely defined in
uapi/linux/bpf.h, as BPF program writers are not expected to touch their
fields - nor does the verifier allow them to do so.
Currently these structs are simple wrappers around structs rb_node and
list_head and linked_list / rbtree implementation just casts and passes
to library functions for those data structures. Later patches in this
series, though, will add an "owner" field to bpf_{rb,list}_node, such
that they're not just wrapping an underlying node type. Moreover, the
bpf linked_list and rbtree implementations will deal with these owner
pointers directly in a few different places.
To avoid having to do
void *owner = (void*)bpf_list_node + sizeof(struct list_head)
with opaque UAPI node types, add bpf_{list,rb}_node_kern struct
definitions to internal headers and modify linked_list and rbtree to use
the internal types where appropriate.
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Link: https://lore.kernel.org/r/20230718083813.3416104-3-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
While the check_max_stack_depth function explores call chains emanating
from the main prog, which is typically enough to cover all possible call
chains, it doesn't explore those rooted at async callbacks unless the
async callback will have been directly called, since unlike non-async
callbacks it skips their instruction exploration as they don't
contribute to stack depth.
It could be the case that the async callback leads to a callchain which
exceeds the stack depth, but this is never reachable while only
exploring the entry point from main subprog. Hence, repeat the check for
the main subprog *and* all async callbacks marked by the symbolic
execution pass of the verifier, as execution of the program may begin at
any of them.
Consider functions with following stack depths:
main: 256
async: 256
foo: 256
main:
rX = async
bpf_timer_set_callback(...)
async:
foo()
Here, async is not descended as it does not contribute to stack depth of
main (since it is referenced using bpf_pseudo_func and not
bpf_pseudo_call). However, when async is invoked asynchronously, it will
end up breaching the MAX_BPF_STACK limit by calling foo.
Hence, in addition to main, we also need to explore call chains
beginning at all async callback subprogs in a program.
Fixes: 7ddc80a476 ("bpf: Teach stack depth check about async callbacks.")
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20230717161530.1238-3-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The assignment to idx in check_max_stack_depth happens once we see a
bpf_pseudo_call or bpf_pseudo_func. This is not an issue as the rest of
the code performs a few checks and then pushes the frame to the frame
stack, except the case of async callbacks. If the async callback case
causes the loop iteration to be skipped, the idx assignment will be
incorrect on the next iteration of the loop. The value stored in the
frame stack (as the subprogno of the current subprog) will be incorrect.
This leads to incorrect checks and incorrect tail_call_reachable
marking. Save the target subprog in a new variable and only assign to
idx once we are done with the is_async_cb check which may skip pushing
of frame to the frame stack and subsequent stack depth checks and tail
call markings.
Fixes: 7ddc80a476 ("bpf: Teach stack depth check about async callbacks.")
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20230717161530.1238-2-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
So the variables can account for resources of huge quantities even on
32-bit machines.
Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
post-6.5 issue.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZLboHQAKCRDdBJ7gKXxA
jqwtAP4m3MQNcYzQk8qbV+EQat/csTnrefytyD0ogFRoxcMAFAD/XT784sZzn4SU
s/mL1HLk1BsubT/yQmY3lISXHDPuPAo=
=5W3V
-----END PGP SIGNATURE-----
Merge tag 'mm-hotfixes-stable-2023-07-18-12-28' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull hotfixes from Andrew Morton:
"Seven hotfixes, six of which are cc:stable and one of which addresses
a post-6.5 issue"
* tag 'mm-hotfixes-stable-2023-07-18-12-28' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
maple_tree: fix node allocation testing on 32 bit
maple_tree: fix 32 bit mas_next testing
selftests/mm: mkdirty: fix incorrect position of #endif
maple_tree: set the node limit when creating a new root node
mm/mlock: fix vma iterator conversion of apply_vma_lock_flags()
prctl: move PR_GET_AUXV out of PR_MCE_KILL
selftests/mm: give scripts execute permission
seccomp_unotify allows more privileged processes do actions on behalf
of less privileged processes.
In many cases, the workflow is fully synchronous. It means a target
process triggers a system call and passes controls to a supervisor
process that handles the system call and returns controls to the target
process. In this context, "synchronous" means that only one process is
running and another one is waiting.
There is the WF_CURRENT_CPU flag that is used to advise the scheduler to
move the wakee to the current CPU. For such synchronous workflows, it
makes context switches a few times faster.
Right now, each interaction takes 12µs. With this patch, it takes about
3µs.
This change introduce the SECCOMP_USER_NOTIF_FD_SYNC_WAKE_UP flag that
it used to enable the sync mode.
Signed-off-by: Andrei Vagin <avagin@google.com>
Acked-by: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230308073201.3102738-5-avagin@google.com
Signed-off-by: Kees Cook <keescook@chromium.org>
Add complete_on_current_cpu, wake_up_poll_on_current_cpu helpers to wake
up tasks on the current CPU.
These two helpers are useful when the task needs to make a synchronous context
switch to another task. In this context, synchronous means it wakes up the
target task and falls asleep right after that.
One example of such workloads is seccomp user notifies. This mechanism allows
the supervisor process handles system calls on behalf of a target process.
While the supervisor is handling an intercepted system call, the target process
will be blocked in the kernel, waiting for a response to come back.
On-CPU context switches are much faster than regular ones.
Signed-off-by: Andrei Vagin <avagin@google.com>
Acked-by: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230308073201.3102738-4-avagin@google.com
Signed-off-by: Kees Cook <keescook@chromium.org>
Add WF_CURRENT_CPU wake flag that advices the scheduler to
move the wakee to the current CPU. This is useful for fast on-CPU
context switching use cases.
In addition, make ttwu external rather than static so that
the flag could be passed to it from outside of sched/core.c.
Signed-off-by: Peter Oskolkov <posk@google.com>
Signed-off-by: Andrei Vagin <avagin@google.com>
Acked-by: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230308073201.3102738-3-avagin@google.com
Signed-off-by: Kees Cook <keescook@chromium.org>
The main reason is to use new wake_up helpers that will be added in the
following patches. But here are a few other reasons:
* if we use two different ways, we always need to call them both. This
patch fixes seccomp_notify_recv where we forgot to call wake_up_poll
in the error path.
* If we use one primitive, we can control how many waiters are woken up
for each request. Our goal is to wake up just one that will handle a
request. Right now, wake_up_poll can wake up one waiter and
up(&match->notif->request) can wake up one more.
Signed-off-by: Andrei Vagin <avagin@google.com>
Acked-by: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230308073201.3102738-2-avagin@google.com
Signed-off-by: Kees Cook <keescook@chromium.org>
Somehow PR_GET_AUXV got added into PR_MCE_KILL's switch when the patch was
applied [1].
Thus move it out of the switch, to the place the patch added it.
In the recently released v6.4 kernel some user could, in principle, be
already using this feature by mapping the right page and passing the
PR_GET_AUXV constant as a pointer:
prctl(PR_MCE_KILL, PR_GET_AUXV, ...)
So this does change the behavior for users. We could keep the bug since
the other subcases in PR_MCE_KILL (PR_MCE_KILL_CLEAR and PR_MCE_KILL_SET)
do not overlap.
However, v6.4 may be recent enough (2 weeks old) that moving the lines
(rather than just adding a new case) does not break anybody? Moreover,
the documentation in man-pages was just committed today [2].
Link: https://lkml.kernel.org/r/20230708233344.361854-1-ojeda@kernel.org
Fixes: ddc65971bb ("prctl: add PR_GET_AUXV to copy auxv to userspace")
Link: https://lore.kernel.org/all/d81864a7f7f43bca6afa2a09fc2e850e4050ab42.1680611394.git.josh@joshtriplett.org/ [1]
Link: https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/commit/?id=8cf0c06bfd3c2b219b044d4151c96f0da50af9ad [2]
Signed-off-by: Miguel Ojeda <ojeda@kernel.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
cgroup_create() creates cgrp and assigns the kernfs_node to cgrp->kn,
then cgroup_mkdir() populates base and csses cft file by calling
css_populate_dir() and cgroup_apply_control_enable() with a valid
cgrp->kn. Check for NULL cgrp->kn, will always be false, remove it.
Signed-off-by: Kamalesh Babulal <kamalesh.babulal@oracle.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
cgroup_taskset_migrate() has been renamed to cgroup_migrate_execute() since
commit e595cd7069 ("cgroup: track migration context in cgroup_mgctx").
Update the corresponding comment.
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Use local variable parent to initialize iter tcgrp in for loop so the size
of cgroup.o can be reduced by 64 bytes. No functional change intended.
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Henry reported that rt_mutex_adjust_prio_check() has an ordering
problem and puts the lie to the comment in [7]. Sharing the sort key
between lock->waiters and owner->pi_waiters *does* create problems,
since unlike what the comment claims, holding [L] is insufficient.
Notably, consider:
A
/ \
M1 M2
| |
B C
That is, task A owns both M1 and M2, B and C block on them. In this
case a concurrent chain walk (B & C) will modify their resp. sort keys
in [7] while holding M1->wait_lock and M2->wait_lock. So holding [L]
is meaningless, they're different Ls.
This then gives rise to a race condition between [7] and [11], where
the requeue of pi_waiters will observe an inconsistent tree order.
B C
(holds M1->wait_lock, (holds M2->wait_lock,
holds B->pi_lock) holds A->pi_lock)
[7]
waiter_update_prio();
...
[8]
raw_spin_unlock(B->pi_lock);
...
[10]
raw_spin_lock(A->pi_lock);
[11]
rt_mutex_enqueue_pi();
// observes inconsistent A->pi_waiters
// tree order
Fixing this means either extending the range of the owner lock from
[10-13] to [6-13], with the immediate problem that this means [6-8]
hold both blocked and owner locks, or duplicating the sort key.
Since the locking in chain walk is horrible enough without having to
consider pi_lock nesting rules, duplicate the sort key instead.
By giving each tree their own sort key, the above race becomes
harmless, if C sees B at the old location, then B will correct things
(if they need correcting) when it walks up the chain and reaches A.
Fixes: fb00aca474 ("rtmutex: Turn the plist into an rb-tree")
Reported-by: Henry Wu <triangletrap12@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Henry Wu <triangletrap12@gmail.com>
Link: https://lkml.kernel.org/r/20230707161052.GF2883469%40hirez.programming.kicks-ass.net
- Fix the idle sibling selection
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmS0N8AACgkQEsHwGGHe
VUq4qhAAhLlRz7V0txbvj65xrQAuF7RiqWQlIm6NYX+eWb5wpqasy8mTpVr37Lhg
JJIdSlI0sw3vIgnjgmuLU6e6MKO2pDWCqppv7CB4YjTT6/ifyIWvFotHTtfOBDjy
eTV/wEpfDKOKHJDdWHRdUjiDgktZs1gJVV8e65/ViuWoCEV3ARy7tS4liORiwNpr
RTuG6C3lDAfw6RKWCvBxDV15XhMDNYYQzN1bwedTAryP8jAFFUKce2de6HK6qyUF
pNqzX0eGkN+6TG13uKLJIsAfdydWHWvkXWqFzVZsS7Upu4lyAyjslHK5UEF/MgCy
4c4/xaqcDTKn8dtQycQ3FMwujbamHdnct9or4aDsEa9PIIAcab/gKxojNPlBSkCt
0FpBjJUP2X/DBGpMaNX02fwQhhXwj0dN/4OWI5kXrL2sctqzueMSfTUKpjsLZMAx
VTngoXVcVlLxqB2LqIzjhsWG+awlw/IKEKotjhX4vyra9L50CRbMLJo7hJdHD16J
dmGQmsJRlGyQYEOgoyI7BU/Bqs/+iE1r5fAcVgjWWUPotl7XKDgcwTpELTPj5xzO
b7ALkUXRcHf2NdfaLDZeyJzT/Z1mq1znPTsDqi94f1ASbmAd0iXLFNHtyJxXJabZ
/aTLUj/GmmaCw2R02S4l67uC/BHChToloNVCGsq2KfBYcUunod0=
=bSNb
-----END PGP SIGNATURE-----
Merge tag 'sched_urgent_for_v6.5_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Borislav Petkov:
- Remove a cgroup from under a polling process properly
- Fix the idle sibling selection
* tag 'sched_urgent_for_v6.5_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/psi: use kernfs polling functions for PSI trigger polling
sched/fair: Use recent_used_cpu to test p->cpus_ptr
- fprobe: Add a comment why fprobe will be skipped if another kprobe is
running in fprobe_kprobe_handler().
- probe-events: Fix some issues related to fetch-argument
. Fix double counting of the string length for user-string and symstr.
This will require longer buffer in the array case.
. Fix not to count error code (minus value) for the total used length
in array argument. This makes the total used length shorter.
. Fix to update dynamic used data size counter only if fetcharg uses
the dynamic size data. This may mis-count the used dynamic data
size and corrupt data.
. Revert "tracing: Add "(fault)" name injection to kernel probes"
because that did not work correctly with a bug, and we agreed the
current '(fault)' output (instead of '"(fault)"' like a string)
explains what happened more clearly.
. Fix to record 0-length (means fault access) data_loc data in fetch
function itself, instead of store_trace_args(). If we record an
array of string, this will fix to save fault access data on each
entry of the array correctly.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCgAdFiEEh7BulGwFlgAOi5DV2/sHvwUrPxsFAmSxSlYACgkQ2/sHvwUr
PxupyAgApFDi9YGsmrVbXmIN5y+yGMyio2H6xR7XkX+L02nvDY6uVqL/jgT8pHfI
AeGZEA+EqwxIfWpYBfztsFej+Gl3Elfvu14OSxwaafUlW3mgZFQqw1ZR0HvzXoKJ
8Iw6WOXjhLe3/QLy43UY8JQGOKI07i3gh71wa0W0huOyiwwHuuVwPSY9QJJ2ulSg
OWFSuMFO8IxYimp0BpFu/vrfa8CdgWLc24tgJ5EpZtzu6L0A2I/FMZjnBukxnP9s
rjAXv0uRuSFvvF7/RGCqrLza12525qyHx7d5IWUq5shd3bCnaUOnAieF//MoJaR3
q8McDJK//EPbUvCWgESuuyPS05smyQ==
=iumA
-----END PGP SIGNATURE-----
Merge tag 'probes-fixes-v6.5-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull probe fixes from Masami Hiramatsu:
- fprobe: Add a comment why fprobe will be skipped if another kprobe is
running in fprobe_kprobe_handler().
- probe-events: Fix some issues related to fetch-arguments:
- Fix double counting of the string length for user-string and
symstr. This will require longer buffer in the array case.
- Fix not to count error code (minus value) for the total used
length in array argument. This makes the total used length
shorter.
- Fix to update dynamic used data size counter only if fetcharg uses
the dynamic size data. This may mis-count the used dynamic data
size and corrupt data.
- Revert "tracing: Add "(fault)" name injection to kernel probes"
because that did not work correctly with a bug, and we agreed the
current '(fault)' output (instead of '"(fault)"' like a string)
explains what happened more clearly.
- Fix to record 0-length (means fault access) data_loc data in fetch
function itself, instead of store_trace_args(). If we record an
array of string, this will fix to save fault access data on each
entry of the array correctly.
* tag 'probes-fixes-v6.5-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing/probes: Fix to record 0-length data_loc in fetch_store_string*() if fails
Revert "tracing: Add "(fault)" name injection to kernel probes"
tracing/probes: Fix to update dynamic data counter if fetcharg uses it
tracing/probes: Fix not to count error code to total length
tracing/probes: Fix to avoid double count of the string length on the array
fprobes: Add a comment why fprobe_kprobe_handler exits if kprobe is running
The nanosecond-to-millisecond skew computation uses unsigned arithmetic,
which produces user-unfriendly large positive numbers for negative skews.
Therefore, use signed arithmetic for this computation in order to preserve
the negativity.
Reported-by: Chris Bainbridge <chris.bainbridge@gmail.com>
Reported-by: Feng Tang <feng.tang@intel.com>
Fixes: dd02926994 ("clocksource: Improve "skew is too large" messages")
Reviewed-by: Feng Tang <feng.tang@intel.com>
Tested-by: Chris Bainbridge <chris.bainbridge@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Currently shuffling sets the same cpu affinities for all tasks,
which makes us less likely to hit paths involving migrating
blocked tasks onto a cpu where they can't run.
This patch adds an element of randomness to allow affinities of
different writer tasks to diverge.
This has helped uncover issues in testing with Proxy Execution
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: kernel-team@android.com
Signed-off-by: Connor O'Brien <connoro@google.com>
Signed-off-by: John Stultz <jstultz@google.com>
Acked-by: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The rtort_pipe_count WARN() indicates that grace periods were unable
to invoke all callbacks during a stutter_wait() interval. But it is
sometimes helpful to have a bit more information as to why. This commit
therefore invokes show_rcu_gp_kthreads() immediately before that WARN()
in order to dump out some relevant information.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The scftorture test can quickly execute a large number of calls to no-wait
smp_call_function(), each of which holds a block of memory until the
corresponding handler is invoked. Especially when the longwait module
parameter is specified, this can chew up an arbitrarily large amount
of memory. This commit therefore blocks after each memory-allocation
failure, with the duration a function of longwait.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Kernels built with CONFIG_KASAN=y quarantine newly freed memory in order
to better detect use-after-free errors. However, this can exhaust memory
more quickly in allocator-heavy tests, which can result in spurious
scftorture failure. This commit therefore forgives memory-allocation
failure in kernels built with CONFIG_KASAN=y, but continues counting
the errors for use in detailed test-result analyses.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Both the CONFIG_TASKS_RCU and CONFIG_TASKS_RUDE_RCU options
are broken when RCU_TINY is enabled as well, as some functions
are missing a declaration.
In file included from kernel/rcu/update.c:649:
kernel/rcu/tasks.h:1271:21: error: no previous prototype for 'get_rcu_tasks_rude_gp_kthread' [-Werror=missing-prototypes]
1271 | struct task_struct *get_rcu_tasks_rude_gp_kthread(void)
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
kernel/rcu/rcuscale.c:330:27: error: 'get_rcu_tasks_rude_gp_kthread' undeclared here (not in a function); did you mean 'get_rcu_tasks_trace_gp_kthread'?
330 | .rso_gp_kthread = get_rcu_tasks_rude_gp_kthread,
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
| get_rcu_tasks_trace_gp_kthread
In file included from /home/arnd/arm-soc/kernel/rcu/update.c:649:
kernel/rcu/tasks.h:1113:21: error: no previous prototype for 'get_rcu_tasks_gp_kthread' [-Werror=missing-prototypes]
1113 | struct task_struct *get_rcu_tasks_gp_kthread(void)
| ^~~~~~~~~~~~~~~~~~~~~~~~
Also, building with CONFIG_TASKS_RUDE_RCU but not CONFIG_TASKS_RCU is
broken because of some missing stub functions:
kernel/rcu/rcuscale.c:322:27: error: 'tasks_scale_read_lock' undeclared here (not in a function); did you mean 'srcu_scale_read_lock'?
322 | .readlock = tasks_scale_read_lock,
| ^~~~~~~~~~~~~~~~~~~~~
| srcu_scale_read_lock
kernel/rcu/rcuscale.c:323:27: error: 'tasks_scale_read_unlock' undeclared here (not in a function); did you mean 'srcu_scale_read_unlock'?
323 | .readunlock = tasks_scale_read_unlock,
| ^~~~~~~~~~~~~~~~~~~~~~~
| srcu_scale_read_unlock
Move the declarations outside of the RCU_TINY #ifdef and duplicate the
shared stub functions to address all of the above.
Fixes: 88d7ff818f0ce ("rcuscale: Add RCU Tasks Rude testing")
Fixes: 755f1c5eb416b ("rcuscale: Measure RCU Tasks Trace grace-period kthread CPU time")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit causes RCU Tasks Trace to output the CPU time consumed by
its grace-period kthread. The CPU time is whatever is in the designated
task's current->stime field, and thus is controlled by whatever CPU-time
accounting scheme is in effect.
This output appears in microseconds as follows on the console:
rcu_scale: Grace-period kthread CPU time: 42367.037
[ paulmck: Apply Willy Tarreau feedback. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit adds the ability to output the CPU time consumed by the
grace-period kthread for the RCU variant under test. The CPU time is
whatever is in the designated task's current->stime field, and thus is
controlled by whatever CPU-time accounting scheme is in effect.
This output appears in microseconds as follows on the console:
rcu_scale: Grace-period kthread CPU time: 42367.037
[ paulmck: Apply feedback from Stephen Rothwell and kernel test robot. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Tested-by: Yujie Liu <yujie.liu@intel.com>
By default, rcuscale collects only 100 points of data per writer, but
arranging for all kthreads to be actively collecting (if not recording)
data during the time that any kthread might be recording. This works
well, but does not allow much time to bring external performance tools
to bear. This commit therefore adds a minruntime module parameter
that specifies a minimum data-collection interval in seconds.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Some workloads do isolated RCU work, and this can affect operation
latencies. This commit therefore adds a writer_holdoff_jiffies module
parameter that causes writers to block for the specified number of
jiffies between each pair of consecutive write-side operations.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit adds a "jiffies" test to refscale, allowing use of jiffies
to be compared to ktime_get_real_fast_ns(). On my x86 laptop, jiffies
is more than 20x faster. (Though for many uses, the tens-of-nanoseconds
overhead of ktime_get_real_fast_ns() will be just fine.)
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Running the refscale test occasionally crashes the kernel with the
following error:
[ 8569.952896] BUG: unable to handle page fault for address: ffffffffffffffe8
[ 8569.952900] #PF: supervisor read access in kernel mode
[ 8569.952902] #PF: error_code(0x0000) - not-present page
[ 8569.952904] PGD c4b048067 P4D c4b049067 PUD c4b04b067 PMD 0
[ 8569.952910] Oops: 0000 [#1] PREEMPT_RT SMP NOPTI
[ 8569.952916] Hardware name: Dell Inc. PowerEdge R750/0WMWCR, BIOS 1.2.4 05/28/2021
[ 8569.952917] RIP: 0010:prepare_to_wait_event+0x101/0x190
:
[ 8569.952940] Call Trace:
[ 8569.952941] <TASK>
[ 8569.952944] ref_scale_reader+0x380/0x4a0 [refscale]
[ 8569.952959] kthread+0x10e/0x130
[ 8569.952966] ret_from_fork+0x1f/0x30
[ 8569.952973] </TASK>
The likely cause is that init_waitqueue_head() is called after the call to
the torture_create_kthread() function that creates the ref_scale_reader
kthread. Although this init_waitqueue_head() call will very likely
complete before this kthread is created and starts running, it is
possible that the calling kthread will be delayed between the calls to
torture_create_kthread() and init_waitqueue_head(). In this case, the
new kthread will use the waitqueue head before it is properly initialized,
which is not good for the kernel's health and well-being.
The above crash happened here:
static inline void __add_wait_queue(...)
{
:
if (!(wq->flags & WQ_FLAG_PRIORITY)) <=== Crash here
The offset of flags from list_head entry in wait_queue_entry is
-0x18. If reader_tasks[i].wq.head.next is NULL as allocated reader_task
structure is zero initialized, the instruction will try to access address
0xffffffffffffffe8, which is exactly the fault address listed above.
This commit therefore invokes init_waitqueue_head() before creating
the kthread.
Fixes: 653ed64b01 ("refperf: Add a test to measure performance of read-side synchronization")
Signed-off-by: Waiman Long <longman@redhat.com>
Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
Acked-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The various RCU Tasks flavors now do lazy grace periods when there are
only asynchronous grace period requests. By default, the system will let
250 milliseconds elapse after the first call_rcu_tasks*() callbacki is
queued before starting a grace period. In contrast, synchronous grace
period requests such as synchronize_rcu_tasks*() will start a grace
period immediately.
However, invoking one of the call_rcu_tasks*() functions in a too-tight
loop can result in a callback flood, which in turn can exhaust memory
if grace periods are delayed for too long.
This commit therefore sets a limit so that the grace-period kthread
will be awakened when any CPU's callback list expands to contain
rcupdate.rcu_task_lazy_lim callbacks elements (defaulting to 32, set to -1
to disable), the grace-period kthread will be awakened, thus cancelling
any ongoing laziness and getting out in front of the potential callback
flood.
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Martin KaFai Lau <kafai@fb.com>
Cc: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit adds kernel boot parameters for callback laziness, allowing
the RCU Tasks flavors to be individually adjusted.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The performance requirements on RCU Tasks, and in particular on RCU
Tasks Trace, have evolved over time as the workloads have evolved.
The current implementation is designed to provide low grace-period
latencies, and also to accommodate short-duration floods of callbacks.
However, current workloads can also provide a constant background
callback-queuing rate of a few hundred call_rcu_tasks_trace() invocations
per second. This results in continuous back-to-back RCU Tasks Trace
grace periods, which in turn can consume the better part of 10% of a CPU.
One could take the attitude that there are several tens of other CPUs on
the systems running such workloads, but energy efficiency is a thing.
On these systems, although asynchronous grace-period requests happen
every few milliseconds, synchronous grace-period requests are quite rare.
This commit therefore arrranges for grace periods to be initiated
immediately in response to calls to synchronize_rcu_tasks*() and
also to calls to synchronize_rcu_mult() that are passed one of the
call_rcu_tasks*() functions. These are recognized by the tell-tale
wakeme_after_rcu callback function.
In other cases, callbacks are gathered up for up to about 250 milliseconds
before a grace period is initiated. This results in more than an order of
magnitude reduction in RCU Tasks Trace grace periods, with corresponding
reduction in consumption of CPU time.
Reported-by: Alexei Starovoitov <ast@kernel.org>
Reported-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Add kernel-doc for all APIs that do not already have it.
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: John Stultz <jstultz@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Stephen Boyd <sboyd@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: linux-doc@vger.kernel.org
Acked-by: John Stultz <jstultz@google.com>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Link: https://lore.kernel.org/r/20230704052405.5089-3-rdunlap@infradead.org
- Unbreak the /sys/power/resume interface after recent changes (Azat
Khuzhin).
- Allow PM_QOS_DEFAULT_VALUE to be used with frequency QoS (Chungkai
Yang).
- Remove __init from cpufreq callbacks in the sparc driver, because
they may be called after initialization too (Viresh Kumar).
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAmSxhIcSHHJqd0Byand5
c29ja2kubmV0AAoJEILEb/54YlRxfdQP/1+MxGLh2XVVKvB9QTI0/ofYPxPfoTuf
5Sf2lOa7rNeauc2xnQFM6EMOeme4ckTUbrO7AkZVACYVqbKJ92IUBJfo3R2Ar+1Z
9TogwG+YOX3OjR8QtiGHLwA5fvOgbt89JaDn2ZCWcS+gHARZ3VMgdbDt+C3MUldV
UVr/5kAkWefv39PIYHCwAJU7Xhn97B5nW58KgpkHuxOcHDKe0VfdxLcKBsyoc6QE
IGMVV2WtnoyEdM1aNfZ37+3NksiIdZMA6OvM5C/7HOs6IqJaFxVUxm4333sM5AW9
y5LPYSBbedVxICdLkUUq8W5MDRNCTPXgC3gexEu0XtWdAV9AG+9aNeZytT7KGrLO
xe4vbl6s1LnxC6YBw2bB+U/DbLtxQrAP1nYZj6yxhbHVsnTTZg7Qvevz7nAdPlOL
3FsutIT+9OQprWXxYBRv3AumF+hpG1bm8Zutyaqb2vwRwMbbXWTpzRry4ydp6bTj
VB2YWeQOxCKl+dC5jXM1wfPjbsqWQvvGZVh1VIzlcZgzqALWf+F5fTMUKuY963kd
V6fR1YmKg+Xxb+BU9mEjaUMfH5Yr8Mv9Gpf7D87MTsNTluFjAmRWk5a0ahVAXcwe
n1jFxBNUsuJuvz2KwWrVZExT2xqJ8kVfnOxNcevtaXK+uk8+jE94lwjJn0P9v8w8
e+0QABkgUYY2
=b/om
-----END PGP SIGNATURE-----
Merge tag 'pm-6.5-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fixes from Rafael Wysocki:
"These fix hibernation (after recent changes), frequency QoS and the
sparc cpufreq driver.
Specifics:
- Unbreak the /sys/power/resume interface after recent changes (Azat
Khuzhin).
- Allow PM_QOS_DEFAULT_VALUE to be used with frequency QoS (Chungkai
Yang).
- Remove __init from cpufreq callbacks in the sparc driver, because
they may be called after initialization too (Viresh Kumar)"
* tag 'pm-6.5-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
cpufreq: sparc: Don't mark cpufreq callbacks with __init
PM: QoS: Restore support for default value on frequency QoS
PM: hibernate: Fix writing maj:min to /sys/power/resume
Merge a PM QoS fix and a hibernation fix for 6.5-rc2.
- Unbreak the /sys/power/resume interface after recent changes (Azat
Khuzhin).
- Allow PM_QOS_DEFAULT_VALUE to be used with frequency QoS (Chungkai
Yang).
* pm-sleep:
PM: hibernate: Fix writing maj:min to /sys/power/resume
* pm-qos:
PM: QoS: Restore support for default value on frequency QoS
Fix to record 0-length data to data_loc in fetch_store_string*() if it fails
to get the string data.
Currently those expect that the data_loc is updated by store_trace_args() if
it returns the error code. However, that does not work correctly if the
argument is an array of strings. In that case, store_trace_args() only clears
the first entry of the array (which may have no error) and leaves other
entries. So it should be cleared by fetch_store_string*() itself.
Also, 'dyndata' and 'maxlen' in store_trace_args() should be updated
only if it is used (ret > 0 and argument is a dynamic data.)
Link: https://lore.kernel.org/all/168908496683.123124.4761206188794205601.stgit@devnote2/
Fixes: 40b53b7718 ("tracing: probeevent: Add array type support")
Cc: stable@vger.kernel.org
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmSwqwoACgkQ6rmadz2v
bTqOHRAAn+fzTLqUqsveFQcxOkie5MPHxKoOTjG4+yFR7rzPkU6Mn5RX3w5yFzSn
RqutwykF9OgipAzC3QXv4pRJuq6Gia5nvwUSDP4CX273ljyeF54DK7HfopE1+YrK
HXyBWZvVvMZP6q7qQyQ3qtbHZSjs5XP/M6YBlJ5zo/BTLFCyvbSDP14YKEqcBkWG
ld72ElXFxlnr/zEfRjzBCfMlbmgeHLO0SiHS/9827zEmNP1AAH5/ETA7/rJ7yCJs
QNQUIoJWob8xm5FMJ6CU/+sOqXR1CY053meGJFFBX5pvVD/CLRhrwHn0IMCyQqmh
wKR5waeXhpl/CKNeFuxXVMNFiXbqBb/0LYJaJtrMysjMLTsQ9X7NkrDBa/+kYGyZ
+ghGlaMQvPqUGg0rLH2nl9JNB8Ne/8prLMsAKUWnPuOo+Q03j054gnqhGeNtDd5b
gpSk+7x93PlhGcegBV1Wk8dkiGC5V9nTVNxg40XQUCs4k9L/8Vjc35Tjqx7nBTNH
DiFD24DDKUZacw9L6nEqvLF/N2fiRjtUZnVPC0yn/annyBcfX1s+ZH2Tu1F6Qk38
QMfBCnt12exmsiDoxdzzGJtjHnS/k5fsaKjlR21mOyMrIH7ipltr5UHHrdr1hBP6
24uSeTImvQQKDi+9IuXN127jZDOupKqVS6csrA0ZXrlKWh2HR+U=
=GVUB
-----END PGP SIGNATURE-----
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Alexei Starovoitov says:
====================
pull-request: bpf-next 2023-07-13
We've added 67 non-merge commits during the last 15 day(s) which contain
a total of 106 files changed, 4444 insertions(+), 619 deletions(-).
The main changes are:
1) Fix bpftool build in presence of stale vmlinux.h,
from Alexander Lobakin.
2) Introduce bpf_me_mcache_free_rcu() and fix OOM under stress,
from Alexei Starovoitov.
3) Teach verifier actual bounds of bpf_get_smp_processor_id()
and fix perf+libbpf issue related to custom section handling,
from Andrii Nakryiko.
4) Introduce bpf map element count, from Anton Protopopov.
5) Check skb ownership against full socket, from Kui-Feng Lee.
6) Support for up to 12 arguments in BPF trampoline, from Menglong Dong.
7) Export rcu_request_urgent_qs_task, from Paul E. McKenney.
8) Fix BTF walking of unions, from Yafang Shao.
9) Extend link_info for kprobe_multi and perf_event links,
from Yafang Shao.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (67 commits)
selftests/bpf: Add selftest for PTR_UNTRUSTED
bpf: Fix an error in verifying a field in a union
selftests/bpf: Add selftests for nested_trust
bpf: Fix an error around PTR_UNTRUSTED
selftests/bpf: add testcase for TRACING with 6+ arguments
bpf, x86: allow function arguments up to 12 for TRACING
bpf, x86: save/restore regs with BPF_DW size
bpftool: Use "fallthrough;" keyword instead of comments
bpf: Add object leak check.
bpf: Convert bpf_cpumask to bpf_mem_cache_free_rcu.
bpf: Introduce bpf_mem_free_rcu() similar to kfree_rcu().
selftests/bpf: Improve test coverage of bpf_mem_alloc.
rcu: Export rcu_request_urgent_qs_task()
bpf: Allow reuse from waiting_for_gp_ttrace list.
bpf: Add a hint to allocated objects.
bpf: Change bpf_mem_cache draining process.
bpf: Further refactor alloc_bulk().
bpf: Factor out inc/dec of active flag into helpers.
bpf: Refactor alloc_bulk().
bpf: Let free_all() return the number of freed elements.
...
====================
Link: https://lore.kernel.org/r/20230714020910.80794-1-alexei.starovoitov@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We are utilizing BPF LSM to monitor BPF operations within our container
environment. When we add support for raw_tracepoint, it hits below
error.
; (const void *)attr->raw_tracepoint.name);
27: (79) r3 = *(u64 *)(r2 +0)
access beyond the end of member map_type (mend:4) in struct (anon) with off 0 size 8
It can be reproduced with below BPF prog.
SEC("lsm/bpf")
int BPF_PROG(bpf_audit, int cmd, union bpf_attr *attr, unsigned int size)
{
switch (cmd) {
case BPF_RAW_TRACEPOINT_OPEN:
bpf_printk("raw_tracepoint is %s", attr->raw_tracepoint.name);
break;
default:
break;
}
return 0;
}
The reason is that when accessing a field in a union, such as bpf_attr,
if the field is located within a nested struct that is not the first
member of the union, it can result in incorrect field verification.
union bpf_attr {
struct {
__u32 map_type; <<<< Actually it will find that field.
__u32 key_size;
__u32 value_size;
...
};
...
struct {
__u64 name; <<<< We want to verify this field.
__u32 prog_fd;
} raw_tracepoint;
};
Considering the potential deep nesting levels, finding a perfect
solution to address this issue has proven challenging. Therefore, I
propose a solution where we simply skip the verification process if the
field in question is located within a union.
Fixes: 7e3617a72d ("bpf: Add array support to btf_struct_access")
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Link: https://lore.kernel.org/r/20230713025642.27477-4-laoar.shao@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Per discussion with Alexei, the PTR_UNTRUSTED flag should not been
cleared when we start to walk a new struct, because the struct in
question may be a struct nested in a union. We should also check and set
this flag before we walk its each member, in case itself is a union.
We will clear this flag if the field is BTF_TYPE_SAFE_RCU_OR_NULL.
Fixes: 6fcd486b3a ("bpf: Refactor RCU enforcement in the verifier.")
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Link: https://lore.kernel.org/r/20230713025642.27477-2-laoar.shao@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
- Fix some missing-prototype warnings
- Fix user events struct args (did not include size of struct)
When creating a user event, the "struct" keyword is to denote
that the size of the field will be passed in. But the parsing
failed to handle this case.
- Add selftest to struct sizes for user events
- Fix sample code for direct trampolines.
The sample code for direct trampolines attached to handle_mm_fault().
But the prototype changed and the direct trampoline sample code
was not updated. Direct trampolines needs to have the arguments correct
otherwise it can fail or crash the system.
- Remove unused ftrace_regs_caller_ret() prototype.
- Quiet false positive of FORTIFY_SOURCE
Due to backward compatibility, the structure used to save stack traces
in the kernel had a fixed size of 8. This structure is exported to
user space via the tracing format file. A change was made to allow
more than 8 functions to be recorded, and user space now uses the
size field to know how many functions are actually in the stack.
But the structure still has size of 8 (even though it points into
the ring buffer that has the required amount allocated to hold a
full stack. This was fine until the fortifier noticed that the
memcpy(&entry->caller, stack, size) was greater than the 8 functions
and would complain at runtime about it. Hide this by using a pointer
to the stack location on the ring buffer instead of using the address
of the entry structure caller field.
- Fix a deadloop in reading trace_pipe that was caused by a mismatch
between ring_buffer_empty() returning false which then asked to
read the data, but the read code uses rb_num_of_entries() that
returned zero, and causing a infinite "retry".
- Fix a warning caused by not using all pages allocated to store
ftrace functions, where this can happen if the linker inserts a bunch of
"NULL" entries, causing the accounting of how many pages needed
to be off.
- Fix histogram synthetic event crashing when the start event is
removed and the end event is still using a variable from it.
- Fix memory leak in freeing iter->temp in tracing_release_pipe()
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZLBF6hQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qkswAP4mhdoFFfNosM7+Sh/R4t31IxKZApm9
M2Hf9jgvJ7b65AD/VV1XfO6skw2+5Yn9S4UyNE2MQaYxPwWpONcNFUzZ3Q8=
=Nb+7
-----END PGP SIGNATURE-----
Merge tag 'trace-v6.5-rc1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fixes from Steven Rostedt:
- Fix some missing-prototype warnings
- Fix user events struct args (did not include size of struct)
When creating a user event, the "struct" keyword is to denote that
the size of the field will be passed in. But the parsing failed to
handle this case.
- Add selftest to struct sizes for user events
- Fix sample code for direct trampolines.
The sample code for direct trampolines attached to handle_mm_fault().
But the prototype changed and the direct trampoline sample code was
not updated. Direct trampolines needs to have the arguments correct
otherwise it can fail or crash the system.
- Remove unused ftrace_regs_caller_ret() prototype.
- Quiet false positive of FORTIFY_SOURCE
Due to backward compatibility, the structure used to save stack
traces in the kernel had a fixed size of 8. This structure is
exported to user space via the tracing format file. A change was made
to allow more than 8 functions to be recorded, and user space now
uses the size field to know how many functions are actually in the
stack.
But the structure still has size of 8 (even though it points into the
ring buffer that has the required amount allocated to hold a full
stack.
This was fine until the fortifier noticed that the
memcpy(&entry->caller, stack, size) was greater than the 8 functions
and would complain at runtime about it.
Hide this by using a pointer to the stack location on the ring buffer
instead of using the address of the entry structure caller field.
- Fix a deadloop in reading trace_pipe that was caused by a mismatch
between ring_buffer_empty() returning false which then asked to read
the data, but the read code uses rb_num_of_entries() that returned
zero, and causing a infinite "retry".
- Fix a warning caused by not using all pages allocated to store ftrace
functions, where this can happen if the linker inserts a bunch of
"NULL" entries, causing the accounting of how many pages needed to be
off.
- Fix histogram synthetic event crashing when the start event is
removed and the end event is still using a variable from it
- Fix memory leak in freeing iter->temp in tracing_release_pipe()
* tag 'trace-v6.5-rc1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing: Fix memory leak of iter->temp when reading trace_pipe
tracing/histograms: Add histograms to hist_vars if they have referenced variables
tracing: Stop FORTIFY_SOURCE complaining about stack trace caller
ftrace: Fix possible warning on checking all pages used in ftrace_process_locs()
ring-buffer: Fix deadloop issue on reading trace_pipe
tracing: arm64: Avoid missing-prototype warnings
selftests/user_events: Test struct size match cases
tracing/user_events: Fix struct arg size match check
x86/ftrace: Remove unsued extern declaration ftrace_regs_caller_ret()
arm64: ftrace: Add direct call trampoline samples support
samples: ftrace: Save required argument registers in sample trampolines
This reverts commit 2e9906f84f.
It was turned out that commit 2e9906f84f ("tracing: Add "(fault)"
name injection to kernel probes") did not work correctly and probe
events still show just '(fault)' (instead of '"(fault)"'). Also,
current '(fault)' is more explicit that it faulted.
This also moves FAULT_STRING macro to trace.h so that synthetic
event can keep using it, and uses it in trace_probe.c too.
Link: https://lore.kernel.org/all/168908495772.123124.1250788051922100079.stgit@devnote2/
Link: https://lore.kernel.org/all/20230706230642.3793a593@rorschach.local.home/
Cc: stable@vger.kernel.org
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
select_idle_capacity() not only looks for an idle cpu that fits for the
waking task but also for cpu with highest bandwidth when no cpu fits.
Start the loop with target cpu so it will be selected 1st when no cpu fits
but several cpus shared the same bandwidth. Starting with target cpu
prevents the task to migrate between cpus with same bandwidth at every
wakeup when no cpu fits.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20230711081359.868862-1-vincent.guittot@linaro.org
There have been a case where the SD_SHARE_CPUCAPACITY sched group flag
in a parent domain were not set and propagated properly when a degenerate
domain is removed.
Add dump of domain sched group flags of a CPU to make debug easier
in the future.
Usage:
cat /debug/sched/domains/cpu0/domain1/groups_flags
to dump cpu0 domain1's sched group flags.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lore.kernel.org/r/ed1749262d94d95a8296c86a415999eda90bcfe3.1688770494.git.tim.c.chen@linux.intel.com
should_we_balance() traverses the group_balance_mask (AND'ed with lb_env::
cpus) starting from lower numbered CPUs looking for the first idle CPU.
In hybrid x86 systems, the siblings of SMT cores get CPU numbers, before
non-SMT cores:
[0, 1] [2, 3] [4, 5] 6 7 8 9
b i b i b i b i i i
In the figure above, CPUs in brackets are siblings of an SMT core. The
rest are non-SMT cores. 'b' indicates a busy CPU, 'i' indicates an
idle CPU.
We should let a CPU on a fully idle core get the first chance to idle
load balance as it has more CPU capacity than a CPU on an idle SMT
CPU with busy sibling. So for the figure above, if we are running
should_we_balance() to CPU 1, we should return false to let CPU 7 on
idle core to have a chance first to idle load balance.
A partially busy (i.e., of type group_has_spare) local group with SMT
cores will often have only one SMT sibling busy. If the destination CPU
is a non-SMT core, partially busy, lower-numbered, SMT cores should not
be considered when finding the first idle CPU.
However, in should_we_balance(), when we encounter idle SMT first in partially
busy core, we prematurely break the search for the first idle CPU.
Higher-numbered, non-SMT cores is not given the chance to have
idle balance done on their behalf. Those CPUs will only be considered
for idle balancing by chance via CPU_NEWLY_IDLE.
Instead, consider the idle state of the whole SMT core.
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/807bdd05331378ea3bf5956bda87ded1036ba769.1688770494.git.tim.c.chen@linux.intel.com
In the current prefer sibling load balancing code, there is an implicit
assumption that the busiest sched group and local sched group are
equivalent, hence the tasks to be moved is simply the difference in
number of tasks between the two groups (i.e. imbalance) divided by two.
However, we may have different number of cores between the cluster groups,
say when we take CPU offline or we have hybrid groups. In that case,
we should balance between the two groups such that #tasks/#cores ratio
is the same between the same between both groups. Hence the imbalance
computed will need to reflect this.
Adjust the sibling imbalance computation to take into account of the
above considerations.
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/4eacbaa236e680687dae2958378a6173654113df.1688770494.git.tim.c.chen@linux.intel.com
When balancing sibling domains that have different number of cores,
tasks in respective sibling domain should be proportional to the
number of cores in each domain. In preparation of implementing such a
policy, record the number of cores in a scheduling group.
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/04641eeb0e95c21224352f5743ecb93dfac44654.1688770494.git.tim.c.chen@linux.intel.com
On hybrid CPUs with scheduling cluster enabled, we will need to
consider balancing between SMT CPU cluster, and Atom core cluster.
Below shows such a hybrid x86 CPU with 4 big cores and 8 atom cores.
Each scheduling cluster span a L2 cache.
--L2-- --L2-- --L2-- --L2-- ----L2---- -----L2------
[0, 1] [2, 3] [4, 5] [5, 6] [7 8 9 10] [11 12 13 14]
Big Big Big Big Atom Atom
core core core core Module Module
If the busiest group is a big core with both SMT CPUs busy, we should
active load balance if destination group has idle CPU cores. Such
condition is considered by asym_active_balance() in load balancing but not
considered when looking for busiest group and computing load imbalance.
Add this consideration in find_busiest_group() and calculate_imbalance().
In addition, update the logic determining the busier group when one group
is SMT and the other group is non SMT but both groups are partially busy
with idle CPU. The busier group should be the group with idle cores rather
than the group with one busy SMT CPU. We do not want to make the SMT group
the busiest one to pull the only task off SMT CPU and causing the whole core to
go empty.
Otherwise suppose in the search for the busiest group, we first encounter
an SMT group with 1 task and set it as the busiest. The destination
group is an atom cluster with 1 task and we next encounter an atom
cluster group with 3 tasks, we will not pick this atom cluster over the
SMT group, even though we should. As a result, we do not load balance
the busier Atom cluster (with 3 tasks) towards the local atom cluster
(with 1 task). And it doesn't make sense to pick the 1 task SMT group
as the busier group as we also should not pull task off the SMT towards
the 1 task atom cluster and make the SMT core completely empty.
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/e24f35d142308790f69be65930b82794ef6658a2.1688770494.git.tim.c.chen@linux.intel.com
The static key psi_cgroups_enabled is only used inside file psi.c.
Make it static.
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Link: https://lore.kernel.org/r/20230525103428.49712-1-linmiaohe@huawei.com
As core scheduling introduced, a new state of idle is defined as
force idle, running idle task but nr_running greater than zero.
If a cpu is in force idle state, idle_cpu() will return zero. This
result makes sense in some scenarios, e.g., load balance,
showacpu when dumping, and judge the RCU boost kthread is starving.
But this will cause error in other scenarios, e.g., tick_irq_exit():
When force idle, rq->curr == rq->idle but rq->nr_running > 0, results
that idle_cpu() returns 0. In function tick_irq_exit(), if idle_cpu()
is 0, tick_nohz_irq_exit() will not be called, and ts->idle_active will
not become 1, which became 0 in tick_nohz_irq_enter().
ts->idle_sleeptime won't update in function update_ts_time_stats(), if
ts->idle_active is 0, which should be 1. And this bug will result that
ts->idle_sleeptime is less than the actual value, and finally will
result that the idle time in /proc/stat is less than the actual value.
To solve this problem, we introduce sched_core_idle_cpu(), which
returns 1 when force idle. We audit all users of idle_cpu(), and
change idle_cpu() into sched_core_idle_cpu() in function
tick_irq_exit().
v2-->v3: Only replace idle_cpu() with sched_core_idle_cpu() in
function tick_irq_exit(). And modify the corresponding commit log.
Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Joel Fernandes <joel@joelfernandes.org>
Link: https://lore.kernel.org/r/1688011324-42406-1-git-send-email-CruzZhao@linux.alibaba.com