Commit Graph

43026 Commits

Author SHA1 Message Date
Mikhail Kobuk 2cf0dee989 tracing: Remove extra space at the end of hwlat_detector/mode
Space is printed after each mode value including the last one:
$ echo \"$(sudo cat /sys/kernel/tracing/hwlat_detector/mode)\"
"none [round-robin] per-cpu "

Found by Linux Verification Center (linuxtesting.org) with SVACE.

Link: https://lore.kernel.org/linux-trace-kernel/20230825103432.7750-1-m.kobuk@ispras.ru

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Fixes: 8fa826b734 ("trace/hwlat: Implement the mode config option")
Signed-off-by: Mikhail Kobuk <m.kobuk@ispras.ru>
Reviewed-by: Alexey Khoroshilov <khoroshilov@ispras.ru>
Acked-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-09-01 21:00:00 -04:00
Linus Torvalds 34232fcfe9 Tracing updates for 6.6:
User visible changes:
 
   - Added a way to easier filter with cpumasks:
      # echo 'cpumask & CPUS{17-42}' > /sys/kernel/tracing/events/ipi_send_cpumask/filter
 
   - Show actual size of ring buffer after modifying the ring buffer size via
     buffer_size_kb. Currently it just returns what was written, but the actual
     size rounds up to the sub buffer size. Show that real size instead.
 
  Major changes:
 
   - Added "eventfs". This is the code that handles the inodes and dentries of
     tracefs/events directory. As there are thousands of events, and each event
     has several inodes and dentries that currently exist even when tracing is
     never used, they take up precious memory. Instead, eventfs will allocate
     the inodes and dentries in a JIT way (similar to what procfs does). There
     is now metadata that handles the events and subdirectories, and will create
     the inodes and dentries when they are used.
 
     Note, I also have patches that remove the subdirectory meta data, but will
     wait till the next merge window before applying them. It's a little more
     complex, and I want to make sure the dynamic code works properly before
     adding more complexity, making it easier to revert if need be.
 
  Minor changes:
 
   - Optimization to user event list traversal.
 
   - Remove intermediate permission of tracefs files (note the intermediate
     permission removes all access to the files so it is not a security concern,
     but just a clean up.)
 
   - Add the complex fix to FORTIFY_SOURCE to the kernel stack event logic.
 
   - Other minor clean ups.
 -----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCgAyFiEEXtmkj8VMCiLR0IBM68Js21pW3nMFAmTwtAsUHHJvc3RlZHRA
 Z29vZG1pcy5vcmcACgkQ68Js21pW3nNOXRAAsslQT6alY4OeplC4x47+V6+6NiIA
 oDtOmWAqf7TsH9bukzRFD36rUly42O20RJDx9z0Q3iRc3vGxEawId8z6P0HmBwRb
 VSl5BryWvL5Wc5w94xS8EeCuC1MRfhVDyfbtVFmWigzfvd/f+hp71ViMPHUvrRJX
 KhzzNSBc4ir5E1lzfwa7meYTXzDwrQlZbYfdf5aH94IWAkqDj85PUZDJ7UmLZhXG
 CIglSpNFXZ0j19Wo/U6KZlHR1XfunBKungCzJ5Dbznc9YLWZTQXOIZF4YPKfPIJL
 ulRG9chwXY0nQWhG3xM1UHZLsAMSWw5i13a4ZN4d8FCNOgv8ttcJnfDk7ZYUS0Oz
 RmY1dGcSRKAZTUTjm8ZBtmyiUCc9kZAIk0fyEfIHtoDYXmhnvni3wuTnbRSdXaSi
 q4YkxPaLfX8Fn3QloCqqddt8iONu7BnbpZOhUCl2AtBib52gnTTF7+rQ6/0D3rjo
 SSuvEHhnjJhzk+3jM2odxjmTAztNT+yu6FbKXZUKPt1Kj9YHv1J9cEQw9/Etw+GV
 8jQBe979D8hFJmDOJOT/O/TdPqE9mQoMNBt6Y8QnE4nbJWM+i/MBrThFpUSQhRCr
 0Ya/HgR2QyRH7RmZW5o2H9mNtN+V9c7RxZW8erYzRbUs0YofK2OpGi9SrPzxWCke
 w6j0VVZHaxdPguM=
 =/s+e
 -----END PGP SIGNATURE-----

Merge tag 'trace-v6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing updates from Steven Rostedt:
 "User visible changes:

   - Added a way to easier filter with cpumasks:

       # echo 'cpumask & CPUS{17-42}' > /sys/kernel/tracing/events/ipi_send_cpumask/filter

   - Show actual size of ring buffer after modifying the ring buffer
     size via buffer_size_kb.

     Currently it just returns what was written, but the actual size
     rounds up to the sub buffer size. Show that real size instead.

  Major changes:

   - Added "eventfs". This is the code that handles the inodes and
     dentries of tracefs/events directory. As there are thousands of
     events, and each event has several inodes and dentries that
     currently exist even when tracing is never used, they take up
     precious memory. Instead, eventfs will allocate the inodes and
     dentries in a JIT way (similar to what procfs does). There is now
     metadata that handles the events and subdirectories, and will
     create the inodes and dentries when they are used.

     Note, I also have patches that remove the subdirectory meta data,
     but will wait till the next merge window before applying them. It's
     a little more complex, and I want to make sure the dynamic code
     works properly before adding more complexity, making it easier to
     revert if need be.

  Minor changes:

   - Optimization to user event list traversal

   - Remove intermediate permission of tracefs files (note the
     intermediate permission removes all access to the files so it is
     not a security concern, but just a clean up)

   - Add the complex fix to FORTIFY_SOURCE to the kernel stack event
     logic

   - Other minor cleanups"

* tag 'trace-v6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (29 commits)
  tracefs: Remove kerneldoc from struct eventfs_file
  tracefs: Avoid changing i_mode to a temp value
  tracing/user_events: Optimize safe list traversals
  ftrace: Remove empty declaration ftrace_enable_daemon() and ftrace_disable_daemon()
  tracing: Remove unused function declarations
  tracing/filters: Document cpumask filtering
  tracing/filters: Further optimise scalar vs cpumask comparison
  tracing/filters: Optimise CPU vs cpumask filtering when the user mask is a single CPU
  tracing/filters: Optimise scalar vs cpumask filtering when the user mask is a single CPU
  tracing/filters: Optimise cpumask vs cpumask filtering when user mask is a single CPU
  tracing/filters: Enable filtering the CPU common field by a cpumask
  tracing/filters: Enable filtering a scalar field by a cpumask
  tracing/filters: Enable filtering a cpumask field by another cpumask
  tracing/filters: Dynamically allocate filter_pred.regex
  test: ftrace: Fix kprobe test for eventfs
  eventfs: Move tracing/events to eventfs
  eventfs: Implement removal of meta data from eventfs
  eventfs: Implement functions to create files and dirs when accessed
  eventfs: Implement eventfs lookup, read, open functions
  eventfs: Implement eventfs file add functions
  ...
2023-09-01 16:34:25 -07:00
Linus Torvalds bd30fe6a7d workqueue: Changes for v6.6
* Unbound workqueues now support more flexible affinity scopes. The default
   behavior is to soft-affine according to last level cache boundaries. A
   work item queued from a given LLC is executed by a worker running on the
   same LLC but the worker may be moved across cache boundaries as the
   scheduler sees fit. On machines which multiple L3 caches, which are
   becoming more popular along with chiplet designs, this improves cache
   locality while not harming work conservation too much.
 
   Unbound workqueues are now also a lot more flexible in terms of execution
   affinity. Differeing levels of affinity scopes are supported and both the
   default and per-workqueue affinity settings can be modified dynamically.
   This should help working around amny of sub-optimal behaviors observed
   recently with asymmetric ARM CPUs.
 
   This involved signficant restructuring of workqueue code. Nothing was
   reported yet but there's some risk of subtle regressions. Should keep an
   eye out.
 
 * Rescuer workers now has more identifiable comms.
 
 * workqueue.unbound_cpus added so that CPUs which can be used by workqueue
   can be constrained early during boot.
 
 * Now that all the in-tree users have been flushed out, trigger warning if
   system-wide workqueues are flushed.
 
 * One pull commit from for-6.5-fixes to avoid cascading conflicts in the
   affinity scope patchset.
 -----BEGIN PGP SIGNATURE-----
 
 iIQEABYIACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZPERlQ4cdGpAa2VybmVs
 Lm9yZwAKCRCxYfJx3gVYGVqQAPwIOy9tWY5jFAmMuIyH6wV50hbmfxCc2n5xhQNr
 5HoyGgEA8lw1W7afDCIPiQVA7AYsu8dhwuNSOcRCJxhrrn4XsA0=
 =g/Uu
 -----END PGP SIGNATURE-----

Merge tag 'wq-for-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq

Pull workqueue updates from Tejun Heo:

 - Unbound workqueues now support more flexible affinity scopes.

   The default behavior is to soft-affine according to last level cache
   boundaries. A work item queued from a given LLC is executed by a
   worker running on the same LLC but the worker may be moved across
   cache boundaries as the scheduler sees fit. On machines which
   multiple L3 caches, which are becoming more popular along with
   chiplet designs, this improves cache locality while not harming work
   conservation too much.

   Unbound workqueues are now also a lot more flexible in terms of
   execution affinity. Differeing levels of affinity scopes are
   supported and both the default and per-workqueue affinity settings
   can be modified dynamically. This should help working around amny of
   sub-optimal behaviors observed recently with asymmetric ARM CPUs.

   This involved signficant restructuring of workqueue code. Nothing was
   reported yet but there's some risk of subtle regressions. Should keep
   an eye out.

 - Rescuer workers now has more identifiable comms.

 - workqueue.unbound_cpus added so that CPUs which can be used by
   workqueue can be constrained early during boot.

 - Now that all the in-tree users have been flushed out, trigger warning
   if system-wide workqueues are flushed.

* tag 'wq-for-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (31 commits)
  workqueue: fix data race with the pwq->stats[] increment
  workqueue: Rename rescuer kworker
  workqueue: Make default affinity_scope dynamically updatable
  workqueue: Add "Affinity Scopes and Performance" section to documentation
  workqueue: Implement non-strict affinity scope for unbound workqueues
  workqueue: Add workqueue_attrs->__pod_cpumask
  workqueue: Factor out need_more_worker() check and worker wake-up
  workqueue: Factor out work to worker assignment and collision handling
  workqueue: Add multiple affinity scopes and interface to select them
  workqueue: Modularize wq_pod_type initialization
  workqueue: Add tools/workqueue/wq_dump.py which prints out workqueue configuration
  workqueue: Generalize unbound CPU pods
  workqueue: Factor out clearing of workqueue-only attrs fields
  workqueue: Factor out actual cpumask calculation to reduce subtlety in wq_update_pod()
  workqueue: Initialize unbound CPU pods later in the boot
  workqueue: Move wq_pod_init() below workqueue_init()
  workqueue: Rename NUMA related names to use pod instead
  workqueue: Rename workqueue_attrs->no_numa to ->ordered
  workqueue: Make unbound workqueues to use per-cpu pool_workqueues
  workqueue: Call wq_update_unbound_numa() on all CPUs in NUMA node on CPU hotplug
  ...
2023-09-01 16:06:32 -07:00
Linus Torvalds 7716f383a5 cgroup: Changes for v6.6
* Per-cpu cpu usage stats are now tracked. This currently isn't printed out
   in the cgroupfs interface and can only be accessed through e.g. BPF.
   Should decide on a not-too-ugly way to show per-cpu stats in cgroupfs.
 
 * cpuset received some cleanups and prepatory patches for the pending
   cpus.exclusive patchset which will allow cpuset partitions to be created
   below non-partition parents, which should ease the management of partition
   cpusets.
 
 * A lot of code and documentation cleanup patches.
 
 * tools/testing/selftests/cgroup/test_cpuset.c is added. This causes trivial
   conflicts in .gitignore and Makefile under the directory against
   fe3b1bf19b ("selftests: cgroup: add test_zswap program"). They can be
   resolved by keeping lines from both branches.
 -----BEGIN PGP SIGNATURE-----
 
 iIQEABYIACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZPENTg4cdGpAa2VybmVs
 Lm9yZwAKCRCxYfJx3gVYGcyBAP44cHwpSFxXe3cehxAzb1l/2BZXtzU5l48OqUQd
 MwHyrwEAm7+MTVAR2xOF4f+oVM9KWmKj7oV7Clpixl1S7hHyjwE=
 =FCc9
 -----END PGP SIGNATURE-----

Merge tag 'cgroup-for-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup

Pull cgroup updates from Tejun Heo:

 - Per-cpu cpu usage stats are now tracked

   This currently isn't printed out in the cgroupfs interface and can
   only be accessed through e.g. BPF. Should decide on a not-too-ugly
   way to show per-cpu stats in cgroupfs

 - cpuset received some cleanups and prepatory patches for the pending
   cpus.exclusive patchset which will allow cpuset partitions to be
   created below non-partition parents, which should ease the management
   of partition cpusets

 - A lot of code and documentation cleanup patches

 - tools/testing/selftests/cgroup/test_cpuset.c added

* tag 'cgroup-for-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (32 commits)
  cgroup: Avoid -Wstringop-overflow warnings
  cgroup:namespace: Remove unused cgroup_namespaces_init()
  cgroup/rstat: Record the cumulative per-cpu time of cgroup and its descendants
  cgroup: clean up if condition in cgroup_pidlist_start()
  cgroup: fix obsolete function name in cgroup_destroy_locked()
  Documentation: cgroup-v2.rst: Correct number of stats entries
  cgroup: fix obsolete function name above css_free_rwork_fn()
  cgroup/cpuset: fix kernel-doc
  cgroup: clean up printk()
  cgroup: fix obsolete comment above cgroup_create()
  docs: cgroup-v1: fix typo
  docs: cgroup-v1: correct the term of Page Cache organization in inode
  cgroup/misc: Store atomic64_t reads to u64
  cgroup/misc: Change counters to be explicit 64bit types
  cgroup/misc: update struct members descriptions
  cgroup: remove cgrp->kn check in css_populate_dir()
  cgroup: fix obsolete function name
  cgroup: use cached local variable parent in for loop
  cgroup: remove obsolete comment above struct cgroupstats
  cgroup: put cgroup_tryget_css() inside CONFIG_CGROUP_SCHED
  ...
2023-09-01 15:58:21 -07:00
Linus Torvalds e987af4546 percpu: changes for v6.6
percpu
 * A couple cleanups by Baoquan He and Bibo Mao. The only behavior change
   is to start printing messages if we're under the warn limit for failed
   atomic allocations.
 
 percpu_counter
 * Shakeel introduced percpu counters into mm_struct which caused percpu
   allocations be on the hot path [1]. Originally I spent some time
   trying to improve the percpu allocator, but instead preferred what
   Mateusz Guzik proposed grouping at the allocation site,
   percpu_counter_init_many(). This allows a single percpu allocation to
   be shared by the counters. I like this approach because it creates a
   shared lifetime by the allocations. Additionally, I believe many inits
   have higher level synchronization requirements, like percpu_counter
   does against HOTPLUG_CPU. Therefore we can group these optimizations
   together.
 
 [1] https://lore.kernel.org/linux-mm/20221024052841.3291983-1-shakeelb@google.com/
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE3hZPHJdcVwe+yTTtiDc0yuoFPR0FAmTv2IUACgkQiDc0yuoF
 PR0+gg//U430Y9jRSKQtbh3dEPaAeWGcTfSTnVHbQGfBj3A4ePJyWl/Tgzri31AC
 rzr8SRs0yX8b82TbECWsV67i/GrntLJyz4yQ52S/RRqVwnQqSn/wicEdCY00lJBt
 Tye8zApOnYBouaYqIOxm/M7ofvKzJ3gWOVeF/zBwM6hwvNaXXtY5r86fSDxoEbhY
 HOFnCDmg5Spf0U50j1G7nV5KfAb7BNA3/HFyzfzH+w+OWi4IGbThsfrg1qvjyFot
 KlEK/kF8Af2xj2A2se4XFsLc2D/Tj+29juYVQqIPBJzVPrZ2uerKSszK5Zcr+Use
 kMiG7tRWKE+2vkOM1RQ5Y5NCVEBhlXlienz1gf/C7247SEGs6OIyqvyDAgPTRx6p
 oR2/vx9hMtaSMf4aHWd+fYS5gNZ05iMvOIbRZnI1wZkQglQVkJvXhzuLaJ+dIGSP
 ypv6XOepik7vDjZ3p3xJXd0TAn4NSkn3jWRetrymdtMFanF99qw1VqjmkLecSil0
 Gr0UhRL1oiMde6niVJrOpdOGLwt/M4N99Y5rksw6NCnktRJ99coFGj7LglZGMsu+
 YkOyjD8MVJXTkBtBNGeqHTKe6nyVkHFq9ad5EmWjPkefP5JziH8i18k7JlF1dLA5
 c8peq3ES659D5f0mU2jilD9PsCsBfSn6Of4ruMZa2Zr1XDD8snI=
 =vcA1
 -----END PGP SIGNATURE-----

Merge tag 'percpu-for-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu

Pull percpu updates from Dennis Zhou:
 "One bigger change to percpu_counter's api allowing for init and
  destroy of multiple counters via percpu_counter_init_many() and
  percpu_counter_destroy_many(). This is used to help begin remediating
  a performance regression with percpu rss stats.

  Additionally, it seems larger core count machines are feeling the
  burden of the single threaded allocation of percpu. Mateusz is
  thinking about it and I will spend some time on it too.

  percpu:

   - A couple cleanups by Baoquan He and Bibo Mao. The only behavior
     change is to start printing messages if we're under the warn limit
     for failed atomic allocations.

  percpu_counter:

   - Shakeel introduced percpu counters into mm_struct which caused
     percpu allocations be on the hot path [1]. Originally I spent some
     time trying to improve the percpu allocator, but instead preferred
     what Mateusz Guzik proposed grouping at the allocation site,
     percpu_counter_init_many(). This allows a single percpu allocation
     to be shared by the counters. I like this approach because it
     creates a shared lifetime by the allocations. Additionally, I
     believe many inits have higher level synchronization requirements,
     like percpu_counter does against HOTPLUG_CPU. Therefore we can
     group these optimizations together"

Link: https://lore.kernel.org/linux-mm/20221024052841.3291983-1-shakeelb@google.com/ [1]

* tag 'percpu-for-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu:
  kernel/fork: group allocation/free of per-cpu counters for mm struct
  pcpcntr: add group allocation/free
  mm/percpu.c: print error message too if atomic alloc failed
  mm/percpu.c: optimize the code in pcpu_setup_first_chunk() a little bit
  mm/percpu.c: remove redundant check
  mm/percpu: Remove some local variables in pcpu_populate_pte
2023-09-01 15:44:45 -07:00
Linus Torvalds 8e1e49550d TTY/Serial driver changes for 6.6-rc1
Here is the big set of tty and serial driver changes for 6.6-rc1.
 
 Lots of cleanups in here this cycle, and some driver updates.  Short
 summary is:
   - Jiri's continued work to make the tty code and apis be a bit more
     sane with regards to modern kernel coding style and types
   - cpm_uart driver updates
   - n_gsm updates and fixes
   - meson driver updates
   - sc16is7xx driver updates
   - 8250 driver updates for different hardware types
   - qcom-geni driver fixes
   - tegra serial driver change
   - stm32 driver updates
   - synclink_gt driver cleanups
   - tty structure size reduction
 
 All of these have been in linux-next this week with no reported issues.
 The last bit of cleanups from Jiri and the tty structure size reduction
 came in last week, a bit late but as they were just style changes and
 size reductions, I figured they should get into this merge cycle so that
 others can work on top of them with no merge conflicts.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCZPH+jA8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+ykKyACgldt6QeenTN+6dXIHS/eQHtTKZwMAn3arSeXI
 QrUUnLFjOWyoX87tbMBQ
 =LVw0
 -----END PGP SIGNATURE-----

Merge tag 'tty-6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty

Pull tty/serial driver updates from Greg KH:
 "Here is the big set of tty and serial driver changes for 6.6-rc1.

  Lots of cleanups in here this cycle, and some driver updates. Short
  summary is:

   - Jiri's continued work to make the tty code and apis be a bit more
     sane with regards to modern kernel coding style and types

   - cpm_uart driver updates

   - n_gsm updates and fixes

   - meson driver updates

   - sc16is7xx driver updates

   - 8250 driver updates for different hardware types

   - qcom-geni driver fixes

   - tegra serial driver change

   - stm32 driver updates

   - synclink_gt driver cleanups

   - tty structure size reduction

  All of these have been in linux-next this week with no reported
  issues. The last bit of cleanups from Jiri and the tty structure size
  reduction came in last week, a bit late but as they were just style
  changes and size reductions, I figured they should get into this merge
  cycle so that others can work on top of them with no merge conflicts"

* tag 'tty-6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (199 commits)
  tty: shrink the size of struct tty_struct by 40 bytes
  tty: n_tty: deduplicate copy code in n_tty_receive_buf_real_raw()
  tty: n_tty: extract ECHO_OP processing to a separate function
  tty: n_tty: unify counts to size_t
  tty: n_tty: use u8 for chars and flags
  tty: n_tty: simplify chars_in_buffer()
  tty: n_tty: remove unsigned char casts from character constants
  tty: n_tty: move newline handling to a separate function
  tty: n_tty: move canon handling to a separate function
  tty: n_tty: use MASK() for masking out size bits
  tty: n_tty: make n_tty_data::num_overrun unsigned
  tty: n_tty: use time_is_before_jiffies() in n_tty_receive_overrun()
  tty: n_tty: use 'num' for writes' counts
  tty: n_tty: use output character directly
  tty: n_tty: make flow of n_tty_receive_buf_common() a bool
  Revert "tty: serial: meson: Add a earlycon for the T7 SoC"
  Documentation: devices.txt: Fix minors for ttyCPM*
  Documentation: devices.txt: Remove ttySIOC*
  Documentation: devices.txt: Remove ttyIOC*
  serial: 8250_bcm7271: improve bcm7271 8250 port
  ...
2023-09-01 09:38:00 -07:00
Linus Torvalds 4ad0a4c234 powerpc updates for 6.6
- Add HOTPLUG_SMT support (/sys/devices/system/cpu/smt) and honour the
    configured SMT state when hotplugging CPUs into the system.
 
  - Combine final TLB flush and lazy TLB mm shootdown IPIs when using the Radix
    MMU to avoid a broadcast TLBIE flush on exit.
 
  - Drop the exclusion between ptrace/perf watchpoints, and drop the now unused
    associated arch hooks.
 
  - Add support for the "nohlt" command line option to disable CPU idle.
 
  - Add support for -fpatchable-function-entry for ftrace, with GCC >= 13.1.
 
  - Rework memory block size determination, and support 256MB size on systems
    with GPUs that have hotpluggable memory.
 
  - Various other small features and fixes.
 
 Thanks to: Andrew Donnellan, Aneesh Kumar K.V, Arnd Bergmann, Athira Rajeev,
 Benjamin Gray, Christophe Leroy, Frederic Barrat, Gautam Menghani, Geoff Levand,
 Hari Bathini, Immad Mir, Jialin Zhang, Joel Stanley, Jordan Niethe, Justin
 Stitt, Kajol Jain, Kees Cook, Krzysztof Kozlowski, Laurent Dufour, Liang He,
 Linus Walleij, Mahesh Salgaonkar, Masahiro Yamada, Michal Suchanek, Nageswara
 R Sastry, Nathan Chancellor, Nathan Lynch, Naveen N Rao, Nicholas Piggin, Nick
 Desaulniers, Omar Sandoval, Randy Dunlap, Reza Arbab, Rob Herring, Russell
 Currey, Sourabh Jain, Thomas Gleixner, Trevor Woerner, Uwe Kleine-König, Vaibhav
 Jain, Xiongfeng Wang, Yuan Tan, Zhang Rui, Zheng Zengkai.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCAAxFiEEJFGtCPCthwEv2Y/bUevqPMjhpYAFAmTwgbwTHG1wZUBlbGxl
 cm1hbi5pZC5hdQAKCRBR6+o8yOGlgFmpD/432vipeoqvkAYsyK0xi/Y3GcY0wcyd
 WJApLXXadEbtKQrgXQ6sowWqalg5thYnQCRarg/tXKK/po3KfgwkPjGDpOL+cIdr
 12QVN2XJm9VmJ1wYJxzk+yXx4F43AdmMdr94qWAGufbTHezwb4UpzVR1NxtFrOE/
 X5TNsC2+2mdZY/ZaNHS5vsTIFv3EhQfqgjZPlIAdLn6CGc8xWT514Q/uHA8+ytM/
 HL7Hqs33DoPSvgTa5TT/2E0d0k5nO3P5KObzAjpYlireTPaBi51mpKGewcrtm0o2
 v3cBlbfx3C7pe9ZhKBK9BH8cjynfiqsVZ9/lCw/7eBNdm9tHuzG0jeS7Db9tCZXS
 fM7G2R7SoIusPTqxlBmkU5DpYslwrHiVgCyy3ijxkoA/fakVwh/GgTcMsRt73IY6
 n6DsUvWwuYHCIeIiHmHQJqCqCRtV+aMzU3AbbBHOjtdIanhlW16M686dEsgCirh7
 akRVRD5VqKaqXs34PpkRL89Xv3wZRjl6XZ3hZFfCjSYXfpXDXhgSToIskpHYhKL8
 gpY7WtG9YQP05Xz5HRCx6EluaZVeKe0lZi6fezX7Mi9AygJQO8FfXqP1mHBlEq40
 ThWtvL9D89RV6lADqqFN20XepgvKNOyAXcE4szvsnIZYUSPmZQZSPxx+DHtROaLP
 jX3ifxtxJp92pQ==
 =5g7K
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-6.6-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc updates from Michael Ellerman:

 - Add HOTPLUG_SMT support (/sys/devices/system/cpu/smt) and honour the
   configured SMT state when hotplugging CPUs into the system

 - Combine final TLB flush and lazy TLB mm shootdown IPIs when using the
   Radix MMU to avoid a broadcast TLBIE flush on exit

 - Drop the exclusion between ptrace/perf watchpoints, and drop the now
   unused associated arch hooks

 - Add support for the "nohlt" command line option to disable CPU idle

 - Add support for -fpatchable-function-entry for ftrace, with GCC >=
   13.1

 - Rework memory block size determination, and support 256MB size on
   systems with GPUs that have hotpluggable memory

 - Various other small features and fixes

Thanks to Andrew Donnellan, Aneesh Kumar K.V, Arnd Bergmann, Athira
Rajeev, Benjamin Gray, Christophe Leroy, Frederic Barrat, Gautam
Menghani, Geoff Levand, Hari Bathini, Immad Mir, Jialin Zhang, Joel
Stanley, Jordan Niethe, Justin Stitt, Kajol Jain, Kees Cook, Krzysztof
Kozlowski, Laurent Dufour, Liang He, Linus Walleij, Mahesh Salgaonkar,
Masahiro Yamada, Michal Suchanek, Nageswara R Sastry, Nathan Chancellor,
Nathan Lynch, Naveen N Rao, Nicholas Piggin, Nick Desaulniers, Omar
Sandoval, Randy Dunlap, Reza Arbab, Rob Herring, Russell Currey, Sourabh
Jain, Thomas Gleixner, Trevor Woerner, Uwe Kleine-König, Vaibhav Jain,
Xiongfeng Wang, Yuan Tan, Zhang Rui, and Zheng Zengkai.

* tag 'powerpc-6.6-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (135 commits)
  macintosh/ams: linux/platform_device.h is needed
  powerpc/xmon: Reapply "Relax frame size for clang"
  powerpc/mm/book3s64: Use 256M as the upper limit with coherent device memory attached
  powerpc/mm/book3s64: Fix build error with SPARSEMEM disabled
  powerpc/iommu: Fix notifiers being shared by PCI and VIO buses
  powerpc/mpc5xxx: Add missing fwnode_handle_put()
  powerpc/config: Disable SLAB_DEBUG_ON in skiroot
  powerpc/pseries: Remove unused hcall tracing instruction
  powerpc/pseries: Fix hcall tracepoints with JUMP_LABEL=n
  powerpc: dts: add missing space before {
  powerpc/eeh: Use pci_dev_id() to simplify the code
  powerpc/64s: Move CPU -mtune options into Kconfig
  powerpc/powermac: Fix unused function warning
  powerpc/pseries: Rework lppaca_shared_proc() to avoid DEBUG_PREEMPT
  powerpc: Don't include lppaca.h in paca.h
  powerpc/pseries: Move hcall_vphn() prototype into vphn.h
  powerpc/pseries: Move VPHN constants into vphn.h
  cxl: Drop unused detach_spa()
  powerpc: Drop zalloc_maybe_bootmem()
  powerpc/powernv: Use struct opal_prd_msg in more places
  ...
2023-08-31 12:43:10 -07:00
Linus Torvalds df57721f9a Add x86 shadow stack support
Convert IBT selftest to asm to fix objtool warning
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEV76QKkVc4xCGURexaDWVMHDJkrAFAmTv1QQACgkQaDWVMHDJ
 krAUwhAAn6TOwHJK8BSkHeiQhON1nrlP3c5cv0AyZ2NP8RYDrZrSZvhpYBJ6wgKC
 Cx5CGq5nn9twYsYS3KsktLKDfR3lRdsQ7K9qtyFtYiaeaVKo+7gEKl/K+klwai8/
 gninQWHk0zmSCja8Vi77q52WOMkQKapT8+vaON9EVDO8dVEi+CvhAIfPwMafuiwO
 Rk4X86SzoZu9FP79LcCg9XyGC/XbM2OG9eNUTSCKT40qTTKm5y4gix687NvAlaHR
 ko5MTsdl0Wfp6Qk0ohT74LnoA2c1g/FluvZIM33ci/2rFpkf9Hw7ip3lUXqn6CPx
 rKiZ+pVRc0xikVWkraMfIGMJfUd2rhelp8OyoozD7DB7UZw40Q4RW4N5tgq9Fhe9
 MQs3p1v9N8xHdRKl365UcOczUxNAmv4u0nV5gY/4FMC6VjldCl2V9fmqYXyzFS4/
 Ogg4FSd7c2JyGFKPs+5uXyi+RY2qOX4+nzHOoKD7SY616IYqtgKoz5usxETLwZ6s
 VtJOmJL0h//z0A7tBliB0zd+SQ5UQQBDC2XouQH2fNX2isJMn0UDmWJGjaHgK6Hh
 8jVp6LNqf+CEQS387UxckOyj7fu438hDky1Ggaw4YqowEOhQeqLVO4++x+HITrbp
 AupXfbJw9h9cMN63Yc0gVxXQ9IMZ+M7UxLtZ3Cd8/PVztNy/clA=
 =3UUm
 -----END PGP SIGNATURE-----

Merge tag 'x86_shstk_for_6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 shadow stack support from Dave Hansen:
 "This is the long awaited x86 shadow stack support, part of Intel's
  Control-flow Enforcement Technology (CET).

  CET consists of two related security features: shadow stacks and
  indirect branch tracking. This series implements just the shadow stack
  part of this feature, and just for userspace.

  The main use case for shadow stack is providing protection against
  return oriented programming attacks. It works by maintaining a
  secondary (shadow) stack using a special memory type that has
  protections against modification. When executing a CALL instruction,
  the processor pushes the return address to both the normal stack and
  to the special permission shadow stack. Upon RET, the processor pops
  the shadow stack copy and compares it to the normal stack copy.

  For more information, refer to the links below for the earlier
  versions of this patch set"

Link: https://lore.kernel.org/lkml/20220130211838.8382-1-rick.p.edgecombe@intel.com/
Link: https://lore.kernel.org/lkml/20230613001108.3040476-1-rick.p.edgecombe@intel.com/

* tag 'x86_shstk_for_6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (47 commits)
  x86/shstk: Change order of __user in type
  x86/ibt: Convert IBT selftest to asm
  x86/shstk: Don't retry vm_munmap() on -EINTR
  x86/kbuild: Fix Documentation/ reference
  x86/shstk: Move arch detail comment out of core mm
  x86/shstk: Add ARCH_SHSTK_STATUS
  x86/shstk: Add ARCH_SHSTK_UNLOCK
  x86: Add PTRACE interface for shadow stack
  selftests/x86: Add shadow stack test
  x86/cpufeatures: Enable CET CR4 bit for shadow stack
  x86/shstk: Wire in shadow stack interface
  x86: Expose thread features in /proc/$PID/status
  x86/shstk: Support WRSS for userspace
  x86/shstk: Introduce map_shadow_stack syscall
  x86/shstk: Check that signal frame is shadow stack mem
  x86/shstk: Check that SSP is aligned on sigreturn
  x86/shstk: Handle signals for shadow stack
  x86/shstk: Introduce routines modifying shstk
  x86/shstk: Handle thread shadow stack
  x86/shstk: Add user-mode shadow stack support
  ...
2023-08-31 12:20:12 -07:00
Christoph Hellwig 765aa6b3a4 dma-pool: remove a __maybe_unused label in atomic_pool_expand
Move the #endif a line so that free_page label is only seen by the
compile pass when actually used.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chunhui He <hchunhui@mail.ustc.edu.cn>
Reviewed-by: Robin Murphy <roin.murphy@arm.com>
2023-08-31 14:12:37 +02:00
Linus Torvalds cd99b9eb4b Documentation work keeps chugging along; stuff for 6.6 includes:
- Work from Carlos Bilbao to integrate rustdoc output into the generated
   HTML documentation.  This took some work to figure out how to do it
   without slowing the docs build and without creating people who don't have
   Rust installed, but Carlos got there.
 
 - Move the loongarch and mips architecture documentation under
   Documentation/arch/.
 
 - Some more maintainer documentation from Jakub
 
 ...plus the usual assortment of updates, translations, and fixes.
 -----BEGIN PGP SIGNATURE-----
 
 iQFDBAABCAAtFiEEIw+MvkEiF49krdp9F0NaE2wMflgFAmTvqNkPHGNvcmJldEBs
 d24ubmV0AAoJEBdDWhNsDH5YgIgH/3drfLtlFtzLqDOzrzDXS8yGnE3pPdxw796b
 /ZFzAK16wYKaKevYoIz8bVGGKaE1sEUW0mhlq4KGdfZuxLG8YnWS8URyCW4FDU2E
 6qNL+8oJ8LZfID46f9Q8ZgfEz7yF/mhCqPk7MEswYtwbscs2ZTGCTGYB/5BHlBuT
 LR+M89uLmHgr8S1o24v30OgiX+VvQFyu0xoxIhbiqUZvBd/XdfX2pgYd9BGzMj5q
 C2ZP+V14g36c5pV0EO9TwhCXOF/WVrp7DbjbfWAsqBSLxvpXPydH2q1DUzGeQtP1
 exujrBD1O8q3pPdaNA5R+h6cWlHmUZug9mE4BRLp9ErGrozwJsQ=
 =C3Uv
 -----END PGP SIGNATURE-----

Merge tag 'docs-6.6' of git://git.lwn.net/linux

Pull documentation updates from Jonathan Corbet:
 "Documentation work keeps chugging along; this includes:

   - Work from Carlos Bilbao to integrate rustdoc output into the
     generated HTML documentation. This took some work to figure out how
     to do it without slowing the docs build and without creating people
     who don't have Rust installed, but Carlos got there

   - Move the loongarch and mips architecture documentation under
     Documentation/arch/

   - Some more maintainer documentation from Jakub

  ... plus the usual assortment of updates, translations, and fixes"

* tag 'docs-6.6' of git://git.lwn.net/linux: (56 commits)
  Docu: genericirq.rst: fix irq-example
  input: docs: pxrc: remove reference to phoenix-sim
  Documentation: serial-console: Fix literal block marker
  docs/mm: remove references to hmm_mirror ops and clean typos
  docs/zh_CN: correct regi_chg(),regi_add() to region_chg(),region_add()
  Documentation: Fix typos
  Documentation/ABI: Fix typos
  scripts: kernel-doc: fix macro handling in enums
  scripts: kernel-doc: parse DEFINE_DMA_UNMAP_[ADDR|LEN]
  Documentation: riscv: Update boot image header since EFI stub is supported
  Documentation: riscv: Add early boot document
  Documentation: arm: Add bootargs to the table of added DT parameters
  docs: kernel-parameters: Refer to the correct bitmap function
  doc: update params of memhp_default_state=
  docs: Add book to process/kernel-docs.rst
  docs: sparse: fix invalid link addresses
  docs: vfs: clean up after the iterate() removal
  docs: Add a section on surveys to the researcher guidelines
  docs: move mips under arch
  docs: move loongarch under arch
  ...
2023-08-30 20:05:42 -07:00
Phil Sutter ea078ae910 netfilter: nf_tables: Audit log rule reset
Resetting rules' stateful data happens outside of the transaction logic,
so 'get' and 'dump' handlers have to emit audit log entries themselves.

Fixes: 8daa8fde3f ("netfilter: nf_tables: Introduce NFT_MSG_GETRULE_RESET")
Signed-off-by: Phil Sutter <phil@nwl.cc>
Reviewed-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-08-31 01:29:28 +02:00
Phil Sutter 7e9be1124d netfilter: nf_tables: Audit log setelem reset
Since set element reset is not integrated into nf_tables' transaction
logic, an explicit log call is needed, similar to NFT_MSG_GETOBJ_RESET
handling.

For the sake of simplicity, catchall element reset will always generate
a dedicated log entry. This relieves nf_tables_dump_set() from having to
adjust the logged element count depending on whether a catchall element
was found or not.

Fixes: 079cd63321 ("netfilter: nf_tables: Introduce NFT_MSG_GETSETELEM_RESET")
Signed-off-by: Phil Sutter <phil@nwl.cc>
Reviewed-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-08-31 01:29:27 +02:00
Linus Torvalds 1a35914f73 integrity-v6.6
-----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQQdXVVFGN5XqKr1Hj7LwZzRsCrn5QUCZO0WoxQcem9oYXJAbGlu
 dXguaWJtLmNvbQAKCRDLwZzRsCrn5alsAP0UZQIKI2zEjFdtucgClcSouflIOC5i
 Hvtgv3qVFXPZQwEA2H/SGjigtH5NruVXECDZdrIfaGGvBhyeY72lbswXfQ0=
 =Gu8i
 -----END PGP SIGNATURE-----

Merge tag 'integrity-v6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity

Pull integrity subsystem updates from Mimi Zohar:

 - With commit 099f26f22f ("integrity: machine keyring CA
   configuration") certificates may be loaded onto the IMA keyring,
   directly or indirectly signed by keys on either the "builtin" or the
   "machine" keyrings.

   With the ability for the system/machine owner to sign the IMA policy
   itself without needing to recompile the kernel, update the IMA
   architecture specific policy rules to require the IMA policy itself
   be signed.

   [ As commit 099f26f22f was upstreamed in linux-6.4, updating the
     IMA architecture specific policy now to require signed IMA policies
     may break userspace expectations. ]

 - IMA only checked the file data hash was not on the system blacklist
   keyring for files with an appended signature (e.g. kernel modules,
   Power kernel image).

   Check all file data hashes regardless of how it was signed

 - Code cleanup, and a kernel-doc update

* tag 'integrity-v6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity:
  kexec_lock: Replace kexec_mutex() by kexec_lock() in two comments
  ima: require signed IMA policy when UEFI secure boot is enabled
  integrity: Always reference the blacklist keyring with appraisal
  ima: Remove deprecated IMA_TRUSTED_KEYRING Kconfig
2023-08-30 09:16:56 -07:00
Linus Torvalds 1086eeac9c lsm/stable-6.6 PR 20230829
-----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCAAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAmTuKLcUHHBhdWxAcGF1
 bC1tb29yZS5jb20ACgkQ6iDy2pc3iXM/Eg//cwaOu/ASS08Cz/tfXeKpzg9UpzbW
 uHqGtgdE9ZEvS71z+3dorOJVPEwPr+/yviq3FXYjYHFqvVhLZCvYM9rw+eNo/k4T
 I95UTchGUsMWwkw61YBDLythfXm2UL5nabjckO81i9UPtxUYOwF6xQMQXYyMcLL8
 6fm1vnCvK5FBEXi2HSUWy3Eb3wdviGdHrL6h19Aeew+q8u33asWSxn9vmBSSFEzZ
 492//Pgy0t3FA6paWXQRvoR+GvLgBXNOvHB68cAx9vS8Lq6mAwJJSCRrQtKGh2Gd
 YInr49f+TXOosD5Tm6ueWO4sr8RzQZ7nPyM+BLue4Yn2ZzdYgjwfHdkHWS1KeH5X
 qVqa9s6/QONvkSCzqHs/ne2qio1Q0/0uGgwOkx6N7oVWQWjE7iTYlADwM0CDJnd2
 UD7AHTOgpc88x1T1eW599MZttSCznBTSFXv4waaS5/5NT9n8Db1TpTtCTedOc1x2
 n+c+F5BHLy69vhSGCanvum/8i2gNoKVyYaHyaMsQxr5LRcLnvN6oOjWIv7jMKxe7
 GavUAxU7M5rxPUH44vrrrI+XztKJOdpCz4S0xp+7pSSSGAK5KkmVVLXjzrlGO1WS
 55ixxQWYTGK0KlWHp4Ofi6brE9a4ATKcd1XscPN+AtBYX2ufNHLskCZulu/lyrMx
 lAy9RRDe1hHWTvg=
 =dnm4
 -----END PGP SIGNATURE-----

Merge tag 'lsm-pr-20230829' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm

Pull LSM updates from Paul Moore:

 - Add proper multi-LSM support for xattrs in the
   security_inode_init_security() hook

   Historically the LSM layer has only allowed a single LSM to add an
   xattr to an inode, with IMA/EVM measuring that and adding its own as
   well. As we work towards promoting IMA/EVM to a "proper LSM" instead
   of the special case that it is now, we need to better support the
   case of multiple LSMs each adding xattrs to an inode and after
   several attempts we now appear to have something that is working
   well. It is worth noting that in the process of making this change we
   uncovered a problem with Smack's SMACK64TRANSMUTE xattr which is also
   fixed in this pull request.

 - Additional LSM hook constification

   Two patches to constify parameters to security_capget() and
   security_binder_transfer_file(). While I generally don't make a
   special note of who submitted these patches, these were the work of
   an Outreachy intern, Khadija Kamran, and that makes me happy;
   hopefully it does the same for all of you reading this.

 - LSM hook comment header fixes

   One patch to add a missing hook comment header, one to fix a minor
   typo.

 - Remove an old, unused credential function declaration

   It wasn't clear to me who should pick this up, but it was trivial,
   obviously correct, and arguably the LSM layer has a vested interest
   in credentials so I merged it. Sadly I'm now noticing that despite my
   subject line cleanup I didn't cleanup the "unsued" misspelling, sigh

* tag 'lsm-pr-20230829' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm:
  lsm: constify the 'file' parameter in security_binder_transfer_file()
  lsm: constify the 'target' parameter in security_capget()
  lsm: add comment block for security_sk_classify_flow LSM hook
  security: Fix ret values doc for security_inode_init_security()
  cred: remove unsued extern declaration change_create_files_as()
  evm: Support multiple LSMs providing an xattr
  evm: Align evm_inode_init_security() definition with LSM infrastructure
  smack: Set the SMACK64TRANSMUTE xattr in smack_inode_init_security()
  security: Allow all LSMs to provide xattrs for inode_init_security hook
  lsm: fix typo in security_file_lock() comment header
2023-08-30 09:07:09 -07:00
Linus Torvalds 3ea67c4f46 audit/stable-6.6 PR 20230829
-----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCAAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAmTuKIQUHHBhdWxAcGF1
 bC1tb29yZS5jb20ACgkQ6iDy2pc3iXMSahAA4o+mfGxcadExo8wsEFfizsQd0JS1
 6KpV8Gl9/uwPTCUmvjquFnTb5tbNFZ1X7jnj2g0+/ZHYPp9yJQqTKu7NX1Q9w+dE
 11tiipc4CyrcJpWrjBinNH27txjulLSCN1imMnRYLZOpk1AbXTwjuLjFBy2iTDtm
 8TAPj4vcKbi5MlcUodp/DGO6ysL75gTsLn5UUsHJhWbofz4ECay0heQoPeZ/MaW3
 gBPMRgt/REg8ikdR/ntFMOD6ywBZZ0Vsf/S+hNWGwHUgGxQ5H7rJBEFI65HL4Ur1
 c36UFRsypT1sFaIDbS/PrvpT3M48XwmqdmWNx5Z1dtJCCwNhuhsmEkXB+GEud2qM
 SOQQfMgfjKvnaLMPUmDePuAiSflSJj2AHo1HXlYxKFtybI1plJGiRoDX5jlsklCp
 JbwUJ2y7YlxNPIaZSBHYIUuniUDqET83cR2D3YJiU+2I9myg8Z5Amto8d4MFgf21
 f4qfm0SDBMvXYHUuhUry0/kuk2A0R89H4HUNcrGky+cSsaelpm06uaxj43B/M9Dp
 v1nSwDQpDtYKSt+16GUDfqq5BywjwMe4J7wlE9+YdTDrvuc2qUxZMky5GzZ55Wnl
 mbe6BVEBc19FhDeC3muhgV0jWCUGKuq6q+W+CRmxafyOMzX9NIDFaZf1KxkaesxD
 S9I7AYmT7fCghFQ=
 =tZaJ
 -----END PGP SIGNATURE-----

Merge tag 'audit-pr-20230829' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit

Pull audit updates from Paul Moore:
 "Six audit patches, the highlights are:

   - Add an explicit cond_resched() call when generating PATH records

     Certain tracefs/debugfs operations can generate a *lot* of audit
     PATH entries and if one has an aggressive system configuration (not
     the default) this can cause a soft lockup in the audit code as it
     works to process all of these new entries.

     This is in sharp contrast to the common case where only one or two
     PATH entries are logged. In order to fix this corner case without
     excessively impacting the common case we're adding a single
     cond_rescued() call between two of the most intensive loops in the
     __audit_inode_child() function.

   - Various minor cleanups

     We removed a conditional header file as the included header already
     had the necessary logic in place, fixed a dummy function's return
     value, and the usual collection of checkpatch.pl noise (whitespace,
     brace, and trailing statement tweaks)"

* tag 'audit-pr-20230829' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
  audit: move trailing statements to next line
  audit: cleanup function braces and assignment-in-if-condition
  audit: add space before parenthesis and around '=', "==", and '<'
  audit: fix possible soft lockup in __audit_inode_child()
  audit: correct audit_filter_inodes() definition
  audit: include security.h unconditionally
2023-08-30 08:17:35 -07:00
Christoph Hellwig 2dcdf8c18d dma-contiguous: fix the Kconfig entry for CONFIG_DMA_NUMA_CMA
It makes no sense to expose CONFIG_DMA_NUMA_CMA if CONFIG_NUMA is not
enabled, and random config options shouldn't be default unless there
is a good reason.  Replace the default NUMA with a depends on to fix both
issues.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Robin Murphy <roin.murphy@arm.com>
2023-08-30 13:52:53 +02:00
Thomas Gleixner 2b8272ff4a cpu/hotplug: Prevent self deadlock on CPU hot-unplug
Xiongfeng reported and debugged a self deadlock of the task which initiates
and controls a CPU hot-unplug operation vs. the CFS bandwidth timer.

    CPU1      			                 	 CPU2

T1 sets cfs_quota
   starts hrtimer cfs_bandwidth 'period_timer'
T1 is migrated to CPU2				
						T1 initiates offlining of CPU1
Hotplug operation starts
  ...
'period_timer' expires and is re-enqueued on CPU1
  ...
take_cpu_down()
  CPU1 shuts down and does not handle timers
  anymore. They have to be migrated in the
  post dead hotplug steps by the control task.

						T1 runs the post dead offline operation
					      	T1 is scheduled out
						T1 waits for 'period_timer' to expire

T1 waits there forever if it is scheduled out before it can execute the hrtimer
offline callback hrtimers_dead_cpu().

Cure this by delegating the hotplug control operation to a worker thread on
an online CPU. This takes the initiating user space task, which might be
affected by the bandwidth timer, completely out of the picture.

Reported-by: Xiongfeng Wang <wangxiongfeng2@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Yu Liao <liaoyu15@huawei.com>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/lkml/8e785777-03aa-99e1-d20e-e956f5685be6@huawei.com
Link: https://lore.kernel.org/r/87h6oqdq0i.ffs@tglx
2023-08-30 12:24:22 +02:00
Paul Gortmaker 96c1fa04f0 tick/rcu: Fix false positive "softirq work is pending" messages
In commit 0345691b24 ("tick/rcu: Stop allowing RCU_SOFTIRQ in idle") the
new function report_idle_softirq() was created by breaking code out of the
existing can_stop_idle_tick() for kernels v5.18 and newer.

In doing so, the code essentially went from a one conditional:

	if (a && b && c)
		warn();

to a three conditional:

	if (!a)
		return;
	if (!b)
		return;
	if (!c)
		return;
	warn();

But that conversion got the condition for the RT specific
local_bh_blocked() wrong. The original condition was:

   	!local_bh_blocked()

but the conversion failed to negate it so it ended up as:

        if (!local_bh_blocked())
		return false;

This issue lay dormant until another fixup for the same commit was added
in commit a7e282c777 ("tick/rcu: Fix bogus ratelimit condition").
This commit realized the ratelimit was essentially set to zero instead
of ten, and hence *no* softirq pending messages would ever be issued.

Once this commit was backported via linux-stable, both the v6.1 and v6.4
preempt-rt kernels started printing out 10 instances of this at boot:

  NOHZ tick-stop error: local softirq work is pending, handler #80!!!

Remove the negation and return when local_bh_blocked() evaluates to true to
bring the correct behaviour back.

Fixes: 0345691b24 ("tick/rcu: Stop allowing RCU_SOFTIRQ in idle")
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Ahmad Fatoum <a.fatoum@pengutronix.de>
Reviewed-by: Wen Yang <wenyang.linux@foxmail.com>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20230818200757.1808398-1-paul.gortmaker@windriver.com
2023-08-30 12:20:28 +02:00
Sergey Senozhatsky fb5a431559 dma-debug: don't call __dma_entry_alloc_check_leak() under free_entries_lock
__dma_entry_alloc_check_leak() calls into printk -> serial console
output (qcom geni) and grabs port->lock under free_entries_lock
spin lock, which is a reverse locking dependency chain as qcom_geni
IRQ handler can call into dma-debug code and grab free_entries_lock
under port->lock.

Move __dma_entry_alloc_check_leak() call out of free_entries_lock
scope so that we don't acquire serial console's port->lock under it.

Trimmed-down lockdep splat:

 The existing dependency chain (in reverse order) is:

               -> #2 (free_entries_lock){-.-.}-{2:2}:
        _raw_spin_lock_irqsave+0x60/0x80
        dma_entry_alloc+0x38/0x110
        debug_dma_map_page+0x60/0xf8
        dma_map_page_attrs+0x1e0/0x230
        dma_map_single_attrs.constprop.0+0x6c/0xc8
        geni_se_rx_dma_prep+0x40/0xcc
        qcom_geni_serial_isr+0x310/0x510
        __handle_irq_event_percpu+0x110/0x244
        handle_irq_event_percpu+0x20/0x54
        handle_irq_event+0x50/0x88
        handle_fasteoi_irq+0xa4/0xcc
        handle_irq_desc+0x28/0x40
        generic_handle_domain_irq+0x24/0x30
        gic_handle_irq+0xc4/0x148
        do_interrupt_handler+0xa4/0xb0
        el1_interrupt+0x34/0x64
        el1h_64_irq_handler+0x18/0x24
        el1h_64_irq+0x64/0x68
        arch_local_irq_enable+0x4/0x8
        ____do_softirq+0x18/0x24
        ...

               -> #1 (&port_lock_key){-.-.}-{2:2}:
        _raw_spin_lock_irqsave+0x60/0x80
        qcom_geni_serial_console_write+0x184/0x1dc
        console_flush_all+0x344/0x454
        console_unlock+0x94/0xf0
        vprintk_emit+0x238/0x24c
        vprintk_default+0x3c/0x48
        vprintk+0xb4/0xbc
        _printk+0x68/0x90
        register_console+0x230/0x38c
        uart_add_one_port+0x338/0x494
        qcom_geni_serial_probe+0x390/0x424
        platform_probe+0x70/0xc0
        really_probe+0x148/0x280
        __driver_probe_device+0xfc/0x114
        driver_probe_device+0x44/0x100
        __device_attach_driver+0x64/0xdc
        bus_for_each_drv+0xb0/0xd8
        __device_attach+0xe4/0x140
        device_initial_probe+0x1c/0x28
        bus_probe_device+0x44/0xb0
        device_add+0x538/0x668
        of_device_add+0x44/0x50
        of_platform_device_create_pdata+0x94/0xc8
        of_platform_bus_create+0x270/0x304
        of_platform_populate+0xac/0xc4
        devm_of_platform_populate+0x60/0xac
        geni_se_probe+0x154/0x160
        platform_probe+0x70/0xc0
        ...

               -> #0 (console_owner){-...}-{0:0}:
        __lock_acquire+0xdf8/0x109c
        lock_acquire+0x234/0x284
        console_flush_all+0x330/0x454
        console_unlock+0x94/0xf0
        vprintk_emit+0x238/0x24c
        vprintk_default+0x3c/0x48
        vprintk+0xb4/0xbc
        _printk+0x68/0x90
        dma_entry_alloc+0xb4/0x110
        debug_dma_map_sg+0xdc/0x2f8
        __dma_map_sg_attrs+0xac/0xe4
        dma_map_sgtable+0x30/0x4c
        get_pages+0x1d4/0x1e4 [msm]
        msm_gem_pin_pages_locked+0x38/0xac [msm]
        msm_gem_pin_vma_locked+0x58/0x88 [msm]
        msm_ioctl_gem_submit+0xde4/0x13ac [msm]
        drm_ioctl_kernel+0xe0/0x15c
        drm_ioctl+0x2e8/0x3f4
        vfs_ioctl+0x30/0x50
        ...

 Chain exists of:
   console_owner --> &port_lock_key --> free_entries_lock

  Possible unsafe locking scenario:

        CPU0                    CPU1
        ----                    ----
   lock(free_entries_lock);
                                lock(&port_lock_key);
                                lock(free_entries_lock);
   lock(console_owner);

                *** DEADLOCK ***

 Call trace:
  dump_backtrace+0xb4/0xf0
  show_stack+0x20/0x30
  dump_stack_lvl+0x60/0x84
  dump_stack+0x18/0x24
  print_circular_bug+0x1cc/0x234
  check_noncircular+0x78/0xac
  __lock_acquire+0xdf8/0x109c
  lock_acquire+0x234/0x284
  console_flush_all+0x330/0x454
  console_unlock+0x94/0xf0
  vprintk_emit+0x238/0x24c
  vprintk_default+0x3c/0x48
  vprintk+0xb4/0xbc
  _printk+0x68/0x90
  dma_entry_alloc+0xb4/0x110
  debug_dma_map_sg+0xdc/0x2f8
  __dma_map_sg_attrs+0xac/0xe4
  dma_map_sgtable+0x30/0x4c
  get_pages+0x1d4/0x1e4 [msm]
  msm_gem_pin_pages_locked+0x38/0xac [msm]
  msm_gem_pin_vma_locked+0x58/0x88 [msm]
  msm_ioctl_gem_submit+0xde4/0x13ac [msm]
  drm_ioctl_kernel+0xe0/0x15c
  drm_ioctl+0x2e8/0x3f4
  vfs_ioctl+0x30/0x50
  ...

Reported-by: Rob Clark <robdclark@chromium.org>
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Acked-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-08-30 11:29:08 +02:00
Linus Torvalds 6c1b980a7e dma-maping updates for Linux 6.6
- allow dynamic sizing of the swiotlb buffer, to cater for secure
    virtualization workloads that require all I/O to be bounce buffered
    (Petr Tesarik)
  - move a declaration to a header (Arnd Bergmann)
  - check for memory region overlap in dma-contiguous (Binglei Wang)
  - remove the somewhat dangerous runtime swiotlb-xen enablement and
    unexport is_swiotlb_active (Christoph Hellwig, Juergen Gross)
  - per-node CMA improvements (Yajun Deng)
 -----BEGIN PGP SIGNATURE-----
 
 iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAmTuDHkLHGhjaEBsc3Qu
 ZGUACgkQD55TZVIEUYOqvhAApMk2/ceTgVH17sXaKE822+xKvgv377O6TlggMeGG
 W4zA0KD69DNz0AfaaCc5U5f7n8Ld/YY1RsvkHW4b3jgw+KRTeQr0jjitBgP5kP2M
 A1+qxdyJpCTwiPt9s2+JFVPeyZ0s52V6OJODKRG3s0ore55R+U09VySKtASON+q3
 GMKfWqQteKC+thg7NkrQ7JUixuo84oICws+rZn4K9ifsX2O0HYW6aMW0feRfZjJH
 r0TgqZc4RdPTSaF22oapR9Ls39+7hp/pBvoLm5sBNA3cl5C3X4VWo9ERMU1jW9h+
 VYQv39NycUspgskWJmpbU06/+ooYqQlwHSR/vdNusmFIvxo4tf6/UX72YO5F8Dar
 ap0wYGauiEwTjSnhVxPTXk3obWyWEsgFAeRnPdTlH2CNmv38QZU2HLb8eU1pcXxX
 j+WI2Ewy9z22uBVYiPOKpdW1jkSfmlmfPp/8SbAdua7I3YQ90rQN6AvU06zAi/cL
 NQTgO81E4jPkygqAVgS/LeYziWAQ73yM7m9ExThtTgqFtHortwhJ4Fd8XKtvtvEb
 viXAZ/WZtQBv/CIKAW98NhgIDP/SPOT8ym6V35WK+kkNFMS6LMSQUfl9GgbHGyFa
 n9icMm7BmbDtT1+AKNafG9En4DtAf9M9QNidAVOyfrsIk6S0gZoZwvIStkA7on8a
 cNY=
 =kVVr
 -----END PGP SIGNATURE-----

Merge tag 'dma-mapping-6.6-2023-08-29' of git://git.infradead.org/users/hch/dma-mapping

Pull dma-maping updates from Christoph Hellwig:

 - allow dynamic sizing of the swiotlb buffer, to cater for secure
   virtualization workloads that require all I/O to be bounce buffered
   (Petr Tesarik)

 - move a declaration to a header (Arnd Bergmann)

 - check for memory region overlap in dma-contiguous (Binglei Wang)

 - remove the somewhat dangerous runtime swiotlb-xen enablement and
   unexport is_swiotlb_active (Christoph Hellwig, Juergen Gross)

 - per-node CMA improvements (Yajun Deng)

* tag 'dma-mapping-6.6-2023-08-29' of git://git.infradead.org/users/hch/dma-mapping:
  swiotlb: optimize get_max_slots()
  swiotlb: move slot allocation explanation comment where it belongs
  swiotlb: search the software IO TLB only if the device makes use of it
  swiotlb: allocate a new memory pool when existing pools are full
  swiotlb: determine potential physical address limit
  swiotlb: if swiotlb is full, fall back to a transient memory pool
  swiotlb: add a flag whether SWIOTLB is allowed to grow
  swiotlb: separate memory pool data from other allocator data
  swiotlb: add documentation and rename swiotlb_do_find_slots()
  swiotlb: make io_tlb_default_mem local to swiotlb.c
  swiotlb: bail out of swiotlb_init_late() if swiotlb is already allocated
  dma-contiguous: check for memory region overlap
  dma-contiguous: support numa CMA for specified node
  dma-contiguous: support per-numa CMA for all architectures
  dma-mapping: move arch_dma_set_mask() declaration to header
  swiotlb: unexport is_swiotlb_active
  x86: always initialize xen-swiotlb when xen-pcifront is enabling
  xen/pci: add flag for PCI passthrough being possible
2023-08-29 20:32:10 -07:00
Linus Torvalds adfd671676 sysctl-6.6-rc1
Long ago we set out to remove the kitchen sink on kernel/sysctl.c arrays and
 placings sysctls to their own sybsystem or file to help avoid merge conflicts.
 Matthew Wilcox pointed out though that if we're going to do that we might as
 well also *save* space while at it and try to remove the extra last sysctl
 entry added at the end of each array, a sentintel, instead of bloating the
 kernel by adding a new sentinel with each array moved.
 
 Doing that was not so trivial, and has required slowing down the moves of
 kernel/sysctl.c arrays and measuring the impact on size by each new move.
 
 The complex part of the effort to help reduce the size of each sysctl is being
 done by the patient work of el señor Don Joel Granados. A lot of this is truly
 painful code refactoring and testing and then trying to measure the savings of
 each move and removing the sentinels. Although Joel already has code which does
 most of this work, experience with sysctl moves in the past shows is we need to
 be careful due to the slew of odd build failures that are possible due to the
 amount of random Kconfig options sysctls use.
 
 To that end Joel's work is split by first addressing the major housekeeping
 needed to remove the sentinels, which is part of this merge request. The rest
 of the work to actually remove the sentinels will be done later in future
 kernel releases.
 
 At first I was only going to send his first 7 patches of his patch series,
 posted 1 month ago, but in retrospect due to the testing the changes have
 received in linux-next and the minor changes they make this goes with the
 entire set of patches Joel had planned: just sysctl house keeping. There are
 networking changes but these are part of the house keeping too.
 
 The preliminary math is showing this will all help reduce the overall build
 time size of the kernel and run time memory consumed by the kernel by about
 ~64 bytes per array where we are able to remove each sentinel in the future.
 That also means there is no more bloating the kernel with the extra ~64 bytes
 per array moved as no new sentinels are created.
 
 Most of this has been in linux-next for about a month, the last 7 patches took
 a minor refresh 2 week ago based on feedback.
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCgAwFiEENnNq2KuOejlQLZofziMdCjCSiKcFAmTuVnMSHG1jZ3JvZkBr
 ZXJuZWwub3JnAAoJEM4jHQowkoinIckP/imvRlfkO6L0IP7MmJBRPtwY01rsTAKO
 Q14dZ//bG4DVQeGl1FdzrF6hhuLgekU0qW1YDFIWiCXO7CbaxaNBPSUkeW6ReVoC
 R/VHNUPxSR1PWQy1OTJV2t4XKri2sB7ijmUsfsATtISwhei9bggTHEysShtP4tv+
 U87DzhoqMnbYIsfMo49KCqOa1Qm7TmjC1a7WAp6Fph3GJuXAzZR5pXpsd0NtOZ9x
 Ud5RT22icnQpMl7K+yPsqY6XcS5JkgBe/WbSzMAUkYZvBZFBq9t2D+OW5h9TZMhw
 piJWQ9X0Rm7qI2D15mJfXwaOhhyDhWci391hzdJmS6DI0prf6Ma2NFdAWOt/zomI
 uiRujS4bGeBUaK5F4TX2WQ1+jdMtAZ+0FncFnzt4U8q7dzUc91uVCm6iHW3gcfAb
 N7OEg2ZL0gkkgCZHqKxN8wpNQiC2KwnNk+HLAbnL2a/oJYfBtdopQmlxWfrN2hpF
 xxROiENqk483BRdMXDq6DR/gyDZmZWCobXIglSzlqCOjCOcLbDziIJ7pJk83ok09
 h/QnXTYHf9protBq9OIQesgh2pwNzBBLifK84KZLKcb7IbdIKjpQrW5STp04oNGf
 wcGJzEz8tXUe0UKyMM47AcHQGzIy6cdXNLjyF8a+m7rnZzr1ndnMqZyRStZzuQin
 AUg2VWHKPmW9
 =sq2p
 -----END PGP SIGNATURE-----

Merge tag 'sysctl-6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux

Pull sysctl updates from Luis Chamberlain:
 "Long ago we set out to remove the kitchen sink on kernel/sysctl.c
  arrays and placings sysctls to their own sybsystem or file to help
  avoid merge conflicts. Matthew Wilcox pointed out though that if we're
  going to do that we might as well also *save* space while at it and
  try to remove the extra last sysctl entry added at the end of each
  array, a sentintel, instead of bloating the kernel by adding a new
  sentinel with each array moved.

  Doing that was not so trivial, and has required slowing down the moves
  of kernel/sysctl.c arrays and measuring the impact on size by each new
  move.

  The complex part of the effort to help reduce the size of each sysctl
  is being done by the patient work of el señor Don Joel Granados. A lot
  of this is truly painful code refactoring and testing and then trying
  to measure the savings of each move and removing the sentinels.
  Although Joel already has code which does most of this work,
  experience with sysctl moves in the past shows is we need to be
  careful due to the slew of odd build failures that are possible due to
  the amount of random Kconfig options sysctls use.

  To that end Joel's work is split by first addressing the major
  housekeeping needed to remove the sentinels, which is part of this
  merge request. The rest of the work to actually remove the sentinels
  will be done later in future kernel releases.

  The preliminary math is showing this will all help reduce the overall
  build time size of the kernel and run time memory consumed by the
  kernel by about ~64 bytes per array where we are able to remove each
  sentinel in the future. That also means there is no more bloating the
  kernel with the extra ~64 bytes per array moved as no new sentinels
  are created"

* tag 'sysctl-6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux:
  sysctl: Use ctl_table_size as stopping criteria for list macro
  sysctl: SIZE_MAX->ARRAY_SIZE in register_net_sysctl
  vrf: Update to register_net_sysctl_sz
  networking: Update to register_net_sysctl_sz
  netfilter: Update to register_net_sysctl_sz
  ax.25: Update to register_net_sysctl_sz
  sysctl: Add size to register_net_sysctl function
  sysctl: Add size arg to __register_sysctl_init
  sysctl: Add size to register_sysctl
  sysctl: Add a size arg to __register_sysctl_table
  sysctl: Add size argument to init_header
  sysctl: Add ctl_table_size to ctl_table_header
  sysctl: Use ctl_table_header in list_for_each_table_entry
  sysctl: Prefer ctl_table_header in proc_sysctl
2023-08-29 17:39:15 -07:00
Linus Torvalds daa22f5a78 Modules changes for v6.6-rc1
Summary of the changes worth highlighting from most interesting to boring below:
 
   * Christoph Hellwig's symbol_get() fix to Nvidia's efforts to circumvent the
     protection he put in place in year 2020 to prevent proprietary modules from
     using GPL only symbols, and also ensuring proprietary modules which export
     symbols grandfather their taint. That was done through year 2020 commit
     262e6ae708 ("modules: inherit TAINT_PROPRIETARY_MODULE"). Christoph's new
     fix is done by clarifing __symbol_get() was only ever intended to prevent
     module reference loops by Linux kernel modules and so making it only find
     symbols exported via EXPORT_SYMBOL_GPL(). The circumvention tactic used
     by Nvidia was to use symbol_get() to purposely swift through proprietary
     module symbols and completley bypass our traditional EXPORT_SYMBOL*()
     annotations and community agreed upon restrictions.
 
     A small set of preamble patches fix up a few symbols which just needed
     adjusting for this on two modules, the rtc ds1685 and the networking enetc
     module. Two other modules just needed some build fixing and removal of use
     of __symbol_get() as they can't ever be modular, as was done by Arnd on
     the ARM pxa module and Christoph did on the mmc au1xmmc driver.
 
     This is a good reminder to us that symbol_get() is just a hack to address
     things which should be fixed through Kconfig at build time as was done in
     the later patches, and so ultimately it should just go.
 
   * Extremely late minor fix for old module layout 055f23b74b ("module: check
     for exit sections in layout_sections() instead of module_init_section()") by
     James Morse for arm64. Note that this layout thing is old, it is *not*
     Song Liu's commit ac3b432839 ("module: replace module_layout with
     module_memory"). The issue however is very odd to run into and so there was
     no hurry to get this in fast.
 
   * Although the fix did not go through the modules tree I'd like to highlight
     the fix by Peter Zijlstra in commit 5409730962 ("x86/static_call: Fix
     __static_call_fixup()") now merged in your tree which came out of what
     was originally suspected to be a fallout of the the newer module layout
     changes by Song Liu commit ac3b432839 ("module: replace module_layout
     with module_memory") instead of module_init_section()"). Thanks to the report
     by Christian Bricart and the debugging by Song Liu & Peter that turned to
     be noted as a kernel regression in place since v5.19 through commit
     ee88d363d1 ("x86,static_call: Use alternative RET encoding").
 
     I highlight this to reflect and clarify that we haven't seen more fallout
     from ac3b432839 ("module: replace module_layout with module_memory").
 
   * RISC-V toolchain got mapping symbol support which prefix symbols with "$"
     to help with alignment considerations for disassembly. This is used to
     differentiate between incompatible instruction encodings when disassembling.
     RISC-V just matches what ARM/AARCH64 did for alignment considerations and
     Palmer Dabbelt extended is_mapping_symbol() to accept these symbols for
     RISC-V. We already had support for this for all architectures but it also
     checked for the second character, the RISC-V check Dabbelt added was just
     for the "$". After a bit of testing and fallout on linux-next and based on
     feedback from Masahiro Yamada it was decided to simplify the check and treat
     the first char "$" as unique for all architectures, and so we no make
     is_mapping_symbol() for all archs if the symbol starts with "$".
 
     The most relevant commit for this for RISC-V on binutils was:
 
     https://sourceware.org/pipermail/binutils/2021-July/117350.html
 
   * A late fix by Andrea Righi (today) to make module zstd decompression use
     vmalloc() instead of kmalloc() to account for large compressed modules. I
     suspect we'll see similar things for other decompression algorithms soon.
 
   * samples/hw_breakpoint minor fixes by Rong Tao, Arnd Bergmann and Chen Jiahao
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCgAwFiEENnNq2KuOejlQLZofziMdCjCSiKcFAmTuShISHG1jZ3JvZkBr
 ZXJuZWwub3JnAAoJEM4jHQowkoin7rEQAIt9cGmkHyA6Po/Ex8DejWvSTTOQzIXk
 NvtGurODghWnCejZ7Yofo1T48mvgHOenDQB9qNSkVtKDyhmWCbss6wQU/5M8Mc3A
 G+9svkQ8H1BRzTwX3WJKF9KNMhI0HA0CXz3ED/I4iX/Q4Ffv3bgbAiitY6r48lJV
 PSKPzwH9QMIti6k3j+bFf2SwWCV3X2jz+btdxwY34dVFyggdYgaBNKEdrumCx4nL
 g0tQQxI8QgltOnwlfOPLEhdSU1yWyIWZtqtki6xksLziwTreRaw1HotgXQDpnt/S
 iJY9xiKN1ChtVSprQlbTb9yhFbCEGvOYGEaKl/ZsGENQjKzRWsQ+dtT8Ww6n2Y1H
 aJXwniv6SqCW7dCwdKo4sE7JFYDP56yFYKBLOPSPbMm6DJwTMbzLUf7TGNh6NKyl
 3pqjGagJ+LTj3l9w5ur4zTrDGAmLzMpNR03+6niTM7C3TPOI1+wh5zGbvtoA/WdA
 ytQeOTiUsi0uyVgk50f67IC6virrxwupeyZQlYFGNuEGBClgXzzzgw/MKwg0VMvc
 aWhFPUOLx8/8juJ3A5qiOT+znQJ2DTqWlT+QkQ8R5qFVXEW1g9IOnhaHqDX+KB0A
 OPlZ9xwss2U0Zd1XhourtqhUhvcODWNzTj3oPzjdrGiBjdENz8hPKP+7HV1CG6xy
 RdxpSwu72kFu
 =IQy2
 -----END PGP SIGNATURE-----

Merge tag 'modules-6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux

Pull modules updates from Luis Chamberlain:
 "Summary of the changes worth highlighting from most interesting to
  boring below:

   - Christoph Hellwig's symbol_get() fix to Nvidia's efforts to
     circumvent the protection he put in place in year 2020 to prevent
     proprietary modules from using GPL only symbols, and also ensuring
     proprietary modules which export symbols grandfather their taint.

     That was done through year 2020 commit 262e6ae708 ("modules:
     inherit TAINT_PROPRIETARY_MODULE"). Christoph's new fix is done by
     clarifing __symbol_get() was only ever intended to prevent module
     reference loops by Linux kernel modules and so making it only find
     symbols exported via EXPORT_SYMBOL_GPL(). The circumvention tactic
     used by Nvidia was to use symbol_get() to purposely swift through
     proprietary module symbols and completely bypass our traditional
     EXPORT_SYMBOL*() annotations and community agreed upon
     restrictions.

     A small set of preamble patches fix up a few symbols which just
     needed adjusting for this on two modules, the rtc ds1685 and the
     networking enetc module. Two other modules just needed some build
     fixing and removal of use of __symbol_get() as they can't ever be
     modular, as was done by Arnd on the ARM pxa module and Christoph
     did on the mmc au1xmmc driver.

     This is a good reminder to us that symbol_get() is just a hack to
     address things which should be fixed through Kconfig at build time
     as was done in the later patches, and so ultimately it should just
     go.

   - Extremely late minor fix for old module layout 055f23b74b
     ("module: check for exit sections in layout_sections() instead of
     module_init_section()") by James Morse for arm64. Note that this
     layout thing is old, it is *not* Song Liu's commit ac3b432839
     ("module: replace module_layout with module_memory"). The issue
     however is very odd to run into and so there was no hurry to get
     this in fast.

   - Although the fix did not go through the modules tree I'd like to
     highlight the fix by Peter Zijlstra in commit 5409730962
     ("x86/static_call: Fix __static_call_fixup()") now merged in your
     tree which came out of what was originally suspected to be a
     fallout of the the newer module layout changes by Song Liu commit
     ac3b432839 ("module: replace module_layout with module_memory")
     instead of module_init_section()"). Thanks to the report by
     Christian Bricart and the debugging by Song Liu & Peter that turned
     to be noted as a kernel regression in place since v5.19 through
     commit ee88d363d1 ("x86,static_call: Use alternative RET
     encoding").

     I highlight this to reflect and clarify that we haven't seen more
     fallout from ac3b432839 ("module: replace module_layout with
     module_memory").

   - RISC-V toolchain got mapping symbol support which prefix symbols
     with "$" to help with alignment considerations for disassembly.

     This is used to differentiate between incompatible instruction
     encodings when disassembling. RISC-V just matches what ARM/AARCH64
     did for alignment considerations and Palmer Dabbelt extended
     is_mapping_symbol() to accept these symbols for RISC-V. We already
     had support for this for all architectures but it also checked for
     the second character, the RISC-V check Dabbelt added was just for
     the "$". After a bit of testing and fallout on linux-next and based
     on feedback from Masahiro Yamada it was decided to simplify the
     check and treat the first char "$" as unique for all architectures,
     and so we no make is_mapping_symbol() for all archs if the symbol
     starts with "$".

     The most relevant commit for this for RISC-V on binutils was:

       https://sourceware.org/pipermail/binutils/2021-July/117350.html

   - A late fix by Andrea Righi (today) to make module zstd
     decompression use vmalloc() instead of kmalloc() to account for
     large compressed modules. I suspect we'll see similar things for
     other decompression algorithms soon.

   - samples/hw_breakpoint minor fixes by Rong Tao, Arnd Bergmann and
     Chen Jiahao"

* tag 'modules-6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux:
  module/decompress: use vmalloc() for zstd decompression workspace
  kallsyms: Add more debug output for selftest
  ARM: module: Use module_init_layout_section() to spot init sections
  arm64: module: Use module_init_layout_section() to spot init sections
  module: Expose module_init_layout_section()
  modules: only allow symbol_get of EXPORT_SYMBOL_GPL modules
  rtc: ds1685: use EXPORT_SYMBOL_GPL for ds1685_rtc_poweroff
  net: enetc: use EXPORT_SYMBOL_GPL for enetc_phc_index
  mmc: au1xmmc: force non-modular build and remove symbol_get usage
  ARM: pxa: remove use of symbol_get()
  samples/hw_breakpoint: mark sample_hbp as static
  samples/hw_breakpoint: fix building without module unloading
  samples/hw_breakpoint: Fix kernel BUG 'invalid opcode: 0000'
  modpost, kallsyms: Treat add '$'-prefixed symbols as mapping symbols
  kernel: params: Remove unnecessary ‘0’ values from err
  module: Ignore RISC-V mapping symbols too
2023-08-29 17:32:32 -07:00
Linus Torvalds d68b4b6f30 - An extensive rework of kexec and crash Kconfig from Eric DeVolder
("refactor Kconfig to consolidate KEXEC and CRASH options").
 
 - kernel.h slimming work from Andy Shevchenko ("kernel.h: Split out a
   couple of macros to args.h").
 
 - gdb feature work from Kuan-Ying Lee ("Add GDB memory helper
   commands").
 
 - vsprintf inclusion rationalization from Andy Shevchenko
   ("lib/vsprintf: Rework header inclusions").
 
 - Switch the handling of kdump from a udev scheme to in-kernel handling,
   by Eric DeVolder ("crash: Kernel handling of CPU and memory hot
   un/plug").
 
 - Many singleton patches to various parts of the tree
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZO2GpAAKCRDdBJ7gKXxA
 juW3AQD1moHzlSN6x9I3tjm5TWWNYFoFL8af7wXDJspp/DWH/AD/TO0XlWWhhbYy
 QHy7lL0Syha38kKLMXTM+bN6YQHi9AU=
 =WJQa
 -----END PGP SIGNATURE-----

Merge tag 'mm-nonmm-stable-2023-08-28-22-48' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull non-MM updates from Andrew Morton:

 - An extensive rework of kexec and crash Kconfig from Eric DeVolder
   ("refactor Kconfig to consolidate KEXEC and CRASH options")

 - kernel.h slimming work from Andy Shevchenko ("kernel.h: Split out a
   couple of macros to args.h")

 - gdb feature work from Kuan-Ying Lee ("Add GDB memory helper
   commands")

 - vsprintf inclusion rationalization from Andy Shevchenko
   ("lib/vsprintf: Rework header inclusions")

 - Switch the handling of kdump from a udev scheme to in-kernel
   handling, by Eric DeVolder ("crash: Kernel handling of CPU and memory
   hot un/plug")

 - Many singleton patches to various parts of the tree

* tag 'mm-nonmm-stable-2023-08-28-22-48' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (81 commits)
  document while_each_thread(), change first_tid() to use for_each_thread()
  drivers/char/mem.c: shrink character device's devlist[] array
  x86/crash: optimize CPU changes
  crash: change crash_prepare_elf64_headers() to for_each_possible_cpu()
  crash: hotplug support for kexec_load()
  x86/crash: add x86 crash hotplug support
  crash: memory and CPU hotplug sysfs attributes
  kexec: exclude elfcorehdr from the segment digest
  crash: add generic infrastructure for crash hotplug support
  crash: move a few code bits to setup support of crash hotplug
  kstrtox: consistently use _tolower()
  kill do_each_thread()
  nilfs2: fix WARNING in mark_buffer_dirty due to discarded buffer reuse
  scripts/bloat-o-meter: count weak symbol sizes
  treewide: drop CONFIG_EMBEDDED
  lockdep: fix static memory detection even more
  lib/vsprintf: declare no_hash_pointers in sprintf.h
  lib/vsprintf: split out sprintf() and friends
  kernel/fork: stop playing lockless games for exe_file replacement
  adfs: delete unused "union adfs_dirtail" definition
  ...
2023-08-29 14:53:51 -07:00
Linus Torvalds b96a3e9142 - Some swap cleanups from Ma Wupeng ("fix WARN_ON in add_to_avail_list")
- Peter Xu has a series (mm/gup: Unify hugetlb, speed up thp") which
   reduces the special-case code for handling hugetlb pages in GUP.  It
   also speeds up GUP handling of transparent hugepages.
 
 - Peng Zhang provides some maple tree speedups ("Optimize the fast path
   of mas_store()").
 
 - Sergey Senozhatsky has improved te performance of zsmalloc during
   compaction (zsmalloc: small compaction improvements").
 
 - Domenico Cerasuolo has developed additional selftest code for zswap
   ("selftests: cgroup: add zswap test program").
 
 - xu xin has doe some work on KSM's handling of zero pages.  These
   changes are mainly to enable the user to better understand the
   effectiveness of KSM's treatment of zero pages ("ksm: support tracking
   KSM-placed zero-pages").
 
 - Jeff Xu has fixes the behaviour of memfd's
   MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED sysctl ("mm/memfd: fix sysctl
   MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED").
 
 - David Howells has fixed an fscache optimization ("mm, netfs, fscache:
   Stop read optimisation when folio removed from pagecache").
 
 - Axel Rasmussen has given userfaultfd the ability to simulate memory
   poisoning ("add UFFDIO_POISON to simulate memory poisoning with UFFD").
 
 - Miaohe Lin has contributed some routine maintenance work on the
   memory-failure code ("mm: memory-failure: remove unneeded PageHuge()
   check").
 
 - Peng Zhang has contributed some maintenance work on the maple tree
   code ("Improve the validation for maple tree and some cleanup").
 
 - Hugh Dickins has optimized the collapsing of shmem or file pages into
   THPs ("mm: free retracted page table by RCU").
 
 - Jiaqi Yan has a patch series which permits us to use the healthy
   subpages within a hardware poisoned huge page for general purposes
   ("Improve hugetlbfs read on HWPOISON hugepages").
 
 - Kemeng Shi has done some maintenance work on the pagetable-check code
   ("Remove unused parameters in page_table_check").
 
 - More folioification work from Matthew Wilcox ("More filesystem folio
   conversions for 6.6"), ("Followup folio conversions for zswap").  And
   from ZhangPeng ("Convert several functions in page_io.c to use a
   folio").
 
 - page_ext cleanups from Kemeng Shi ("minor cleanups for page_ext").
 
 - Baoquan He has converted some architectures to use the GENERIC_IOREMAP
   ioremap()/iounmap() code ("mm: ioremap: Convert architectures to take
   GENERIC_IOREMAP way").
 
 - Anshuman Khandual has optimized arm64 tlb shootdown ("arm64: support
   batched/deferred tlb shootdown during page reclamation/migration").
 
 - Better maple tree lockdep checking from Liam Howlett ("More strict
   maple tree lockdep").  Liam also developed some efficiency improvements
   ("Reduce preallocations for maple tree").
 
 - Cleanup and optimization to the secondary IOMMU TLB invalidation, from
   Alistair Popple ("Invalidate secondary IOMMU TLB on permission
   upgrade").
 
 - Ryan Roberts fixes some arm64 MM selftest issues ("selftests/mm fixes
   for arm64").
 
 - Kemeng Shi provides some maintenance work on the compaction code ("Two
   minor cleanups for compaction").
 
 - Some reduction in mmap_lock pressure from Matthew Wilcox ("Handle most
   file-backed faults under the VMA lock").
 
 - Aneesh Kumar contributes code to use the vmemmap optimization for DAX
   on ppc64, under some circumstances ("Add support for DAX vmemmap
   optimization for ppc64").
 
 - page-ext cleanups from Kemeng Shi ("add page_ext_data to get client
   data in page_ext"), ("minor cleanups to page_ext header").
 
 - Some zswap cleanups from Johannes Weiner ("mm: zswap: three
   cleanups").
 
 - kmsan cleanups from ZhangPeng ("minor cleanups for kmsan").
 
 - VMA handling cleanups from Kefeng Wang ("mm: convert to
   vma_is_initial_heap/stack()").
 
 - DAMON feature work from SeongJae Park ("mm/damon/sysfs-schemes:
   implement DAMOS tried total bytes file"), ("Extend DAMOS filters for
   address ranges and DAMON monitoring targets").
 
 - Compaction work from Kemeng Shi ("Fixes and cleanups to compaction").
 
 - Liam Howlett has improved the maple tree node replacement code
   ("maple_tree: Change replacement strategy").
 
 - ZhangPeng has a general code cleanup - use the K() macro more widely
   ("cleanup with helper macro K()").
 
 - Aneesh Kumar brings memmap-on-memory to ppc64 ("Add support for memmap
   on memory feature on ppc64").
 
 - pagealloc cleanups from Kemeng Shi ("Two minor cleanups for pcp list
   in page_alloc"), ("Two minor cleanups for get pageblock migratetype").
 
 - Vishal Moola introduces a memory descriptor for page table tracking,
   "struct ptdesc" ("Split ptdesc from struct page").
 
 - memfd selftest maintenance work from Aleksa Sarai ("memfd: cleanups
   for vm.memfd_noexec").
 
 - MM include file rationalization from Hugh Dickins ("arch: include
   asm/cacheflush.h in asm/hugetlb.h").
 
 - THP debug output fixes from Hugh Dickins ("mm,thp: fix sloppy text
   output").
 
 - kmemleak improvements from Xiaolei Wang ("mm/kmemleak: use
   object_cache instead of kmemleak_initialized").
 
 - More folio-related cleanups from Matthew Wilcox ("Remove _folio_dtor
   and _folio_order").
 
 - A VMA locking scalability improvement from Suren Baghdasaryan
   ("Per-VMA lock support for swap and userfaults").
 
 - pagetable handling cleanups from Matthew Wilcox ("New page table range
   API").
 
 - A batch of swap/thp cleanups from David Hildenbrand ("mm/swap: stop
   using page->private on tail pages for THP_SWAP + cleanups").
 
 - Cleanups and speedups to the hugetlb fault handling from Matthew
   Wilcox ("Change calling convention for ->huge_fault").
 
 - Matthew Wilcox has also done some maintenance work on the MM subsystem
   documentation ("Improve mm documentation").
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZO1JUQAKCRDdBJ7gKXxA
 jrMwAP47r/fS8vAVT3zp/7fXmxaJYTK27CTAM881Gw1SDhFM/wEAv8o84mDenCg6
 Nfio7afS1ncD+hPYT8947UnLxTgn+ww=
 =Afws
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2023-08-28-18-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:

 - Some swap cleanups from Ma Wupeng ("fix WARN_ON in
   add_to_avail_list")

 - Peter Xu has a series (mm/gup: Unify hugetlb, speed up thp") which
   reduces the special-case code for handling hugetlb pages in GUP. It
   also speeds up GUP handling of transparent hugepages.

 - Peng Zhang provides some maple tree speedups ("Optimize the fast path
   of mas_store()").

 - Sergey Senozhatsky has improved te performance of zsmalloc during
   compaction (zsmalloc: small compaction improvements").

 - Domenico Cerasuolo has developed additional selftest code for zswap
   ("selftests: cgroup: add zswap test program").

 - xu xin has doe some work on KSM's handling of zero pages. These
   changes are mainly to enable the user to better understand the
   effectiveness of KSM's treatment of zero pages ("ksm: support
   tracking KSM-placed zero-pages").

 - Jeff Xu has fixes the behaviour of memfd's
   MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED sysctl ("mm/memfd: fix sysctl
   MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED").

 - David Howells has fixed an fscache optimization ("mm, netfs, fscache:
   Stop read optimisation when folio removed from pagecache").

 - Axel Rasmussen has given userfaultfd the ability to simulate memory
   poisoning ("add UFFDIO_POISON to simulate memory poisoning with
   UFFD").

 - Miaohe Lin has contributed some routine maintenance work on the
   memory-failure code ("mm: memory-failure: remove unneeded PageHuge()
   check").

 - Peng Zhang has contributed some maintenance work on the maple tree
   code ("Improve the validation for maple tree and some cleanup").

 - Hugh Dickins has optimized the collapsing of shmem or file pages into
   THPs ("mm: free retracted page table by RCU").

 - Jiaqi Yan has a patch series which permits us to use the healthy
   subpages within a hardware poisoned huge page for general purposes
   ("Improve hugetlbfs read on HWPOISON hugepages").

 - Kemeng Shi has done some maintenance work on the pagetable-check code
   ("Remove unused parameters in page_table_check").

 - More folioification work from Matthew Wilcox ("More filesystem folio
   conversions for 6.6"), ("Followup folio conversions for zswap"). And
   from ZhangPeng ("Convert several functions in page_io.c to use a
   folio").

 - page_ext cleanups from Kemeng Shi ("minor cleanups for page_ext").

 - Baoquan He has converted some architectures to use the
   GENERIC_IOREMAP ioremap()/iounmap() code ("mm: ioremap: Convert
   architectures to take GENERIC_IOREMAP way").

 - Anshuman Khandual has optimized arm64 tlb shootdown ("arm64: support
   batched/deferred tlb shootdown during page reclamation/migration").

 - Better maple tree lockdep checking from Liam Howlett ("More strict
   maple tree lockdep"). Liam also developed some efficiency
   improvements ("Reduce preallocations for maple tree").

 - Cleanup and optimization to the secondary IOMMU TLB invalidation,
   from Alistair Popple ("Invalidate secondary IOMMU TLB on permission
   upgrade").

 - Ryan Roberts fixes some arm64 MM selftest issues ("selftests/mm fixes
   for arm64").

 - Kemeng Shi provides some maintenance work on the compaction code
   ("Two minor cleanups for compaction").

 - Some reduction in mmap_lock pressure from Matthew Wilcox ("Handle
   most file-backed faults under the VMA lock").

 - Aneesh Kumar contributes code to use the vmemmap optimization for DAX
   on ppc64, under some circumstances ("Add support for DAX vmemmap
   optimization for ppc64").

 - page-ext cleanups from Kemeng Shi ("add page_ext_data to get client
   data in page_ext"), ("minor cleanups to page_ext header").

 - Some zswap cleanups from Johannes Weiner ("mm: zswap: three
   cleanups").

 - kmsan cleanups from ZhangPeng ("minor cleanups for kmsan").

 - VMA handling cleanups from Kefeng Wang ("mm: convert to
   vma_is_initial_heap/stack()").

 - DAMON feature work from SeongJae Park ("mm/damon/sysfs-schemes:
   implement DAMOS tried total bytes file"), ("Extend DAMOS filters for
   address ranges and DAMON monitoring targets").

 - Compaction work from Kemeng Shi ("Fixes and cleanups to compaction").

 - Liam Howlett has improved the maple tree node replacement code
   ("maple_tree: Change replacement strategy").

 - ZhangPeng has a general code cleanup - use the K() macro more widely
   ("cleanup with helper macro K()").

 - Aneesh Kumar brings memmap-on-memory to ppc64 ("Add support for
   memmap on memory feature on ppc64").

 - pagealloc cleanups from Kemeng Shi ("Two minor cleanups for pcp list
   in page_alloc"), ("Two minor cleanups for get pageblock
   migratetype").

 - Vishal Moola introduces a memory descriptor for page table tracking,
   "struct ptdesc" ("Split ptdesc from struct page").

 - memfd selftest maintenance work from Aleksa Sarai ("memfd: cleanups
   for vm.memfd_noexec").

 - MM include file rationalization from Hugh Dickins ("arch: include
   asm/cacheflush.h in asm/hugetlb.h").

 - THP debug output fixes from Hugh Dickins ("mm,thp: fix sloppy text
   output").

 - kmemleak improvements from Xiaolei Wang ("mm/kmemleak: use
   object_cache instead of kmemleak_initialized").

 - More folio-related cleanups from Matthew Wilcox ("Remove _folio_dtor
   and _folio_order").

 - A VMA locking scalability improvement from Suren Baghdasaryan
   ("Per-VMA lock support for swap and userfaults").

 - pagetable handling cleanups from Matthew Wilcox ("New page table
   range API").

 - A batch of swap/thp cleanups from David Hildenbrand ("mm/swap: stop
   using page->private on tail pages for THP_SWAP + cleanups").

 - Cleanups and speedups to the hugetlb fault handling from Matthew
   Wilcox ("Change calling convention for ->huge_fault").

 - Matthew Wilcox has also done some maintenance work on the MM
   subsystem documentation ("Improve mm documentation").

* tag 'mm-stable-2023-08-28-18-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (489 commits)
  maple_tree: shrink struct maple_tree
  maple_tree: clean up mas_wr_append()
  secretmem: convert page_is_secretmem() to folio_is_secretmem()
  nios2: fix flush_dcache_page() for usage from irq context
  hugetlb: add documentation for vma_kernel_pagesize()
  mm: add orphaned kernel-doc to the rst files.
  mm: fix clean_record_shared_mapping_range kernel-doc
  mm: fix get_mctgt_type() kernel-doc
  mm: fix kernel-doc warning from tlb_flush_rmaps()
  mm: remove enum page_entry_size
  mm: allow ->huge_fault() to be called without the mmap_lock held
  mm: move PMD_ORDER to pgtable.h
  mm: remove checks for pte_index
  memcg: remove duplication detection for mem_cgroup_uncharge_swap
  mm/huge_memory: work on folio->swap instead of page->private when splitting folio
  mm/swap: inline folio_set_swap_entry() and folio_swap_entry()
  mm/swap: use dedicated entry for swap in folio
  mm/swap: stop using page->private on tail pages for THP_SWAP
  selftests/mm: fix WARNING comparing pointer to 0
  selftests: cgroup: fix test_kmem_memcg_deletion kernel mem check
  ...
2023-08-29 14:25:26 -07:00
Mirsad Goran Todorovac fe48ba7dae workqueue: fix data race with the pwq->stats[] increment
KCSAN has discovered a data race in kernel/workqueue.c:2598:

[ 1863.554079] ==================================================================
[ 1863.554118] BUG: KCSAN: data-race in process_one_work / process_one_work

[ 1863.554142] write to 0xffff963d99d79998 of 8 bytes by task 5394 on cpu 27:
[ 1863.554154] process_one_work (kernel/workqueue.c:2598)
[ 1863.554166] worker_thread (./include/linux/list.h:292 kernel/workqueue.c:2752)
[ 1863.554177] kthread (kernel/kthread.c:389)
[ 1863.554186] ret_from_fork (arch/x86/kernel/process.c:145)
[ 1863.554197] ret_from_fork_asm (arch/x86/entry/entry_64.S:312)

[ 1863.554213] read to 0xffff963d99d79998 of 8 bytes by task 5450 on cpu 12:
[ 1863.554224] process_one_work (kernel/workqueue.c:2598)
[ 1863.554235] worker_thread (./include/linux/list.h:292 kernel/workqueue.c:2752)
[ 1863.554247] kthread (kernel/kthread.c:389)
[ 1863.554255] ret_from_fork (arch/x86/kernel/process.c:145)
[ 1863.554266] ret_from_fork_asm (arch/x86/entry/entry_64.S:312)

[ 1863.554280] value changed: 0x0000000000001766 -> 0x000000000000176a

[ 1863.554295] Reported by Kernel Concurrency Sanitizer on:
[ 1863.554303] CPU: 12 PID: 5450 Comm: kworker/u64:1 Tainted: G             L     6.5.0-rc6+ #44
[ 1863.554314] Hardware name: ASRock X670E PG Lightning/X670E PG Lightning, BIOS 1.21 04/26/2023
[ 1863.554322] Workqueue: btrfs-endio btrfs_end_bio_work [btrfs]
[ 1863.554941] ==================================================================

    lockdep_invariant_state(true);
→   pwq->stats[PWQ_STAT_STARTED]++;
    trace_workqueue_execute_start(work);
    worker->current_func(work);

Moving pwq->stats[PWQ_STAT_STARTED]++; before the line

    raw_spin_unlock_irq(&pool->lock);

resolves the data race without performance penalty.

KCSAN detected at least one additional data race:

[  157.834751] ==================================================================
[  157.834770] BUG: KCSAN: data-race in process_one_work / process_one_work

[  157.834793] write to 0xffff9934453f77a0 of 8 bytes by task 468 on cpu 29:
[  157.834804] process_one_work (/home/marvin/linux/kernel/linux_torvalds/kernel/workqueue.c:2606)
[  157.834815] worker_thread (/home/marvin/linux/kernel/linux_torvalds/./include/linux/list.h:292 /home/marvin/linux/kernel/linux_torvalds/kernel/workqueue.c:2752)
[  157.834826] kthread (/home/marvin/linux/kernel/linux_torvalds/kernel/kthread.c:389)
[  157.834834] ret_from_fork (/home/marvin/linux/kernel/linux_torvalds/arch/x86/kernel/process.c:145)
[  157.834845] ret_from_fork_asm (/home/marvin/linux/kernel/linux_torvalds/arch/x86/entry/entry_64.S:312)

[  157.834859] read to 0xffff9934453f77a0 of 8 bytes by task 214 on cpu 7:
[  157.834868] process_one_work (/home/marvin/linux/kernel/linux_torvalds/kernel/workqueue.c:2606)
[  157.834879] worker_thread (/home/marvin/linux/kernel/linux_torvalds/./include/linux/list.h:292 /home/marvin/linux/kernel/linux_torvalds/kernel/workqueue.c:2752)
[  157.834890] kthread (/home/marvin/linux/kernel/linux_torvalds/kernel/kthread.c:389)
[  157.834897] ret_from_fork (/home/marvin/linux/kernel/linux_torvalds/arch/x86/kernel/process.c:145)
[  157.834907] ret_from_fork_asm (/home/marvin/linux/kernel/linux_torvalds/arch/x86/entry/entry_64.S:312)

[  157.834920] value changed: 0x000000000000052a -> 0x0000000000000532

[  157.834933] Reported by Kernel Concurrency Sanitizer on:
[  157.834941] CPU: 7 PID: 214 Comm: kworker/u64:2 Tainted: G             L     6.5.0-rc7-kcsan-00169-g81eaf55a60fc #4
[  157.834951] Hardware name: ASRock X670E PG Lightning/X670E PG Lightning, BIOS 1.21 04/26/2023
[  157.834958] Workqueue: btrfs-endio btrfs_end_bio_work [btrfs]
[  157.835567] ==================================================================

in code:

        trace_workqueue_execute_end(work, worker->current_func);
→       pwq->stats[PWQ_STAT_COMPLETED]++;
        lock_map_release(&lockdep_map);
        lock_map_release(&pwq->wq->lockdep_map);

which needs to be resolved separately.

Fixes: 725e8ec59c ("workqueue: Add pwq->stats[] and a monitoring script")
Cc: Tejun Heo <tj@kernel.org>
Suggested-by: Lai Jiangshan <jiangshanlai@gmail.com>
Link: https://lore.kernel.org/lkml/20230818194448.29672-1-mirsad.todorovac@alu.unizg.hr/
Signed-off-by: Mirsad Goran Todorovac <mirsad.todorovac@alu.unizg.hr>
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-29 09:52:16 -10:00
Hao Jia c958ca2013 sched/fair: Make update_entity_lag() static
The function update_entity_lag() is only used inside the kernel/sched/fair.c file.
Make it static.

Signed-off-by: Hao Jia <jiahao.os@bytedance.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230829030325.69128-1-jiahao.os@bytedance.com
2023-08-29 21:05:28 +02:00
Linus Torvalds bd6c11bc43 Networking changes for 6.6.
Core
 ----
 
  - Increase size limits for to-be-sent skb frag allocations. This
    allows tun, tap devices and packet sockets to better cope with large
    writes operations.
 
  - Store netdevs in an xarray, to simplify iterating over netdevs.
 
  - Refactor nexthop selection for multipath routes.
 
  - Improve sched class lifetime handling.
 
  - Add backup nexthop ID support for bridge.
 
  - Implement drop reasons support in openvswitch.
 
  - Several data races annotations and fixes.
 
  - Constify the sk parameter of routing functions.
 
  - Prepend kernel version to netconsole message.
 
 Protocols
 ---------
 
  - Implement support for TCP probing the peer being under memory
    pressure.
 
  - Remove hard coded limitation on IPv6 specific info placement
    inside the socket struct.
 
  - Get rid of sysctl_tcp_adv_win_scale and use an auto-estimated
    per socket scaling factor.
 
  - Scaling-up the IPv6 expired route GC via a separated list of
    expiring routes.
 
  - In-kernel support for the TLS alert protocol.
 
  - Better support for UDP reuseport with connected sockets.
 
  - Add NEXT-C-SID support for SRv6 End.X behavior, reducing the SR
    header size.
 
  - Get rid of additional ancillary per MPTCP connection struct socket.
 
  - Implement support for BPF-based MPTCP packet schedulers.
 
  - Format MPTCP subtests selftests results in TAP.
 
  - Several new SMC 2.1 features including unique experimental options,
    max connections per lgr negotiation, max links per lgr negotiation.
 
 BPF
 ---
 
  - Multi-buffer support in AF_XDP.
 
  - Add multi uprobe BPF links for attaching multiple uprobes
    and usdt probes, which is significantly faster and saves extra fds.
 
  - Implement an fd-based tc BPF attach API (TCX) and BPF link support on
    top of it.
 
  - Add SO_REUSEPORT support for TC bpf_sk_assign.
 
  - Support new instructions from cpu v4 to simplify the generated code and
    feature completeness, for x86, arm64, riscv64.
 
  - Support defragmenting IPv(4|6) packets in BPF.
 
  - Teach verifier actual bounds of bpf_get_smp_processor_id()
    and fix perf+libbpf issue related to custom section handling.
 
  - Introduce bpf map element count and enable it for all program types.
 
  - Add a BPF hook in sys_socket() to change the protocol ID
    from IPPROTO_TCP to IPPROTO_MPTCP to cover migration for legacy.
 
  - Introduce bpf_me_mcache_free_rcu() and fix OOM under stress.
 
  - Add uprobe support for the bpf_get_func_ip helper.
 
  - Check skb ownership against full socket.
 
  - Support for up to 12 arguments in BPF trampoline.
 
  - Extend link_info for kprobe_multi and perf_event links.
 
 Netfilter
 ---------
 
  - Speed-up process exit by aborting ruleset validation if a
    fatal signal is pending.
 
  - Allow NLA_POLICY_MASK to be used with BE16/BE32 types.
 
 Driver API
 ----------
 
  - Page pool optimizations, to improve data locality and cache usage.
 
  - Introduce ndo_hwtstamp_get() and ndo_hwtstamp_set() to avoid the need
    for raw ioctl() handling in drivers.
 
  - Simplify genetlink dump operations (doit/dumpit) providing them
    the common information already populated in struct genl_info.
 
  - Extend and use the yaml devlink specs to [re]generate the split ops.
 
  - Introduce devlink selective dumps, to allow SF filtering SF based on
    handle and other attributes.
 
  - Add yaml netlink spec for netlink-raw families, allow route, link and
    address related queries via the ynl tool.
 
  - Remove phylink legacy mode support.
 
  - Support offload LED blinking to phy.
 
  - Add devlink port function attributes for IPsec.
 
 New hardware / drivers
 ----------------------
 
  - Ethernet:
    - Broadcom ASP 2.0 (72165) ethernet controller
    - MediaTek MT7988 SoC
    - Texas Instruments AM654 SoC
    - Texas Instruments IEP driver
    - Atheros qca8081 phy
    - Marvell 88Q2110 phy
    - NXP TJA1120 phy
 
  - WiFi:
    - MediaTek mt7981 support
 
  - Can:
    - Kvaser SmartFusion2 PCI Express devices
    - Allwinner T113 controllers
    - Texas Instruments tcan4552/4553 chips
 
  - Bluetooth:
    - Intel Gale Peak
    - Qualcomm WCN3988 and WCN7850
    - NXP AW693 and IW624
    - Mediatek MT2925
 
 Drivers
 -------
 
  - Ethernet NICs:
    - nVidia/Mellanox:
      - mlx5:
        - support UDP encapsulation in packet offload mode
        - IPsec packet offload support in eswitch mode
        - improve aRFS observability by adding new set of counters
        - extends MACsec offload support to cover RoCE traffic
        - dynamic completion EQs
      - mlx4:
        - convert to use auxiliary bus instead of custom interface logic
    - Intel
      - ice:
        - implement switchdev bridge offload, even for LAG interfaces
        - implement SRIOV support for LAG interfaces
      - igc:
        - add support for multiple in-flight TX timestamps
    - Broadcom:
      - bnxt:
        - use the unified RX page pool buffers for XDP and non-XDP
        - use the NAPI skb allocation cache
    - OcteonTX2:
      - support Round Robin scheduling HTB offload
      - TC flower offload support for SPI field
    - Freescale:
      -  add XDP_TX feature support
    - AMD:
      - ionic: add support for PCI FLR event
      - sfc:
        - basic conntrack offload
        - introduce eth, ipv4 and ipv6 pedit offloads
    - ST Microelectronics:
      - stmmac: maximze PTP timestamping resolution
 
  - Virtual NICs:
    - Microsoft vNIC:
      - batch ringing RX queue doorbell on receiving packets
      - add page pool for RX buffers
    - Virtio vNIC:
      - add per queue interrupt coalescing support
    - Google vNIC:
      - add queue-page-list mode support
 
  - Ethernet high-speed switches:
    - nVidia/Mellanox (mlxsw):
      - add port range matching tc-flower offload
      - permit enslavement to netdevices with uppers
 
  - Ethernet embedded switches:
    - Marvell (mv88e6xxx):
      - convert to phylink_pcs
    - Renesas:
      - r8A779fx: add speed change support
      - rzn1: enables vlan support
 
  - Ethernet PHYs:
    - convert mv88e6xxx to phylink_pcs
 
  - WiFi:
    - Qualcomm Wi-Fi 7 (ath12k):
      - extremely High Throughput (EHT) PHY support
    - RealTek (rtl8xxxu):
      - enable AP mode for: RTL8192FU, RTL8710BU (RTL8188GU),
        RTL8192EU and RTL8723BU
    - RealTek (rtw89):
      - Introduce Time Averaged SAR (TAS) support
 
  - Connector:
    - support for event filtering
 
 Signed-off-by: Paolo Abeni <pabeni@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmTt1ZoSHHBhYmVuaUBy
 ZWRoYXQuY29tAAoJECkkeY3MjxOkgFUP/REFaYWdWUvAzmWeezyx9dqgZMfSOjWq
 9QvySiA94OAOcjIYkb7wfzQ5BBAZqaBQ/f8XqWwS1EDDDEBs8sP1cxmABKwW7Hsr
 qFRu2sOqLzKBk223d0jIgEocfQaFpGbF71gXoTlDivBjBi5UxWm9bF0XnbYWcKgO
 /QEvzNosi9uNdi85Fzmv62J6YzAdidEpwGsM7X2CfejwNRmStxAEg/NwvRR0Hyiq
 OJCo97omEgTRaUle8nc64PDx33u4h5kQ1BkaeHEv0rbE3hftFC2YPKn/InmqSFGz
 6ew2xnrGPR37LCuAiCcIIv6yR7K0eu0iYJ7jXwZxBDqxGavEPuwWGBoCP6qFiitH
 ZLWhIrAUrdmSbySkTOCONhJ475qFAuQoYHYpZnX/bJZUHlSsb/9lwDJYJQGpVfd1
 /daqJVSb7lhaifmNO1iNd/ibCIXq9zapwtkRwA897M8GkZBTsnVvazFld1Em+Se3
 Bx6DSDUVBqVQ9fpZG2IAGD6odDwOzC1lF2IoceFvK9Ff6oE0psI+A0qNLMkHxZbW
 Qlo7LsNe53hpoCC+yHTfXX7e/X8eNt0EnCGOQJDusZ0Nr3K7H4LKFA0i8UBUK05n
 4lKnnaSQW7GQgdofLWt103OMDR9GoDxpFsm7b1X9+AEk6Fz6tq50wWYeMZETUKYP
 DCW8VGFOZjZM
 =9CsR
 -----END PGP SIGNATURE-----

Merge tag 'net-next-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next

Pull networking updates from Paolo Abeni:
 "Core:

   - Increase size limits for to-be-sent skb frag allocations. This
     allows tun, tap devices and packet sockets to better cope with
     large writes operations

   - Store netdevs in an xarray, to simplify iterating over netdevs

   - Refactor nexthop selection for multipath routes

   - Improve sched class lifetime handling

   - Add backup nexthop ID support for bridge

   - Implement drop reasons support in openvswitch

   - Several data races annotations and fixes

   - Constify the sk parameter of routing functions

   - Prepend kernel version to netconsole message

  Protocols:

   - Implement support for TCP probing the peer being under memory
     pressure

   - Remove hard coded limitation on IPv6 specific info placement inside
     the socket struct

   - Get rid of sysctl_tcp_adv_win_scale and use an auto-estimated per
     socket scaling factor

   - Scaling-up the IPv6 expired route GC via a separated list of
     expiring routes

   - In-kernel support for the TLS alert protocol

   - Better support for UDP reuseport with connected sockets

   - Add NEXT-C-SID support for SRv6 End.X behavior, reducing the SR
     header size

   - Get rid of additional ancillary per MPTCP connection struct socket

   - Implement support for BPF-based MPTCP packet schedulers

   - Format MPTCP subtests selftests results in TAP

   - Several new SMC 2.1 features including unique experimental options,
     max connections per lgr negotiation, max links per lgr negotiation

  BPF:

   - Multi-buffer support in AF_XDP

   - Add multi uprobe BPF links for attaching multiple uprobes and usdt
     probes, which is significantly faster and saves extra fds

   - Implement an fd-based tc BPF attach API (TCX) and BPF link support
     on top of it

   - Add SO_REUSEPORT support for TC bpf_sk_assign

   - Support new instructions from cpu v4 to simplify the generated code
     and feature completeness, for x86, arm64, riscv64

   - Support defragmenting IPv(4|6) packets in BPF

   - Teach verifier actual bounds of bpf_get_smp_processor_id() and fix
     perf+libbpf issue related to custom section handling

   - Introduce bpf map element count and enable it for all program types

   - Add a BPF hook in sys_socket() to change the protocol ID from
     IPPROTO_TCP to IPPROTO_MPTCP to cover migration for legacy

   - Introduce bpf_me_mcache_free_rcu() and fix OOM under stress

   - Add uprobe support for the bpf_get_func_ip helper

   - Check skb ownership against full socket

   - Support for up to 12 arguments in BPF trampoline

   - Extend link_info for kprobe_multi and perf_event links

  Netfilter:

   - Speed-up process exit by aborting ruleset validation if a fatal
     signal is pending

   - Allow NLA_POLICY_MASK to be used with BE16/BE32 types

  Driver API:

   - Page pool optimizations, to improve data locality and cache usage

   - Introduce ndo_hwtstamp_get() and ndo_hwtstamp_set() to avoid the
     need for raw ioctl() handling in drivers

   - Simplify genetlink dump operations (doit/dumpit) providing them the
     common information already populated in struct genl_info

   - Extend and use the yaml devlink specs to [re]generate the split ops

   - Introduce devlink selective dumps, to allow SF filtering SF based
     on handle and other attributes

   - Add yaml netlink spec for netlink-raw families, allow route, link
     and address related queries via the ynl tool

   - Remove phylink legacy mode support

   - Support offload LED blinking to phy

   - Add devlink port function attributes for IPsec

  New hardware / drivers:

   - Ethernet:
      - Broadcom ASP 2.0 (72165) ethernet controller
      - MediaTek MT7988 SoC
      - Texas Instruments AM654 SoC
      - Texas Instruments IEP driver
      - Atheros qca8081 phy
      - Marvell 88Q2110 phy
      - NXP TJA1120 phy

   - WiFi:
      - MediaTek mt7981 support

   - Can:
      - Kvaser SmartFusion2 PCI Express devices
      - Allwinner T113 controllers
      - Texas Instruments tcan4552/4553 chips

   - Bluetooth:
      - Intel Gale Peak
      - Qualcomm WCN3988 and WCN7850
      - NXP AW693 and IW624
      - Mediatek MT2925

  Drivers:

   - Ethernet NICs:
      - nVidia/Mellanox:
         - mlx5:
            - support UDP encapsulation in packet offload mode
            - IPsec packet offload support in eswitch mode
            - improve aRFS observability by adding new set of counters
            - extends MACsec offload support to cover RoCE traffic
            - dynamic completion EQs
         - mlx4:
            - convert to use auxiliary bus instead of custom interface
              logic
      - Intel
         - ice:
            - implement switchdev bridge offload, even for LAG
              interfaces
            - implement SRIOV support for LAG interfaces
         - igc:
            - add support for multiple in-flight TX timestamps
      - Broadcom:
         - bnxt:
            - use the unified RX page pool buffers for XDP and non-XDP
            - use the NAPI skb allocation cache
      - OcteonTX2:
         - support Round Robin scheduling HTB offload
         - TC flower offload support for SPI field
      - Freescale:
         - add XDP_TX feature support
      - AMD:
         - ionic: add support for PCI FLR event
         - sfc:
            - basic conntrack offload
            - introduce eth, ipv4 and ipv6 pedit offloads
      - ST Microelectronics:
         - stmmac: maximze PTP timestamping resolution

   - Virtual NICs:
      - Microsoft vNIC:
         - batch ringing RX queue doorbell on receiving packets
         - add page pool for RX buffers
      - Virtio vNIC:
         - add per queue interrupt coalescing support
      - Google vNIC:
         - add queue-page-list mode support

   - Ethernet high-speed switches:
      - nVidia/Mellanox (mlxsw):
         - add port range matching tc-flower offload
         - permit enslavement to netdevices with uppers

   - Ethernet embedded switches:
      - Marvell (mv88e6xxx):
         - convert to phylink_pcs
      - Renesas:
         - r8A779fx: add speed change support
         - rzn1: enables vlan support

   - Ethernet PHYs:
      - convert mv88e6xxx to phylink_pcs

   - WiFi:
      - Qualcomm Wi-Fi 7 (ath12k):
         - extremely High Throughput (EHT) PHY support
      - RealTek (rtl8xxxu):
         - enable AP mode for: RTL8192FU, RTL8710BU (RTL8188GU),
           RTL8192EU and RTL8723BU
      - RealTek (rtw89):
         - Introduce Time Averaged SAR (TAS) support

   - Connector:
      - support for event filtering"

* tag 'net-next-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1806 commits)
  net: ethernet: mtk_wed: minor change in wed_{tx,rx}info_show
  net: ethernet: mtk_wed: add some more info in wed_txinfo_show handler
  net: stmmac: clarify difference between "interface" and "phy_interface"
  r8152: add vendor/device ID pair for D-Link DUB-E250
  devlink: move devlink_notify_register/unregister() to dev.c
  devlink: move small_ops definition into netlink.c
  devlink: move tracepoint definitions into core.c
  devlink: push linecard related code into separate file
  devlink: push rate related code into separate file
  devlink: push trap related code into separate file
  devlink: use tracepoint_enabled() helper
  devlink: push region related code into separate file
  devlink: push param related code into separate file
  devlink: push resource related code into separate file
  devlink: push dpipe related code into separate file
  devlink: move and rename devlink_dpipe_send_and_alloc_skb() helper
  devlink: push shared buffer related code into separate file
  devlink: push port related code into separate file
  devlink: push object register/unregister notifications into separate helpers
  inet: fix IP_TRANSPARENT error handling
  ...
2023-08-29 11:33:01 -07:00
Andrea Righi a419beac4a module/decompress: use vmalloc() for zstd decompression workspace
Using kmalloc() to allocate the decompression workspace for zstd may
trigger the following warning when large modules are loaded (i.e., xfs):

[    2.961884] WARNING: CPU: 1 PID: 254 at mm/page_alloc.c:4453 __alloc_pages+0x2c3/0x350
...
[    2.989033] Call Trace:
[    2.989841]  <TASK>
[    2.990614]  ? show_regs+0x6d/0x80
[    2.991573]  ? __warn+0x89/0x160
[    2.992485]  ? __alloc_pages+0x2c3/0x350
[    2.993520]  ? report_bug+0x17e/0x1b0
[    2.994506]  ? handle_bug+0x51/0xa0
[    2.995474]  ? exc_invalid_op+0x18/0x80
[    2.996469]  ? asm_exc_invalid_op+0x1b/0x20
[    2.997530]  ? module_zstd_decompress+0xdc/0x2a0
[    2.998665]  ? __alloc_pages+0x2c3/0x350
[    2.999695]  ? module_zstd_decompress+0xdc/0x2a0
[    3.000821]  __kmalloc_large_node+0x7a/0x150
[    3.001920]  __kmalloc+0xdb/0x170
[    3.002824]  module_zstd_decompress+0xdc/0x2a0
[    3.003857]  module_decompress+0x37/0xc0
[    3.004688]  init_module_from_file+0xd0/0x100
[    3.005668]  idempotent_init_module+0x11c/0x2b0
[    3.006632]  __x64_sys_finit_module+0x64/0xd0
[    3.007568]  do_syscall_64+0x59/0x90
[    3.008373]  ? ksys_read+0x73/0x100
[    3.009395]  ? exit_to_user_mode_prepare+0x30/0xb0
[    3.010531]  ? syscall_exit_to_user_mode+0x37/0x60
[    3.011662]  ? do_syscall_64+0x68/0x90
[    3.012511]  ? do_syscall_64+0x68/0x90
[    3.013364]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8

However, continuous physical memory does not seem to be required in
module_zstd_decompress(), so use vmalloc() instead, to prevent the
warning and avoid potential failures at loading compressed modules.

Fixes: 169a58ad82 ("module/decompress: Support zstd in-kernel decompression")
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2023-08-29 09:39:08 -07:00
Linus Torvalds 815c24a085 linux-kselftest-kunit-6.6-rc1
This kunit update for Linux 6.6.rc1 consists of:
 
 -- Adds support for running Rust documentation tests as KUnit tests
 -- Makes init, str, sync, types doctests compilable/testable
 -- Adds support for attributes API which include speed, modules
    attributes, ability to filter and report attributes.
 -- Adds support for marking tests slow using attributes API.
 -- Adds attributes API documentation
 -- Fixes to wild-memory-access bug in kunit_filter_suites() and
    a possible memory leak in kunit_filter_suites()
 -- Adds support for counting number of test suites in a module, list
    action to kunit test modules, and test filtering on module tests.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEPZKym/RZuOCGeA/kCwJExA0NQxwFAmTsxL8ACgkQCwJExA0N
 Qxwt6BAA5FgF7nUeGRZCnot4MQCNGRThxsns2k3CKjM1Iokp8tstTDoNHXzk2veS
 WlRYOHFQqQOVTVRP+laXyjjMMHnlnhFxqbv93UKsen4JIUJDLFLq9x+0i+0bZh97
 N1rE5cKUnqjAOL6MIJuomW9IzEIrbMcqdljm6SOCZp90NLvq1+I4pDGLgx2bxcow
 Y/7dkx+dnlEsoACZ19CL1L2TaR21GpKdpOudpHNCShsbE0aOAlyHAVcmH64FTqCy
 Z1LtrA0odS71q0yxDVCk5X3cIkeVfGBMz6aMZBRzS9k5jU4H1EN1eG1rGdGErIe5
 YduwX3KMikYJB2stT64T1vgldIpT/emxqkBigmxQ37g3Flgopz4bI1snMBry+nKb
 ViD/WQNjsf2iL8MooCgYBzH7yjmX6lXXQTZXROogBj4lP2/0gHiQVZyXZEAjtoO3
 uNzUbfHQGnvtTphBHV4nNGaO+7kU9Y/oX8TYFcSYJQzcH5UVx16uBwevZjT1bii/
 q89bRAQLnJpzkR93SGpnmsRgoDcYJSYsEA1o/f9Eqq8j3guOS2idpJvkheXq8+A2
 MqTSOCJHENKZ3v0UGKlvZUPStaMaqN58z/VjlWug5EaB83LLfPcXJrGjz/EHk967
 hYDHcwPoamTegr1zlg3ckOLiWEhga2tv6aHPkshkcFphpnhRU/c=
 =Nsb8
 -----END PGP SIGNATURE-----

Merge tag 'linux-kselftest-kunit-6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest

Pull kunit updates from Shuah Khan:

 - add support for running Rust documentation tests as KUnit tests

 - make init, str, sync, types doctests compilable/testable

 - add support for attributes API which include speed, modules
   attributes, ability to filter and report attributes

 - add support for marking tests slow using attributes API

 - add attributes API documentation

 - fix a wild-memory-access bug in kunit_filter_suites() and a possible
   memory leak in kunit_filter_suites()

 - add support for counting number of test suites in a module, list
   action to kunit test modules, and test filtering on module tests

* tag 'linux-kselftest-kunit-6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest: (25 commits)
  kunit: fix struct kunit_attr header
  kunit: replace KUNIT_TRIGGER_STATIC_STUB maro with KUNIT_STATIC_STUB_REDIRECT
  kunit: Allow kunit test modules to use test filtering
  kunit: Make 'list' action available to kunit test modules
  kunit: Report the count of test suites in a module
  kunit: fix uninitialized variables bug in attributes filtering
  kunit: fix possible memory leak in kunit_filter_suites()
  kunit: fix wild-memory-access bug in kunit_filter_suites()
  kunit: Add documentation of KUnit test attributes
  kunit: add tests for filtering attributes
  kunit: time: Mark test as slow using test attributes
  kunit: memcpy: Mark tests as slow using test attributes
  kunit: tool: Add command line interface to filter and report attributes
  kunit: Add ability to filter attributes
  kunit: Add module attribute
  kunit: Add speed attribute
  kunit: Add test attributes API structure
  MAINTAINERS: add Rust KUnit files to the KUnit entry
  rust: support running Rust documentation tests as KUnit ones
  rust: types: make doctests compilable/testable
  ...
2023-08-28 18:56:38 -07:00
Linus Torvalds ccc5e98177 Power management updates for 6.6-rc1
- Rework the menu and teo cpuidle governors to avoid calling
    tick_nohz_get_sleep_length(), which is likely to become quite
    expensive going forward, too often and improve making decisions
    regarding whether or not to stop the scheduler tick in the teo
    governor (Rafael Wysocki).
 
  - Improve the performance of cpufreq_stats_create_table() in some
    cases (Liao Chang).
 
  - Fix two issues in the amd-pstate-ut cpufreq driver (Swapnil Sapkal).
 
  - Use clamp() helper macro to improve the code readability in
    cpufreq_verify_within_limits() (Liao Chang).
 
  - Set stale CPU frequency to minimum in intel_pstate (Doug Smythies).
 
  - Migrate cpufreq drivers for various platforms to use void remove
    callback (Yangtao Li).
 
  - Add online/offline/exit hooks for Tegra driver (Sumit Gupta).
 
  - Explicitly include correct DT includes in cpufreq (Rob Herring).
 
  - Frequency domain updates for qcom-hw driver (Neil Armstrong).
 
  - Modify AMD pstate driver return the highest_perf value (Meng Li).
 
  - Generic cleanups for cppc, mediatek and powernow driver (Liao Chang,
    Konrad Dybcio).
 
  - Add more platforms to cpufreq-arm driver's blocklist (AngeloGioacchino
    Del Regno and Konrad Dybcio).
 
  - brcmstb-avs-cpufreq: Fix -Warray-bounds bug (Gustavo A. R. Silva).
 
  - Add device PM helpers to allow a device to remain powered-on during
    system-wide transitions (Ulf Hansson).
 
  - Rework hibernation memory snapshotting to avoid storing pages filled
    with zeros in hibernation image files (Brian Geffon).
 
  - Add check to make sure that CPU latency QoS constraints do not use
    negative values (Clive Lin).
 
  - Optimize rp->domains memory allocation in the Intel RAPL power
    capping driver (xiongxin).
 
  - Remove recursion while parsing zones in the arm_scmi power capping
    driver (Cristian Marussi).
 
  - Fix memory leak in devfreq_dev_release() (Boris Brezillon).
 
  - Rewrite devfreq_monitor_start() kerneldoc comment (Manivannan
    Sadhasivam).
 
  - Explicitly include correct DT includes in devfreq (Rob Herring).
 
  - Remove unsued pm_runtime_update_max_time_suspended() extern
    declaration (YueHaibing).
 
  - Add turbo-boost support to cpupower (Wyes Karny).
 
  - Add support for amd_pstate mode change to cpupower (Wyes Karny).
 
  - Fix 'cpupower idle_set' command to accept only numeric values of
    arguments (Likhitha Korrapati).
 
  - Clean up OPP code and add new frequency related APIs to it (Viresh
    Kumar, Manivannan Sadhasivam).
 
  - Convert ti cpufreq/opp bindings to json schema (Nishanth Menon).
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAmTslI4SHHJqd0Byand5
 c29ja2kubmV0AAoJEILEb/54YlRxLMYP/3v0DxA3HZSZ/Xg63P9ylnln084cDt+/
 qpJZ0CJUd6+MkoeuCYq/5udNwPSREsfx+pIEJy+h/iCiQlQz3NzriR7/dgPV0Ud0
 t7k95lyZo+u51MNxk4SEqRMVTyYaNgDPvGbLyWFpLnne3CsxYzfH5xr77yHf342W
 jHii1vJLXiXPnQWDlahf8tUpdQ0MQFmEwx0WkJp81NaAFyXDi0fPrB4YZaZrr6AQ
 3TNaxTxZSirVSn19m5RPPAQhEfK8Dk4jF8wVPWsuL9F6v+9wERD9zcaxUPf3CD36
 aj+SqKLCkOfkJHk45PCIYbS2wQ04fT/yWE9Rzm4iSr+fWA/q7vA0jXsaAgcv1Bm7
 k6QyAy2ffLZTUFObX5bevIPvxZTzunLh0iglHx0WZKS/nn/9Jwpt6UMrpOsjiw/J
 GLKEww+ZiKXj980GfvV2QUZG/XmsrvML/1L+qiDxNB2IPTxxuOxrWQ+cM7oxUTPM
 pdIPIdwkm5ICVRVcAfNw/fr30s2yp1K304VWgzbKdK9b1aVhUSkxZGI8KHFODOHO
 4Crii2rk0r972kxuJmenKwEfmwr/rbAAstFVSM736jH9RUANaWsIeNvkurXMOd2f
 mil9DViTAu0iY4cy5tgLiLHDH4tOQOOCntRVFJ1tSytMyCFlMvVM0dwrc0yh254Q
 zcrNj8ERJSsC
 =6BIh
 -----END PGP SIGNATURE-----

Merge tag 'pm-6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management updates from Rafael Wysocki:
 "These rework cpuidle governors to call tick_nohz_get_sleep_length()
  less often and fix one of them, rework hibernation to avoid storing
  pages filled with zeros in hibernation images, switch over some
  cpufreq drivers to use void remove callbacks, fix and clean up
  multiple cpufreq drivers, fix the devfreq core, update the cpupower
  utility and make other assorted improvements.

  Specifics:

   - Rework the menu and teo cpuidle governors to avoid calling
     tick_nohz_get_sleep_length(), which is likely to become quite
     expensive going forward, too often and improve making decisions
     regarding whether or not to stop the scheduler tick in the teo
     governor (Rafael Wysocki)

   - Improve the performance of cpufreq_stats_create_table() in some
     cases (Liao Chang)

   - Fix two issues in the amd-pstate-ut cpufreq driver (Swapnil Sapkal)

   - Use clamp() helper macro to improve the code readability in
     cpufreq_verify_within_limits() (Liao Chang)

   - Set stale CPU frequency to minimum in intel_pstate (Doug Smythies)

   - Migrate cpufreq drivers for various platforms to use void remove
     callback (Yangtao Li)

   - Add online/offline/exit hooks for Tegra driver (Sumit Gupta)

   - Explicitly include correct DT includes in cpufreq (Rob Herring)

   - Frequency domain updates for qcom-hw driver (Neil Armstrong)

   - Modify AMD pstate driver return the highest_perf value (Meng Li)

   - Generic cleanups for cppc, mediatek and powernow driver (Liao
     Chang, Konrad Dybcio)

   - Add more platforms to cpufreq-arm driver's blocklist
     (AngeloGioacchino Del Regno and Konrad Dybcio)

   - brcmstb-avs-cpufreq: Fix -Warray-bounds bug (Gustavo A. R. Silva)

   - Add device PM helpers to allow a device to remain powered-on during
     system-wide transitions (Ulf Hansson)

   - Rework hibernation memory snapshotting to avoid storing pages
     filled with zeros in hibernation image files (Brian Geffon)

   - Add check to make sure that CPU latency QoS constraints do not use
     negative values (Clive Lin)

   - Optimize rp->domains memory allocation in the Intel RAPL power
     capping driver (xiongxin)

   - Remove recursion while parsing zones in the arm_scmi power capping
     driver (Cristian Marussi)

   - Fix memory leak in devfreq_dev_release() (Boris Brezillon)

   - Rewrite devfreq_monitor_start() kerneldoc comment (Manivannan
     Sadhasivam)

   - Explicitly include correct DT includes in devfreq (Rob Herring)

   - Remove unsued pm_runtime_update_max_time_suspended() extern
     declaration (YueHaibing)

   - Add turbo-boost support to cpupower (Wyes Karny)

   - Add support for amd_pstate mode change to cpupower (Wyes Karny)

   - Fix 'cpupower idle_set' command to accept only numeric values of
     arguments (Likhitha Korrapati)

   - Clean up OPP code and add new frequency related APIs to it (Viresh
     Kumar, Manivannan Sadhasivam)

   - Convert ti cpufreq/opp bindings to json schema (Nishanth Menon)"

* tag 'pm-6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (74 commits)
  cpufreq: tegra194: remove opp table in exit hook
  cpufreq: powernow-k8: Use related_cpus instead of cpus in driver.exit()
  cpufreq: tegra194: add online/offline hooks
  cpuidle: teo: Avoid unnecessary variable assignments
  cpufreq: qcom-cpufreq-hw: add support for 4 freq domains
  dt-bindings: cpufreq: qcom-hw: add a 4th frequency domain
  cpufreq: amd-pstate-ut: Fix kernel panic when loading the driver
  cpufreq: amd-pstate-ut: Remove module parameter access
  cpufreq: Use clamp() helper macro to improve the code readability
  PM: sleep: Add helpers to allow a device to remain powered-on
  PM: QoS: Add check to make sure CPU latency is non-negative
  PM: runtime: Remove unsued extern declaration of pm_runtime_update_max_time_suspended()
  cpufreq: intel_pstate: set stale CPU frequency to minimum
  cpufreq: stats: Improve the performance of cpufreq_stats_create_table()
  dt-bindings: cpufreq: Convert ti-cpufreq to json schema
  dt-bindings: opp: Convert ti-omap5-opp-supply to json schema
  OPP: Fix argument name in doc comment
  cpuidle: menu: Skip tick_nohz_get_sleep_length() call in some cases
  cpufreq: cppc: Set fie_disabled to FIE_DISABLED if fails to create kworker_fie
  cpufreq: cppc: cppc_cpufreq_get_rate() returns zero in all error cases.
  ...
2023-08-28 18:04:39 -07:00
Linus Torvalds 97efd28334 Misc x86 cleanups.
The following commit deserves special mention:
 
    22dc02f81c Revert "sched/fair: Move unused stub functions to header"
 
 This is in x86/cleanups, because the revert is a re-application of a
 number of cleanups that got removed inadvertedly.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmTtDkoRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1jCMw//UvQGM8yxsTa57r0/ZpJHS2++P5pJxOsz
 45kBb3aBiDV6idArce4EHpthp3MvF3Pycibp9w0qg//NOtIHTKeagXv52abxsu1W
 hmS6gXJZDXZvjO1BFaUlmv97iYtzGfKnQppj32C4tMr9SaP49h3KvOHH1Z8CR3mP
 1nZaJJwYIi2qBh7msnmLGG+F0drb85O/dfHdoLX6iVJw9UP4n5nu9u8u1E0iC7J7
 2GC6AwP60A0EBRTK9EHQQEYwy9uvdS/TG5f2Qk1VP87KA9TTocs8MyapMG4DQu79
 hZKVEGuVQAlV3rYe9cJBNpDx1mTu3rmuMH0G71KEe3T6UcG5QRUiAPm8UfA9prPD
 uWjY4zm5o0W3tUio4V1MqqiLFIaBU63WmTY9RyM0QH8Ms8r8GugWKmnrTIuHfEC3
 9D+Uhyb5d8ID6qFGLTOvPm0g+v64lnH71qq83PcVJgsmZvUb2XvFA3d/A0h9JzLT
 2In/yfU10UsLUFTiNRyAgcLccjaGhliDB2oke9Kp0OyOTSQRcWmiq8kByVxCPImP
 auOWWcNXjcuOgjlnziEkMTDuRY12MgUB2If4zhELvdEFibIaaNW5sNCbY2msWaN1
 CUD7fcj0L3HZvzujUm72l5hxL2brJMuPwVNJfuOe4T8wzy569d6VJULrd1URBM1B
 vfaPs1Dz46Q=
 =kiAA
 -----END PGP SIGNATURE-----

Merge tag 'x86-cleanups-2023-08-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull misc x86 cleanups from Ingo Molnar:
 "The following commit deserves special mention:

   22dc02f81c Revert "sched/fair: Move unused stub functions to header"

  This is in x86/cleanups, because the revert is a re-application of a
  number of cleanups that got removed inadvertedly"

[ This also effectively undoes the amd_check_microcode() microcode
  declaration change I had done in my microcode loader merge in commit
  42a7f6e3ff ("Merge tag 'x86_microcode_for_v6.6_rc1' [...]").

  I picked the declaration change by Arnd from this branch instead,
  which put it in <asm/processor.h> instead of <asm/microcode.h> like I
  had done in my merge resolution   - Linus ]

* tag 'x86-cleanups-2023-08-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/platform/uv: Refactor code using deprecated strncpy() interface to use strscpy()
  x86/hpet: Refactor code using deprecated strncpy() interface to use strscpy()
  x86/platform/uv: Refactor code using deprecated strcpy()/strncpy() interfaces to use strscpy()
  x86/qspinlock-paravirt: Fix missing-prototype warning
  x86/paravirt: Silence unused native_pv_lock_init() function warning
  x86/alternative: Add a __alt_reloc_selftest() prototype
  x86/purgatory: Include header for warn() declaration
  x86/asm: Avoid unneeded __div64_32 function definition
  Revert "sched/fair: Move unused stub functions to header"
  x86/apic: Hide unused safe_smp_processor_id() on 32-bit UP
  x86/cpu: Fix amd_check_microcode() declaration
2023-08-28 17:05:58 -07:00
Linus Torvalds 3ca9a836ff Scheduler changes for v6.6:
- The biggest change is introduction of a new iteration of the
   SCHED_FAIR interactivity code: the EEVDF ("Earliest Eligible Virtual
   Deadline First") scheduler.
 
   EEVDF too is a virtual-time scheduler, with two parameters (weight
   and relative deadline), compared to CFS that had weight only.
   It completely reworks the base scheduler: placement, preemption,
   picking -- everything.
 
   LWN.net, as usual, has a terrific writeup about EEVDF:
 
      https://lwn.net/Articles/925371/
 
   Preemption (both tick and wakeup) is driven by testing against
   a fresh pick. Because the tree is now effectively an interval
   tree, and the selection is no longer the 'leftmost' task,
   over-scheduling is less of a problem. A lot of the CFS
   heuristics are removed or replaced by more natural latency-space
   parameters & constructs.
 
   In terms of expected performance regressions: we'll and can fix
   everything where a 'good' workload misbehaves with the new scheduler,
   but EEVDF inevitably changes workload scheduling in a binary fashion,
   hopefully for the better in the overwhelming majority of cases,
   but in some cases it won't, especially in adversarial loads that
   got lucky with the previous code, such as some variants of hackbench.
   We are trying hard to err on the side of fixing all performance
   regressions, but we expect some inevitable post-release iterations
   of that process.
 
 - Improve load-balancing on hybrid x86 systems: enable cluster
   scheduling (again).
 
 - Improve & fix bandwidth-scheduling on nohz systems.
 
 - Improve bandwidth-throttling.
 
 - Use lock guards to simplify and de-goto-ify control flow.
 
 - Misc improvements, cleanups and fixes.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmTtDOgRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1iS4g//b9yewVW9OPxetKoN8zIJA0TjFYuuOVHK
 BlCJi5dbzXeCTrtENI65BRA7kPbTQ3AjwLRQ2BallAZ4dJceK0RhlZJvcrMNsm4e
 Adcpoch/FbqPKCrtAJQY04Ln1B244n/KyVifYett9220dMgTFQGJJYxrTc2G2+Kp
 F44vdUHzRczIE+KeOgBild1CwfKv5Zn5xgaXgtuoPLZtWBE0C1fSSzbK/PTINcUx
 bS4NVxK0CpOqSiNjnugV8KsYb71/0U6IgShBVjfHsrlBYigOH2NbVTH5xyjF8f83
 WxiGstlhxj+N6Kv4L6FOJIAr2BIggH82j3FaPACmv4c8pzEoBBbvlAJkfinLEgbn
 Povg3OF2t6uZ8NoHjeu3WxOjBsphbpkFz7H5nno1ibXSIR/JyUH5MdBPSx93QITB
 QoUKQpr/L8zWauWDOEzSaJjEsZbl8rkcIVq5Bk0bR3qn2xkZsIeVte+vCEu3+tBc
 b4JOZjq7AuPDqPnsBLvuyiFZ7zwsAfm+pOD5UF3/zbLjPn1N/7wTNQZ29zjc04jl
 SifpCZGgF1KlG8m8wNTlSfVvq0ksppCzJt+C6VFuejZ191IGpirQHn4Vp0sluMhC
 WRzXhb7v37Bq5JY10GMfeKb/jAiRs68kozhzqVPsBSAPS6I6jJssONgedq+LbQdC
 tFsmE9n09do=
 =XtCD
 -----END PGP SIGNATURE-----

Merge tag 'sched-core-2023-08-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler updates from Ingo Molnar:

 - The biggest change is introduction of a new iteration of the
   SCHED_FAIR interactivity code: the EEVDF ("Earliest Eligible Virtual
   Deadline First") scheduler

   EEVDF too is a virtual-time scheduler, with two parameters (weight
   and relative deadline), compared to CFS that had weight only. It
   completely reworks the base scheduler: placement, preemption, picking
   -- everything

   LWN.net, as usual, has a terrific writeup about EEVDF:

      https://lwn.net/Articles/925371/

   Preemption (both tick and wakeup) is driven by testing against a
   fresh pick. Because the tree is now effectively an interval tree, and
   the selection is no longer the 'leftmost' task, over-scheduling is
   less of a problem. A lot of the CFS heuristics are removed or
   replaced by more natural latency-space parameters & constructs

   In terms of expected performance regressions: we will and can fix
   everything where a 'good' workload misbehaves with the new scheduler,
   but EEVDF inevitably changes workload scheduling in a binary fashion,
   hopefully for the better in the overwhelming majority of cases, but
   in some cases it won't, especially in adversarial loads that got
   lucky with the previous code, such as some variants of hackbench. We
   are trying hard to err on the side of fixing all performance
   regressions, but we expect some inevitable post-release iterations of
   that process

 - Improve load-balancing on hybrid x86 systems: enable cluster
   scheduling (again)

 - Improve & fix bandwidth-scheduling on nohz systems

 - Improve bandwidth-throttling

 - Use lock guards to simplify and de-goto-ify control flow

 - Misc improvements, cleanups and fixes

* tag 'sched-core-2023-08-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (43 commits)
  sched/eevdf/doc: Modify the documented knob to base_slice_ns as well
  sched/eevdf: Curb wakeup-preemption
  sched: Simplify sched_core_cpu_{starting,deactivate}()
  sched: Simplify try_steal_cookie()
  sched: Simplify sched_tick_remote()
  sched: Simplify sched_exec()
  sched: Simplify ttwu()
  sched: Simplify wake_up_if_idle()
  sched: Simplify: migrate_swap_stop()
  sched: Simplify sysctl_sched_uclamp_handler()
  sched: Simplify get_nohz_timer_target()
  sched/rt: sysctl_sched_rr_timeslice show default timeslice after reset
  sched/rt: Fix sysctl_sched_rr_timeslice intial value
  sched/fair: Block nohz tick_stop when cfs bandwidth in use
  sched, cgroup: Restore meaning to hierarchical_quota
  MAINTAINERS: Add Peter explicitly to the psi section
  sched/psi: Select KERNFS as needed
  sched/topology: Align group flags when removing degenerate domain
  sched/fair: remove util_est boosting
  sched/fair: Propagate enqueue flags into place_entity()
  ...
2023-08-28 16:43:39 -07:00
Linus Torvalds 1a7c611546 Perf events changes for v6.6:
- AMD IBS improvements
 - Intel PMU driver updates
 - Extend core perf facilities & the ARM PMU driver to better handle ARM big.LITTLE events
 - Micro-optimize software events and the ring-buffer code
 - Misc cleanups & fixes
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmTtBscRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1hHoQ/+IBQ8Xi/rcdd40n8OqEB/VBWVuSjNT3uN
 3pHHcTl2Pio9CxBeat42NekNijlRILCKJrZ3Lt3JWBmWyWv5l3KFabelj+lDF2xa
 TVCjTnQNe1+HvrODYnF4ECIs5vaoMVjcJ9jg8+VDgAcOQr1nZs4m5TVAd6TLqPpV
 urBEQVULkkzk7ZRhfrugKhw+wrpWFefgGCx0RV8ijZB7TLMHc2wE+Q/sTxKdKceL
 wNaJaDgV33pZh0aImwR9pKUE532hF1FiBdLuehkh61PZa1L82jzAX1xjw2s1hSa4
 eIWemPHJIYfivRlENbJsDWc4N8gk6ijVHwrxGcr4Axu+NN+zPtQ3ddhaGMAyKdTo
 qUKXH3MZSMIl++jI5Fkc6xM+XLvY1rML62epSzMwu/cc7Z5MeyWdQcri0N9YFuO7
 wUUNnFpU00lwQBLbyyUQ3Zi8E0QV7NuPW4axTkmntiIjMpLagaEvVSf6nf8qLpbE
 WTT16s707t19hUZNazNZ7ONmhly4ALbHFQEH65J2KoYn99fYqy9z68Hwk+xnmykw
 bc3qvfhpw0MImQQ+DqHiBwb4n4UuvY2WlkkZI3FfNeSG63DaM2mZikfpElpXYjn6
 9iOIXvx21Wiq/n0cbLhidI2q/ZzFCzYLCk6ikZ320wb+rhvd7EoSlZil6QSzn3pH
 Qdk+NEZgWQY=
 =ZT6+
 -----END PGP SIGNATURE-----

Merge tag 'perf-core-2023-08-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull perf event updates from Ingo Molnar:

 - AMD IBS improvements

 - Intel PMU driver updates

 - Extend core perf facilities & the ARM PMU driver to better handle ARM big.LITTLE events

 - Micro-optimize software events and the ring-buffer code

 - Misc cleanups & fixes

* tag 'perf-core-2023-08-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/x86/uncore: Remove unnecessary ?: operator around pcibios_err_to_errno() call
  perf/x86/intel: Add Crestmont PMU
  x86/cpu: Update Hybrids
  x86/cpu: Fix Crestmont uarch
  x86/cpu: Fix Gracemont uarch
  perf: Remove unused extern declaration arch_perf_get_page_size()
  perf: Remove unused PERF_PMU_CAP_HETEROGENEOUS_CPUS capability
  arm_pmu: Remove unused PERF_PMU_CAP_HETEROGENEOUS_CPUS capability
  perf/x86: Remove unused PERF_PMU_CAP_HETEROGENEOUS_CPUS capability
  arm_pmu: Add PERF_PMU_CAP_EXTENDED_HW_TYPE capability
  perf/x86/ibs: Set mem_lvl_num, mem_remote and mem_hops for data_src
  perf/mem: Add PERF_MEM_LVLNUM_NA to PERF_MEM_NA
  perf/mem: Introduce PERF_MEM_LVLNUM_UNC
  perf/ring_buffer: Use local_try_cmpxchg in __perf_output_begin
  locking/arch: Avoid variable shadowing in local_try_cmpxchg()
  perf/core: Use local64_try_cmpxchg in perf_swevent_set_period
  perf/x86: Use local64_try_cmpxchg
  perf/amd: Prevent grouping of IBS events
2023-08-28 16:35:01 -07:00
Linus Torvalds 6f49693a6c Updates for the CPU hotplug core:
- Support partial SMT enablement.
 
     So far the sysfs SMT control only allows to toggle between SMT on and
     off. That's sufficient for x86 which usually has at max two threads
     except for the Xeon PHI platform which has four threads per core.
 
     Though PowerPC has up to 16 threads per core and so far it's only
     possible to control the number of enabled threads per core via a
     command line option. There is some way to control this at runtime, but
     that lacks enforcement and the usability is awkward.
 
     This update expands the sysfs interface and the core infrastructure to
     accept numerical values so PowerPC can build SMT runtime control for
     partial SMT enablement on top.
 
     The core support has also been provided to the PowerPC maintainers who
     added the PowerPC related changes on top.
 
   - Minor cleanups and documentation updates.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmTsj4wTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoaszEADKMd/6m7/Bq7RU2OJ+IXw8yfMEF9nS
 6HPrFu71a4cDufb/G8UckQOvkwdTFWD7bZ0snJe2sBDFTOtzK/inYkgPZTxlm7si
 JcJmFnHKUM7OTwNZb7Tv1bd9Csz4JhggAYUw6P8CqsCmhQ+p6ECemx3bHDlYiywm
 5eW2yzI9EM4dbsHPwUOvjI0WazGvAf0esSDAS8JTnhBXbd8FAckbMV+xuRPcCUK+
 dBqbqr+3Nf4/wcXTro/gZIc7sEATAHH6m7zHlLVBSyVPnBxre8NLz6KciW4SezyJ
 GWFnDV03mmG2KxQ2ugwI8n6M3zDJQtfEJFwW/x4t2M5RK+ka2a6G6GtCLHYOXLWR
 akIuBXtTAC57BgpqzBihGej9eiC1BJ1QMa9ZK+6WDXSZtMTFOLlbwdY2/qyfxpfw
 LfepWb+UMtFy5YyW84S1O5/AqpOtKD2kPTqfDjvDxWIAigispU+qwAKxcMzMjtwz
 aAlf2Z/iX0R9DkRzGD2gaFG5AUsRich8RtVO7u+WDwYSsi8ywrvryiPlZrDDBkSQ
 sRzdoHeXNGVY/FgkbZmEyBj4udrypymkR6ivqn6C2OrysgznSiv5NC983uS6TfJX
 cVqdUv6CNYYNiNu0x0Qf0MluYT2s5c1Fa4bjCBJL+KwORwjM3+TCN9RA1KtFrW2T
 G3Ta1KqI6wRonA==
 =JQRJ
 -----END PGP SIGNATURE-----

Merge tag 'smp-core-2023-08-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull CPU hotplug updates from Thomas Gleixner:
 "Updates for the CPU hotplug core:

   - Support partial SMT enablement.

     So far the sysfs SMT control only allows to toggle between SMT on
     and off. That's sufficient for x86 which usually has at max two
     threads except for the Xeon PHI platform which has four threads per
     core

     Though PowerPC has up to 16 threads per core and so far it's only
     possible to control the number of enabled threads per core via a
     command line option. There is some way to control this at runtime,
     but that lacks enforcement and the usability is awkward

     This update expands the sysfs interface and the core infrastructure
     to accept numerical values so PowerPC can build SMT runtime control
     for partial SMT enablement on top

     The core support has also been provided to the PowerPC maintainers
     who added the PowerPC related changes on top

   - Minor cleanups and documentation updates"

* tag 'smp-core-2023-08-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  Documentation: core-api/cpuhotplug: Fix state names
  cpu/hotplug: Remove unused function declaration cpu_set_state_online()
  cpu/SMT: Fix cpu_smt_possible() comment
  cpu/SMT: Allow enabling partial SMT states via sysfs
  cpu/SMT: Create topology_smt_thread_allowed()
  cpu/SMT: Remove topology_smt_supported()
  cpu/SMT: Store the current/max number of threads
  cpu/SMT: Move smt/control simple exit cases earlier
  cpu/SMT: Move SMT prototypes into cpu_smt.h
  cpu/hotplug: Remove dependancy against cpu_primary_thread_mask
2023-08-28 15:04:43 -07:00
Linus Torvalds dd3f0fe501 Boring updates for the interrupt subsystem:
Core:
 
     - Prevent a deadlock of nested interrupt threads vs.
       synchronize_hard()
 
     - Removal of a stale extern declaration
 
   Drivers:
 
     - The first new driver since v6.2 for Amlogic-C3 SoCs
 
     - The usual small fixes, cleanups and improvements all over
       the place
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmTsjR0THHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYofmLEAC5anouyAUbGjl/cL//+2GkvWB2YgO2
 +D7q8tx3Tt5US/vpiYaDvFs+at+2lAyB6M/KUkvYSCne9bm80+YqaL+73iM6YQKH
 yNDrAnLR1FA4+fHIvvhmk23U1uUjWgSTL7iKufgNWf8I0aYsWLTIX3N6m0606ZLE
 eUNIf7w+aZRr/axHdadRQpib6l1fvfA3C72urPRBnZDA56ZDAgE9tS0kfk9D+3sW
 BgXRp4knvHBf6I4RdA10hHDTa1RuX9xkDeAC1a/ljWpbCEgEDPJ+5JI+TD+fU/d5
 TCVGa7GwqJc2srRFwy76/t0jQrG7DnwW56SsMomjS+vjIu4exNFwXJ6LqZSJacwa
 Z3HB0Py3awQWPfHdFqdF9LHyum+a58RHX96RenlL8Q/42qe5K6RmAIfcAaiy2OpL
 xAGy9+nplMWh+qde9q1o30WPr08GhhDEXrdHZdAAODjBeoUDGmFooH5NHAFjw2+Q
 ba15/f7Nl8KIl854OUJv4cftNEv5klpueLR/YUviivoO55vydRae/k/CSPhvt7TN
 VIQ+vgiaiOCEwAAx2kP7Au0ADeEMCYiEqH9KWBp33dvjNZMt2DbAGLDWagcy8N9y
 R8ms4c5e7Z2MvN9Z6YDihQ1XvkQsdX/dWwJq3weH3c/tP1MBFFHZYdeQhIVKTIKR
 4zFKi4jrlmn0vQ==
 =jiUK
 -----END PGP SIGNATURE-----

Merge tag 'irq-core-2023-08-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull irq updates from Thomas Gleixner:
 "Boring updates for the interrupt subsystem:

  Core:

   - Prevent a deadlock of nested interrupt threads vs.
     synchronize_hard()

   - Removal of a stale extern declaration

  Drivers:

   - The first new driver since v6.2 for Amlogic-C3 SoCs

   - The usual small fixes, cleanups and improvements all over the
     place"

* tag 'irq-core-2023-08-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  irqchip: Add support for Amlogic-C3 SoCs
  dt-bindings: interrupt-controller: Add support for Amlogic-C3 SoCs
  irqchip/irq-mvebu-sei: Use devm_platform_get_and_ioremap_resource()
  irqchip/ls-scfg-msi: Use devm_platform_get_and_ioremap_resource()
  irqchip: Explicitly include correct DT includes
  irqchip/orion: Use of_address_count() helper
  irqchip/irq-pruss-intc: Do not check for 0 return after calling platform_get_irq()
  irqchip/imx-mu-msi: Do not check for 0 return after calling platform_get_irq()
  irqchipr/i8259: Mark i8259_of_init() static
  irqchip/mips-gic: Mark gic_irq_domain_free() static
  irqchip/xtensa-pic: Include header for xtensa_pic_init_legacy()
  irqchip/loongson-eiointc: Fix return value checking of eiointc_index
  genirq: Remove unused extern declaration
  genirq: Prevent nested thread vs synchronize_hardirq() deadlock
2023-08-28 14:33:11 -07:00
Linus Torvalds 6bfce7759c A single update to the core entry code, which removes the empty user
address limit check which is a leftover of the removed TIF_FSCHECK.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmTsi5ETHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoZRGEACDGWXaadhCYNpHftScC6X3z2d3DXZX
 E5PAiVtq/sQX8FV2KJgMxywxQ4QEJjtX9lgoxGMKc7zai2FgKbWzPnlz6qz3przh
 P5Ji+sBl0DbpVnLMaAAVzikvWpjFfSbf/b8lHA3EVs/9HkPfZY7rwByONsVkxX2e
 6AS8tOv0CJP3lMaZ02tDs48PWOeF1CEpub9Eg5JfYG+CTU0gy+wFMnIUCkN/eP2E
 CYNo6wTFjBQ43S7GWrqA6eYgbHLBBvOuHLHM3RlLOm2Rexct/umf84At9K9wUJvJ
 mGSrZKsgD3UZJi7HpF5RXsY88+4uV38vhkN6LGRdHrarLz3WMvnc191WP7iwCBmo
 HGIgWWxm9+bGAxiw9wTNgmERvwKBeMNNQEDu/58An637VDucrYOlRi2Mh0CE5QiG
 i1R+KiKBUZw3Blogx+O65m0PyXpJQqHfr2WkfT+uKJCs7wRBdupmWv+ZAcSj6tys
 ILqCHRmI4n46T2qp67/M6FbYTrk0DNWsjgjtUgLBquEsj6z00favxAug5NrJV6+c
 5/kf7C97h1TmtqqNjtL4uwfWGm2bqc6AZyMpsk0KqnirywmnkgIKOWHu//TwQVJs
 jpwRvsAv3UNnUrO6qtqNzbNDQQ0MOLAAuDgardGWW7gEEhvaa+HdbwyjZSwDZvZy
 b8PLikU7gRB9rQ==
 =W9kd
 -----END PGP SIGNATURE-----

Merge tag 'core-entry-2023-08-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull core entry code update from Thomas Gleixner:
 "A single update to the core entry code, which removes the empty user
  address limit check which is a leftover of the removed TIF_FSCHECK"

* tag 'core-entry-2023-08-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  entry: Remove empty addr_limit_user_check()
2023-08-28 14:04:55 -07:00
Linus Torvalds b98af53cb0 Clocksource watchdog commits for v6.6
This pull reqeust contains the following:
 
 o	Handle negative skews in "skew is too large" messages.
 
 o	Extend watchdog check exemption to 4-Socket platforms
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEbK7UrM+RBIrCoViJnr8S83LZ+4wFAmTcF94THHBhdWxtY2tA
 a2VybmVsLm9yZwAKCRCevxLzctn7jPeHEACXvYFMwno+1DPlt2PtFJglnkuYyOUQ
 SwRRiOLWXdzsikzjWIvdrSOqslbdIjhZmSSyPDjyo4GPYFSSOyKNsqczwsX8R49u
 O/2Yzkrx2OmIWcKiAkII8Iw4fFedoITtzC59wbZHoo/+upEpbZYP3u7AjJTird7y
 du3WcWcGc1eEt5+7MNwbZfwpzo2t5Rb3Wqfgs6vnKTG7Abc/23uChsCBzPavX7X/
 djNd1bA5YmEldKKxSoF5XSW/F1TWIA4fXMDkBwgRKHBx1Y7xU+nJMtam4ogAzN6a
 4zgthgy5wQ7/VnTBv2rmQQb6ae3Blm09Yg1ac8zt95RLgmGkyX73lZGnRBTdYqyh
 kb47Tfw3a+e+VPTD+W3rY1NOSOwbstdDHVckK+0bFvqNyXOoaEJL+EEOhm9rPXxv
 le+T6Ct1VPAF9lHPUz7lVCVXN91vP4Gqrxjmeq5rqWNOvRW3jBcCLnEpFzWJtu50
 JjQBi3HA0HW+Bxqov22W0llFAa0gVm8xyxXfNSSL7VoCinnS6/qyQvD1GoG0brk3
 l5orOk38m2/acvTyvw2tnvAAuqmOr+oGlcQhJOOVl7jDz+sae6RMTwWCMaVxKUDo
 YW7v5YFQePmzC9J4M5XFMdCCTb3cUCjVLMdsPgONM3kn9ALEDhhTBSw76N//bGLg
 4/OEsT7Aq6LHkA==
 =6/HI
 -----END PGP SIGNATURE-----

Merge tag 'clocksource.2023.08.15a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu

Pull clocksource watchdog updates from Paul McKenney:

 - Handle negative skews in "skew is too large" messages

 - Extend watchdog check exemption to 4-Socket platforms

* tag 'clocksource.2023.08.15a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu:
  x86/tsc: Extend watchdog check exemption to 4-Sockets platform
  clocksource: Handle negative skews in "skew is too large" messages
2023-08-28 13:59:46 -07:00
Linus Torvalds b324696dce CSD lock commits for v6.5
This series reduces the number of stack traces dumped during CSD-lock
 debugging.  This helps to avoid console overrun on systems with large
 numbers of CPUs.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEbK7UrM+RBIrCoViJnr8S83LZ+4wFAmSxzCITHHBhdWxtY2tA
 a2VybmVsLm9yZwAKCRCevxLzctn7jK3wD/9bNte4pbnG5CmJvu+bh1idbD+ZIyaU
 13mTg1J35XurW/QlmbOKoNcdylvmE4QdzUxURXgv4FHO7wdiARBGgrz9ArKjgq+e
 NfklTr4EY9UbRb27tO1iJDUiP6ZZB3fw+gYs2zmJMkn5CqK+rkUrTdcMa8EFHvlT
 vf6OL8xeFjsrTCWfYTAYJU1Yp+0UOiO+BRwzq4u76Wzpex79EiMEE2lLeRZfXhz9
 mF704EXn7VEkfRo50GlGOjVkezghlItXlaUCV2eQ4T6/LwXgreStCTKfhrDA5Qs2
 mAQ5OMZJztlbUWcVrEPZMpQ6pXWaJx5qoMZ1uP8Obec89ocr+/DuL1Myyaau9g+H
 rCYA9Om4XfAd2JURrxOIlKQ7SmvRJNZWpv0DHizTfWpSTOumtON2RyVlC2EYwx9Q
 2ZL4Eo99VzYcAXWx8KiGpF6CtW67VXKZsHwTtJegu4Vkjk9wOt5Sa9svhiHv0Kz4
 veYE9XuOH5+tIfN6tP4eikI+4VJOVhudsOKiXCjhoscy+1/gtXRH5WgYwvSiopWo
 nEsj05V7U0hWdsPpu7niZ982vAU1eHC7EeQt+pc5f17NeNr51xG3Oj3xyy14yzFC
 TbEyOft0MEsJ8NkC93FCbNqere7dk6k+bxVqvoWQ5tDfsEhIaVn+HzVhPdsluvfP
 1JixcSuqZ42RqA==
 =t4s8
 -----END PGP SIGNATURE-----

Merge tag 'csd-lock.2023.07.15a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu

Pull CSD lock updates from Paul McKenney:
 "This series reduces the number of stack traces dumped during CSD-lock
  debugging. This helps to avoid console overrun on systems with large
  numbers of CPUs"

* tag 'csd-lock.2023.07.15a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu:
  smp: Reduce NMI traffic from CSD waiters to CSD destination
  smp: Reduce logging due to dump_stack of CSD waiters
2023-08-28 13:46:41 -07:00
Linus Torvalds 6ae0c15765 smp_call_function torture-test updates for v6.6
This pull request prevents some memory-exhaustion false-postitive failures
 in scftorture testing.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEbK7UrM+RBIrCoViJnr8S83LZ+4wFAmTcFWYTHHBhdWxtY2tA
 a2VybmVsLm9yZwAKCRCevxLzctn7jBPhD/9jqgTgPuF3bmRVpkggIXHN0wCTihS9
 BNQPaUdLRmTymtZAecaPOdRvPPMUvqjOK5dS8sx7rnoyU+qr33mUkRzSFCIrsGHM
 62FowQ4grokOkQnJYUpVuLhitYwwmWi7aKi5T2Xolc4ooSIpWZe/NPoiteGkm4lc
 nuA84DcV51rRykjBjW3LIrffoi9fu3lU65FsAjQttG7OZwWmAjhhHl29loCPlG3F
 +Ui+0p+cp8WAB/2J0B/6aHTqK6JJoV0t/gzKpzYvI/Gydz/7PaYjdBhPCSxHcsXd
 LMf+OO5/LtGfw4kcYF/8O4Ir0t4F681iOXlz06op2P2OT90S0O31SGUWznKMVq3E
 V307I9LnfT5Jo2aK9xD4ad8GM9rMKb9btc284QvaYAjCUD5RBoyA/S1d6e0u9rt3
 oK7rJWIG9bzCbZ7R7xXCzpkCYw98npVeDxS9gdwWSCA0vBwmhF8BbVQODZ/u+YQ0
 TQyTSankebeaoINeieto0ZAbK9iDSbsnTmKZ146hoLGFshDFN7qPOL4PggXPqw5B
 CXILQH+SOjMO+JaIrd4iOr172REzp1/64K4szaheV4LxyEwC/QJBdxhajdpJOTOS
 LowIG+LIIElr8dPIiiEIBVAaTehadgqA1+5zIcevt8OSMb7KOoB6FkXKj/9kWOfD
 PwFfqskEYoY8xQ==
 =8rLK
 -----END PGP SIGNATURE-----

Merge tag 'scftorture.2023.08.15a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu

Pull smp_call_function torture-test updates from Paul McKenney:
 "This prevents some memory-exhaustion false-postitive failures in
  scftorture testing"

* tag 'scftorture.2023.08.15a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu:
  scftorture: Add CONFIG_PREEMPT_DYNAMIC=n to NOPREEMPT scenario
  scftorture: Pause testing after memory-allocation failure
  scftorture: Forgive memory-allocation failure if KASAN
  torture: Scale scftorture memory based on number of CPUs
2023-08-28 13:42:29 -07:00
Linus Torvalds 68cadad11f RCU pull request for v6.6
doc.2023.07.14b: Documentation updates.
 
 fixes.2023.08.16a: Miscellaneous fixes, perhaps most notably simplifying
 	SRCU_NOTIFIER_INIT() as suggested.
 
 rcu-tasks.2023.07.24a:  RCU Tasks updates, most notably treating
 	Tasks RCU callbacks as lazy while still treating synchronous
 	grace periods as urgent.  Also fixes one bug that restores the
 	ability to apply debug-objects to RCU Tasks and another that
 	fixes a race condition that could result in false-positive
 	failures of the boot-time self-test code.
 
 rcuscale.2023.07.14b: RCU-scalability performance-test updates,
 	most notably adding the ability to measure the RCU-Tasks's
 	grace-period kthread's CPU consumption.  This proved
 	quite useful for the rcu-tasks.2023.07.24a work.
 
 refscale.2023.07.14b: Reference-acquisition/release performance-test
 	updates, including a fix for an uninitialized wait_queue_head_t.
 
 torture.2023.08.14a: Miscellaneous torture-test updates.
 
 torturescripts.2023.07.20a: Torture-test scripting updates, including
 	removal of the non-longer-functional formal-verification scripts,
 	test builds of individual RCU Tasks flavors, better diagnostics
 	for loss of connectivity for distributed rcutorture tests,
 	disabling of reboot loops in qemu/KVM-based rcutorture testing,
 	and passing of init parameters to rcutorture's init program.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEbK7UrM+RBIrCoViJnr8S83LZ+4wFAmTjkssTHHBhdWxtY2tA
 a2VybmVsLm9yZwAKCRCevxLzctn7jITND/9zEqYNbeFrcBs/YaHdoAjsNgOt1IYN
 csfF/KArVgdvmrwlV/nEaQMLaJcw9X7DVU5+7E2JbbDaB/2FSacseNyKk6mfgSVK
 /0rnTOXpqI9/T1HiJObWZvDQFuKL12bfteXWGJg1sMt2JUGZ4nAWhdZ3xRjp2XkO
 89qB5r0fF8gyGwvQ3M29ss8T9Oy0uUNJmDY/QyVxHM6dhkpSAezFffKzD7C4zkSV
 WucRTpYJ7bs6otBGtVmwz3x60UAuLwcVfQyB+CTbnGLsps9yAYU+1DDVdm7olcr3
 ARXMeboeodMvy9jWXhtbWRVAAob4lVUDXQN27kb4sBgroRQBfQXMuByRAU6s0VtX
 frOl6rlbORuAetsC8wFL0IFVn4yTpvXKbYw7h1MXTs7gVVbl33O9FieGvWu0r79/
 VR4Xw+JbmYWtyvFV8Zaq4iIEcOe+PeNH6u0bPx+htsHYd1+DUG2UY0MVmJQ3a4sb
 ygejA6mguCk7KBzWab8wdDpgAfhNwg0T9a+LQYcaskuD5SSWjYqqg6i1ulqqqyiE
 bOfRKDX4mWmAobWKHLssqUrjiLbxfygIaHjCrt7rWJKPIs1bK/WfWa4JbrE0NRwK
 9IDd1lWc9C+zoUpjyZWSG3ahK5lWo2u4sPNoRtMQjowjobIz1cBhaEwmFe72bG7C
 FCKb7Da2oUaLOw==
 =EujZ
 -----END PGP SIGNATURE-----

Merge tag 'rcu.2023.08.21a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu

Pull RCU updates from Paul McKenney:

 - Documentation updates

 - Miscellaneous fixes, perhaps most notably simplifying
   SRCU_NOTIFIER_INIT() as suggested

 - RCU Tasks updates, most notably treating Tasks RCU callbacks as lazy
   while still treating synchronous grace periods as urgent. Also fixes
   one bug that restores the ability to apply debug-objects to RCU Tasks
   and another that fixes a race condition that could result in
   false-positive failures of the boot-time self-test code

 - RCU-scalability performance-test updates, most notably adding the
   ability to measure the RCU-Tasks's grace-period kthread's CPU
   consumption. This proved quite useful for the RCU Tasks work

 - Reference-acquisition/release performance-test updates, including a
   fix for an uninitialized wait_queue_head_t

 - Miscellaneous torture-test updates

 - Torture-test scripting updates, including removal of the
   non-longer-functional formal-verification scripts, test builds of
   individual RCU Tasks flavors, better diagnostics for loss of
   connectivity for distributed rcutorture tests, disabling of reboot
   loops in qemu/KVM-based rcutorture testing, and passing of init
   parameters to rcutorture's init program

* tag 'rcu.2023.08.21a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu: (64 commits)
  rcu: Use WRITE_ONCE() for assignments to ->next for rculist_nulls
  rcu: Make the rcu_nocb_poll boot parameter usable via boot config
  rcu: Mark __rcu_irq_enter_check_tick() ->rcu_urgent_qs load
  srcu,notifier: Remove #ifdefs in favor of SRCU Tiny srcu_usage
  rcutorture: Stop right-shifting torture_random() return values
  torture: Stop right-shifting torture_random() return values
  torture: Move stutter_wait() timeouts to hrtimers
  torture: Move torture_shuffle() timeouts to hrtimers
  torture: Move torture_onoff() timeouts to hrtimers
  torture: Make torture_hrtimeout_*() use TASK_IDLE
  torture: Add lock_torture writer_fifo module parameter
  torture: Add a kthread-creation callback to _torture_create_kthread()
  rcu-tasks: Fix boot-time RCU tasks debug-only deadlock
  rcu-tasks: Permit use of debug-objects with RCU Tasks flavors
  checkpatch: Complain about unexpected uses of RCU Tasks Trace
  torture: Cause mkinitrd.sh to indicate failure on compile errors
  torture: Make init program dump command-line arguments
  torture: Switch qemu from -nographic to -display none
  torture: Add init-program support for loongarch
  torture: Avoid torture-test reboot loops
  ...
2023-08-28 13:19:28 -07:00
Linus Torvalds 727dbda16b hardening updates for v6.6-rc1
- Carve out the new CONFIG_LIST_HARDENED as a more focused subset of
   CONFIG_DEBUG_LIST (Marco Elver).
 
 - Fix kallsyms lookup failure under Clang LTO (Yonghong Song).
 
 - Clarify documentation for CONFIG_UBSAN_TRAP (Jann Horn).
 
 - Flexible array member conversion not carried in other tree (Gustavo
   A. R. Silva).
 
 - Various strlcpy() and strncpy() removals not carried in other trees
   (Azeem Shaikh, Justin Stitt).
 
 - Convert nsproxy.count to refcount_t (Elena Reshetova).
 
 - Add handful of __counted_by annotations not carried in other trees,
   as well as an LKDTM test.
 
 - Fix build failure with gcc-plugins on GCC 14+.
 
 - Fix selftests to respect SKIP for signal-delivery tests.
 
 - Fix CFI warning for paravirt callback prototype.
 
 - Clarify documentation for seq_show_option_n() usage.
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmTs6ZAWHGtlZXNjb29r
 QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJkpjD/9AeST5Imc2t0t71Qd+wPxW3jT3
 kDZPlHH8wHmuxSpRscX82m21SozvEMvybo6Cp7FSH4qr863FnBWMlo8acr7rKxUf
 0f7Y9qgY/hKADiVx5p0pbnCgcy+l4pwsxIqVCGuhjvNCbWHrdGqLM4UjIfaVz5Ws
 +55a/C3S1KVwB1s1+6to43jtKqQAx6yrqYWOaT3wEfCzHC87f9PUHhIGnFQVwPGP
 WpjQI/BQKpH7+MDCoJOPrZqXaE/4lWALxR6+5BBheGbvLoWifpJEYHX6bDUzkgBz
 liQDkgr4eAw5EXSOS7mX3EApfeMKakznJt9Mcmn0h3pPRlM3ZSVD64Xrou2Brpje
 exS2JRuh6HwIiXY9nTHc6YMGcAWG1syAR/hM2fQdujM0CWtBUk9+kkuYWsqF6nIK
 3tOxYLB/Ph4p+tShd+v5R3mEmp/6snYRKJoUk+9Fk67i54NnK4huyxaCO4zui+ML
 3vHuGp8KgFHUjJaYmYXHs3TRZnKSFUkPGc4MbpiGtmJ9zhfSwlhhF+yfBJCsvmTf
 ZajA+sPupT4OjLxU6vUD/ZNkXAEjWzktyX2v9YBA7FHh7SqPtX9ARRIxh417AjEJ
 tBPHhW/iRw9ftBIAKDmI7gPLynngd/zvjhvk6O5egHYjjgRM1/WAJZ4V26XR6+hf
 TWfQb7VRzdZIqwOEUA==
 =9ZWP
 -----END PGP SIGNATURE-----

Merge tag 'hardening-v6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull hardening updates from Kees Cook:
 "As has become normal, changes are scattered around the tree (either
  explicitly maintainer Acked or for trivial stuff that went ignored):

   - Carve out the new CONFIG_LIST_HARDENED as a more focused subset of
     CONFIG_DEBUG_LIST (Marco Elver)

   - Fix kallsyms lookup failure under Clang LTO (Yonghong Song)

   - Clarify documentation for CONFIG_UBSAN_TRAP (Jann Horn)

   - Flexible array member conversion not carried in other tree (Gustavo
     A. R. Silva)

   - Various strlcpy() and strncpy() removals not carried in other trees
     (Azeem Shaikh, Justin Stitt)

   - Convert nsproxy.count to refcount_t (Elena Reshetova)

   - Add handful of __counted_by annotations not carried in other trees,
     as well as an LKDTM test

   - Fix build failure with gcc-plugins on GCC 14+

   - Fix selftests to respect SKIP for signal-delivery tests

   - Fix CFI warning for paravirt callback prototype

   - Clarify documentation for seq_show_option_n() usage"

* tag 'hardening-v6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (23 commits)
  LoadPin: Annotate struct dm_verity_loadpin_trusted_root_digest with __counted_by
  kallsyms: Change func signature for cleanup_symbol_name()
  kallsyms: Fix kallsyms_selftest failure
  nsproxy: Convert nsproxy.count to refcount_t
  integrity: Annotate struct ima_rule_opt_list with __counted_by
  lkdtm: Add FAM_BOUNDS test for __counted_by
  Compiler Attributes: counted_by: Adjust name and identifier expansion
  um: refactor deprecated strncpy to memcpy
  um: vector: refactor deprecated strncpy
  alpha: Replace one-element array with flexible-array member
  hardening: Move BUG_ON_DATA_CORRUPTION to hardening options
  list: Introduce CONFIG_LIST_HARDENED
  list_debug: Introduce inline wrappers for debug checks
  compiler_types: Introduce the Clang __preserve_most function attribute
  gcc-plugins: Rename last_stmt() for GCC 14+
  selftests/harness: Actually report SKIP for signal tests
  x86/paravirt: Fix tlb_remove_table function callback prototype warning
  EISA: Replace all non-returning strlcpy with strscpy
  perf: Replace strlcpy with strscpy
  um: Remove strlcpy declaration
  ...
2023-08-28 12:59:45 -07:00
Linus Torvalds b03a434214 seccomp updates for v6.6-rc1
- Provide USER_NOTIFY flag for synchronous mode (Andrei Vagin, Peter
   Oskolkov). This touches the scheduler and perf but has been Acked by
   Peter Zijlstra.
 
 - Fix regression in syscall skipping and restart tracing on arm32.
   This touches arch/arm/ but has been Acked by Arnd Bergmann.
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmTs418WHGtlZXNjb29r
 QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJohpD/4tEfRdnb/KDgwQ7uvqBonUJXcx
 wqw17LZCGTpBV3/Tp3+aEseD1NezOxiMJL88VyUHSy7nfDJShbL6QtyoenwEOeXJ
 HmBUfcIH3cqRutHEJ3drYBzBetpeeK2G+gTYVj+JoEfPWyPf+Egj+1JE2n1xLi92
 WC1miBAyBZ59kN+D1hcDzJu24CkAwbcUYlEzGejN5lBOwxYV3/fjARBVRvefOO5m
 jljSCIVJOFgCiybKhJ7Zw1+lkFc3cIlcOgr4/ZegSc8PxFVebnuImTHHp/gvoo6F
 7d1xe5Hk+PSfNvVq41MAeRB2vK2tY5efwjXRarThUaydPTO43KiQm0dzP0EYWK9a
 LcOg8zAXZnpvuWU5O2SqUKADcxe2TjS1WuQ/Q4ixxgKz2kJKDwrNU8Frf327eLSR
 acfZgMMiUfEXyXDV9B3LzNAtwdvwyxYrzEzxgKywhThIhZmQDat0rI2IaTV5QIc5
 pkxiFEe0TPwpzyUVO9dSzE+ughTmNQOKk5uAM9e2NwRwVdhEmlZAxo0kStJ1NoaA
 yDjYIKfaNBElchL4v2931KJFJseI+uRaWdW10JEV+1M69+gEAEs6wbmAxtcYS776
 xWsYp3slXzlmeVyvQp/ah8p0y55r+qTbcnhkvIdiwLYei4Bh3KOoJUlVmW0V5dKq
 b+7qspIvBA0kKRAqPw==
 =DI8R
 -----END PGP SIGNATURE-----

Merge tag 'seccomp-v6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull seccomp updates from Kees Cook:

 - Provide USER_NOTIFY flag for synchronous mode (Andrei Vagin, Peter
   Oskolkov). This touches the scheduler and perf but has been Acked by
   Peter Zijlstra.

 - Fix regression in syscall skipping and restart tracing on arm32. This
   touches arch/arm/ but has been Acked by Arnd Bergmann.

* tag 'seccomp-v6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  seccomp: Add missing kerndoc notations
  ARM: ptrace: Restore syscall skipping for tracers
  ARM: ptrace: Restore syscall restart tracing
  selftests/seccomp: Handle arm32 corner cases better
  perf/benchmark: add a new benchmark for seccom_unotify
  selftest/seccomp: add a new test for the sync mode of seccomp_user_notify
  seccomp: add the synchronous mode for seccomp_unotify
  sched: add a few helpers to wake up tasks on the current cpu
  sched: add WF_CURRENT_CPU and externise ttwu
  seccomp: don't use semaphore and wait_queue together
2023-08-28 12:38:26 -07:00
Linus Torvalds 615e95831e v6.6-vfs.ctime
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZOXTKAAKCRCRxhvAZXjc
 oifJAQCzi/p+AdQu8LA/0XvR7fTwaq64ZDCibU4BISuLGT2kEgEAuGbuoFZa0rs2
 XYD/s4+gi64p9Z01MmXm2XO1pu3GPg0=
 =eJz5
 -----END PGP SIGNATURE-----

Merge tag 'v6.6-vfs.ctime' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs timestamp updates from Christian Brauner:
 "This adds VFS support for multi-grain timestamps and converts tmpfs,
  xfs, ext4, and btrfs to use them. This carries acks from all relevant
  filesystems.

  The VFS always uses coarse-grained timestamps when updating the ctime
  and mtime after a change. This has the benefit of allowing filesystems
  to optimize away a lot of metadata updates, down to around 1 per
  jiffy, even when a file is under heavy writes.

  Unfortunately, this has always been an issue when we're exporting via
  NFSv3, which relies on timestamps to validate caches. A lot of changes
  can happen in a jiffy, so timestamps aren't sufficient to help the
  client decide to invalidate the cache.

  Even with NFSv4, a lot of exported filesystems don't properly support
  a change attribute and are subject to the same problems with timestamp
  granularity. Other applications have similar issues with timestamps
  (e.g., backup applications).

  If we were to always use fine-grained timestamps, that would improve
  the situation, but that becomes rather expensive, as the underlying
  filesystem would have to log a lot more metadata updates.

  This introduces fine-grained timestamps that are used when they are
  actively queried.

  This uses the 31st bit of the ctime tv_nsec field to indicate that
  something has queried the inode for the mtime or ctime. When this flag
  is set, on the next mtime or ctime update, the kernel will fetch a
  fine-grained timestamp instead of the usual coarse-grained one.

  As POSIX generally mandates that when the mtime changes, the ctime
  must also change the kernel always stores normalized ctime values, so
  only the first 30 bits of the tv_nsec field are ever used.

  Filesytems can opt into this behavior by setting the FS_MGTIME flag in
  the fstype. Filesystems that don't set this flag will continue to use
  coarse-grained timestamps.

  Various preparatory changes, fixes and cleanups are included:

   - Fixup all relevant places where POSIX requires updating ctime
     together with mtime. This is a wide-range of places and all
     maintainers provided necessary Acks.

   - Add new accessors for inode->i_ctime directly and change all
     callers to rely on them. Plain accesses to inode->i_ctime are now
     gone and it is accordingly rename to inode->__i_ctime and commented
     as requiring accessors.

   - Extend generic_fillattr() to pass in a request mask mirroring in a
     sense the statx() uapi. This allows callers to pass in a request
     mask to only get a subset of attributes filled in.

   - Rework timestamp updates so it's possible to drop the @now
     parameter the update_time() inode operation and associated helpers.

   - Add inode_update_timestamps() and convert all filesystems to it
     removing a bunch of open-coding"

* tag 'v6.6-vfs.ctime' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (107 commits)
  btrfs: convert to multigrain timestamps
  ext4: switch to multigrain timestamps
  xfs: switch to multigrain timestamps
  tmpfs: add support for multigrain timestamps
  fs: add infrastructure for multigrain timestamps
  fs: drop the timespec64 argument from update_time
  xfs: have xfs_vn_update_time gets its own timestamp
  fat: make fat_update_time get its own timestamp
  fat: remove i_version handling from fat_update_time
  ubifs: have ubifs_update_time use inode_update_timestamps
  btrfs: have it use inode_update_timestamps
  fs: drop the timespec64 arg from generic_update_time
  fs: pass the request_mask to generic_fillattr
  fs: remove silly warning from current_time
  gfs2: fix timestamp handling on quota inodes
  fs: rename i_ctime field to __i_ctime
  selinux: convert to ctime accessor functions
  security: convert to ctime accessor functions
  apparmor: convert to ctime accessor functions
  sunrpc: convert to ctime accessor functions
  ...
2023-08-28 09:31:32 -07:00
Linus Torvalds 3b35375f19 A last minute fix for a regression introduced in the v6.5 merge window. The
conversion of the software based interrupt resend mechanism to hlist missed
 to add a check whether the descriptor is already enqueued and dropped the
 interrupt descriptor lookup for nested interrupts.
 
 The missing check whether the descriptor is already queued causes hlist
 corruption and can be observed in the wild. The dropped parent descriptor
 lookup has not yet caused problems, but it would result in stale interrupt
 line in the worst case.
 
 Add the missing enqueued check and bring the descriptor lookup back to cure
 this.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmTqNLQTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoTeBD/0b8zbNhmO5TXhP6GrCPXahFM6aTmyK
 NveZMzh1c7tQZzMBNNEnRoaYvmcgPOviZ1Yi3+/Hs3oaR/b6nLt36K8+MRC7J+15
 j6cIylmpTp9eH5Na3IT1wmTNfCVAdoejoZVYq4PPHAHUrzqu7ESOTLzHbPmWS97E
 VGdvUrKnQ7J4ajOZn7bXWaia+qCuIij87CYAKH++c9JVMIc0iTs2Zd7FG2sncgLm
 OJdvjmMy/qN9a1jYdM4DrGOS8HBdvuYb9EEDuZB4NEY3nBR+svQqBHsD462LgxNe
 +OTzLBVMoP9heKbyTU9357PUq2qz6OmpC0vE1n5XgkSEdrvm9x1UjYcPQnagRm25
 JZp/pEI/ryD8oGQNWzsPe7PDyyKHV5F0Q1KPHGUvvEJxwF+USVe9Zm6damfZvGeA
 dp34zYg0mFCH0hmqdYs6+cc8sJcEy8aR8FFUgI1Uj5nr9zZ3vV7WTsOjJ12NDFo/
 L+oDKz6/sdL2X/EKddP3ffQrImPF8DdSYfEPmoukTMhihfgXewBlgvg3b9HekVVm
 9j7UhqsQw/mdPcTpkM6cd5ngxB71X64gMjAfotwsproJg/EUw978CM++9sGKmKy8
 jU7hlgZQ3DniSCyCpXB/7vZxAFej8TKTWmTc4KZYKiMfej2vqI3FjA3KLGY6GzK+
 ls/Rm57EOhKZlw==
 =Snax
 -----END PGP SIGNATURE-----

Merge tag 'irq-urgent-2023-08-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull irq fix from Thomas Gleixner:
 "A last minute fix for a regression introduced in the v6.5 merge
  window.

  The conversion of the software based interrupt resend mechanism to
  hlist missed to add a check whether the descriptor is already enqueued
  and dropped the interrupt descriptor lookup for nested interrupts.

  The missing check whether the descriptor is already queued causes
  hlist corruption and can be observed in the wild. The dropped parent
  descriptor lookup has not yet caused problems, but it would result in
  stale interrupt line in the worst case.

  Add the missing enqueued check and bring the descriptor lookup back to
  cure this"

* tag 'irq-urgent-2023-08-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  genirq: Fix software resend lockup and nested resend
2023-08-26 10:34:29 -07:00
Johan Hovold 9f5deb5516 genirq: Fix software resend lockup and nested resend
The switch to using hlist for managing software resend of interrupts
broke resend in at least two ways:

First, unconditionally adding interrupt descriptors to the resend list can
corrupt the list when the descriptor in question has already been
added. This causes the resend tasklet to loop indefinitely with interrupts
disabled as was recently reported with the Lenovo ThinkPad X13s after
threaded NAPI was disabled in the ath11k WiFi driver.

This bug is easily fixed by restoring the old semantics of irq_sw_resend()
so that it can be called also for descriptors that have already been marked
for resend.

Second, the offending commit also broke software resend of nested
interrupts by simply discarding the code that made sure that such
interrupts are retriggered using the parent interrupt.

Add back the corresponding code that adds the parent descriptor to the
resend list.

Fixes: bc06a9e087 ("genirq: Use hlist for managing resend handlers")
Signed-off-by: Johan Hovold <johan+linaro@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/lkml/20230809073432.4193-1-johan+linaro@kernel.org/
Link: https://lore.kernel.org/r/20230826154004.1417-1-johan+linaro@kernel.org
2023-08-26 19:14:31 +02:00
Jakub Kicinski bebfbf07c7 bpf-next-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZOjkTAAKCRDbK58LschI
 gx32AP9gaaHFBtOYBfoenKTJfMgv1WhtQHIBas+WN9ItmBx9MAEA4gm/VyQ6oD7O
 EBjJKJQ2CZ/QKw7cNacXw+l5jF7/+Q0=
 =8P7g
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Daniel Borkmann says:

====================
pull-request: bpf-next 2023-08-25

We've added 87 non-merge commits during the last 8 day(s) which contain
a total of 104 files changed, 3719 insertions(+), 4212 deletions(-).

The main changes are:

1) Add multi uprobe BPF links for attaching multiple uprobes
   and usdt probes, which is significantly faster and saves extra fds,
   from Jiri Olsa.

2) Add support BPF cpu v4 instructions for arm64 JIT compiler,
   from Xu Kuohai.

3) Add support BPF cpu v4 instructions for riscv64 JIT compiler,
   from Pu Lehui.

4) Fix LWT BPF xmit hooks wrt their return values where propagating
   the result from skb_do_redirect() would trigger a use-after-free,
   from Yan Zhai.

5) Fix a BPF verifier issue related to bpf_kptr_xchg() with local kptr
   where the map's value kptr type and locally allocated obj type
   mismatch, from Yonghong Song.

6) Fix BPF verifier's check_func_arg_reg_off() function wrt graph
   root/node which bypassed reg->off == 0 enforcement,
   from Kumar Kartikeya Dwivedi.

7) Lift BPF verifier restriction in networking BPF programs to treat
   comparison of packet pointers not as a pointer leak,
   from Yafang Shao.

8) Remove unmaintained XDP BPF samples as they are maintained
   in xdp-tools repository out of tree, from Toke Høiland-Jørgensen.

9) Batch of fixes for the tracing programs from BPF samples in order
   to make them more libbpf-aware, from Daniel T. Lee.

10) Fix a libbpf signedness determination bug in the CO-RE relocation
    handling logic, from Andrii Nakryiko.

11) Extend libbpf to support CO-RE kfunc relocations. Also follow-up
    fixes for bpf_refcount shared ownership implementation,
    both from Dave Marchevsky.

12) Add a new bpf_object__unpin() API function to libbpf,
    from Daniel Xu.

13) Fix a memory leak in libbpf to also free btf_vmlinux
    when the bpf_object gets closed, from Hao Luo.

14) Small error output improvements to test_bpf module, from Helge Deller.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (87 commits)
  selftests/bpf: Add tests for rbtree API interaction in sleepable progs
  bpf: Allow bpf_spin_{lock,unlock} in sleepable progs
  bpf: Consider non-owning refs to refcounted nodes RCU protected
  bpf: Reenable bpf_refcount_acquire
  bpf: Use bpf_mem_free_rcu when bpf_obj_dropping refcounted nodes
  bpf: Consider non-owning refs trusted
  bpf: Ensure kptr_struct_meta is non-NULL for collection insert and refcount_acquire
  selftests/bpf: Enable cpu v4 tests for RV64
  riscv, bpf: Support unconditional bswap insn
  riscv, bpf: Support signed div/mod insns
  riscv, bpf: Support 32-bit offset jmp insn
  riscv, bpf: Support sign-extension mov insns
  riscv, bpf: Support sign-extension load insns
  riscv, bpf: Fix missing exception handling and redundant zext for LDX_B/H/W
  samples/bpf: Add note to README about the XDP utilities moved to xdp-tools
  samples/bpf: Cleanup .gitignore
  samples/bpf: Remove the xdp_sample_pkts utility
  samples/bpf: Remove the xdp1 and xdp2 utilities
  samples/bpf: Remove the xdp_rxq_info utility
  samples/bpf: Remove the xdp_redirect* utilities
  ...
====================

Link: https://lore.kernel.org/r/20230825194319.12727-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-25 18:40:15 -07:00
Yonghong Song 76903a9648 kallsyms: Change func signature for cleanup_symbol_name()
All users of cleanup_symbol_name() do not use the return value.
So let us change the return value of cleanup_symbol_name() to
'void' to reflect its usage pattern.

Suggested-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Reviewed-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230825202036.441212-1-yonghong.song@linux.dev
Signed-off-by: Kees Cook <keescook@chromium.org>
2023-08-25 15:00:36 -07:00
Rafael J. Wysocki 6a0b211f8b Merge branches 'pm-sleep', 'pm-qos' and 'powercap'
Merge system-wide power management changes and power capping updates
for 6.6-rc1:

 - Add device PM helpers to allow a device to remain powered-on during
   system-wide transitions (Ulf Hansson).

 - Rework hibernation memory snapshotting to avoid storing pages filled
   with zeros in hibernation image files (Brian Geffon).

 - Add check to make sure that CPU latency QoS constraints do not use
   negative values (Clive Lin).

 - Optimize rp->domains memory allocation in the Intel RAPL power
   capping driver (xiongxin).

 - Remove recursion while parsing zones in the arm_scmi power capping
   driver (Cristian Marussi).

* pm-sleep:
  PM: sleep: Add helpers to allow a device to remain powered-on
  PM: hibernate: don't store zero pages in the image file

* pm-qos:
  PM: QoS: Add check to make sure CPU latency is non-negative

* powercap:
  powercap: intel_rapl: Optimize rp->domains memory allocation
  powercap: arm_scmi: Remove recursion while parsing zones
2023-08-25 21:23:30 +02:00
Yonghong Song 33f0467fe0 kallsyms: Fix kallsyms_selftest failure
Kernel test robot reported a kallsyms_test failure when clang lto is
enabled (thin or full) and CONFIG_KALLSYMS_SELFTEST is also enabled.
I can reproduce in my local environment with the following error message
with thin lto:
  [    1.877897] kallsyms_selftest: Test for 1750th symbol failed: (tsc_cs_mark_unstable) addr=ffffffff81038090
  [    1.877901] kallsyms_selftest: abort

It appears that commit 8cc32a9bbf ("kallsyms: strip LTO-only suffixes
from promoted global functions") caused the failure. Commit 8cc32a9bbf
changed cleanup_symbol_name() based on ".llvm." instead of '.' where
".llvm." is appended to a before-lto-optimization local symbol name.
We need to propagate such knowledge in kallsyms_selftest.c as well.

Further more, compare_symbol_name() in kallsyms.c needs change as well.
In scripts/kallsyms.c, kallsyms_names and kallsyms_seqs_of_names are used
to record symbol names themselves and index to symbol names respectively.
For example:
  kallsyms_names:
    ...
    __amd_smn_rw._entry       <== seq 1000
    __amd_smn_rw._entry.5     <== seq 1001
    __amd_smn_rw.llvm.<hash>  <== seq 1002
    ...

kallsyms_seqs_of_names are sorted based on cleanup_symbol_name() through, so
the order in kallsyms_seqs_of_names actually has

  index 1000:   seq 1002   <== __amd_smn_rw.llvm.<hash> (actual symbol comparison using '__amd_smn_rw')
  index 1001:   seq 1000   <== __amd_smn_rw._entry
  index 1002:   seq 1001   <== __amd_smn_rw._entry.5

Let us say at a particular point, at index 1000, symbol '__amd_smn_rw.llvm.<hash>'
is comparing to '__amd_smn_rw._entry' where '__amd_smn_rw._entry' is the one to
search e.g., with function kallsyms_on_each_match_symbol(). The current implementation
will find out '__amd_smn_rw._entry' is less than '__amd_smn_rw.llvm.<hash>' and
then continue to search e.g., index 999 and never found a match although the actual
index 1001 is a match.

To fix this issue, let us do cleanup_symbol_name() first and then do comparison.
In the above case, comparing '__amd_smn_rw' vs '__amd_smn_rw._entry' and
'__amd_smn_rw._entry' being greater than '__amd_smn_rw', the next comparison will
be > index 1000 and eventually index 1001 will be hit an a match is found.

For any symbols not having '.llvm.' substr, there is no functionality change
for compare_symbol_name().

Fixes: 8cc32a9bbf ("kallsyms: strip LTO-only suffixes from promoted global functions")
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202308232200.1c932a90-oliver.sang@intel.com
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Reviewed-by: Song Liu <song@kernel.org>
Reviewed-by: Zhen Lei <thunder.leizhen@huawei.com>
Link: https://lore.kernel.org/r/20230825034659.1037627-1-yonghong.song@linux.dev
Cc: stable@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
2023-08-25 10:44:20 -07:00
Dave Marchevsky 5861d1e8db bpf: Allow bpf_spin_{lock,unlock} in sleepable progs
Commit 9e7a4d9831 ("bpf: Allow LSM programs to use bpf spin locks")
disabled bpf_spin_lock usage in sleepable progs, stating:

 Sleepable LSM programs can be preempted which means that allowng spin
 locks will need more work (disabling preemption and the verifier
 ensuring that no sleepable helpers are called when a spin lock is
 held).

This patch disables preemption before grabbing bpf_spin_lock. The second
requirement above "no sleepable helpers are called when a spin lock is
held" is implicitly enforced by current verifier logic due to helper
calls in spin_lock CS being disabled except for a few exceptions, none
of which sleep.

Due to above preemption changes, bpf_spin_lock CS can also be considered
a RCU CS, so verifier's in_rcu_cs check is modified to account for this.

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Link: https://lore.kernel.org/r/20230821193311.3290257-7-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-08-25 09:23:17 -07:00
Dave Marchevsky 0816b8c6bf bpf: Consider non-owning refs to refcounted nodes RCU protected
An earlier patch in the series ensures that the underlying memory of
nodes with bpf_refcount - which can have multiple owners - is not reused
until RCU grace period has elapsed. This prevents
use-after-free with non-owning references that may point to
recently-freed memory. While RCU read lock is held, it's safe to
dereference such a non-owning ref, as by definition RCU GP couldn't have
elapsed and therefore underlying memory couldn't have been reused.

From the perspective of verifier "trustedness" non-owning refs to
refcounted nodes are now trusted only in RCU CS and therefore should no
longer pass is_trusted_reg, but rather is_rcu_reg. Let's mark them
MEM_RCU in order to reflect this new state.

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Link: https://lore.kernel.org/r/20230821193311.3290257-6-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-08-25 09:23:16 -07:00
Dave Marchevsky ba2464c86f bpf: Reenable bpf_refcount_acquire
Now that all reported issues are fixed, bpf_refcount_acquire can be
turned back on. Also reenable all bpf_refcount-related tests which were
disabled.

This a revert of:
 * commit f3514a5d67 ("selftests/bpf: Disable newly-added 'owner' field test until refcount re-enabled")
 * commit 7deca5eae8 ("bpf: Disable bpf_refcount_acquire kfunc calls until race conditions are fixed")

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230821193311.3290257-5-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-08-25 09:23:16 -07:00
Dave Marchevsky 7e26cd12ad bpf: Use bpf_mem_free_rcu when bpf_obj_dropping refcounted nodes
This is the final fix for the use-after-free scenario described in
commit 7793fc3bab ("bpf: Make bpf_refcount_acquire fallible for
non-owning refs"). That commit, by virtue of changing
bpf_refcount_acquire's refcount_inc to a refcount_inc_not_zero, fixed
the "refcount incr on 0" splat. The not_zero check in
refcount_inc_not_zero, though, still occurs on memory that could have
been free'd and reused, so the commit didn't properly fix the root
cause.

This patch actually fixes the issue by free'ing using the recently-added
bpf_mem_free_rcu, which ensures that the memory is not reused until
RCU grace period has elapsed. If that has happened then
there are no non-owning references alive that point to the
recently-free'd memory, so it can be safely reused.

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230821193311.3290257-4-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-08-25 09:23:16 -07:00
Dave Marchevsky f0d991a070 bpf: Ensure kptr_struct_meta is non-NULL for collection insert and refcount_acquire
It's straightforward to prove that kptr_struct_meta must be non-NULL for
any valid call to these kfuncs:

  * btf_parse_struct_metas in btf.c creates a btf_struct_meta for any
    struct in user BTF with a special field (e.g. bpf_refcount,
    {rb,list}_node). These are stored in that BTF's struct_meta_tab.

  * __process_kf_arg_ptr_to_graph_node in verifier.c ensures that nodes
    have {rb,list}_node field and that it's at the correct offset.
    Similarly, check_kfunc_args ensures bpf_refcount field existence for
    node param to bpf_refcount_acquire.

  * So a btf_struct_meta must have been created for the struct type of
    node param to these kfuncs

  * That BTF and its struct_meta_tab are guaranteed to still be around.
    Any arbitrary {rb,list} node the BPF program interacts with either:
    came from bpf_obj_new or a collection removal kfunc in the same
    program, in which case the BTF is associated with the program and
    still around; or came from bpf_kptr_xchg, in which case the BTF was
    associated with the map and is still around

Instead of silently continuing with NULL struct_meta, which caused
confusing bugs such as those addressed by commit 2140a6e342 ("bpf: Set
kptr_struct_meta for node param to list and rbtree insert funcs"), let's
error out. Then, at runtime, we can confidently say that the
implementations of these kfuncs were given a non-NULL kptr_struct_meta,
meaning that special-field-specific functionality like
bpf_obj_free_fields and the bpf_obj_drop change introduced later in this
series are guaranteed to execute.

This patch doesn't change functionality, just makes it easier to reason
about existing functionality.

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230821193311.3290257-2-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-08-25 09:23:16 -07:00
Mateusz Guzik 14ef95be6f kernel/fork: group allocation/free of per-cpu counters for mm struct
A trivial execve scalability test which tries to be very friendly
(statically linked binaries, all separate) is predominantly bottlenecked
by back-to-back per-cpu counter allocations which serialize on global
locks.

Ease the pain by allocating and freeing them in one go.

Bench can be found here:
http://apollo.backplane.com/DFlyMisc/doexec.c

$ cc -static -O2 -o static-doexec doexec.c
$ ./static-doexec $(nproc)

Even at a very modest scale of 26 cores (ops/s):
before:	133543.63
after:	186061.81 (+39%)

While with the patch these allocations remain a significant problem,
the primary bottleneck shifts to page release handling.

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://lore.kernel.org/r/20230823050609.2228718-3-mjguzik@gmail.com
[Dennis: reflowed 1 line]
Signed-off-by: Dennis Zhou <dennis@kernel.org>
2023-08-25 08:10:35 -07:00
Eric DeVolder a396d0f81b crash: change crash_prepare_elf64_headers() to for_each_possible_cpu()
The function crash_prepare_elf64_headers() generates the elfcorehdr which
describes the CPUs and memory in the system for the crash kernel.  In
particular, it writes out ELF PT_NOTEs for memory regions and the CPUs in
the system.

With respect to the CPUs, the current implementation utilizes
for_each_present_cpu() which means that as CPUs are added and removed, the
elfcorehdr must again be updated to reflect the new set of CPUs.

The reasoning behind the move to use for_each_possible_cpu(), is:

- At kernel boot time, all percpu crash_notes are allocated for all
  possible CPUs; that is, crash_notes are not allocated dynamically
  when CPUs are plugged/unplugged. Thus the crash_notes for each
  possible CPU are always available.

- The crash_prepare_elf64_headers() creates an ELF PT_NOTE per CPU.
  Changing to for_each_possible_cpu() is valid as the crash_notes
  pointed to by each CPU PT_NOTE are present and always valid.

Furthermore, examining a common crash processing path of:

 kernel panic -> crash kernel -> makedumpfile -> 'crash' analyzer
           elfcorehdr      /proc/vmcore     vmcore

reveals how the ELF CPU PT_NOTEs are utilized:

- Upon panic, each CPU is sent an IPI and shuts itself down, recording
 its state in its crash_notes. When all CPUs are shutdown, the
 crash kernel is launched with a pointer to the elfcorehdr.

- The crash kernel via linux/fs/proc/vmcore.c does not examine or
 use the contents of the PT_NOTEs, it exposes them via /proc/vmcore.

- The makedumpfile utility uses /proc/vmcore and reads the CPU
 PT_NOTEs to craft a nr_cpus variable, which is reported in a
 header but otherwise generally unused. Makedumpfile creates the
 vmcore.

- The 'crash' dump analyzer does not appear to reference the CPU
 PT_NOTEs. Instead it looks-up the cpu_[possible|present|onlin]_mask
 symbols and directly examines those structure contents from vmcore
 memory. From that information it is able to determine which CPUs
 are present and online, and locate the corresponding crash_notes.
 Said differently, it appears that 'crash' analyzer does not rely
 on the ELF PT_NOTEs for CPUs; rather it obtains the information
 directly via kernel symbols and the memory within the vmcore.

(There maybe other vmcore generating and analysis tools that do use these
PT_NOTEs, but 'makedumpfile' and 'crash' seems to be the most common
solution.)

This results in the benefit of having all CPUs described in the
elfcorehdr, and therefore reducing the need to re-generate the elfcorehdr
on CPU changes, at the small expense of an additional 56 bytes per PT_NOTE
for not-present-but-possible CPUs.

On systems where kexec_file_load() syscall is utilized, all the above is
valid.  On systems where kexec_load() syscall is utilized, there may be
the need for the elfcorehdr to be regenerated once.  The reason being that
some archs only populate the 'present' CPUs from the
/sys/devices/system/cpus entries, which the userspace 'kexec' utility uses
to generate the userspace-supplied elfcorehdr.  In this situation, one
memory or CPU change will rewrite the elfcorehdr via the
crash_prepare_elf64_headers() function and now all possible CPUs will be
described, just as with kexec_file_load() syscall.

Link: https://lkml.kernel.org/r/20230814214446.6659-8-eric.devolder@oracle.com
Signed-off-by: Eric DeVolder <eric.devolder@oracle.com>
Suggested-by: Sourabh Jain <sourabhjain@linux.ibm.com>
Reviewed-by: Sourabh Jain <sourabhjain@linux.ibm.com>
Acked-by: Hari Bathini <hbathini@linux.ibm.com>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Akhil Raj <lf32.dev@gmail.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Borislav Petkov (AMD) <bp@alien8.de>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Mimi Zohar <zohar@linux.ibm.com>
Cc: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Thomas Weißschuh <linux@weissschuh.net>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-24 16:25:14 -07:00
Eric DeVolder a72bbec70d crash: hotplug support for kexec_load()
The hotplug support for kexec_load() requires changes to the userspace
kexec-tools and a little extra help from the kernel.

Given a kdump capture kernel loaded via kexec_load(), and a subsequent
hotplug event, the crash hotplug handler finds the elfcorehdr and rewrites
it to reflect the hotplug change.  That is the desired outcome, however,
at kernel panic time, the purgatory integrity check fails (because the
elfcorehdr changed), and the capture kernel does not boot and no vmcore is
generated.

Therefore, the userspace kexec-tools/kexec must indicate to the kernel
that the elfcorehdr can be modified (because the kexec excluded the
elfcorehdr from the digest, and sized the elfcorehdr memory buffer
appropriately).

To facilitate hotplug support with kexec_load():
 - a new kexec flag KEXEC_UPATE_ELFCOREHDR indicates that it is
   safe for the kernel to modify the kexec_load()'d elfcorehdr
 - the /sys/kernel/crash_elfcorehdr_size node communicates the
   preferred size of the elfcorehdr memory buffer
 - The sysfs crash_hotplug nodes (ie.
   /sys/devices/system/[cpu|memory]/crash_hotplug) dynamically
   take into account kexec_file_load() vs kexec_load() and
   KEXEC_UPDATE_ELFCOREHDR.
   This is critical so that the udev rule processing of crash_hotplug
   is all that is needed to determine if the userspace unload-then-load
   of the kdump image is to be skipped, or not. The proposed udev
   rule change looks like:
   # The kernel updates the crash elfcorehdr for CPU and memory changes
   SUBSYSTEM=="cpu", ATTRS{crash_hotplug}=="1", GOTO="kdump_reload_end"
   SUBSYSTEM=="memory", ATTRS{crash_hotplug}=="1", GOTO="kdump_reload_end"

The table below indicates the behavior of kexec_load()'d kdump image
updates (with the new udev crash_hotplug rule in place):

 Kernel |Kexec
 -------+-----+----
 Old    |Old  |New
        |  a  | a
 -------+-----+----
 New    |  a  | b
 -------+-----+----

where kexec 'old' and 'new' delineate kexec-tools has the needed
modifications for the crash hotplug feature, and kernel 'old' and 'new'
delineate the kernel supports this crash hotplug feature.

Behavior 'a' indicates the unload-then-reload of the entire kdump image. 
For the kexec 'old' column, the unload-then-reload occurs due to the
missing flag KEXEC_UPDATE_ELFCOREHDR.  An 'old' kernel (with 'new' kexec)
does not present the crash_hotplug sysfs node, which leads to the
unload-then-reload of the kdump image.

Behavior 'b' indicates the desired optimized behavior of the kernel
directly modifying the elfcorehdr and avoiding the unload-then-reload of
the kdump image.

If the udev rule is not updated with crash_hotplug node check, then no
matter any combination of kernel or kexec is new or old, the kdump image
continues to be unload-then-reload on hotplug changes.

To fully support crash hotplug feature, there needs to be a rollout of
kernel, kexec-tools and udev rule changes.  However, the order of the
rollout of these pieces does not matter; kexec_load()'d kdump images still
function for hotplug as-is.

Link: https://lkml.kernel.org/r/20230814214446.6659-7-eric.devolder@oracle.com
Signed-off-by: Eric DeVolder <eric.devolder@oracle.com>
Suggested-by: Hari Bathini <hbathini@linux.ibm.com>
Acked-by: Hari Bathini <hbathini@linux.ibm.com>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Akhil Raj <lf32.dev@gmail.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Borislav Petkov (AMD) <bp@alien8.de>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Mimi Zohar <zohar@linux.ibm.com>
Cc: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Sourabh Jain <sourabhjain@linux.ibm.com>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Thomas Weißschuh <linux@weissschuh.net>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-24 16:25:14 -07:00
Eric DeVolder f7cc804a9f kexec: exclude elfcorehdr from the segment digest
When a crash kernel is loaded via the kexec_file_load() syscall, the
kernel places the various segments (ie crash kernel, crash initrd,
boot_params, elfcorehdr, purgatory, etc) in memory.  For those
architectures that utilize purgatory, a hash digest of the segments is
calculated for integrity checking.  The digest is embedded into the
purgatory image prior to placing in memory.

Updates to the elfcorehdr in response to CPU and memory changes would
cause the purgatory integrity checking to fail (at crash time, and no
vmcore created).  Therefore, the elfcorehdr segment is explicitly excluded
from the purgatory digest, enabling updates to the elfcorehdr while also
avoiding the need to recompute the hash digest and reload purgatory.

Link: https://lkml.kernel.org/r/20230814214446.6659-4-eric.devolder@oracle.com
Signed-off-by: Eric DeVolder <eric.devolder@oracle.com>
Suggested-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Sourabh Jain <sourabhjain@linux.ibm.com>
Acked-by: Hari Bathini <hbathini@linux.ibm.com>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Akhil Raj <lf32.dev@gmail.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Borislav Petkov (AMD) <bp@alien8.de>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Mimi Zohar <zohar@linux.ibm.com>
Cc: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Thomas Weißschuh <linux@weissschuh.net>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-24 16:25:13 -07:00
Eric DeVolder 2472627561 crash: add generic infrastructure for crash hotplug support
To support crash hotplug, a mechanism is needed to update the crash
elfcorehdr upon CPU or memory changes (eg.  hot un/plug or off/ onlining).
The crash elfcorehdr describes the CPUs and memory to be written into the
vmcore.

To track CPU changes, callbacks are registered with the cpuhp mechanism
via cpuhp_setup_state_nocalls(CPUHP_BP_PREPARE_DYN).  The crash hotplug
elfcorehdr update has no explicit ordering requirement (relative to other
cpuhp states), so meets the criteria for utilizing CPUHP_BP_PREPARE_DYN. 
CPUHP_BP_PREPARE_DYN is a dynamic state and avoids the need to introduce a
new state for crash hotplug.  Also, CPUHP_BP_PREPARE_DYN is the last state
in the PREPARE group, just prior to the STARTING group, which is very
close to the CPU starting up in a plug/online situation, or stopping in a
unplug/ offline situation.  This minimizes the window of time during an
actual plug/online or unplug/offline situation in which the elfcorehdr
would be inaccurate.  Note that for a CPU being unplugged or offlined, the
CPU will still be present in the list of CPUs generated by
crash_prepare_elf64_headers().  However, there is no need to explicitly
omit the CPU, see justification in 'crash: change
crash_prepare_elf64_headers() to for_each_possible_cpu()'.

To track memory changes, a notifier is registered to capture the memblock
MEM_ONLINE and MEM_OFFLINE events via register_memory_notifier().

The CPU callbacks and memory notifiers invoke crash_handle_hotplug_event()
which performs needed tasks and then dispatches the event to the
architecture specific arch_crash_handle_hotplug_event() to update the
elfcorehdr with the current state of CPUs and memory.  During the process,
the kexec_lock is held.

Link: https://lkml.kernel.org/r/20230814214446.6659-3-eric.devolder@oracle.com
Signed-off-by: Eric DeVolder <eric.devolder@oracle.com>
Reviewed-by: Sourabh Jain <sourabhjain@linux.ibm.com>
Acked-by: Hari Bathini <hbathini@linux.ibm.com>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Akhil Raj <lf32.dev@gmail.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Borislav Petkov (AMD) <bp@alien8.de>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Mimi Zohar <zohar@linux.ibm.com>
Cc: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Thomas Weißschuh <linux@weissschuh.net>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-24 16:25:13 -07:00
Eric DeVolder 6f991cc363 crash: move a few code bits to setup support of crash hotplug
Patch series "crash: Kernel handling of CPU and memory hot un/plug", v28.

Once the kdump service is loaded, if changes to CPUs or memory occur,
either by hot un/plug or off/onlining, the crash elfcorehdr must also be
updated.

The elfcorehdr describes to kdump the CPUs and memory in the system, and
any inaccuracies can result in a vmcore with missing CPU context or memory
regions.

The current solution utilizes udev to initiate an unload-then-reload of
the kdump image (eg.  kernel, initrd, boot_params, purgatory and
elfcorehdr) by the userspace kexec utility.  In the original post I
outlined the significant performance problems related to offloading this
activity to userspace.

This patchset introduces a generic crash handler that registers with the
CPU and memory notifiers.  Upon CPU or memory changes, from either hot
un/plug or off/onlining, this generic handler is invoked and performs
important housekeeping, for example obtaining the appropriate lock, and
then invokes an architecture specific handler to do the appropriate
elfcorehdr update.

Note the description in patch 'crash: change crash_prepare_elf64_headers()
to for_each_possible_cpu()' and 'x86/crash: optimize CPU changes' that
enables further optimizations related to CPU plug/unplug/online/offline
performance of elfcorehdr updates.

In the case of x86_64, the arch specific handler generates a new
elfcorehdr, and overwrites the old one in memory; thus no involvement with
userspace needed.

To realize the benefits/test this patchset, one must make a couple
of minor changes to userspace:

 - Prevent udev from updating kdump crash kernel on hot un/plug changes.
   Add the following as the first lines to the RHEL udev rule file
   /usr/lib/udev/rules.d/98-kexec.rules:

   # The kernel updates the crash elfcorehdr for CPU and memory changes
   SUBSYSTEM=="cpu", ATTRS{crash_hotplug}=="1", GOTO="kdump_reload_end"
   SUBSYSTEM=="memory", ATTRS{crash_hotplug}=="1", GOTO="kdump_reload_end"

   With this changeset applied, the two rules evaluate to false for
   CPU and memory change events and thus skip the userspace
   unload-then-reload of kdump.

 - Change to the kexec_file_load for loading the kdump kernel:
   Eg. on RHEL: in /usr/bin/kdumpctl, change to:
    standard_kexec_args="-p -d -s"
   which adds the -s to select kexec_file_load() syscall.

This kernel patchset also supports kexec_load() with a modified kexec
userspace utility.  A working changeset to the kexec userspace utility is
posted to the kexec-tools mailing list here:

 http://lists.infradead.org/pipermail/kexec/2023-May/027049.html

To use the kexec-tools patch, apply, build and install kexec-tools, then
change the kdumpctl's standard_kexec_args to replace the -s with
--hotplug.  The removal of -s reverts to the kexec_load syscall and the
addition of --hotplug invokes the changes put forth in the kexec-tools
patch.


This patch (of 8):

The crash hotplug support leans on the work for the kexec_file_load()
syscall.  To also support the kexec_load() syscall, a few bits of code
need to be move outside of CONFIG_KEXEC_FILE.  As such, these bits are
moved out of kexec_file.c and into a common location crash_core.c.

In addition, struct crash_mem and crash_notes were moved to new locales so
that PROC_KCORE, which sets CRASH_CORE alone, builds correctly.

No functionality change intended.

Link: https://lkml.kernel.org/r/20230814214446.6659-1-eric.devolder@oracle.com
Link: https://lkml.kernel.org/r/20230814214446.6659-2-eric.devolder@oracle.com
Signed-off-by: Eric DeVolder <eric.devolder@oracle.com>
Reviewed-by: Sourabh Jain <sourabhjain@linux.ibm.com>
Acked-by: Hari Bathini <hbathini@linux.ibm.com>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Akhil Raj <lf32.dev@gmail.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Borislav Petkov (AMD) <bp@alien8.de>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Mimi Zohar <zohar@linux.ibm.com>
Cc: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Thomas Weißschuh <linux@weissschuh.net>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-24 16:25:13 -07:00
Kees Cook 33c24bee4b kallsyms: Add more debug output for selftest
While debugging a recent kallsyms_selftest failure[1], I needed more
details on what specifically was failing. This adds those details for
each failure state that is checked.

[1] https://lore.kernel.org/all/202308232200.1c932a90-oliver.sang@intel.com/

Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Yonghong Song <yhs@meta.com>
Cc: "Erhard F." <erhard_f@mailbox.org>
Cc: Zhen Lei <thunder.leizhen@huawei.com>
Cc: kernel test robot <oliver.sang@intel.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Yang Li <yang.lee@linux.alibaba.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2023-08-24 14:30:50 -07:00
Yonghong Song 393dc4bd92 bpf: Remove a WARN_ON_ONCE warning related to local kptr
Currently, in function bpf_obj_free_fields(), for local kptr,
a warning will be issued if the struct does not contain any
special fields. But actually the kernel seems totally okay
with a local kptr without any special fields. Permitting
no special fields also aligns with future percpu kptr which
also allows no special fields.

Acked-by: Dave Marchevsky <davemarchevsky@fb.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230824063417.201925-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-08-24 08:15:16 -07:00
Yafang Shao d75e30dddf bpf: Fix issue in verifying allow_ptr_leaks
After we converted the capabilities of our networking-bpf program from
cap_sys_admin to cap_net_admin+cap_bpf, our networking-bpf program
failed to start. Because it failed the bpf verifier, and the error log
is "R3 pointer comparison prohibited".

A simple reproducer as follows,

SEC("cls-ingress")
int ingress(struct __sk_buff *skb)
{
	struct iphdr *iph = (void *)(long)skb->data + sizeof(struct ethhdr);

	if ((long)(iph + 1) > (long)skb->data_end)
		return TC_ACT_STOLEN;
	return TC_ACT_OK;
}

Per discussion with Yonghong and Alexei [1], comparison of two packet
pointers is not a pointer leak. This patch fixes it.

Our local kernel is 6.1.y and we expect this fix to be backported to
6.1.y, so stable is CCed.

[1]. https://lore.kernel.org/bpf/CAADnVQ+Nmspr7Si+pxWn8zkE7hX-7s93ugwC+94aXSy4uQ9vBg@mail.gmail.com/

Suggested-by: Yonghong Song <yonghong.song@linux.dev>
Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20230823020703.3790-2-laoar.shao@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-08-23 09:37:29 -07:00
Mark Rutland 1dfe3a5a7c entry: Remove empty addr_limit_user_check()
Back when set_fs() was a generic API for altering the address limit,
addr_limit_user_check() was a safety measure to prevent userspace being
able to issue syscalls with an unbound limit.

With the the removal of set_fs() as a generic API, the last user of
addr_limit_user_check() was removed in commit:

  b5a5a01d8e ("arm64: uaccess: remove addr_limit_user_check()")

... as since that commit, no architecture defines TIF_FSCHECK, and hence
addr_limit_user_check() always expands to nothing.

Remove addr_limit_user_check(), updating the comment in
exit_to_user_mode_prepare() to no longer refer to it. At the same time,
the comment is reworded to be a little more generic so as to cover
kmap_assert_nomap() in addition to lockdep_sys_exit().

No functional change.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20230821163526.2319443-1-mark.rutland@arm.com
2023-08-23 10:32:39 +02:00
Masami Hiramatsu (Google) 08c9306fc2 tracing/fprobe-event: Assume fprobe is a return event by $retval
Assume the fprobe event is a return event if there is $retval is
used in the probe's argument without %return. e.g.

echo 'f:myevent vfs_read $retval' >> dynamic_events

then 'myevent' is a return probe event.

Link: https://lore.kernel.org/all/169272160261.160970.13613040161560998787.stgit@devnote2/

Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-08-23 09:41:32 +09:00
Masami Hiramatsu (Google) 27973e5c64 tracing/probes: Add string type check with BTF
Add a string type checking with BTF information if possible.
This will check whether the given BTF argument (and field) is
signed char array or pointer to signed char. If not, it reject
the 'string' type. If it is pointer to signed char, it adds
a dereference opration so that it can correctly fetch the
string data from memory.

 # echo 'f getname_flags%return retval->name:string' >> dynamic_events
 # echo 't sched_switch next->comm:string' >> dynamic_events

The above cases, 'struct filename::name' is 'char *' and
'struct task_struct::comm' is 'char []'. But in both case,
user can specify ':string' to fetch the string data.

Link: https://lore.kernel.org/all/169272159250.160970.1881112937198526188.stgit@devnote2/

Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-08-23 09:41:13 +09:00
Masami Hiramatsu (Google) d157d76944 tracing/probes: Support BTF field access from $retval
Support BTF argument on '$retval' for function return events including
kretprobe and fprobe for accessing the return value.
This also allows user to access its fields if the return value is a
pointer of a data structure.

E.g.
 # echo 'f getname_flags%return +0($retval->name):string' \
   > dynamic_events
 # echo 1 > events/fprobes/getname_flags__exit/enable
 # ls > /dev/null
 # head -n 40 trace | tail
              ls-87      [000] ...1.  8067.616101: getname_flags__exit: (vfs_fstatat+0x3c/0x70 <- getname_flags) arg1="./function_profile_enabled"
              ls-87      [000] ...1.  8067.616108: getname_flags__exit: (vfs_fstatat+0x3c/0x70 <- getname_flags) arg1="./trace_stat"
              ls-87      [000] ...1.  8067.616115: getname_flags__exit: (vfs_fstatat+0x3c/0x70 <- getname_flags) arg1="./set_graph_notrace"
              ls-87      [000] ...1.  8067.616122: getname_flags__exit: (vfs_fstatat+0x3c/0x70 <- getname_flags) arg1="./set_graph_function"
              ls-87      [000] ...1.  8067.616129: getname_flags__exit: (vfs_fstatat+0x3c/0x70 <- getname_flags) arg1="./set_ftrace_notrace"
              ls-87      [000] ...1.  8067.616135: getname_flags__exit: (vfs_fstatat+0x3c/0x70 <- getname_flags) arg1="./set_ftrace_filter"
              ls-87      [000] ...1.  8067.616143: getname_flags__exit: (vfs_fstatat+0x3c/0x70 <- getname_flags) arg1="./touched_functions"
              ls-87      [000] ...1.  8067.616237: getname_flags__exit: (vfs_fstatat+0x3c/0x70 <- getname_flags) arg1="./enabled_functions"
              ls-87      [000] ...1.  8067.616245: getname_flags__exit: (vfs_fstatat+0x3c/0x70 <- getname_flags) arg1="./available_filter_functions"
              ls-87      [000] ...1.  8067.616253: getname_flags__exit: (vfs_fstatat+0x3c/0x70 <- getname_flags) arg1="./set_ftrace_notrace_pid"


Link: https://lore.kernel.org/all/169272158234.160970.2446691104240645205.stgit@devnote2/

Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-08-23 09:40:51 +09:00
Masami Hiramatsu (Google) c440adfbe3 tracing/probes: Support BTF based data structure field access
Using BTF to access the fields of a data structure. You can use this
for accessing the field with '->' or '.' operation with BTF argument.

 # echo 't sched_switch next=next->pid vruntime=next->se.vruntime' \
   > dynamic_events
 # echo 1 > events/tracepoints/sched_switch/enable
 # head -n 40 trace | tail
          <idle>-0       [000] d..3.   272.565382: sched_switch: (__probestub_sched_switch+0x4/0x10) next=26 vruntime=956533179
      kcompactd0-26      [000] d..3.   272.565406: sched_switch: (__probestub_sched_switch+0x4/0x10) next=0 vruntime=0
          <idle>-0       [000] d..3.   273.069441: sched_switch: (__probestub_sched_switch+0x4/0x10) next=9 vruntime=956533179
     kworker/0:1-9       [000] d..3.   273.069464: sched_switch: (__probestub_sched_switch+0x4/0x10) next=26 vruntime=956579181
      kcompactd0-26      [000] d..3.   273.069480: sched_switch: (__probestub_sched_switch+0x4/0x10) next=0 vruntime=0
          <idle>-0       [000] d..3.   273.141434: sched_switch: (__probestub_sched_switch+0x4/0x10) next=22 vruntime=956533179
    kworker/u2:1-22      [000] d..3.   273.141461: sched_switch: (__probestub_sched_switch+0x4/0x10) next=0 vruntime=0
          <idle>-0       [000] d..3.   273.480872: sched_switch: (__probestub_sched_switch+0x4/0x10) next=22 vruntime=956585857
    kworker/u2:1-22      [000] d..3.   273.480905: sched_switch: (__probestub_sched_switch+0x4/0x10) next=70 vruntime=959533179
              sh-70      [000] d..3.   273.481102: sched_switch: (__probestub_sched_switch+0x4/0x10) next=0 vruntime=0

Link: https://lore.kernel.org/all/169272157251.160970.9318175874130965571.stgit@devnote2/

Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reviewed-by: Alan Maguire <alan.maguire@oracle.com>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-08-23 09:40:28 +09:00
Masami Hiramatsu (Google) 302db0f5b3 tracing/probes: Add a function to search a member of a struct/union
Add btf_find_struct_member() API to search a member of a given data structure
or union from the member's name.

Link: https://lore.kernel.org/all/169272156248.160970.8868479822371129043.stgit@devnote2/

Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reviewed-by: Alan Maguire <alan.maguire@oracle.com>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-08-23 09:40:16 +09:00
Masami Hiramatsu (Google) ebeed8d4a5 tracing/probes: Move finding func-proto API and getting func-param API to trace_btf
Move generic function-proto find API and getting function parameter API
to BTF library code from trace_probe.c. This will avoid redundant efforts
on different feature.

Link: https://lore.kernel.org/all/169272155255.160970.719426926348706349.stgit@devnote2/

Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-08-23 09:39:45 +09:00
Masami Hiramatsu (Google) b1d1e90490 tracing/probes: Support BTF argument on module functions
Since the btf returned from bpf_get_btf_vmlinux() only covers functions in
the vmlinux, BTF argument is not available on the functions in the modules.
Use bpf_find_btf_id() instead of bpf_get_btf_vmlinux()+btf_find_name_kind()
so that BTF argument can find the correct struct btf and btf_type in it.
With this fix, fprobe events can use `$arg*` on module functions as below

 # grep nf_log_ip_packet /proc/kallsyms
ffffffffa0005c00 t nf_log_ip_packet	[nf_log_syslog]
ffffffffa0005bf0 t __pfx_nf_log_ip_packet	[nf_log_syslog]
 # echo 'f nf_log_ip_packet $arg*' > dynamic_events
 # cat dynamic_events
f:fprobes/nf_log_ip_packet__entry nf_log_ip_packet net=net pf=pf hooknum=hooknum skb=skb in=in out=out loginfo=loginfo prefix=prefix

To support the module's btf which is removable, the struct btf needs to be
ref-counted. So this also records the btf in the traceprobe_parse_context
and returns the refcount when the parse has done.

Link: https://lore.kernel.org/all/169272154223.160970.3507930084247934031.stgit@devnote2/

Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-08-23 09:39:15 +09:00
Chuang Wang f8bbf8b990 tracing/eprobe: Iterate trace_eprobe directly
Refer to the description in [1], we can skip "container_of()" following
"list_for_each_entry()" by using "list_for_each_entry()" with
"struct trace_eprobe" and "tp.list".

Also, this patch defines "for_each_trace_eprobe_tp" to simplify the code
of the same logic.

[1] https://lore.kernel.org/all/CAHk-=wjakjw6-rDzDDBsuMoDCqd+9ogifR_EE1F0K-jYek1CdA@mail.gmail.com/

Link: https://lore.kernel.org/all/20230822022433.262478-1-nashuiliang@gmail.com/

Signed-off-by: Chuang Wang <nashuiliang@gmail.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2023-08-23 09:38:56 +09:00
Ruan Jinjie 8865aea047 kernel: kprobes: Use struct_size()
Use struct_size() instead of hand-writing it, when allocating a structure
with a flex array.

This is less verbose.

Link: https://lore.kernel.org/all/20230725195424.3469242-1-ruanjinjie@huawei.com/

Signed-off-by: Ruan Jinjie <ruanjinjie@huawei.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2023-08-23 09:38:17 +09:00
Kumar Kartikeya Dwivedi 6785b2edf4 bpf: Fix check_func_arg_reg_off bug for graph root/node
The commit being fixed introduced a hunk into check_func_arg_reg_off
that bypasses reg->off == 0 enforcement when offset points to a graph
node or root. This might possibly be done for treating bpf_rbtree_remove
and others as KF_RELEASE and then later check correct reg->off in helper
argument checks.

But this is not the case, those helpers are already not KF_RELEASE and
permit non-zero reg->off and verify it later to match the subobject in
BTF type.

However, this logic leads to bpf_obj_drop permitting free of register
arguments with non-zero offset when they point to a graph root or node
within them, which is not ok.

For instance:

struct foo {
	int i;
	int j;
	struct bpf_rb_node node;
};

struct foo *f = bpf_obj_new(typeof(*f));
if (!f) ...
bpf_obj_drop(f); // OK
bpf_obj_drop(&f->i); // still ok from verifier PoV
bpf_obj_drop(&f->node); // Not OK, but permitted right now

Fix this by dropping the whole part of code altogether.

Fixes: 6a3cd3318f ("bpf: Migrate release_on_unlock logic to non-owning ref semantics")
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20230822175140.1317749-2-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-08-22 12:52:48 -07:00
Clive Lin 5f55836ab4 PM: QoS: Add check to make sure CPU latency is non-negative
CPU latency should never be negative, which will be incorrectly high
when converted to unsigned data type.

Commit 8d36694245 ("PM: QoS: Add check to make sure CPU freq is
non-negative") makes sure CPU frequency is non-negative to fix incorrect
behavior in freqency QoS.

Add an analogous check to make sure CPU latency is non-negative so as to
prevent this problem from happening in CPU latency QoS.

Signed-off-by: Clive Lin <clive.lin@mediatek.com>
[ rjw: Changelog edits ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2023-08-22 21:37:29 +02:00
Yonghong Song ab6c637ad0 bpf: Fix a bpf_kptr_xchg() issue with local kptr
When reviewing local percpu kptr support, Alexei discovered a bug
wherea bpf_kptr_xchg() may succeed even if the map value kptr type and
locally allocated obj type do not match ([1]). Missed struct btf_id
comparison is the reason for the bug. This patch added such struct btf_id
comparison and will flag verification failure if types do not match.

  [1] https://lore.kernel.org/bpf/20230819002907.io3iphmnuk43xblu@macbook-pro-8.dhcp.thefacebook.com/#t

Reported-by: Alexei Starovoitov <ast@kernel.org>
Fixes: 738c96d5e2 ("bpf: Allow local kptrs to be exchanged via bpf_kptr_xchg")
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20230822050053.2886960-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-08-22 09:43:55 -07:00
Eric Vaughn a943188dab tracing/user_events: Optimize safe list traversals
Several of the list traversals in the user_events facility use safe list
traversals where they could be using the unsafe versions instead.

Replace these safe traversals with their unsafe counterparts in the
interest of optimization.

Link: https://lore.kernel.org/linux-trace-kernel/20230810194337.695983-1-ervaughn@linux.microsoft.com

Suggested-by: Beau Belgrave <beaub@linux.microsoft.com>
Signed-off-by: Eric Vaughn <ervaughn@linux.microsoft.com>
Acked-by: Beau Belgrave <beaub@linux.microsoft.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-08-22 05:22:10 -04:00
Yue Haibing efde97a175 tracing: Remove unused function declarations
Commit 9457158bbc ("tracing: Fix reset of time stamps during trace_clock changes")
left behind tracing_reset_current() declaration.
Also commit 6954e41526 ("tracing: Place trace_pid_list logic into abstract functions")
removed trace_free_pid_list() implementation but leave declaration.

Link: https://lore.kernel.org/linux-trace-kernel/20230803144028.25492-1-yuehaibing@huawei.com

Cc: <mhiramat@kernel.org>
Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-08-22 05:19:35 -04:00
Valentin Schneider 38c6f68083 tracing/filters: Further optimise scalar vs cpumask comparison
Per the previous commits, we now only enter do_filter_scalar_cpumask() with
a mask of weight greater than one. Optimise the equality checks.

Link: https://lkml.kernel.org/r/20230707172155.70873-9-vschneid@redhat.com

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Leonardo Bras <leobras@redhat.com>
Cc: Frederic Weisbecker <frederic@kernel.org>
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-08-22 05:13:29 -04:00
Valentin Schneider 1cffbe6c62 tracing/filters: Optimise CPU vs cpumask filtering when the user mask is a single CPU
Steven noted that when the user-provided cpumask contains a single CPU,
then the filtering function can use a scalar as input instead of a
full-fledged cpumask.

In this case we can directly re-use filter_pred_cpu(), we just need to
transform '&' into '==' before executing it.

Link: https://lkml.kernel.org/r/20230707172155.70873-8-vschneid@redhat.com

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Leonardo Bras <leobras@redhat.com>
Cc: Frederic Weisbecker <frederic@kernel.org>
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-08-22 05:13:29 -04:00
Valentin Schneider ca77dd8ce4 tracing/filters: Optimise scalar vs cpumask filtering when the user mask is a single CPU
Steven noted that when the user-provided cpumask contains a single CPU,
then the filtering function can use a scalar as input instead of a
full-fledged cpumask.

When the mask contains a single CPU, directly re-use the unsigned field
predicate functions. Transform '&' into '==' beforehand.

Link: https://lkml.kernel.org/r/20230707172155.70873-7-vschneid@redhat.com

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Leonardo Bras <leobras@redhat.com>
Cc: Frederic Weisbecker <frederic@kernel.org>
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-08-22 05:13:29 -04:00
Valentin Schneider fe4fa4ec9b tracing/filters: Optimise cpumask vs cpumask filtering when user mask is a single CPU
Steven noted that when the user-provided cpumask contains a single CPU,
then the filtering function can use a scalar as input instead of a
full-fledged cpumask.

Reuse do_filter_scalar_cpumask() when the input mask has a weight of one.

Link: https://lkml.kernel.org/r/20230707172155.70873-6-vschneid@redhat.com

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Leonardo Bras <leobras@redhat.com>
Cc: Frederic Weisbecker <frederic@kernel.org>
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-08-22 05:13:28 -04:00
Valentin Schneider 347d24fc82 tracing/filters: Enable filtering the CPU common field by a cpumask
The tracing_cpumask lets us specify which CPUs are traced in a buffer
instance, but doesn't let us do this on a per-event basis (unless one
creates an instance per event).

A previous commit added filtering scalar fields by a user-given cpumask,
make this work with the CPU common field as well.

This enables doing things like

$ trace-cmd record -e 'sched_switch' -f 'CPU & CPUS{12-52}' \
		   -e 'sched_wakeup' -f 'target_cpu & CPUS{12-52}'

Link: https://lkml.kernel.org/r/20230707172155.70873-5-vschneid@redhat.com

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Leonardo Bras <leobras@redhat.com>
Cc: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-08-22 05:13:28 -04:00
Valentin Schneider 3cbec9d7b9 tracing/filters: Enable filtering a scalar field by a cpumask
Several events use a scalar field to denote a CPU:
o sched_wakeup.target_cpu
o sched_migrate_task.orig_cpu,dest_cpu
o sched_move_numa.src_cpu,dst_cpu
o ipi_send_cpu.cpu
o ...

Filtering these currently requires using arithmetic comparison functions,
which can be tedious when dealing with interleaved SMT or NUMA CPU ids.

Allow these to be filtered by a user-provided cpumask, which enables e.g.:

$ trace-cmd record -e 'sched_wakeup' -f 'target_cpu & CPUS{2,4,6,8-32}'

Link: https://lkml.kernel.org/r/20230707172155.70873-4-vschneid@redhat.com

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Leonardo Bras <leobras@redhat.com>
Cc: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-08-22 05:13:28 -04:00
Valentin Schneider 39f7c41c90 tracing/filters: Enable filtering a cpumask field by another cpumask
The recently introduced ipi_send_cpumask trace event contains a cpumask
field, but it currently cannot be used in filter expressions.

Make event filtering aware of cpumask fields, and allow these to be
filtered by a user-provided cpumask.

The user-provided cpumask is to be given in cpulist format and wrapped as:
"CPUS{$cpulist}". The use of curly braces instead of parentheses is to
prevent predicate_parse() from parsing the contents of CPUS{...} as a
full-fledged predicate subexpression.

This enables e.g.:

$ trace-cmd record -e 'ipi_send_cpumask' -f 'cpumask & CPUS{2,4,6,8-32}'

Link: https://lkml.kernel.org/r/20230707172155.70873-3-vschneid@redhat.com

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Leonardo Bras <leobras@redhat.com>
Cc: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-08-22 05:13:28 -04:00
Valentin Schneider cfb58e278c tracing/filters: Dynamically allocate filter_pred.regex
Every predicate allocation includes a MAX_FILTER_STR_VAL (256) char array
in the regex field, even if the predicate function does not use the field.

A later commit will introduce a dynamically allocated cpumask to struct
filter_pred, which will require a dedicated freeing function. Bite the
bullet and make filter_pred.regex dynamically allocated.

While at it, reorder the fields of filter_pred to fill in the byte
holes. The struct now fits on a single cacheline.

No change in behaviour intended.

The kfree()'s were patched via Coccinelle:
  @@
  struct filter_pred *pred;
  @@

  -kfree(pred);
  +free_predicate(pred);

Link: https://lkml.kernel.org/r/20230707172155.70873-2-vschneid@redhat.com

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Leonardo Bras <leobras@redhat.com>
Cc: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-08-22 05:13:28 -04:00
Jiri Olsa 686328d80c bpf: Add bpf_get_func_ip helper support for uprobe link
Adding support for bpf_get_func_ip helper being called from
ebpf program attached by uprobe_multi link.

It returns the ip of the uprobe.

Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230809083440.3209381-7-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-08-21 15:51:25 -07:00
Jiri Olsa b733eeade4 bpf: Add pid filter support for uprobe_multi link
Adding support to specify pid for uprobe_multi link and the uprobes
are created only for task with given pid value.

Using the consumer.filter filter callback for that, so the task gets
filtered during the uprobe installation.

We still need to check the task during runtime in the uprobe handler,
because the handler could get executed if there's another system
wide consumer on the same uprobe (thanks Oleg for the insight).

Cc: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230809083440.3209381-6-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-08-21 15:51:25 -07:00
Jiri Olsa 0b779b61f6 bpf: Add cookies support for uprobe_multi link
Adding support to specify cookies array for uprobe_multi link.

The cookies array share indexes and length with other uprobe_multi
arrays (offsets/ref_ctr_offsets).

The cookies[i] value defines cookie for i-the uprobe and will be
returned by bpf_get_attach_cookie helper when called from ebpf
program hooked to that specific uprobe.

Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230809083440.3209381-5-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-08-21 15:51:25 -07:00
Jiri Olsa 89ae89f53d bpf: Add multi uprobe link
Adding new multi uprobe link that allows to attach bpf program
to multiple uprobes.

Uprobes to attach are specified via new link_create uprobe_multi
union:

  struct {
    __aligned_u64   path;
    __aligned_u64   offsets;
    __aligned_u64   ref_ctr_offsets;
    __u32           cnt;
    __u32           flags;
  } uprobe_multi;

Uprobes are defined for single binary specified in path and multiple
calling sites specified in offsets array with optional reference
counters specified in ref_ctr_offsets array. All specified arrays
have length of 'cnt'.

The 'flags' supports single bit for now that marks the uprobe as
return probe.

Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230809083440.3209381-4-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-08-21 15:51:25 -07:00
Jiri Olsa 3505cb9fa2 bpf: Add attach_type checks under bpf_prog_attach_check_attach_type
Add extra attach_type checks from link_create under
bpf_prog_attach_check_attach_type.

Suggested-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230809083440.3209381-3-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-08-21 15:51:25 -07:00
Hou Tao c2e42ddf26 bpf, cpumask: Clean up bpf_cpu_map_entry directly in cpu_map_free
After synchronous_rcu(), both the dettached XDP program and
xdp_do_flush() are completed, and the only user of bpf_cpu_map_entry
will be cpu_map_kthread_run(), so instead of calling
__cpu_map_entry_replace() to stop kthread and cleanup entry after a RCU
grace period, do these things directly.

Signed-off-by: Hou Tao <houtao1@huawei.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/r/20230816045959.358059-3-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-08-21 15:21:16 -07:00
Hou Tao 8f8500a247 bpf, cpumap: Use queue_rcu_work() to remove unnecessary rcu_barrier()
As for now __cpu_map_entry_replace() uses call_rcu() to wait for the
inflight xdp program to exit the RCU read critical section, and then
launch kworker cpu_map_kthread_stop() to call kthread_stop() to flush
all pending xdp frames or skbs.

But it is unnecessary to use rcu_barrier() in cpu_map_kthread_stop() to
wait for the completion of __cpu_map_entry_free(), because rcu_barrier()
will wait for all pending RCU callbacks and cpu_map_kthread_stop() only
needs to wait for the completion of a specific __cpu_map_entry_free().

So use queue_rcu_work() to replace call_rcu(), schedule_work() and
rcu_barrier(). queue_rcu_work() will queue a __cpu_map_entry_free()
kworker after a RCU grace period. Because __cpu_map_entry_free() is
running in a kworker context, so it is OK to do all of these freeing
procedures include kthread_stop() in it.

After the update, there is no need to do reference-counting for
bpf_cpu_map_entry, because bpf_cpu_map_entry is freed directly in
__cpu_map_entry_free(), so just remove it.

Signed-off-by: Hou Tao <houtao1@huawei.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/r/20230816045959.358059-2-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-08-21 15:21:16 -07:00
Matthew Wilcox (Oracle) ebc1baf5c9 mm: free up a word in the first tail page
Store the folio order in the low byte of the flags word in the first tail
page.  This frees up the word that was being used to store the order and
dtor bytes previously.

Link: https://lkml.kernel.org/r/20230816151201.3655946-11-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Cc: Yanteng Si <siyanteng@loongson.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21 14:28:45 -07:00
Matthew Wilcox (Oracle) de53c05f2a mm: add large_rmappable page flag
Stored in the first tail page's flags, this flag replaces the destructor. 
That removes the last of the destructors, so remove all references to
folio_dtor and compound_dtor.

Link: https://lkml.kernel.org/r/20230816151201.3655946-9-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Cc: Yanteng Si <siyanteng@loongson.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21 14:28:44 -07:00
Matthew Wilcox (Oracle) 9c5ccf2db0 mm: remove HUGETLB_PAGE_DTOR
We can use a bit in page[1].flags to indicate that this folio belongs to
hugetlb instead of using a value in page[1].dtors.  That lets
folio_test_hugetlb() become an inline function like it should be.  We can
also get rid of NULL_COMPOUND_DTOR.

Link: https://lkml.kernel.org/r/20230816151201.3655946-8-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Cc: Yanteng Si <siyanteng@loongson.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21 14:28:44 -07:00
Randy Dunlap ef815d2cba treewide: drop CONFIG_EMBEDDED
There is only one Kconfig user of CONFIG_EMBEDDED and it can be switched
to EXPERT or "if !ARCH_MULTIPLATFORM" (suggested by Arnd).

Link: https://lkml.kernel.org/r/20230816055010.31534-1-rdunlap@infradead.org
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Acked-by: Geert Uytterhoeven <geert+renesas@glider.be>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Palmer Dabbelt <palmer@rivosinc.com>	[RISC-V]
Acked-by: Greg Ungerer <gerg@linux-m68k.org>
Acked-by: Jason A. Donenfeld <Jason@zx2c4.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>	[powerpc]
Cc: Russell King <linux@armlinux.org.uk>
Cc: Vineet Gupta <vgupta@kernel.org>
Cc: Brian Cain <bcain@quicinc.com>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Dinh Nguyen <dinguyen@kernel.org>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Rich Felker <dalias@libc.org>
Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21 13:46:25 -07:00
Helge Deller 0a6b58c5cd lockdep: fix static memory detection even more
On the parisc architecture, lockdep reports for all static objects which
are in the __initdata section (e.g. "setup_done" in devtmpfs,
"kthreadd_done" in init/main.c) this warning:

	INFO: trying to register non-static key.

The warning itself is wrong, because those objects are in the __initdata
section, but the section itself is on parisc outside of range from
_stext to _end, which is why the static_obj() functions returns a wrong
answer.

While fixing this issue, I noticed that the whole existing check can
be simplified a lot.
Instead of checking against the _stext and _end symbols (which include
code areas too) just check for the .data and .bss segments (since we check a
data object). This can be done with the existing is_kernel_core_data()
macro.

In addition objects in the __initdata section can be checked with
init_section_contains(), and is_kernel_rodata() allows keys to be in the
_ro_after_init section.

This partly reverts and simplifies commit bac59d18c7 ("x86/setup: Fix static
memory detection").

Link: https://lkml.kernel.org/r/ZNqrLRaOi/3wPAdp@p100
Fixes: bac59d18c7 ("x86/setup: Fix static memory detection")
Signed-off-by: Helge Deller <deller@gmx.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21 13:46:24 -07:00
Mateusz Guzik a7031f1452 kernel/fork: stop playing lockless games for exe_file replacement
xchg originated in 6e399cd144 ("prctl: avoid using mmap_sem for exe_file
serialization").  While the commit message does not explain *why* the
change, I found the original submission [1] which ultimately claims it
cleans things up by removing dependency of exe_file on the semaphore.

However, fe69d560b5 ("kernel/fork: always deny write access to current
MM exe_file") added a semaphore up/down cycle to synchronize the state of
exe_file against fork, defeating the point of the original change.

This is on top of semaphore trips already present both in the replacing
function and prctl (the only consumer).

Normally replacing exe_file does not happen for busy processes, thus
write-locking is not an impediment to performance in the intended use
case.  If someone keeps invoking the routine for a busy processes they are
trying to play dirty and that's another reason to avoid any trickery.

As such I think the atomic here only adds complexity for no benefit.

Just write-lock around the replacement.

I also note that replacement races against the mapping check loop as
nothing synchronizes actual assignment with with said checks but I am not
addressing it in this patch.  (Is the loop of any use to begin with?)

Link: https://lore.kernel.org/linux-mm/1424979417.10344.14.camel@stgolabs.net/ [1]
Link: https://lkml.kernel.org/r/20230814172140.1777161-1-mjguzik@gmail.com
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: "Christian Brauner (Microsoft)" <brauner@kernel.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mateusz Guzik <mjguzik@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21 13:46:24 -07:00
Aleksa Sarai 9876cfe8ec memfd: replace ratcheting feature from vm.memfd_noexec with hierarchy
This sysctl has the very unusual behaviour of not allowing any user (even
CAP_SYS_ADMIN) to reduce the restriction setting, meaning that if you were
to set this sysctl to a more restrictive option in the host pidns you
would need to reboot your machine in order to reset it.

The justification given in [1] is that this is a security feature and thus
it should not be possible to disable.  Aside from the fact that we have
plenty of security-related sysctls that can be disabled after being
enabled (fs.protected_symlinks for instance), the protection provided by
the sysctl is to stop users from being able to create a binary and then
execute it.  A user with CAP_SYS_ADMIN can trivially do this without
memfd_create(2):

  % cat mount-memfd.c
  #include <fcntl.h>
  #include <string.h>
  #include <stdio.h>
  #include <stdlib.h>
  #include <unistd.h>
  #include <linux/mount.h>

  #define SHELLCODE "#!/bin/echo this file was executed from this totally private tmpfs:"

  int main(void)
  {
  	int fsfd = fsopen("tmpfs", FSOPEN_CLOEXEC);
  	assert(fsfd >= 0);
  	assert(!fsconfig(fsfd, FSCONFIG_CMD_CREATE, NULL, NULL, 2));

  	int dfd = fsmount(fsfd, FSMOUNT_CLOEXEC, 0);
  	assert(dfd >= 0);

  	int execfd = openat(dfd, "exe", O_CREAT | O_RDWR | O_CLOEXEC, 0782);
  	assert(execfd >= 0);
  	assert(write(execfd, SHELLCODE, strlen(SHELLCODE)) == strlen(SHELLCODE));
  	assert(!close(execfd));

  	char *execpath = NULL;
  	char *argv[] = { "bad-exe", NULL }, *envp[] = { NULL };
  	execfd = openat(dfd, "exe", O_PATH | O_CLOEXEC);
  	assert(execfd >= 0);
  	assert(asprintf(&execpath, "/proc/self/fd/%d", execfd) > 0);
  	assert(!execve(execpath, argv, envp));
  }
  % ./mount-memfd
  this file was executed from this totally private tmpfs: /proc/self/fd/5
  %

Given that it is possible for CAP_SYS_ADMIN users to create executable
binaries without memfd_create(2) and without touching the host filesystem
(not to mention the many other things a CAP_SYS_ADMIN process would be
able to do that would be equivalent or worse), it seems strange to cause a
fair amount of headache to admins when there doesn't appear to be an
actual security benefit to blocking this.  There appear to be concerns
about confused-deputy-esque attacks[2] but a confused deputy that can
write to arbitrary sysctls is a bigger security issue than executable
memfds.

/* New API */

The primary requirement from the original author appears to be more based
on the need to be able to restrict an entire system in a hierarchical
manner[3], such that child namespaces cannot re-enable executable memfds.

So, implement that behaviour explicitly -- the vm.memfd_noexec scope is
evaluated up the pidns tree to &init_pid_ns and you have the most
restrictive value applied to you.  The new lower limit you can set
vm.memfd_noexec is whatever limit applies to your parent.

Note that a pidns will inherit a copy of the parent pidns's effective
vm.memfd_noexec setting at unshare() time.  This matches the existing
behaviour, and it also ensures that a pidns will never have its
vm.memfd_noexec setting *lowered* behind its back (but it will be raised
if the parent raises theirs).

/* Backwards Compatibility */

As the previous version of the sysctl didn't allow you to lower the
setting at all, there are no backwards compatibility issues with this
aspect of the change.

However it should be noted that now that the setting is completely
hierarchical.  Previously, a cloned pidns would just copy the current
pidns setting, meaning that if the parent's vm.memfd_noexec was changed it
wouldn't propoagate to existing pid namespaces.  Now, the restriction
applies recursively.  This is a uAPI change, however:

 * The sysctl is very new, having been merged in 6.3.
 * Several aspects of the sysctl were broken up until this patchset and
   the other patchset by Jeff Xu last month.

And thus it seems incredibly unlikely that any real users would run into
this issue. In the worst case, if this causes userspace isues we could
make it so that modifying the setting follows the hierarchical rules but
the restriction checking uses the cached copy.

[1]: https://lore.kernel.org/CABi2SkWnAgHK1i6iqSqPMYuNEhtHBkO8jUuCvmG3RmUB5TKHJw@mail.gmail.com/
[2]: https://lore.kernel.org/CALmYWFs_dNCzw_pW1yRAo4bGCPEtykroEQaowNULp7svwMLjOg@mail.gmail.com/
[3]: https://lore.kernel.org/CALmYWFuahdUF7cT4cm7_TGLqPanuHXJ-hVSfZt7vpTnc18DPrw@mail.gmail.com/

Link: https://lkml.kernel.org/r/20230814-memfd-vm-noexec-uapi-fixes-v2-4-7ff9e3e10ba6@cyphar.com
Fixes: 105ff5339f ("mm/memfd: add MFD_NOEXEC_SEAL and MFD_EXEC")
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Cc: Dominique Martinet <asmadeus@codewreck.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Daniel Verkamp <dverkamp@chromium.org>
Cc: Jeff Xu <jeffxu@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21 13:37:59 -07:00
Kefeng Wang 549f5c771e perf/core: use vma_is_initial_stack() and vma_is_initial_heap()
Use the helpers to simplify code, also kill unneeded goto cpy_name.

Link: https://lkml.kernel.org/r/20230728050043.59880-5-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Christian Göttsche <cgzones@googlemail.com>
Cc: "Christian König" <christian.koenig@amd.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: David Airlie <airlied@gmail.com>
Cc: Eric Paris <eparis@parisplace.org>
Cc: Felix Kuehling <felix.kuehling@amd.com>
Cc: "Pan, Xinhui" <Xinhui.Pan@amd.com>
Cc: Paul Moore <paul@paul-moore.com>
Cc: Stephen Smalley <stephen.smalley.work@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21 13:37:32 -07:00
Arnd Bergmann 68af05143f kernel/iomem.c: remove __weak ioremap_cache helper
No portable code calls into this function any more, and on architectures
that don't use or define their own, it causes a warning:

kernel/iomem.c:10:22: warning: no previous prototype for 'ioremap_cache' [-Wmissing-prototypes]
   10 | __weak void __iomem *ioremap_cache(resource_size_t offset, unsigned long size)

Fold it into the only caller that uses it on architectures
without the #define.

Note that the fallback to ioremap is probably still wrong on
those architectures, but this is what it's always done there.

Link: https://lkml.kernel.org/r/20230726145432.1617809-1-arnd@kernel.org
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21 13:37:28 -07:00
Elena Reshetova 2ddd3cac1f nsproxy: Convert nsproxy.count to refcount_t
atomic_t variables are currently used to implement reference counters
with the following properties:
 - counter is initialized to 1 using atomic_set()
 - a resource is freed upon counter reaching zero
 - once counter reaches zero, its further
   increments aren't allowed
 - counter schema uses basic atomic operations
   (set, inc, inc_not_zero, dec_and_test, etc.)

Such atomic variables should be converted to a newly provided
refcount_t type and API that prevents accidental counter overflows and
underflows. This is important since overflows and underflows can lead
to use-after-free situation and be exploitable.

The variable nsproxy.count is used as pure reference counter. Convert it
to refcount_t and fix up the operations.

**Important note for maintainers:

Some functions from refcount_t API defined in refcount.h have different
memory ordering guarantees than their atomic counterparts. Please check
Documentation/core-api/refcount-vs-atomic.rst for more information.

Normally the differences should not matter since refcount_t provides
enough guarantees to satisfy the refcounting use cases, but in some
rare cases it might matter. Please double check that you don't have
some undocumented memory guarantees for this variable usage.

For the nsproxy.count it might make a difference in following places:
 - put_nsproxy() and switch_task_namespaces(): decrement in
   refcount_dec_and_test() only provides RELEASE ordering and ACQUIRE
   ordering on success vs. fully ordered atomic counterpart

Suggested-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Reviewed-by: David Windsor <dwindsor@gmail.com>
Reviewed-by: Hans Liljestrand <ishkamiel@gmail.com>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/r/20230818041327.gonna.210-kees@kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
2023-08-21 11:29:12 -07:00
Zheng Yejian c2489bb7e6 tracing: Introduce pipe_cpumask to avoid race on trace_pipes
There is race issue when concurrently splice_read main trace_pipe and
per_cpu trace_pipes which will result in data read out being different
from what actually writen.

As suggested by Steven:
  > I believe we should add a ref count to trace_pipe and the per_cpu
  > trace_pipes, where if they are opened, nothing else can read it.
  >
  > Opening trace_pipe locks all per_cpu ref counts, if any of them are
  > open, then the trace_pipe open will fail (and releases any ref counts
  > it had taken).
  >
  > Opening a per_cpu trace_pipe will up the ref count for just that
  > CPU buffer. This will allow multiple tasks to read different per_cpu
  > trace_pipe files, but will prevent the main trace_pipe file from
  > being opened.

But because we only need to know whether per_cpu trace_pipe is open or
not, using a cpumask instead of using ref count may be easier.

After this patch, users will find that:
 - Main trace_pipe can be opened by only one user, and if it is
   opened, all per_cpu trace_pipes cannot be opened;
 - Per_cpu trace_pipes can be opened by multiple users, but each per_cpu
   trace_pipe can only be opened by one user. And if one of them is
   opened, main trace_pipe cannot be opened.

Link: https://lore.kernel.org/linux-trace-kernel/20230818022645.1948314-1-zhengyejian1@huawei.com

Suggested-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-08-21 11:17:14 -04:00
Greg Kroah-Hartman 642073c306 Merge commit b320441c04 ("Merge tag 'tty-6.5-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty") into tty-next
We need the serial-core fixes in here as well.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-08-20 14:29:37 +02:00
Jakub Kicinski 7ff57803d2 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

Conflicts:

drivers/net/ethernet/sfc/tc.c
  fa165e1949 ("sfc: don't unregister flow_indr if it was never registered")
  3bf969e88a ("sfc: add MAE table machinery for conntrack table")
https://lore.kernel.org/all/20230818112159.7430e9b4@canb.auug.org.au/

No adjacent changes.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-18 12:44:56 -07:00
Douglas Anderson 1f38c86bb2 watchdog/hardlockup: avoid large stack frames in watchdog_hardlockup_check()
After commit 77c12fc959 ("watchdog/hardlockup: add a "cpu" param to
watchdog_hardlockup_check()") we started storing a `struct cpumask` on the
stack in watchdog_hardlockup_check().  On systems with CONFIG_NR_CPUS set
to 8192 this takes up 1K on the stack.  That triggers warnings with
`CONFIG_FRAME_WARN` set to 1024.

We'll use the new trigger_allbutcpu_cpu_backtrace() to avoid needing to
use a CPU mask at all.

Link: https://lkml.kernel.org/r/20230804065935.v4.2.I501ab68cb926ee33a7c87e063d207abf09b9943c@changeid
Fixes: 77c12fc959 ("watchdog/hardlockup: add a "cpu" param to watchdog_hardlockup_check()")
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/r/202307310955.pLZDhpnl-lkp@intel.com
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Cc: Lecopzer Chen <lecopzer.chen@mediatek.com>
Cc: Pingfan Liu <kernelfans@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18 10:19:00 -07:00
Douglas Anderson 8d539b84f1 nmi_backtrace: allow excluding an arbitrary CPU
The APIs that allow backtracing across CPUs have always had a way to
exclude the current CPU.  This convenience means callers didn't need to
find a place to allocate a CPU mask just to handle the common case.

Let's extend the API to take a CPU ID to exclude instead of just a
boolean.  This isn't any more complex for the API to handle and allows the
hardlockup detector to exclude a different CPU (the one it already did a
trace for) without needing to find space for a CPU mask.

Arguably, this new API also encourages safer behavior.  Specifically if
the caller wants to avoid tracing the current CPU (maybe because they
already traced the current CPU) this makes it more obvious to the caller
that they need to make sure that the current CPU ID can't change.

[akpm@linux-foundation.org: fix trigger_allbutcpu_cpu_backtrace() stub]
Link: https://lkml.kernel.org/r/20230804065935.v4.1.Ia35521b91fc781368945161d7b28538f9996c182@changeid
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: kernel test robot <lkp@intel.com>
Cc: Lecopzer Chen <lecopzer.chen@mediatek.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Pingfan Liu <kernelfans@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18 10:19:00 -07:00
Greg Kroah-Hartman be33db2142 kthread: unexport __kthread_should_park()
There are no in-kernel users of __kthread_should_park() so mark it as
static and do not export it.

Link: https://lkml.kernel.org/r/2023080450-handcuff-stump-1d6e@gregkh
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: John Stultz <jstultz@google.com>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Cc: "Arve Hjønnevåg" <arve@android.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: "Christian Brauner (Microsoft)" <brauner@kernel.org>
Cc: Mike Christie <michael.christie@oracle.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Zqiang <qiang1.zhang@intel.com>
Cc: Prathu Baronia <quic_pbaronia@quicinc.com>
Cc: Sami Tolvanen <samitolvanen@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18 10:18:59 -07:00
Arnd Bergmann 29665c1e2a gcov: shut up missing prototype warnings for internal stubs
gcov uses global functions that are called from generated code, but these
have no prototype in a header, which causes a W=1 build warning:

kernel/gcov/gcc_base.c:12:6: error: no previous prototype for '__gcov_init' [-Werror=missing-prototypes]
kernel/gcov/gcc_base.c:40:6: error: no previous prototype for '__gcov_flush' [-Werror=missing-prototypes]
kernel/gcov/gcc_base.c:46:6: error: no previous prototype for '__gcov_merge_add' [-Werror=missing-prototypes]
kernel/gcov/gcc_base.c:52:6: error: no previous prototype for '__gcov_merge_single' [-Werror=missing-prototypes]

Just turn off these warnings unconditionally for the two files that
contain them.

Link: https://lore.kernel.org/all/0820010f-e9dc-779d-7924-49c7df446bce@linux.ibm.com/
Link: https://lkml.kernel.org/r/20230725123042.2269077-1-arnd@kernel.org
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Tested-by: Peter Oberparleiter <oberpar@linux.ibm.com>
Acked-by: Peter Oberparleiter <oberpar@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18 10:18:58 -07:00
Li kunyu 598f0046e9 kernel: relay: remove unnecessary NULL values from relay_open_buf
buf is assigned first, so it does not need to initialize the assignment.

Link: https://lkml.kernel.org/r/20230713234459.2908-1-kunyu@nfschina.com
Signed-off-by: Li kunyu <kunyu@nfschina.com>
Reviewed-by: Andrew Morton <akpm@linux-foudation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18 10:18:55 -07:00
Eric DeVolder 95d1fef537 remove ARCH_DEFAULT_KEXEC from Kconfig.kexec
This patch is a minor cleanup to the series "refactor Kconfig to
consolidate KEXEC and CRASH options".

In that series, a new option ARCH_DEFAULT_KEXEC was introduced in order to
obtain the equivalent behavior of s390 original Kconfig settings for
KEXEC.  As it turns out, this new option did not fully provide the
equivalent behavior, rather a "select KEXEC" did.

As such, the ARCH_DEFAULT_KEXEC is not needed anymore, so remove it.

Link: https://lkml.kernel.org/r/20230802161750.2215-1-eric.devolder@oracle.com
Signed-off-by: Eric DeVolder <eric.devolder@oracle.com>
Acked-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18 10:18:55 -07:00
Eric DeVolder e6265fe777 kexec: rename ARCH_HAS_KEXEC_PURGATORY
The Kconfig refactor to consolidate KEXEC and CRASH options utilized
option names of the form ARCH_SUPPORTS_<option>. Thus rename the
ARCH_HAS_KEXEC_PURGATORY to ARCH_SUPPORTS_KEXEC_PURGATORY to follow
the same.

Link: https://lkml.kernel.org/r/20230712161545.87870-15-eric.devolder@oracle.com
Signed-off-by: Eric DeVolder <eric.devolder@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18 10:18:54 -07:00
Eric DeVolder 89cde45591 kexec: consolidate kexec and crash options into kernel/Kconfig.kexec
Patch series "refactor Kconfig to consolidate KEXEC and CRASH options", v6.

The Kconfig is refactored to consolidate KEXEC and CRASH options from
various arch/<arch>/Kconfig files into new file kernel/Kconfig.kexec.

The Kconfig.kexec is now a submenu titled "Kexec and crash features"
located under "General Setup".

The following options are impacted:

 - KEXEC
 - KEXEC_FILE
 - KEXEC_SIG
 - KEXEC_SIG_FORCE
 - KEXEC_IMAGE_VERIFY_SIG
 - KEXEC_BZIMAGE_VERIFY_SIG
 - KEXEC_JUMP
 - CRASH_DUMP

Over time, these options have been copied between Kconfig files and
are very similar to one another, but with slight differences.

The following architectures are impacted by the refactor (because of
use of one or more KEXEC/CRASH options):

 - arm
 - arm64
 - ia64
 - loongarch
 - m68k
 - mips
 - parisc
 - powerpc
 - riscv
 - s390
 - sh
 - x86 

More information:

In the patch series "crash: Kernel handling of CPU and memory hot
un/plug"

 https://lore.kernel.org/lkml/20230503224145.7405-1-eric.devolder@oracle.com/

the new kernel feature introduces the config option CRASH_HOTPLUG.

In reviewing, Thomas Gleixner requested that the new config option
not be placed in x86 Kconfig. Rather the option needs a generic/common
home. To Thomas' point, the KEXEC and CRASH options have largely been
duplicated in the various arch/<arch>/Kconfig files, with minor
differences. This kind of proliferation is to be avoid/stopped.

 https://lore.kernel.org/lkml/875y91yv63.ffs@tglx/

To that end, I have refactored the arch Kconfigs so as to consolidate
the various KEXEC and CRASH options. Generally speaking, this work has
the following themes:

- KEXEC and CRASH options are moved into new file kernel/Kconfig.kexec
  - These items from arch/Kconfig:
      CRASH_CORE KEXEC_CORE KEXEC_ELF HAVE_IMA_KEXEC
  - These items from arch/x86/Kconfig form the common options:
      KEXEC KEXEC_FILE KEXEC_SIG KEXEC_SIG_FORCE
      KEXEC_BZIMAGE_VERIFY_SIG KEXEC_JUMP CRASH_DUMP
  - These items from arch/arm64/Kconfig form the common options:
      KEXEC_IMAGE_VERIFY_SIG
  - The crash hotplug series appends CRASH_HOTPLUG to Kconfig.kexec
- The Kconfig.kexec is now a submenu titled "Kexec and crash features"
  and is now listed in "General Setup" submenu from init/Kconfig.
- To control the common options, each has a new ARCH_SUPPORTS_<option>
  option. These gateway options determine whether the common options
  options are valid for the architecture.
- To account for the slight differences in the original architecture
  coding of the common options, each now has a corresponding
  ARCH_SELECTS_<option> which are used to elicit the same side effects
  as the original arch/<arch>/Kconfig files for KEXEC and CRASH options.

An example, 'make menuconfig' illustrating the submenu:

  > General setup > Kexec and crash features
  [*] Enable kexec system call
  [*] Enable kexec file based system call
  [*]   Verify kernel signature during kexec_file_load() syscall
  [ ]     Require a valid signature in kexec_file_load() syscall
  [ ]     Enable bzImage signature verification support
  [*] kexec jump
  [*] kernel crash dumps
  [*]   Update the crash elfcorehdr on system configuration changes

In the process of consolidating the common options, I encountered
slight differences in the coding of these options in several of the
architectures. As a result, I settled on the following solution:

- Each of the common options has a 'depends on ARCH_SUPPORTS_<option>'
  statement. For example, the KEXEC_FILE option has a 'depends on
  ARCH_SUPPORTS_KEXEC_FILE' statement.

  This approach is needed on all common options so as to prevent
  options from appearing for architectures which previously did
  not allow/enable them. For example, arm supports KEXEC but not
  KEXEC_FILE. The arch/arm/Kconfig does not provide
  ARCH_SUPPORTS_KEXEC_FILE and so KEXEC_FILE and related options
  are not available to arm.

- The boolean ARCH_SUPPORTS_<option> in effect allows the arch to
  determine when the feature is allowed.  Archs which don't have the
  feature simply do not provide the corresponding ARCH_SUPPORTS_<option>.
  For each arch, where there previously were KEXEC and/or CRASH
  options, these have been replaced with the corresponding boolean
  ARCH_SUPPORTS_<option>, and an appropriate def_bool statement.

  For example, if the arch supports KEXEC_FILE, then the
  ARCH_SUPPORTS_KEXEC_FILE simply has a 'def_bool y'. This permits
  the KEXEC_FILE option to be available.

  If the arch has a 'depends on' statement in its original coding
  of the option, then that expression becomes part of the def_bool
  expression. For example, arm64 had:

  config KEXEC
    depends on PM_SLEEP_SMP

  and in this solution, this converts to:

  config ARCH_SUPPORTS_KEXEC
    def_bool PM_SLEEP_SMP


- In order to account for the architecture differences in the
  coding for the common options, the ARCH_SELECTS_<option> in the
  arch/<arch>/Kconfig is used. This option has a 'depends on
  <option>' statement to couple it to the main option, and from
  there can insert the differences from the common option and the
  arch original coding of that option.

  For example, a few archs enable CRYPTO and CRYTPO_SHA256 for
  KEXEC_FILE. These require a ARCH_SELECTS_KEXEC_FILE and
  'select CRYPTO' and 'select CRYPTO_SHA256' statements.

Illustrating the option relationships:

For each of the common KEXEC and CRASH options:
 ARCH_SUPPORTS_<option> <- <option> <- ARCH_SELECTS_<option>

 <option>                   # in Kconfig.kexec
 ARCH_SUPPORTS_<option>     # in arch/<arch>/Kconfig, as needed
 ARCH_SELECTS_<option>      # in arch/<arch>/Kconfig, as needed


For example, KEXEC:
 ARCH_SUPPORTS_KEXEC <- KEXEC <- ARCH_SELECTS_KEXEC

 KEXEC                      # in Kconfig.kexec
 ARCH_SUPPORTS_KEXEC        # in arch/<arch>/Kconfig, as needed
 ARCH_SELECTS_KEXEC         # in arch/<arch>/Kconfig, as needed


To summarize, the ARCH_SUPPORTS_<option> permits the <option> to be
enabled, and the ARCH_SELECTS_<option> handles side effects (ie.
select statements).

Examples:
A few examples to show the new strategy in action:

===== x86 (minus the help section) =====
Original:
 config KEXEC
    bool "kexec system call"
    select KEXEC_CORE

 config KEXEC_FILE
    bool "kexec file based system call"
    select KEXEC_CORE
    select HAVE_IMA_KEXEC if IMA
    depends on X86_64
    depends on CRYPTO=y
    depends on CRYPTO_SHA256=y

 config ARCH_HAS_KEXEC_PURGATORY
    def_bool KEXEC_FILE

 config KEXEC_SIG
    bool "Verify kernel signature during kexec_file_load() syscall"
    depends on KEXEC_FILE

 config KEXEC_SIG_FORCE
    bool "Require a valid signature in kexec_file_load() syscall"
    depends on KEXEC_SIG

 config KEXEC_BZIMAGE_VERIFY_SIG
    bool "Enable bzImage signature verification support"
    depends on KEXEC_SIG
    depends on SIGNED_PE_FILE_VERIFICATION
    select SYSTEM_TRUSTED_KEYRING

 config CRASH_DUMP
    bool "kernel crash dumps"
    depends on X86_64 || (X86_32 && HIGHMEM)

 config KEXEC_JUMP
    bool "kexec jump"
    depends on KEXEC && HIBERNATION
    help

becomes...
New:
config ARCH_SUPPORTS_KEXEC
    def_bool y

config ARCH_SUPPORTS_KEXEC_FILE
    def_bool X86_64 && CRYPTO && CRYPTO_SHA256

config ARCH_SELECTS_KEXEC_FILE
    def_bool y
    depends on KEXEC_FILE
    select HAVE_IMA_KEXEC if IMA

config ARCH_SUPPORTS_KEXEC_PURGATORY
    def_bool KEXEC_FILE

config ARCH_SUPPORTS_KEXEC_SIG
    def_bool y

config ARCH_SUPPORTS_KEXEC_SIG_FORCE
    def_bool y

config ARCH_SUPPORTS_KEXEC_BZIMAGE_VERIFY_SIG
    def_bool y

config ARCH_SUPPORTS_KEXEC_JUMP
    def_bool y

config ARCH_SUPPORTS_CRASH_DUMP
    def_bool X86_64 || (X86_32 && HIGHMEM)


===== powerpc (minus the help section) =====
Original:
 config KEXEC
    bool "kexec system call"
    depends on PPC_BOOK3S || PPC_E500 || (44x && !SMP)
    select KEXEC_CORE

 config KEXEC_FILE
    bool "kexec file based system call"
    select KEXEC_CORE
    select HAVE_IMA_KEXEC if IMA
    select KEXEC_ELF
    depends on PPC64
    depends on CRYPTO=y
    depends on CRYPTO_SHA256=y

 config ARCH_HAS_KEXEC_PURGATORY
    def_bool KEXEC_FILE

 config CRASH_DUMP
    bool "Build a dump capture kernel"
    depends on PPC64 || PPC_BOOK3S_32 || PPC_85xx || (44x && !SMP)
    select RELOCATABLE if PPC64 || 44x || PPC_85xx

becomes...
New:
config ARCH_SUPPORTS_KEXEC
    def_bool PPC_BOOK3S || PPC_E500 || (44x && !SMP)

config ARCH_SUPPORTS_KEXEC_FILE
    def_bool PPC64 && CRYPTO=y && CRYPTO_SHA256=y

config ARCH_SUPPORTS_KEXEC_PURGATORY
    def_bool KEXEC_FILE

config ARCH_SELECTS_KEXEC_FILE
    def_bool y
    depends on KEXEC_FILE
    select KEXEC_ELF
    select HAVE_IMA_KEXEC if IMA

config ARCH_SUPPORTS_CRASH_DUMP
    def_bool PPC64 || PPC_BOOK3S_32 || PPC_85xx || (44x && !SMP)

config ARCH_SELECTS_CRASH_DUMP
    def_bool y
    depends on CRASH_DUMP
    select RELOCATABLE if PPC64 || 44x || PPC_85xx


Testing Approach and Results

There are 388 config files in the arch/<arch>/configs directories.
For each of these config files, a .config is generated both before and
after this Kconfig series, and checked for equivalence. This approach
allows for a rather rapid check of all architectures and a wide
variety of configs wrt/ KEXEC and CRASH, and avoids requiring
compiling for all architectures and running kernels and run-time
testing.

For each config file, the olddefconfig, allnoconfig and allyesconfig
targets are utilized. In testing the randconfig has revealed problems
as well, but is not used in the before and after equivalence check
since one can not generate the "same" .config for before and after,
even if using the same KCONFIG_SEED since the option list is
different.

As such, the following script steps compare the before and after
of 'make olddefconfig'. The new symbols introduced by this series
are filtered out, but otherwise the config files are PASS only if
they were equivalent, and FAIL otherwise.

The script performs the test by doing the following:

 # Obtain the "golden" .config output for given config file
 # Reset test sandbox
 git checkout master
 git branch -D test_Kconfig
 git checkout -B test_Kconfig master
 make distclean
 # Write out updated config
 cp -f <config file> .config
 make ARCH=<arch> olddefconfig
 # Track each item in .config, LHSB is "golden"
 scoreboard .config 

 # Obtain the "changed" .config output for given config file
 # Reset test sandbox
 make distclean
 # Apply this Kconfig series
 git am <this Kconfig series>
 # Write out updated config
 cp -f <config file> .config
 make ARCH=<arch> olddefconfig
 # Track each item in .config, RHSB is "changed"
 scoreboard .config 

 # Determine test result
 # Filter-out new symbols introduced by this series
 # Filter-out symbol=n which not in either scoreboard
 # Compare LHSB "golden" and RHSB "changed" scoreboards and issue PASS/FAIL

The script was instrumental during the refactoring of Kconfig as it
continually revealed problems. The end result being that the solution
presented in this series passes all configs as checked by the script,
with the following exceptions:

- arch/ia64/configs/zx1_config with olddefconfig
  This config file has:
  # CONFIG_KEXEC is not set
  CONFIG_CRASH_DUMP=y
  and this refactor now couples KEXEC to CRASH_DUMP, so it is not
  possible to enable CRASH_DUMP without KEXEC.

- arch/sh/configs/* with allyesconfig
  The arch/sh/Kconfig codes CRASH_DUMP as dependent upon BROKEN_ON_MMU
  (which clearly is not meant to be set). This symbol is not provided
  but with the allyesconfig it is set to yes which enables CRASH_DUMP.
  But KEXEC is coded as dependent upon MMU, and is set to no in
  arch/sh/mm/Kconfig, so KEXEC is not enabled.
  This refactor now couples KEXEC to CRASH_DUMP, so it is not
  possible to enable CRASH_DUMP without KEXEC.

While the above exceptions are not equivalent to their original,
the config file produced is valid (and in fact better wrt/ CRASH_DUMP
handling).


This patch (of 14)

The config options for kexec and crash features are consolidated
into new file kernel/Kconfig.kexec. Under the "General Setup" submenu
is a new submenu "Kexec and crash handling". All the kexec and
crash options that were once in the arch-dependent submenu "Processor
type and features" are now consolidated in the new submenu.

The following options are impacted:

 - KEXEC
 - KEXEC_FILE
 - KEXEC_SIG
 - KEXEC_SIG_FORCE
 - KEXEC_BZIMAGE_VERIFY_SIG
 - KEXEC_JUMP
 - CRASH_DUMP

The three main options are KEXEC, KEXEC_FILE and CRASH_DUMP.

Architectures specify support of certain KEXEC and CRASH features with
similarly named new ARCH_SUPPORTS_<option> config options.

Architectures can utilize the new ARCH_SELECTS_<option> config
options to specify additional components when <option> is enabled.

To summarize, the ARCH_SUPPORTS_<option> permits the <option> to be
enabled, and the ARCH_SELECTS_<option> handles side effects (ie.
select statements).

Link: https://lkml.kernel.org/r/20230712161545.87870-1-eric.devolder@oracle.com
Link: https://lkml.kernel.org/r/20230712161545.87870-2-eric.devolder@oracle.com
Signed-off-by: Eric DeVolder <eric.devolder@oracle.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov (AMD) <bp@alien8.de>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Cc. "H. Peter Anvin" <hpa@zytor.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Dave Hansen <dave.hansen@linux.intel.com> # for x86
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Hari Bathini <hbathini@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Cc: Juerg Haefliger <juerg.haefliger@canonical.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Linus Walleij <linus.walleij@linaro.org>
Cc: Marc Aurèle La France <tsi@tuyoix.net>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Miguel Ojeda <ojeda@kernel.org>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Cc: Sami Tolvanen <samitolvanen@google.com>
Cc: Sebastian Reichel <sebastian.reichel@collabora.com>
Cc: Sourabh Jain <sourabhjain@linux.ibm.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: WANG Xuerui <kernel@xen0n.name>
Cc: Will Deacon <will@kernel.org>
Cc: Xin Li <xin3.li@intel.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Zhen Lei <thunder.leizhen@huawei.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18 10:18:51 -07:00
Azeem Shaikh 4264be505d acct: replace all non-returning strlcpy with strscpy
strlcpy() reads the entire source buffer first.  This read may exceed the
destination size limit.  This is both inefficient and can lead to linear
read overflows if a source string is not NUL-terminated [1].  In an effort
to remove strlcpy() completely [2], replace strlcpy() here with strscpy().
No return values were used, so direct replacement is safe.

[1] https://www.kernel.org/doc/html/latest/process/deprecated.html#strlcpy
[2] https://github.com/KSPP/linux/issues/89

Link: https://lkml.kernel.org/r/20230710011748.3538624-1-azeemshaikh38@gmail.com
Signed-off-by: Azeem Shaikh <azeemshaikh38@gmail.com>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18 10:18:51 -07:00
Vincent Whitchurch b0b88e02f0 signal: print comm and exe name on fatal signals
Make the print-fatal-signals message more useful by printing the comm
and the exe name for the process which received the fatal signal:

Before:

 potentially unexpected fatal signal 4
 potentially unexpected fatal signal 11

After:

 buggy-program: pool: potentially unexpected fatal signal 4
 some-daemon: gdbus: potentially unexpected fatal signal 11

comm used to be present but was removed in commit 681a90ffe8
("arc, print-fatal-signals: reduce duplicated information") because it's
also included as part of the later stack trace.  Having the comm as part
of the main "unexpected fatal..." print is rather useful though when
analysing logs, and the exe name is also valuable as shown in the
examples above where the comm ends up having some generic name like
"pool".

[akpm@linux-foundation.org: don't include linux/file.h twice]
Link: https://lkml.kernel.org/r/20230707-fatal-comm-v1-1-400363905d5e@axis.com
Signed-off-by: Vincent Whitchurch <vincent.whitchurch@axis.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vineet Gupta <vgupta@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18 10:18:50 -07:00
tiozhang 4099451ac2 cred: convert printks to pr_<level>
Use current logging style.

Link: https://lkml.kernel.org/r/20230625033452.GA22858@didi-ThinkCentre-M930t-N000
Signed-off-by: tiozhang <tiozhang@didiglobal.com>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Joe Perches <joe@perches.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Paulo Alcantara <pc@cjr.nz>
Cc: Weiping Zhang <zwp10758@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18 10:18:49 -07:00
Alistair Popple ec8832d007 mmu_notifiers: don't invalidate secondary TLBs as part of mmu_notifier_invalidate_range_end()
Secondary TLBs are now invalidated from the architecture specific TLB
invalidation functions.  Therefore there is no need to explicitly notify
or invalidate as part of the range end functions.  This means we can
remove mmu_notifier_invalidate_range_end_only() and some of the
ptep_*_notify() functions.

Link: https://lkml.kernel.org/r/90d749d03cbab256ca0edeb5287069599566d783.1690292440.git-series.apopple@nvidia.com
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Cc: Andrew Donnellan <ajd@linux.ibm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chaitanya Kumar Borah <chaitanya.kumar.borah@intel.com>
Cc: Frederic Barrat <fbarrat@linux.ibm.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Nicolin Chen <nicolinc@nvidia.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
Cc: Will Deacon <will@kernel.org>
Cc: Zhi Wang <zhi.wang.linux@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18 10:12:41 -07:00
Baoquan He 016fec9101 mm: move is_ioremap_addr() into new header file
Now is_ioremap_addr() is only used in kernel/iomem.c and gonna be used in
mm/ioremap.c.  Move it into its own new header file linux/ioremap.h.

Link: https://lkml.kernel.org/r/20230706154520.11257-17-bhe@redhat.com
Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Brian Cain <bcain@quicinc.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Chris Zankel <chris@zankel.net>
Cc: David Laight <David.Laight@ACULAB.COM>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Niklas Schnelle <schnelle@linux.ibm.com>
Cc: Rich Felker <dalias@libc.org>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@kernel.org>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18 10:12:35 -07:00
Miaohe Lin 3fade62b62 mm/mm_init.c: remove obsolete macro HASH_SMALL
HASH_SMALL only works when parameter numentries is 0. But the sole caller
futex_init() never calls alloc_large_system_hash() with numentries set to
0. So HASH_SMALL is obsolete and remove it.

Link: https://lkml.kernel.org/r/20230625021323.849147-1-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: André Almeida <andrealmeid@igalia.com>
Cc: Darren Hart <dvhart@infradead.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18 10:12:07 -07:00
Kefeng Wang 527ed4f7d9 mm: remove arguments of show_mem()
All callers of show_mem() pass 0 and NULL, so we can remove the two
arguments by directly calling __show_mem(0, NULL, MAX_NR_ZONES - 1) in
show_mem().

Link: https://lkml.kernel.org/r/20230630062253.189440-1-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18 10:12:02 -07:00
Gustavo A. R. Silva 78d44b824e cgroup: Avoid -Wstringop-overflow warnings
Change the notation from pointer-to-array to pointer-to-pointer.
With this, we avoid the compiler complaining about trying
to access a region of size zero as an argument during function
calls.

This is a workaround to prevent the compiler complaining about
accessing an array of size zero when evaluating the arguments
of a couple of function calls. See below:

kernel/cgroup/cgroup.c: In function 'find_css_set':
kernel/cgroup/cgroup.c:1206:16: warning: 'find_existing_css_set' accessing 4 bytes in a region of size 0 [-Wstringop-overflow=]
 1206 |         cset = find_existing_css_set(old_cset, cgrp, template);
      |                ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
kernel/cgroup/cgroup.c:1206:16: note: referencing argument 3 of type 'struct cgroup_subsys_state *[0]'
kernel/cgroup/cgroup.c:1071:24: note: in a call to function 'find_existing_css_set'
 1071 | static struct css_set *find_existing_css_set(struct css_set *old_cset,
      |                        ^~~~~~~~~~~~~~~~~~~~~

With the change to pointer-to-pointer, the functions are not prevented
from being executed, and they will do what they have to do when
CGROUP_SUBSYS_COUNT == 0.

Address the following -Wstringop-overflow warnings seen when
built with ARM architecture and aspeed_g4_defconfig configuration
(notice that under this configuration CGROUP_SUBSYS_COUNT == 0):

kernel/cgroup/cgroup.c:1208:16: warning: 'find_existing_css_set' accessing 4 bytes in a region of size 0 [-Wstringop-overflow=]
kernel/cgroup/cgroup.c:1258:15: warning: 'css_set_hash' accessing 4 bytes in a region of size 0 [-Wstringop-overflow=]
kernel/cgroup/cgroup.c:6089:18: warning: 'css_set_hash' accessing 4 bytes in a region of size 0 [-Wstringop-overflow=]
kernel/cgroup/cgroup.c:6153:18: warning: 'css_set_hash' accessing 4 bytes in a region of size 0 [-Wstringop-overflow=]

This results in no differences in binary output.

Link: https://github.com/KSPP/linux/issues/316
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-17 11:55:05 -10:00
Kees Cook 46822860a5 seccomp: Add missing kerndoc notations
The kerndoc for some struct member and function arguments were missing.
Add them.

Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Will Drewry <wad@chromium.org>
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202308171742.AncabIG1-lkp@intel.com/
Signed-off-by: Kees Cook <keescook@chromium.org>
2023-08-17 12:32:15 -07:00
Zheng Yejian eecb91b9f9 tracing: Fix memleak due to race between current_tracer and trace
Kmemleak report a leak in graph_trace_open():

  unreferenced object 0xffff0040b95f4a00 (size 128):
    comm "cat", pid 204981, jiffies 4301155872 (age 99771.964s)
    hex dump (first 32 bytes):
      e0 05 e7 b4 ab 7d 00 00 0b 00 01 00 00 00 00 00 .....}..........
      f4 00 01 10 00 a0 ff ff 00 00 00 00 65 00 10 00 ............e...
    backtrace:
      [<000000005db27c8b>] kmem_cache_alloc_trace+0x348/0x5f0
      [<000000007df90faa>] graph_trace_open+0xb0/0x344
      [<00000000737524cd>] __tracing_open+0x450/0xb10
      [<0000000098043327>] tracing_open+0x1a0/0x2a0
      [<00000000291c3876>] do_dentry_open+0x3c0/0xdc0
      [<000000004015bcd6>] vfs_open+0x98/0xd0
      [<000000002b5f60c9>] do_open+0x520/0x8d0
      [<00000000376c7820>] path_openat+0x1c0/0x3e0
      [<00000000336a54b5>] do_filp_open+0x14c/0x324
      [<000000002802df13>] do_sys_openat2+0x2c4/0x530
      [<0000000094eea458>] __arm64_sys_openat+0x130/0x1c4
      [<00000000a71d7881>] el0_svc_common.constprop.0+0xfc/0x394
      [<00000000313647bf>] do_el0_svc+0xac/0xec
      [<000000002ef1c651>] el0_svc+0x20/0x30
      [<000000002fd4692a>] el0_sync_handler+0xb0/0xb4
      [<000000000c309c35>] el0_sync+0x160/0x180

The root cause is descripted as follows:

  __tracing_open() {  // 1. File 'trace' is being opened;
    ...
    *iter->trace = *tr->current_trace;  // 2. Tracer 'function_graph' is
                                        //    currently set;
    ...
    iter->trace->open(iter);  // 3. Call graph_trace_open() here,
                              //    and memory are allocated in it;
    ...
  }

  s_start() {  // 4. The opened file is being read;
    ...
    *iter->trace = *tr->current_trace;  // 5. If tracer is switched to
                                        //    'nop' or others, then memory
                                        //    in step 3 are leaked!!!
    ...
  }

To fix it, in s_start(), close tracer before switching then reopen the
new tracer after switching. And some tracers like 'wakeup' may not update
'iter->private' in some cases when reopen, then it should be cleared
to avoid being mistakenly closed again.

Link: https://lore.kernel.org/linux-trace-kernel/20230817125539.1646321-1-zhengyejian1@huawei.com

Fixes: d7350c3f45 ("tracing/core: make the read callbacks reentrants")
Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-08-17 13:49:37 -04:00
Peter Zijlstra 63304558ba sched/eevdf: Curb wakeup-preemption
Mike and others noticed that EEVDF does like to over-schedule quite a
bit -- which does hurt performance of a number of benchmarks /
workloads.

In particular, what seems to cause over-scheduling is that when lag is
of the same order (or larger) than the request / slice then placement
will not only cause the task to be placed left of current, but also
with a smaller deadline than current, which causes immediate
preemption.

[ notably, lag bounds are relative to HZ ]

Mike suggested we stick to picking 'current' for as long as it's
eligible to run, giving it uninterrupted runtime until it reaches
parity with the pack.

Augment Mike's suggestion by only allowing it to exhaust it's initial
request.

One random data point:

echo NO_RUN_TO_PARITY > /debug/sched/features
perf stat -a -e context-switches --repeat 10 -- perf bench sched messaging -g 20 -t -l 5000

	3,723,554        context-switches      ( +-  0.56% )
	9.5136 +- 0.0394 seconds time elapsed  ( +-  0.41% )

echo RUN_TO_PARITY > /debug/sched/features
perf stat -a -e context-switches --repeat 10 -- perf bench sched messaging -g 20 -t -l 5000

	2,556,535        context-switches      ( +-  0.51% )
	9.2427 +- 0.0302 seconds time elapsed  ( +-  0.33% )

Suggested-by: Mike Galbraith <umgwanakikbuti@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20230816134059.GC982867@hirez.programming.kicks-ass.net
2023-08-17 17:07:07 +02:00
Paul E. McKenney fe24a0b632 Merge branches 'doc.2023.07.14b', 'fixes.2023.08.16a', 'rcu-tasks.2023.07.24a', 'rcuscale.2023.07.14b', 'refscale.2023.07.14b', 'torture.2023.08.14a' and 'torturescripts.2023.07.20a' into HEAD
doc.2023.07.14b:  Documentation updates.
fixes.2023.08.16a:  Miscellaneous fixes.
rcu-tasks.2023.07.24a:  RCU Tasks updates.
rcuscale.2023.07.14b:  RCU (updater) scalability test updates.
refscale.2023.07.14b:  Reference (reader) scalability test updates.
torture.2023.08.14a:  Other torture-test updates.
torturescripts.2023.07.20a:  Other torture-test scripting updates.
2023-08-16 14:31:08 -07:00
Paul E. McKenney 3292ba0229 rcu: Make the rcu_nocb_poll boot parameter usable via boot config
The rcu_nocb_poll kernel boot parameter is defined via early_param(),
whose parsing functions are invoked from parse_early_param() which
is in turn invoked by setup_arch(), which is very early indeed.  It
is invoked so early that the console output timestamps read 0.000000,
in other words, before time begins.

This use of early_param() means that the rcu_nocb_poll kernel boot
parameter cannot usefully be embedded into the kernel image.  Yes, you
can embed it, but setup_boot_config() is invoked from start_kernel()
too late for it to be parsed.

But it makes no sense to parse this parameter so early.  After all,
it cannot do anything until the rcuog kthreads are created, which is
long after rcu_init() time, let alone setup_boot_config() time.

This commit therefore switches the rcu_nocb_poll kernel boot parameter
from early_param() to __setup(), which allows boot-config parsing of
this parameter, in turn allowing it to be embedded into the kernel image.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
2023-08-16 14:27:41 -07:00
Paul E. McKenney 343640cb5b rcu: Mark __rcu_irq_enter_check_tick() ->rcu_urgent_qs load
The rcu_request_urgent_qs_task() function does a cross-CPU store
to ->rcu_urgent_qs, so this commit therefore marks the load in
__rcu_irq_enter_check_tick() with READ_ONCE().

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
2023-08-16 14:27:41 -07:00
Sven Schnelle c4d6b54381 tracing/synthetic: Allocate one additional element for size
While debugging another issue I noticed that the stack trace contains one
invalid entry at the end:

<idle>-0       [008] d..4.    26.484201: wake_lat: pid=0 delta=2629976084 000000009cc24024 stack=STACK:
=> __schedule+0xac6/0x1a98
=> schedule+0x126/0x2c0
=> schedule_timeout+0x150/0x2c0
=> kcompactd+0x9ca/0xc20
=> kthread+0x2f6/0x3d8
=> __ret_from_fork+0x8a/0xe8
=> 0x6b6b6b6b6b6b6b6b

This is because the code failed to add the one element containing the
number of entries to field_size.

Link: https://lkml.kernel.org/r/20230816154928.4171614-4-svens@linux.ibm.com

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Fixes: 00cf3d672a ("tracing: Allow synthetic events to pass around stacktraces")
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-08-16 16:37:07 -04:00
Sven Schnelle 887f92e09e tracing/synthetic: Skip first entry for stack traces
While debugging another issue I noticed that the stack trace output
contains the number of entries on top:

         <idle>-0       [000] d..4.   203.322502: wake_lat: pid=0 delta=2268270616 stack=STACK:
=> 0x10
=> __schedule+0xac6/0x1a98
=> schedule+0x126/0x2c0
=> schedule_timeout+0x242/0x2c0
=> __wait_for_common+0x434/0x680
=> __wait_rcu_gp+0x198/0x3e0
=> synchronize_rcu+0x112/0x138
=> ring_buffer_reset_online_cpus+0x140/0x2e0
=> tracing_reset_online_cpus+0x15c/0x1d0
=> tracing_set_clock+0x180/0x1d8
=> hist_register_trigger+0x486/0x670
=> event_hist_trigger_parse+0x494/0x1318
=> trigger_process_regex+0x1d4/0x258
=> event_trigger_write+0xb4/0x170
=> vfs_write+0x210/0xad0
=> ksys_write+0x122/0x208

Fix this by skipping the first element. Also replace the pointer
logic with an index variable which is easier to read.

Link: https://lkml.kernel.org/r/20230816154928.4171614-3-svens@linux.ibm.com

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Fixes: 00cf3d672a ("tracing: Allow synthetic events to pass around stacktraces")
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-08-16 16:34:25 -04:00
Sven Schnelle ddeea494a1 tracing/synthetic: Use union instead of casts
The current code uses a lot of casts to access the fields member in struct
synth_trace_events with different sizes.  This makes the code hard to
read, and had already introduced an endianness bug. Use a union and struct
instead.

Link: https://lkml.kernel.org/r/20230816154928.4171614-2-svens@linux.ibm.com

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Fixes: 00cf3d672a ("tracing: Allow synthetic events to pass around stacktraces")
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-08-16 16:33:27 -04:00
Zheng Yejian b71645d6af tracing: Fix cpu buffers unavailable due to 'record_disabled' missed
Trace ring buffer can no longer record anything after executing
following commands at the shell prompt:

  # cd /sys/kernel/tracing
  # cat tracing_cpumask
  fff
  # echo 0 > tracing_cpumask
  # echo 1 > snapshot
  # echo fff > tracing_cpumask
  # echo 1 > tracing_on
  # echo "hello world" > trace_marker
  -bash: echo: write error: Bad file descriptor

The root cause is that:
  1. After `echo 0 > tracing_cpumask`, 'record_disabled' of cpu buffers
     in 'tr->array_buffer.buffer' became 1 (see tracing_set_cpumask());
  2. After `echo 1 > snapshot`, 'tr->array_buffer.buffer' is swapped
     with 'tr->max_buffer.buffer', then the 'record_disabled' became 0
     (see update_max_tr());
  3. After `echo fff > tracing_cpumask`, the 'record_disabled' become -1;
Then array_buffer and max_buffer are both unavailable due to value of
'record_disabled' is not 0.

To fix it, enable or disable both array_buffer and max_buffer at the same
time in tracing_set_cpumask().

Link: https://lkml.kernel.org/r/20230805033816.3284594-2-zhengyejian1@huawei.com

Cc: <mhiramat@kernel.org>
Cc: <vnagarnaik@google.com>
Cc: <shuah@kernel.org>
Fixes: 71babb2705 ("tracing: change CPU ring buffer state from tracing_cpumask")
Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-08-16 15:12:42 -04:00
Enlin Mu 3e00123a13 printk: export symbols for debug modules
the module is out-of-tree, it saves kernel logs when panic

Signed-off-by: Enlin Mu <enlin.mu@unisoc.com>
Acked-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20230815020711.2604939-1-yunlong.xing@unisoc.com
2023-08-16 17:06:38 +02:00
Yafang Shao 0aa35162d2 bpf: Fix uninitialized symbol in bpf_perf_link_fill_kprobe()
The commit 1b715e1b0e ("bpf: Support ->fill_link_info for perf_event") leads
to the following Smatch static checker warning:

    kernel/bpf/syscall.c:3416 bpf_perf_link_fill_kprobe()
    error: uninitialized symbol 'type'.

That can happens when uname is NULL. So fix it by verifying the uname when we
really need to fill it.

Fixes: 1b715e1b0e ("bpf: Support ->fill_link_info for perf_event")
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Closes: https://lore.kernel.org/bpf/85697a7e-f897-4f74-8b43-82721bebc462@kili.mountain
Link: https://lore.kernel.org/bpf/20230813141900.1268-2-laoar.shao@gmail.com
2023-08-16 16:44:23 +02:00
Benjamin Gray 53834a0c09 perf/hw_breakpoint: Remove arch breakpoint hooks
PowerPC was the only user of these hooks, and has been refactored to no
longer require them. There is no need to keep them around, so remove
them to reduce complexity.

Signed-off-by: Benjamin Gray <bgray@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/20230801011744.153973-8-bgray@linux.ibm.com
2023-08-16 23:54:50 +10:00
Joel Granados 9edbfe92a0 sysctl: Add size to register_sysctl
This commit adds table_size to register_sysctl in preparation for the
removal of the sentinel elements in the ctl_table arrays (last empty
markers). And though we do *not* remove any sentinels in this commit, we
set things up by either passing the table_size explicitly or using
ARRAY_SIZE on the ctl_table arrays.

We replace the register_syctl function with a macro that will add the
ARRAY_SIZE to the new register_sysctl_sz function. In this way the
callers that are already using an array of ctl_table structs do not
change. For the callers that pass a ctl_table array pointer, we pass the
table_size to register_sysctl_sz instead of the macro.

Signed-off-by: Joel Granados <j.granados@samsung.com>
Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2023-08-15 15:26:17 -07:00
Joel Granados bff97cf11b sysctl: Add a size arg to __register_sysctl_table
We make these changes in order to prepare __register_sysctl_table and
its callers for when we remove the sentinel element (empty element at
the end of ctl_table arrays). We don't actually remove any sentinels in
this commit, but we *do* make sure to use ARRAY_SIZE so the table_size
is available when the removal occurs.

We add a table_size argument to __register_sysctl_table and adjust
callers, all of which pass ctl_table pointers and need an explicit call
to ARRAY_SIZE. We implement a size calculation in register_net_sysctl in
order to forward the size of the array pointer received from the network
register calls.

The new table_size argument does not yet have any effect in the
init_header call which is still dependent on the sentinel's presence.
table_size *does* however drive the `kzalloc` allocation in
__register_sysctl_table with no adverse effects as the allocated memory
is either one element greater than the calculated ctl_table array (for
the calls in ipc_sysctl.c, mq_sysctl.c and ucount.c) or the exact size
of the calculated ctl_table array (for the call from sysctl_net.c and
register_sysctl). This approach will allows us to "just" remove the
sentinel without further changes to __register_sysctl_table as
table_size will represent the exact size for all the callers at that
point.

Signed-off-by: Joel Granados <j.granados@samsung.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2023-08-15 15:26:17 -07:00
Atul Kumar Pant b1a0f64cc6 audit: move trailing statements to next line
Fixes following checkpatch.pl issue:
ERROR: trailing statements should be on next line

Signed-off-by: Atul Kumar Pant <atulpant.linux@gmail.com>
[PM: subject line tweak]
Signed-off-by: Paul Moore <paul@paul-moore.com>
2023-08-15 18:16:14 -04:00
Atul Kumar Pant 22cde1012f audit: cleanup function braces and assignment-in-if-condition
The patch fixes following checkpatch.pl issue:
ERROR: open brace '{' following function definitions go on the next line
ERROR: do not use assignment in if condition

Signed-off-by: Atul Kumar Pant <atulpant.linux@gmail.com>
[PM: subject line tweaks]
Signed-off-by: Paul Moore <paul@paul-moore.com>
2023-08-15 18:10:56 -04:00
Atul Kumar Pant 62acadda11 audit: add space before parenthesis and around '=', "==", and '<'
Fixes following checkpatch.pl issue:
ERROR: space required before the open parenthesis '('
ERROR: spaces required around that '='
ERROR: spaces required around that '<'
ERROR: spaces required around that '=='

Signed-off-by: Atul Kumar Pant <atulpant.linux@gmail.com>
[PM: subject line tweaks]
Signed-off-by: Paul Moore <paul@paul-moore.com>
2023-08-15 18:09:20 -04:00
David Vernet 8ba651ed7f bpf: Support default .validate() and .update() behavior for struct_ops links
Currently, if a struct_ops map is loaded with BPF_F_LINK, it must also
define the .validate() and .update() callbacks in its corresponding
struct bpf_struct_ops in the kernel. Enabling struct_ops link is useful
in its own right to ensure that the map is unloaded if an application
crashes. For example, with sched_ext, we want to automatically unload
the host-wide scheduler if the application crashes. We would likely
never support updating elements of a sched_ext struct_ops map, so we'd
have to implement these callbacks showing that they _can't_ support
element updates just to benefit from the basic lifetime management of
struct_ops links.

Let's enable struct_ops maps to work with BPF_F_LINK even if they
haven't defined these callbacks, by assuming that a struct_ops map
element cannot be updated by default.

Acked-by: Kui-Feng Lee <thinker.li@gmail.com>
Signed-off-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/r/20230814185908.700553-2-void@manifault.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-08-14 22:23:39 -07:00
Lu Jialin 82b90b6c5b cgroup:namespace: Remove unused cgroup_namespaces_init()
cgroup_namspace_init() just return 0. Therefore, there is no need to
call it during start_kernel. Just remove it.

Fixes: a79a908fd2 ("cgroup: introduce cgroup namespaces")
Signed-off-by: Lu Jialin <lujialin4@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-14 14:29:47 -10:00
Aaron Tomlin b6a46f7263 workqueue: Rename rescuer kworker
Each CPU-specific and unbound kworker kthread conforms to a particular
naming scheme. However, this does not extend to the rescuer kworker.
At present, a rescuer kworker is simply named according to its
workqueue's name. This can be cryptic.

This patch modifies a rescuer to follow the kworker naming scheme.
The "R" is indicative of a rescuer and after "-" is its workqueue's
name e.g. "kworker/R-ext4-rsv-conver".

tj: Use "R" instead of "r" as the prefix to make it more distinctive and
    consistent with how highpri pools are marked.

Signed-off-by: Aaron Tomlin <atomlin@atomlin.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-14 14:20:26 -10:00
Paul E. McKenney bc19e86e28 rcutorture: Stop right-shifting torture_random() return values
Now that torture_random() uses swahw32(), its callers no longer see
not-so-random low-order bits, as these are now swapped up into the upper
16 bits of the torture_random() function's return value.  This commit
therefore removes the right-shifting of torture_random() return values.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-08-14 15:01:08 -07:00
Paul E. McKenney 6cab60ceb1 torture: Stop right-shifting torture_random() return values
Now that torture_random() uses swahw32(), its callers no longer see
not-so-random low-order bits, as these are now swapped up into the upper
16 bits of the torture_random() function's return value.  This commit
therefore removes the right-shifting of torture_random() return values.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-08-14 15:01:08 -07:00
Paul E. McKenney 10af43671e torture: Move stutter_wait() timeouts to hrtimers
In order to gain better race coverage, move the test start/stop
waits in stutter_wait() to torture_hrtimeout_jiffies().

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-08-14 15:01:08 -07:00
Paul E. McKenney dea81dcfd3 torture: Move torture_shuffle() timeouts to hrtimers
In order to gain better race coverage, move the CPU-migration timed
waits in torture_shuffle() to torture_hrtimeout_jiffies().

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-08-14 15:01:08 -07:00
Paul E. McKenney 3f0c06e1cb torture: Move torture_onoff() timeouts to hrtimers
In order to gain better race coverage, move the CPU-hotplug-related
timed waits in torture_onoff() to torture_hrtimeout_jiffies().

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-08-14 15:01:08 -07:00
Paul E. McKenney 872948c665 torture: Make torture_hrtimeout_*() use TASK_IDLE
Given that it is expected that more code will use torture_hrtimeout_*(),
including for longer timeouts, make it use TASK_IDLE instead of
TASK_UNINTERRUPTIBLE.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-08-14 15:01:07 -07:00
Dietmar Eggemann 5d248bb39f torture: Add lock_torture writer_fifo module parameter
This commit adds a module parameter that causes the locktorture writer
to run at real-time priority.

To use it:
insmod /lib/modules/torture.ko random_shuffle=1
insmod /lib/modules/locktorture.ko torture_type=mutex_lock rt_boost=1 rt_boost_factor=50 nested_locks=3 writer_fifo=1
													^^^^^^^^^^^^^

A predecessor to this patch has been helpful to uncover issues with the
proxy-execution series.

[ paulmck: Remove locktorture-specific code from kernel/torture.c. ]

Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: kernel-team@android.com
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
[jstultz: Include header change to build, reword commit message]
Signed-off-by: John Stultz <jstultz@google.com>
Acked-by: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-08-14 15:01:07 -07:00
Paul E. McKenney 67d5404d27 torture: Add a kthread-creation callback to _torture_create_kthread()
This commit adds a kthread-creation callback to the
_torture_create_kthread() function, which allows callers of a new
torture_create_kthread_cb() macro to specify a function to be invoked
after the kthread is created but before it is awakened for the first time.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: kernel-team@android.com
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Acked-by: John Stultz <jstultz@google.com>
2023-08-14 15:00:37 -07:00
Paul E. McKenney 9d0cce2bc3 rcu-tasks: Fix boot-time RCU tasks debug-only deadlock
In kernels built with CONFIG_PROVE_RCU=y (for example, lockdep kernels),
the following sequence of events can occur:

o	rcu_init_tasks_generic() is invoked just before init is spawned.
	It invokes rcu_spawn_tasks_kthread() and friends.

o	rcu_spawn_tasks_kthread() invokes rcu_spawn_tasks_kthread_generic(),
	which uses kthread_run() to create the needed kthread.

o	Control returns to rcu_init_tasks_generic(), which, because this
	is a CONFIG_PROVE_RCU=y kernel, invokes the version of the
	rcu_tasks_initiate_self_tests() function that actually does
	something, including invoking synchronize_rcu_tasks(), which
	in turn invokes synchronize_rcu_tasks_generic().

o	synchronize_rcu_tasks_generic() sees that the ->kthread_ptr is
	still NULL, because the newly spawned kthread has not yet
	started.

o	The new kthread starts, preempting synchronize_rcu_tasks_generic()
	just after its check.  This kthread invokes rcu_tasks_one_gp(),
	which acquires ->tasks_gp_mutex, and, seeing no work, blocks
	in rcuwait_wait_event().  Note that this step requires either
	a preemptible kernel or a fault-injection-style sleep at the
	beginning of mutex_lock().

o	synchronize_rcu_tasks_generic() resumes and invokes rcu_tasks_one_gp().

o	rcu_tasks_one_gp() attempts to acquire ->tasks_gp_mutex, which
	is still held by the newly spawned kthread's rcu_tasks_one_gp()
	function.  Deadlock.

Because the only reason for ->tasks_gp_mutex is to handle pre-kthread
synchronous grace periods, this commit avoids this deadlock by having
rcu_tasks_one_gp() momentarily release ->tasks_gp_mutex while invoking
rcuwait_wait_event().  This allows the call to rcu_tasks_one_gp() from
synchronize_rcu_tasks_generic() proceed.

Note that it is not necessary to release the mutex anywhere else in
rcu_tasks_one_gp() because rcuwait_wait_event() is the only function
that can block indefinitely.

Reported-by: Guenter Roeck <linux@roeck-us.net>
Reported-by: Roy Hopkins <rhopkins@suse.de>
Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Tested-by: Roy Hopkins <rhopkins@suse.de>
2023-08-14 14:58:25 -07:00
Peter Zijlstra 7170509cad sched: Simplify sched_core_cpu_{starting,deactivate}()
Use guards to reduce gotos and simplify control flow.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lore.kernel.org/r/20230801211812.371787909@infradead.org
2023-08-14 17:01:27 +02:00
Peter Zijlstra b4e1fa1e14 sched: Simplify try_steal_cookie()
Use guards to reduce gotos and simplify control flow.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lore.kernel.org/r/20230801211812.304154828@infradead.org
2023-08-14 17:01:27 +02:00
Peter Zijlstra 6dafc713e3 sched: Simplify sched_tick_remote()
Use guards to reduce gotos and simplify control flow.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lore.kernel.org/r/20230801211812.236247952@infradead.org
2023-08-14 17:01:26 +02:00
Peter Zijlstra 4bdada79f3 sched: Simplify sched_exec()
Use guards to reduce gotos and simplify control flow.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lore.kernel.org/r/20230801211812.168490417@infradead.org
2023-08-14 17:01:26 +02:00
Peter Zijlstra 857d315f12 sched: Simplify ttwu()
Use guards to reduce gotos and simplify control flow.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lore.kernel.org/r/20230801211812.101069260@infradead.org
2023-08-14 17:01:25 +02:00
Peter Zijlstra 4eb054f92b sched: Simplify wake_up_if_idle()
Use guards to reduce gotos and simplify control flow.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lore.kernel.org/r/20230801211812.032678917@infradead.org
2023-08-14 17:01:25 +02:00
Peter Zijlstra 5bb76f1ddf sched: Simplify: migrate_swap_stop()
Use guards to reduce gotos and simplify control flow.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lore.kernel.org/r/20230801211811.964370836@infradead.org
2023-08-14 17:01:25 +02:00
Peter Zijlstra 0f92cdf36f sched: Simplify sysctl_sched_uclamp_handler()
Use guards to reduce gotos and simplify control flow.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lore.kernel.org/r/20230801211811.896559109@infradead.org
2023-08-14 17:01:24 +02:00
Peter Zijlstra 7537b90c00 sched: Simplify get_nohz_timer_target()
Use guards to reduce gotos and simplify control flow.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lore.kernel.org/r/20230801211811.828443100@infradead.org
2023-08-14 17:01:24 +02:00
Cyril Hrubis c1fc6484e1 sched/rt: sysctl_sched_rr_timeslice show default timeslice after reset
The sched_rr_timeslice can be reset to default by writing value that is
<= 0. However after reading from this file we always got the last value
written, which is not useful at all.

$ echo -1 > /proc/sys/kernel/sched_rr_timeslice_ms
$ cat /proc/sys/kernel/sched_rr_timeslice_ms
-1

Fix this by setting the variable that holds the sysctl file value to the
jiffies_to_msecs(RR_TIMESLICE) in case that <= 0 value was written.

Signed-off-by: Cyril Hrubis <chrubis@suse.cz>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Petr Vorel <pvorel@suse.cz>
Acked-by: Mel Gorman <mgorman@suse.de>
Tested-by: Petr Vorel <pvorel@suse.cz>
Link: https://lore.kernel.org/r/20230802151906.25258-3-chrubis@suse.cz
2023-08-14 17:01:23 +02:00
Cyril Hrubis c7fcb99877 sched/rt: Fix sysctl_sched_rr_timeslice intial value
There is a 10% rounding error in the intial value of the
sysctl_sched_rr_timeslice with CONFIG_HZ_300=y.

This was found with LTP test sched_rr_get_interval01:

sched_rr_get_interval01.c:57: TPASS: sched_rr_get_interval() passed
sched_rr_get_interval01.c:64: TPASS: Time quantum 0s 99999990ns
sched_rr_get_interval01.c:72: TFAIL: /proc/sys/kernel/sched_rr_timeslice_ms != 100 got 90
sched_rr_get_interval01.c:57: TPASS: sched_rr_get_interval() passed
sched_rr_get_interval01.c:64: TPASS: Time quantum 0s 99999990ns
sched_rr_get_interval01.c:72: TFAIL: /proc/sys/kernel/sched_rr_timeslice_ms != 100 got 90

What this test does is to compare the return value from the
sched_rr_get_interval() and the sched_rr_timeslice_ms sysctl file and
fails if they do not match.

The problem it found is the intial sysctl file value which was computed as:

static int sysctl_sched_rr_timeslice = (MSEC_PER_SEC / HZ) * RR_TIMESLICE;

which works fine as long as MSEC_PER_SEC is multiple of HZ, however it
introduces 10% rounding error for CONFIG_HZ_300:

(MSEC_PER_SEC / HZ) * (100 * HZ / 1000)

(1000 / 300) * (100 * 300 / 1000)

3 * 30 = 90

This can be easily fixed by reversing the order of the multiplication
and division. After this fix we get:

(MSEC_PER_SEC * (100 * HZ / 1000)) / HZ

(1000 * (100 * 300 / 1000)) / 300

(1000 * 30) / 300 = 100

Fixes: 975e155ed8 ("sched/rt: Show the 'sched_rr_timeslice' SCHED_RR timeslice tuning knob in milliseconds")
Signed-off-by: Cyril Hrubis <chrubis@suse.cz>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Petr Vorel <pvorel@suse.cz>
Acked-by: Mel Gorman <mgorman@suse.de>
Tested-by: Petr Vorel <pvorel@suse.cz>
Link: https://lore.kernel.org/r/20230802151906.25258-2-chrubis@suse.cz
2023-08-14 17:01:23 +02:00
Kees Cook 53e9e33ede printk: ringbuffer: Fix truncating buffer size min_t cast
If an output buffer size exceeded U16_MAX, the min_t(u16, ...) cast in
copy_data() was causing writes to truncate. This manifested as output
bytes being skipped, seen as %NUL bytes in pstore dumps when the available
record size was larger than 65536. Fix the cast to no longer truncate
the calculation.

Cc: Petr Mladek <pmladek@suse.com>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: John Ogness <john.ogness@linutronix.de>
Reported-by: Vijay Balakrishna <vijayb@linux.microsoft.com>
Link: https://lore.kernel.org/lkml/d8bb1ec7-a4c5-43a2-9de0-9643a70b899f@linux.microsoft.com/
Fixes: b6cf8b3f33 ("printk: add lockless ringbuffer")
Cc: stable@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Tested-by: Vijay Balakrishna <vijayb@linux.microsoft.com>
Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck
Reviewed-by: Tyler Hicks (Microsoft) <code@tyhicks.com>
Tested-by: Tyler Hicks (Microsoft) <code@tyhicks.com>
Reviewed-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20230811054528.never.165-kees@kernel.org
2023-08-14 13:05:22 +02:00
Rafael J. Wysocki 8e1d6a9223 Merge back system-wide sleep material for v6.6. 2023-08-14 09:55:44 +02:00
Linus Torvalds 9578b04c32 Power management fixes for 6.5-rc6
- Make amd-pstate use device_attributes as expected by the CPU root
    kobject (Thomas Weißschuh).
 
  - Restore the previous behavior of resume_store() when hibernation is
    not available which is to return the full number of bytes that were
    to be written by user space (Vlastimil Babka).
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAmTWgJ8SHHJqd0Byand5
 c29ja2kubmV0AAoJEILEb/54YlRxGEgP/01+F+nmq0c5QebC3LWw4cyYuepeCJ86
 jfIbJR+XHOoiTaQMORHKBEk8xlelL/R65tRhkB/Gq1uFzeIId+xYJJlsW4Lpj7bz
 rx/FXOAW8mAyPe/kNitBtcjh4tqEiPBiVzn1tKTA4OOLm0CzOE5v9KML93U2vsOa
 Y2I3Jp1N6HHC8oRzbYpQgvB6R2MXX/oRd5fCvrVyMidFFbgYz8sWssRe8eUTGFAj
 U/bufaKM7N/qlavikSul1f4T3KpRN+xpu7+I3W6M5/w0EQt663u3TffY1Mo+qllB
 uoIM7emwsR6J6WsJyWbHgZEh/fIPmPAhGtsUsam9dN4aoDXfac2Trqrf+xYYbAtS
 7mafAyWa+NxQCy/90QxoTrqhj3U4/dIbne4l1ZqgZQ7vyzM/NA4Gi0VBDEpt1BZU
 q6uvhS4PXvkRm/PezQSQCSMaP66F0erMCHxKTXTN1wYNob0AKjV6l1bmG5LdPcIh
 Nsk+CDkAVGmbqfDrtek9FfJZWgH3/lPDg0oVVMi9WiE8CdhYfKoB+Eh/MFVGiiDg
 69cogAHqTUeuB46NPNedeOacGc6F0+mnAwkgNkClCTCHZJ0QSDlh2yVR003ZhnUj
 sHx6jf6rYodW+nBQydjUzVm+twH47tltY0ibzN3ZIXiMM0UlALHBF+Oj4hOtGxUa
 jiiqkLyB/9kH
 =0RaA
 -----END PGP SIGNATURE-----

Merge tag 'pm-6.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management fixes from Rafael Wysocki:
 "These fix an amd-pstate cpufreq driver issues and recently introduced
  hibernation-related breakage.

  Specifics:

   - Make amd-pstate use device_attributes as expected by the CPU root
     kobject (Thomas Weißschuh)

   - Restore the previous behavior of resume_store() when hibernation is
     not available which is to return the full number of bytes that were
     to be written by user space (Vlastimil Babka)"

* tag 'pm-6.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  cpufreq: amd-pstate: fix global sysfs attribute type
  PM: hibernate: fix resume_store() return value when hibernation not available
2023-08-11 12:24:22 -07:00
Jakub Kicinski 6a1ed1430d bpf-next pull-request 2023-08-09
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQRdM/uy1Ege0+EN1fNar9k/UBDW4wUCZNRx8QAKCRBar9k/UBDW
 46MBAQC3YDFsEfPzX4P7ZnlM5Lf1NynjNbso5bYW0TF/dp/Y+gD+M8wdM5Vj2Mb0
 Zr56TnwCJei0kGBemiel4sStt3e4qwY=
 =+0u+
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Martin KaFai Lau says:

====================
pull-request: bpf-next 2023-08-09

We've added 19 non-merge commits during the last 6 day(s) which contain
a total of 25 files changed, 369 insertions(+), 141 deletions(-).

The main changes are:

1) Fix array-index-out-of-bounds access when detaching from an
   already empty mprog entry from Daniel Borkmann.

2) Adjust bpf selftest because of a recent llvm change
   related to the cpu-v4 ISA from Eduard Zingerman.

3) Add uprobe support for the bpf_get_func_ip helper from Jiri Olsa.

4) Fix a KASAN splat due to the kernel incorrectly accepted
   an invalid program using the recent cpu-v4 instruction from
   Yonghong Song.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next:
  bpf: btf: Remove two unused function declarations
  bpf: lru: Remove unused declaration bpf_lru_promote()
  selftests/bpf: relax expected log messages to allow emitting BPF_ST
  selftests/bpf: remove duplicated functions
  bpf, docs: Fix small typo and define semantics of sign extension
  selftests/bpf: Add bpf_get_func_ip test for uprobe inside function
  selftests/bpf: Add bpf_get_func_ip tests for uprobe on function entry
  bpf: Add support for bpf_get_func_ip helper for uprobe program
  selftests/bpf: Add a movsx selftest for sign-extension of R10
  bpf: Fix an incorrect verification success with movsx insn
  bpf, docs: Formalize type notation and function semantics in ISA standard
  bpf: change bpf_alu_sign_string and bpf_movsx_string to static
  libbpf: Use local includes inside the library
  bpf: fix bpf_dynptr_slice() to stop return an ERR_PTR.
  bpf: fix inconsistent return types of bpf_xdp_copy_buf().
  selftests/bpf: fix the incorrect verification of port numbers.
  selftests/bpf: Add test for detachment on empty mprog entry
  bpf: Fix mprog detachment for empty mprog entry
  bpf: bpf_struct_ops: Remove unnecessary initial values of variables
====================

Link: https://lore.kernel.org/r/20230810055123.109578-1-martin.lau@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-10 14:12:34 -07:00
Jakub Kicinski 4d016ae42e Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

No conflicts.

Adjacent changes:

drivers/net/ethernet/intel/igc/igc_main.c
  06b412589e ("igc: Add lock to safeguard global Qbv variables")
  d3750076d4 ("igc: Add TransmissionOverrun counter")

drivers/net/ethernet/microsoft/mana/mana_en.c
  a7dfeda6fd ("net: mana: Fix MANA VF unload when hardware is unresponsive")
  a9ca9f9cef ("page_pool: split types and declarations from page_pool.h")
  92272ec410 ("eth: add missing xdp.h includes in drivers")

net/mptcp/protocol.h
  511b90e392 ("mptcp: fix disconnect vs accept race")
  b8dc6d6ce9 ("mptcp: fix rcv buffer auto-tuning")

tools/testing/selftests/net/mptcp/mptcp_join.sh
  c8c101ae39 ("selftests: mptcp: join: fix 'implicit EP' test")
  03668c65d1 ("selftests: mptcp: join: rework detailed report")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-10 14:10:53 -07:00
Ingo Molnar b41bbb33cf Merge branch 'sched/eevdf' into sched/core
Pick up the EEVDF work into the main branch - it's looking good so far.

 Conflicts:
	kernel/sched/features.h

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2023-08-10 09:05:43 +02:00
Yue Haibing 526bc5ba19 bpf: lru: Remove unused declaration bpf_lru_promote()
Commit 3a08c2fd76 ("bpf: LRU List") declared but never implemented this.

Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Link: https://lore.kernel.org/r/20230808145531.19692-1-yuehaibing@huawei.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-08-08 17:21:42 -07:00
Khadija Kamran 6672efbb68 lsm: constify the 'target' parameter in security_capget()
Three LSMs register the implementations for the "capget" hook: AppArmor,
SELinux, and the normal capability code. Looking at the function
implementations we may observe that the first parameter "target" is not
changing.

Mark the first argument "target" of LSM hook security_capget() as
"const" since it will not be changing in the LSM hook.

cap_capget() LSM hook declaration exceeds the 80 characters per line
limit. Split the function declaration to multiple lines to decrease the
line length.

Signed-off-by: Khadija Kamran <kamrankhadijadj@gmail.com>
Acked-by: John Johansen <john.johansen@canonical.com>
[PM: align the cap_capget() declaration, spelling fixes]
Signed-off-by: Paul Moore <paul@paul-moore.com>
2023-08-08 16:48:47 -04:00
Gaosheng Cui b59bc6e372 audit: fix possible soft lockup in __audit_inode_child()
Tracefs or debugfs maybe cause hundreds to thousands of PATH records,
too many PATH records maybe cause soft lockup.

For example:
  1. CONFIG_KASAN=y && CONFIG_PREEMPTION=n
  2. auditctl -a exit,always -S open -k key
  3. sysctl -w kernel.watchdog_thresh=5
  4. mkdir /sys/kernel/debug/tracing/instances/test

There may be a soft lockup as follows:
  watchdog: BUG: soft lockup - CPU#45 stuck for 7s! [mkdir:15498]
  Kernel panic - not syncing: softlockup: hung tasks
  Call trace:
   dump_backtrace+0x0/0x30c
   show_stack+0x20/0x30
   dump_stack+0x11c/0x174
   panic+0x27c/0x494
   watchdog_timer_fn+0x2bc/0x390
   __run_hrtimer+0x148/0x4fc
   __hrtimer_run_queues+0x154/0x210
   hrtimer_interrupt+0x2c4/0x760
   arch_timer_handler_phys+0x48/0x60
   handle_percpu_devid_irq+0xe0/0x340
   __handle_domain_irq+0xbc/0x130
   gic_handle_irq+0x78/0x460
   el1_irq+0xb8/0x140
   __audit_inode_child+0x240/0x7bc
   tracefs_create_file+0x1b8/0x2a0
   trace_create_file+0x18/0x50
   event_create_dir+0x204/0x30c
   __trace_add_new_event+0xac/0x100
   event_trace_add_tracer+0xa0/0x130
   trace_array_create_dir+0x60/0x140
   trace_array_create+0x1e0/0x370
   instance_mkdir+0x90/0xd0
   tracefs_syscall_mkdir+0x68/0xa0
   vfs_mkdir+0x21c/0x34c
   do_mkdirat+0x1b4/0x1d4
   __arm64_sys_mkdirat+0x4c/0x60
   el0_svc_common.constprop.0+0xa8/0x240
   do_el0_svc+0x8c/0xc0
   el0_svc+0x20/0x30
   el0_sync_handler+0xb0/0xb4
   el0_sync+0x160/0x180

Therefore, we add cond_resched() to __audit_inode_child() to fix it.

Fixes: 5195d8e217 ("audit: dynamically allocate audit_names when not enough space is in the names array")
Signed-off-by: Gaosheng Cui <cuigaosheng1@huawei.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2023-08-08 14:45:20 -04:00
Petr Tesarik d069ed288a swiotlb: optimize get_max_slots()
Use a simple logical shift and increment to calculate the number of slots
taken by the DMA segment boundary.

At least GCC-13 is not able to optimize the expression, producing this
horrible assembly code on x86:

	cmpq	$-1, %rcx
	je	.L364
	addq	$2048, %rcx
	shrq	$11, %rcx
	movq	%rcx, %r13
.L331:
	// rest of the function here...

	// after function epilogue and return:
.L364:
	movabsq $9007199254740992, %r13
	jmp	.L331

After the optimization, the code looks more reasonable:

	shrq	$11, %r11
	leaq	1(%r11), %rbx

Signed-off-by: Petr Tesarik <petr.tesarik.ext@huawei.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-08-08 10:29:21 -07:00
Petr Tesarik f94cb36e76 swiotlb: move slot allocation explanation comment where it belongs
Move the comment down in front of the loop that actually sets the list
member of struct io_tlb_slot to zero.

Fixes: 26a7e09478 ("swiotlb: refactor swiotlb_tbl_map_single")
Signed-off-by: Petr Tesarik <petr.tesarik.ext@huawei.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-08-08 10:29:06 -07:00
Tejun Heo 523a301e66 workqueue: Make default affinity_scope dynamically updatable
While workqueue.default_affinity_scope is writable, it only affects
workqueues which are created afterwards and isn't very useful. Instead,
let's introduce explicit "default" scope and update the effective scope
dynamically when workqueue.default_affinity_scope is changed.

Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-07 15:57:25 -10:00
Tejun Heo 8639ecebc9 workqueue: Implement non-strict affinity scope for unbound workqueues
An unbound workqueue can be served by multiple worker_pools to improve
locality. The segmentation is achieved by grouping CPUs into pods. By
default, the cache boundaries according to cpus_share_cache() define the
CPUs are grouped. Let's a workqueue is allowed to run on all CPUs and the
system has two L3 caches. The workqueue would be mapped to two worker_pools
each serving one L3 cache domains.

While this improves locality, because the pod boundaries are strict, it
limits the total bandwidth a given issuer can consume. For example, let's
say there is a thread pinned to a CPU issuing enough work items to saturate
the whole machine. With the machine segmented into two pods, no matter how
many work items it issues, it can only use half of the CPUs on the system.

While this limitation has existed for a very long time, it wasn't very
pronounced because the affinity grouping used to be always by NUMA nodes.
With cache boundaries as the default and support for even finer grained
scopes (smt and cpu), it is now an a lot more pressing problem.

This patch implements non-strict affinity scope where the pod boundaries
aren't enforced strictly. Going back to the previous example, the workqueue
would still be mapped to two worker_pools; however, the affinity enforcement
would be soft. The workers in both pools would have their cpus_allowed set
to the whole machine thus allowing the scheduler to migrate them anywhere on
the machine. However, whenever an idle worker is woken up, the workqueue
code asks the scheduler to bring back the task within the pod if the worker
is outside. ie. work items start executing within its affinity scope but can
be migrated outside as the scheduler sees fit. This removes the hard cap on
utilization while maintaining the benefits of affinity scopes.

After the earlier ->__pod_cpumask changes, the implementation is pretty
simple. When non-strict which is the new default:

* pool_allowed_cpus() returns @pool->attrs->cpumask instead of
  ->__pod_cpumask so that the workers are allowed to run on any CPU that
  the associated workqueues allow.

* If the idle worker task's ->wake_cpu is outside the pod, kick_pool() sets
  the field to a CPU within the pod.

This would be the first use of task_struct->wake_cpu outside scheduler
proper, so it isn't clear whether this would be acceptable. However, other
methods of migrating tasks are significantly more expensive and are likely
prohibitively so if we want to do this on every work item. This needs
discussion with scheduler folks.

There is also a race window where setting ->wake_cpu wouldn't be effective
as the target task is still on CPU. However, the window is pretty small and
this being a best-effort optimization, it doesn't seem to warrant more
complexity at the moment.

While the non-strict cache affinity scopes seem to be the best option, the
performance picture interacts with the affinity scope and is a bit
complicated to fully discuss in this patch, so the behavior is made easily
selectable through wqattrs and sysfs and the next patch will add
documentation to discuss performance implications.

v2: pool->attrs->affn_strict is set to true for per-cpu worker_pools.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
2023-08-07 15:57:25 -10:00
Tejun Heo 9546b29e4a workqueue: Add workqueue_attrs->__pod_cpumask
workqueue_attrs has two uses:

* to specify the required unouned workqueue properties by users

* to match worker_pool's properties to workqueues by core code

For example, if the user wants to restrict a workqueue to run only CPUs 0
and 2, and the two CPUs are on different affinity scopes, the workqueue's
attrs->cpumask would contains CPUs 0 and 2, and the workqueue would be
associated with two worker_pools, one with attrs->cpumask containing just
CPU 0 and the other CPU 2.

Workqueue wants to support non-strict affinity scopes where work items are
started in their matching affinity scopes but the scheduler is free to
migrate them outside the starting scopes, which can enable utilizing the
whole machine while maintaining most of the locality benefits from affinity
scopes.

To enable that, worker_pools need to distinguish the strict affinity that it
has to follow (because that's the restriction coming from the user) and the
soft affinity that it wants to apply when dispatching work items. Note that
two worker_pools with different soft dispatching requirements have to be
separate; otherwise, for example, we'd be ping-ponging worker threads across
NUMA boundaries constantly.

This patch adds workqueue_attrs->__pod_cpumask. The new field is double
underscored as it's only used internally to distinguish worker_pools. A
worker_pool's ->cpumask is now always the same as the online subset of
allowed CPUs of the associated workqueues, and ->__pod_cpumask is the pod's
subset of that ->cpumask. Going back to the example above, both worker_pools
would have ->cpumask containing both CPUs 0 and 2 but one's ->__pod_cpumask
would contain 0 while the other's 2.

* pool_allowed_cpus() is added. It returns the worker_pool's strict cpumask
  that the pool's workers must stay within. This is currently always
  ->__pod_cpumask as all boundaries are still strict.

* As a workqueue_attrs can now track both the associated workqueues' cpumask
  and its per-pod subset, wq_calc_pod_cpumask() no longer needs an external
  out-argument. Drop @cpumask and instead store the result in
  ->__pod_cpumask.

* The above also simplifies apply_wqattrs_prepare() as the same
  workqueue_attrs can be used to create all pods associated with a
  workqueue. tmp_attrs is dropped.

* wq_update_pod() is updated to use wqattrs_equal() to test whether a pwq
  update is needed instead of only comparing ->cpumask so that
  ->__pod_cpumask is compared too. It can directly compare ->__pod_cpumaks
  but the code is easier to understand and more robust this way.

The only user-visible behavior change is that two workqueues with different
cpumasks no longer can share worker_pools even when their pod subsets
coincide. Going back to the example, let's say there's another workqueue
with cpumask 0, 2, 3, where 2 and 3 are in the same pod. It would be mapped
to two worker_pools - one with CPU 0, the other with 2 and 3. The former has
the same cpumask as the first pod of the earlier example and would have
shared the same worker_pool but that's no longer the case after this patch.
The worker_pools would have the same ->__pod_cpumask but their ->cpumask's
wouldn't match.

While this is necessary to support non-strict affinity scopes, there can be
further optimizations to maintain sharing among strict affinity scopes.
However, non-strict affinity scopes are going to be preferable for most use
cases and we don't see very diverse mixture of unbound workqueue cpumasks
anyway, so the additional overhead doesn't seem to justify the extra
complexity.

v2: - wq_update_pod() was incorrectly comparing target_attrs->__pod_cpumask
      to pool->attrs->cpumask instead of its ->__pod_cpumask. Fix it by
      using wqattrs_equal() for comparison instead.

    - Per-cpu worker pools weren't initializing ->__pod_cpumask which caused
      a subtle problem later on. Set it to cpumask_of(cpu) like ->cpumask.

Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-07 15:57:25 -10:00
Tejun Heo 0219a3528d workqueue: Factor out need_more_worker() check and worker wake-up
Checking need_more_worker() and calling wake_up_worker() is a repeated
pattern. Let's add kick_pool(), which checks need_more_worker() and
open-code wake_up_worker(), and replace wake_up_worker() uses. The following
conversions aren't one-to-one:

* __queue_work() was using __need_more_work() because it knows that
  pool->worklist isn't empty. Switching to kick_pool() adds an extra
  list_empty() test.

* create_worker() always needs to wake up the newly minted worker whether
  there's more work to do or not to avoid triggering hung task check on the
  new task. Keep the current wake_up_process() and still add kick_pool().
  This may lead to an extra wakeup which isn't harmful.

* pwq_adjust_max_active() was explicitly checking whether it needs to wake
  up a worker or not to avoid spurious wakeups. As kick_pool() only wakes up
  a worker when necessary, this explicit check is no longer necessary and
  dropped.

* unbind_workers() now calls kick_pool() instead of wake_up_worker() adding
  a need_more_worker() test. This avoids spurious wakeups and shouldn't
  break anything.

wake_up_worker() is dropped as kick_pool() replaces all its users. After
this patch, all paths that wakes up a non-rescuer worker to initiate work
item execution use kick_pool(). This will enable future changes to improve
locality.

Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-07 15:57:25 -10:00
Tejun Heo 873eaca6ea workqueue: Factor out work to worker assignment and collision handling
The two work execution paths in worker_thread() and rescuer_thread() use
move_linked_works() to claim work items from @pool->worklist. Once claimed,
process_schedule_works() is called which invokes process_one_work() on each
work item. process_one_work() then uses find_worker_executing_work() to
detect and handle collisions - situations where the work item to be executed
is still running on another worker.

This works fine, but, to improve work execution locality, we want to
establish work to worker association earlier and know for sure that the
worker is going to excute the work once asssigned, which requires performing
collision handling earlier while trying to assign the work item to the
worker.

This patch introduces assign_work() which assigns a work item to a worker
using move_linked_works() and then performs collision handling. As collision
handling is handled earlier, process_one_work() no longer needs to worry
about them.

After the this patch, collision checks for linked work items are skipped,
which should be fine as they can't be queued multiple times concurrently.
For work items running from rescuers, the timing of collision handling may
change but the invariant that the work items go through collision handling
before starting execution does not.

This patch shouldn't cause noticeable behavior changes, especially given
that worker_thread() behavior remains the same.

Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-07 15:57:25 -10:00
Tejun Heo 63c5484e74 workqueue: Add multiple affinity scopes and interface to select them
Add three more affinity scopes - WQ_AFFN_CPU, SMT and CACHE - and make CACHE
the default. The code changes to actually add the additional scopes are
trivial.

Also add module parameter "workqueue.default_affinity_scope" to override the
default scope and "affinity_scope" sysfs file to configure it per workqueue.
wq_dump.py and documentations are updated accordingly.

This enables significant flexibility in configuring how unbound workqueues
behave. If affinity scope is set to "cpu", it'll behave close to a per-cpu
workqueue. On the other hand, "system" removes all locality boundaries.

Many modern machines have multiple L3 caches often while being mostly
uniform in terms of memory access. Thus, workqueue's previous behavior of
spreading work items in each NUMA node had negative performance implications
from unncessarily crossing L3 boundaries between issue and execution.
However, picking a finer grained affinity scope also has a downside in that
an issuer in one group can't utilize CPUs in other groups.

While dependent on the specifics of workload, there's usually a noticeable
penalty in crossing L3 boundaries, so let's default to CACHE. This issue
will be further addressed and documented with examples in future patches.

Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-07 15:57:24 -10:00
Tejun Heo 025e168458 workqueue: Modularize wq_pod_type initialization
While wq_pod_type[] can now group CPUs in any aribitrary way, WQ_AFFN_NUM
init is hard coded into workqueue_init_topology(). This patch modularizes
the init path by introducing init_pod_type() which takes a callback to
determine whether two CPUs should share a pod as an argument.

init_pod_type() first scans the CPU combinations testing for sharing to
assign consecutive pod IDs and initialize pod_type->cpu_pod[]. Once
->cpu_pod[] is determined, ->pod_cpus[] and ->pod_node[] are initialized
accordingly. WQ_AFFN_NUMA is now initialized by calling init_pod_type() with
cpus_share_numa() which tests whether the CPU belongs to the same NUMA node.

This patch may change the pod ID assigned to each NUMA node but that
shouldn't cause any behavior changes as the NUMA node to use for allocations
are tracked separately in pod_type->pod_node[]. This makes adding new
affinty types pretty easy.

Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-07 15:57:24 -10:00
Tejun Heo 84193c0710 workqueue: Generalize unbound CPU pods
While renamed to pod, the code still assumes that the pods are defined by
NUMA boundaries. Let's generalize it:

* workqueue_attrs->affn_scope is added. Each enum represents the type of
  boundaries that define the pods. There are currently two scopes -
  WQ_AFFN_NUMA and WQ_AFFN_SYSTEM. The former is the same behavior as before
  - one pod per NUMA node. The latter defines one global pod across the
  whole system.

* struct wq_pod_type is added which describes how pods are configured for
  each affnity scope. For each pod, it lists the member CPUs and the
  preferred NUMA node for memory allocations. The reverse mapping from CPU
  to pod is also available.

* wq_pod_enabled is dropped. Pod is now always enabled. The previously
  disabled behavior is now implemented through WQ_AFFN_SYSTEM.

* get_unbound_pool() wants to determine the NUMA node to allocate memory
  from for the new pool. The variables are renamed from node to pod but the
  logic still assumes they're one and the same. Clearly distinguish them -
  walk the WQ_AFFN_NUMA pods to find the matching pod and then use the pod's
  NUMA node.

* wq_calc_pod_cpumask() was taking @pod but assumed that it was the NUMA
  node. Take @cpu instead and determine the cpumask to use from the pod_type
  matching @attrs.

* apply_wqattrs_prepare() is update to return ERR_PTR() on error instead of
  NULL so that it can indicate -EINVAL on invalid affinity scopes.

This patch allows CPUs to be grouped into pods however desired per type.
While this patch causes some internal behavior changes, nothing material
should change for workqueue users.

v2: Trigger WARN_ON_ONCE() in wqattrs_pod_type() if affn_scope is
    WQ_AFFN_NR_TYPES which indicates that the function is called with a
    worker_pool's attrs instead of a workqueue's.

Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-07 15:57:24 -10:00
Tejun Heo 5de7a03cac workqueue: Factor out clearing of workqueue-only attrs fields
workqueue_attrs can be used for both workqueues and worker_pools. However,
some fields, currently only ->ordered, only apply to workqueues and should
be cleared to the default / invalid values.

Currently, an unbound workqueue explicitly clears attrs->ordered in
get_unbound_pool() after copying the source workqueue attrs, while per-cpu
workqueues rely on the fact that zeroing on allocation gives us the desired
default value for pool->attrs->ordered.

This is fragile. Let's add wqattrs_clear_for_pool() which clears
attrs->ordered and is called from both init_worker_pool() and
get_unbound_pool(). This will ease adding more workqueue-only attrs fields.

In get_unbound_pool(), pool->node initialization is moved upwards for
readability. This shouldn't cause any behavior changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-07 15:57:24 -10:00
Tejun Heo 0f36ee24cd workqueue: Factor out actual cpumask calculation to reduce subtlety in wq_update_pod()
For an unbound pool, multiple cpumasks are involved.

U: The user-specified cpumask (may be filtered with cpu_possible_mask).

A: The actual cpumask filtered by wq_unbound_cpumask. If the filtering
   leaves no CPU, wq_unbound_cpumask is used.

P: Per-pod subsets of #A.

wq->attrs stores #U, wq->dfl_pwq->pool->attrs->cpumask #A, and
wq->cpu_pwq[CPU]->pool->attrs->cpumask #P.

wq_update_pod() is called to update per-pod pwq's during CPU hotplug. To
calculate the new #P for each workqueue, it needs to call
wq_calc_pod_cpumask() with @attrs that contains #A. Currently,
wq_update_pod() achieves this by calling wq_calc_pod_cpumask() with
wq->dfl_pwq->pool->attrs.

This is rather fragile because we're calling wq_calc_pod_cpumask() with
@attrs of a worker_pool rather than the workqueue's actual attrs when what
we want to calculate is the workqueue's cpumask on the pod. While this works
fine currently, future changes will add fields which are used differently
between workqueues and worker_pools and this subtlety will bite us.

This patch factors out #U -> #A calculation from apply_wqattrs_prepare()
into wqattrs_actualize_cpumask and updates wq_update_pod() to copy
wq->unbound_attrs and use the new helper to obtain #A freshly instead of
abusing wq->dfl_pwq->pool_attrs.

This shouldn't cause any behavior changes in the current code.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: K Prateek Nayak <kprateek.nayak@amd.com>
Reference: http://lkml.kernel.org/r/30625cdd-4d61-594b-8db9-6816b017dde3@amd.com
2023-08-07 15:57:24 -10:00
Tejun Heo 2930155b2e workqueue: Initialize unbound CPU pods later in the boot
During boot, to initialize unbound CPU pods, wq_pod_init() was called from
workqueue_init(). This is early enough for NUMA nodes to be set up but
before SMP is brought up and CPU topology information is populated.

Workqueue is in the process of improving CPU locality for unbound workqueues
and will need access to topology information during pod init. This adds a
new init function workqueue_init_topology() which is called after CPU
topology information is available and replaces wq_pod_init().

As unbound CPU pods are now initialized after workqueues are activated, we
need to revisit the workqueues to apply the pod configuration. Workqueues
which are created before workqueue_init_topology() are set up so that they
always use the default worker pool. After pods are set up in
workqueue_init_topology(), wq_update_pod() is called on all existing
workqueues to update the pool associations accordingly.

Note that wq_update_pod_attrs_buf allocation is moved to
workqueue_init_early(). This isn't necessary right now but enables further
generalization of pod handling in the future.

This patch changes the initialization sequence but the end result should be
the same.

Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-07 15:57:24 -10:00
Tejun Heo a86feae619 workqueue: Move wq_pod_init() below workqueue_init()
wq_pod_init() is called from workqueue_init() and responsible for
initializing unbound CPU pods according to NUMA node. Workqueue is in the
process of improving affinity awareness and wants to use other topology
information to initialize unbound CPU pods; however, unlike NUMA nodes,
other topology information isn't yet available in workqueue_init().

The next patch will introduce a later stage init function for workqueue
which will be responsible for initializing unbound CPU pods. Relocate
wq_pod_init() below workqueue_init() where the new init function is going to
be located so that the diff can show the content differences.

Just a relocation. No functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-07 15:57:24 -10:00
Tejun Heo fef59c9cab workqueue: Rename NUMA related names to use pod instead
Workqueue is in the process of improving CPU affinity awareness. It will
become more flexible and won't be tied to NUMA node boundaries. This patch
renames all NUMA related names in workqueue.c to use "pod" instead.

While "pod" isn't a very common term, it short and captures the grouping of
CPUs well enough. These names are only going to be used within workqueue
implementation proper, so the specific naming doesn't matter that much.

* wq_numa_possible_cpumask -> wq_pod_cpus

* wq_numa_enabled -> wq_pod_enabled

* wq_update_unbound_numa_attrs_buf -> wq_update_pod_attrs_buf

* workqueue_select_cpu_near -> select_numa_node_cpu

  This rename is different from others. The function is only used by
  queue_work_node() and specifically tries to find a CPU in the specified
  NUMA node. As workqueue affinity will become more flexible and untied from
  NUMA, this function's name should specifically describe that it's for
  NUMA.

* wq_calc_node_cpumask -> wq_calc_pod_cpumask

* wq_update_unbound_numa -> wq_update_pod

* wq_numa_init -> wq_pod_init

* node -> pod in local variables

Only renames. No functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-07 15:57:23 -10:00
Tejun Heo af73f5c9fe workqueue: Rename workqueue_attrs->no_numa to ->ordered
With the recent removal of NUMA related module param and sysfs knob,
workqueue_attrs->no_numa is now only used to implement ordered workqueues.
Let's rename the field so that it's less confusing especially with the
planned CPU affinity awareness improvements.

Just a rename. No functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-07 15:57:23 -10:00
Tejun Heo 636b927eba workqueue: Make unbound workqueues to use per-cpu pool_workqueues
A pwq (pool_workqueue) represents an association between a workqueue and a
worker_pool. When a work item is queued, the workqueue selects the pwq to
use, which in turn determines the pool, and queues the work item to the pool
through the pwq. pwq is also what implements the maximum concurrency limit -
@max_active.

As a per-cpu workqueue should be assocaited with a different worker_pool on
each CPU, it always had per-cpu pwq's that are accessed through wq->cpu_pwq.
However, unbound workqueues were sharing a pwq within each NUMA node by
default. The sharing has several downsides:

* Because @max_active is per-pwq, the meaning of @max_active changes
  depending on the machine configuration and whether workqueue NUMA locality
  support is enabled.

* Makes per-cpu and unbound code deviate.

* Gets in the way of making workqueue CPU locality awareness more flexible.

This patch makes unbound workqueues use per-cpu pwq's the same way per-cpu
workqueues do by making the following changes:

* wq->numa_pwq_tbl[] is removed and unbound workqueues now use wq->cpu_pwq
  just like per-cpu workqueues. wq->cpu_pwq is now RCU protected for unbound
  workqueues.

* numa_pwq_tbl_install() is renamed to install_unbound_pwq() and installs
  the specified pwq to the target CPU's wq->cpu_pwq.

* apply_wqattrs_prepare() now always allocates a separate pwq for each CPU
  unless the workqueue is ordered. If ordered, all CPUs use wq->dfl_pwq.
  This makes the return value of wq_calc_node_cpumask() unnecessary. It now
  returns void.

* @max_active now means the same thing for both per-cpu and unbound
  workqueues. WQ_UNBOUND_MAX_ACTIVE now equals WQ_MAX_ACTIVE and
  documentation is updated accordingly. WQ_UNBOUND_MAX_ACTIVE is no longer
  used in workqueue implementation and will be removed later.

* All unbound pwq operations which used to be per-numa-node are now per-cpu.

For most unbound workqueue users, this shouldn't cause noticeable changes.
Work item issue and completion will be a small bit faster, flush_workqueue()
would become a bit more expensive, and the total concurrency limit would
likely become higher. All @max_active==1 use cases are currently being
audited for conversion into alloc_ordered_workqueue() and they shouldn't be
affected once the audit and conversion is complete.

One area where the behavior change may be more noticeable is
workqueue_congested() as the reported congestion state is now per CPU
instead of NUMA node. There are only two users of this interface -
drivers/infiniband/hw/hfi1 and net/smc. Maintainers of both subsystems are
cc'd. Inputs on the behavior change would be very much appreciated.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Karsten Graul <kgraul@linux.ibm.com>
Cc: Wenjia Zhang <wenjia@linux.ibm.com>
Cc: Jan Karcher <jaka@linux.ibm.com>
2023-08-07 15:57:23 -10:00
Tejun Heo 4cbfd3de73 workqueue: Call wq_update_unbound_numa() on all CPUs in NUMA node on CPU hotplug
When a CPU went online or offline, wq_update_unbound_numa() was called only
on the CPU which was going up or down. This works fine because all CPUs on
the same NUMA node share the same pool_workqueue slot - one CPU updating it
updates it for everyone in the node.

However, future changes will make each CPU use a separate pool_workqueue
even when they're sharing the same worker_pool, which requires updating
pool_workqueue's for all CPUs which may be sharing the same pool_workqueue
on hotplug.

To accommodate the planned changes, this patch updates
workqueue_on/offline_cpu() so that they call wq_update_unbound_numa() for
all CPUs sharing the same NUMA node as the CPU going up or down. In the
current code, the second+ calls would be noops and there shouldn't be any
behavior changes.

* As wq_update_unbound_numa() is now called on multiple CPUs per each
  hotplug event, @cpu is renamed to @hotplug_cpu and another @cpu argument
  is added. The former indicates the CPU being hot[un]plugged and the latter
  the CPU whose pool_workqueue is being updated.

* In wq_update_unbound_numa(), cpu_off is renamed to off_cpu for consistency
  with the new @hotplug_cpu.

Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-07 15:57:23 -10:00
Tejun Heo 687a9aa56f workqueue: Make per-cpu pool_workqueues allocated and released like unbound ones
Currently, all per-cpu pwq's (pool_workqueue's) are allocated directly
through a per-cpu allocation and thus, unlike unbound workqueues, not
reference counted. This difference in lifetime management between the two
types is a bit confusing.

Unbound workqueues are currently accessed through wq->numa_pwq_tbl[] which
isn't suitiable for the planned CPU locality related improvements. The plan
is to unify pwq handling across per-cpu and unbound workqueues so that
they're always accessed through wq->cpu_pwq.

In preparation, this patch makes per-cpu pwq's to be allocated, reference
counted and released the same way as unbound pwq's. wq->cpu_pwq now holds
pointers to pwq's instead of containing them directly.

pwq_unbound_release_workfn() is renamed to pwq_release_workfn() as it's now
also used for per-cpu work items.

Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-07 15:57:23 -10:00
Tejun Heo 967b494e2f workqueue: Use a kthread_worker to release pool_workqueues
pool_workqueue release path is currently bounced to system_wq; however, this
is a bit tricky because this bouncing occurs while holding a pool lock and
thus has risk of causing a A-A deadlock. This is currently addressed by the
fact that only unbound workqueues use this bouncing path and system_wq is a
per-cpu workqueue.

While this works, it's brittle and requires a work-around like setting the
lockdep subclass for the lock of unbound pools. Besides, future changes will
use the bouncing path for per-cpu workqueues too making the current approach
unusable.

Let's just use a dedicated kthread_worker to untangle the dependency. This
is just one more kthread for all workqueues and makes the pwq release logic
simpler and more robust.

Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-07 15:57:23 -10:00
Tejun Heo fcecfa8f27 workqueue: Remove module param disable_numa and sysfs knobs pool_ids and numa
Unbound workqueue CPU affinity is going to receive an overhaul and the NUMA
specific knobs won't make sense anymore. Remove them. Also, the pool_ids
knob was used for debugging and not really meaningful given that there is no
visibility into the pools associated with those IDs. Remove it too. A future
patch will improve overall visibility.

Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-07 15:57:23 -10:00
Tejun Heo 797e8345cb workqueue: Relocate worker and work management functions
Collect first_idle_worker(), worker_enter/leave_idle(),
find_worker_executing_work(), move_linked_works() and wake_up_worker() into
one place. These functions will later be used to implement higher level
worker management logic.

No functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-07 15:57:23 -10:00
Tejun Heo ee1ceef727 workqueue: Rename wq->cpu_pwqs to wq->cpu_pwq
wq->cpu_pwqs is a percpu variable carraying one pointer to a pool_workqueue.
The field name being plural is unusual and confusing. Rename it to singular.

This patch doesn't cause any functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-07 15:57:23 -10:00
Tejun Heo fe089f87cc workqueue: Not all work insertion needs to wake up a worker
insert_work() always tried to wake up a worker; however, the only time it
needs to try to wake up a worker is when a new active work item is queued.
When a work item goes on the inactive list or queueing a flush work item,
there's no reason to try to wake up a worker.

This patch moves the worker wakeup logic out of insert_work() and places it
in the active new work item queueing path in __queue_work().

While at it:

* __queue_work() is dereferencing pwq->pool repeatedly. Add local variable
  pool.

* Every caller of insert_work() calls debug_work_activate(). Consolidate the
  invocations into insert_work().

* In __queue_work() pool->watchdog_ts update is relocated slightly. This is
  to better accommodate future changes.

This makes wakeups more precise and will help the planned change to assign
work items to workers before waking them up. No behavior changes intended.

v2: WARN_ON_ONCE(pool != last_pool) added in __queue_work() to clarify as
    suggested by Lai.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
2023-08-07 15:57:22 -10:00
Tejun Heo c0ab017d43 workqueue: Cleanups around process_scheduled_works()
* Drop the trivial optimization in worker_thread() where it bypasses calling
  process_scheduled_works() if the first work item isn't linked. This is a
  mostly pointless micro optimization and gets in the way of improving the
  work processing path.

* Consolidate pool->watchdog_ts updates in the two callers into
  process_scheduled_works().

Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-07 15:57:22 -10:00
Tejun Heo bc8b50c2df workqueue: Drop the special locking rule for worker->flags and worker_pool->flags
worker->flags used to be accessed from scheduler hooks without grabbing
pool->lock for concurrency management. This is no longer true since
6d25be5782 ("sched/core, workqueues: Distangle worker accounting from rq
lock"). Also, it's unclear why worker_pool->flags was using the "X" rule.
All relevant users are accessing it under the pool lock.

Let's drop the special "X" rule and use the "L" rule for these flag fields
instead. While at it, replace the CONTEXT comment with
lockdep_assert_held().

This allows worker_set/clr_flags() to be used from context which isn't the
worker itself. This will be used later to implement assinging work items to
workers before waking them up so that workqueue can have better control over
which worker executes which work item on which CPU.

The only actual changes are sanity checks. There shouldn't be any visible
behavior changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-07 15:57:22 -10:00
Tejun Heo 87437656c2 workqueue: Merge branch 'for-6.5-fixes' into for-6.6
Unbound workqueue execution locality improvement patchset is about to
applied which will cause merge conflicts with changes in for-6.5-fixes.
Let's avoid future merge conflict by pulling in for-6.5-fixes.

Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-07 15:54:25 -10:00
Jiri Olsa a3c485a5d8 bpf: Add support for bpf_get_func_ip helper for uprobe program
Adding support for bpf_get_func_ip helper for uprobe program to return
probed address for both uprobe and return uprobe.

We discussed this in [1] and agreed that uprobe can have special use
of bpf_get_func_ip helper that differs from kprobe.

The kprobe bpf_get_func_ip returns:
  - address of the function if probe is attach on function entry
    for both kprobe and return kprobe
  - 0 if the probe is not attach on function entry

The uprobe bpf_get_func_ip returns:
  - address of the probe for both uprobe and return uprobe

The reason for this semantic change is that kernel can't really tell
if the probe user space address is function entry.

The uprobe program is actually kprobe type program attached as uprobe.
One of the consequences of this design is that uprobes do not have its
own set of helpers, but share them with kprobes.

As we need different functionality for bpf_get_func_ip helper for uprobe,
I'm adding the bool value to the bpf_trace_run_ctx, so the helper can
detect that it's executed in uprobe context and call specific code.

The is_uprobe bool is set as true in bpf_prog_run_array_sleepable, which
is currently used only for executing bpf programs in uprobe.

Renaming bpf_prog_run_array_sleepable to bpf_prog_run_array_uprobe
to address that it's only used for uprobes and that it sets the
run_ctx.is_uprobe as suggested by Yafang Shao.

Suggested-by: Andrii Nakryiko <andrii@kernel.org>
Tested-by: Alan Maguire <alan.maguire@oracle.com>
[1] https://lore.kernel.org/bpf/CAEf4BzZ=xLVkG5eurEuvLU79wAMtwho7ReR+XJAgwhFF4M-7Cg@mail.gmail.com/
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Tested-by: Viktor Malik <vmalik@redhat.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230807085956.2344866-2-jolsa@kernel.org
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-08-07 16:42:58 -07:00
Yonghong Song db2baf82b0 bpf: Fix an incorrect verification success with movsx insn
syzbot reports a verifier bug which triggers a runtime panic.
The test bpf program is:
   0: (62) *(u32 *)(r10 -8) = 553656332
   1: (bf) r1 = (s16)r10
   2: (07) r1 += -8
   3: (b7) r2 = 3
   4: (bd) if r2 <= r1 goto pc+0
   5: (85) call bpf_trace_printk#-138320
   6: (b7) r0 = 0
   7: (95) exit

At insn 1, the current implementation keeps 'r1' as a frame pointer,
which caused later bpf_trace_printk helper call crash since frame
pointer address is not valid any more. Note that at insn 4,
the 'pointer vs. scalar' comparison is allowed for privileged
prog run.

To fix the problem with above insn 1, the fix in the patch adopts
similar pattern to existing 'R1 = (u32) R2' handling. For unprivileged
prog run, verification will fail with 'R<num> sign-extension part of pointer'.
For privileged prog run, the dst_reg 'r1' will be marked as
an unknown scalar, so later 'bpf_trace_pointk' helper will complain
since it expected certain pointers.

Reported-by: syzbot+d61b595e9205573133b3@syzkaller.appspotmail.com
Fixes: 8100928c88 ("bpf: Support new sign-extension mov insns")
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20230807175721.671696-1-yonghong.song@linux.dev
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-08-07 16:23:35 -07:00
Linus Torvalds 14f9643dc9 workqueue: Fixes for v6.5-rc5
Two commits:
 
 * The recently added cpu_intensive auto detection and warning mechanism was
   spuriously triggered on slow CPUs. While not causing serious issues, it's
   still a nuisance and can cause unintended concurrency management
   behaviors. Relax the threshold on machines with lower BogoMIPS. While
   BogoMIPS is not an accurate measure of performance by most measures, we
   don't have to be accurate and it has rough but strong enough correlation.
 
 * A correction in Kconfig help text.
 -----BEGIN PGP SIGNATURE-----
 
 iIQEABYIACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZNFMTQ4cdGpAa2VybmVs
 Lm9yZwAKCRCxYfJx3gVYGb+4AQCniWx3rwWWmLgviPR0AfYWbcQ8/P/qGh++fmsR
 tEF3sQD/bLdeWcVa1pSzXjhGtRVGsTis6oOhk81A0zIZlx0v2Qg=
 =sThu
 -----END PGP SIGNATURE-----

Merge tag 'wq-for-6.5-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq

Pull workqueue fixes from Tejun Heo:

 - The recently added cpu_intensive auto detection and warning mechanism
   was spuriously triggered on slow CPUs.

   While not causing serious issues, it's still a nuisance and can cause
   unintended concurrency management behaviors.

   Relax the threshold on machines with lower BogoMIPS. While BogoMIPS
   is not an accurate measure of performance by most measures, we don't
   have to be accurate and it has rough but strong enough correlation.

 - A correction in Kconfig help text

* tag 'wq-for-6.5-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
  workqueue: Scale up wq_cpu_intensive_thresh_us if BogoMIPS is below 4000
  workqueue: Fix cpu_intensive_thresh_us name in help text
2023-08-07 13:07:12 -07:00
Hao Jia 0437719c1a cgroup/rstat: Record the cumulative per-cpu time of cgroup and its descendants
The member variable bstat of the structure cgroup_rstat_cpu
records the per-cpu time of the cgroup itself, but does not
include the per-cpu time of its descendants. The per-cpu time
including descendants is very useful for calculating the
per-cpu usage of cgroups.

Although we can indirectly obtain the total per-cpu time
of the cgroup and its descendants by accumulating the per-cpu
bstat of each descendant of the cgroup. But after a child cgroup
is removed, we will lose its bstat information. This will cause
the cumulative value to be non-monotonic, thus affecting
the accuracy of cgroup per-cpu usage.

So we add the subtree_bstat variable to record the total
per-cpu time of this cgroup and its descendants, which is
similar to "cpuacct.usage*" in cgroup v1. And this is
also helpful for the migration from cgroup v1 to cgroup v2.
After adding this variable, we can obtain the per-cpu time of
cgroup and its descendants in user mode through eBPF/drgn, etc.
And we are still trying to determine how to expose it in the
cgroupfs interface.

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Hao Jia <jiahao.os@bytedance.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-07 08:41:25 -10:00
Yang Yingliang 9680540c0c workqueue: use LIST_HEAD to initialize cull_list
Use LIST_HEAD() to initialize cull_list instead of open-coding it.

Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-07 08:36:51 -10:00
Miaohe Lin e7e64a1bff cgroup: clean up if condition in cgroup_pidlist_start()
There's no need to use '<=' when knowing 'l->list[mid] != pid' already.
No functional change intended.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-07 08:30:06 -10:00
Wenyu Liu 55e2b69649 kexec_lock: Replace kexec_mutex() by kexec_lock() in two comments
kexec_mutex is replaced by an atomic variable
in 05c6257433 (panic, kexec: make __crash_kexec() NMI safe).

But there are still two comments that referenced kexec_mutex,
replace them by kexec_lock.

Signed-off-by: Wenyu Liu <liuwenyu7@huawei.com>
Acked-by: Baoquan He <bhe@redhat.com>
Acked-by: Paul Menzel <pmenzel@molgen.mpg.de>
Signed-off-by: Mimi Zohar <zohar@linux.ibm.com>
2023-08-07 09:55:42 -04:00
Vlastimil Babka df2f7cde73 PM: hibernate: fix resume_store() return value when hibernation not available
On a laptop with hibernation set up but not actively used, and with
secure boot and lockdown enabled kernel, 6.5-rc1 gets stuck on boot with
the following repeated messages:

  A start job is running for Resume from hibernation using device /dev/system/swap (24s / no limit)
  lockdown_is_locked_down: 25311154 callbacks suppressed
  Lockdown: systemd-hiberna: hibernation is restricted; see man kernel_lockdown.7
  ...

Checking the resume code leads to commit cc89c63e2f ("PM: hibernate:
move finding the resume device out of software_resume") which
inadvertently changed the return value from resume_store() to 0 when
!hibernation_available(). This apparently translates to userspace
write() returning 0 as in number of bytes written, and userspace looping
indefinitely in the attempt to write the intended value.

Fix this by returning the full number of bytes that were to be written,
as that's what was done before the commit.

Fixes: cc89c63e2f ("PM: hibernate: move finding the resume device out of software_resume")
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2023-08-07 11:41:11 +02:00
Yang Yingliang 1e8e2efb34 bpf: change bpf_alu_sign_string and bpf_movsx_string to static
The bpf_alu_sign_string and bpf_movsx_string introduced in commit
f835bb6222 ("bpf: Add kernel/bpftool asm support for new instructions")
are only used in disasm.c now, change them to static.

Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202308050615.wxAn1v2J-lkp@intel.com/
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230803023128.3753323-1-yangyingliang@huawei.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-08-04 16:15:50 -07:00
Kui-Feng Lee 5426700e68 bpf: fix bpf_dynptr_slice() to stop return an ERR_PTR.
Verify if the pointer obtained from bpf_xdp_pointer() is either an error or
NULL before returning it.

The function bpf_dynptr_slice() mistakenly returned an ERR_PTR. Instead of
solely checking for NULL, it should also verify if the pointer returned by
bpf_xdp_pointer() is an error or NULL.

Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/bpf/d1360219-85c3-4a03-9449-253ea905f9d1@moroto.mountain/
Fixes: 66e3a13e7c ("bpf: Add bpf_dynptr_slice and bpf_dynptr_slice_rdwr")
Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230803231206.1060485-1-thinker.li@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-08-04 14:53:15 -07:00
Daniel Borkmann d210f9735e bpf: Fix mprog detachment for empty mprog entry
syzbot reported an UBSAN array-index-out-of-bounds access in bpf_mprog_read()
upon bpf_mprog_detach(). While it did not have a reproducer, I was able to
manually reproduce through an empty mprog entry which just has miniq present.

The latter is important given otherwise we get an ENOENT error as tcx detaches
the whole mprog entry. The index 4294967295 was triggered via NULL dtuple.prog
which then attempts to detach from the back. bpf_mprog_fetch() in this case
did hit the idx == total and therefore tried to grab the entry at idx -1.

Fix it by adding an explicit bpf_mprog_total() check in bpf_mprog_detach() and
bail out early with ENOENT.

Fixes: 053c8e1f23 ("bpf: Add generic attach/detach/query API for multi-progs")
Reported-by: syzbot+0c06ba0f831fe07a8f27@syzkaller.appspotmail.com
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/20230804131112.11012-1-daniel@iogearbox.net
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-08-04 09:35:39 -07:00
Li kunyu 5964d1e459 bpf: bpf_struct_ops: Remove unnecessary initial values of variables
err and tlinks is assigned first, so it does not need to initialize the
assignment.

Signed-off-by: Li kunyu <kunyu@nfschina.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20230804175929.2867-1-kunyu@nfschina.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-08-03 17:54:35 -07:00
Miaohe Lin 7f828eacc4 cgroup: fix obsolete function name in cgroup_destroy_locked()
Since commit e76ecaeef6 ("cgroup: use cgroup_kn_lock_live() in other
cgroup kernfs methods"), cgroup_kn_lock_live() is used in cgroup kernfs
methods. Update corresponding comment.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-03 14:13:33 -10:00
Jakub Kicinski d07b7b32da pull-request: bpf-next 2023-08-03
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQRdM/uy1Ege0+EN1fNar9k/UBDW4wUCZMvevwAKCRBar9k/UBDW
 42Z0AP90hLZ9OmoghYAlALHLl8zqXuHCV8OeFXR5auqG+kkcCwEAx6h99vnh4zgP
 Tngj6Yid60o39/IZXXblhV37HfSiyQ8=
 =/kVE
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Martin KaFai Lau says:

====================
pull-request: bpf-next 2023-08-03

We've added 54 non-merge commits during the last 10 day(s) which contain
a total of 84 files changed, 4026 insertions(+), 562 deletions(-).

The main changes are:

1) Add SO_REUSEPORT support for TC bpf_sk_assign from Lorenz Bauer,
   Daniel Borkmann

2) Support new insns from cpu v4 from Yonghong Song

3) Non-atomically allocate freelist during prefill from YiFei Zhu

4) Support defragmenting IPv(4|6) packets in BPF from Daniel Xu

5) Add tracepoint to xdp attaching failure from Leon Hwang

6) struct netdev_rx_queue and xdp.h reshuffling to reduce
   rebuild time from Jakub Kicinski

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (54 commits)
  net: invert the netdevice.h vs xdp.h dependency
  net: move struct netdev_rx_queue out of netdevice.h
  eth: add missing xdp.h includes in drivers
  selftests/bpf: Add testcase for xdp attaching failure tracepoint
  bpf, xdp: Add tracepoint to xdp attaching failure
  selftests/bpf: fix static assert compilation issue for test_cls_*.c
  bpf: fix bpf_probe_read_kernel prototype mismatch
  riscv, bpf: Adapt bpf trampoline to optimized riscv ftrace framework
  libbpf: fix typos in Makefile
  tracing: bpf: use struct trace_entry in struct syscall_tp_t
  bpf, devmap: Remove unused dtab field from bpf_dtab_netdev
  bpf, cpumap: Remove unused cmap field from bpf_cpu_map_entry
  netfilter: bpf: Only define get_proto_defrag_hook() if necessary
  bpf: Fix an array-index-out-of-bounds issue in disasm.c
  net: remove duplicate INDIRECT_CALLABLE_DECLARE of udp[6]_ehashfn
  docs/bpf: Fix malformed documentation
  bpf: selftests: Add defrag selftests
  bpf: selftests: Support custom type and proto for client sockets
  bpf: selftests: Support not connecting client socket
  netfilter: bpf: Support BPF_F_NETFILTER_IP_DEFRAG in netfilter link
  ...
====================

Link: https://lore.kernel.org/r/20230803174845.825419-1-martin.lau@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-03 15:34:36 -07:00
Jakub Kicinski 35b1b1fd96 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

Conflicts:

net/dsa/port.c
  9945c1fb03 ("net: dsa: fix older DSA drivers using phylink")
  a88dd75384 ("net: dsa: remove legacy_pre_march2020 detection")
https://lore.kernel.org/all/20230731102254.2c9868ca@canb.auug.org.au/

net/xdp/xsk.c
  3c5b4d69c3 ("net: annotate data-races around sk->sk_mark")
  b7f72a30e9 ("xsk: introduce wrappers and helpers for supporting multi-buffer in Tx path")
https://lore.kernel.org/all/20230731102631.39988412@canb.auug.org.au/

drivers/net/ethernet/broadcom/bnxt/bnxt.c
  37b61cda9c ("bnxt: don't handle XDP in netpoll")
  2b56b3d992 ("eth: bnxt: handle invalid Tx completions more gracefully")
https://lore.kernel.org/all/20230801101708.1dc7faac@canb.auug.org.au/

Adjacent changes:

drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec_fs.c
  62da08331f ("net/mlx5e: Set proper IPsec source port in L4 selector")
  fbd517549c ("net/mlx5e: Add function to get IPsec offload namespace")

drivers/net/ethernet/sfc/selftest.c
  55c1528f9b ("sfc: fix field-spanning memcpy in selftest")
  ae9d445cd4 ("sfc: Miscellaneous comment removals")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-03 14:34:37 -07:00
Linus Torvalds 999f663186 Including fixes from bpf and wireless.
Nothing scary here. Feels like the first wave of regressions
 from v6.5 is addressed - one outstanding fix still to come
 in TLS for the sendpage rework.
 
 Current release - regressions:
 
  - udp: fix __ip_append_data()'s handling of MSG_SPLICE_PAGES
 
  - dsa: fix older DSA drivers using phylink
 
 Previous releases - regressions:
 
  - gro: fix misuse of CB in udp socket lookup
 
  - mlx5: unregister devlink params in case interface is down
 
  - Revert "wifi: ath11k: Enable threaded NAPI"
 
 Previous releases - always broken:
 
  - sched: cls_u32: fix match key mis-addressing
 
  - sched: bind logic fixes for cls_fw, cls_u32 and cls_route
 
  - add bound checks to a number of places which hand-parse netlink
 
  - bpf: disable preemption in perf_event_output helpers code
 
  - qed: fix scheduling in a tasklet while getting stats
 
  - avoid using APIs which are not hardirq-safe in couple of drivers,
    when we may be in a hard IRQ (netconsole)
 
  - wifi: cfg80211: fix return value in scan logic, avoid page
    allocator warning
 
  - wifi: mt76: mt7615: do not advertise 5 GHz on first PHY
    of MT7615D (DBDC)
 
 Misc:
 
  - drop handful of inactive maintainers, put some new in place
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmTMCRwACgkQMUZtbf5S
 Irv1tRAArN6rfYrr2ulaTOfMqhWb1Q+kAs00nBCKqC+OdWgT0hqw2QAuqTAVjhje
 8HBYlNGyhJ10yp0Q5y4Fp9CsBDHDDNjIp/YGEbr0vC/9mUDOhYD8WV07SmZmzEJu
 gmt4LeFPTk07yZy7VxMLY5XKuwce6MWGHArehZE7PSa9+07yY2Ov9X02ntr9hSdH
 ih+VdDI12aTVSj208qb0qNb2JkefFHW9dntVxce4/mtYJE9+47KMR2aXDXtCh0C6
 ECgx0LQkdEJ5vNSYfypww0SXIG5aj7sE6HMTdJkjKH7ws4xrW8H+P9co77Hb/DTH
 TsRBS4SgB20hFNxz3OQwVmAvj+2qfQssL7SeIkRnaEWeTBuVqCwjLdoIzKXJxxq+
 cvtUAAM8XUPqec5cPiHPkeAJV6aJhrdUdMjjbCI9uFYU32AWFBQEqvVGP9xdhXHK
 QIpTLiy26Vw8PwiJdROuGiZJCXePqQRLDuMX1L43ZO1rwIrZcWGHjCNtsR9nXKgQ
 apbbxb2/rq2FBMB+6obKeHzWDy3JraNCsUspmfleqdjQ2mpbRokd4Vw2564FJgaC
 5OznPIX6OuoCY5sftLUcRcpH5ncNj01BvyqjWyCIfJdkCqCUL7HSAgxfm5AUnZip
 ZIXOzZnZ6uTUQFptXdjey/jNEQ6qpV8RmwY0CMsmJoo88DXI34Y=
 =HYkl
 -----END PGP SIGNATURE-----

Merge tag 'net-6.5-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
 "Including fixes from bpf and wireless.

  Nothing scary here. Feels like the first wave of regressions from v6.5
  is addressed - one outstanding fix still to come in TLS for the
  sendpage rework.

  Current release - regressions:

   - udp: fix __ip_append_data()'s handling of MSG_SPLICE_PAGES

   - dsa: fix older DSA drivers using phylink

  Previous releases - regressions:

   - gro: fix misuse of CB in udp socket lookup

   - mlx5: unregister devlink params in case interface is down

   - Revert "wifi: ath11k: Enable threaded NAPI"

  Previous releases - always broken:

   - sched: cls_u32: fix match key mis-addressing

   - sched: bind logic fixes for cls_fw, cls_u32 and cls_route

   - add bound checks to a number of places which hand-parse netlink

   - bpf: disable preemption in perf_event_output helpers code

   - qed: fix scheduling in a tasklet while getting stats

   - avoid using APIs which are not hardirq-safe in couple of drivers,
     when we may be in a hard IRQ (netconsole)

   - wifi: cfg80211: fix return value in scan logic, avoid page
     allocator warning

   - wifi: mt76: mt7615: do not advertise 5 GHz on first PHY of MT7615D
     (DBDC)

  Misc:

   - drop handful of inactive maintainers, put some new in place"

* tag 'net-6.5-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (98 commits)
  MAINTAINERS: update TUN/TAP maintainers
  test/vsock: remove vsock_perf executable on `make clean`
  tcp_metrics: fix data-race in tcpm_suck_dst() vs fastopen
  tcp_metrics: annotate data-races around tm->tcpm_net
  tcp_metrics: annotate data-races around tm->tcpm_vals[]
  tcp_metrics: annotate data-races around tm->tcpm_lock
  tcp_metrics: annotate data-races around tm->tcpm_stamp
  tcp_metrics: fix addr_same() helper
  prestera: fix fallback to previous version on same major version
  udp: Fix __ip_append_data()'s handling of MSG_SPLICE_PAGES
  net/mlx5e: Set proper IPsec source port in L4 selector
  net/mlx5: fs_core: Skip the FTs in the same FS_TYPE_PRIO_CHAINS fs_prio
  net/mlx5: fs_core: Make find_closest_ft more generic
  wifi: brcmfmac: Fix field-spanning write in brcmf_scan_params_v2_to_v1()
  vxlan: Fix nexthop hash size
  ip6mr: Fix skb_under_panic in ip6mr_cache_report()
  s390/qeth: Don't call dev_close/dev_open (DOWN/UP)
  net: tap_open(): set sk_uid from current_fsuid()
  net: tun_chr_open(): set sk_uid from current_fsuid()
  net: dcb: choose correct policy to parse DCB_ATTR_BCN
  ...
2023-08-03 14:00:02 -07:00
James Morse 2abcc4b5a6 module: Expose module_init_layout_section()
module_init_layout_section() choses whether the core module loader
considers a section as init or not. This affects the placement of the
exit section when module unloading is disabled. This code will never run,
so it can be free()d once the module has been initialised.

arm and arm64 need to count the number of PLTs they need before applying
relocations based on the section name. The init PLTs are stored separately
so they can be free()d. arm and arm64 both use within_module_init() to
decide which list of PLTs to use when applying the relocation.

Because within_module_init()'s behaviour changes when module unloading
is disabled, both architecture would need to take this into account when
counting the PLTs.

Today neither architecture does this, meaning when module unloading is
disabled there are insufficient PLTs in the init section to load some
modules, resulting in warnings:
| WARNING: CPU: 2 PID: 51 at arch/arm64/kernel/module-plts.c:99 module_emit_plt_entry+0x184/0x1cc
| Modules linked in: crct10dif_common
| CPU: 2 PID: 51 Comm: modprobe Not tainted 6.5.0-rc4-yocto-standard-dirty #15208
| Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
| pstate: 20400005 (nzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
| pc : module_emit_plt_entry+0x184/0x1cc
| lr : module_emit_plt_entry+0x94/0x1cc
| sp : ffffffc0803bba60
[...]
| Call trace:
|  module_emit_plt_entry+0x184/0x1cc
|  apply_relocate_add+0x2bc/0x8e4
|  load_module+0xe34/0x1bd4
|  init_module_from_file+0x84/0xc0
|  __arm64_sys_finit_module+0x1b8/0x27c
|  invoke_syscall.constprop.0+0x5c/0x104
|  do_el0_svc+0x58/0x160
|  el0_svc+0x38/0x110
|  el0t_64_sync_handler+0xc0/0xc4
|  el0t_64_sync+0x190/0x194

Instead of duplicating module_init_layout_section()s logic, expose it.

Reported-by: Adam Johnston <adam.johnston@arm.com>
Fixes: 055f23b74b ("module: check for exit sections in layout_sections() instead of module_init_section()")
Cc: stable@vger.kernel.org
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2023-08-03 13:42:02 -07:00
Jakub Kicinski 3932f22723 pull-request: bpf 2023-08-03
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQRdM/uy1Ege0+EN1fNar9k/UBDW4wUCZMvqewAKCRBar9k/UBDW
 48yeAQCnPnwzcvy+JDrdosuJEErhMv0pH3ECixNpPBpns95kzAEA9QhSYwjAhlFf
 61d6hoiXj/sIibgMQT/ihODgeJ4wfQE=
 =u7qn
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf

Martin KaFai Lau says:

====================
pull-request: bpf 2023-08-03

We've added 5 non-merge commits during the last 7 day(s) which contain
a total of 3 files changed, 37 insertions(+), 20 deletions(-).

The main changes are:

1) Disable preemption in perf_event_output helpers code,
   from Jiri Olsa

2) Add length check for SK_DIAG_BPF_STORAGE_REQ_MAP_FD parsing,
   from Lin Ma

3) Multiple warning splat fixes in cpumap from Hou Tao

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  bpf, cpumap: Handle skb as well when clean up ptr_ring
  bpf, cpumap: Make sure kthread is running before map update returns
  bpf: Add length check for SK_DIAG_BPF_STORAGE_REQ_MAP_FD parsing
  bpf: Disable preemption in bpf_event_output
  bpf: Disable preemption in bpf_perf_event_output
====================

Link: https://lore.kernel.org/r/20230803181429.994607-1-martin.lau@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-03 11:22:53 -07:00
Jakub Kicinski 680ee0456a net: invert the netdevice.h vs xdp.h dependency
xdp.h is far more specific and is included in only 67 other
files vs netdevice.h's 1538 include sites.
Make xdp.h include netdevice.h, instead of the other way around.
This decreases the incremental allmodconfig builds size when
xdp.h is touched from 5947 to 662 objects.

Move bpf_prog_run_xdp() to xdp.h, seems appropriate and filter.h
is a mega-header in its own right so it's nice to avoid xdp.h
getting included there as well.

The only unfortunate part is that the typedef for xdp_features_t
has to move to netdevice.h, since its embedded in struct netdevice.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Link: https://lore.kernel.org/r/20230803010230.1755386-4-kuba@kernel.org
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-08-03 08:38:07 -07:00
Arnd Bergmann 8874a414f8 x86/qspinlock-paravirt: Fix missing-prototype warning
__pv_queued_spin_unlock_slowpath() is defined in a header file as
a global function, and designed to be called from inline asm, but
there is no prototype visible in the definition:

  kernel/locking/qspinlock_paravirt.h:493:1: error: no previous \
    prototype for '__pv_queued_spin_unlock_slowpath' [-Werror=missing-prototypes]

Add this to the x86 header that contains the inline asm calling it,
and ensure this gets included before the definition, rather than
after it.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230803082619.1369127-8-arnd@kernel.org
2023-08-03 17:15:05 +02:00
Rick Edgecombe c35559f94e x86/shstk: Introduce map_shadow_stack syscall
When operating with shadow stacks enabled, the kernel will automatically
allocate shadow stacks for new threads, however in some cases userspace
will need additional shadow stacks. The main example of this is the
ucontext family of functions, which require userspace allocating and
pivoting to userspace managed stacks.

Unlike most other user memory permissions, shadow stacks need to be
provisioned with special data in order to be useful. They need to be setup
with a restore token so that userspace can pivot to them via the RSTORSSP
instruction. But, the security design of shadow stacks is that they
should not be written to except in limited circumstances. This presents a
problem for userspace, as to how userspace can provision this special
data, without allowing for the shadow stack to be generally writable.

Previously, a new PROT_SHADOW_STACK was attempted, which could be
mprotect()ed from RW permissions after the data was provisioned. This was
found to not be secure enough, as other threads could write to the
shadow stack during the writable window.

The kernel can use a special instruction, WRUSS, to write directly to
userspace shadow stacks. So the solution can be that memory can be mapped
as shadow stack permissions from the beginning (never generally writable
in userspace), and the kernel itself can write the restore token.

First, a new madvise() flag was explored, which could operate on the
PROT_SHADOW_STACK memory. This had a couple of downsides:
1. Extra checks were needed in mprotect() to prevent writable memory from
   ever becoming PROT_SHADOW_STACK.
2. Extra checks/vma state were needed in the new madvise() to prevent
   restore tokens being written into the middle of pre-used shadow stacks.
   It is ideal to prevent restore tokens being added at arbitrary
   locations, so the check was to make sure the shadow stack had never been
   written to.
3. It stood out from the rest of the madvise flags, as more of direct
   action than a hint at future desired behavior.

So rather than repurpose two existing syscalls (mmap, madvise) that don't
quite fit, just implement a new map_shadow_stack syscall to allow
userspace to map and setup new shadow stacks in one step. While ucontext
is the primary motivator, userspace may have other unforeseen reasons to
setup its own shadow stacks using the WRSS instruction. Towards this
provide a flag so that stacks can be optionally setup securely for the
common case of ucontext without enabling WRSS. Or potentially have the
kernel set up the shadow stack in some new way.

The following example demonstrates how to create a new shadow stack with
map_shadow_stack:
void *shstk = map_shadow_stack(addr, stack_size, SHADOW_STACK_SET_TOKEN);

Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/all/20230613001108.3040476-35-rick.p.edgecombe%40intel.com
2023-08-02 15:01:51 -07:00
Arnd Bergmann 6a5a148aaf bpf: fix bpf_probe_read_kernel prototype mismatch
bpf_probe_read_kernel() has a __weak definition in core.c and another
definition with an incompatible prototype in kernel/trace/bpf_trace.c,
when CONFIG_BPF_EVENTS is enabled.

Since the two are incompatible, there cannot be a shared declaration in
a header file, but the lack of a prototype causes a W=1 warning:

kernel/bpf/core.c:1638:12: error: no previous prototype for 'bpf_probe_read_kernel' [-Werror=missing-prototypes]

On 32-bit architectures, the local prototype

u64 __weak bpf_probe_read_kernel(void *dst, u32 size, const void *unsafe_ptr)

passes arguments in other registers as the one in bpf_trace.c

BPF_CALL_3(bpf_probe_read_kernel, void *, dst, u32, size,
            const void *, unsafe_ptr)

which uses 64-bit arguments in pairs of registers.

As both versions of the function are fairly simple and only really
differ in one line, just move them into a header file as an inline
function that does not add any overhead for the bpf_trace.c callers
and actually avoids a function call for the other one.

Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/all/ac25cb0f-b804-1649-3afb-1dc6138c2716@iogearbox.net/
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230801111449.185301-1-arnd@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-08-02 14:14:17 -07:00
Miaohe Lin a2c15fece4 cgroup: fix obsolete function name above css_free_rwork_fn()
Since commit 8f36aaec9c ("cgroup: Use rcu_work instead of explicit rcu
and work item"), css_free_work_fn has been renamed to css_free_rwork_fn.
Update corresponding comment.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-02 09:37:59 -10:00
Cai Xinchen 05f76ae95e cgroup/cpuset: fix kernel-doc
Add kernel-doc of param @rotor to fix warnings:

kernel/cgroup/cpuset.c:4162: warning: Function parameter or member
'rotor' not described in 'cpuset_spread_node'
kernel/cgroup/cpuset.c:3771: warning: Function parameter or member
'work' not described in 'cpuset_hotplug_workfn'

Signed-off-by: Cai Xinchen <caixinchen1@huawei.com>
Acked-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-02 09:37:03 -10:00
Kamalesh Babulal 55a5956a55 cgroup: clean up printk()
Convert the only printk() to use pr_*() helper. No functional change.

Signed-off-by: Kamalesh Babulal <kamalesh.babulal@oracle.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-02 09:36:24 -10:00
Christoph Hellwig 9011e49d54 modules: only allow symbol_get of EXPORT_SYMBOL_GPL modules
It has recently come to my attention that nvidia is circumventing the
protection added in 262e6ae708 ("modules: inherit
TAINT_PROPRIETARY_MODULE") by importing exports from their proprietary
modules into an allegedly GPL licensed module and then rexporting them.

Given that symbol_get was only ever intended for tightly cooperating
modules using very internal symbols it is logical to restrict it to
being used on EXPORT_SYMBOL_GPL and prevent nvidia from costly DMCA
Circumvention of Access Controls law suites.

All symbols except for four used through symbol_get were already exported
as EXPORT_SYMBOL_GPL, and the remaining four ones were switched over in
the preparation patches.

Fixes: 262e6ae708 ("modules: inherit TAINT_PROPRIETARY_MODULE")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2023-08-02 11:18:22 -07:00
Phil Auld 88c56cfeae sched/fair: Block nohz tick_stop when cfs bandwidth in use
CFS bandwidth limits and NOHZ full don't play well together.  Tasks
can easily run well past their quotas before a remote tick does
accounting.  This leads to long, multi-period stalls before such
tasks can run again. Currently, when presented with these conflicting
requirements the scheduler is favoring nohz_full and letting the tick
be stopped. However, nohz tick stopping is already best-effort, there
are a number of conditions that can prevent it, whereas cfs runtime
bandwidth is expected to be enforced.

Make the scheduler favor bandwidth over stopping the tick by setting
TICK_DEP_BIT_SCHED when the only running task is a cfs task with
runtime limit enabled. We use cfs_b->hierarchical_quota to
determine if the task requires the tick.

Add check in pick_next_task_fair() as well since that is where
we have a handle on the task that is actually going to be running.

Add check in sched_can_stop_tick() to cover some edge cases such
as nr_running going from 2->1 and the 1 remains the running task.

Reviewed-By: Ben Segall <bsegall@google.com>
Signed-off-by: Phil Auld <pauld@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230712133357.381137-3-pauld@redhat.com
2023-08-02 16:19:26 +02:00
Phil Auld c98c18270b sched, cgroup: Restore meaning to hierarchical_quota
In cgroupv2 cfs_b->hierarchical_quota is set to -1 for all task
groups due to the previous fix simply taking the min.  It should
reflect a limit imposed at that level or by an ancestor. Even
though cgroupv2 does not require child quota to be less than or
equal to that of its ancestors the task group will still be
constrained by such a quota so this should be shown here. Cgroupv1
continues to set this correctly.

In both cases, add initialization when a new task group is created
based on the current parent's value (or RUNTIME_INF in the case of
root_task_group). Otherwise, the field is wrong until a quota is
changed after creation and __cfs_schedulable() is called.

Fixes: c53593e5cb ("sched, cgroup: Don't reject lower cpu.max on ancestors")
Signed-off-by: Phil Auld <pauld@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ben Segall <bsegall@google.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20230714125746.812891-1-pauld@redhat.com
2023-08-02 16:19:26 +02:00
Yauheni Kaliuta d3c4db86c7 tracing: bpf: use struct trace_entry in struct syscall_tp_t
bpf tracepoint program uses struct trace_event_raw_sys_enter as
argument where trace_entry is the first field. Use the same instead
of unsigned long long since if it's amended (for example by RT
patch) it accesses data with wrong offset.

Signed-off-by: Yauheni Kaliuta <ykaliuta@redhat.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230801075222.7717-1-ykaliuta@redhat.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-08-01 10:53:28 -07:00
Petr Tesarik 1395706a14 swiotlb: search the software IO TLB only if the device makes use of it
Skip searching the software IO TLB if a device has never used it, making
sure these devices are not affected by the introduction of multiple IO TLB
memory pools.

Additional memory barrier is required to ensure that the new value of the
flag is visible to other CPUs after mapping a new bounce buffer. For
efficiency, the flag check should be inlined, and then the memory barrier
must be moved to is_swiotlb_buffer(). However, it can replace the existing
barrier in swiotlb_find_pool(), because all callers use is_swiotlb_buffer()
first to verify that the buffer address belongs to the software IO TLB.

Signed-off-by: Petr Tesarik <petr.tesarik.ext@huawei.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-08-01 18:02:32 +02:00
Petr Tesarik 1aaa736815 swiotlb: allocate a new memory pool when existing pools are full
When swiotlb_find_slots() cannot find suitable slots, schedule the
allocation of a new memory pool. It is not possible to allocate the pool
immediately, because this code may run in interrupt context, which is not
suitable for large memory allocations. This means that the memory pool will
be available too late for the currently requested mapping, but the stress
on the software IO TLB allocator is likely to continue, and subsequent
allocations will benefit from the additional pool eventually.

Keep all memory pools for an allocator in an RCU list to avoid locking on
the read side. For modifications, add a new spinlock to struct io_tlb_mem.

The spinlock also protects updates to the total number of slabs (nslabs in
struct io_tlb_mem), but not reads of the value. Readers may therefore
encounter a stale value, but this is not an issue:

- swiotlb_tbl_map_single() and is_swiotlb_active() only check for non-zero
  value. This is ensured by the existence of the default memory pool,
  allocated at boot.

- The exact value is used only for non-critical purposes (debugfs, kernel
  messages).

Signed-off-by: Petr Tesarik <petr.tesarik.ext@huawei.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-08-01 18:02:27 +02:00
Petr Tesarik ad96ce3252 swiotlb: determine potential physical address limit
The value returned by default_swiotlb_limit() should be constant, because
it is used to decide whether DMA can be used. To allow allocating memory
pools on the fly, use the maximum possible physical address rather than the
highest address used by the default pool.

For swiotlb_init_remap(), this is either an arch-specific limit used by
memblock_alloc_low(), or the highest directly mapped physical address if
the initialization flags include SWIOTLB_ANY. For swiotlb_init_late(), the
highest address is determined by the GFP flags.

Signed-off-by: Petr Tesarik <petr.tesarik.ext@huawei.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-08-01 18:02:24 +02:00
Petr Tesarik 79636caad3 swiotlb: if swiotlb is full, fall back to a transient memory pool
Try to allocate a transient memory pool if no suitable slots can be found
and the respective SWIOTLB is allowed to grow. The transient pool is just
enough big for this one bounce buffer. It is inserted into a per-device
list of transient memory pools, and it is freed again when the bounce
buffer is unmapped.

Transient memory pools are kept in an RCU list. A memory barrier is
required after adding a new entry, because any address within a transient
buffer must be immediately recognized as belonging to the SWIOTLB, even if
it is passed to another CPU.

Deletion does not require any synchronization beyond RCU ordering
guarantees. After a buffer is unmapped, its physical addresses may no
longer be passed to the DMA API, so the memory range of the corresponding
stale entry in the RCU list never matches. If the memory range gets
allocated again, then it happens only after a RCU quiescent state.

Since bounce buffers can now be allocated from different pools, add a
parameter to swiotlb_alloc_pool() to let the caller know which memory pool
is used. Add swiotlb_find_pool() to find the memory pool corresponding to
an address. This function is now also used by is_swiotlb_buffer(), because
a simple boundary check is no longer sufficient.

The logic in swiotlb_alloc_tlb() is taken from __dma_direct_alloc_pages(),
simplified and enhanced to use coherent memory pools if needed.

Note that this is not the most efficient way to provide a bounce buffer,
but when a DMA buffer can't be mapped, something may (and will) actually
break. At that point it is better to make an allocation, even if it may be
an expensive operation.

Signed-off-by: Petr Tesarik <petr.tesarik.ext@huawei.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-08-01 18:02:20 +02:00
Petr Tesarik 62708b2ba4 swiotlb: add a flag whether SWIOTLB is allowed to grow
Add a config option (CONFIG_SWIOTLB_DYNAMIC) to enable or disable dynamic
allocation of additional bounce buffers.

If this option is set, mark the default SWIOTLB as able to grow and
restricted DMA pools as unable.

However, if the address of the default memory pool is explicitly queried,
make the default SWIOTLB also unable to grow. This is currently used to set
up PCI BAR movable regions on some Octeon MIPS boards which may not be able
to use a SWIOTLB pool elsewhere in physical memory. See octeon_pci_setup()
for more details.

If a remap function is specified, it must be also called on any dynamically
allocated pools, but there are some issues:

- The remap function may block, so it should not be called from an atomic
  context.
- There is no corresponding unremap() function if the memory pool is
  freed.
- The only in-tree implementation (xen_swiotlb_fixup) requires that the
  number of slots in the memory pool is a multiple of SWIOTLB_SEGSIZE.

Keep it simple for now and disable growing the SWIOTLB if a remap function
was specified.

Signed-off-by: Petr Tesarik <petr.tesarik.ext@huawei.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-08-01 18:02:17 +02:00
Petr Tesarik 158dbe9c9a swiotlb: separate memory pool data from other allocator data
Carve out memory pool specific fields from struct io_tlb_mem. The original
struct now contains shared data for the whole allocator, while the new
struct io_tlb_pool contains data that is specific to one memory pool of
(potentially) many.

Signed-off-by: Petr Tesarik <petr.tesarik.ext@huawei.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-08-01 18:02:14 +02:00
Petr Tesarik fea18777a7 swiotlb: add documentation and rename swiotlb_do_find_slots()
Add some kernel-doc comments and move the existing documentation of struct
io_tlb_slot to its correct location. The latter was forgotten in commit
942a8186eb ("swiotlb: move struct io_tlb_slot to swiotlb.c").

Use the opportunity to give swiotlb_do_find_slots() a more descriptive name
and make it clear how it differs from swiotlb_find_slots().

Signed-off-by: Petr Tesarik <petr.tesarik.ext@huawei.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-08-01 18:02:12 +02:00
Petr Tesarik 05ee774122 swiotlb: make io_tlb_default_mem local to swiotlb.c
SWIOTLB implementation details should not be exposed to the rest of the
kernel. This will allow to make changes to the implementation without
modifying non-swiotlb code.

To avoid breaking existing users, provide helper functions for the few
required fields.

As a bonus, using a helper function to initialize struct device allows to
get rid of an #ifdef in driver core.

Signed-off-by: Petr Tesarik <petr.tesarik.ext@huawei.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-08-01 18:02:09 +02:00
Petr Tesarik 0c6874a6ac swiotlb: bail out of swiotlb_init_late() if swiotlb is already allocated
If swiotlb is allocated, immediately return 0, so callers do not have to
check io_tlb_default_mem.nslabs explicitly.

Signed-off-by: Petr Tesarik <petr.tesarik.ext@huawei.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-08-01 18:02:02 +02:00
Hou Tao 1ea66e89f6 bpf, devmap: Remove unused dtab field from bpf_dtab_netdev
Commit 96360004b8 ("xdp: Make devmap flush_list common for all map
instances") removes the use of bpf_dtab_netdev::dtab in bq_enqueue(),
so just remove dtab from bpf_dtab_netdev.

Signed-off-by: Hou Tao <houtao1@huawei.com>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Link: https://lore.kernel.org/r/20230728014942.892272-3-houtao@huaweicloud.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-07-31 18:26:08 -07:00
Hou Tao 2d20bfc315 bpf, cpumap: Remove unused cmap field from bpf_cpu_map_entry
Since commit cdfafe98ca ("xdp: Make cpumap flush_list common for all
map instances"), cmap is no longer used, so just remove it.

Signed-off-by: Hou Tao <houtao1@huawei.com>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20230728014942.892272-2-houtao@huaweicloud.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-07-31 18:26:08 -07:00
Yonghong Song e99688eba2 bpf: Fix an array-index-out-of-bounds issue in disasm.c
syzbot reported an array-index-out-of-bounds when printing out bpf
insns. Further investigation shows the insn is illegal but
is printed out due to log level 1 or 2 before actual insn verification
in do_check().

This particular illegal insn is a MOVSX insn with offset value 2.
The legal offset value for MOVSX should be 8, 16 and 32.
The disasm sign-extension-size array index is calculated as
 (insn->off / 8) - 1
and offset value 2 gives an out-of-bound index -1.

Tighten the checking for MOVSX insn in disasm.c to avoid
array-index-out-of-bounds issue.

Reported-by: syzbot+3758842a6c01012aa73b@syzkaller.appspotmail.com
Fixes: f835bb6222 ("bpf: Add kernel/bpftool asm support for new instructions")
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20230731204534.1975311-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-31 17:35:02 -07:00
Paul E. McKenney cb88f7f51b rcu-tasks: Permit use of debug-objects with RCU Tasks flavors
Currently, cblist_init_generic() holds a raw spinlock when invoking
INIT_WORK().  This fails in kernels built with CONFIG_DEBUG_OBJECTS=y
due to memory allocation being forbidden while holding a raw spinlock.
But the only reason for holding the raw spinlock is to synchronize
with early boot calls to call_rcu_tasks(), call_rcu_tasks_rude, and,
last but not least, call_rcu_tasks_trace().  These calls also invoke
cblist_init_generic() in order to support early boot queueing of
callbacks.

Except that there are no early boot calls to either of these three
functions, and the BPF guys confirm that they have no plans to add any
such calls.

This commit therefore removes the synchronization and adds a
WARN_ON_ONCE() to catch the case of now-prohibited early boot RCU Tasks
callback queueing.

If early boot queueing is needed, an "initialized" flag may be added to
the rcu_tasks structure.  Then queueing a callback before this flag is set
would initialize the callback list (if needed) and queue the callback.
The decision as to where to queue the callback given the possibility of
non-zero boot CPUs is left as an exercise for the reader.

Reported-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-07-31 16:06:57 -07:00
Hou Tao 7c62b75cd1 bpf, cpumap: Handle skb as well when clean up ptr_ring
The following warning was reported when running xdp_redirect_cpu with
both skb-mode and stress-mode enabled:

  ------------[ cut here ]------------
  Incorrect XDP memory type (-2128176192) usage
  WARNING: CPU: 7 PID: 1442 at net/core/xdp.c:405
  Modules linked in:
  CPU: 7 PID: 1442 Comm: kworker/7:0 Tainted: G  6.5.0-rc2+ #1
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
  Workqueue: events __cpu_map_entry_free
  RIP: 0010:__xdp_return+0x1e4/0x4a0
  ......
  Call Trace:
   <TASK>
   ? show_regs+0x65/0x70
   ? __warn+0xa5/0x240
   ? __xdp_return+0x1e4/0x4a0
   ......
   xdp_return_frame+0x4d/0x150
   __cpu_map_entry_free+0xf9/0x230
   process_one_work+0x6b0/0xb80
   worker_thread+0x96/0x720
   kthread+0x1a5/0x1f0
   ret_from_fork+0x3a/0x70
   ret_from_fork_asm+0x1b/0x30
   </TASK>

The reason for the warning is twofold. One is due to the kthread
cpu_map_kthread_run() is stopped prematurely. Another one is
__cpu_map_ring_cleanup() doesn't handle skb mode and treats skbs in
ptr_ring as XDP frames.

Prematurely-stopped kthread will be fixed by the preceding patch and
ptr_ring will be empty when __cpu_map_ring_cleanup() is called. But
as the comments in __cpu_map_ring_cleanup() said, handling and freeing
skbs in ptr_ring as well to "catch any broken behaviour gracefully".

Fixes: 11941f8a85 ("bpf: cpumap: Implement generic cpumap")
Signed-off-by: Hou Tao <houtao1@huawei.com>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Link: https://lore.kernel.org/r/20230729095107.1722450-3-houtao@huaweicloud.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-07-31 15:37:12 -07:00
Hou Tao 640a604585 bpf, cpumap: Make sure kthread is running before map update returns
The following warning was reported when running stress-mode enabled
xdp_redirect_cpu with some RT threads:

  ------------[ cut here ]------------
  WARNING: CPU: 4 PID: 65 at kernel/bpf/cpumap.c:135
  CPU: 4 PID: 65 Comm: kworker/4:1 Not tainted 6.5.0-rc2+ #1
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
  Workqueue: events cpu_map_kthread_stop
  RIP: 0010:put_cpu_map_entry+0xda/0x220
  ......
  Call Trace:
   <TASK>
   ? show_regs+0x65/0x70
   ? __warn+0xa5/0x240
   ......
   ? put_cpu_map_entry+0xda/0x220
   cpu_map_kthread_stop+0x41/0x60
   process_one_work+0x6b0/0xb80
   worker_thread+0x96/0x720
   kthread+0x1a5/0x1f0
   ret_from_fork+0x3a/0x70
   ret_from_fork_asm+0x1b/0x30
   </TASK>

The root cause is the same as commit 4369016497 ("bpf: cpumap: Fix memory
leak in cpu_map_update_elem"). The kthread is stopped prematurely by
kthread_stop() in cpu_map_kthread_stop(), and kthread() doesn't call
cpu_map_kthread_run() at all but XDP program has already queued some
frames or skbs into ptr_ring. So when __cpu_map_ring_cleanup() checks
the ptr_ring, it will find it was not emptied and report a warning.

An alternative fix is to use __cpu_map_ring_cleanup() to drop these
pending frames or skbs when kthread_stop() returns -EINTR, but it may
confuse the user, because these frames or skbs have been handled
correctly by XDP program. So instead of dropping these frames or skbs,
just make sure the per-cpu kthread is running before
__cpu_map_entry_alloc() returns.

After apply the fix, the error handle for kthread_stop() will be
unnecessary because it will always return 0, so just remove it.

Fixes: 6710e11269 ("bpf: introduce new bpf cpu map type BPF_MAP_TYPE_CPUMAP")
Signed-off-by: Hou Tao <houtao1@huawei.com>
Reviewed-by: Pu Lehui <pulehui@huawei.com>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Link: https://lore.kernel.org/r/20230729095107.1722450-2-houtao@huaweicloud.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-07-31 15:37:12 -07:00
Martin KaFai Lau 079082c60a tcx: Fix splat during dev unregister
During unregister_netdevice_many_notify(), the ordering of our concerned
function calls is like this:

  unregister_netdevice_many_notify
    dev_shutdown
	qdisc_put
            clsact_destroy
    tcx_uninstall

The syzbot reproducer triggered a case that the qdisc refcnt is not
zero during dev_shutdown().

tcx_uninstall() will then WARN_ON_ONCE(tcx_entry(entry)->miniq_active)
because the miniq is still active and the entry should not be freed.
The latter assumed that qdisc destruction happens before tcx teardown.

This fix is to avoid tcx_uninstall() doing tcx_entry_free() when the
miniq is still alive and let the clsact_destroy() do the free later, so
that we do not assume any specific ordering for either of them.

If still active, tcx_uninstall() does clear the entry when flushing out
the prog/link. clsact_destroy() will then notice the "!tcx_entry_is_active()"
and then does the tcx_entry_free() eventually.

Fixes: e420bed025 ("bpf: Add fd-based tcx multi-prog infra with link support")
Reported-by: syzbot+376a289e86a0fd02b9ba@syzkaller.appspotmail.com
Reported-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Co-developed-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: syzbot+376a289e86a0fd02b9ba@syzkaller.appspotmail.com
Tested-by: Leon Romanovsky <leonro@nvidia.com>
Link: https://lore.kernel.org/r/222255fe07cb58f15ee662e7ee78328af5b438e4.1690549248.git.daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-31 14:44:02 -07:00
Ajay Kaher 27152bceea eventfs: Move tracing/events to eventfs
Up until now, /sys/kernel/tracing/events was no different than any other
part of tracefs. The files and directories within the events directory was
created when the tracefs was mounted, and also created for the instances in
/sys/kernel/tracing/instances/<instance>/events. Most of these files and
directories will never be referenced. Since there are thousands of these
files and directories they spend their time wasting precious memory
resources.

Move the "events" directory to the new eventfs. The eventfs will take the
meta data of the events that they represent and store that. When the files
in the events directory are referenced, the dentry and inodes to represent
them are then created. When the files are no longer referenced, they are
freed. This saves the precious memory resources that were wasted on these
seldom referenced dentries and inodes.

Running the following:

 ~# cat /proc/meminfo /proc/slabinfo  > before.out
 ~# mkdir /sys/kernel/tracing/instances/foo
 ~# cat /proc/meminfo /proc/slabinfo  > after.out

to test the changes produces the following deltas:

Before this change:

 Before after deltas for meminfo:

   MemFree: -32260
   MemAvailable: -21496
   KReclaimable: 21528
   Slab: 22440
   SReclaimable: 21528
   SUnreclaim: 912
   VmallocUsed: 16

 Before after deltas for slabinfo:

   <slab>:		<objects>	[ * <size> = <total>]

   tracefs_inode_cache:	14472		[* 1184 = 17134848]
   buffer_head:		24		[* 168 = 4032]
   hmem_inode_cache:	28		[* 1480 = 41440]
   dentry:		14450		[* 312 = 4508400]
   lsm_inode_cache:	14453		[* 32 = 462496]
   vma_lock:		11		[* 152 = 1672]
   vm_area_struct:	2		[* 184 = 368]
   trace_event_file:	1748		[* 88 = 153824]
   kmalloc-256:		1072		[* 256 = 274432]
   kmalloc-64:		2842		[* 64 = 181888]

 Total slab additions in size: 22,763,400 bytes

With this change:

 Before after deltas for meminfo:

   MemFree: -12600
   MemAvailable: -12580
   Cached: 24
   Active: 12
   Inactive: 68
   Inactive(anon): 48
   Active(file): 12
   Inactive(file): 20
   Dirty: -4
   AnonPages: 68
   KReclaimable: 12
   Slab: 1856
   SReclaimable: 12
   SUnreclaim: 1844
   KernelStack: 16
   PageTables: 36
   VmallocUsed: 16

 Before after deltas for slabinfo:

   <slab>:		<objects>	[ * <size> = <total>]

   tracefs_inode_cache:	108		[* 1184 = 127872]
   buffer_head:		24		[* 168 = 4032]
   hmem_inode_cache:	18		[* 1480 = 26640]
   dentry:		127		[* 312 = 39624]
   lsm_inode_cache:	152		[* 32 = 4864]
   vma_lock:		67		[* 152 = 10184]
   vm_area_struct:	-12		[* 184 = -2208]
   trace_event_file: 	1764		[* 96 = 169344]
   kmalloc-96:		14322		[* 96 = 1374912]
   kmalloc-64:		2814		[* 64 = 180096]
   kmalloc-32:		1103		[* 32 = 35296]
   kmalloc-16:		2308		[* 16 = 36928]
   kmalloc-8:		12800		[* 8 = 102400]

 Total slab additions in size: 2,109,984 bytes

Which is a savings of 20,653,416 bytes (20 MB) per tracing instance.

Link: https://lkml.kernel.org/r/1690568452-46553-10-git-send-email-akaher@vmware.com

Signed-off-by: Ajay Kaher <akaher@vmware.com>
Co-developed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Tested-by: Ching-lin Yu <chinglinyu@google.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-07-31 11:55:55 -04:00
Binglei Wang 3fa6456ebe dma-contiguous: check for memory region overlap
In the process of parsing the DTS, check whether the memory region
specified by the DTS CMA node area overlaps with the kernel text
memory space reserved by memblock before calling
early_init_fdt_scan_reserved_mem.

Signed-off-by: Binglei Wang <l3b2w1@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-07-31 17:54:56 +02:00
Yajun Deng bf29bfaa54 dma-contiguous: support numa CMA for specified node
The kernel parameter 'cma_pernuma=' only supports reserving the same
size of CMA area for each node. We need to reserve different sizes of
CMA area for specified nodes if these devices belong to different nodes.

Adding another kernel parameter 'numa_cma=' to reserve CMA area for
the specified node. If we want to use one of these parameters, we need to
enable DMA_NUMA_CMA.

At the same time, print the node id in cma_declare_contiguous_nid() if
CONFIG_NUMA is enabled.

Signed-off-by: Yajun Deng <yajun.deng@linux.dev>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-07-31 17:54:29 +02:00
Yajun Deng 22e4a348f8 dma-contiguous: support per-numa CMA for all architectures
In the commit b7176c261c ("dma-contiguous: provide the ability to
reserve per-numa CMA"), Barry adds DMA_PERNUMA_CMA for ARM64.

But this feature is architecture independent, so support per-numa CMA
for all architectures, and enable it by default if NUMA.

Signed-off-by: Yajun Deng <yajun.deng@linux.dev>
Tested-by: Yicong Yang <yangyicong@hisilicon.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-07-31 17:54:28 +02:00
Arnd Bergmann 3d6f126b15 dma-mapping: move arch_dma_set_mask() declaration to header
This function has a __weak definition and an override that is only used on
freescale powerpc chips. The powerpc definition however does not see the
declaration that is in a .c file:

arch/powerpc/kernel/dma-mask.c:7:6: error: no previous prototype for 'arch_dma_set_mask' [-Werror=missing-prototypes]

Move it into the linux/dma-map-ops.h header where the other arch_dma_* functions
are declared.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-07-31 17:54:28 +02:00
Christoph Hellwig 42e584a985 swiotlb: unexport is_swiotlb_active
Drivers have no business looking at dma-mapping or swiotlb internals.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Juergen Gross <jgross@suse.com>
2023-07-31 17:54:27 +02:00
Zhang Rui 52b38b7ad5 cpu/SMT: Fix cpu_smt_possible() comment
Commit e1572f1d08 ("cpu/SMT: create and export cpu_smt_possible()")
introduces cpu_smt_possible() to represent if SMT is theoretically
possible. It returns true when SMT is supported and not forcefully
disabled ('nosmt=force'). But the comment of it says "Returns true if
SMT is not supported of forcefully (irreversibly) disabled", which is
wrong. Fix that comment accordingly.

Signed-off-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20230728155313.44170-1-rui.zhang@intel.com
2023-07-31 17:32:44 +02:00
YueHaibing 51a5acce71 genirq: Remove unused extern declaration
commit 3795de236d ("genirq: Distangle kernel/irq/handle.c")
left behind this.

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20230720143625.29176-1-yuehaibing@huawei.com
2023-07-31 17:27:16 +02:00
Vincent Whitchurch e2c12739cc genirq: Prevent nested thread vs synchronize_hardirq() deadlock
There is a possibility of deadlock if synchronize_hardirq() is called
when the nested threaded interrupt is active.  The following scenario
was observed on a uniprocessor PREEMPT_NONE system:

 Thread 1                      Thread 2

 handle_nested_thread()
  Set INPROGRESS
  Call ->thread_fn()
   thread_fn goes to sleep

                              free_irq()
                               __synchronize_hardirq()
                                Busy-loop forever waiting for INPROGRESS
                                to be cleared

The INPROGRESS flag is only supposed to be used for hard interrupt
handlers.  Remove the incorrect usage in the nested threaded interrupt
case and instead re-use the threads_active / wait_for_threads mechanism
to wait for nested threaded interrupts to complete.

Signed-off-by: Vincent Whitchurch <vincent.whitchurch@axis.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20230613-genirq-nested-v3-1-ae58221143eb@axis.com
2023-07-31 17:24:22 +02:00
Peter Zijlstra 22dc02f81c Revert "sched/fair: Move unused stub functions to header"
Revert commit 7aa55f2a59 ("sched/fair: Move unused stub functions to
header"), for while it has the right Changelog, the actual patch
content a revert of the previous 4 patches:

  f7df852ad6 ("sched: Make task_vruntime_update() prototype visible")
  c0bdfd72fb ("sched/fair: Hide unused init_cfs_bandwidth() stub")
  378be384e0 ("sched: Add schedule_user() declaration")
  d55ebae3f3 ("sched: Hide unused sched_update_scaling()")

So in effect this is a revert of a revert and re-applies those
patches.

Fixes: 7aa55f2a59 ("sched/fair: Move unused stub functions to header")
Reported-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
2023-07-31 11:47:08 +02:00
Greg Kroah-Hartman fe3015748a Merge 6.5-rc4 into tty-next
We need the serial/tty fixes in here as well for testing and future
development.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-07-31 09:39:56 +02:00
Steven Rostedt (Google) ee41106a12 tracing: Require all trace events to have a TRACE_SYSTEM
The creation of the trace event directory requires that a TRACE_SYSTEM is
defined that the trace event directory is added within the system it was
defined in.

The code handled the case where a TRACE_SYSTEM was not added, and would
then add the event at the events directory. But nothing should be doing
this. This code also prevents the implementation of creating dynamic
dentrys for the eventfs system.

As this path has never been hit on correct code, remove it. If it does get
hit, issues a WARN_ON_ONCE() and return ENODEV.

Link: https://lkml.kernel.org/r/1690568452-46553-2-git-send-email-akaher@vmware.com

Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Ajay Kaher <akaher@vmware.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-07-30 18:13:33 -04:00
Zheng Yejian 6d98a0f2ac tracing: Set actual size after ring buffer resize
Currently we can resize trace ringbuffer by writing a value into file
'buffer_size_kb', then by reading the file, we get the value that is
usually what we wrote. However, this value may be not actual size of
trace ring buffer because of the round up when doing resize in kernel,
and the actual size would be more useful.

Link: https://lore.kernel.org/linux-trace-kernel/20230705002705.576633-1-zhengyejian1@huawei.com

Cc: <mhiramat@kernel.org>
Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-07-30 18:11:45 -04:00
Steven Rostedt (Google) 6bba92881d tracing: Add free_trace_iter_content() helper function
As the trace iterator is created and used by various interfaces, the clean
up of it needs to be consistent. Create a free_trace_iter_content() helper
function that frees the content of the iterator and use that to clean it
up in all places that it is used.

Link: https://lkml.kernel.org/r/20230715141348.341887497@goodmis.org

Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-07-30 18:11:45 -04:00
Steven Rostedt (Google) 9182b519b8 tracing: Remove unnecessary copying of tr->current_trace
The iterator allocated a descriptor to copy the current_trace. This was done
with the assumption that the function pointers might change. But this was a
false assuption, as it does not change. There's no reason to make a copy of the
current_trace and just use the pointer it points to. This removes needing to
manage freeing the descriptor. Worse yet, there's locations that the iterator
is used but does make a copy and just uses the pointer. This could cause the
actual pointer to the trace descriptor to be freed and not the allocated copy.

This is more of a clean up than a fix.

Link: https://lkml.kernel.org/r/20230715141348.135792275@goodmis.org

Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Fixes: d7350c3f45 ("tracing/core: make the read callbacks reentrants")
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-07-30 18:11:45 -04:00
Uros Bizjak 00a8478f8f ring_buffer: Use try_cmpxchg instead of cmpxchg
Use try_cmpxchg instead of cmpxchg (*ptr, old, new) == old in
ring_buffer.c. x86 CMPXCHG instruction returns success in ZF flag,
so this change saves a compare after cmpxchg (and related move
instruction in front of cmpxchg).

No functional change intended.

Link: https://lore.kernel.org/linux-trace-kernel/20230714154418.8884-1-ubizjak@gmail.com

Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-07-30 18:11:45 -04:00
Steven Rostedt (Google) e7186af7fb tracing: Add back FORTIFY_SOURCE logic to kernel_stack event structure
For backward compatibility, older tooling expects to see the kernel_stack
event with a "caller" field that is a fixed size array of 8 addresses. The
code now supports more than 8 with an added "size" field that states the
real number of entries. But the "caller" field still just looks like a
fixed size to user space.

Since the tracing macros that create the user space format files also
creates the structures that those files represent, the kernel_stack event
structure had its "caller" field a fixed size of 8, but in reality, when
it is allocated on the ring buffer, it can hold more if the stack trace is
bigger that 8 functions. The copying of these entries was simply done with
a memcpy():

  size = nr_entries * sizeof(unsigned long);
  memcpy(entry->caller, fstack->calls, size);

The FORTIFY_SOURCE logic noticed at runtime that when the nr_entries was
larger than 8, that the memcpy() was writing more than what the structure
stated it can hold and it complained about it. This is because the
FORTIFY_SOURCE code is unaware that the amount allocated is actually
enough to hold the size. It does not expect that a fixed size field will
hold more than the fixed size.

This was originally solved by hiding the caller assignment with some
pointer arithmetic.

  ptr = ring_buffer_data();
  entry = ptr;

  ptr += offsetof(typeof(*entry), caller);
  memcpy(ptr, fstack->calls, size);

But it is considered bad form to hide from kernel hardening. Instead, make
it work nicely with FORTIFY_SOURCE by adding a new __stack_array() macro
that is specific for this one special use case. The macro will take 4
arguments: type, item, len, field (whereas the __array() macro takes just
the first three). This macro will act just like the __array() macro when
creating the code to deal with the format file that is exposed to user
space. But for the kernel, it will turn the caller field into:

  type item[] __counted_by(field);

or for this instance:

  unsigned long caller[] __counted_by(size);

Now the kernel code can expose the assignment of the caller to the
FORTIFY_SOURCE and everyone is happy!

Link: https://lore.kernel.org/linux-trace-kernel/20230712105235.5fc441aa@gandalf.local.home/
Link: https://lore.kernel.org/linux-trace-kernel/20230713092605.2ddb9788@rorschach.local.home

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Suggested-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
2023-07-30 18:11:44 -04:00
Linus Torvalds b0b9850e7d Probe fixes for 6.5-rc3:
- probe-events: Fix to add NULL check for some BTF API calls which can
   return error code and NULL.
 
 - ftrace selftests: Fix to check fprobe and kprobe event correctly. This
   fixes a miss condition of the test command.
 
 - kprobes: Prohibit probing on the function which starts from "__cfi_"
   and "__pfx_" since those are auto generated for kernel CFI and not
   executed.
 -----BEGIN PGP SIGNATURE-----
 
 iQFPBAABCgA5FiEEh7BulGwFlgAOi5DV2/sHvwUrPxsFAmTGdH4bHG1hc2FtaS5o
 aXJhbWF0c3VAZ21haWwuY29tAAoJENv7B78FKz8bmMAH/0qTHII0KYQDvrNJ40tT
 SDM8+4zOJEtnjVYq87+4EWBhpVEL3VbLRJaprjXh40lZJrCP3MglCF152p4bOhgb
 ZrjWuTAgE0N+rBhdeUJlzy3iLzl0G9dzfA+sn1XMcW+/HSPstJcjAG6wD7ROeZzL
 XCxzE+NY6Y6mYbB52DaS8Hv7g7WccaTV+KeRjokhMPt+u7/KItJ4hQb/RXtAL31S
 n4thCeVllaPBuc7m2CmKwJ9jzOg7/0qpAIUGx1Z+Khy/3YfRhG1nT93GxP8hLmad
 SH9kGps09WXF5f8FbjYglOmq7ioDbIUz3oXPQRZYPymV8A0EU+b+/8IsRog1ySd1
 BVk=
 =qKWS
 -----END PGP SIGNATURE-----

Merge tag 'probes-fixes-v6.5-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull probe fixes from Masami Hiramatsu:

 - probe-events: add NULL check for some BTF API calls which can return
   error code and NULL.

 - ftrace selftests: check fprobe and kprobe event correctly. This fixes
   a miss condition of the test command.

 - kprobes: do not allow probing functions that start with "__cfi_" or
   "__pfx_" since those are auto generated for kernel CFI and not
   executed.

* tag 'probes-fixes-v6.5-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  kprobes: Prohibit probing on CFI preamble symbol
  selftests/ftrace: Fix to check fprobe event eneblement
  tracing/probes: Fix to add NULL check for BTF APIs
2023-07-30 11:27:22 -07:00
Linus Torvalds c959e90094 - Fix a rtmutex race condition resulting from sharing of the sort key
between the lock waiters and the PI chain tree (->pi_waiters) of
   a task by giving each tree their own sort key
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmTGMGAACgkQEsHwGGHe
 VUrqNw/9F7p+/eDXP9TURup3Ms87lOlqCJg2CClTRiPF/FZcVzVbmltWewLDOWC6
 GBEa7l++lpKwIGCuzg99tDScutdlCw8mQe4knrFvrJc04zzPV3O4N0Tiw1eD8wIw
 /i6mzRaEuB5R8nxd9YK24gEsijerxhgVwKFvsCjbTV71LrY1geCJRqaqnooVnNm7
 kvy+UIuckFJL8GDsPBJNsZdTyBrs7lC+m4su5QHr7FkQmF+4iEbDkwlYqPzafhVL
 e4/FRjWHC5HRKd69j8fqyYZP8ryUlrmHFxkeAE08CxiqwdOELNHrp+i/x2GN9FTl
 wL8psVl9oVt2KT79Py0e40ZOoUWHDOugx2IGzWl13C0XFKY+QH2HR6C+gJdedPMQ
 tBtNyf8Ivukp+SSjyUXu8q9dGLJfdxtIKmbNHOuZDU121Fqg5+b04vQkzdgkMwJh
 OoYj0fDUMBa6Zfv1PQkkSRU2wMQlgGTjMiCEvUjAs3vGBFQcu+flAm2UOr8NDLaF
 0r3wNRVzk53k85R0L49oS8aU2FOcaHlpYXOevh0NYAt68EbmZctns85SJRfrwNGE
 QuykN8yDID/zjGsjSSF6L1IaQSjM1ieozPVgqP4RLIxqaM7n/BJfZqiLTn0gBrbc
 tEaoSIay6/zybKPCDeVgMsTSlX8EOoRKD/1Daog+zfYkUt19U/w=
 =97eo
 -----END PGP SIGNATURE-----

Merge tag 'locking_urgent_for_v6.5_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull locking fix from Borislav Petkov:

 - Fix a rtmutex race condition resulting from sharing of the sort key
   between the lock waiters and the PI chain tree (->pi_waiters) of a
   task by giving each tree their own sort key

* tag 'locking_urgent_for_v6.5_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  locking/rtmutex: Fix task->pi_waiters integrity
2023-07-30 11:12:32 -07:00
Linus Torvalds b88e123cc0 Tracing fixes for 6.5:
- Fix to /sys/kernel/tracing/per_cpu/cpu*/stats read and entries.
   If a resize shrinks the buffer it clears the read count to notify
   readers that they need to reset. But the read count is also used for
   accounting and this causes the numbers to be off. Instead, create a
   separate variable to use to notify readers to reset.
 
 - Fix the ref counts of the "soft disable" mode. The wrong value was
   used for testing if soft disable mode should be enabled or disable,
   but instead, just change the logic to do the enable and disable
   in place when the SOFT_MODE is set or cleared.
 
 - Several kernel-doc fixes
 
 - Removal of unused external declarations
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZMVd3xQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qgzwAQCBHA3gf30GChf0EZUdIVueA31/1n2Z
 2ZW1VeKUHQufpAEAuzbkeTdaj6bbpsT5T1Pf3zUIvpHs7kOYJWQq+75GBAI=
 =9o02
 -----END PGP SIGNATURE-----

Merge tag 'trace-v6.5-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing fixes from Steven Rostedt:

 - Fix to /sys/kernel/tracing/per_cpu/cpu*/stats read and entries.

   If a resize shrinks the buffer it clears the read count to notify
   readers that they need to reset. But the read count is also used for
   accounting and this causes the numbers to be off. Instead, create a
   separate variable to use to notify readers to reset.

 - Fix the ref counts of the "soft disable" mode. The wrong value was
   used for testing if soft disable mode should be enabled or disable,
   but instead, just change the logic to do the enable and disable in
   place when the SOFT_MODE is set or cleared.

 - Several kernel-doc fixes

 - Removal of unused external declarations

* tag 'trace-v6.5-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing: Fix warning in trace_buffered_event_disable()
  ftrace: Remove unused extern declarations
  tracing: Fix kernel-doc warnings in trace_seq.c
  tracing: Fix kernel-doc warnings in trace_events_trigger.c
  tracing/synthetic: Fix kernel-doc warnings in trace_events_synth.c
  ring-buffer: Fix kernel-doc warnings in ring_buffer.c
  ring-buffer: Fix wrong stat of cpu_buffer->read
2023-07-29 20:40:43 -07:00
Masami Hiramatsu (Google) de02f2ac5d kprobes: Prohibit probing on CFI preamble symbol
Do not allow to probe on "__cfi_" or "__pfx_" started symbol, because those
are used for CFI and not executed. Probing it will break the CFI.

Link: https://lore.kernel.org/all/168904024679.116016.18089228029322008512.stgit@devnote2/

Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-07-29 23:32:26 +09:00
Zheng Yejian dea499781a tracing: Fix warning in trace_buffered_event_disable()
Warning happened in trace_buffered_event_disable() at
  WARN_ON_ONCE(!trace_buffered_event_ref)

  Call Trace:
   ? __warn+0xa5/0x1b0
   ? trace_buffered_event_disable+0x189/0x1b0
   __ftrace_event_enable_disable+0x19e/0x3e0
   free_probe_data+0x3b/0xa0
   unregister_ftrace_function_probe_func+0x6b8/0x800
   event_enable_func+0x2f0/0x3d0
   ftrace_process_regex.isra.0+0x12d/0x1b0
   ftrace_filter_write+0xe6/0x140
   vfs_write+0x1c9/0x6f0
   [...]

The cause of the warning is in __ftrace_event_enable_disable(),
trace_buffered_event_enable() was called once while
trace_buffered_event_disable() was called twice.
Reproduction script show as below, for analysis, see the comments:
 ```
 #!/bin/bash

 cd /sys/kernel/tracing/

 # 1. Register a 'disable_event' command, then:
 #    1) SOFT_DISABLED_BIT was set;
 #    2) trace_buffered_event_enable() was called first time;
 echo 'cmdline_proc_show:disable_event:initcall:initcall_finish' > \
     set_ftrace_filter

 # 2. Enable the event registered, then:
 #    1) SOFT_DISABLED_BIT was cleared;
 #    2) trace_buffered_event_disable() was called first time;
 echo 1 > events/initcall/initcall_finish/enable

 # 3. Try to call into cmdline_proc_show(), then SOFT_DISABLED_BIT was
 #    set again!!!
 cat /proc/cmdline

 # 4. Unregister the 'disable_event' command, then:
 #    1) SOFT_DISABLED_BIT was cleared again;
 #    2) trace_buffered_event_disable() was called second time!!!
 echo '!cmdline_proc_show:disable_event:initcall:initcall_finish' > \
     set_ftrace_filter
 ```

To fix it, IIUC, we can change to call trace_buffered_event_enable() at
fist time soft-mode enabled, and call trace_buffered_event_disable() at
last time soft-mode disabled.

Link: https://lore.kernel.org/linux-trace-kernel/20230726095804.920457-1-zhengyejian1@huawei.com

Cc: <mhiramat@kernel.org>
Fixes: 0fc1b09ff1 ("tracing: Use temp buffer when filtering events")
Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-07-28 20:29:51 -04:00
Gaosheng Cui 6c95d71bad tracing: Fix kernel-doc warnings in trace_seq.c
Fix kernel-doc warning:

kernel/trace/trace_seq.c:142: warning: Function parameter or member
'args' not described in 'trace_seq_vprintf'

Link: https://lkml.kernel.org/r/20230724140827.1023266-5-cuigaosheng1@huawei.com

Cc: <mhiramat@kernel.org>
Signed-off-by: Gaosheng Cui <cuigaosheng1@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-07-28 19:59:04 -04:00
Gaosheng Cui bd7217f30c tracing: Fix kernel-doc warnings in trace_events_trigger.c
Fix kernel-doc warnings:

kernel/trace/trace_events_trigger.c:59: warning: Function parameter
or member 'buffer' not described in 'event_triggers_call'
kernel/trace/trace_events_trigger.c:59: warning: Function parameter
or member 'event' not described in 'event_triggers_call'

Link: https://lkml.kernel.org/r/20230724140827.1023266-4-cuigaosheng1@huawei.com

Cc: <mhiramat@kernel.org>
Signed-off-by: Gaosheng Cui <cuigaosheng1@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-07-28 19:59:03 -04:00
Gaosheng Cui b32c789f7d tracing/synthetic: Fix kernel-doc warnings in trace_events_synth.c
Fix kernel-doc warning:

kernel/trace/trace_events_synth.c:1257: warning: Function parameter
or member 'mod' not described in 'synth_event_gen_cmd_array_start'

Link: https://lkml.kernel.org/r/20230724140827.1023266-3-cuigaosheng1@huawei.com

Cc: <mhiramat@kernel.org>
Signed-off-by: Gaosheng Cui <cuigaosheng1@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-07-28 19:59:03 -04:00
Gaosheng Cui 151e34d1c6 ring-buffer: Fix kernel-doc warnings in ring_buffer.c
Fix kernel-doc warnings:

kernel/trace/ring_buffer.c:954: warning: Function parameter or
member 'cpu' not described in 'ring_buffer_wake_waiters'
kernel/trace/ring_buffer.c:3383: warning: Excess function parameter
'event' description in 'ring_buffer_unlock_commit'
kernel/trace/ring_buffer.c:5359: warning: Excess function parameter
'cpu' description in 'ring_buffer_reset_online_cpus'

Link: https://lkml.kernel.org/r/20230724140827.1023266-2-cuigaosheng1@huawei.com

Cc: <mhiramat@kernel.org>
Signed-off-by: Gaosheng Cui <cuigaosheng1@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-07-28 19:59:03 -04:00
Zheng Yejian 2d093282b0 ring-buffer: Fix wrong stat of cpu_buffer->read
When pages are removed in rb_remove_pages(), 'cpu_buffer->read' is set
to 0 in order to make sure any read iterators reset themselves. However,
this will mess 'entries' stating, see following steps:

  # cd /sys/kernel/tracing/
  # 1. Enlarge ring buffer prepare for later reducing:
  # echo 20 > per_cpu/cpu0/buffer_size_kb
  # 2. Write a log into ring buffer of cpu0:
  # taskset -c 0 echo "hello1" > trace_marker
  # 3. Read the log:
  # cat per_cpu/cpu0/trace_pipe
       <...>-332     [000] .....    62.406844: tracing_mark_write: hello1
  # 4. Stop reading and see the stats, now 0 entries, and 1 event readed:
  # cat per_cpu/cpu0/stats
   entries: 0
   [...]
   read events: 1
  # 5. Reduce the ring buffer
  # echo 7 > per_cpu/cpu0/buffer_size_kb
  # 6. Now entries became unexpected 1 because actually no entries!!!
  # cat per_cpu/cpu0/stats
   entries: 1
   [...]
   read events: 0

To fix it, introduce 'page_removed' field to count total removed pages
since last reset, then use it to let read iterators reset themselves
instead of changing the 'read' pointer.

Link: https://lore.kernel.org/linux-trace-kernel/20230724054040.3489499-1-zhengyejian1@huawei.com

Cc: <mhiramat@kernel.org>
Cc: <vnagarnaik@google.com>
Fixes: 83f40318da ("ring-buffer: Make removal of ring buffer pages atomic")
Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-07-28 19:16:23 -04:00
YiFei Zhu d1a02358d4 bpf: Non-atomically allocate freelist during prefill
In internal testing of test_maps, we sometimes observed failures like:
  test_maps: test_maps.c:173: void test_hashmap_percpu(unsigned int, void *):
    Assertion `bpf_map_update_elem(fd, &key, value, BPF_ANY) == 0' failed.
where the errno is ENOMEM. After some troubleshooting and enabling
the warnings, we saw:
  [   91.304708] percpu: allocation failed, size=8 align=8 atomic=1, atomic alloc failed, no space left
  [   91.304716] CPU: 51 PID: 24145 Comm: test_maps Kdump: loaded Tainted: G                 N 6.1.38-smp-DEV #7
  [   91.304719] Hardware name: Google Astoria/astoria, BIOS 0.20230627.0-0 06/27/2023
  [   91.304721] Call Trace:
  [   91.304724]  <TASK>
  [   91.304730]  [<ffffffffa7ef83b9>] dump_stack_lvl+0x59/0x88
  [   91.304737]  [<ffffffffa7ef83f8>] dump_stack+0x10/0x18
  [   91.304738]  [<ffffffffa75caa0c>] pcpu_alloc+0x6fc/0x870
  [   91.304741]  [<ffffffffa75ca302>] __alloc_percpu_gfp+0x12/0x20
  [   91.304743]  [<ffffffffa756785e>] alloc_bulk+0xde/0x1e0
  [   91.304746]  [<ffffffffa7566c02>] bpf_mem_alloc_init+0xd2/0x2f0
  [   91.304747]  [<ffffffffa7547c69>] htab_map_alloc+0x479/0x650
  [   91.304750]  [<ffffffffa751d6e0>] map_create+0x140/0x2e0
  [   91.304752]  [<ffffffffa751d413>] __sys_bpf+0x5a3/0x6c0
  [   91.304753]  [<ffffffffa751c3ec>] __x64_sys_bpf+0x1c/0x30
  [   91.304754]  [<ffffffffa7ef847a>] do_syscall_64+0x5a/0x80
  [   91.304756]  [<ffffffffa800009b>] entry_SYSCALL_64_after_hwframe+0x63/0xcd

This makes sense, because in atomic context, percpu allocation would
not create new chunks; it would only create in non-atomic contexts.
And if during prefill all precpu chunks are full, -ENOMEM would
happen immediately upon next unit_alloc.

Prefill phase does not actually run in atomic context, so we can
use this fact to allocate non-atomically with GFP_KERNEL instead
of GFP_NOWAIT. This avoids the immediate -ENOMEM.

GFP_NOWAIT has to be used in unit_alloc when bpf program runs
in atomic context. Even if bpf program runs in non-atomic context,
in most cases, rcu read lock is enabled for the program so
GFP_NOWAIT is still needed. This is often also the case for
BPF_MAP_UPDATE_ELEM syscalls.

Signed-off-by: YiFei Zhu <zhuyifei@google.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20230728043359.3324347-1-zhuyifei@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-28 09:41:10 -07:00
Yonghong Song 09fedc7318 bpf: Fix compilation warning with -Wparentheses
The kernel test robot reported compilation warnings when -Wparentheses is
added to KBUILD_CFLAGS with gcc compiler. The following is the error message:

  .../bpf-next/kernel/bpf/verifier.c: In function ‘coerce_reg_to_size_sx’:
  .../bpf-next/kernel/bpf/verifier.c:5901:14:
    error: suggest parentheses around comparison in operand of ‘==’ [-Werror=parentheses]
    if (s64_max >= 0 == s64_min >= 0) {
        ~~~~~~~~^~~~
  .../bpf-next/kernel/bpf/verifier.c: In function ‘coerce_subreg_to_size_sx’:
  .../bpf-next/kernel/bpf/verifier.c:5965:14:
    error: suggest parentheses around comparison in operand of ‘==’ [-Werror=parentheses]
    if (s32_min >= 0 == s32_max >= 0) {
        ~~~~~~~~^~~~

To fix the issue, add proper parentheses for the above '>=' condition
to silence the warning/error.

I tried a few clang compilers like clang16 and clang18 and they do not emit
such warnings with -Wparentheses.

Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202307281133.wi0c4SqG-lkp@intel.com/
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20230728055740.2284534-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-28 08:54:04 -07:00
Michael Ellerman 7f48405c3c cpu/SMT: Allow enabling partial SMT states via sysfs
Add support to the /sys/devices/system/cpu/smt/control interface for
enabling a specified number of SMT threads per core, including partial
SMT states where not all threads are brought online.

The current interface accepts "on" and "off", to enable either 1 or all
SMT threads per core.

This commit allows writing an integer, between 1 and the number of SMT
threads supported by the machine. Writing 1 is a synonym for "off", 2 or
more enables SMT with the specified number of threads.

When reading the file, if all threads are online "on" is returned, to
avoid changing behaviour for existing users. If some other number of
threads is online then the integer value is returned.

Architectures like x86 only supporting 1 thread or all threads, should not
define CONFIG_SMT_NUM_THREADS_DYNAMIC. Architecture supporting partial SMT
states, like PowerPC, should define it.

[ ldufour: Slightly reword the commit's description ]
[ ldufour: Remove switch() in __store_smt_control() ]
[ ldufour: Rix build issue in control_show() ]

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Link: https://lore.kernel.org/r/20230705145143.40545-8-ldufour@linux.ibm.com
2023-07-28 09:53:37 +02:00
Michael Ellerman 38253464bc cpu/SMT: Create topology_smt_thread_allowed()
Some architectures allows partial SMT states, i.e. when not all SMT threads
are brought online.

To support that, add an architecture helper which checks whether a given
CPU is allowed to be brought online depending on how many SMT threads are
currently enabled. Since this is only applicable to architecture supporting
partial SMT, only these architectures should select the new configuration
variable CONFIG_SMT_NUM_THREADS_DYNAMIC. For the other architectures, not
supporting the partial SMT states, there is no need to define
topology_cpu_smt_allowed(), the generic code assumed that all the threads
are allowed or only the primary ones.

Call the helper from cpu_smt_enable(), and cpu_smt_allowed() when SMT is
enabled, to check if the particular thread should be onlined. Notably,
also call it from cpu_smt_disable() if CPU_SMT_ENABLED, to allow
offlining some threads to move from a higher to lower number of threads
online.

[ ldufour: Slightly reword the commit's description ]
[ ldufour: Introduce CONFIG_SMT_NUM_THREADS_DYNAMIC ]

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Link: https://lore.kernel.org/r/20230705145143.40545-7-ldufour@linux.ibm.com
2023-07-28 09:53:37 +02:00
Laurent Dufour 91b4a7dbfe cpu/SMT: Remove topology_smt_supported()
Since the maximum number of threads is now passed to cpu_smt_set_num_threads(),
checking that value is enough to know whether SMT is supported.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Link: https://lore.kernel.org/r/20230705145143.40545-6-ldufour@linux.ibm.com
2023-07-28 09:53:37 +02:00
Michael Ellerman 447ae4ac41 cpu/SMT: Store the current/max number of threads
Some architectures allow partial SMT states at boot time, ie. when not all
SMT threads are brought online.

To support that the SMT code needs to know the maximum number of SMT
threads, and also the currently configured number.

The architecture code knows the max number of threads, so have the
architecture code pass that value to cpu_smt_set_num_threads(). Note that
although topology_max_smt_threads() exists, it is not configured early
enough to be used here. As architecture, like PowerPC, allows the threads
number to be set through the kernel command line, also pass that value.

[ ldufour: Slightly reword the commit message ]
[ ldufour: Rename cpu_smt_check_topology and add a num_threads argument ]

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Link: https://lore.kernel.org/r/20230705145143.40545-5-ldufour@linux.ibm.com
2023-07-28 09:53:37 +02:00
Michael Ellerman c53361ce7d cpu/SMT: Move smt/control simple exit cases earlier
Move the simple exit cases, i.e. those which don't depend on the value
written, earlier in the function. That makes it clearer that regardless of
the input those states cannot be transitioned out of.

That does have a user-visible effect, in that the error returned will
now always be EPERM/ENODEV for those states, regardless of the value
written. Previously writing an invalid value would return EINVAL even
when in those states.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Link: https://lore.kernel.org/r/20230705145143.40545-4-ldufour@linux.ibm.com
2023-07-28 09:53:36 +02:00
Michael Ellerman 3f9169196b cpu/SMT: Move SMT prototypes into cpu_smt.h
In order to export the cpuhp_smt_control enum as part of the interface
between generic and architecture code, the architecture code needs to
include asm/topology.h.

But that leads to circular header dependencies. So split the enum and
related declarations into a separate header.

[ ldufour: Reworded the commit's description ]

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Link: https://lore.kernel.org/r/20230705145143.40545-3-ldufour@linux.ibm.com
2023-07-28 09:53:36 +02:00
Laurent Dufour 7a4dcb4a5d cpu/hotplug: Remove dependancy against cpu_primary_thread_mask
The commit 18415f33e2 ("cpu/hotplug: Allow "parallel" bringup up to
CPUHP_BP_KICK_AP_STATE") introduce a dependancy against a global variable
cpu_primary_thread_mask exported by the X86 code. This variable is only
used when CONFIG_HOTPLUG_PARALLEL is set.

Since cpuhp_get_primary_thread_mask() and cpuhp_smt_aware() are only used
when CONFIG_HOTPLUG_PARALLEL is set, don't define them when it is not set.

No functional change.

Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Link: https://lore.kernel.org/r/20230705145143.40545-2-ldufour@linux.ibm.com
2023-07-28 09:53:36 +02:00
Yonghong Song f835bb6222 bpf: Add kernel/bpftool asm support for new instructions
Add asm support for new instructions so kernel verifier and bpftool
xlated insn dumps can have proper asm syntax for new instructions.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Quentin Monnet <quentin@isovalent.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-27 18:54:02 -07:00
Yonghong Song 4cd58e9af8 bpf: Support new 32bit offset jmp instruction
Add interpreter/jit/verifier support for 32bit offset jmp instruction.
If a conditional jmp instruction needs more than 16bit offset,
it can be simulated with a conditional jmp + a 32bit jmp insn.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230728011231.3716103-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-27 18:52:33 -07:00
Yonghong Song 7058e3a31e bpf: Fix jit blinding with new sdiv/smov insns
Handle new insns properly in bpf_jit_blind_insn() function.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230728011225.3715812-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-27 18:52:33 -07:00
Yonghong Song ec0e2da95f bpf: Support new signed div/mod instructions.
Add interpreter/jit support for new signed div/mod insns.
The new signed div/mod instructions are encoded with
unsigned div/mod instructions plus insn->off == 1.
Also add basic verifier support to ensure new insns get
accepted.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230728011219.3714605-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-27 18:52:33 -07:00
Yonghong Song 0845c3db7b bpf: Support new unconditional bswap instruction
The existing 'be' and 'le' insns will do conditional bswap
depends on host endianness. This patch implements
unconditional bswap insns.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230728011213.3712808-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-27 18:52:33 -07:00
Yonghong Song 1f1e864b65 bpf: Handle sign-extenstin ctx member accesses
Currently, if user accesses a ctx member with signed types,
the compiler will generate an unsigned load followed by
necessary left and right shifts.

With the introduction of sign-extension load, compiler may
just emit a ldsx insn instead. Let us do a final movsx sign
extension to the final unsigned ctx load result to
satisfy original sign extension requirement.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230728011207.3712528-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-27 18:52:33 -07:00
Yonghong Song 8100928c88 bpf: Support new sign-extension mov insns
Add interpreter/jit support for new sign-extension mov insns.
The original 'MOV' insn is extended to support reg-to-reg
signed version for both ALU and ALU64 operations. For ALU mode,
the insn->off value of 8 or 16 indicates sign-extension
from 8- or 16-bit value to 32-bit value. For ALU64 mode,
the insn->off value of 8/16/32 indicates sign-extension
from 8-, 16- or 32-bit value to 64-bit value.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230728011202.3712300-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-27 18:52:33 -07:00
Yonghong Song 1f9a1ea821 bpf: Support new sign-extension load insns
Add interpreter/jit support for new sign-extension load insns
which adds a new mode (BPF_MEMSX).
Also add verifier support to recognize these insns and to
do proper verification with new insns. In verifier, besides
to deduce proper bounds for the dst_reg, probed memory access
is also properly handled.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230728011156.3711870-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-27 18:52:33 -07:00
Jakub Kicinski 014acf2668 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

No conflicts or adjacent changes.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-27 15:22:46 -07:00
Azeem Shaikh c9732f1461 perf: Replace strlcpy with strscpy
strlcpy() reads the entire source buffer first.
This read may exceed the destination size limit.
This is both inefficient and can lead to linear read
overflows if a source string is not NUL-terminated [1].
In an effort to remove strlcpy() completely [2], replace
strlcpy() here with strscpy().
No return values were used, so direct replacement is safe.

[1] https://www.kernel.org/doc/html/latest/process/deprecated.html#strlcpy
[2] https://github.com/KSPP/linux/issues/89

Signed-off-by: Azeem Shaikh <azeemshaikh38@gmail.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20230703165817.2840457-1-azeemshaikh38@gmail.com
Signed-off-by: Kees Cook <keescook@chromium.org>
2023-07-27 08:51:25 -07:00
Rae Moar a547c4ce10 kunit: time: Mark test as slow using test attributes
Mark the time KUnit test, time64_to_tm_test_date_range, as slow using test
attributes.

This test ran relatively much slower than most other KUnit tests.

By marking this test as slow, the test can now be filtered using the KUnit
test attribute filtering feature. Example: --filter "speed>slow". This will
run only the tests that have speeds faster than slow. The slow attribute
will also be outputted in KTAP.

Reviewed-by: David Gow <davidgow@google.com>
Signed-off-by: Rae Moar <rmoar@google.com>
Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
2023-07-26 13:29:35 -06:00
Linus Torvalds 5f0bc0b042 mm: suppress mm fault logging if fatal signal already pending
Commit eda0047296 ("mm: make the page fault mmap locking killable")
intentionally made it much easier to trigger the "page fault fails
because a fatal signal is pending" situation, by having the mmap locking
fail early in that case.

We have long aborted page faults in other fatal cases when the actual IO
for a page is interrupted by SIGKILL - which is particularly useful for
the traditional case of NFS hanging due to network issues, but local
filesystems could cause it too if you happened to get the SIGKILL while
waiting for a page to be faulted in (eg lock_folio_maybe_drop_mmap()).

So aborting the page fault wasn't a new condition - but it now triggers
earlier, before we even get to 'handle_mm_fault()'.  And as a result the
error doesn't go through our 'fault_signal_pending()' logic, and doesn't
get filtered away there.

Normally you'd never even notice, because if a fatal signal is pending,
the new SIGSEGV we send ends up being ignored anyway.

But it turns out that there is one very noticeable exception: if you
enable 'show_unhandled_signals', the aborted page fault will be logged
in the kernel messages, and you'll get a scary line looking something
like this in your logs:

  pverados[2183248]: segfault at 55e5a00f9ae0 ip 000055e5a00f9ae0 sp 00007ffc0720bea8 error 14 in perl[55e5a00d4000+195000] likely on CPU 10 (core 4, socket 0)

which is rather misleading.  It's not really a segfault at all, it's
just "the thread was killed before the page fault completed, so we
aborted the page fault".

Fix this by just making it clear that a pending fatal signal means that
any new signal coming in after that is implicitly handled.  This will
avoid the misleading logging, since now the signal isn't 'unhandled' any
more.

Reported-and-tested-by: Fiona Ebner <f.ebner@proxmox.com>
Tested-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
Link: https://lore.kernel.org/lkml/8d063a26-43f5-0bb7-3203-c6a04dc159f8@proxmox.com/
Acked-by: Oleg Nesterov <oleg@redhat.com>
Fixes: eda0047296 ("mm: make the page fault mmap locking killable")
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-07-26 10:51:59 -07:00
Chen Yu 4efcc8bc7e sched/topology: Align group flags when removing degenerate domain
The flags of the child of a given scheduling domain are used to initialize
the flags of its scheduling groups. When the child of a scheduling domain
is degenerated, the flags of its local scheduling group need to be updated
to align with the flags of its new child domain.

The flag SD_SHARE_CPUCAPACITY was aligned in
Commit bf2dc42d6b ("sched/topology: Propagate SMT flags when removing degenerate domain").
Further generalize this alignment so other flags can be used later, such as
in cluster-based task wakeup. [1]

Reported-by: Yicong Yang <yangyicong@huawei.com>
Suggested-by: Ricardo Neri <ricardo.neri@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Reviewed-by: Yicong Yang <yangyicong@hisilicon.com>
Link: https://lore.kernel.org/r/20230713013133.2314153-1-yu.c.chen@intel.com
2023-07-26 12:28:51 +02:00
Vincent Guittot c2e164ac33 sched/fair: remove util_est boosting
There is no need to use runnable_avg when estimating util_est and that
even generates wrong behavior because one includes blocked tasks whereas
the other one doesn't. This can lead to accounting twice the waking task p,
once with the blocked runnable_avg and another one when adding its
util_est.

cpu's runnable_avg is already used when computing util_avg which is then
compared with util_est.

In some situation, feec will not select prev_cpu but another one on the
same performance domain because of higher max_util

Fixes: 7d0583cf9e ("sched/fair, cpufreq: Introduce 'runnable boosting'")
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lore.kernel.org/r/20230706135144.324311-1-vincent.guittot@linaro.org
2023-07-26 12:28:50 +02:00
Masami Hiramatsu (Google) 1f9f4f4777 tracing/probes: Fix to add NULL check for BTF APIs
Since find_btf_func_param() abd btf_type_by_id() can return NULL,
the caller must check the return value correctly.

Link: https://lore.kernel.org/all/169024903951.395371.11361556840733470934.stgit@devnote2/

Fixes: b576e09701 ("tracing/probes: Support function parameters if BTF is available")
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-07-26 12:53:38 +09:00
Arnd Bergmann 63e2da3b7f bpf: work around -Wuninitialized warning
Splitting these out into separate helper functions means that we
actually pass an uninitialized variable into another function call
if dec_active() happens to not be inlined, and CONFIG_PREEMPT_RT
is disabled:

kernel/bpf/memalloc.c: In function 'add_obj_to_free_list':
kernel/bpf/memalloc.c:200:9: error: 'flags' is used uninitialized [-Werror=uninitialized]
  200 |         dec_active(c, flags);

Avoid this by passing the flags by reference, so they either get
initialized and dereferenced through a pointer, or the pointer never
gets accessed at all.

Fixes: 18e027b1c7 ("bpf: Factor out inc/dec of active flag into helpers.")
Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/r/20230725202653.2905259-1-arnd@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-25 17:14:18 -07:00
Jiri Olsa d62cc390c2 bpf: Disable preemption in bpf_event_output
We received report [1] of kernel crash, which is caused by
using nesting protection without disabled preemption.

The bpf_event_output can be called by programs executed by
bpf_prog_run_array_cg function that disabled migration but
keeps preemption enabled.

This can cause task to be preempted by another one inside the
nesting protection and lead eventually to two tasks using same
perf_sample_data buffer and cause crashes like:

  BUG: kernel NULL pointer dereference, address: 0000000000000001
  #PF: supervisor instruction fetch in kernel mode
  #PF: error_code(0x0010) - not-present page
  ...
  ? perf_output_sample+0x12a/0x9a0
  ? finish_task_switch.isra.0+0x81/0x280
  ? perf_event_output+0x66/0xa0
  ? bpf_event_output+0x13a/0x190
  ? bpf_event_output_data+0x22/0x40
  ? bpf_prog_dfc84bbde731b257_cil_sock4_connect+0x40a/0xacb
  ? xa_load+0x87/0xe0
  ? __cgroup_bpf_run_filter_sock_addr+0xc1/0x1a0
  ? release_sock+0x3e/0x90
  ? sk_setsockopt+0x1a1/0x12f0
  ? udp_pre_connect+0x36/0x50
  ? inet_dgram_connect+0x93/0xa0
  ? __sys_connect+0xb4/0xe0
  ? udp_setsockopt+0x27/0x40
  ? __pfx_udp_push_pending_frames+0x10/0x10
  ? __sys_setsockopt+0xdf/0x1a0
  ? __x64_sys_connect+0xf/0x20
  ? do_syscall_64+0x3a/0x90
  ? entry_SYSCALL_64_after_hwframe+0x72/0xdc

Fixing this by disabling preemption in bpf_event_output.

[1] https://github.com/cilium/cilium/issues/26756
Cc: stable@vger.kernel.org
Reported-by: Oleg "livelace" Popov <o.popov@livelace.ru>
Closes: https://github.com/cilium/cilium/issues/26756
Fixes: 2a916f2f54 ("bpf: Use migrate_disable/enable in array macros and cgroup/lirc code.")
Acked-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20230725084206.580930-3-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-25 17:06:37 -07:00
Jiri Olsa f2c67a3e60 bpf: Disable preemption in bpf_perf_event_output
The nesting protection in bpf_perf_event_output relies on disabled
preemption, which is guaranteed for kprobes and tracepoints.

However bpf_perf_event_output can be also called from uprobes context
through bpf_prog_run_array_sleepable function which disables migration,
but keeps preemption enabled.

This can cause task to be preempted by another one inside the nesting
protection and lead eventually to two tasks using same perf_sample_data
buffer and cause crashes like:

  kernel tried to execute NX-protected page - exploit attempt? (uid: 0)
  BUG: unable to handle page fault for address: ffffffff82be3eea
  ...
  Call Trace:
   ? __die+0x1f/0x70
   ? page_fault_oops+0x176/0x4d0
   ? exc_page_fault+0x132/0x230
   ? asm_exc_page_fault+0x22/0x30
   ? perf_output_sample+0x12b/0x910
   ? perf_event_output+0xd0/0x1d0
   ? bpf_perf_event_output+0x162/0x1d0
   ? bpf_prog_c6271286d9a4c938_krava1+0x76/0x87
   ? __uprobe_perf_func+0x12b/0x540
   ? uprobe_dispatcher+0x2c4/0x430
   ? uprobe_notify_resume+0x2da/0xce0
   ? atomic_notifier_call_chain+0x7b/0x110
   ? exit_to_user_mode_prepare+0x13e/0x290
   ? irqentry_exit_to_user_mode+0x5/0x30
   ? asm_exc_int3+0x35/0x40

Fixing this by disabling preemption in bpf_perf_event_output.

Cc: stable@vger.kernel.org
Fixes: 8c7dcb84e3 ("bpf: implement sleepable uprobes by chaining gps")
Acked-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20230725084206.580930-2-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-25 17:05:53 -07:00
Tejun Heo aa6fde93f3 workqueue: Scale up wq_cpu_intensive_thresh_us if BogoMIPS is below 4000
wq_cpu_intensive_thresh_us is used to detect CPU-hogging per-cpu work items.
Once detected, they're excluded from concurrency management to prevent them
from blocking other per-cpu work items. If CONFIG_WQ_CPU_INTENSIVE_REPORT is
enabled, repeat offenders are also reported so that the code can be updated.

The default threshold is 10ms which is long enough to do fair bit of work on
modern CPUs while short enough to be usually not noticeable. This
unfortunately leads to a lot of, arguable spurious, detections on very slow
CPUs. Using the same threshold across CPUs whose performance levels may be
apart by multiple levels of magnitude doesn't make whole lot of sense.

This patch scales up wq_cpu_intensive_thresh_us upto 1 second when BogoMIPS
is below 4000. This is obviously very inaccurate but it doesn't have to be
accurate to be useful. The mechanism is still useful when the threshold is
fully scaled up and the benefits of reports are usually shared with everyone
regardless of who's reporting, so as long as there are sufficient number of
fast machines reporting, we don't lose much.

Some (or is it all?) ARM CPUs systemtically report significantly lower
BogoMIPS. While this doesn't break anything, given how widespread ARM CPUs
are, it's at least a missed opportunity and it probably would be a good idea
to teach workqueue about it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-Tested-by: Geert Uytterhoeven <geert@linux-m68k.org>
2023-07-25 11:49:57 -10:00
Jiri Slaby bcb48185ed tty: sysrq: switch sysrq handlers from int to u8
The passed parameter to sysrq handlers is a key (a character). So change
the type from 'int' to 'u8'. Let it specifically be 'u8' for two
reasons:
* unsigned: unsigned values come from the upper layers (devices) and the
  tty layer assumes unsigned on most places, and
* 8-bit: as that what's supposed to be one day in all the layers built
  on the top of tty. (Currently, we use mostly 'unsigned char' and
  somewhere still only 'char'. (But that also translates to the former
  thanks to -funsigned-char.))

Signed-off-by: Jiri Slaby (SUSE) <jirislaby@kernel.org>
Cc: Richard Henderson <richard.henderson@linaro.org>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: WANG Xuerui <kernel@xen0n.name>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Maxime Ripard <mripard@kernel.org>
Cc: Thomas Zimmermann <tzimmermann@suse.de>
Cc: David Airlie <airlied@gmail.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Jason Wessel <jason.wessel@windriver.com>
Cc: Daniel Thompson <daniel.thompson@linaro.org>
Cc: Douglas Anderson <dianders@chromium.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Len Brown <len.brown@intel.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Zqiang <qiang.zhang1211@gmail.com>
Acked-by: Thomas Zimmermann <tzimmermann@suse.de> # DRM
Acked-by: WANG Xuerui <git@xen0n.name> # loongarch
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Acked-by: Daniel Thompson <daniel.thompson@linaro.org>
Link: https://lore.kernel.org/r/20230712081811.29004-3-jirislaby@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-07-25 19:21:03 +02:00
Palmer Dabbelt ff09f6fd29 modpost, kallsyms: Treat add '$'-prefixed symbols as mapping symbols
Trying to restrict the '$'-prefix change to RISC-V caused some fallout,
so let's just treat all those symbols as special.

Fixes: c05780ef3c ("module: Ignore RISC-V mapping symbols too")
Link: https://lore.kernel.org/all/20230712015747.77263-1-wangkefeng.wang@huawei.com/
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
Reviewed-by: Masahiro Yamada <masahiroy@kernel.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2023-07-24 12:09:47 -07:00
Jeff Layton 417d2b6b11 bpf: convert to ctime accessor functions
In later patches, we're going to change how the inode's ctime field is
used. Switch to using accessor functions instead of raw accesses of
inode->i_ctime.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Message-Id: <20230705190309.579783-84-jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-07-24 10:30:07 +02:00
Brian Geffon 005e8dddd4 PM: hibernate: don't store zero pages in the image file
On ChromeOS we've observed a considerable number of in-use pages filled with
zeros. Today with hibernate it's entirely possible that saveable pages are just
zero filled. Since we're already copying pages word-by-word in do_copy_page it
becomes almost free to determine if a page was completely filled with zeros.

This change introduces a new bitmap which will track these zero pages. If a page
is zero it will not be included in the saved image, instead to track these zero
pages in the image file we will introduce a new flag which we will set on the
packed PFN list. When reading back in the image file we will detect these zero
page PFNs and rebuild the zero page bitmap.

When the image is being loaded through calls to write_next_page if we encounter
a zero page we will silently memset it to 0 and then continue on to the next
page. Given the implementation in snapshot_read_next/snapshot_write_next this
change  will be transparent to non-compressed/compressed and swsusp modes of
operation.

To provide some concrete numbers from simple ad-hoc testing, on a device which
was lightly in use we saw that:

PM: hibernation: Image created (964408 pages copied, 548304 zero pages)

Of the approximately 6.2GB of saveable pages 2.2GB (36%) were just zero filled
and could be tracked entirely within the packed PFN list. The savings would
obviously be much lower for lzo compressed images, but even in the case of
compression not copying pages across to the compression threads will still
speed things up. It's also possible that we would see better overall compression
ratios as larger regions of "real data" would improve the compressibility.

Finally, such an approach could dramatically improve swsusp performance
as each one of those zero pages requires a write syscall to reload, by
handling it as part of the packed PFN list we're able to fully avoid
that.

Signed-off-by: Brian Geffon <bgeffon@google.com>
[ rjw: Whitespace adjustments, removal of redundant parentheses ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2023-07-24 09:51:59 +02:00
Linus Torvalds 3b4e48b800 Tracing fixes for 6.5-rc2:
- Swapping the ring buffer for snapshotting (for things like irqsoff)
   can crash if the ring buffer is being resized. Disable swapping
   when this happens. The missed swap will be reported to the tracer.
 
 - Report error if the histogram fails to be created due to an error in
   adding a histogram variable, in event_hist_trigger_parse().
 
 - Remove unused declaration of tracing_map_set_field_descr().
 
 Chen Lin (1):
       ring-buffer: Do not swap cpu_buffer during resize process
 
 Mohamed Khalfella (1):
       tracing/histograms: Return an error if we fail to add histogram to hist_vars list
 
 YueHaibing (1):
       tracing: Remove unused extern declaration tracing_map_set_field_descr()
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZL2IixQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qsHAAQCS/VLpMOA5AS9JWvwuEnGAVymyJcGS
 jmnWkuMmf5fPpQD/di/xY1clLNhz6P7PAZvR3N6qw3AsNjPW/ZapDkrRWQA=
 =RoHL
 -----END PGP SIGNATURE-----

Merge tag 'trace-v6.5-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing fixes from Steven Rostedt:

 - Swapping the ring buffer for snapshotting (for things like irqsoff)
   can crash if the ring buffer is being resized. Disable swapping when
   this happens. The missed swap will be reported to the tracer

 - Report error if the histogram fails to be created due to an error in
   adding a histogram variable, in event_hist_trigger_parse()

 - Remove unused declaration of tracing_map_set_field_descr()

* tag 'trace-v6.5-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing/histograms: Return an error if we fail to add histogram to hist_vars list
  ring-buffer: Do not swap cpu_buffer during resize process
  tracing: Remove unused extern declaration tracing_map_set_field_descr()
2023-07-23 15:19:14 -07:00
Mohamed Khalfella 4b8b390516 tracing/histograms: Return an error if we fail to add histogram to hist_vars list
Commit 6018b585e8 ("tracing/histograms: Add histograms to hist_vars if
they have referenced variables") added a check to fail histogram creation
if save_hist_vars() failed to add histogram to hist_vars list. But the
commit failed to set ret to failed return code before jumping to
unregister histogram, fix it.

Link: https://lore.kernel.org/linux-trace-kernel/20230714203341.51396-1-mkhalfella@purestorage.com

Cc: stable@vger.kernel.org
Fixes: 6018b585e8 ("tracing/histograms: Add histograms to hist_vars if they have referenced variables")
Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-07-23 11:18:52 -04:00
Chen Lin 8a96c0288d ring-buffer: Do not swap cpu_buffer during resize process
When ring_buffer_swap_cpu was called during resize process,
the cpu buffer was swapped in the middle, resulting in incorrect state.
Continuing to run in the wrong state will result in oops.

This issue can be easily reproduced using the following two scripts:
/tmp # cat test1.sh
//#! /bin/sh
for i in `seq 0 100000`
do
         echo 2000 > /sys/kernel/debug/tracing/buffer_size_kb
         sleep 0.5
         echo 5000 > /sys/kernel/debug/tracing/buffer_size_kb
         sleep 0.5
done
/tmp # cat test2.sh
//#! /bin/sh
for i in `seq 0 100000`
do
        echo irqsoff > /sys/kernel/debug/tracing/current_tracer
        sleep 1
        echo nop > /sys/kernel/debug/tracing/current_tracer
        sleep 1
done
/tmp # ./test1.sh &
/tmp # ./test2.sh &

A typical oops log is as follows, sometimes with other different oops logs.

[  231.711293] WARNING: CPU: 0 PID: 9 at kernel/trace/ring_buffer.c:2026 rb_update_pages+0x378/0x3f8
[  231.713375] Modules linked in:
[  231.714735] CPU: 0 PID: 9 Comm: kworker/0:1 Tainted: G        W          6.5.0-rc1-00276-g20edcec23f92 #15
[  231.716750] Hardware name: linux,dummy-virt (DT)
[  231.718152] Workqueue: events update_pages_handler
[  231.719714] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[  231.721171] pc : rb_update_pages+0x378/0x3f8
[  231.722212] lr : rb_update_pages+0x25c/0x3f8
[  231.723248] sp : ffff800082b9bd50
[  231.724169] x29: ffff800082b9bd50 x28: ffff8000825f7000 x27: 0000000000000000
[  231.726102] x26: 0000000000000001 x25: fffffffffffff010 x24: 0000000000000ff0
[  231.728122] x23: ffff0000c3a0b600 x22: ffff0000c3a0b5c0 x21: fffffffffffffe0a
[  231.730203] x20: ffff0000c3a0b600 x19: ffff0000c0102400 x18: 0000000000000000
[  231.732329] x17: 0000000000000000 x16: 0000000000000000 x15: 0000ffffe7aa8510
[  231.734212] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000002
[  231.736291] x11: ffff8000826998a8 x10: ffff800082b9baf0 x9 : ffff800081137558
[  231.738195] x8 : fffffc00030e82c8 x7 : 0000000000000000 x6 : 0000000000000001
[  231.740192] x5 : ffff0000ffbafe00 x4 : 0000000000000000 x3 : 0000000000000000
[  231.742118] x2 : 00000000000006aa x1 : 0000000000000001 x0 : ffff0000c0007208
[  231.744196] Call trace:
[  231.744892]  rb_update_pages+0x378/0x3f8
[  231.745893]  update_pages_handler+0x1c/0x38
[  231.746893]  process_one_work+0x1f0/0x468
[  231.747852]  worker_thread+0x54/0x410
[  231.748737]  kthread+0x124/0x138
[  231.749549]  ret_from_fork+0x10/0x20
[  231.750434] ---[ end trace 0000000000000000 ]---
[  233.720486] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
[  233.721696] Mem abort info:
[  233.721935]   ESR = 0x0000000096000004
[  233.722283]   EC = 0x25: DABT (current EL), IL = 32 bits
[  233.722596]   SET = 0, FnV = 0
[  233.722805]   EA = 0, S1PTW = 0
[  233.723026]   FSC = 0x04: level 0 translation fault
[  233.723458] Data abort info:
[  233.723734]   ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000
[  233.724176]   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
[  233.724589]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
[  233.725075] user pgtable: 4k pages, 48-bit VAs, pgdp=0000000104943000
[  233.725592] [0000000000000000] pgd=0000000000000000, p4d=0000000000000000
[  233.726231] Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP
[  233.726720] Modules linked in:
[  233.727007] CPU: 0 PID: 9 Comm: kworker/0:1 Tainted: G        W          6.5.0-rc1-00276-g20edcec23f92 #15
[  233.727777] Hardware name: linux,dummy-virt (DT)
[  233.728225] Workqueue: events update_pages_handler
[  233.728655] pstate: 200000c5 (nzCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[  233.729054] pc : rb_update_pages+0x1a8/0x3f8
[  233.729334] lr : rb_update_pages+0x154/0x3f8
[  233.729592] sp : ffff800082b9bd50
[  233.729792] x29: ffff800082b9bd50 x28: ffff8000825f7000 x27: 0000000000000000
[  233.730220] x26: 0000000000000000 x25: ffff800082a8b840 x24: ffff0000c0102418
[  233.730653] x23: 0000000000000000 x22: fffffc000304c880 x21: 0000000000000003
[  233.731105] x20: 00000000000001f4 x19: ffff0000c0102400 x18: ffff800082fcbc58
[  233.731727] x17: 0000000000000000 x16: 0000000000000001 x15: 0000000000000001
[  233.732282] x14: ffff8000825fe0c8 x13: 0000000000000001 x12: 0000000000000000
[  233.732709] x11: ffff8000826998a8 x10: 0000000000000ae0 x9 : ffff8000801b760c
[  233.733148] x8 : fefefefefefefeff x7 : 0000000000000018 x6 : ffff0000c03298c0
[  233.733553] x5 : 0000000000000002 x4 : 0000000000000000 x3 : 0000000000000000
[  233.733972] x2 : ffff0000c3a0b600 x1 : 0000000000000000 x0 : 0000000000000000
[  233.734418] Call trace:
[  233.734593]  rb_update_pages+0x1a8/0x3f8
[  233.734853]  update_pages_handler+0x1c/0x38
[  233.735148]  process_one_work+0x1f0/0x468
[  233.735525]  worker_thread+0x54/0x410
[  233.735852]  kthread+0x124/0x138
[  233.736064]  ret_from_fork+0x10/0x20
[  233.736387] Code: 92400000 910006b5 aa000021 aa0303f7 (f9400060)
[  233.736959] ---[ end trace 0000000000000000 ]---

After analysis, the seq of the error is as follows [1-5]:

int ring_buffer_resize(struct trace_buffer *buffer, unsigned long size,
			int cpu_id)
{
	for_each_buffer_cpu(buffer, cpu) {
		cpu_buffer = buffer->buffers[cpu];
		//1. get cpu_buffer, aka cpu_buffer(A)
		...
		...
		schedule_work_on(cpu,
		 &cpu_buffer->update_pages_work);
		//2. 'update_pages_work' is queue on 'cpu', cpu_buffer(A) is passed to
		// update_pages_handler, do the update process, set 'update_done' in
		// complete(&cpu_buffer->update_done) and to wakeup resize process.
	//---->
		//3. Just at this moment, ring_buffer_swap_cpu is triggered,
		//cpu_buffer(A) be swaped to cpu_buffer(B), the max_buffer.
		//ring_buffer_swap_cpu is called as the 'Call trace' below.

		Call trace:
		 dump_backtrace+0x0/0x2f8
		 show_stack+0x18/0x28
		 dump_stack+0x12c/0x188
		 ring_buffer_swap_cpu+0x2f8/0x328
		 update_max_tr_single+0x180/0x210
		 check_critical_timing+0x2b4/0x2c8
		 tracer_hardirqs_on+0x1c0/0x200
		 trace_hardirqs_on+0xec/0x378
		 el0_svc_common+0x64/0x260
		 do_el0_svc+0x90/0xf8
		 el0_svc+0x20/0x30
		 el0_sync_handler+0xb0/0xb8
		 el0_sync+0x180/0x1c0
	//<----

	/* wait for all the updates to complete */
	for_each_buffer_cpu(buffer, cpu) {
		cpu_buffer = buffer->buffers[cpu];
		//4. get cpu_buffer, cpu_buffer(B) is used in the following process,
		//the state of cpu_buffer(A) and cpu_buffer(B) is totally wrong.
		//for example, cpu_buffer(A)->update_done will leave be set 1, and will
		//not 'wait_for_completion' at the next resize round.
		  if (!cpu_buffer->nr_pages_to_update)
			continue;

		if (cpu_online(cpu))
			wait_for_completion(&cpu_buffer->update_done);
		cpu_buffer->nr_pages_to_update = 0;
	}
	...
}
	//5. the state of cpu_buffer(A) and cpu_buffer(B) is totally wrong,
	//Continuing to run in the wrong state, then oops occurs.

Link: https://lore.kernel.org/linux-trace-kernel/202307191558478409990@zte.com.cn

Signed-off-by: Chen Lin <chen.lin5@zte.com.cn>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-07-23 11:09:25 -04:00
YueHaibing 1faf7e4a0b tracing: Remove unused extern declaration tracing_map_set_field_descr()
Since commit 08d43a5fa0 ("tracing: Add lock-free tracing_map"),
this is never used, so can be removed.

Link: https://lore.kernel.org/linux-trace-kernel/20230722032123.24664-1-yuehaibing@huawei.com

Cc: <mhiramat@kernel.org>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-07-23 11:08:14 -04:00
Miaohe Lin a3fdeeb3f1 cgroup: fix obsolete comment above cgroup_create()
Since commit 743210386c ("cgroup: use cgrp->kn->id as the cgroup ID"),
cgrp is associated with its kernfs_node. Update corresponding comment.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-07-21 09:54:48 -10:00
Haitao Huang 714e08cc3e cgroup/misc: Store atomic64_t reads to u64
Change 'new_usage' type to u64 so it can be compared with unsigned 'max'
and 'capacity' properly even if the value crosses the signed boundary.

Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-07-21 08:10:06 -10:00
Xiu Jianfeng bf98354280 audit: correct audit_filter_inodes() definition
After changes in commit 0590b9335a ("fixing audit rule ordering mess,
part 1"), audit_filter_inodes() returns void, so if CONFIG_AUDITSYSCALL
not defined, it should be do {} while(0).

Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2023-07-21 12:17:25 -04:00
Jakub Kicinski 59be3baa8d Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

No conflicts or adjacent changes.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-20 15:52:55 -07:00
Linus Torvalds 57f1f9dd3a Including fixes from BPF, netfilter, bluetooth and CAN.
Current release - regressions:
 
  - eth: r8169: multiple fixes for PCIe ASPM-related problems
 
  - vrf: fix RCU lockdep splat in output path
 
 Previous releases - regressions:
 
  - gso: fall back to SW segmenting with GSO_UDP_L4 dodgy bit set
 
  - dsa: mv88e6xxx: do a final check before timing out when polling
 
  - nf_tables: fix sleep in atomic in nft_chain_validate
 
 Previous releases - always broken:
 
  - sched: fix undoing tcf_bind_filter() in multiple classifiers
 
  - bpf, arm64: fix BTI type used for freplace attached functions
 
  - can: gs_usb: fix time stamp counter initialization
 
  - nft_set_pipapo: fix improper element removal (leading to UAF)
 
 Misc:
 
  - net: support STP on bridge in non-root netns, STP prevents
    packet loops so not supporting it results in freezing systems
    of unsuspecting users, and in turn very upset noises being made
 
  - fix kdoc warnings
 
  - annotate various bits of TCP state to prevent data races
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmS5pp0ACgkQMUZtbf5S
 IrtudA/9Ep+URprI3tpv+VHOQMWtMd7lzz+wwEUDQSo2T6xdMcYbd1E4ZWWOPw/y
 jTIIVF3qde4nuI/MZtzGhvCD8v4bzhw10uRm4f4vhC2i+CzXr/UdOQSMqeZmJZgN
 vndixvRjHJKYxogOa+DjXgOiuQTQfuSfSnaai0kvw3zZzi4tev/Bdj6KZmFW+UK+
 Q7uQZ5n8tdE4UvUdj8Jek23SZ4kL+HtQOIdAAqyduQnYnax5L5sbep0TjuCjjkpK
 26rvmwYFJmEab4mC2T3Y7VDaXYM9M2f/EuFBMBVEohE3KPTTdT12WzLfJv7TTKTl
 hymfXgfmCXiZElzoQTJ69bFGbhqFaCJwhCUHFwYqkqj0bW9cXYJD2achpi3nVgnn
 CV8vfqJtkzdgh2bV2faG+1wmAm1wzHSURmT5NlnFaX6a6BYypaN7CERn7BnIdLM/
 YA2wud39bL0EJsic5e3gtlyJdfhtx7iqCMzE7S5FiUZvgOmUhBZ4IWkMs6Aq5PpL
 FLLgBSHGEIAdLVQGvXLjfQ/LeSrW8JsiSy6deztzR+ZflvvaBIP5y8sC3+KdxAvN
 3ybMsMEE5OK3i808aV3l6/8DLeAJ+DWuMc96Ix7Yyt2LXFnnV79DX49zJAEUWrc7
 54FnNzkgAO/Q9aEFmmQoFt5qZmoFHuNwcHBOmXARAatQqNCwDqk=
 =Xifr
 -----END PGP SIGNATURE-----

Merge tag 'net-6.5-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
 "Including fixes from BPF, netfilter, bluetooth and CAN.

  Current release - regressions:

   - eth: r8169: multiple fixes for PCIe ASPM-related problems

   - vrf: fix RCU lockdep splat in output path

  Previous releases - regressions:

   - gso: fall back to SW segmenting with GSO_UDP_L4 dodgy bit set

   - dsa: mv88e6xxx: do a final check before timing out when polling

   - nf_tables: fix sleep in atomic in nft_chain_validate

  Previous releases - always broken:

   - sched: fix undoing tcf_bind_filter() in multiple classifiers

   - bpf, arm64: fix BTI type used for freplace attached functions

   - can: gs_usb: fix time stamp counter initialization

   - nft_set_pipapo: fix improper element removal (leading to UAF)

  Misc:

   - net: support STP on bridge in non-root netns, STP prevents packet
     loops so not supporting it results in freezing systems of
     unsuspecting users, and in turn very upset noises being made

   - fix kdoc warnings

   - annotate various bits of TCP state to prevent data races"

* tag 'net-6.5-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (95 commits)
  net: phy: prevent stale pointer dereference in phy_init()
  tcp: annotate data-races around fastopenq.max_qlen
  tcp: annotate data-races around icsk->icsk_user_timeout
  tcp: annotate data-races around tp->notsent_lowat
  tcp: annotate data-races around rskq_defer_accept
  tcp: annotate data-races around tp->linger2
  tcp: annotate data-races around icsk->icsk_syn_retries
  tcp: annotate data-races around tp->keepalive_probes
  tcp: annotate data-races around tp->keepalive_intvl
  tcp: annotate data-races around tp->keepalive_time
  tcp: annotate data-races around tp->tsoffset
  tcp: annotate data-races around tp->tcp_tx_delay
  Bluetooth: MGMT: Use correct address for memcpy()
  Bluetooth: btusb: Fix bluetooth on Intel Macbook 2014
  Bluetooth: SCO: fix sco_conn related locking and validity issues
  Bluetooth: hci_conn: return ERR_PTR instead of NULL when there is no link
  Bluetooth: hci_sync: Avoid use-after-free in dbg for hci_remove_adv_monitor()
  Bluetooth: coredump: fix building with coredump disabled
  Bluetooth: ISO: fix iso_conn related locking and validity issues
  Bluetooth: hci_event: call disconnect callback before deleting conn
  ...
2023-07-20 14:46:39 -07:00
Xiu Jianfeng be4187faa8 audit: include security.h unconditionally
The ifdef-else logic is already in the header file, so include it
unconditionally, no functional changes here.

Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com>
[PM: fixed misspelling in the subject]
Signed-off-by: Paul Moore <paul@paul-moore.com>
2023-07-20 15:06:58 -04:00
John Ogness 132a90d152 printk: Rename abandon_console_lock_in_panic() to other_cpu_in_panic()
Currently abandon_console_lock_in_panic() is only used to determine if
the current CPU should immediately release the console lock because
another CPU is in panic. However, later this function will be used by
the CPU to immediately release other resources in this situation.

Rename the function to other_cpu_in_panic(), which is a better
description and does not assume it is related to the console lock.

Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20230717194607.145135-8-john.ogness@linutronix.de
2023-07-20 13:06:22 +02:00
John Ogness 9e70a5e109 printk: Add per-console suspended state
Currently the global @console_suspended is used to determine if
consoles are in a suspended state. Its primary purpose is to allow
usage of the console_lock when suspended without causing console
printing. It is synchronized by the console_lock.

Rather than relying on the console_lock to determine suspended
state, make it an official per-console state that is set within
console->flags. This allows the state to be queried via SRCU.

Remove @console_suspended. Console printing will still be avoided
when suspended because console_is_usable() returns false when
the new suspended flag is set for that console.

Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20230717194607.145135-7-john.ogness@linutronix.de
2023-07-20 13:06:22 +02:00
John Ogness 696ffaf50e printk: Consolidate console deferred printing
Printing to consoles can be deferred for several reasons:

- explicitly with printk_deferred()
- printk() in NMI context
- recursive printk() calls

The current implementation is not consistent. For printk_deferred(),
irq work is scheduled twice. For NMI und recursive, panic CPU
suppression and caller delays are not properly enforced.

Correct these inconsistencies by consolidating the deferred printing
code so that vprintk_deferred() is the top-level function for
deferred printing and vprintk_emit() will perform whichever irq_work
queueing is appropriate.

Also add kerneldoc for wake_up_klogd() and defer_console_output() to
clarify their differences and appropriate usage.

Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20230717194607.145135-6-john.ogness@linutronix.de
2023-07-20 13:06:22 +02:00
John Ogness eacb04ff3c printk: Do not take console lock for console_flush_on_panic()
Currently console_flush_on_panic() will attempt to acquire the
console lock when flushing the buffer on panic. If it fails to
acquire the lock, it continues anyway because this is the last
chance to get any pending records printed.

The reason why the console lock was attempted at all was to
prevent any other CPUs from acquiring the console lock for
printing while the panic CPU was printing. But as of the
previous commit, non-panic CPUs will no longer attempt to
acquire the console lock in a panic situation. Therefore it is
no longer strictly necessary for a panic CPU to acquire the
console lock.

Avoiding taking the console lock when flushing in panic has
the additional benefit of avoiding possible deadlocks due to
semaphore usage in NMI context (semaphores are not NMI-safe)
and avoiding possible deadlocks if another CPU accesses the
semaphore and is stopped while holding one of the semaphore's
internal spinlocks.

Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20230717194607.145135-5-john.ogness@linutronix.de
2023-07-20 13:06:22 +02:00
John Ogness 51a1d258e5 printk: Keep non-panic-CPUs out of console lock
When in a panic situation, non-panic CPUs should avoid holding the
console lock so as not to contend with the panic CPU. This is already
implemented with abandon_console_lock_in_panic(), which is checked
after each printed line. However, non-panic CPUs should also avoid
trying to acquire the console lock during a panic.

Modify console_trylock() to fail and console_lock() to block() when
called from a non-panic CPU during a panic.

Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20230717194607.145135-4-john.ogness@linutronix.de
2023-07-20 13:06:22 +02:00
John Ogness 7b23a66db5 printk: Reduce console_unblank() usage in unsafe scenarios
A semaphore is not NMI-safe, even when using down_trylock(). Both
down_trylock() and up() are using internal spinlocks and up()
might even call wake_up_process().

In the panic() code path it gets even worse because the internal
spinlocks of the semaphore may have been taken by a CPU that has
been stopped.

To reduce the risk of deadlocks caused by the console semaphore in
the panic path, make the following changes:

- First check if any consoles have implemented the unblank()
  callback. If not, then there is no reason to take the console
  semaphore anyway. (This check is also useful for the non-panic
  path since the locking/unlocking of the console lock can be
  quite expensive due to console printing.)

- If the panic path is in NMI context, bail out without attempting
  to take the console semaphore or calling any unblank() callbacks.
  Bailing out is acceptable because console_unblank() would already
  bail out if the console semaphore is contended. The alternative of
  ignoring the console semaphore and calling the unblank() callbacks
  anyway is a bad idea because these callbacks are also not NMI-safe.

If consoles with unblank() callbacks exist and console_unblank() is
called from a non-NMI panic context, it will still attempt a
down_trylock(). This could still result in a deadlock if one of the
stopped CPUs is holding the semaphore internal spinlock. But this
is a risk that the kernel has been (and continues to be) willing
to take.

Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20230717194607.145135-3-john.ogness@linutronix.de
2023-07-20 13:06:22 +02:00
John Ogness 6d3e0d8cc6 kdb: Do not assume write() callback available
It is allowed for consoles to not provide a write() callback. For
example ttynull does this.

Check if a write() callback is available before using it.

Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Daniel Thompson <daniel.thompson@linaro.org>
Acked-by: Daniel Thompson <daniel.thompson@linaro.org>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20230717194607.145135-2-john.ogness@linutronix.de
2023-07-20 13:06:22 +02:00
Paul E. McKenney c924bf5a43 rcu: Clarify rcu_is_watching() kernel-doc comment
Make it clear that this function always returns either true or false
without other planned failure modes.

Reported-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
2023-07-19 13:21:28 -07:00
Alexei Starovoitov 6f5a630d7c bpf, net: Introduce skb_pointer_if_linear().
Network drivers always call skb_header_pointer() with non-null buffer.
Remove !buffer check to prevent accidental misuse of skb_header_pointer().
Introduce skb_pointer_if_linear() instead.

Reported-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230718234021.43640-1-alexei.starovoitov@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-19 10:27:33 -07:00
Daniel Borkmann e420bed025 bpf: Add fd-based tcx multi-prog infra with link support
This work refactors and adds a lightweight extension ("tcx") to the tc BPF
ingress and egress data path side for allowing BPF program management based
on fds via bpf() syscall through the newly added generic multi-prog API.
The main goal behind this work which we also presented at LPC [0] last year
and a recent update at LSF/MM/BPF this year [3] is to support long-awaited
BPF link functionality for tc BPF programs, which allows for a model of safe
ownership and program detachment.

Given the rise in tc BPF users in cloud native environments, this becomes
necessary to avoid hard to debug incidents either through stale leftover
programs or 3rd party applications accidentally stepping on each others toes.
As a recap, a BPF link represents the attachment of a BPF program to a BPF
hook point. The BPF link holds a single reference to keep BPF program alive.
Moreover, hook points do not reference a BPF link, only the application's
fd or pinning does. A BPF link holds meta-data specific to attachment and
implements operations for link creation, (atomic) BPF program update,
detachment and introspection. The motivation for BPF links for tc BPF programs
is multi-fold, for example:

  - From Meta: "It's especially important for applications that are deployed
    fleet-wide and that don't "control" hosts they are deployed to. If such
    application crashes and no one notices and does anything about that, BPF
    program will keep running draining resources or even just, say, dropping
    packets. We at FB had outages due to such permanent BPF attachment
    semantics. With fd-based BPF link we are getting a framework, which allows
    safe, auto-detachable behavior by default, unless application explicitly
    opts in by pinning the BPF link." [1]

  - From Cilium-side the tc BPF programs we attach to host-facing veth devices
    and phys devices build the core datapath for Kubernetes Pods, and they
    implement forwarding, load-balancing, policy, EDT-management, etc, within
    BPF. Currently there is no concept of 'safe' ownership, e.g. we've recently
    experienced hard-to-debug issues in a user's staging environment where
    another Kubernetes application using tc BPF attached to the same prio/handle
    of cls_bpf, accidentally wiping all Cilium-based BPF programs from underneath
    it. The goal is to establish a clear/safe ownership model via links which
    cannot accidentally be overridden. [0,2]

BPF links for tc can co-exist with non-link attachments, and the semantics are
in line also with XDP links: BPF links cannot replace other BPF links, BPF
links cannot replace non-BPF links, non-BPF links cannot replace BPF links and
lastly only non-BPF links can replace non-BPF links. In case of Cilium, this
would solve mentioned issue of safe ownership model as 3rd party applications
would not be able to accidentally wipe Cilium programs, even if they are not
BPF link aware.

Earlier attempts [4] have tried to integrate BPF links into core tc machinery
to solve cls_bpf, which has been intrusive to the generic tc kernel API with
extensions only specific to cls_bpf and suboptimal/complex since cls_bpf could
be wiped from the qdisc also. Locking a tc BPF program in place this way, is
getting into layering hacks given the two object models are vastly different.

We instead implemented the tcx (tc 'express') layer which is an fd-based tc BPF
attach API, so that the BPF link implementation blends in naturally similar to
other link types which are fd-based and without the need for changing core tc
internal APIs. BPF programs for tc can then be successively migrated from classic
cls_bpf to the new tc BPF link without needing to change the program's source
code, just the BPF loader mechanics for attaching is sufficient.

For the current tc framework, there is no change in behavior with this change
and neither does this change touch on tc core kernel APIs. The gist of this
patch is that the ingress and egress hook have a lightweight, qdisc-less
extension for BPF to attach its tc BPF programs, in other words, a minimal
entry point for tc BPF. The name tcx has been suggested from discussion of
earlier revisions of this work as a good fit, and to more easily differ between
the classic cls_bpf attachment and the fd-based one.

For the ingress and egress tcx points, the device holds a cache-friendly array
with program pointers which is separated from control plane (slow-path) data.
Earlier versions of this work used priority to determine ordering and expression
of dependencies similar as with classic tc, but it was challenged that for
something more future-proof a better user experience is required. Hence this
resulted in the design and development of the generic attach/detach/query API
for multi-progs. See prior patch with its discussion on the API design. tcx is
the first user and later we plan to integrate also others, for example, one
candidate is multi-prog support for XDP which would benefit and have the same
'look and feel' from API perspective.

The goal with tcx is to have maximum compatibility to existing tc BPF programs,
so they don't need to be rewritten specifically. Compatibility to call into
classic tcf_classify() is also provided in order to allow successive migration
or both to cleanly co-exist where needed given its all one logical tc layer and
the tcx plus classic tc cls/act build one logical overall processing pipeline.

tcx supports the simplified return codes TCX_NEXT which is non-terminating (go
to next program) and terminating ones with TCX_PASS, TCX_DROP, TCX_REDIRECT.
The fd-based API is behind a static key, so that when unused the code is also
not entered. The struct tcx_entry's program array is currently static, but
could be made dynamic if necessary at a point in future. The a/b pair swap
design has been chosen so that for detachment there are no allocations which
otherwise could fail.

The work has been tested with tc-testing selftest suite which all passes, as
well as the tc BPF tests from the BPF CI, and also with Cilium's L4LB.

Thanks also to Nikolay Aleksandrov and Martin Lau for in-depth early reviews
of this work.

  [0] https://lpc.events/event/16/contributions/1353/
  [1] https://lore.kernel.org/bpf/CAEf4BzbokCJN33Nw_kg82sO=xppXnKWEncGTWCTB9vGCmLB6pw@mail.gmail.com
  [2] https://colocatedeventseu2023.sched.com/event/1Jo6O/tales-from-an-ebpf-programs-murder-mystery-hemanth-malla-guillaume-fournier-datadog
  [3] http://vger.kernel.org/bpfconf2023_material/tcx_meta_netdev_borkmann.pdf
  [4] https://lore.kernel.org/bpf/20210604063116.234316-1-memxor@gmail.com

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230719140858.13224-3-daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-19 10:07:27 -07:00
Daniel Borkmann 053c8e1f23 bpf: Add generic attach/detach/query API for multi-progs
This adds a generic layer called bpf_mprog which can be reused by different
attachment layers to enable multi-program attachment and dependency resolution.
In-kernel users of the bpf_mprog don't need to care about the dependency
resolution internals, they can just consume it with few API calls.

The initial idea of having a generic API sparked out of discussion [0] from an
earlier revision of this work where tc's priority was reused and exposed via
BPF uapi as a way to coordinate dependencies among tc BPF programs, similar
as-is for classic tc BPF. The feedback was that priority provides a bad user
experience and is hard to use [1], e.g.:

  I cannot help but feel that priority logic copy-paste from old tc, netfilter
  and friends is done because "that's how things were done in the past". [...]
  Priority gets exposed everywhere in uapi all the way to bpftool when it's
  right there for users to understand. And that's the main problem with it.

  The user don't want to and don't need to be aware of it, but uapi forces them
  to pick the priority. [...] Your cover letter [0] example proves that in
  real life different service pick the same priority. They simply don't know
  any better. Priority is an unnecessary magic that apps _have_ to pick, so
  they just copy-paste and everyone ends up using the same.

The course of the discussion showed more and more the need for a generic,
reusable API where the "same look and feel" can be applied for various other
program types beyond just tc BPF, for example XDP today does not have multi-
program support in kernel, but also there was interest around this API for
improving management of cgroup program types. Such common multi-program
management concept is useful for BPF management daemons or user space BPF
applications coordinating internally about their attachments.

Both from Cilium and Meta side [2], we've collected the following requirements
for a generic attach/detach/query API for multi-progs which has been implemented
as part of this work:

  - Support prog-based attach/detach and link API
  - Dependency directives (can also be combined):
    - BPF_F_{BEFORE,AFTER} with relative_{fd,id} which can be {prog,link,none}
      - BPF_F_ID flag as {fd,id} toggle; the rationale for id is so that user
        space application does not need CAP_SYS_ADMIN to retrieve foreign fds
        via bpf_*_get_fd_by_id()
      - BPF_F_LINK flag as {prog,link} toggle
      - If relative_{fd,id} is none, then BPF_F_BEFORE will just prepend, and
        BPF_F_AFTER will just append for attaching
      - Enforced only at attach time
    - BPF_F_REPLACE with replace_bpf_fd which can be prog, links have their
      own infra for replacing their internal prog
    - If no flags are set, then it's default append behavior for attaching
  - Internal revision counter and optionally being able to pass expected_revision
  - User space application can query current state with revision, and pass it
    along for attachment to assert current state before doing updates
  - Query also gets extension for link_ids array and link_attach_flags:
    - prog_ids are always filled with program IDs
    - link_ids are filled with link IDs when link was used, otherwise 0
    - {prog,link}_attach_flags for holding {prog,link}-specific flags
  - Must be easy to integrate/reuse for in-kernel users

The uapi-side changes needed for supporting bpf_mprog are rather minimal,
consisting of the additions of the attachment flags, revision counter, and
expanding existing union with relative_{fd,id} member.

The bpf_mprog framework consists of an bpf_mprog_entry object which holds
an array of bpf_mprog_fp (fast-path structure). The bpf_mprog_cp (control-path
structure) is part of bpf_mprog_bundle. Both have been separated, so that
fast-path gets efficient packing of bpf_prog pointers for maximum cache
efficiency. Also, array has been chosen instead of linked list or other
structures to remove unnecessary indirections for a fast point-to-entry in
tc for BPF.

The bpf_mprog_entry comes as a pair via bpf_mprog_bundle so that in case of
updates the peer bpf_mprog_entry is populated and then just swapped which
avoids additional allocations that could otherwise fail, for example, in
detach case. bpf_mprog_{fp,cp} arrays are currently static, but they could
be converted to dynamic allocation if necessary at a point in future.
Locking is deferred to the in-kernel user of bpf_mprog, for example, in case
of tcx which uses this API in the next patch, it piggybacks on rtnl.

An extensive test suite for checking all aspects of this API for prog-based
attach/detach and link API comes as BPF selftests in this series.

Thanks also to Andrii Nakryiko for early API discussions wrt Meta's BPF prog
management.

  [0] https://lore.kernel.org/bpf/20221004231143.19190-1-daniel@iogearbox.net
  [1] https://lore.kernel.org/bpf/CAADnVQ+gEY3FjCR=+DmjDR4gp5bOYZUFJQXj4agKFHT9CQPZBw@mail.gmail.com
  [2] http://vger.kernel.org/bpfconf2023_material/tcx_meta_netdev_borkmann.pdf

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/20230719140858.13224-2-daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-19 10:07:27 -07:00
Anton Protopopov 72829b1c1f bpf: allow any program to use the bpf_map_sum_elem_count kfunc
Register the bpf_map_sum_elem_count func for all programs, and update the
map_ptr subtest of the test_progs test to test the new functionality.

The usage is allowed as long as the pointer to the map is trusted (when
using tracing programs) or is a const pointer to map, as in the following
example:

    struct {
            __uint(type, BPF_MAP_TYPE_HASH);
            ...
    } hash SEC(".maps");

    ...

    static inline int some_bpf_prog(void)
    {
            struct bpf_map *map = (struct bpf_map *)&hash;
            __s64 count;

            count = bpf_map_sum_elem_count(map);

            ...
    }

Signed-off-by: Anton Protopopov <aspsk@isovalent.com>
Link: https://lore.kernel.org/r/20230719092952.41202-5-aspsk@isovalent.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-19 09:48:53 -07:00
Anton Protopopov 9c29804961 bpf: make an argument const in the bpf_map_sum_elem_count kfunc
We use the map pointer only to read the counter values, no locking
involved, so mark the argument as const.

Signed-off-by: Anton Protopopov <aspsk@isovalent.com>
Link: https://lore.kernel.org/r/20230719092952.41202-4-aspsk@isovalent.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-19 09:48:52 -07:00
Anton Protopopov 5ba190c29c bpf: consider CONST_PTR_TO_MAP as trusted pointer to struct bpf_map
Add the BTF id of struct bpf_map to the reg2btf_ids array. This makes the
values of the CONST_PTR_TO_MAP type to be considered as trusted by kfuncs.
This, in turn, allows users to execute trusted kfuncs which accept `struct
bpf_map *` arguments from non-tracing programs.

While exporting the btf_bpf_map_id variable, save some bytes by defining
it as BTF_ID_LIST_GLOBAL_SINGLE (which is u32[1]) and not as BTF_ID_LIST
(which is u32[64]).

Signed-off-by: Anton Protopopov <aspsk@isovalent.com>
Link: https://lore.kernel.org/r/20230719092952.41202-3-aspsk@isovalent.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-19 09:48:52 -07:00
Anton Protopopov 831deb2976 bpf: consider types listed in reg2btf_ids as trusted
The reg2btf_ids array contains a list of types for which we can (and need)
to find a corresponding static BTF id. All the types in the list can be
considered as trusted for purposes of kfuncs.

Signed-off-by: Anton Protopopov <aspsk@isovalent.com>
Link: https://lore.kernel.org/r/20230719092952.41202-2-aspsk@isovalent.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-19 09:48:52 -07:00
Peter Zijlstra d07f09a1f9 sched/fair: Propagate enqueue flags into place_entity()
This allows place_entity() to consider ENQUEUE_WAKEUP and
ENQUEUE_MIGRATED.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230531124604.274010996@infradead.org
2023-07-19 09:43:59 +02:00
Peter Zijlstra e4ec3318a1 sched/debug: Rename sysctl_sched_min_granularity to sysctl_sched_base_slice
EEVDF uses this tunable as the base request/slice -- make sure the
name reflects this.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230531124604.205287511@infradead.org
2023-07-19 09:43:59 +02:00
Peter Zijlstra 5e963f2bd4 sched/fair: Commit to EEVDF
EEVDF is a better defined scheduling policy, as a result it has less
heuristics/tunables. There is no compelling reason to keep CFS around.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230531124604.137187212@infradead.org
2023-07-19 09:43:58 +02:00
Peter Zijlstra e8f331bcc2 sched/smp: Use lag to simplify cross-runqueue placement
Using lag is both more correct and simpler when moving between
runqueues.

Notable, min_vruntime() was invented as a cheap approximation of
avg_vruntime() for this very purpose (SMP migration). Since we now
have the real thing; use it.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230531124604.068911180@infradead.org
2023-07-19 09:43:58 +02:00
Peter Zijlstra 76cae9dbe1 sched/fair: Commit to lag based placement
Removes the FAIR_SLEEPERS code in favour of the new LAG based
placement.

Specifically, the whole FAIR_SLEEPER thing was a very crude
approximation to make up for the lack of lag based placement,
specifically the 'service owed' part. This is important for things
like 'starve' and 'hackbench'.

One side effect of FAIR_SLEEPER is that it caused 'small' unfairness,
specifically, by always ignoring up-to 'thresh' sleeptime it would
have a 50%/50% time distribution for a 50% sleeper vs a 100% runner,
while strictly speaking this should (of course) result in a 33%/67%
split (as CFS will also do if the sleep period exceeds 'thresh').

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230531124604.000198861@infradead.org
2023-07-19 09:43:58 +02:00
Peter Zijlstra 147f3efaa2 sched/fair: Implement an EEVDF-like scheduling policy
Where CFS is currently a WFQ based scheduler with only a single knob,
the weight. The addition of a second, latency oriented parameter,
makes something like WF2Q or EEVDF based a much better fit.

Specifically, EEVDF does EDF like scheduling in the left half of the
tree -- those entities that are owed service. Except because this is a
virtual time scheduler, the deadlines are in virtual time as well,
which is what allows over-subscription.

EEVDF has two parameters:

 - weight, or time-slope: which is mapped to nice just as before

 - request size, or slice length: which is used to compute
   the virtual deadline as: vd_i = ve_i + r_i/w_i

Basically, by setting a smaller slice, the deadline will be earlier
and the task will be more eligible and ran earlier.

Tick driven preemption is driven by request/slice completion; while
wakeup preemption is driven by the deadline.

Because the tree is now effectively an interval tree, and the
selection is no longer 'leftmost', over-scheduling is less of a
problem.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230531124603.931005524@infradead.org
2023-07-19 09:43:58 +02:00
Peter Zijlstra 86bfbb7ce4 sched/fair: Add lag based placement
With the introduction of avg_vruntime, it is possible to approximate
lag (the entire purpose of introducing it in fact). Use this to do lag
based placement over sleep+wake.

Specifically, the FAIR_SLEEPERS thing places things too far to the
left and messes up the deadline aspect of EEVDF.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230531124603.794929315@infradead.org
2023-07-19 09:43:58 +02:00
Peter Zijlstra e0c2ff903c sched/fair: Remove sched_feat(START_DEBIT)
With the introduction of avg_vruntime() there is no need to use worse
approximations. Take the 0-lag point as starting point for inserting
new tasks.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230531124603.722361178@infradead.org
2023-07-19 09:43:58 +02:00
Peter Zijlstra af4cf40470 sched/fair: Add cfs_rq::avg_vruntime
In order to move to an eligibility based scheduling policy, we need
to have a better approximation of the ideal scheduler.

Specifically, for a virtual time weighted fair queueing based
scheduler the ideal scheduler will be the weighted average of the
individual virtual runtimes (math in the comment).

As such, compute the weighted average to approximate the ideal
scheduler -- note that the approximation is in the individual task
behaviour, which isn't strictly conformant.

Specifically consider adding a task with a vruntime left of center, in
this case the average will move backwards in time -- something the
ideal scheduler would of course never do.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230531124603.654144274@infradead.org
2023-07-19 09:43:58 +02:00
Ingo Molnar 752182b24b Linux 6.5-rc2
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmS0at0eHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiG6m8H/RZCd2DWeM94CgK5
 DIxNLxu90PrEcrOnqeHFJtQoSiUQTHeseh9E4BH0JdPDxtlU89VwYRAevseWiVkp
 JyyJPB40UR4i2DO0P1+oWBBsGEG+bo8lZ1M+uxU5k6lgC0fAi96/O48mwwmI0Mtm
 P6BkWd3IkSXc7l4AGIrKa5P+pYEnm0Z6YAZILfdMVBcLXZWv1OAwEEkZNdUuhE3d
 5DxlQ8MLzlQIbe+LasiwdL9r606acFaJG9Hewl9x5DBlsObZ3d2rX4vEt1NEVh89
 shV29xm2AjCpLh4kstJFdTDCkDw/4H7TcFB/NMxXyzEXp3Bx8YXq+mjVd2mpq1FI
 hHtCsOA=
 =KiaU
 -----END PGP SIGNATURE-----

Merge tag 'v6.5-rc2' into sched/core, to pick up fixes

Sync with upstream fixes before applying EEVDF.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2023-07-19 09:43:25 +02:00
Dave Marchevsky c3c510ce43 bpf: Add 'owner' field to bpf_{list,rb}_node
As described by Kumar in [0], in shared ownership scenarios it is
necessary to do runtime tracking of {rb,list} node ownership - and
synchronize updates using this ownership information - in order to
prevent races. This patch adds an 'owner' field to struct bpf_list_node
and bpf_rb_node to implement such runtime tracking.

The owner field is a void * that describes the ownership state of a
node. It can have the following values:

  NULL           - the node is not owned by any data structure
  BPF_PTR_POISON - the node is in the process of being added to a data
                   structure
  ptr_to_root    - the pointee is a data structure 'root'
                   (bpf_rb_root / bpf_list_head) which owns this node

The field is initially NULL (set by bpf_obj_init_field default behavior)
and transitions states in the following sequence:

  Insertion: NULL -> BPF_PTR_POISON -> ptr_to_root
  Removal:   ptr_to_root -> NULL

Before a node has been successfully inserted, it is not protected by any
root's lock, and therefore two programs can attempt to add the same node
to different roots simultaneously. For this reason the intermediate
BPF_PTR_POISON state is necessary. For removal, the node is protected
by some root's lock so this intermediate hop isn't necessary.

Note that bpf_list_pop_{front,back} helpers don't need to check owner
before removing as the node-to-be-removed is not passed in as input and
is instead taken directly from the list. Do the check anyways and
WARN_ON_ONCE in this unexpected scenario.

Selftest changes in this patch are entirely mechanical: some BTF
tests have hardcoded struct sizes for structs that contain
bpf_{list,rb}_node fields, those were adjusted to account for the new
sizes. Selftest additions to validate the owner field are added in a
further patch in the series.

  [0]: https://lore.kernel.org/bpf/d7hyspcow5wtjcmw4fugdgyp3fwhljwuscp3xyut5qnwivyeru@ysdq543otzv2

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Suggested-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20230718083813.3416104-4-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-18 17:23:10 -07:00
Dave Marchevsky 0a1f7bfe35 bpf: Introduce internal definitions for UAPI-opaque bpf_{rb,list}_node
Structs bpf_rb_node and bpf_list_node are opaquely defined in
uapi/linux/bpf.h, as BPF program writers are not expected to touch their
fields - nor does the verifier allow them to do so.

Currently these structs are simple wrappers around structs rb_node and
list_head and linked_list / rbtree implementation just casts and passes
to library functions for those data structures. Later patches in this
series, though, will add an "owner" field to bpf_{rb,list}_node, such
that they're not just wrapping an underlying node type. Moreover, the
bpf linked_list and rbtree implementations will deal with these owner
pointers directly in a few different places.

To avoid having to do

  void *owner = (void*)bpf_list_node + sizeof(struct list_head)

with opaque UAPI node types, add bpf_{list,rb}_node_kern struct
definitions to internal headers and modify linked_list and rbtree to use
the internal types where appropriate.

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Link: https://lore.kernel.org/r/20230718083813.3416104-3-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-18 17:23:10 -07:00
Kumar Kartikeya Dwivedi b5e9ad522c bpf: Repeat check_max_stack_depth for async callbacks
While the check_max_stack_depth function explores call chains emanating
from the main prog, which is typically enough to cover all possible call
chains, it doesn't explore those rooted at async callbacks unless the
async callback will have been directly called, since unlike non-async
callbacks it skips their instruction exploration as they don't
contribute to stack depth.

It could be the case that the async callback leads to a callchain which
exceeds the stack depth, but this is never reachable while only
exploring the entry point from main subprog. Hence, repeat the check for
the main subprog *and* all async callbacks marked by the symbolic
execution pass of the verifier, as execution of the program may begin at
any of them.

Consider functions with following stack depths:
main: 256
async: 256
foo: 256

main:
    rX = async
    bpf_timer_set_callback(...)

async:
    foo()

Here, async is not descended as it does not contribute to stack depth of
main (since it is referenced using bpf_pseudo_func and not
bpf_pseudo_call). However, when async is invoked asynchronously, it will
end up breaching the MAX_BPF_STACK limit by calling foo.

Hence, in addition to main, we also need to explore call chains
beginning at all async callback subprogs in a program.

Fixes: 7ddc80a476 ("bpf: Teach stack depth check about async callbacks.")
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20230717161530.1238-3-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-18 15:21:09 -07:00
Kumar Kartikeya Dwivedi ba7b3e7d5f bpf: Fix subprog idx logic in check_max_stack_depth
The assignment to idx in check_max_stack_depth happens once we see a
bpf_pseudo_call or bpf_pseudo_func. This is not an issue as the rest of
the code performs a few checks and then pushes the frame to the frame
stack, except the case of async callbacks. If the async callback case
causes the loop iteration to be skipped, the idx assignment will be
incorrect on the next iteration of the loop. The value stored in the
frame stack (as the subprogno of the current subprog) will be incorrect.

This leads to incorrect checks and incorrect tail_call_reachable
marking. Save the target subprog in a new variable and only assign to
idx once we are done with the is_async_cb check which may skip pushing
of frame to the frame stack and subsequent stack depth checks and tail
call markings.

Fixes: 7ddc80a476 ("bpf: Teach stack depth check about async callbacks.")
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20230717161530.1238-2-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-18 15:21:09 -07:00
Haitao Huang 32bf85c60c cgroup/misc: Change counters to be explicit 64bit types
So the variables can account for resources of huge quantities even on
32-bit machines.

Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-07-18 12:10:00 -10:00
Linus Torvalds 4806364acf Seven hotfixes, six of which are cc:stable and one of which addresses a
post-6.5 issue.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZLboHQAKCRDdBJ7gKXxA
 jqwtAP4m3MQNcYzQk8qbV+EQat/csTnrefytyD0ogFRoxcMAFAD/XT784sZzn4SU
 s/mL1HLk1BsubT/yQmY3lISXHDPuPAo=
 =5W3V
 -----END PGP SIGNATURE-----

Merge tag 'mm-hotfixes-stable-2023-07-18-12-28' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull hotfixes from Andrew Morton:
 "Seven hotfixes, six of which are cc:stable and one of which addresses
  a post-6.5 issue"

* tag 'mm-hotfixes-stable-2023-07-18-12-28' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  maple_tree: fix node allocation testing on 32 bit
  maple_tree: fix 32 bit mas_next testing
  selftests/mm: mkdirty: fix incorrect position of #endif
  maple_tree: set the node limit when creating a new root node
  mm/mlock: fix vma iterator conversion of apply_vma_lock_flags()
  prctl: move PR_GET_AUXV out of PR_MCE_KILL
  selftests/mm: give scripts execute permission
2023-07-18 14:19:42 -07:00
Andrei Vagin 48a1084a8b seccomp: add the synchronous mode for seccomp_unotify
seccomp_unotify allows more privileged processes do actions on behalf
of less privileged processes.

In many cases, the workflow is fully synchronous. It means a target
process triggers a system call and passes controls to a supervisor
process that handles the system call and returns controls to the target
process. In this context, "synchronous" means that only one process is
running and another one is waiting.

There is the WF_CURRENT_CPU flag that is used to advise the scheduler to
move the wakee to the current CPU. For such synchronous workflows, it
makes context switches a few times faster.

Right now, each interaction takes 12µs. With this patch, it takes about
3µs.

This change introduce the SECCOMP_USER_NOTIF_FD_SYNC_WAKE_UP flag that
it used to enable the sync mode.

Signed-off-by: Andrei Vagin <avagin@google.com>
Acked-by: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230308073201.3102738-5-avagin@google.com
Signed-off-by: Kees Cook <keescook@chromium.org>
2023-07-17 16:08:08 -07:00
Andrei Vagin 6f63904c8f sched: add a few helpers to wake up tasks on the current cpu
Add complete_on_current_cpu, wake_up_poll_on_current_cpu helpers to wake
up tasks on the current CPU.

These two helpers are useful when the task needs to make a synchronous context
switch to another task. In this context, synchronous means it wakes up the
target task and falls asleep right after that.

One example of such workloads is seccomp user notifies. This mechanism allows
the  supervisor process handles system calls on behalf of a target process.
While the supervisor is handling an intercepted system call, the target process
will be blocked in the kernel, waiting for a response to come back.

On-CPU context switches are much faster than regular ones.

Signed-off-by: Andrei Vagin <avagin@google.com>
Acked-by: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230308073201.3102738-4-avagin@google.com
Signed-off-by: Kees Cook <keescook@chromium.org>
2023-07-17 16:08:08 -07:00
Peter Oskolkov ab83f455f0 sched: add WF_CURRENT_CPU and externise ttwu
Add WF_CURRENT_CPU wake flag that advices the scheduler to
move the wakee to the current CPU. This is useful for fast on-CPU
context switching use cases.

In addition, make ttwu external rather than static so that
the flag could be passed to it from outside of sched/core.c.

Signed-off-by: Peter Oskolkov <posk@google.com>
Signed-off-by: Andrei Vagin <avagin@google.com>
Acked-by: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230308073201.3102738-3-avagin@google.com
Signed-off-by: Kees Cook <keescook@chromium.org>
2023-07-17 16:08:08 -07:00
Andrei Vagin 4943b66df1 seccomp: don't use semaphore and wait_queue together
The main reason is to use new wake_up helpers that will be added in the
following patches. But here are a few other reasons:

* if we use two different ways, we always need to call them both. This
  patch fixes seccomp_notify_recv where we forgot to call wake_up_poll
  in the error path.

* If we use one primitive, we can control how many waiters are woken up
  for each request. Our goal is to wake up just one that will handle a
  request. Right now, wake_up_poll can wake up one waiter and
  up(&match->notif->request) can wake up one more.

Signed-off-by: Andrei Vagin <avagin@google.com>
Acked-by: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230308073201.3102738-2-avagin@google.com
Signed-off-by: Kees Cook <keescook@chromium.org>
2023-07-17 16:08:08 -07:00
Miguel Ojeda 636e348353 prctl: move PR_GET_AUXV out of PR_MCE_KILL
Somehow PR_GET_AUXV got added into PR_MCE_KILL's switch when the patch was
applied [1].

Thus move it out of the switch, to the place the patch added it.

In the recently released v6.4 kernel some user could, in principle, be
already using this feature by mapping the right page and passing the
PR_GET_AUXV constant as a pointer:

    prctl(PR_MCE_KILL, PR_GET_AUXV, ...)

So this does change the behavior for users.  We could keep the bug since
the other subcases in PR_MCE_KILL (PR_MCE_KILL_CLEAR and PR_MCE_KILL_SET)
do not overlap.

However, v6.4 may be recent enough (2 weeks old) that moving the lines
(rather than just adding a new case) does not break anybody?  Moreover,
the documentation in man-pages was just committed today [2].

Link: https://lkml.kernel.org/r/20230708233344.361854-1-ojeda@kernel.org
Fixes: ddc65971bb ("prctl: add PR_GET_AUXV to copy auxv to userspace")
Link: https://lore.kernel.org/all/d81864a7f7f43bca6afa2a09fc2e850e4050ab42.1680611394.git.josh@joshtriplett.org/ [1]
Link: https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/commit/?id=8cf0c06bfd3c2b219b044d4151c96f0da50af9ad [2]
Signed-off-by: Miguel Ojeda <ojeda@kernel.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-07-17 12:53:21 -07:00
Kamalesh Babulal c25ff4b911 cgroup: remove cgrp->kn check in css_populate_dir()
cgroup_create() creates cgrp and assigns the kernfs_node to cgrp->kn,
then cgroup_mkdir() populates base and csses cft file by calling
css_populate_dir() and cgroup_apply_control_enable() with a valid
cgrp->kn. Check for NULL cgrp->kn, will always be false, remove it.

Signed-off-by: Kamalesh Babulal <kamalesh.babulal@oracle.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-07-17 08:44:56 -10:00
Miaohe Lin 6f71780e7f cgroup: fix obsolete function name
cgroup_taskset_migrate() has been renamed to cgroup_migrate_execute() since
commit e595cd7069 ("cgroup: track migration context in cgroup_mgctx").
Update the corresponding comment.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-07-17 08:38:58 -10:00
Miaohe Lin fcbb485d9f cgroup: use cached local variable parent in for loop
Use local variable parent to initialize iter tcgrp in for loop so the size
of cgroup.o can be reduced by 64 bytes. No functional change intended.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-07-17 08:34:41 -10:00
Peter Zijlstra f7853c3424 locking/rtmutex: Fix task->pi_waiters integrity
Henry reported that rt_mutex_adjust_prio_check() has an ordering
problem and puts the lie to the comment in [7]. Sharing the sort key
between lock->waiters and owner->pi_waiters *does* create problems,
since unlike what the comment claims, holding [L] is insufficient.

Notably, consider:

	A
      /   \
     M1   M2
     |     |
     B     C

That is, task A owns both M1 and M2, B and C block on them. In this
case a concurrent chain walk (B & C) will modify their resp. sort keys
in [7] while holding M1->wait_lock and M2->wait_lock. So holding [L]
is meaningless, they're different Ls.

This then gives rise to a race condition between [7] and [11], where
the requeue of pi_waiters will observe an inconsistent tree order.

	B				C

  (holds M1->wait_lock,		(holds M2->wait_lock,
   holds B->pi_lock)		 holds A->pi_lock)

  [7]
  waiter_update_prio();
  ...
  [8]
  raw_spin_unlock(B->pi_lock);
  ...
  [10]
  raw_spin_lock(A->pi_lock);

				[11]
				rt_mutex_enqueue_pi();
				// observes inconsistent A->pi_waiters
				// tree order

Fixing this means either extending the range of the owner lock from
[10-13] to [6-13], with the immediate problem that this means [6-8]
hold both blocked and owner locks, or duplicating the sort key.

Since the locking in chain walk is horrible enough without having to
consider pi_lock nesting rules, duplicate the sort key instead.

By giving each tree their own sort key, the above race becomes
harmless, if C sees B at the old location, then B will correct things
(if they need correcting) when it walks up the chain and reaches A.

Fixes: fb00aca474 ("rtmutex: Turn the plist into an rb-tree")
Reported-by: Henry Wu <triangletrap12@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Henry Wu <triangletrap12@gmail.com>
Link: https://lkml.kernel.org/r/20230707161052.GF2883469%40hirez.programming.kicks-ass.net
2023-07-17 13:59:10 +02:00
Linus Torvalds f61a89ca11 - Remove a cgroup from under a polling process properly
- Fix the idle sibling selection
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmS0N8AACgkQEsHwGGHe
 VUq4qhAAhLlRz7V0txbvj65xrQAuF7RiqWQlIm6NYX+eWb5wpqasy8mTpVr37Lhg
 JJIdSlI0sw3vIgnjgmuLU6e6MKO2pDWCqppv7CB4YjTT6/ifyIWvFotHTtfOBDjy
 eTV/wEpfDKOKHJDdWHRdUjiDgktZs1gJVV8e65/ViuWoCEV3ARy7tS4liORiwNpr
 RTuG6C3lDAfw6RKWCvBxDV15XhMDNYYQzN1bwedTAryP8jAFFUKce2de6HK6qyUF
 pNqzX0eGkN+6TG13uKLJIsAfdydWHWvkXWqFzVZsS7Upu4lyAyjslHK5UEF/MgCy
 4c4/xaqcDTKn8dtQycQ3FMwujbamHdnct9or4aDsEa9PIIAcab/gKxojNPlBSkCt
 0FpBjJUP2X/DBGpMaNX02fwQhhXwj0dN/4OWI5kXrL2sctqzueMSfTUKpjsLZMAx
 VTngoXVcVlLxqB2LqIzjhsWG+awlw/IKEKotjhX4vyra9L50CRbMLJo7hJdHD16J
 dmGQmsJRlGyQYEOgoyI7BU/Bqs/+iE1r5fAcVgjWWUPotl7XKDgcwTpELTPj5xzO
 b7ALkUXRcHf2NdfaLDZeyJzT/Z1mq1znPTsDqi94f1ASbmAd0iXLFNHtyJxXJabZ
 /aTLUj/GmmaCw2R02S4l67uC/BHChToloNVCGsq2KfBYcUunod0=
 =bSNb
 -----END PGP SIGNATURE-----

Merge tag 'sched_urgent_for_v6.5_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler fixes from Borislav Petkov:

 - Remove a cgroup from under a polling process properly

 - Fix the idle sibling selection

* tag 'sched_urgent_for_v6.5_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/psi: use kernfs polling functions for PSI trigger polling
  sched/fair: Use recent_used_cpu to test p->cpus_ptr
2023-07-16 13:22:08 -07:00
Linus Torvalds 6eede0686f hardening fixes for v6.5-rc2
- Remove LTO-only suffixes from promoted global function symbols (Yonghong Song)
 
 - Remove unused .text..refcount section from vmlinux.lds.h (Petr Pavlu)
 
 - Add missing __always_inline to sparc __arch_xchg() (Arnd Bergmann)
 
 - Claim maintainership of string routines
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmSzO7IWHGtlZXNjb29r
 QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJim9EACS3Jor73jTPAKKzbjiPdJXCZpW
 EesWC+a4nJ/hV4+pqIMUqpF/TRaJd1PZNJPMblI/wSGO4VcI6AywVLrX67a1wFyO
 pJUp3O8lXj7g5uA3cFmCMM1KKtQqZb/qZuAaP7BfnFaATjRNjejr0CHzcqqY45Jm
 fMo/it1k0zDUAHKx112T7v+lMFxGxzbysPYbXzU62zf74gPPmilk+ZnfEvUV3Qe7
 sRjV1Zk0VXwqSx0ED4tPIpJEHCCy0uZpQdo/vhghjYxFecIkM9T8XaKyEY2rqv+a
 u4YaMc/31TXi33ajfCdi1WxbNrRB6WU2tCjPrMbzdoRvTi1zaxc8GSl1mcftmdJk
 /f5Pxq1Jcf7ZRC7Ujbag9pfncITHtMXWRb84boV0JdncxmdMUmyZGuUUhU+Ma8m4
 yqdHMstaMvdtcn9FigVep2r0Fa+Eh8H+jNZuixdaReZHEYbjmrPM1P8zLtDYOSfk
 TZnnt2KrxeudgkqpNAFibWNkv9waKfx/oUU4ljwRbjjzQdUofvqyV15s9f5mcBLU
 98EDHTZ/XbURrzbgLAui/oFHZNuuBqdjZCaHD/1SQoemPF6zmkY3Hv+8YYQitk7a
 i+VaSoukE/S1vIBCasCi+2qgOOCxH4orZ2Ewll9iw2VVO4x8o8UVHIPYI5JIT6F9
 Xx1CeYrKfWDdDUXKVA==
 =HUoI
 -----END PGP SIGNATURE-----

Merge tag 'hardening-v6.5-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull hardening fixes from Kees Cook:

 - Remove LTO-only suffixes from promoted global function symbols
   (Yonghong Song)

 - Remove unused .text..refcount section from vmlinux.lds.h (Petr Pavlu)

 - Add missing __always_inline to sparc __arch_xchg() (Arnd Bergmann)

 - Claim maintainership of string routines

* tag 'hardening-v6.5-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  sparc: mark __arch_xchg() as __always_inline
  MAINTAINERS: Foolishly claim maintainership of string routines
  kallsyms: strip LTO-only suffixes from promoted global functions
  vmlinux.lds.h: Remove a reference to no longer used sections .text..refcount
2023-07-16 12:18:18 -07:00
Linus Torvalds 4b4eef57e6 Probe fixes for 6.5-rc1, the 2nd set:
- fprobe: Add a comment why fprobe will be skipped if another kprobe is
    running in fprobe_kprobe_handler().
 
  - probe-events: Fix some issues related to fetch-argument
   . Fix double counting of the string length for user-string and symstr.
     This will require longer buffer in the array case.
   . Fix not to count error code (minus value) for the total used length
     in array argument. This makes the total used length shorter.
   . Fix to update dynamic used data size counter only if fetcharg uses
     the dynamic size data. This may mis-count the used dynamic data
     size and corrupt data.
   . Revert "tracing: Add "(fault)" name injection to kernel probes"
     because that did not work correctly with a bug, and we agreed the
     current '(fault)' output (instead of '"(fault)"' like a string)
     explains what happened more clearly.
   . Fix to record 0-length (means fault access) data_loc data in fetch
     function itself, instead of store_trace_args(). If we record an
     array of string, this will fix to save fault access data on each
     entry of the array correctly.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCgAdFiEEh7BulGwFlgAOi5DV2/sHvwUrPxsFAmSxSlYACgkQ2/sHvwUr
 PxupyAgApFDi9YGsmrVbXmIN5y+yGMyio2H6xR7XkX+L02nvDY6uVqL/jgT8pHfI
 AeGZEA+EqwxIfWpYBfztsFej+Gl3Elfvu14OSxwaafUlW3mgZFQqw1ZR0HvzXoKJ
 8Iw6WOXjhLe3/QLy43UY8JQGOKI07i3gh71wa0W0huOyiwwHuuVwPSY9QJJ2ulSg
 OWFSuMFO8IxYimp0BpFu/vrfa8CdgWLc24tgJ5EpZtzu6L0A2I/FMZjnBukxnP9s
 rjAXv0uRuSFvvF7/RGCqrLza12525qyHx7d5IWUq5shd3bCnaUOnAieF//MoJaR3
 q8McDJK//EPbUvCWgESuuyPS05smyQ==
 =iumA
 -----END PGP SIGNATURE-----

Merge tag 'probes-fixes-v6.5-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull probe fixes from Masami Hiramatsu:

 - fprobe: Add a comment why fprobe will be skipped if another kprobe is
   running in fprobe_kprobe_handler().

 - probe-events: Fix some issues related to fetch-arguments:

    - Fix double counting of the string length for user-string and
      symstr. This will require longer buffer in the array case.

    - Fix not to count error code (minus value) for the total used
      length in array argument. This makes the total used length
      shorter.

    - Fix to update dynamic used data size counter only if fetcharg uses
      the dynamic size data. This may mis-count the used dynamic data
      size and corrupt data.

    - Revert "tracing: Add "(fault)" name injection to kernel probes"
      because that did not work correctly with a bug, and we agreed the
      current '(fault)' output (instead of '"(fault)"' like a string)
      explains what happened more clearly.

    - Fix to record 0-length (means fault access) data_loc data in fetch
      function itself, instead of store_trace_args(). If we record an
      array of string, this will fix to save fault access data on each
      entry of the array correctly.

* tag 'probes-fixes-v6.5-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing/probes: Fix to record 0-length data_loc in fetch_store_string*() if fails
  Revert "tracing: Add "(fault)" name injection to kernel probes"
  tracing/probes: Fix to update dynamic data counter if fetcharg uses it
  tracing/probes: Fix not to count error code to total length
  tracing/probes: Fix to avoid double count of the string length on the array
  fprobes: Add a comment why fprobe_kprobe_handler exits if kprobe is running
2023-07-16 12:13:51 -07:00
Paul E. McKenney e40806e9bc clocksource: Handle negative skews in "skew is too large" messages
The nanosecond-to-millisecond skew computation uses unsigned arithmetic,
which produces user-unfriendly large positive numbers for negative skews.
Therefore, use signed arithmetic for this computation in order to preserve
the negativity.

Reported-by: Chris Bainbridge <chris.bainbridge@gmail.com>
Reported-by: Feng Tang <feng.tang@intel.com>
Fixes: dd02926994 ("clocksource: Improve "skew is too large" messages")
Reviewed-by: Feng Tang <feng.tang@intel.com>
Tested-by: Chris Bainbridge <chris.bainbridge@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-07-14 15:17:09 -07:00
Connor O'Brien e2a0b786c5 torture: Support randomized shuffling for proxy exec testing
Currently shuffling sets the same cpu affinities for all tasks,
which makes us less likely to hit paths involving migrating
blocked tasks onto a cpu where they can't run.

This patch adds an element of randomness to allow affinities of
different writer tasks to diverge.

This has helped uncover issues in testing with Proxy Execution

Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: kernel-team@android.com
Signed-off-by: Connor O'Brien <connoro@google.com>
Signed-off-by: John Stultz <jstultz@google.com>
Acked-by: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-07-14 15:04:09 -07:00
Paul E. McKenney 9cafe974cf rcutorture: Dump grace-period state upon rtort_pipe_count incidents
The rtort_pipe_count WARN() indicates that grace periods were unable
to invoke all callbacks during a stutter_wait() interval.  But it is
sometimes helpful to have a bit more information as to why.  This commit
therefore invokes show_rcu_gp_kthreads() immediately before that WARN()
in order to dump out some relevant information.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-07-14 15:04:09 -07:00
Paul E. McKenney 4a71be9387 scftorture: Pause testing after memory-allocation failure
The scftorture test can quickly execute a large number of calls to no-wait
smp_call_function(), each of which holds a block of memory until the
corresponding handler is invoked.  Especially when the longwait module
parameter is specified, this can chew up an arbitrarily large amount
of memory.  This commit therefore blocks after each memory-allocation
failure, with the duration a function of longwait.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-07-14 15:02:57 -07:00
Paul E. McKenney 013608cd08 scftorture: Forgive memory-allocation failure if KASAN
Kernels built with CONFIG_KASAN=y quarantine newly freed memory in order
to better detect use-after-free errors.  However, this can exhaust memory
more quickly in allocator-heavy tests, which can result in spurious
scftorture failure.  This commit therefore forgives memory-allocation
failure in kernels built with CONFIG_KASAN=y, but continues counting
the errors for use in detailed test-result analyses.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-07-14 15:02:57 -07:00
Zqiang e60c122a16 rcuscale: Move rcu_scale_writer() schedule_timeout_uninterruptible() to _idle()
The rcuscale.holdoff module parameter can be used to delay the start
of rcu_scale_writer() kthread.  However, the hung-task timeout will
trigger when the timeout specified by rcuscale.holdoff is greater than
hung_task_timeout_secs:

runqemu kvm nographic slirp qemuparams="-smp 4 -m 2048M"
bootparams="rcuscale.shutdown=0 rcuscale.holdoff=300"

[  247.071753] INFO: task rcu_scale_write:59 blocked for more than 122 seconds.
[  247.072529]       Not tainted 6.4.0-rc1-00134-gb9ed6de8d4ff #7
[  247.073400] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  247.074331] task:rcu_scale_write state:D stack:30144 pid:59    ppid:2      flags:0x00004000
[  247.075346] Call Trace:
[  247.075660]  <TASK>
[  247.075965]  __schedule+0x635/0x1280
[  247.076448]  ? __pfx___schedule+0x10/0x10
[  247.076967]  ? schedule_timeout+0x2dc/0x4d0
[  247.077471]  ? __pfx_lock_release+0x10/0x10
[  247.078018]  ? enqueue_timer+0xe2/0x220
[  247.078522]  schedule+0x84/0x120
[  247.078957]  schedule_timeout+0x2e1/0x4d0
[  247.079447]  ? __pfx_schedule_timeout+0x10/0x10
[  247.080032]  ? __pfx_rcu_scale_writer+0x10/0x10
[  247.080591]  ? __pfx_process_timeout+0x10/0x10
[  247.081163]  ? __pfx_sched_set_fifo_low+0x10/0x10
[  247.081760]  ? __pfx_rcu_scale_writer+0x10/0x10
[  247.082287]  rcu_scale_writer+0x6b1/0x7f0
[  247.082773]  ? mark_held_locks+0x29/0xa0
[  247.083252]  ? __pfx_rcu_scale_writer+0x10/0x10
[  247.083865]  ? __pfx_rcu_scale_writer+0x10/0x10
[  247.084412]  kthread+0x179/0x1c0
[  247.084759]  ? __pfx_kthread+0x10/0x10
[  247.085098]  ret_from_fork+0x2c/0x50
[  247.085433]  </TASK>

This commit therefore replaces schedule_timeout_uninterruptible() with
schedule_timeout_idle().

Signed-off-by: Zqiang <qiang.zhang1211@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-07-14 15:01:49 -07:00
Arnd Bergmann e0a34641eb rcuscale: fix building with RCU_TINY
Both the CONFIG_TASKS_RCU and CONFIG_TASKS_RUDE_RCU options
are broken when RCU_TINY is enabled as well, as some functions
are missing a declaration.

In file included from kernel/rcu/update.c:649:
kernel/rcu/tasks.h:1271:21: error: no previous prototype for 'get_rcu_tasks_rude_gp_kthread' [-Werror=missing-prototypes]
 1271 | struct task_struct *get_rcu_tasks_rude_gp_kthread(void)
      |                     ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
kernel/rcu/rcuscale.c:330:27: error: 'get_rcu_tasks_rude_gp_kthread' undeclared here (not in a function); did you mean 'get_rcu_tasks_trace_gp_kthread'?
  330 |         .rso_gp_kthread = get_rcu_tasks_rude_gp_kthread,
      |                           ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      |                           get_rcu_tasks_trace_gp_kthread

In file included from /home/arnd/arm-soc/kernel/rcu/update.c:649:
kernel/rcu/tasks.h:1113:21: error: no previous prototype for 'get_rcu_tasks_gp_kthread' [-Werror=missing-prototypes]
 1113 | struct task_struct *get_rcu_tasks_gp_kthread(void)
      |                     ^~~~~~~~~~~~~~~~~~~~~~~~

Also, building with CONFIG_TASKS_RUDE_RCU but not CONFIG_TASKS_RCU is
broken because of some missing stub functions:

kernel/rcu/rcuscale.c:322:27: error: 'tasks_scale_read_lock' undeclared here (not in a function); did you mean 'srcu_scale_read_lock'?
  322 |         .readlock       = tasks_scale_read_lock,
      |                           ^~~~~~~~~~~~~~~~~~~~~
      |                           srcu_scale_read_lock
kernel/rcu/rcuscale.c:323:27: error: 'tasks_scale_read_unlock' undeclared here (not in a function); did you mean 'srcu_scale_read_unlock'?
  323 |         .readunlock     = tasks_scale_read_unlock,
      |                           ^~~~~~~~~~~~~~~~~~~~~~~
      |                           srcu_scale_read_unlock

Move the declarations outside of the RCU_TINY #ifdef and duplicate the
shared stub functions to address all of the above.

Fixes: 88d7ff818f0ce ("rcuscale: Add RCU Tasks Rude testing")
Fixes: 755f1c5eb416b ("rcuscale: Measure RCU Tasks Trace grace-period kthread CPU time")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-07-14 15:01:49 -07:00
Paul E. McKenney a15ec57cfc rcuscale: Add RCU Tasks Rude testing
Add a "tasks-rude" option to the rcuscale.scale_type module parameter.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-07-14 15:01:49 -07:00
Paul E. McKenney 271a8467a5 rcuscale: Measure RCU Tasks Trace grace-period kthread CPU time
This commit causes RCU Tasks Trace to output the CPU time consumed by
its grace-period kthread.  The CPU time is whatever is in the designated
task's current->stime field, and thus is controlled by whatever CPU-time
accounting scheme is in effect.

This output appears in microseconds as follows on the console:

rcu_scale: Grace-period kthread CPU time: 42367.037

[ paulmck: Apply Willy Tarreau feedback. ]

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-07-14 15:01:49 -07:00
Paul E. McKenney 5f8e320269 rcuscale: Measure grace-period kthread CPU time
This commit adds the ability to output the CPU time consumed by the
grace-period kthread for the RCU variant under test.  The CPU time is
whatever is in the designated task's current->stime field, and thus is
controlled by whatever CPU-time accounting scheme is in effect.

This output appears in microseconds as follows on the console:

rcu_scale: Grace-period kthread CPU time: 42367.037

[ paulmck: Apply feedback from Stephen Rothwell and kernel test robot. ]

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Tested-by: Yujie Liu <yujie.liu@intel.com>
2023-07-14 15:01:49 -07:00
Paul E. McKenney bb7bad3dae rcuscale: Print out full set of kfree_rcu parameters
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
2023-07-14 15:01:49 -07:00
Paul E. McKenney c68465dfaa rcuscale: Print out full set of module parameters
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-07-14 15:01:49 -07:00
Paul E. McKenney 7221f493c5 rcuscale: Add minruntime module parameter
By default, rcuscale collects only 100 points of data per writer, but
arranging for all kthreads to be actively collecting (if not recording)
data during the time that any kthread might be recording.  This works
well, but does not allow much time to bring external performance tools
to bear.  This commit therefore adds a minruntime module parameter
that specifies a minimum data-collection interval in seconds.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-07-14 15:01:49 -07:00
Paul E. McKenney ee7516a163 rcuscale: Fix gp_async_max typo: s/reader/writer/
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-07-14 15:01:49 -07:00
Paul E. McKenney 2226f3dc05 rcuscale: Permit blocking delays between writers
Some workloads do isolated RCU work, and this can affect operation
latencies.  This commit therefore adds a writer_holdoff_jiffies module
parameter that causes writers to block for the specified number of
jiffies between each pair of consecutive write-side operations.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-07-14 15:01:48 -07:00
Paul E. McKenney b5a2801fc0 refscale: Add a "jiffies" test
This commit adds a "jiffies" test to refscale, allowing use of jiffies
to be compared to ktime_get_real_fast_ns().  On my x86 laptop, jiffies
is more than 20x faster.  (Though for many uses, the tens-of-nanoseconds
overhead of ktime_get_real_fast_ns() will be just fine.)

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-07-14 15:01:04 -07:00
Waiman Long f5063e8948 refscale: Fix uninitalized use of wait_queue_head_t
Running the refscale test occasionally crashes the kernel with the
following error:

[ 8569.952896] BUG: unable to handle page fault for address: ffffffffffffffe8
[ 8569.952900] #PF: supervisor read access in kernel mode
[ 8569.952902] #PF: error_code(0x0000) - not-present page
[ 8569.952904] PGD c4b048067 P4D c4b049067 PUD c4b04b067 PMD 0
[ 8569.952910] Oops: 0000 [#1] PREEMPT_RT SMP NOPTI
[ 8569.952916] Hardware name: Dell Inc. PowerEdge R750/0WMWCR, BIOS 1.2.4 05/28/2021
[ 8569.952917] RIP: 0010:prepare_to_wait_event+0x101/0x190
  :
[ 8569.952940] Call Trace:
[ 8569.952941]  <TASK>
[ 8569.952944]  ref_scale_reader+0x380/0x4a0 [refscale]
[ 8569.952959]  kthread+0x10e/0x130
[ 8569.952966]  ret_from_fork+0x1f/0x30
[ 8569.952973]  </TASK>

The likely cause is that init_waitqueue_head() is called after the call to
the torture_create_kthread() function that creates the ref_scale_reader
kthread.  Although this init_waitqueue_head() call will very likely
complete before this kthread is created and starts running, it is
possible that the calling kthread will be delayed between the calls to
torture_create_kthread() and init_waitqueue_head().  In this case, the
new kthread will use the waitqueue head before it is properly initialized,
which is not good for the kernel's health and well-being.

The above crash happened here:

	static inline void __add_wait_queue(...)
	{
		:
		if (!(wq->flags & WQ_FLAG_PRIORITY)) <=== Crash here

The offset of flags from list_head entry in wait_queue_entry is
-0x18. If reader_tasks[i].wq.head.next is NULL as allocated reader_task
structure is zero initialized, the instruction will try to access address
0xffffffffffffffe8, which is exactly the fault address listed above.

This commit therefore invokes init_waitqueue_head() before creating
the kthread.

Fixes: 653ed64b01 ("refperf: Add a test to measure performance of read-side synchronization")
Signed-off-by: Waiman Long <longman@redhat.com>
Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
Acked-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-07-14 15:01:04 -07:00
Paul E. McKenney db13710a03 rcu-tasks: Cancel callback laziness if too many callbacks
The various RCU Tasks flavors now do lazy grace periods when there are
only asynchronous grace period requests.  By default, the system will let
250 milliseconds elapse after the first call_rcu_tasks*() callbacki is
queued before starting a grace period.  In contrast, synchronous grace
period requests such as synchronize_rcu_tasks*() will start a grace
period immediately.

However, invoking one of the call_rcu_tasks*() functions in a too-tight
loop can result in a callback flood, which in turn can exhaust memory
if grace periods are delayed for too long.

This commit therefore sets a limit so that the grace-period kthread
will be awakened when any CPU's callback list expands to contain
rcupdate.rcu_task_lazy_lim callbacks elements (defaulting to 32, set to -1
to disable), the grace-period kthread will be awakened, thus cancelling
any ongoing laziness and getting out in front of the potential callback
flood.

Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Martin KaFai Lau <kafai@fb.com>
Cc: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-07-14 15:00:12 -07:00
Paul E. McKenney 450d461aa6 rcu-tasks: Add kernel boot parameters for callback laziness
This commit adds kernel boot parameters for callback laziness, allowing
the RCU Tasks flavors to be individually adjusted.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-07-14 15:00:12 -07:00
Paul E. McKenney 5ae769c611 rcu-tasks: Remove redundant #ifdef CONFIG_TASKS_RCU
The kernel/rcu/tasks.h file has a #endif immediately followed by an

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-07-14 15:00:12 -07:00
Paul E. McKenney d119357d07 rcu-tasks: Treat only synchronous grace periods urgently
The performance requirements on RCU Tasks, and in particular on RCU
Tasks Trace, have evolved over time as the workloads have evolved.
The current implementation is designed to provide low grace-period
latencies, and also to accommodate short-duration floods of callbacks.

However, current workloads can also provide a constant background
callback-queuing rate of a few hundred call_rcu_tasks_trace() invocations
per second.  This results in continuous back-to-back RCU Tasks Trace
grace periods, which in turn can consume the better part of 10% of a CPU.
One could take the attitude that there are several tens of other CPUs on
the systems running such workloads, but energy efficiency is a thing.
On these systems, although asynchronous grace-period requests happen
every few milliseconds, synchronous grace-period requests are quite rare.

This commit therefore arrranges for grace periods to be initiated
immediately in response to calls to synchronize_rcu_tasks*() and
also to calls to synchronize_rcu_mult() that are passed one of the
call_rcu_tasks*() functions.  These are recognized by the tell-tale
wakeme_after_rcu callback function.

In other cases, callbacks are gathered up for up to about 250 milliseconds
before a grace period is initiated.  This results in more than an order of
magnitude reduction in RCU Tasks Trace grace periods, with corresponding
reduction in consumption of CPU time.

Reported-by: Alexei Starovoitov <ast@kernel.org>
Reported-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-07-14 15:00:11 -07:00
Randy Dunlap 67b3f564cb time: add kernel-doc in time.c
Add kernel-doc for all APIs that do not already have it.

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: John Stultz <jstultz@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Stephen Boyd <sboyd@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: linux-doc@vger.kernel.org
Acked-by: John Stultz <jstultz@google.com>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Link: https://lore.kernel.org/r/20230704052405.5089-3-rdunlap@infradead.org
2023-07-14 13:47:07 -06:00
Linus Torvalds bde7f15027 Power management fixes for 6.5-rc2
- Unbreak the /sys/power/resume interface after recent changes (Azat
    Khuzhin).
 
  - Allow PM_QOS_DEFAULT_VALUE to be used with frequency QoS (Chungkai
    Yang).
 
  - Remove __init from cpufreq callbacks in the sparc driver, because
    they may be called after initialization too (Viresh Kumar).
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAmSxhIcSHHJqd0Byand5
 c29ja2kubmV0AAoJEILEb/54YlRxfdQP/1+MxGLh2XVVKvB9QTI0/ofYPxPfoTuf
 5Sf2lOa7rNeauc2xnQFM6EMOeme4ckTUbrO7AkZVACYVqbKJ92IUBJfo3R2Ar+1Z
 9TogwG+YOX3OjR8QtiGHLwA5fvOgbt89JaDn2ZCWcS+gHARZ3VMgdbDt+C3MUldV
 UVr/5kAkWefv39PIYHCwAJU7Xhn97B5nW58KgpkHuxOcHDKe0VfdxLcKBsyoc6QE
 IGMVV2WtnoyEdM1aNfZ37+3NksiIdZMA6OvM5C/7HOs6IqJaFxVUxm4333sM5AW9
 y5LPYSBbedVxICdLkUUq8W5MDRNCTPXgC3gexEu0XtWdAV9AG+9aNeZytT7KGrLO
 xe4vbl6s1LnxC6YBw2bB+U/DbLtxQrAP1nYZj6yxhbHVsnTTZg7Qvevz7nAdPlOL
 3FsutIT+9OQprWXxYBRv3AumF+hpG1bm8Zutyaqb2vwRwMbbXWTpzRry4ydp6bTj
 VB2YWeQOxCKl+dC5jXM1wfPjbsqWQvvGZVh1VIzlcZgzqALWf+F5fTMUKuY963kd
 V6fR1YmKg+Xxb+BU9mEjaUMfH5Yr8Mv9Gpf7D87MTsNTluFjAmRWk5a0ahVAXcwe
 n1jFxBNUsuJuvz2KwWrVZExT2xqJ8kVfnOxNcevtaXK+uk8+jE94lwjJn0P9v8w8
 e+0QABkgUYY2
 =b/om
 -----END PGP SIGNATURE-----

Merge tag 'pm-6.5-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management fixes from Rafael Wysocki:
 "These fix hibernation (after recent changes), frequency QoS and the
  sparc cpufreq driver.

  Specifics:

   - Unbreak the /sys/power/resume interface after recent changes (Azat
     Khuzhin).

   - Allow PM_QOS_DEFAULT_VALUE to be used with frequency QoS (Chungkai
     Yang).

   - Remove __init from cpufreq callbacks in the sparc driver, because
     they may be called after initialization too (Viresh Kumar)"

* tag 'pm-6.5-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  cpufreq: sparc: Don't mark cpufreq callbacks with __init
  PM: QoS: Restore support for default value on frequency QoS
  PM: hibernate: Fix writing maj:min to /sys/power/resume
2023-07-14 11:07:04 -07:00
Rafael J. Wysocki d121758da6 Merge branches 'pm-sleep' and 'pm-qos'
Merge a PM QoS fix and a hibernation fix for 6.5-rc2.

 - Unbreak the /sys/power/resume interface after recent changes (Azat
   Khuzhin).

 - Allow PM_QOS_DEFAULT_VALUE to be used with frequency QoS (Chungkai
   Yang).

* pm-sleep:
  PM: hibernate: Fix writing maj:min to /sys/power/resume

* pm-qos:
  PM: QoS: Restore support for default value on frequency QoS
2023-07-14 19:13:21 +02:00
Masami Hiramatsu (Google) 797311bce5 tracing/probes: Fix to record 0-length data_loc in fetch_store_string*() if fails
Fix to record 0-length data to data_loc in fetch_store_string*() if it fails
to get the string data.
Currently those expect that the data_loc is updated by store_trace_args() if
it returns the error code. However, that does not work correctly if the
argument is an array of strings. In that case, store_trace_args() only clears
the first entry of the array (which may have no error) and leaves other
entries. So it should be cleared by fetch_store_string*() itself.
Also, 'dyndata' and 'maxlen' in store_trace_args() should be updated
only if it is used (ret > 0 and argument is a dynamic data.)

Link: https://lore.kernel.org/all/168908496683.123124.4761206188794205601.stgit@devnote2/

Fixes: 40b53b7718 ("tracing: probeevent: Add array type support")
Cc: stable@vger.kernel.org
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2023-07-14 17:04:58 +09:00
Jakub Kicinski d2afa89f66 for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmSwqwoACgkQ6rmadz2v
 bTqOHRAAn+fzTLqUqsveFQcxOkie5MPHxKoOTjG4+yFR7rzPkU6Mn5RX3w5yFzSn
 RqutwykF9OgipAzC3QXv4pRJuq6Gia5nvwUSDP4CX273ljyeF54DK7HfopE1+YrK
 HXyBWZvVvMZP6q7qQyQ3qtbHZSjs5XP/M6YBlJ5zo/BTLFCyvbSDP14YKEqcBkWG
 ld72ElXFxlnr/zEfRjzBCfMlbmgeHLO0SiHS/9827zEmNP1AAH5/ETA7/rJ7yCJs
 QNQUIoJWob8xm5FMJ6CU/+sOqXR1CY053meGJFFBX5pvVD/CLRhrwHn0IMCyQqmh
 wKR5waeXhpl/CKNeFuxXVMNFiXbqBb/0LYJaJtrMysjMLTsQ9X7NkrDBa/+kYGyZ
 +ghGlaMQvPqUGg0rLH2nl9JNB8Ne/8prLMsAKUWnPuOo+Q03j054gnqhGeNtDd5b
 gpSk+7x93PlhGcegBV1Wk8dkiGC5V9nTVNxg40XQUCs4k9L/8Vjc35Tjqx7nBTNH
 DiFD24DDKUZacw9L6nEqvLF/N2fiRjtUZnVPC0yn/annyBcfX1s+ZH2Tu1F6Qk38
 QMfBCnt12exmsiDoxdzzGJtjHnS/k5fsaKjlR21mOyMrIH7ipltr5UHHrdr1hBP6
 24uSeTImvQQKDi+9IuXN127jZDOupKqVS6csrA0ZXrlKWh2HR+U=
 =GVUB
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Alexei Starovoitov says:

====================
pull-request: bpf-next 2023-07-13

We've added 67 non-merge commits during the last 15 day(s) which contain
a total of 106 files changed, 4444 insertions(+), 619 deletions(-).

The main changes are:

1) Fix bpftool build in presence of stale vmlinux.h,
   from Alexander Lobakin.

2) Introduce bpf_me_mcache_free_rcu() and fix OOM under stress,
   from Alexei Starovoitov.

3) Teach verifier actual bounds of bpf_get_smp_processor_id()
   and fix perf+libbpf issue related to custom section handling,
   from Andrii Nakryiko.

4) Introduce bpf map element count, from Anton Protopopov.

5) Check skb ownership against full socket, from Kui-Feng Lee.

6) Support for up to 12 arguments in BPF trampoline, from Menglong Dong.

7) Export rcu_request_urgent_qs_task, from Paul E. McKenney.

8) Fix BTF walking of unions, from Yafang Shao.

9) Extend link_info for kprobe_multi and perf_event links,
   from Yafang Shao.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (67 commits)
  selftests/bpf: Add selftest for PTR_UNTRUSTED
  bpf: Fix an error in verifying a field in a union
  selftests/bpf: Add selftests for nested_trust
  bpf: Fix an error around PTR_UNTRUSTED
  selftests/bpf: add testcase for TRACING with 6+ arguments
  bpf, x86: allow function arguments up to 12 for TRACING
  bpf, x86: save/restore regs with BPF_DW size
  bpftool: Use "fallthrough;" keyword instead of comments
  bpf: Add object leak check.
  bpf: Convert bpf_cpumask to bpf_mem_cache_free_rcu.
  bpf: Introduce bpf_mem_free_rcu() similar to kfree_rcu().
  selftests/bpf: Improve test coverage of bpf_mem_alloc.
  rcu: Export rcu_request_urgent_qs_task()
  bpf: Allow reuse from waiting_for_gp_ttrace list.
  bpf: Add a hint to allocated objects.
  bpf: Change bpf_mem_cache draining process.
  bpf: Further refactor alloc_bulk().
  bpf: Factor out inc/dec of active flag into helpers.
  bpf: Refactor alloc_bulk().
  bpf: Let free_all() return the number of freed elements.
  ...
====================

Link: https://lore.kernel.org/r/20230714020910.80794-1-alexei.starovoitov@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-13 19:13:24 -07:00
Yafang Shao 33937607ef bpf: Fix an error in verifying a field in a union
We are utilizing BPF LSM to monitor BPF operations within our container
environment. When we add support for raw_tracepoint, it hits below
error.

; (const void *)attr->raw_tracepoint.name);
27: (79) r3 = *(u64 *)(r2 +0)
access beyond the end of member map_type (mend:4) in struct (anon) with off 0 size 8

It can be reproduced with below BPF prog.

SEC("lsm/bpf")
int BPF_PROG(bpf_audit, int cmd, union bpf_attr *attr, unsigned int size)
{
	switch (cmd) {
	case BPF_RAW_TRACEPOINT_OPEN:
		bpf_printk("raw_tracepoint is %s", attr->raw_tracepoint.name);
		break;
	default:
		break;
	}
	return 0;
}

The reason is that when accessing a field in a union, such as bpf_attr,
if the field is located within a nested struct that is not the first
member of the union, it can result in incorrect field verification.

  union bpf_attr {
      struct {
          __u32 map_type; <<<< Actually it will find that field.
          __u32 key_size;
          __u32 value_size;
         ...
      };
      ...
      struct {
          __u64 name;    <<<< We want to verify this field.
          __u32 prog_fd;
      } raw_tracepoint;
  };

Considering the potential deep nesting levels, finding a perfect
solution to address this issue has proven challenging. Therefore, I
propose a solution where we simply skip the verification process if the
field in question is located within a union.

Fixes: 7e3617a72d ("bpf: Add array support to btf_struct_access")
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Link: https://lore.kernel.org/r/20230713025642.27477-4-laoar.shao@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-13 16:24:29 -07:00
Yafang Shao 7ce4dc3e4a bpf: Fix an error around PTR_UNTRUSTED
Per discussion with Alexei, the PTR_UNTRUSTED flag should not been
cleared when we start to walk a new struct, because the struct in
question may be a struct nested in a union. We should also check and set
this flag before we walk its each member, in case itself is a union.
We will clear this flag if the field is BTF_TYPE_SAFE_RCU_OR_NULL.

Fixes: 6fcd486b3a ("bpf: Refactor RCU enforcement in the verifier.")
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Link: https://lore.kernel.org/r/20230713025642.27477-2-laoar.shao@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-13 16:24:29 -07:00
Linus Torvalds b1983d427a Networking fixes for 6.5-rc2, including fixes from netfilter,
wireless and ebpf
 
 Current release - regressions:
 
   - netfilter: conntrack: gre: don't set assured flag for clash entries
 
   - wifi: iwlwifi: remove 'use_tfh' config to fix crash
 
 Previous releases - regressions:
 
   - ipv6: fix a potential refcount underflow for idev
 
   - icmp6: ifix null-ptr-deref of ip6_null_entry->rt6i_idev in icmp6_dev()
 
   - bpf: fix max stack depth check for async callbacks
 
   - eth: mlx5e:
     - check for NOT_READY flag state after locking
     - fix page_pool page fragment tracking for XDP
 
   - eth: igc:
     - fix tx hang issue when QBV gate is closed
     - fix corner cases for TSN offload
 
   - eth: octeontx2-af: Move validation of ptp pointer before its usage
 
   - eth: ena: fix shift-out-of-bounds in exponential backoff
 
 Previous releases - always broken:
 
   - core: prevent skb corruption on frag list segmentation
 
   - sched:
     - cls_fw: fix improper refcount update leads to use-after-free
     - sch_qfq: account for stab overhead in qfq_enqueue
 
   - netfilter:
     - report use refcount overflow
     - prevent OOB access in nft_byteorder_eval
 
   - wifi: mt7921e: fix init command fail with enabled device
 
   - eth: ocelot: fix oversize frame dropping for preemptible TCs
 
   - eth: fec: recycle pages for transmitted XDP frames
 
 Signed-off-by: Paolo Abeni <pabeni@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmSv1YISHHBhYmVuaUBy
 ZWRoYXQuY29tAAoJECkkeY3MjxOkpQgP/1msj0MlIWJnMgzPiMonDSe746JGTg/j
 YengEjqcy3ozC4COBEeyBO6ilt6I+Wrb5H5jimn9h2djB+D7htWNaejQaqJrBxph
 F4lUC6OJqd2ncI3tXAG2BSX1duzDr6B7yL7d5InFIczw8vNh+chsyX0sjlzU12bt
 ppjcSb+Ffc796DB0ItJkBqluxcpjyXE15ZWTTV4GEHK6RoRdxNIGjd7NgvD8podB
 Q/464bHs1jJYkAavuobiOXV2fuxWLTs77E0Vloizoo+42UiRFMLJk+RX98PhSIMa
 eejkxfm+H6+6Qi2omYepvf7vDN3GtLjxbr5C3mTdWPuL4QbNY8agVJ7sS4XnL5/v
 B7EAjyGQK9SmD36zTu7QL/Ul6fSnRq8jz20B0mDa0imAWzi58A+jqbQAMoVOMSS+
 Uv4yKJpIUyx7mUI77+EX3U9r1wytw5eniatTDU+GAsQb2CJ43CqDmn/7RcmGacBo
 P1q+il9JW4kzUQrisUSxmQDfpBvQi5wiygiEdUNI5FEhq6/iKe/lrJnmJZpaLkd5
 P3oEKjapamAmcyrEr/7VD1Mb4jrRfpB7zVn/5OyvywbcLQxA+531iPpy4r4W6cWH
 1MRLBVVHKyb3jfm8J3T4lpDEzd03+MiPS8JiKMUYYNUYkY8tYp92muwC7z2sGI4M
 6eR2MeKD4vds
 =cELX
 -----END PGP SIGNATURE-----

Merge tag 'net-6.5-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Paolo Abeni:
 "Including fixes from netfilter, wireless and ebpf.

  Current release - regressions:

   - netfilter: conntrack: gre: don't set assured flag for clash entries

   - wifi: iwlwifi: remove 'use_tfh' config to fix crash

  Previous releases - regressions:

   - ipv6: fix a potential refcount underflow for idev

   - icmp6: ifix null-ptr-deref of ip6_null_entry->rt6i_idev in
     icmp6_dev()

   - bpf: fix max stack depth check for async callbacks

   - eth: mlx5e:
      - check for NOT_READY flag state after locking
      - fix page_pool page fragment tracking for XDP

   - eth: igc:
      - fix tx hang issue when QBV gate is closed
      - fix corner cases for TSN offload

   - eth: octeontx2-af: Move validation of ptp pointer before its usage

   - eth: ena: fix shift-out-of-bounds in exponential backoff

  Previous releases - always broken:

   - core: prevent skb corruption on frag list segmentation

   - sched:
      - cls_fw: fix improper refcount update leads to use-after-free
      - sch_qfq: account for stab overhead in qfq_enqueue

   - netfilter:
      - report use refcount overflow
      - prevent OOB access in nft_byteorder_eval

   - wifi: mt7921e: fix init command fail with enabled device

   - eth: ocelot: fix oversize frame dropping for preemptible TCs

   - eth: fec: recycle pages for transmitted XDP frames"

* tag 'net-6.5-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (79 commits)
  selftests: tc-testing: add test for qfq with stab overhead
  net/sched: sch_qfq: account for stab overhead in qfq_enqueue
  selftests: tc-testing: add tests for qfq mtu sanity check
  net/sched: sch_qfq: reintroduce lmax bound check for MTU
  wifi: cfg80211: fix receiving mesh packets without RFC1042 header
  wifi: rtw89: debug: fix error code in rtw89_debug_priv_send_h2c_set()
  net: txgbe: fix eeprom calculation error
  net/sched: make psched_mtu() RTNL-less safe
  net: ena: fix shift-out-of-bounds in exponential backoff
  netdevsim: fix uninitialized data in nsim_dev_trap_fa_cookie_write()
  net/sched: flower: Ensure both minimum and maximum ports are specified
  MAINTAINERS: Add another mailing list for QUALCOMM ETHQOS ETHERNET DRIVER
  docs: netdev: update the URL of the status page
  wifi: iwlwifi: remove 'use_tfh' config to fix crash
  xdp: use trusted arguments in XDP hints kfuncs
  bpf: cpumap: Fix memory leak in cpu_map_update_elem
  wifi: airo: avoid uninitialized warning in airo_get_rate()
  octeontx2-pf: Add additional check for MCAM rules
  net: dsa: Removed unneeded of_node_put in felix_parse_ports_node
  net: fec: use netdev_err_once() instead of netdev_err()
  ...
2023-07-13 14:21:22 -07:00
Linus Torvalds ebc27aacee Tracing fixes and clean ups:
- Fix some missing-prototype warnings
 
 - Fix user events struct args (did not include size of struct)
   When creating a user event, the "struct" keyword is to denote
   that the size of the field will be passed in. But the parsing
   failed to handle this case.
 
 - Add selftest to struct sizes for user events
 
 - Fix sample code for direct trampolines.
   The sample code for direct trampolines attached to handle_mm_fault().
   But the prototype changed and the direct trampoline sample code
   was not updated. Direct trampolines needs to have the arguments correct
   otherwise it can fail or crash the system.
 
 - Remove unused ftrace_regs_caller_ret() prototype.
 
 - Quiet false positive of FORTIFY_SOURCE
   Due to backward compatibility, the structure used to save stack traces
   in the kernel had a fixed size of 8. This structure is exported to
   user space via the tracing format file. A change was made to allow
   more than 8 functions to be recorded, and user space now uses the
   size field to know how many functions are actually in the stack.
   But the structure still has size of 8 (even though it points into
   the ring buffer that has the required amount allocated to hold a
   full stack. This was fine until the fortifier noticed that the
   memcpy(&entry->caller, stack, size) was greater than the 8 functions
   and would complain at runtime about it. Hide this by using a pointer
   to the stack location on the ring buffer instead of using the address
   of the entry structure caller field.
 
 - Fix a deadloop in reading trace_pipe that was caused by a mismatch
   between ring_buffer_empty() returning false which then asked to
   read the data, but the read code uses rb_num_of_entries() that
   returned zero, and causing a infinite "retry".
 
 - Fix a warning caused by not using all pages allocated to store
   ftrace functions, where this can happen if the linker inserts a bunch of
   "NULL" entries, causing the accounting of how many pages needed
   to be off.
 
 - Fix histogram synthetic event crashing when the start event is
   removed and the end event is still using a variable from it.
 
 - Fix memory leak in freeing iter->temp in tracing_release_pipe()
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZLBF6hQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qkswAP4mhdoFFfNosM7+Sh/R4t31IxKZApm9
 M2Hf9jgvJ7b65AD/VV1XfO6skw2+5Yn9S4UyNE2MQaYxPwWpONcNFUzZ3Q8=
 =Nb+7
 -----END PGP SIGNATURE-----

Merge tag 'trace-v6.5-rc1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing fixes from Steven Rostedt:

 - Fix some missing-prototype warnings

 - Fix user events struct args (did not include size of struct)

   When creating a user event, the "struct" keyword is to denote that
   the size of the field will be passed in. But the parsing failed to
   handle this case.

 - Add selftest to struct sizes for user events

 - Fix sample code for direct trampolines.

   The sample code for direct trampolines attached to handle_mm_fault().
   But the prototype changed and the direct trampoline sample code was
   not updated. Direct trampolines needs to have the arguments correct
   otherwise it can fail or crash the system.

 - Remove unused ftrace_regs_caller_ret() prototype.

 - Quiet false positive of FORTIFY_SOURCE

   Due to backward compatibility, the structure used to save stack
   traces in the kernel had a fixed size of 8. This structure is
   exported to user space via the tracing format file. A change was made
   to allow more than 8 functions to be recorded, and user space now
   uses the size field to know how many functions are actually in the
   stack.

   But the structure still has size of 8 (even though it points into the
   ring buffer that has the required amount allocated to hold a full
   stack.

   This was fine until the fortifier noticed that the
   memcpy(&entry->caller, stack, size) was greater than the 8 functions
   and would complain at runtime about it.

   Hide this by using a pointer to the stack location on the ring buffer
   instead of using the address of the entry structure caller field.

 - Fix a deadloop in reading trace_pipe that was caused by a mismatch
   between ring_buffer_empty() returning false which then asked to read
   the data, but the read code uses rb_num_of_entries() that returned
   zero, and causing a infinite "retry".

 - Fix a warning caused by not using all pages allocated to store ftrace
   functions, where this can happen if the linker inserts a bunch of
   "NULL" entries, causing the accounting of how many pages needed to be
   off.

 - Fix histogram synthetic event crashing when the start event is
   removed and the end event is still using a variable from it

 - Fix memory leak in freeing iter->temp in tracing_release_pipe()

* tag 'trace-v6.5-rc1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing: Fix memory leak of iter->temp when reading trace_pipe
  tracing/histograms: Add histograms to hist_vars if they have referenced variables
  tracing: Stop FORTIFY_SOURCE complaining about stack trace caller
  ftrace: Fix possible warning on checking all pages used in ftrace_process_locs()
  ring-buffer: Fix deadloop issue on reading trace_pipe
  tracing: arm64: Avoid missing-prototype warnings
  selftests/user_events: Test struct size match cases
  tracing/user_events: Fix struct arg size match check
  x86/ftrace: Remove unsued extern declaration ftrace_regs_caller_ret()
  arm64: ftrace: Add direct call trampoline samples support
  samples: ftrace: Save required argument registers in sample trampolines
2023-07-13 13:44:28 -07:00
Masami Hiramatsu (Google) 4ed8f337de Revert "tracing: Add "(fault)" name injection to kernel probes"
This reverts commit 2e9906f84f.

It was turned out that commit 2e9906f84f ("tracing: Add "(fault)"
name injection to kernel probes") did not work correctly and probe
events still show just '(fault)' (instead of '"(fault)"'). Also,
current '(fault)' is more explicit that it faulted.

This also moves FAULT_STRING macro to trace.h so that synthetic
event can keep using it, and uses it in trace_probe.c too.

Link: https://lore.kernel.org/all/168908495772.123124.1250788051922100079.stgit@devnote2/
Link: https://lore.kernel.org/all/20230706230642.3793a593@rorschach.local.home/

Cc: stable@vger.kernel.org
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-07-14 00:37:43 +09:00
Masami Hiramatsu (Google) e38e2c6a9e tracing/probes: Fix to update dynamic data counter if fetcharg uses it
Fix to update dynamic data counter ('dyndata') and max length ('maxlen')
only if the fetcharg uses the dynamic data. Also get out arg->dynamic
from unlikely(). This makes dynamic data address wrong if
process_fetch_insn() returns error on !arg->dynamic case.

Link: https://lore.kernel.org/all/168908494781.123124.8160245359962103684.stgit@devnote2/

Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Link: https://lore.kernel.org/all/20230710233400.5aaf024e@gandalf.local.home/
Fixes: 9178412ddf ("tracing: probeevent: Return consumed bytes of dynamic area")
Cc: stable@vger.kernel.org
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-07-14 00:37:00 +09:00
Masami Hiramatsu (Google) b41326b5e0 tracing/probes: Fix not to count error code to total length
Fix not to count the error code (which is minus value) to the total
used length of array, because it can mess up the return code of
process_fetch_insn_bottom(). Also clear the 'ret' value because it
will be used for calculating next data_loc entry.

Link: https://lore.kernel.org/all/168908493827.123124.2175257289106364229.stgit@devnote2/

Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/all/8819b154-2ba1-43c3-98a2-cbde20892023@moroto.mountain/
Fixes: 9b960a3883 ("tracing: probeevent: Unify fetch_insn processing common part")
Cc: stable@vger.kernel.org
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-07-14 00:36:28 +09:00
Masami Hiramatsu (Google) 66bcf65d6c tracing/probes: Fix to avoid double count of the string length on the array
If an array is specified with the ustring or symstr, the length of the
strings are accumlated on both of 'ret' and 'total', which means the
length is double counted.
Just set the length to the 'ret' value for avoiding double counting.

Link: https://lore.kernel.org/all/168908492917.123124.15076463491122036025.stgit@devnote2/

Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/all/8819b154-2ba1-43c3-98a2-cbde20892023@moroto.mountain/
Fixes: 88903c4643 ("tracing/probe: Add ustring type for user-space string")
Cc: stable@vger.kernel.org
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-07-14 00:35:53 +09:00
Masami Hiramatsu (Google) d5f28bb1ce fprobes: Add a comment why fprobe_kprobe_handler exits if kprobe is running
Add a comment the reason why fprobe_kprobe_handler() exits if any other
kprobe is running.

Link: https://lore.kernel.org/all/168874788299.159442.2485957441413653858.stgit@devnote2/

Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Link: https://lore.kernel.org/all/20230706120916.3c6abf15@gandalf.local.home/
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-07-14 00:24:00 +09:00
Zheng Yejian d5a8218963 tracing: Fix memory leak of iter->temp when reading trace_pipe
kmemleak reports:
  unreferenced object 0xffff88814d14e200 (size 256):
    comm "cat", pid 336, jiffies 4294871818 (age 779.490s)
    hex dump (first 32 bytes):
      04 00 01 03 00 00 00 00 08 00 00 00 00 00 00 00  ................
      0c d8 c8 9b ff ff ff ff 04 5a ca 9b ff ff ff ff  .........Z......
    backtrace:
      [<ffffffff9bdff18f>] __kmalloc+0x4f/0x140
      [<ffffffff9bc9238b>] trace_find_next_entry+0xbb/0x1d0
      [<ffffffff9bc9caef>] trace_print_lat_context+0xaf/0x4e0
      [<ffffffff9bc94490>] print_trace_line+0x3e0/0x950
      [<ffffffff9bc95499>] tracing_read_pipe+0x2d9/0x5a0
      [<ffffffff9bf03a43>] vfs_read+0x143/0x520
      [<ffffffff9bf04c2d>] ksys_read+0xbd/0x160
      [<ffffffff9d0f0edf>] do_syscall_64+0x3f/0x90
      [<ffffffff9d2000aa>] entry_SYSCALL_64_after_hwframe+0x6e/0xd8

when reading file 'trace_pipe', 'iter->temp' is allocated or relocated
in trace_find_next_entry() but not freed before 'trace_pipe' is closed.

To fix it, free 'iter->temp' in tracing_release_pipe().

Link: https://lore.kernel.org/linux-trace-kernel/20230713141435.1133021-1-zhengyejian1@huawei.com

Cc: stable@vger.kernel.org
Fixes: ff895103a8 ("tracing: Save off entry when peeking at next entry")
Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-07-13 10:48:36 -04:00
Vincent Guittot 7ee7642c91 sched/fair: Stabilize asym cpu capacity system idle cpu selection
select_idle_capacity() not only looks for an idle cpu that fits for the
waking task but also for cpu with highest bandwidth when no cpu fits.
Start the loop with target cpu so it will be selected 1st when no cpu fits
but several cpus shared the same bandwidth. Starting with target cpu
prevents the task to migrate between cpus with same bandwidth at every
wakeup when no cpu fits.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20230711081359.868862-1-vincent.guittot@linaro.org
2023-07-13 15:21:53 +02:00
Peter Zijlstra ed74cc4995 sched/debug: Dump domains' sched group flags
There have been a case where the SD_SHARE_CPUCAPACITY sched group flag
in a parent domain were not set and propagated properly when a degenerate
domain is removed.

Add dump of domain sched group flags of a CPU to make debug easier
in the future.

Usage:
cat /debug/sched/domains/cpu0/domain1/groups_flags
to dump cpu0 domain1's sched group flags.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lore.kernel.org/r/ed1749262d94d95a8296c86a415999eda90bcfe3.1688770494.git.tim.c.chen@linux.intel.com
2023-07-13 15:21:53 +02:00
Ricardo Neri b1bfeab9b0 sched/fair: Consider the idle state of the whole core for load balance
should_we_balance() traverses the group_balance_mask (AND'ed with lb_env::
cpus) starting from lower numbered CPUs looking for the first idle CPU.

In hybrid x86 systems, the siblings of SMT cores get CPU numbers, before
non-SMT cores:

	[0, 1] [2, 3] [4, 5] 6 7 8 9
         b  i   b  i   b  i  b i i i

In the figure above, CPUs in brackets are siblings of an SMT core. The
rest are non-SMT cores. 'b' indicates a busy CPU, 'i' indicates an
idle CPU.

We should let a CPU on a fully idle core get the first chance to idle
load balance as it has more CPU capacity than a CPU on an idle SMT
CPU with busy sibling.  So for the figure above, if we are running
should_we_balance() to CPU 1, we should return false to let CPU 7 on
idle core to have a chance first to idle load balance.

A partially busy (i.e., of type group_has_spare) local group with SMT 
cores will often have only one SMT sibling busy. If the destination CPU
is a non-SMT core, partially busy, lower-numbered, SMT cores should not
be considered when finding the first idle CPU. 

However, in should_we_balance(), when we encounter idle SMT first in partially
busy core, we prematurely break the search for the first idle CPU.

Higher-numbered, non-SMT cores is not given the chance to have
idle balance done on their behalf. Those CPUs will only be considered
for idle balancing by chance via CPU_NEWLY_IDLE.

Instead, consider the idle state of the whole SMT core.

Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/807bdd05331378ea3bf5956bda87ded1036ba769.1688770494.git.tim.c.chen@linux.intel.com
2023-07-13 15:21:52 +02:00
Tim C Chen 7ff1693236 sched/fair: Implement prefer sibling imbalance calculation between asymmetric groups
In the current prefer sibling load balancing code, there is an implicit
assumption that the busiest sched group and local sched group are
equivalent, hence the tasks to be moved is simply the difference in
number of tasks between the two groups (i.e. imbalance) divided by two.

However, we may have different number of cores between the cluster groups,
say when we take CPU offline or we have hybrid groups.  In that case,
we should balance between the two groups such that #tasks/#cores ratio
is the same between the same between both groups.  Hence the imbalance
computed will need to reflect this.

Adjust the sibling imbalance computation to take into account of the
above considerations.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/4eacbaa236e680687dae2958378a6173654113df.1688770494.git.tim.c.chen@linux.intel.com
2023-07-13 15:21:52 +02:00
Tim C Chen d24cb0d911 sched/topology: Record number of cores in sched group
When balancing sibling domains that have different number of cores,
tasks in respective sibling domain should be proportional to the
number of cores in each domain. In preparation of implementing such a
policy, record the number of cores in a scheduling group.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/04641eeb0e95c21224352f5743ecb93dfac44654.1688770494.git.tim.c.chen@linux.intel.com
2023-07-13 15:21:51 +02:00
Tim C Chen fee1759e4f sched/fair: Determine active load balance for SMT sched groups
On hybrid CPUs with scheduling cluster enabled, we will need to
consider balancing between SMT CPU cluster, and Atom core cluster.

Below shows such a hybrid x86 CPU with 4 big cores and 8 atom cores.
Each scheduling cluster span a L2 cache.

          --L2-- --L2-- --L2-- --L2-- ----L2---- -----L2------
          [0, 1] [2, 3] [4, 5] [5, 6] [7 8 9 10] [11 12 13 14]
          Big    Big    Big    Big    Atom       Atom
          core   core   core   core   Module     Module

If the busiest group is a big core with both SMT CPUs busy, we should
active load balance if destination group has idle CPU cores.  Such
condition is considered by asym_active_balance() in load balancing but not
considered when looking for busiest group and computing load imbalance.
Add this consideration in find_busiest_group() and calculate_imbalance().

In addition, update the logic determining the busier group when one group
is SMT and the other group is non SMT but both groups are partially busy
with idle CPU. The busier group should be the group with idle cores rather
than the group with one busy SMT CPU.  We do not want to make the SMT group
the busiest one to pull the only task off SMT CPU and causing the whole core to
go empty.

Otherwise suppose in the search for the busiest group, we first encounter
an SMT group with 1 task and set it as the busiest.  The destination
group is an atom cluster with 1 task and we next encounter an atom
cluster group with 3 tasks, we will not pick this atom cluster over the
SMT group, even though we should.  As a result, we do not load balance
the busier Atom cluster (with 3 tasks) towards the local atom cluster
(with 1 task).  And it doesn't make sense to pick the 1 task SMT group
as the busier group as we also should not pull task off the SMT towards
the 1 task atom cluster and make the SMT core completely empty.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/e24f35d142308790f69be65930b82794ef6658a2.1688770494.git.tim.c.chen@linux.intel.com
2023-07-13 15:21:51 +02:00
Miaohe Lin 35cd21f629 sched/psi: make psi_cgroups_enabled static
The static key psi_cgroups_enabled is only used inside file psi.c.
Make it static.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Link: https://lore.kernel.org/r/20230525103428.49712-1-linmiaohe@huawei.com
2023-07-13 15:21:50 +02:00
Cruz Zhao 548796e2e7 sched/core: introduce sched_core_idle_cpu()
As core scheduling introduced, a new state of idle is defined as
force idle, running idle task but nr_running greater than zero.

If a cpu is in force idle state, idle_cpu() will return zero. This
result makes sense in some scenarios, e.g., load balance,
showacpu when dumping, and judge the RCU boost kthread is starving.

But this will cause error in other scenarios, e.g., tick_irq_exit():
When force idle, rq->curr == rq->idle but rq->nr_running > 0, results
that idle_cpu() returns 0. In function tick_irq_exit(), if idle_cpu()
is 0, tick_nohz_irq_exit() will not be called, and ts->idle_active will
not become 1, which became 0 in tick_nohz_irq_enter().
ts->idle_sleeptime won't update in function update_ts_time_stats(), if
ts->idle_active is 0, which should be 1. And this bug will result that
ts->idle_sleeptime is less than the actual value, and finally will
result that the idle time in /proc/stat is less than the actual value.

To solve this problem, we introduce sched_core_idle_cpu(), which
returns 1 when force idle. We audit all users of idle_cpu(), and
change idle_cpu() into sched_core_idle_cpu() in function
tick_irq_exit().

v2-->v3: Only replace idle_cpu() with sched_core_idle_cpu() in
function tick_irq_exit(). And modify the corresponding commit log.

Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Joel Fernandes <joel@joelfernandes.org>
Link: https://lore.kernel.org/r/1688011324-42406-1-git-send-email-CruzZhao@linux.alibaba.com
2023-07-13 15:21:50 +02:00