Commit Graph

43026 Commits

Author SHA1 Message Date
Yonghong Song db2baf82b0 bpf: Fix an incorrect verification success with movsx insn
syzbot reports a verifier bug which triggers a runtime panic.
The test bpf program is:
   0: (62) *(u32 *)(r10 -8) = 553656332
   1: (bf) r1 = (s16)r10
   2: (07) r1 += -8
   3: (b7) r2 = 3
   4: (bd) if r2 <= r1 goto pc+0
   5: (85) call bpf_trace_printk#-138320
   6: (b7) r0 = 0
   7: (95) exit

At insn 1, the current implementation keeps 'r1' as a frame pointer,
which caused later bpf_trace_printk helper call crash since frame
pointer address is not valid any more. Note that at insn 4,
the 'pointer vs. scalar' comparison is allowed for privileged
prog run.

To fix the problem with above insn 1, the fix in the patch adopts
similar pattern to existing 'R1 = (u32) R2' handling. For unprivileged
prog run, verification will fail with 'R<num> sign-extension part of pointer'.
For privileged prog run, the dst_reg 'r1' will be marked as
an unknown scalar, so later 'bpf_trace_pointk' helper will complain
since it expected certain pointers.

Reported-by: syzbot+d61b595e9205573133b3@syzkaller.appspotmail.com
Fixes: 8100928c88 ("bpf: Support new sign-extension mov insns")
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20230807175721.671696-1-yonghong.song@linux.dev
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-08-07 16:23:35 -07:00
Linus Torvalds 14f9643dc9 workqueue: Fixes for v6.5-rc5
Two commits:
 
 * The recently added cpu_intensive auto detection and warning mechanism was
   spuriously triggered on slow CPUs. While not causing serious issues, it's
   still a nuisance and can cause unintended concurrency management
   behaviors. Relax the threshold on machines with lower BogoMIPS. While
   BogoMIPS is not an accurate measure of performance by most measures, we
   don't have to be accurate and it has rough but strong enough correlation.
 
 * A correction in Kconfig help text.
 -----BEGIN PGP SIGNATURE-----
 
 iIQEABYIACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZNFMTQ4cdGpAa2VybmVs
 Lm9yZwAKCRCxYfJx3gVYGb+4AQCniWx3rwWWmLgviPR0AfYWbcQ8/P/qGh++fmsR
 tEF3sQD/bLdeWcVa1pSzXjhGtRVGsTis6oOhk81A0zIZlx0v2Qg=
 =sThu
 -----END PGP SIGNATURE-----

Merge tag 'wq-for-6.5-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq

Pull workqueue fixes from Tejun Heo:

 - The recently added cpu_intensive auto detection and warning mechanism
   was spuriously triggered on slow CPUs.

   While not causing serious issues, it's still a nuisance and can cause
   unintended concurrency management behaviors.

   Relax the threshold on machines with lower BogoMIPS. While BogoMIPS
   is not an accurate measure of performance by most measures, we don't
   have to be accurate and it has rough but strong enough correlation.

 - A correction in Kconfig help text

* tag 'wq-for-6.5-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
  workqueue: Scale up wq_cpu_intensive_thresh_us if BogoMIPS is below 4000
  workqueue: Fix cpu_intensive_thresh_us name in help text
2023-08-07 13:07:12 -07:00
Hao Jia 0437719c1a cgroup/rstat: Record the cumulative per-cpu time of cgroup and its descendants
The member variable bstat of the structure cgroup_rstat_cpu
records the per-cpu time of the cgroup itself, but does not
include the per-cpu time of its descendants. The per-cpu time
including descendants is very useful for calculating the
per-cpu usage of cgroups.

Although we can indirectly obtain the total per-cpu time
of the cgroup and its descendants by accumulating the per-cpu
bstat of each descendant of the cgroup. But after a child cgroup
is removed, we will lose its bstat information. This will cause
the cumulative value to be non-monotonic, thus affecting
the accuracy of cgroup per-cpu usage.

So we add the subtree_bstat variable to record the total
per-cpu time of this cgroup and its descendants, which is
similar to "cpuacct.usage*" in cgroup v1. And this is
also helpful for the migration from cgroup v1 to cgroup v2.
After adding this variable, we can obtain the per-cpu time of
cgroup and its descendants in user mode through eBPF/drgn, etc.
And we are still trying to determine how to expose it in the
cgroupfs interface.

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Hao Jia <jiahao.os@bytedance.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-07 08:41:25 -10:00
Yang Yingliang 9680540c0c workqueue: use LIST_HEAD to initialize cull_list
Use LIST_HEAD() to initialize cull_list instead of open-coding it.

Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-07 08:36:51 -10:00
Miaohe Lin e7e64a1bff cgroup: clean up if condition in cgroup_pidlist_start()
There's no need to use '<=' when knowing 'l->list[mid] != pid' already.
No functional change intended.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-07 08:30:06 -10:00
Wenyu Liu 55e2b69649 kexec_lock: Replace kexec_mutex() by kexec_lock() in two comments
kexec_mutex is replaced by an atomic variable
in 05c6257433 (panic, kexec: make __crash_kexec() NMI safe).

But there are still two comments that referenced kexec_mutex,
replace them by kexec_lock.

Signed-off-by: Wenyu Liu <liuwenyu7@huawei.com>
Acked-by: Baoquan He <bhe@redhat.com>
Acked-by: Paul Menzel <pmenzel@molgen.mpg.de>
Signed-off-by: Mimi Zohar <zohar@linux.ibm.com>
2023-08-07 09:55:42 -04:00
Vlastimil Babka df2f7cde73 PM: hibernate: fix resume_store() return value when hibernation not available
On a laptop with hibernation set up but not actively used, and with
secure boot and lockdown enabled kernel, 6.5-rc1 gets stuck on boot with
the following repeated messages:

  A start job is running for Resume from hibernation using device /dev/system/swap (24s / no limit)
  lockdown_is_locked_down: 25311154 callbacks suppressed
  Lockdown: systemd-hiberna: hibernation is restricted; see man kernel_lockdown.7
  ...

Checking the resume code leads to commit cc89c63e2f ("PM: hibernate:
move finding the resume device out of software_resume") which
inadvertently changed the return value from resume_store() to 0 when
!hibernation_available(). This apparently translates to userspace
write() returning 0 as in number of bytes written, and userspace looping
indefinitely in the attempt to write the intended value.

Fix this by returning the full number of bytes that were to be written,
as that's what was done before the commit.

Fixes: cc89c63e2f ("PM: hibernate: move finding the resume device out of software_resume")
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2023-08-07 11:41:11 +02:00
Yang Yingliang 1e8e2efb34 bpf: change bpf_alu_sign_string and bpf_movsx_string to static
The bpf_alu_sign_string and bpf_movsx_string introduced in commit
f835bb6222 ("bpf: Add kernel/bpftool asm support for new instructions")
are only used in disasm.c now, change them to static.

Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202308050615.wxAn1v2J-lkp@intel.com/
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230803023128.3753323-1-yangyingliang@huawei.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-08-04 16:15:50 -07:00
Kui-Feng Lee 5426700e68 bpf: fix bpf_dynptr_slice() to stop return an ERR_PTR.
Verify if the pointer obtained from bpf_xdp_pointer() is either an error or
NULL before returning it.

The function bpf_dynptr_slice() mistakenly returned an ERR_PTR. Instead of
solely checking for NULL, it should also verify if the pointer returned by
bpf_xdp_pointer() is an error or NULL.

Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/bpf/d1360219-85c3-4a03-9449-253ea905f9d1@moroto.mountain/
Fixes: 66e3a13e7c ("bpf: Add bpf_dynptr_slice and bpf_dynptr_slice_rdwr")
Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230803231206.1060485-1-thinker.li@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-08-04 14:53:15 -07:00
Daniel Borkmann d210f9735e bpf: Fix mprog detachment for empty mprog entry
syzbot reported an UBSAN array-index-out-of-bounds access in bpf_mprog_read()
upon bpf_mprog_detach(). While it did not have a reproducer, I was able to
manually reproduce through an empty mprog entry which just has miniq present.

The latter is important given otherwise we get an ENOENT error as tcx detaches
the whole mprog entry. The index 4294967295 was triggered via NULL dtuple.prog
which then attempts to detach from the back. bpf_mprog_fetch() in this case
did hit the idx == total and therefore tried to grab the entry at idx -1.

Fix it by adding an explicit bpf_mprog_total() check in bpf_mprog_detach() and
bail out early with ENOENT.

Fixes: 053c8e1f23 ("bpf: Add generic attach/detach/query API for multi-progs")
Reported-by: syzbot+0c06ba0f831fe07a8f27@syzkaller.appspotmail.com
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/20230804131112.11012-1-daniel@iogearbox.net
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-08-04 09:35:39 -07:00
Li kunyu 5964d1e459 bpf: bpf_struct_ops: Remove unnecessary initial values of variables
err and tlinks is assigned first, so it does not need to initialize the
assignment.

Signed-off-by: Li kunyu <kunyu@nfschina.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20230804175929.2867-1-kunyu@nfschina.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-08-03 17:54:35 -07:00
Miaohe Lin 7f828eacc4 cgroup: fix obsolete function name in cgroup_destroy_locked()
Since commit e76ecaeef6 ("cgroup: use cgroup_kn_lock_live() in other
cgroup kernfs methods"), cgroup_kn_lock_live() is used in cgroup kernfs
methods. Update corresponding comment.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-03 14:13:33 -10:00
Jakub Kicinski d07b7b32da pull-request: bpf-next 2023-08-03
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQRdM/uy1Ege0+EN1fNar9k/UBDW4wUCZMvevwAKCRBar9k/UBDW
 42Z0AP90hLZ9OmoghYAlALHLl8zqXuHCV8OeFXR5auqG+kkcCwEAx6h99vnh4zgP
 Tngj6Yid60o39/IZXXblhV37HfSiyQ8=
 =/kVE
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Martin KaFai Lau says:

====================
pull-request: bpf-next 2023-08-03

We've added 54 non-merge commits during the last 10 day(s) which contain
a total of 84 files changed, 4026 insertions(+), 562 deletions(-).

The main changes are:

1) Add SO_REUSEPORT support for TC bpf_sk_assign from Lorenz Bauer,
   Daniel Borkmann

2) Support new insns from cpu v4 from Yonghong Song

3) Non-atomically allocate freelist during prefill from YiFei Zhu

4) Support defragmenting IPv(4|6) packets in BPF from Daniel Xu

5) Add tracepoint to xdp attaching failure from Leon Hwang

6) struct netdev_rx_queue and xdp.h reshuffling to reduce
   rebuild time from Jakub Kicinski

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (54 commits)
  net: invert the netdevice.h vs xdp.h dependency
  net: move struct netdev_rx_queue out of netdevice.h
  eth: add missing xdp.h includes in drivers
  selftests/bpf: Add testcase for xdp attaching failure tracepoint
  bpf, xdp: Add tracepoint to xdp attaching failure
  selftests/bpf: fix static assert compilation issue for test_cls_*.c
  bpf: fix bpf_probe_read_kernel prototype mismatch
  riscv, bpf: Adapt bpf trampoline to optimized riscv ftrace framework
  libbpf: fix typos in Makefile
  tracing: bpf: use struct trace_entry in struct syscall_tp_t
  bpf, devmap: Remove unused dtab field from bpf_dtab_netdev
  bpf, cpumap: Remove unused cmap field from bpf_cpu_map_entry
  netfilter: bpf: Only define get_proto_defrag_hook() if necessary
  bpf: Fix an array-index-out-of-bounds issue in disasm.c
  net: remove duplicate INDIRECT_CALLABLE_DECLARE of udp[6]_ehashfn
  docs/bpf: Fix malformed documentation
  bpf: selftests: Add defrag selftests
  bpf: selftests: Support custom type and proto for client sockets
  bpf: selftests: Support not connecting client socket
  netfilter: bpf: Support BPF_F_NETFILTER_IP_DEFRAG in netfilter link
  ...
====================

Link: https://lore.kernel.org/r/20230803174845.825419-1-martin.lau@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-03 15:34:36 -07:00
Jakub Kicinski 35b1b1fd96 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

Conflicts:

net/dsa/port.c
  9945c1fb03 ("net: dsa: fix older DSA drivers using phylink")
  a88dd75384 ("net: dsa: remove legacy_pre_march2020 detection")
https://lore.kernel.org/all/20230731102254.2c9868ca@canb.auug.org.au/

net/xdp/xsk.c
  3c5b4d69c3 ("net: annotate data-races around sk->sk_mark")
  b7f72a30e9 ("xsk: introduce wrappers and helpers for supporting multi-buffer in Tx path")
https://lore.kernel.org/all/20230731102631.39988412@canb.auug.org.au/

drivers/net/ethernet/broadcom/bnxt/bnxt.c
  37b61cda9c ("bnxt: don't handle XDP in netpoll")
  2b56b3d992 ("eth: bnxt: handle invalid Tx completions more gracefully")
https://lore.kernel.org/all/20230801101708.1dc7faac@canb.auug.org.au/

Adjacent changes:

drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec_fs.c
  62da08331f ("net/mlx5e: Set proper IPsec source port in L4 selector")
  fbd517549c ("net/mlx5e: Add function to get IPsec offload namespace")

drivers/net/ethernet/sfc/selftest.c
  55c1528f9b ("sfc: fix field-spanning memcpy in selftest")
  ae9d445cd4 ("sfc: Miscellaneous comment removals")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-03 14:34:37 -07:00
Linus Torvalds 999f663186 Including fixes from bpf and wireless.
Nothing scary here. Feels like the first wave of regressions
 from v6.5 is addressed - one outstanding fix still to come
 in TLS for the sendpage rework.
 
 Current release - regressions:
 
  - udp: fix __ip_append_data()'s handling of MSG_SPLICE_PAGES
 
  - dsa: fix older DSA drivers using phylink
 
 Previous releases - regressions:
 
  - gro: fix misuse of CB in udp socket lookup
 
  - mlx5: unregister devlink params in case interface is down
 
  - Revert "wifi: ath11k: Enable threaded NAPI"
 
 Previous releases - always broken:
 
  - sched: cls_u32: fix match key mis-addressing
 
  - sched: bind logic fixes for cls_fw, cls_u32 and cls_route
 
  - add bound checks to a number of places which hand-parse netlink
 
  - bpf: disable preemption in perf_event_output helpers code
 
  - qed: fix scheduling in a tasklet while getting stats
 
  - avoid using APIs which are not hardirq-safe in couple of drivers,
    when we may be in a hard IRQ (netconsole)
 
  - wifi: cfg80211: fix return value in scan logic, avoid page
    allocator warning
 
  - wifi: mt76: mt7615: do not advertise 5 GHz on first PHY
    of MT7615D (DBDC)
 
 Misc:
 
  - drop handful of inactive maintainers, put some new in place
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmTMCRwACgkQMUZtbf5S
 Irv1tRAArN6rfYrr2ulaTOfMqhWb1Q+kAs00nBCKqC+OdWgT0hqw2QAuqTAVjhje
 8HBYlNGyhJ10yp0Q5y4Fp9CsBDHDDNjIp/YGEbr0vC/9mUDOhYD8WV07SmZmzEJu
 gmt4LeFPTk07yZy7VxMLY5XKuwce6MWGHArehZE7PSa9+07yY2Ov9X02ntr9hSdH
 ih+VdDI12aTVSj208qb0qNb2JkefFHW9dntVxce4/mtYJE9+47KMR2aXDXtCh0C6
 ECgx0LQkdEJ5vNSYfypww0SXIG5aj7sE6HMTdJkjKH7ws4xrW8H+P9co77Hb/DTH
 TsRBS4SgB20hFNxz3OQwVmAvj+2qfQssL7SeIkRnaEWeTBuVqCwjLdoIzKXJxxq+
 cvtUAAM8XUPqec5cPiHPkeAJV6aJhrdUdMjjbCI9uFYU32AWFBQEqvVGP9xdhXHK
 QIpTLiy26Vw8PwiJdROuGiZJCXePqQRLDuMX1L43ZO1rwIrZcWGHjCNtsR9nXKgQ
 apbbxb2/rq2FBMB+6obKeHzWDy3JraNCsUspmfleqdjQ2mpbRokd4Vw2564FJgaC
 5OznPIX6OuoCY5sftLUcRcpH5ncNj01BvyqjWyCIfJdkCqCUL7HSAgxfm5AUnZip
 ZIXOzZnZ6uTUQFptXdjey/jNEQ6qpV8RmwY0CMsmJoo88DXI34Y=
 =HYkl
 -----END PGP SIGNATURE-----

Merge tag 'net-6.5-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
 "Including fixes from bpf and wireless.

  Nothing scary here. Feels like the first wave of regressions from v6.5
  is addressed - one outstanding fix still to come in TLS for the
  sendpage rework.

  Current release - regressions:

   - udp: fix __ip_append_data()'s handling of MSG_SPLICE_PAGES

   - dsa: fix older DSA drivers using phylink

  Previous releases - regressions:

   - gro: fix misuse of CB in udp socket lookup

   - mlx5: unregister devlink params in case interface is down

   - Revert "wifi: ath11k: Enable threaded NAPI"

  Previous releases - always broken:

   - sched: cls_u32: fix match key mis-addressing

   - sched: bind logic fixes for cls_fw, cls_u32 and cls_route

   - add bound checks to a number of places which hand-parse netlink

   - bpf: disable preemption in perf_event_output helpers code

   - qed: fix scheduling in a tasklet while getting stats

   - avoid using APIs which are not hardirq-safe in couple of drivers,
     when we may be in a hard IRQ (netconsole)

   - wifi: cfg80211: fix return value in scan logic, avoid page
     allocator warning

   - wifi: mt76: mt7615: do not advertise 5 GHz on first PHY of MT7615D
     (DBDC)

  Misc:

   - drop handful of inactive maintainers, put some new in place"

* tag 'net-6.5-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (98 commits)
  MAINTAINERS: update TUN/TAP maintainers
  test/vsock: remove vsock_perf executable on `make clean`
  tcp_metrics: fix data-race in tcpm_suck_dst() vs fastopen
  tcp_metrics: annotate data-races around tm->tcpm_net
  tcp_metrics: annotate data-races around tm->tcpm_vals[]
  tcp_metrics: annotate data-races around tm->tcpm_lock
  tcp_metrics: annotate data-races around tm->tcpm_stamp
  tcp_metrics: fix addr_same() helper
  prestera: fix fallback to previous version on same major version
  udp: Fix __ip_append_data()'s handling of MSG_SPLICE_PAGES
  net/mlx5e: Set proper IPsec source port in L4 selector
  net/mlx5: fs_core: Skip the FTs in the same FS_TYPE_PRIO_CHAINS fs_prio
  net/mlx5: fs_core: Make find_closest_ft more generic
  wifi: brcmfmac: Fix field-spanning write in brcmf_scan_params_v2_to_v1()
  vxlan: Fix nexthop hash size
  ip6mr: Fix skb_under_panic in ip6mr_cache_report()
  s390/qeth: Don't call dev_close/dev_open (DOWN/UP)
  net: tap_open(): set sk_uid from current_fsuid()
  net: tun_chr_open(): set sk_uid from current_fsuid()
  net: dcb: choose correct policy to parse DCB_ATTR_BCN
  ...
2023-08-03 14:00:02 -07:00
James Morse 2abcc4b5a6 module: Expose module_init_layout_section()
module_init_layout_section() choses whether the core module loader
considers a section as init or not. This affects the placement of the
exit section when module unloading is disabled. This code will never run,
so it can be free()d once the module has been initialised.

arm and arm64 need to count the number of PLTs they need before applying
relocations based on the section name. The init PLTs are stored separately
so they can be free()d. arm and arm64 both use within_module_init() to
decide which list of PLTs to use when applying the relocation.

Because within_module_init()'s behaviour changes when module unloading
is disabled, both architecture would need to take this into account when
counting the PLTs.

Today neither architecture does this, meaning when module unloading is
disabled there are insufficient PLTs in the init section to load some
modules, resulting in warnings:
| WARNING: CPU: 2 PID: 51 at arch/arm64/kernel/module-plts.c:99 module_emit_plt_entry+0x184/0x1cc
| Modules linked in: crct10dif_common
| CPU: 2 PID: 51 Comm: modprobe Not tainted 6.5.0-rc4-yocto-standard-dirty #15208
| Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
| pstate: 20400005 (nzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
| pc : module_emit_plt_entry+0x184/0x1cc
| lr : module_emit_plt_entry+0x94/0x1cc
| sp : ffffffc0803bba60
[...]
| Call trace:
|  module_emit_plt_entry+0x184/0x1cc
|  apply_relocate_add+0x2bc/0x8e4
|  load_module+0xe34/0x1bd4
|  init_module_from_file+0x84/0xc0
|  __arm64_sys_finit_module+0x1b8/0x27c
|  invoke_syscall.constprop.0+0x5c/0x104
|  do_el0_svc+0x58/0x160
|  el0_svc+0x38/0x110
|  el0t_64_sync_handler+0xc0/0xc4
|  el0t_64_sync+0x190/0x194

Instead of duplicating module_init_layout_section()s logic, expose it.

Reported-by: Adam Johnston <adam.johnston@arm.com>
Fixes: 055f23b74b ("module: check for exit sections in layout_sections() instead of module_init_section()")
Cc: stable@vger.kernel.org
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2023-08-03 13:42:02 -07:00
Jakub Kicinski 3932f22723 pull-request: bpf 2023-08-03
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQRdM/uy1Ege0+EN1fNar9k/UBDW4wUCZMvqewAKCRBar9k/UBDW
 48yeAQCnPnwzcvy+JDrdosuJEErhMv0pH3ECixNpPBpns95kzAEA9QhSYwjAhlFf
 61d6hoiXj/sIibgMQT/ihODgeJ4wfQE=
 =u7qn
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf

Martin KaFai Lau says:

====================
pull-request: bpf 2023-08-03

We've added 5 non-merge commits during the last 7 day(s) which contain
a total of 3 files changed, 37 insertions(+), 20 deletions(-).

The main changes are:

1) Disable preemption in perf_event_output helpers code,
   from Jiri Olsa

2) Add length check for SK_DIAG_BPF_STORAGE_REQ_MAP_FD parsing,
   from Lin Ma

3) Multiple warning splat fixes in cpumap from Hou Tao

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  bpf, cpumap: Handle skb as well when clean up ptr_ring
  bpf, cpumap: Make sure kthread is running before map update returns
  bpf: Add length check for SK_DIAG_BPF_STORAGE_REQ_MAP_FD parsing
  bpf: Disable preemption in bpf_event_output
  bpf: Disable preemption in bpf_perf_event_output
====================

Link: https://lore.kernel.org/r/20230803181429.994607-1-martin.lau@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-03 11:22:53 -07:00
Jakub Kicinski 680ee0456a net: invert the netdevice.h vs xdp.h dependency
xdp.h is far more specific and is included in only 67 other
files vs netdevice.h's 1538 include sites.
Make xdp.h include netdevice.h, instead of the other way around.
This decreases the incremental allmodconfig builds size when
xdp.h is touched from 5947 to 662 objects.

Move bpf_prog_run_xdp() to xdp.h, seems appropriate and filter.h
is a mega-header in its own right so it's nice to avoid xdp.h
getting included there as well.

The only unfortunate part is that the typedef for xdp_features_t
has to move to netdevice.h, since its embedded in struct netdevice.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Link: https://lore.kernel.org/r/20230803010230.1755386-4-kuba@kernel.org
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-08-03 08:38:07 -07:00
Arnd Bergmann 8874a414f8 x86/qspinlock-paravirt: Fix missing-prototype warning
__pv_queued_spin_unlock_slowpath() is defined in a header file as
a global function, and designed to be called from inline asm, but
there is no prototype visible in the definition:

  kernel/locking/qspinlock_paravirt.h:493:1: error: no previous \
    prototype for '__pv_queued_spin_unlock_slowpath' [-Werror=missing-prototypes]

Add this to the x86 header that contains the inline asm calling it,
and ensure this gets included before the definition, rather than
after it.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230803082619.1369127-8-arnd@kernel.org
2023-08-03 17:15:05 +02:00
Rick Edgecombe c35559f94e x86/shstk: Introduce map_shadow_stack syscall
When operating with shadow stacks enabled, the kernel will automatically
allocate shadow stacks for new threads, however in some cases userspace
will need additional shadow stacks. The main example of this is the
ucontext family of functions, which require userspace allocating and
pivoting to userspace managed stacks.

Unlike most other user memory permissions, shadow stacks need to be
provisioned with special data in order to be useful. They need to be setup
with a restore token so that userspace can pivot to them via the RSTORSSP
instruction. But, the security design of shadow stacks is that they
should not be written to except in limited circumstances. This presents a
problem for userspace, as to how userspace can provision this special
data, without allowing for the shadow stack to be generally writable.

Previously, a new PROT_SHADOW_STACK was attempted, which could be
mprotect()ed from RW permissions after the data was provisioned. This was
found to not be secure enough, as other threads could write to the
shadow stack during the writable window.

The kernel can use a special instruction, WRUSS, to write directly to
userspace shadow stacks. So the solution can be that memory can be mapped
as shadow stack permissions from the beginning (never generally writable
in userspace), and the kernel itself can write the restore token.

First, a new madvise() flag was explored, which could operate on the
PROT_SHADOW_STACK memory. This had a couple of downsides:
1. Extra checks were needed in mprotect() to prevent writable memory from
   ever becoming PROT_SHADOW_STACK.
2. Extra checks/vma state were needed in the new madvise() to prevent
   restore tokens being written into the middle of pre-used shadow stacks.
   It is ideal to prevent restore tokens being added at arbitrary
   locations, so the check was to make sure the shadow stack had never been
   written to.
3. It stood out from the rest of the madvise flags, as more of direct
   action than a hint at future desired behavior.

So rather than repurpose two existing syscalls (mmap, madvise) that don't
quite fit, just implement a new map_shadow_stack syscall to allow
userspace to map and setup new shadow stacks in one step. While ucontext
is the primary motivator, userspace may have other unforeseen reasons to
setup its own shadow stacks using the WRSS instruction. Towards this
provide a flag so that stacks can be optionally setup securely for the
common case of ucontext without enabling WRSS. Or potentially have the
kernel set up the shadow stack in some new way.

The following example demonstrates how to create a new shadow stack with
map_shadow_stack:
void *shstk = map_shadow_stack(addr, stack_size, SHADOW_STACK_SET_TOKEN);

Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/all/20230613001108.3040476-35-rick.p.edgecombe%40intel.com
2023-08-02 15:01:51 -07:00
Arnd Bergmann 6a5a148aaf bpf: fix bpf_probe_read_kernel prototype mismatch
bpf_probe_read_kernel() has a __weak definition in core.c and another
definition with an incompatible prototype in kernel/trace/bpf_trace.c,
when CONFIG_BPF_EVENTS is enabled.

Since the two are incompatible, there cannot be a shared declaration in
a header file, but the lack of a prototype causes a W=1 warning:

kernel/bpf/core.c:1638:12: error: no previous prototype for 'bpf_probe_read_kernel' [-Werror=missing-prototypes]

On 32-bit architectures, the local prototype

u64 __weak bpf_probe_read_kernel(void *dst, u32 size, const void *unsafe_ptr)

passes arguments in other registers as the one in bpf_trace.c

BPF_CALL_3(bpf_probe_read_kernel, void *, dst, u32, size,
            const void *, unsafe_ptr)

which uses 64-bit arguments in pairs of registers.

As both versions of the function are fairly simple and only really
differ in one line, just move them into a header file as an inline
function that does not add any overhead for the bpf_trace.c callers
and actually avoids a function call for the other one.

Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/all/ac25cb0f-b804-1649-3afb-1dc6138c2716@iogearbox.net/
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230801111449.185301-1-arnd@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-08-02 14:14:17 -07:00
Miaohe Lin a2c15fece4 cgroup: fix obsolete function name above css_free_rwork_fn()
Since commit 8f36aaec9c ("cgroup: Use rcu_work instead of explicit rcu
and work item"), css_free_work_fn has been renamed to css_free_rwork_fn.
Update corresponding comment.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-02 09:37:59 -10:00
Cai Xinchen 05f76ae95e cgroup/cpuset: fix kernel-doc
Add kernel-doc of param @rotor to fix warnings:

kernel/cgroup/cpuset.c:4162: warning: Function parameter or member
'rotor' not described in 'cpuset_spread_node'
kernel/cgroup/cpuset.c:3771: warning: Function parameter or member
'work' not described in 'cpuset_hotplug_workfn'

Signed-off-by: Cai Xinchen <caixinchen1@huawei.com>
Acked-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-02 09:37:03 -10:00
Kamalesh Babulal 55a5956a55 cgroup: clean up printk()
Convert the only printk() to use pr_*() helper. No functional change.

Signed-off-by: Kamalesh Babulal <kamalesh.babulal@oracle.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-02 09:36:24 -10:00
Christoph Hellwig 9011e49d54 modules: only allow symbol_get of EXPORT_SYMBOL_GPL modules
It has recently come to my attention that nvidia is circumventing the
protection added in 262e6ae708 ("modules: inherit
TAINT_PROPRIETARY_MODULE") by importing exports from their proprietary
modules into an allegedly GPL licensed module and then rexporting them.

Given that symbol_get was only ever intended for tightly cooperating
modules using very internal symbols it is logical to restrict it to
being used on EXPORT_SYMBOL_GPL and prevent nvidia from costly DMCA
Circumvention of Access Controls law suites.

All symbols except for four used through symbol_get were already exported
as EXPORT_SYMBOL_GPL, and the remaining four ones were switched over in
the preparation patches.

Fixes: 262e6ae708 ("modules: inherit TAINT_PROPRIETARY_MODULE")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2023-08-02 11:18:22 -07:00
Phil Auld 88c56cfeae sched/fair: Block nohz tick_stop when cfs bandwidth in use
CFS bandwidth limits and NOHZ full don't play well together.  Tasks
can easily run well past their quotas before a remote tick does
accounting.  This leads to long, multi-period stalls before such
tasks can run again. Currently, when presented with these conflicting
requirements the scheduler is favoring nohz_full and letting the tick
be stopped. However, nohz tick stopping is already best-effort, there
are a number of conditions that can prevent it, whereas cfs runtime
bandwidth is expected to be enforced.

Make the scheduler favor bandwidth over stopping the tick by setting
TICK_DEP_BIT_SCHED when the only running task is a cfs task with
runtime limit enabled. We use cfs_b->hierarchical_quota to
determine if the task requires the tick.

Add check in pick_next_task_fair() as well since that is where
we have a handle on the task that is actually going to be running.

Add check in sched_can_stop_tick() to cover some edge cases such
as nr_running going from 2->1 and the 1 remains the running task.

Reviewed-By: Ben Segall <bsegall@google.com>
Signed-off-by: Phil Auld <pauld@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230712133357.381137-3-pauld@redhat.com
2023-08-02 16:19:26 +02:00
Phil Auld c98c18270b sched, cgroup: Restore meaning to hierarchical_quota
In cgroupv2 cfs_b->hierarchical_quota is set to -1 for all task
groups due to the previous fix simply taking the min.  It should
reflect a limit imposed at that level or by an ancestor. Even
though cgroupv2 does not require child quota to be less than or
equal to that of its ancestors the task group will still be
constrained by such a quota so this should be shown here. Cgroupv1
continues to set this correctly.

In both cases, add initialization when a new task group is created
based on the current parent's value (or RUNTIME_INF in the case of
root_task_group). Otherwise, the field is wrong until a quota is
changed after creation and __cfs_schedulable() is called.

Fixes: c53593e5cb ("sched, cgroup: Don't reject lower cpu.max on ancestors")
Signed-off-by: Phil Auld <pauld@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ben Segall <bsegall@google.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20230714125746.812891-1-pauld@redhat.com
2023-08-02 16:19:26 +02:00
Yauheni Kaliuta d3c4db86c7 tracing: bpf: use struct trace_entry in struct syscall_tp_t
bpf tracepoint program uses struct trace_event_raw_sys_enter as
argument where trace_entry is the first field. Use the same instead
of unsigned long long since if it's amended (for example by RT
patch) it accesses data with wrong offset.

Signed-off-by: Yauheni Kaliuta <ykaliuta@redhat.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230801075222.7717-1-ykaliuta@redhat.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-08-01 10:53:28 -07:00
Petr Tesarik 1395706a14 swiotlb: search the software IO TLB only if the device makes use of it
Skip searching the software IO TLB if a device has never used it, making
sure these devices are not affected by the introduction of multiple IO TLB
memory pools.

Additional memory barrier is required to ensure that the new value of the
flag is visible to other CPUs after mapping a new bounce buffer. For
efficiency, the flag check should be inlined, and then the memory barrier
must be moved to is_swiotlb_buffer(). However, it can replace the existing
barrier in swiotlb_find_pool(), because all callers use is_swiotlb_buffer()
first to verify that the buffer address belongs to the software IO TLB.

Signed-off-by: Petr Tesarik <petr.tesarik.ext@huawei.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-08-01 18:02:32 +02:00
Petr Tesarik 1aaa736815 swiotlb: allocate a new memory pool when existing pools are full
When swiotlb_find_slots() cannot find suitable slots, schedule the
allocation of a new memory pool. It is not possible to allocate the pool
immediately, because this code may run in interrupt context, which is not
suitable for large memory allocations. This means that the memory pool will
be available too late for the currently requested mapping, but the stress
on the software IO TLB allocator is likely to continue, and subsequent
allocations will benefit from the additional pool eventually.

Keep all memory pools for an allocator in an RCU list to avoid locking on
the read side. For modifications, add a new spinlock to struct io_tlb_mem.

The spinlock also protects updates to the total number of slabs (nslabs in
struct io_tlb_mem), but not reads of the value. Readers may therefore
encounter a stale value, but this is not an issue:

- swiotlb_tbl_map_single() and is_swiotlb_active() only check for non-zero
  value. This is ensured by the existence of the default memory pool,
  allocated at boot.

- The exact value is used only for non-critical purposes (debugfs, kernel
  messages).

Signed-off-by: Petr Tesarik <petr.tesarik.ext@huawei.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-08-01 18:02:27 +02:00
Petr Tesarik ad96ce3252 swiotlb: determine potential physical address limit
The value returned by default_swiotlb_limit() should be constant, because
it is used to decide whether DMA can be used. To allow allocating memory
pools on the fly, use the maximum possible physical address rather than the
highest address used by the default pool.

For swiotlb_init_remap(), this is either an arch-specific limit used by
memblock_alloc_low(), or the highest directly mapped physical address if
the initialization flags include SWIOTLB_ANY. For swiotlb_init_late(), the
highest address is determined by the GFP flags.

Signed-off-by: Petr Tesarik <petr.tesarik.ext@huawei.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-08-01 18:02:24 +02:00
Petr Tesarik 79636caad3 swiotlb: if swiotlb is full, fall back to a transient memory pool
Try to allocate a transient memory pool if no suitable slots can be found
and the respective SWIOTLB is allowed to grow. The transient pool is just
enough big for this one bounce buffer. It is inserted into a per-device
list of transient memory pools, and it is freed again when the bounce
buffer is unmapped.

Transient memory pools are kept in an RCU list. A memory barrier is
required after adding a new entry, because any address within a transient
buffer must be immediately recognized as belonging to the SWIOTLB, even if
it is passed to another CPU.

Deletion does not require any synchronization beyond RCU ordering
guarantees. After a buffer is unmapped, its physical addresses may no
longer be passed to the DMA API, so the memory range of the corresponding
stale entry in the RCU list never matches. If the memory range gets
allocated again, then it happens only after a RCU quiescent state.

Since bounce buffers can now be allocated from different pools, add a
parameter to swiotlb_alloc_pool() to let the caller know which memory pool
is used. Add swiotlb_find_pool() to find the memory pool corresponding to
an address. This function is now also used by is_swiotlb_buffer(), because
a simple boundary check is no longer sufficient.

The logic in swiotlb_alloc_tlb() is taken from __dma_direct_alloc_pages(),
simplified and enhanced to use coherent memory pools if needed.

Note that this is not the most efficient way to provide a bounce buffer,
but when a DMA buffer can't be mapped, something may (and will) actually
break. At that point it is better to make an allocation, even if it may be
an expensive operation.

Signed-off-by: Petr Tesarik <petr.tesarik.ext@huawei.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-08-01 18:02:20 +02:00
Petr Tesarik 62708b2ba4 swiotlb: add a flag whether SWIOTLB is allowed to grow
Add a config option (CONFIG_SWIOTLB_DYNAMIC) to enable or disable dynamic
allocation of additional bounce buffers.

If this option is set, mark the default SWIOTLB as able to grow and
restricted DMA pools as unable.

However, if the address of the default memory pool is explicitly queried,
make the default SWIOTLB also unable to grow. This is currently used to set
up PCI BAR movable regions on some Octeon MIPS boards which may not be able
to use a SWIOTLB pool elsewhere in physical memory. See octeon_pci_setup()
for more details.

If a remap function is specified, it must be also called on any dynamically
allocated pools, but there are some issues:

- The remap function may block, so it should not be called from an atomic
  context.
- There is no corresponding unremap() function if the memory pool is
  freed.
- The only in-tree implementation (xen_swiotlb_fixup) requires that the
  number of slots in the memory pool is a multiple of SWIOTLB_SEGSIZE.

Keep it simple for now and disable growing the SWIOTLB if a remap function
was specified.

Signed-off-by: Petr Tesarik <petr.tesarik.ext@huawei.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-08-01 18:02:17 +02:00
Petr Tesarik 158dbe9c9a swiotlb: separate memory pool data from other allocator data
Carve out memory pool specific fields from struct io_tlb_mem. The original
struct now contains shared data for the whole allocator, while the new
struct io_tlb_pool contains data that is specific to one memory pool of
(potentially) many.

Signed-off-by: Petr Tesarik <petr.tesarik.ext@huawei.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-08-01 18:02:14 +02:00
Petr Tesarik fea18777a7 swiotlb: add documentation and rename swiotlb_do_find_slots()
Add some kernel-doc comments and move the existing documentation of struct
io_tlb_slot to its correct location. The latter was forgotten in commit
942a8186eb ("swiotlb: move struct io_tlb_slot to swiotlb.c").

Use the opportunity to give swiotlb_do_find_slots() a more descriptive name
and make it clear how it differs from swiotlb_find_slots().

Signed-off-by: Petr Tesarik <petr.tesarik.ext@huawei.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-08-01 18:02:12 +02:00
Petr Tesarik 05ee774122 swiotlb: make io_tlb_default_mem local to swiotlb.c
SWIOTLB implementation details should not be exposed to the rest of the
kernel. This will allow to make changes to the implementation without
modifying non-swiotlb code.

To avoid breaking existing users, provide helper functions for the few
required fields.

As a bonus, using a helper function to initialize struct device allows to
get rid of an #ifdef in driver core.

Signed-off-by: Petr Tesarik <petr.tesarik.ext@huawei.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-08-01 18:02:09 +02:00
Petr Tesarik 0c6874a6ac swiotlb: bail out of swiotlb_init_late() if swiotlb is already allocated
If swiotlb is allocated, immediately return 0, so callers do not have to
check io_tlb_default_mem.nslabs explicitly.

Signed-off-by: Petr Tesarik <petr.tesarik.ext@huawei.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-08-01 18:02:02 +02:00
Hou Tao 1ea66e89f6 bpf, devmap: Remove unused dtab field from bpf_dtab_netdev
Commit 96360004b8 ("xdp: Make devmap flush_list common for all map
instances") removes the use of bpf_dtab_netdev::dtab in bq_enqueue(),
so just remove dtab from bpf_dtab_netdev.

Signed-off-by: Hou Tao <houtao1@huawei.com>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Link: https://lore.kernel.org/r/20230728014942.892272-3-houtao@huaweicloud.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-07-31 18:26:08 -07:00
Hou Tao 2d20bfc315 bpf, cpumap: Remove unused cmap field from bpf_cpu_map_entry
Since commit cdfafe98ca ("xdp: Make cpumap flush_list common for all
map instances"), cmap is no longer used, so just remove it.

Signed-off-by: Hou Tao <houtao1@huawei.com>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20230728014942.892272-2-houtao@huaweicloud.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-07-31 18:26:08 -07:00
Yonghong Song e99688eba2 bpf: Fix an array-index-out-of-bounds issue in disasm.c
syzbot reported an array-index-out-of-bounds when printing out bpf
insns. Further investigation shows the insn is illegal but
is printed out due to log level 1 or 2 before actual insn verification
in do_check().

This particular illegal insn is a MOVSX insn with offset value 2.
The legal offset value for MOVSX should be 8, 16 and 32.
The disasm sign-extension-size array index is calculated as
 (insn->off / 8) - 1
and offset value 2 gives an out-of-bound index -1.

Tighten the checking for MOVSX insn in disasm.c to avoid
array-index-out-of-bounds issue.

Reported-by: syzbot+3758842a6c01012aa73b@syzkaller.appspotmail.com
Fixes: f835bb6222 ("bpf: Add kernel/bpftool asm support for new instructions")
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20230731204534.1975311-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-31 17:35:02 -07:00
Paul E. McKenney cb88f7f51b rcu-tasks: Permit use of debug-objects with RCU Tasks flavors
Currently, cblist_init_generic() holds a raw spinlock when invoking
INIT_WORK().  This fails in kernels built with CONFIG_DEBUG_OBJECTS=y
due to memory allocation being forbidden while holding a raw spinlock.
But the only reason for holding the raw spinlock is to synchronize
with early boot calls to call_rcu_tasks(), call_rcu_tasks_rude, and,
last but not least, call_rcu_tasks_trace().  These calls also invoke
cblist_init_generic() in order to support early boot queueing of
callbacks.

Except that there are no early boot calls to either of these three
functions, and the BPF guys confirm that they have no plans to add any
such calls.

This commit therefore removes the synchronization and adds a
WARN_ON_ONCE() to catch the case of now-prohibited early boot RCU Tasks
callback queueing.

If early boot queueing is needed, an "initialized" flag may be added to
the rcu_tasks structure.  Then queueing a callback before this flag is set
would initialize the callback list (if needed) and queue the callback.
The decision as to where to queue the callback given the possibility of
non-zero boot CPUs is left as an exercise for the reader.

Reported-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-07-31 16:06:57 -07:00
Hou Tao 7c62b75cd1 bpf, cpumap: Handle skb as well when clean up ptr_ring
The following warning was reported when running xdp_redirect_cpu with
both skb-mode and stress-mode enabled:

  ------------[ cut here ]------------
  Incorrect XDP memory type (-2128176192) usage
  WARNING: CPU: 7 PID: 1442 at net/core/xdp.c:405
  Modules linked in:
  CPU: 7 PID: 1442 Comm: kworker/7:0 Tainted: G  6.5.0-rc2+ #1
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
  Workqueue: events __cpu_map_entry_free
  RIP: 0010:__xdp_return+0x1e4/0x4a0
  ......
  Call Trace:
   <TASK>
   ? show_regs+0x65/0x70
   ? __warn+0xa5/0x240
   ? __xdp_return+0x1e4/0x4a0
   ......
   xdp_return_frame+0x4d/0x150
   __cpu_map_entry_free+0xf9/0x230
   process_one_work+0x6b0/0xb80
   worker_thread+0x96/0x720
   kthread+0x1a5/0x1f0
   ret_from_fork+0x3a/0x70
   ret_from_fork_asm+0x1b/0x30
   </TASK>

The reason for the warning is twofold. One is due to the kthread
cpu_map_kthread_run() is stopped prematurely. Another one is
__cpu_map_ring_cleanup() doesn't handle skb mode and treats skbs in
ptr_ring as XDP frames.

Prematurely-stopped kthread will be fixed by the preceding patch and
ptr_ring will be empty when __cpu_map_ring_cleanup() is called. But
as the comments in __cpu_map_ring_cleanup() said, handling and freeing
skbs in ptr_ring as well to "catch any broken behaviour gracefully".

Fixes: 11941f8a85 ("bpf: cpumap: Implement generic cpumap")
Signed-off-by: Hou Tao <houtao1@huawei.com>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Link: https://lore.kernel.org/r/20230729095107.1722450-3-houtao@huaweicloud.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-07-31 15:37:12 -07:00
Hou Tao 640a604585 bpf, cpumap: Make sure kthread is running before map update returns
The following warning was reported when running stress-mode enabled
xdp_redirect_cpu with some RT threads:

  ------------[ cut here ]------------
  WARNING: CPU: 4 PID: 65 at kernel/bpf/cpumap.c:135
  CPU: 4 PID: 65 Comm: kworker/4:1 Not tainted 6.5.0-rc2+ #1
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
  Workqueue: events cpu_map_kthread_stop
  RIP: 0010:put_cpu_map_entry+0xda/0x220
  ......
  Call Trace:
   <TASK>
   ? show_regs+0x65/0x70
   ? __warn+0xa5/0x240
   ......
   ? put_cpu_map_entry+0xda/0x220
   cpu_map_kthread_stop+0x41/0x60
   process_one_work+0x6b0/0xb80
   worker_thread+0x96/0x720
   kthread+0x1a5/0x1f0
   ret_from_fork+0x3a/0x70
   ret_from_fork_asm+0x1b/0x30
   </TASK>

The root cause is the same as commit 4369016497 ("bpf: cpumap: Fix memory
leak in cpu_map_update_elem"). The kthread is stopped prematurely by
kthread_stop() in cpu_map_kthread_stop(), and kthread() doesn't call
cpu_map_kthread_run() at all but XDP program has already queued some
frames or skbs into ptr_ring. So when __cpu_map_ring_cleanup() checks
the ptr_ring, it will find it was not emptied and report a warning.

An alternative fix is to use __cpu_map_ring_cleanup() to drop these
pending frames or skbs when kthread_stop() returns -EINTR, but it may
confuse the user, because these frames or skbs have been handled
correctly by XDP program. So instead of dropping these frames or skbs,
just make sure the per-cpu kthread is running before
__cpu_map_entry_alloc() returns.

After apply the fix, the error handle for kthread_stop() will be
unnecessary because it will always return 0, so just remove it.

Fixes: 6710e11269 ("bpf: introduce new bpf cpu map type BPF_MAP_TYPE_CPUMAP")
Signed-off-by: Hou Tao <houtao1@huawei.com>
Reviewed-by: Pu Lehui <pulehui@huawei.com>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Link: https://lore.kernel.org/r/20230729095107.1722450-2-houtao@huaweicloud.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-07-31 15:37:12 -07:00
Martin KaFai Lau 079082c60a tcx: Fix splat during dev unregister
During unregister_netdevice_many_notify(), the ordering of our concerned
function calls is like this:

  unregister_netdevice_many_notify
    dev_shutdown
	qdisc_put
            clsact_destroy
    tcx_uninstall

The syzbot reproducer triggered a case that the qdisc refcnt is not
zero during dev_shutdown().

tcx_uninstall() will then WARN_ON_ONCE(tcx_entry(entry)->miniq_active)
because the miniq is still active and the entry should not be freed.
The latter assumed that qdisc destruction happens before tcx teardown.

This fix is to avoid tcx_uninstall() doing tcx_entry_free() when the
miniq is still alive and let the clsact_destroy() do the free later, so
that we do not assume any specific ordering for either of them.

If still active, tcx_uninstall() does clear the entry when flushing out
the prog/link. clsact_destroy() will then notice the "!tcx_entry_is_active()"
and then does the tcx_entry_free() eventually.

Fixes: e420bed025 ("bpf: Add fd-based tcx multi-prog infra with link support")
Reported-by: syzbot+376a289e86a0fd02b9ba@syzkaller.appspotmail.com
Reported-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Co-developed-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: syzbot+376a289e86a0fd02b9ba@syzkaller.appspotmail.com
Tested-by: Leon Romanovsky <leonro@nvidia.com>
Link: https://lore.kernel.org/r/222255fe07cb58f15ee662e7ee78328af5b438e4.1690549248.git.daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-31 14:44:02 -07:00
Ajay Kaher 27152bceea eventfs: Move tracing/events to eventfs
Up until now, /sys/kernel/tracing/events was no different than any other
part of tracefs. The files and directories within the events directory was
created when the tracefs was mounted, and also created for the instances in
/sys/kernel/tracing/instances/<instance>/events. Most of these files and
directories will never be referenced. Since there are thousands of these
files and directories they spend their time wasting precious memory
resources.

Move the "events" directory to the new eventfs. The eventfs will take the
meta data of the events that they represent and store that. When the files
in the events directory are referenced, the dentry and inodes to represent
them are then created. When the files are no longer referenced, they are
freed. This saves the precious memory resources that were wasted on these
seldom referenced dentries and inodes.

Running the following:

 ~# cat /proc/meminfo /proc/slabinfo  > before.out
 ~# mkdir /sys/kernel/tracing/instances/foo
 ~# cat /proc/meminfo /proc/slabinfo  > after.out

to test the changes produces the following deltas:

Before this change:

 Before after deltas for meminfo:

   MemFree: -32260
   MemAvailable: -21496
   KReclaimable: 21528
   Slab: 22440
   SReclaimable: 21528
   SUnreclaim: 912
   VmallocUsed: 16

 Before after deltas for slabinfo:

   <slab>:		<objects>	[ * <size> = <total>]

   tracefs_inode_cache:	14472		[* 1184 = 17134848]
   buffer_head:		24		[* 168 = 4032]
   hmem_inode_cache:	28		[* 1480 = 41440]
   dentry:		14450		[* 312 = 4508400]
   lsm_inode_cache:	14453		[* 32 = 462496]
   vma_lock:		11		[* 152 = 1672]
   vm_area_struct:	2		[* 184 = 368]
   trace_event_file:	1748		[* 88 = 153824]
   kmalloc-256:		1072		[* 256 = 274432]
   kmalloc-64:		2842		[* 64 = 181888]

 Total slab additions in size: 22,763,400 bytes

With this change:

 Before after deltas for meminfo:

   MemFree: -12600
   MemAvailable: -12580
   Cached: 24
   Active: 12
   Inactive: 68
   Inactive(anon): 48
   Active(file): 12
   Inactive(file): 20
   Dirty: -4
   AnonPages: 68
   KReclaimable: 12
   Slab: 1856
   SReclaimable: 12
   SUnreclaim: 1844
   KernelStack: 16
   PageTables: 36
   VmallocUsed: 16

 Before after deltas for slabinfo:

   <slab>:		<objects>	[ * <size> = <total>]

   tracefs_inode_cache:	108		[* 1184 = 127872]
   buffer_head:		24		[* 168 = 4032]
   hmem_inode_cache:	18		[* 1480 = 26640]
   dentry:		127		[* 312 = 39624]
   lsm_inode_cache:	152		[* 32 = 4864]
   vma_lock:		67		[* 152 = 10184]
   vm_area_struct:	-12		[* 184 = -2208]
   trace_event_file: 	1764		[* 96 = 169344]
   kmalloc-96:		14322		[* 96 = 1374912]
   kmalloc-64:		2814		[* 64 = 180096]
   kmalloc-32:		1103		[* 32 = 35296]
   kmalloc-16:		2308		[* 16 = 36928]
   kmalloc-8:		12800		[* 8 = 102400]

 Total slab additions in size: 2,109,984 bytes

Which is a savings of 20,653,416 bytes (20 MB) per tracing instance.

Link: https://lkml.kernel.org/r/1690568452-46553-10-git-send-email-akaher@vmware.com

Signed-off-by: Ajay Kaher <akaher@vmware.com>
Co-developed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Tested-by: Ching-lin Yu <chinglinyu@google.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-07-31 11:55:55 -04:00
Binglei Wang 3fa6456ebe dma-contiguous: check for memory region overlap
In the process of parsing the DTS, check whether the memory region
specified by the DTS CMA node area overlaps with the kernel text
memory space reserved by memblock before calling
early_init_fdt_scan_reserved_mem.

Signed-off-by: Binglei Wang <l3b2w1@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-07-31 17:54:56 +02:00
Yajun Deng bf29bfaa54 dma-contiguous: support numa CMA for specified node
The kernel parameter 'cma_pernuma=' only supports reserving the same
size of CMA area for each node. We need to reserve different sizes of
CMA area for specified nodes if these devices belong to different nodes.

Adding another kernel parameter 'numa_cma=' to reserve CMA area for
the specified node. If we want to use one of these parameters, we need to
enable DMA_NUMA_CMA.

At the same time, print the node id in cma_declare_contiguous_nid() if
CONFIG_NUMA is enabled.

Signed-off-by: Yajun Deng <yajun.deng@linux.dev>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-07-31 17:54:29 +02:00
Yajun Deng 22e4a348f8 dma-contiguous: support per-numa CMA for all architectures
In the commit b7176c261c ("dma-contiguous: provide the ability to
reserve per-numa CMA"), Barry adds DMA_PERNUMA_CMA for ARM64.

But this feature is architecture independent, so support per-numa CMA
for all architectures, and enable it by default if NUMA.

Signed-off-by: Yajun Deng <yajun.deng@linux.dev>
Tested-by: Yicong Yang <yangyicong@hisilicon.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-07-31 17:54:28 +02:00
Arnd Bergmann 3d6f126b15 dma-mapping: move arch_dma_set_mask() declaration to header
This function has a __weak definition and an override that is only used on
freescale powerpc chips. The powerpc definition however does not see the
declaration that is in a .c file:

arch/powerpc/kernel/dma-mask.c:7:6: error: no previous prototype for 'arch_dma_set_mask' [-Werror=missing-prototypes]

Move it into the linux/dma-map-ops.h header where the other arch_dma_* functions
are declared.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-07-31 17:54:28 +02:00
Christoph Hellwig 42e584a985 swiotlb: unexport is_swiotlb_active
Drivers have no business looking at dma-mapping or swiotlb internals.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Juergen Gross <jgross@suse.com>
2023-07-31 17:54:27 +02:00
Zhang Rui 52b38b7ad5 cpu/SMT: Fix cpu_smt_possible() comment
Commit e1572f1d08 ("cpu/SMT: create and export cpu_smt_possible()")
introduces cpu_smt_possible() to represent if SMT is theoretically
possible. It returns true when SMT is supported and not forcefully
disabled ('nosmt=force'). But the comment of it says "Returns true if
SMT is not supported of forcefully (irreversibly) disabled", which is
wrong. Fix that comment accordingly.

Signed-off-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20230728155313.44170-1-rui.zhang@intel.com
2023-07-31 17:32:44 +02:00
YueHaibing 51a5acce71 genirq: Remove unused extern declaration
commit 3795de236d ("genirq: Distangle kernel/irq/handle.c")
left behind this.

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20230720143625.29176-1-yuehaibing@huawei.com
2023-07-31 17:27:16 +02:00
Vincent Whitchurch e2c12739cc genirq: Prevent nested thread vs synchronize_hardirq() deadlock
There is a possibility of deadlock if synchronize_hardirq() is called
when the nested threaded interrupt is active.  The following scenario
was observed on a uniprocessor PREEMPT_NONE system:

 Thread 1                      Thread 2

 handle_nested_thread()
  Set INPROGRESS
  Call ->thread_fn()
   thread_fn goes to sleep

                              free_irq()
                               __synchronize_hardirq()
                                Busy-loop forever waiting for INPROGRESS
                                to be cleared

The INPROGRESS flag is only supposed to be used for hard interrupt
handlers.  Remove the incorrect usage in the nested threaded interrupt
case and instead re-use the threads_active / wait_for_threads mechanism
to wait for nested threaded interrupts to complete.

Signed-off-by: Vincent Whitchurch <vincent.whitchurch@axis.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20230613-genirq-nested-v3-1-ae58221143eb@axis.com
2023-07-31 17:24:22 +02:00
Peter Zijlstra 22dc02f81c Revert "sched/fair: Move unused stub functions to header"
Revert commit 7aa55f2a59 ("sched/fair: Move unused stub functions to
header"), for while it has the right Changelog, the actual patch
content a revert of the previous 4 patches:

  f7df852ad6 ("sched: Make task_vruntime_update() prototype visible")
  c0bdfd72fb ("sched/fair: Hide unused init_cfs_bandwidth() stub")
  378be384e0 ("sched: Add schedule_user() declaration")
  d55ebae3f3 ("sched: Hide unused sched_update_scaling()")

So in effect this is a revert of a revert and re-applies those
patches.

Fixes: 7aa55f2a59 ("sched/fair: Move unused stub functions to header")
Reported-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
2023-07-31 11:47:08 +02:00
Greg Kroah-Hartman fe3015748a Merge 6.5-rc4 into tty-next
We need the serial/tty fixes in here as well for testing and future
development.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-07-31 09:39:56 +02:00
Steven Rostedt (Google) ee41106a12 tracing: Require all trace events to have a TRACE_SYSTEM
The creation of the trace event directory requires that a TRACE_SYSTEM is
defined that the trace event directory is added within the system it was
defined in.

The code handled the case where a TRACE_SYSTEM was not added, and would
then add the event at the events directory. But nothing should be doing
this. This code also prevents the implementation of creating dynamic
dentrys for the eventfs system.

As this path has never been hit on correct code, remove it. If it does get
hit, issues a WARN_ON_ONCE() and return ENODEV.

Link: https://lkml.kernel.org/r/1690568452-46553-2-git-send-email-akaher@vmware.com

Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Ajay Kaher <akaher@vmware.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-07-30 18:13:33 -04:00
Zheng Yejian 6d98a0f2ac tracing: Set actual size after ring buffer resize
Currently we can resize trace ringbuffer by writing a value into file
'buffer_size_kb', then by reading the file, we get the value that is
usually what we wrote. However, this value may be not actual size of
trace ring buffer because of the round up when doing resize in kernel,
and the actual size would be more useful.

Link: https://lore.kernel.org/linux-trace-kernel/20230705002705.576633-1-zhengyejian1@huawei.com

Cc: <mhiramat@kernel.org>
Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-07-30 18:11:45 -04:00
Steven Rostedt (Google) 6bba92881d tracing: Add free_trace_iter_content() helper function
As the trace iterator is created and used by various interfaces, the clean
up of it needs to be consistent. Create a free_trace_iter_content() helper
function that frees the content of the iterator and use that to clean it
up in all places that it is used.

Link: https://lkml.kernel.org/r/20230715141348.341887497@goodmis.org

Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-07-30 18:11:45 -04:00
Steven Rostedt (Google) 9182b519b8 tracing: Remove unnecessary copying of tr->current_trace
The iterator allocated a descriptor to copy the current_trace. This was done
with the assumption that the function pointers might change. But this was a
false assuption, as it does not change. There's no reason to make a copy of the
current_trace and just use the pointer it points to. This removes needing to
manage freeing the descriptor. Worse yet, there's locations that the iterator
is used but does make a copy and just uses the pointer. This could cause the
actual pointer to the trace descriptor to be freed and not the allocated copy.

This is more of a clean up than a fix.

Link: https://lkml.kernel.org/r/20230715141348.135792275@goodmis.org

Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Fixes: d7350c3f45 ("tracing/core: make the read callbacks reentrants")
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-07-30 18:11:45 -04:00
Uros Bizjak 00a8478f8f ring_buffer: Use try_cmpxchg instead of cmpxchg
Use try_cmpxchg instead of cmpxchg (*ptr, old, new) == old in
ring_buffer.c. x86 CMPXCHG instruction returns success in ZF flag,
so this change saves a compare after cmpxchg (and related move
instruction in front of cmpxchg).

No functional change intended.

Link: https://lore.kernel.org/linux-trace-kernel/20230714154418.8884-1-ubizjak@gmail.com

Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-07-30 18:11:45 -04:00
Steven Rostedt (Google) e7186af7fb tracing: Add back FORTIFY_SOURCE logic to kernel_stack event structure
For backward compatibility, older tooling expects to see the kernel_stack
event with a "caller" field that is a fixed size array of 8 addresses. The
code now supports more than 8 with an added "size" field that states the
real number of entries. But the "caller" field still just looks like a
fixed size to user space.

Since the tracing macros that create the user space format files also
creates the structures that those files represent, the kernel_stack event
structure had its "caller" field a fixed size of 8, but in reality, when
it is allocated on the ring buffer, it can hold more if the stack trace is
bigger that 8 functions. The copying of these entries was simply done with
a memcpy():

  size = nr_entries * sizeof(unsigned long);
  memcpy(entry->caller, fstack->calls, size);

The FORTIFY_SOURCE logic noticed at runtime that when the nr_entries was
larger than 8, that the memcpy() was writing more than what the structure
stated it can hold and it complained about it. This is because the
FORTIFY_SOURCE code is unaware that the amount allocated is actually
enough to hold the size. It does not expect that a fixed size field will
hold more than the fixed size.

This was originally solved by hiding the caller assignment with some
pointer arithmetic.

  ptr = ring_buffer_data();
  entry = ptr;

  ptr += offsetof(typeof(*entry), caller);
  memcpy(ptr, fstack->calls, size);

But it is considered bad form to hide from kernel hardening. Instead, make
it work nicely with FORTIFY_SOURCE by adding a new __stack_array() macro
that is specific for this one special use case. The macro will take 4
arguments: type, item, len, field (whereas the __array() macro takes just
the first three). This macro will act just like the __array() macro when
creating the code to deal with the format file that is exposed to user
space. But for the kernel, it will turn the caller field into:

  type item[] __counted_by(field);

or for this instance:

  unsigned long caller[] __counted_by(size);

Now the kernel code can expose the assignment of the caller to the
FORTIFY_SOURCE and everyone is happy!

Link: https://lore.kernel.org/linux-trace-kernel/20230712105235.5fc441aa@gandalf.local.home/
Link: https://lore.kernel.org/linux-trace-kernel/20230713092605.2ddb9788@rorschach.local.home

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Suggested-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
2023-07-30 18:11:44 -04:00
Linus Torvalds b0b9850e7d Probe fixes for 6.5-rc3:
- probe-events: Fix to add NULL check for some BTF API calls which can
   return error code and NULL.
 
 - ftrace selftests: Fix to check fprobe and kprobe event correctly. This
   fixes a miss condition of the test command.
 
 - kprobes: Prohibit probing on the function which starts from "__cfi_"
   and "__pfx_" since those are auto generated for kernel CFI and not
   executed.
 -----BEGIN PGP SIGNATURE-----
 
 iQFPBAABCgA5FiEEh7BulGwFlgAOi5DV2/sHvwUrPxsFAmTGdH4bHG1hc2FtaS5o
 aXJhbWF0c3VAZ21haWwuY29tAAoJENv7B78FKz8bmMAH/0qTHII0KYQDvrNJ40tT
 SDM8+4zOJEtnjVYq87+4EWBhpVEL3VbLRJaprjXh40lZJrCP3MglCF152p4bOhgb
 ZrjWuTAgE0N+rBhdeUJlzy3iLzl0G9dzfA+sn1XMcW+/HSPstJcjAG6wD7ROeZzL
 XCxzE+NY6Y6mYbB52DaS8Hv7g7WccaTV+KeRjokhMPt+u7/KItJ4hQb/RXtAL31S
 n4thCeVllaPBuc7m2CmKwJ9jzOg7/0qpAIUGx1Z+Khy/3YfRhG1nT93GxP8hLmad
 SH9kGps09WXF5f8FbjYglOmq7ioDbIUz3oXPQRZYPymV8A0EU+b+/8IsRog1ySd1
 BVk=
 =qKWS
 -----END PGP SIGNATURE-----

Merge tag 'probes-fixes-v6.5-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull probe fixes from Masami Hiramatsu:

 - probe-events: add NULL check for some BTF API calls which can return
   error code and NULL.

 - ftrace selftests: check fprobe and kprobe event correctly. This fixes
   a miss condition of the test command.

 - kprobes: do not allow probing functions that start with "__cfi_" or
   "__pfx_" since those are auto generated for kernel CFI and not
   executed.

* tag 'probes-fixes-v6.5-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  kprobes: Prohibit probing on CFI preamble symbol
  selftests/ftrace: Fix to check fprobe event eneblement
  tracing/probes: Fix to add NULL check for BTF APIs
2023-07-30 11:27:22 -07:00
Linus Torvalds c959e90094 - Fix a rtmutex race condition resulting from sharing of the sort key
between the lock waiters and the PI chain tree (->pi_waiters) of
   a task by giving each tree their own sort key
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmTGMGAACgkQEsHwGGHe
 VUrqNw/9F7p+/eDXP9TURup3Ms87lOlqCJg2CClTRiPF/FZcVzVbmltWewLDOWC6
 GBEa7l++lpKwIGCuzg99tDScutdlCw8mQe4knrFvrJc04zzPV3O4N0Tiw1eD8wIw
 /i6mzRaEuB5R8nxd9YK24gEsijerxhgVwKFvsCjbTV71LrY1geCJRqaqnooVnNm7
 kvy+UIuckFJL8GDsPBJNsZdTyBrs7lC+m4su5QHr7FkQmF+4iEbDkwlYqPzafhVL
 e4/FRjWHC5HRKd69j8fqyYZP8ryUlrmHFxkeAE08CxiqwdOELNHrp+i/x2GN9FTl
 wL8psVl9oVt2KT79Py0e40ZOoUWHDOugx2IGzWl13C0XFKY+QH2HR6C+gJdedPMQ
 tBtNyf8Ivukp+SSjyUXu8q9dGLJfdxtIKmbNHOuZDU121Fqg5+b04vQkzdgkMwJh
 OoYj0fDUMBa6Zfv1PQkkSRU2wMQlgGTjMiCEvUjAs3vGBFQcu+flAm2UOr8NDLaF
 0r3wNRVzk53k85R0L49oS8aU2FOcaHlpYXOevh0NYAt68EbmZctns85SJRfrwNGE
 QuykN8yDID/zjGsjSSF6L1IaQSjM1ieozPVgqP4RLIxqaM7n/BJfZqiLTn0gBrbc
 tEaoSIay6/zybKPCDeVgMsTSlX8EOoRKD/1Daog+zfYkUt19U/w=
 =97eo
 -----END PGP SIGNATURE-----

Merge tag 'locking_urgent_for_v6.5_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull locking fix from Borislav Petkov:

 - Fix a rtmutex race condition resulting from sharing of the sort key
   between the lock waiters and the PI chain tree (->pi_waiters) of a
   task by giving each tree their own sort key

* tag 'locking_urgent_for_v6.5_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  locking/rtmutex: Fix task->pi_waiters integrity
2023-07-30 11:12:32 -07:00
Linus Torvalds b88e123cc0 Tracing fixes for 6.5:
- Fix to /sys/kernel/tracing/per_cpu/cpu*/stats read and entries.
   If a resize shrinks the buffer it clears the read count to notify
   readers that they need to reset. But the read count is also used for
   accounting and this causes the numbers to be off. Instead, create a
   separate variable to use to notify readers to reset.
 
 - Fix the ref counts of the "soft disable" mode. The wrong value was
   used for testing if soft disable mode should be enabled or disable,
   but instead, just change the logic to do the enable and disable
   in place when the SOFT_MODE is set or cleared.
 
 - Several kernel-doc fixes
 
 - Removal of unused external declarations
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZMVd3xQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qgzwAQCBHA3gf30GChf0EZUdIVueA31/1n2Z
 2ZW1VeKUHQufpAEAuzbkeTdaj6bbpsT5T1Pf3zUIvpHs7kOYJWQq+75GBAI=
 =9o02
 -----END PGP SIGNATURE-----

Merge tag 'trace-v6.5-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing fixes from Steven Rostedt:

 - Fix to /sys/kernel/tracing/per_cpu/cpu*/stats read and entries.

   If a resize shrinks the buffer it clears the read count to notify
   readers that they need to reset. But the read count is also used for
   accounting and this causes the numbers to be off. Instead, create a
   separate variable to use to notify readers to reset.

 - Fix the ref counts of the "soft disable" mode. The wrong value was
   used for testing if soft disable mode should be enabled or disable,
   but instead, just change the logic to do the enable and disable in
   place when the SOFT_MODE is set or cleared.

 - Several kernel-doc fixes

 - Removal of unused external declarations

* tag 'trace-v6.5-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing: Fix warning in trace_buffered_event_disable()
  ftrace: Remove unused extern declarations
  tracing: Fix kernel-doc warnings in trace_seq.c
  tracing: Fix kernel-doc warnings in trace_events_trigger.c
  tracing/synthetic: Fix kernel-doc warnings in trace_events_synth.c
  ring-buffer: Fix kernel-doc warnings in ring_buffer.c
  ring-buffer: Fix wrong stat of cpu_buffer->read
2023-07-29 20:40:43 -07:00
Masami Hiramatsu (Google) de02f2ac5d kprobes: Prohibit probing on CFI preamble symbol
Do not allow to probe on "__cfi_" or "__pfx_" started symbol, because those
are used for CFI and not executed. Probing it will break the CFI.

Link: https://lore.kernel.org/all/168904024679.116016.18089228029322008512.stgit@devnote2/

Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-07-29 23:32:26 +09:00
Zheng Yejian dea499781a tracing: Fix warning in trace_buffered_event_disable()
Warning happened in trace_buffered_event_disable() at
  WARN_ON_ONCE(!trace_buffered_event_ref)

  Call Trace:
   ? __warn+0xa5/0x1b0
   ? trace_buffered_event_disable+0x189/0x1b0
   __ftrace_event_enable_disable+0x19e/0x3e0
   free_probe_data+0x3b/0xa0
   unregister_ftrace_function_probe_func+0x6b8/0x800
   event_enable_func+0x2f0/0x3d0
   ftrace_process_regex.isra.0+0x12d/0x1b0
   ftrace_filter_write+0xe6/0x140
   vfs_write+0x1c9/0x6f0
   [...]

The cause of the warning is in __ftrace_event_enable_disable(),
trace_buffered_event_enable() was called once while
trace_buffered_event_disable() was called twice.
Reproduction script show as below, for analysis, see the comments:
 ```
 #!/bin/bash

 cd /sys/kernel/tracing/

 # 1. Register a 'disable_event' command, then:
 #    1) SOFT_DISABLED_BIT was set;
 #    2) trace_buffered_event_enable() was called first time;
 echo 'cmdline_proc_show:disable_event:initcall:initcall_finish' > \
     set_ftrace_filter

 # 2. Enable the event registered, then:
 #    1) SOFT_DISABLED_BIT was cleared;
 #    2) trace_buffered_event_disable() was called first time;
 echo 1 > events/initcall/initcall_finish/enable

 # 3. Try to call into cmdline_proc_show(), then SOFT_DISABLED_BIT was
 #    set again!!!
 cat /proc/cmdline

 # 4. Unregister the 'disable_event' command, then:
 #    1) SOFT_DISABLED_BIT was cleared again;
 #    2) trace_buffered_event_disable() was called second time!!!
 echo '!cmdline_proc_show:disable_event:initcall:initcall_finish' > \
     set_ftrace_filter
 ```

To fix it, IIUC, we can change to call trace_buffered_event_enable() at
fist time soft-mode enabled, and call trace_buffered_event_disable() at
last time soft-mode disabled.

Link: https://lore.kernel.org/linux-trace-kernel/20230726095804.920457-1-zhengyejian1@huawei.com

Cc: <mhiramat@kernel.org>
Fixes: 0fc1b09ff1 ("tracing: Use temp buffer when filtering events")
Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-07-28 20:29:51 -04:00
Gaosheng Cui 6c95d71bad tracing: Fix kernel-doc warnings in trace_seq.c
Fix kernel-doc warning:

kernel/trace/trace_seq.c:142: warning: Function parameter or member
'args' not described in 'trace_seq_vprintf'

Link: https://lkml.kernel.org/r/20230724140827.1023266-5-cuigaosheng1@huawei.com

Cc: <mhiramat@kernel.org>
Signed-off-by: Gaosheng Cui <cuigaosheng1@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-07-28 19:59:04 -04:00
Gaosheng Cui bd7217f30c tracing: Fix kernel-doc warnings in trace_events_trigger.c
Fix kernel-doc warnings:

kernel/trace/trace_events_trigger.c:59: warning: Function parameter
or member 'buffer' not described in 'event_triggers_call'
kernel/trace/trace_events_trigger.c:59: warning: Function parameter
or member 'event' not described in 'event_triggers_call'

Link: https://lkml.kernel.org/r/20230724140827.1023266-4-cuigaosheng1@huawei.com

Cc: <mhiramat@kernel.org>
Signed-off-by: Gaosheng Cui <cuigaosheng1@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-07-28 19:59:03 -04:00
Gaosheng Cui b32c789f7d tracing/synthetic: Fix kernel-doc warnings in trace_events_synth.c
Fix kernel-doc warning:

kernel/trace/trace_events_synth.c:1257: warning: Function parameter
or member 'mod' not described in 'synth_event_gen_cmd_array_start'

Link: https://lkml.kernel.org/r/20230724140827.1023266-3-cuigaosheng1@huawei.com

Cc: <mhiramat@kernel.org>
Signed-off-by: Gaosheng Cui <cuigaosheng1@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-07-28 19:59:03 -04:00
Gaosheng Cui 151e34d1c6 ring-buffer: Fix kernel-doc warnings in ring_buffer.c
Fix kernel-doc warnings:

kernel/trace/ring_buffer.c:954: warning: Function parameter or
member 'cpu' not described in 'ring_buffer_wake_waiters'
kernel/trace/ring_buffer.c:3383: warning: Excess function parameter
'event' description in 'ring_buffer_unlock_commit'
kernel/trace/ring_buffer.c:5359: warning: Excess function parameter
'cpu' description in 'ring_buffer_reset_online_cpus'

Link: https://lkml.kernel.org/r/20230724140827.1023266-2-cuigaosheng1@huawei.com

Cc: <mhiramat@kernel.org>
Signed-off-by: Gaosheng Cui <cuigaosheng1@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-07-28 19:59:03 -04:00
Zheng Yejian 2d093282b0 ring-buffer: Fix wrong stat of cpu_buffer->read
When pages are removed in rb_remove_pages(), 'cpu_buffer->read' is set
to 0 in order to make sure any read iterators reset themselves. However,
this will mess 'entries' stating, see following steps:

  # cd /sys/kernel/tracing/
  # 1. Enlarge ring buffer prepare for later reducing:
  # echo 20 > per_cpu/cpu0/buffer_size_kb
  # 2. Write a log into ring buffer of cpu0:
  # taskset -c 0 echo "hello1" > trace_marker
  # 3. Read the log:
  # cat per_cpu/cpu0/trace_pipe
       <...>-332     [000] .....    62.406844: tracing_mark_write: hello1
  # 4. Stop reading and see the stats, now 0 entries, and 1 event readed:
  # cat per_cpu/cpu0/stats
   entries: 0
   [...]
   read events: 1
  # 5. Reduce the ring buffer
  # echo 7 > per_cpu/cpu0/buffer_size_kb
  # 6. Now entries became unexpected 1 because actually no entries!!!
  # cat per_cpu/cpu0/stats
   entries: 1
   [...]
   read events: 0

To fix it, introduce 'page_removed' field to count total removed pages
since last reset, then use it to let read iterators reset themselves
instead of changing the 'read' pointer.

Link: https://lore.kernel.org/linux-trace-kernel/20230724054040.3489499-1-zhengyejian1@huawei.com

Cc: <mhiramat@kernel.org>
Cc: <vnagarnaik@google.com>
Fixes: 83f40318da ("ring-buffer: Make removal of ring buffer pages atomic")
Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-07-28 19:16:23 -04:00
YiFei Zhu d1a02358d4 bpf: Non-atomically allocate freelist during prefill
In internal testing of test_maps, we sometimes observed failures like:
  test_maps: test_maps.c:173: void test_hashmap_percpu(unsigned int, void *):
    Assertion `bpf_map_update_elem(fd, &key, value, BPF_ANY) == 0' failed.
where the errno is ENOMEM. After some troubleshooting and enabling
the warnings, we saw:
  [   91.304708] percpu: allocation failed, size=8 align=8 atomic=1, atomic alloc failed, no space left
  [   91.304716] CPU: 51 PID: 24145 Comm: test_maps Kdump: loaded Tainted: G                 N 6.1.38-smp-DEV #7
  [   91.304719] Hardware name: Google Astoria/astoria, BIOS 0.20230627.0-0 06/27/2023
  [   91.304721] Call Trace:
  [   91.304724]  <TASK>
  [   91.304730]  [<ffffffffa7ef83b9>] dump_stack_lvl+0x59/0x88
  [   91.304737]  [<ffffffffa7ef83f8>] dump_stack+0x10/0x18
  [   91.304738]  [<ffffffffa75caa0c>] pcpu_alloc+0x6fc/0x870
  [   91.304741]  [<ffffffffa75ca302>] __alloc_percpu_gfp+0x12/0x20
  [   91.304743]  [<ffffffffa756785e>] alloc_bulk+0xde/0x1e0
  [   91.304746]  [<ffffffffa7566c02>] bpf_mem_alloc_init+0xd2/0x2f0
  [   91.304747]  [<ffffffffa7547c69>] htab_map_alloc+0x479/0x650
  [   91.304750]  [<ffffffffa751d6e0>] map_create+0x140/0x2e0
  [   91.304752]  [<ffffffffa751d413>] __sys_bpf+0x5a3/0x6c0
  [   91.304753]  [<ffffffffa751c3ec>] __x64_sys_bpf+0x1c/0x30
  [   91.304754]  [<ffffffffa7ef847a>] do_syscall_64+0x5a/0x80
  [   91.304756]  [<ffffffffa800009b>] entry_SYSCALL_64_after_hwframe+0x63/0xcd

This makes sense, because in atomic context, percpu allocation would
not create new chunks; it would only create in non-atomic contexts.
And if during prefill all precpu chunks are full, -ENOMEM would
happen immediately upon next unit_alloc.

Prefill phase does not actually run in atomic context, so we can
use this fact to allocate non-atomically with GFP_KERNEL instead
of GFP_NOWAIT. This avoids the immediate -ENOMEM.

GFP_NOWAIT has to be used in unit_alloc when bpf program runs
in atomic context. Even if bpf program runs in non-atomic context,
in most cases, rcu read lock is enabled for the program so
GFP_NOWAIT is still needed. This is often also the case for
BPF_MAP_UPDATE_ELEM syscalls.

Signed-off-by: YiFei Zhu <zhuyifei@google.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20230728043359.3324347-1-zhuyifei@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-28 09:41:10 -07:00
Yonghong Song 09fedc7318 bpf: Fix compilation warning with -Wparentheses
The kernel test robot reported compilation warnings when -Wparentheses is
added to KBUILD_CFLAGS with gcc compiler. The following is the error message:

  .../bpf-next/kernel/bpf/verifier.c: In function ‘coerce_reg_to_size_sx’:
  .../bpf-next/kernel/bpf/verifier.c:5901:14:
    error: suggest parentheses around comparison in operand of ‘==’ [-Werror=parentheses]
    if (s64_max >= 0 == s64_min >= 0) {
        ~~~~~~~~^~~~
  .../bpf-next/kernel/bpf/verifier.c: In function ‘coerce_subreg_to_size_sx’:
  .../bpf-next/kernel/bpf/verifier.c:5965:14:
    error: suggest parentheses around comparison in operand of ‘==’ [-Werror=parentheses]
    if (s32_min >= 0 == s32_max >= 0) {
        ~~~~~~~~^~~~

To fix the issue, add proper parentheses for the above '>=' condition
to silence the warning/error.

I tried a few clang compilers like clang16 and clang18 and they do not emit
such warnings with -Wparentheses.

Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202307281133.wi0c4SqG-lkp@intel.com/
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20230728055740.2284534-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-28 08:54:04 -07:00
Michael Ellerman 7f48405c3c cpu/SMT: Allow enabling partial SMT states via sysfs
Add support to the /sys/devices/system/cpu/smt/control interface for
enabling a specified number of SMT threads per core, including partial
SMT states where not all threads are brought online.

The current interface accepts "on" and "off", to enable either 1 or all
SMT threads per core.

This commit allows writing an integer, between 1 and the number of SMT
threads supported by the machine. Writing 1 is a synonym for "off", 2 or
more enables SMT with the specified number of threads.

When reading the file, if all threads are online "on" is returned, to
avoid changing behaviour for existing users. If some other number of
threads is online then the integer value is returned.

Architectures like x86 only supporting 1 thread or all threads, should not
define CONFIG_SMT_NUM_THREADS_DYNAMIC. Architecture supporting partial SMT
states, like PowerPC, should define it.

[ ldufour: Slightly reword the commit's description ]
[ ldufour: Remove switch() in __store_smt_control() ]
[ ldufour: Rix build issue in control_show() ]

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Link: https://lore.kernel.org/r/20230705145143.40545-8-ldufour@linux.ibm.com
2023-07-28 09:53:37 +02:00
Michael Ellerman 38253464bc cpu/SMT: Create topology_smt_thread_allowed()
Some architectures allows partial SMT states, i.e. when not all SMT threads
are brought online.

To support that, add an architecture helper which checks whether a given
CPU is allowed to be brought online depending on how many SMT threads are
currently enabled. Since this is only applicable to architecture supporting
partial SMT, only these architectures should select the new configuration
variable CONFIG_SMT_NUM_THREADS_DYNAMIC. For the other architectures, not
supporting the partial SMT states, there is no need to define
topology_cpu_smt_allowed(), the generic code assumed that all the threads
are allowed or only the primary ones.

Call the helper from cpu_smt_enable(), and cpu_smt_allowed() when SMT is
enabled, to check if the particular thread should be onlined. Notably,
also call it from cpu_smt_disable() if CPU_SMT_ENABLED, to allow
offlining some threads to move from a higher to lower number of threads
online.

[ ldufour: Slightly reword the commit's description ]
[ ldufour: Introduce CONFIG_SMT_NUM_THREADS_DYNAMIC ]

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Link: https://lore.kernel.org/r/20230705145143.40545-7-ldufour@linux.ibm.com
2023-07-28 09:53:37 +02:00
Laurent Dufour 91b4a7dbfe cpu/SMT: Remove topology_smt_supported()
Since the maximum number of threads is now passed to cpu_smt_set_num_threads(),
checking that value is enough to know whether SMT is supported.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Link: https://lore.kernel.org/r/20230705145143.40545-6-ldufour@linux.ibm.com
2023-07-28 09:53:37 +02:00
Michael Ellerman 447ae4ac41 cpu/SMT: Store the current/max number of threads
Some architectures allow partial SMT states at boot time, ie. when not all
SMT threads are brought online.

To support that the SMT code needs to know the maximum number of SMT
threads, and also the currently configured number.

The architecture code knows the max number of threads, so have the
architecture code pass that value to cpu_smt_set_num_threads(). Note that
although topology_max_smt_threads() exists, it is not configured early
enough to be used here. As architecture, like PowerPC, allows the threads
number to be set through the kernel command line, also pass that value.

[ ldufour: Slightly reword the commit message ]
[ ldufour: Rename cpu_smt_check_topology and add a num_threads argument ]

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Link: https://lore.kernel.org/r/20230705145143.40545-5-ldufour@linux.ibm.com
2023-07-28 09:53:37 +02:00
Michael Ellerman c53361ce7d cpu/SMT: Move smt/control simple exit cases earlier
Move the simple exit cases, i.e. those which don't depend on the value
written, earlier in the function. That makes it clearer that regardless of
the input those states cannot be transitioned out of.

That does have a user-visible effect, in that the error returned will
now always be EPERM/ENODEV for those states, regardless of the value
written. Previously writing an invalid value would return EINVAL even
when in those states.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Link: https://lore.kernel.org/r/20230705145143.40545-4-ldufour@linux.ibm.com
2023-07-28 09:53:36 +02:00
Michael Ellerman 3f9169196b cpu/SMT: Move SMT prototypes into cpu_smt.h
In order to export the cpuhp_smt_control enum as part of the interface
between generic and architecture code, the architecture code needs to
include asm/topology.h.

But that leads to circular header dependencies. So split the enum and
related declarations into a separate header.

[ ldufour: Reworded the commit's description ]

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Link: https://lore.kernel.org/r/20230705145143.40545-3-ldufour@linux.ibm.com
2023-07-28 09:53:36 +02:00
Laurent Dufour 7a4dcb4a5d cpu/hotplug: Remove dependancy against cpu_primary_thread_mask
The commit 18415f33e2 ("cpu/hotplug: Allow "parallel" bringup up to
CPUHP_BP_KICK_AP_STATE") introduce a dependancy against a global variable
cpu_primary_thread_mask exported by the X86 code. This variable is only
used when CONFIG_HOTPLUG_PARALLEL is set.

Since cpuhp_get_primary_thread_mask() and cpuhp_smt_aware() are only used
when CONFIG_HOTPLUG_PARALLEL is set, don't define them when it is not set.

No functional change.

Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Link: https://lore.kernel.org/r/20230705145143.40545-2-ldufour@linux.ibm.com
2023-07-28 09:53:36 +02:00
Yonghong Song f835bb6222 bpf: Add kernel/bpftool asm support for new instructions
Add asm support for new instructions so kernel verifier and bpftool
xlated insn dumps can have proper asm syntax for new instructions.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Quentin Monnet <quentin@isovalent.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-27 18:54:02 -07:00
Yonghong Song 4cd58e9af8 bpf: Support new 32bit offset jmp instruction
Add interpreter/jit/verifier support for 32bit offset jmp instruction.
If a conditional jmp instruction needs more than 16bit offset,
it can be simulated with a conditional jmp + a 32bit jmp insn.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230728011231.3716103-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-27 18:52:33 -07:00
Yonghong Song 7058e3a31e bpf: Fix jit blinding with new sdiv/smov insns
Handle new insns properly in bpf_jit_blind_insn() function.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230728011225.3715812-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-27 18:52:33 -07:00
Yonghong Song ec0e2da95f bpf: Support new signed div/mod instructions.
Add interpreter/jit support for new signed div/mod insns.
The new signed div/mod instructions are encoded with
unsigned div/mod instructions plus insn->off == 1.
Also add basic verifier support to ensure new insns get
accepted.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230728011219.3714605-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-27 18:52:33 -07:00
Yonghong Song 0845c3db7b bpf: Support new unconditional bswap instruction
The existing 'be' and 'le' insns will do conditional bswap
depends on host endianness. This patch implements
unconditional bswap insns.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230728011213.3712808-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-27 18:52:33 -07:00
Yonghong Song 1f1e864b65 bpf: Handle sign-extenstin ctx member accesses
Currently, if user accesses a ctx member with signed types,
the compiler will generate an unsigned load followed by
necessary left and right shifts.

With the introduction of sign-extension load, compiler may
just emit a ldsx insn instead. Let us do a final movsx sign
extension to the final unsigned ctx load result to
satisfy original sign extension requirement.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230728011207.3712528-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-27 18:52:33 -07:00
Yonghong Song 8100928c88 bpf: Support new sign-extension mov insns
Add interpreter/jit support for new sign-extension mov insns.
The original 'MOV' insn is extended to support reg-to-reg
signed version for both ALU and ALU64 operations. For ALU mode,
the insn->off value of 8 or 16 indicates sign-extension
from 8- or 16-bit value to 32-bit value. For ALU64 mode,
the insn->off value of 8/16/32 indicates sign-extension
from 8-, 16- or 32-bit value to 64-bit value.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230728011202.3712300-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-27 18:52:33 -07:00
Yonghong Song 1f9a1ea821 bpf: Support new sign-extension load insns
Add interpreter/jit support for new sign-extension load insns
which adds a new mode (BPF_MEMSX).
Also add verifier support to recognize these insns and to
do proper verification with new insns. In verifier, besides
to deduce proper bounds for the dst_reg, probed memory access
is also properly handled.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230728011156.3711870-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-27 18:52:33 -07:00
Jakub Kicinski 014acf2668 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

No conflicts or adjacent changes.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-27 15:22:46 -07:00
Azeem Shaikh c9732f1461 perf: Replace strlcpy with strscpy
strlcpy() reads the entire source buffer first.
This read may exceed the destination size limit.
This is both inefficient and can lead to linear read
overflows if a source string is not NUL-terminated [1].
In an effort to remove strlcpy() completely [2], replace
strlcpy() here with strscpy().
No return values were used, so direct replacement is safe.

[1] https://www.kernel.org/doc/html/latest/process/deprecated.html#strlcpy
[2] https://github.com/KSPP/linux/issues/89

Signed-off-by: Azeem Shaikh <azeemshaikh38@gmail.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20230703165817.2840457-1-azeemshaikh38@gmail.com
Signed-off-by: Kees Cook <keescook@chromium.org>
2023-07-27 08:51:25 -07:00
Rae Moar a547c4ce10 kunit: time: Mark test as slow using test attributes
Mark the time KUnit test, time64_to_tm_test_date_range, as slow using test
attributes.

This test ran relatively much slower than most other KUnit tests.

By marking this test as slow, the test can now be filtered using the KUnit
test attribute filtering feature. Example: --filter "speed>slow". This will
run only the tests that have speeds faster than slow. The slow attribute
will also be outputted in KTAP.

Reviewed-by: David Gow <davidgow@google.com>
Signed-off-by: Rae Moar <rmoar@google.com>
Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
2023-07-26 13:29:35 -06:00
Linus Torvalds 5f0bc0b042 mm: suppress mm fault logging if fatal signal already pending
Commit eda0047296 ("mm: make the page fault mmap locking killable")
intentionally made it much easier to trigger the "page fault fails
because a fatal signal is pending" situation, by having the mmap locking
fail early in that case.

We have long aborted page faults in other fatal cases when the actual IO
for a page is interrupted by SIGKILL - which is particularly useful for
the traditional case of NFS hanging due to network issues, but local
filesystems could cause it too if you happened to get the SIGKILL while
waiting for a page to be faulted in (eg lock_folio_maybe_drop_mmap()).

So aborting the page fault wasn't a new condition - but it now triggers
earlier, before we even get to 'handle_mm_fault()'.  And as a result the
error doesn't go through our 'fault_signal_pending()' logic, and doesn't
get filtered away there.

Normally you'd never even notice, because if a fatal signal is pending,
the new SIGSEGV we send ends up being ignored anyway.

But it turns out that there is one very noticeable exception: if you
enable 'show_unhandled_signals', the aborted page fault will be logged
in the kernel messages, and you'll get a scary line looking something
like this in your logs:

  pverados[2183248]: segfault at 55e5a00f9ae0 ip 000055e5a00f9ae0 sp 00007ffc0720bea8 error 14 in perl[55e5a00d4000+195000] likely on CPU 10 (core 4, socket 0)

which is rather misleading.  It's not really a segfault at all, it's
just "the thread was killed before the page fault completed, so we
aborted the page fault".

Fix this by just making it clear that a pending fatal signal means that
any new signal coming in after that is implicitly handled.  This will
avoid the misleading logging, since now the signal isn't 'unhandled' any
more.

Reported-and-tested-by: Fiona Ebner <f.ebner@proxmox.com>
Tested-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
Link: https://lore.kernel.org/lkml/8d063a26-43f5-0bb7-3203-c6a04dc159f8@proxmox.com/
Acked-by: Oleg Nesterov <oleg@redhat.com>
Fixes: eda0047296 ("mm: make the page fault mmap locking killable")
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-07-26 10:51:59 -07:00
Chen Yu 4efcc8bc7e sched/topology: Align group flags when removing degenerate domain
The flags of the child of a given scheduling domain are used to initialize
the flags of its scheduling groups. When the child of a scheduling domain
is degenerated, the flags of its local scheduling group need to be updated
to align with the flags of its new child domain.

The flag SD_SHARE_CPUCAPACITY was aligned in
Commit bf2dc42d6b ("sched/topology: Propagate SMT flags when removing degenerate domain").
Further generalize this alignment so other flags can be used later, such as
in cluster-based task wakeup. [1]

Reported-by: Yicong Yang <yangyicong@huawei.com>
Suggested-by: Ricardo Neri <ricardo.neri@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Reviewed-by: Yicong Yang <yangyicong@hisilicon.com>
Link: https://lore.kernel.org/r/20230713013133.2314153-1-yu.c.chen@intel.com
2023-07-26 12:28:51 +02:00
Vincent Guittot c2e164ac33 sched/fair: remove util_est boosting
There is no need to use runnable_avg when estimating util_est and that
even generates wrong behavior because one includes blocked tasks whereas
the other one doesn't. This can lead to accounting twice the waking task p,
once with the blocked runnable_avg and another one when adding its
util_est.

cpu's runnable_avg is already used when computing util_avg which is then
compared with util_est.

In some situation, feec will not select prev_cpu but another one on the
same performance domain because of higher max_util

Fixes: 7d0583cf9e ("sched/fair, cpufreq: Introduce 'runnable boosting'")
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lore.kernel.org/r/20230706135144.324311-1-vincent.guittot@linaro.org
2023-07-26 12:28:50 +02:00
Masami Hiramatsu (Google) 1f9f4f4777 tracing/probes: Fix to add NULL check for BTF APIs
Since find_btf_func_param() abd btf_type_by_id() can return NULL,
the caller must check the return value correctly.

Link: https://lore.kernel.org/all/169024903951.395371.11361556840733470934.stgit@devnote2/

Fixes: b576e09701 ("tracing/probes: Support function parameters if BTF is available")
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-07-26 12:53:38 +09:00
Arnd Bergmann 63e2da3b7f bpf: work around -Wuninitialized warning
Splitting these out into separate helper functions means that we
actually pass an uninitialized variable into another function call
if dec_active() happens to not be inlined, and CONFIG_PREEMPT_RT
is disabled:

kernel/bpf/memalloc.c: In function 'add_obj_to_free_list':
kernel/bpf/memalloc.c:200:9: error: 'flags' is used uninitialized [-Werror=uninitialized]
  200 |         dec_active(c, flags);

Avoid this by passing the flags by reference, so they either get
initialized and dereferenced through a pointer, or the pointer never
gets accessed at all.

Fixes: 18e027b1c7 ("bpf: Factor out inc/dec of active flag into helpers.")
Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/r/20230725202653.2905259-1-arnd@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-25 17:14:18 -07:00
Jiri Olsa d62cc390c2 bpf: Disable preemption in bpf_event_output
We received report [1] of kernel crash, which is caused by
using nesting protection without disabled preemption.

The bpf_event_output can be called by programs executed by
bpf_prog_run_array_cg function that disabled migration but
keeps preemption enabled.

This can cause task to be preempted by another one inside the
nesting protection and lead eventually to two tasks using same
perf_sample_data buffer and cause crashes like:

  BUG: kernel NULL pointer dereference, address: 0000000000000001
  #PF: supervisor instruction fetch in kernel mode
  #PF: error_code(0x0010) - not-present page
  ...
  ? perf_output_sample+0x12a/0x9a0
  ? finish_task_switch.isra.0+0x81/0x280
  ? perf_event_output+0x66/0xa0
  ? bpf_event_output+0x13a/0x190
  ? bpf_event_output_data+0x22/0x40
  ? bpf_prog_dfc84bbde731b257_cil_sock4_connect+0x40a/0xacb
  ? xa_load+0x87/0xe0
  ? __cgroup_bpf_run_filter_sock_addr+0xc1/0x1a0
  ? release_sock+0x3e/0x90
  ? sk_setsockopt+0x1a1/0x12f0
  ? udp_pre_connect+0x36/0x50
  ? inet_dgram_connect+0x93/0xa0
  ? __sys_connect+0xb4/0xe0
  ? udp_setsockopt+0x27/0x40
  ? __pfx_udp_push_pending_frames+0x10/0x10
  ? __sys_setsockopt+0xdf/0x1a0
  ? __x64_sys_connect+0xf/0x20
  ? do_syscall_64+0x3a/0x90
  ? entry_SYSCALL_64_after_hwframe+0x72/0xdc

Fixing this by disabling preemption in bpf_event_output.

[1] https://github.com/cilium/cilium/issues/26756
Cc: stable@vger.kernel.org
Reported-by: Oleg "livelace" Popov <o.popov@livelace.ru>
Closes: https://github.com/cilium/cilium/issues/26756
Fixes: 2a916f2f54 ("bpf: Use migrate_disable/enable in array macros and cgroup/lirc code.")
Acked-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20230725084206.580930-3-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-25 17:06:37 -07:00
Jiri Olsa f2c67a3e60 bpf: Disable preemption in bpf_perf_event_output
The nesting protection in bpf_perf_event_output relies on disabled
preemption, which is guaranteed for kprobes and tracepoints.

However bpf_perf_event_output can be also called from uprobes context
through bpf_prog_run_array_sleepable function which disables migration,
but keeps preemption enabled.

This can cause task to be preempted by another one inside the nesting
protection and lead eventually to two tasks using same perf_sample_data
buffer and cause crashes like:

  kernel tried to execute NX-protected page - exploit attempt? (uid: 0)
  BUG: unable to handle page fault for address: ffffffff82be3eea
  ...
  Call Trace:
   ? __die+0x1f/0x70
   ? page_fault_oops+0x176/0x4d0
   ? exc_page_fault+0x132/0x230
   ? asm_exc_page_fault+0x22/0x30
   ? perf_output_sample+0x12b/0x910
   ? perf_event_output+0xd0/0x1d0
   ? bpf_perf_event_output+0x162/0x1d0
   ? bpf_prog_c6271286d9a4c938_krava1+0x76/0x87
   ? __uprobe_perf_func+0x12b/0x540
   ? uprobe_dispatcher+0x2c4/0x430
   ? uprobe_notify_resume+0x2da/0xce0
   ? atomic_notifier_call_chain+0x7b/0x110
   ? exit_to_user_mode_prepare+0x13e/0x290
   ? irqentry_exit_to_user_mode+0x5/0x30
   ? asm_exc_int3+0x35/0x40

Fixing this by disabling preemption in bpf_perf_event_output.

Cc: stable@vger.kernel.org
Fixes: 8c7dcb84e3 ("bpf: implement sleepable uprobes by chaining gps")
Acked-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20230725084206.580930-2-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-25 17:05:53 -07:00
Tejun Heo aa6fde93f3 workqueue: Scale up wq_cpu_intensive_thresh_us if BogoMIPS is below 4000
wq_cpu_intensive_thresh_us is used to detect CPU-hogging per-cpu work items.
Once detected, they're excluded from concurrency management to prevent them
from blocking other per-cpu work items. If CONFIG_WQ_CPU_INTENSIVE_REPORT is
enabled, repeat offenders are also reported so that the code can be updated.

The default threshold is 10ms which is long enough to do fair bit of work on
modern CPUs while short enough to be usually not noticeable. This
unfortunately leads to a lot of, arguable spurious, detections on very slow
CPUs. Using the same threshold across CPUs whose performance levels may be
apart by multiple levels of magnitude doesn't make whole lot of sense.

This patch scales up wq_cpu_intensive_thresh_us upto 1 second when BogoMIPS
is below 4000. This is obviously very inaccurate but it doesn't have to be
accurate to be useful. The mechanism is still useful when the threshold is
fully scaled up and the benefits of reports are usually shared with everyone
regardless of who's reporting, so as long as there are sufficient number of
fast machines reporting, we don't lose much.

Some (or is it all?) ARM CPUs systemtically report significantly lower
BogoMIPS. While this doesn't break anything, given how widespread ARM CPUs
are, it's at least a missed opportunity and it probably would be a good idea
to teach workqueue about it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-Tested-by: Geert Uytterhoeven <geert@linux-m68k.org>
2023-07-25 11:49:57 -10:00
Jiri Slaby bcb48185ed tty: sysrq: switch sysrq handlers from int to u8
The passed parameter to sysrq handlers is a key (a character). So change
the type from 'int' to 'u8'. Let it specifically be 'u8' for two
reasons:
* unsigned: unsigned values come from the upper layers (devices) and the
  tty layer assumes unsigned on most places, and
* 8-bit: as that what's supposed to be one day in all the layers built
  on the top of tty. (Currently, we use mostly 'unsigned char' and
  somewhere still only 'char'. (But that also translates to the former
  thanks to -funsigned-char.))

Signed-off-by: Jiri Slaby (SUSE) <jirislaby@kernel.org>
Cc: Richard Henderson <richard.henderson@linaro.org>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: WANG Xuerui <kernel@xen0n.name>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Maxime Ripard <mripard@kernel.org>
Cc: Thomas Zimmermann <tzimmermann@suse.de>
Cc: David Airlie <airlied@gmail.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Jason Wessel <jason.wessel@windriver.com>
Cc: Daniel Thompson <daniel.thompson@linaro.org>
Cc: Douglas Anderson <dianders@chromium.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Len Brown <len.brown@intel.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Zqiang <qiang.zhang1211@gmail.com>
Acked-by: Thomas Zimmermann <tzimmermann@suse.de> # DRM
Acked-by: WANG Xuerui <git@xen0n.name> # loongarch
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Acked-by: Daniel Thompson <daniel.thompson@linaro.org>
Link: https://lore.kernel.org/r/20230712081811.29004-3-jirislaby@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-07-25 19:21:03 +02:00
Palmer Dabbelt ff09f6fd29 modpost, kallsyms: Treat add '$'-prefixed symbols as mapping symbols
Trying to restrict the '$'-prefix change to RISC-V caused some fallout,
so let's just treat all those symbols as special.

Fixes: c05780ef3c ("module: Ignore RISC-V mapping symbols too")
Link: https://lore.kernel.org/all/20230712015747.77263-1-wangkefeng.wang@huawei.com/
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
Reviewed-by: Masahiro Yamada <masahiroy@kernel.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2023-07-24 12:09:47 -07:00
Jeff Layton 417d2b6b11 bpf: convert to ctime accessor functions
In later patches, we're going to change how the inode's ctime field is
used. Switch to using accessor functions instead of raw accesses of
inode->i_ctime.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Message-Id: <20230705190309.579783-84-jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-07-24 10:30:07 +02:00
Brian Geffon 005e8dddd4 PM: hibernate: don't store zero pages in the image file
On ChromeOS we've observed a considerable number of in-use pages filled with
zeros. Today with hibernate it's entirely possible that saveable pages are just
zero filled. Since we're already copying pages word-by-word in do_copy_page it
becomes almost free to determine if a page was completely filled with zeros.

This change introduces a new bitmap which will track these zero pages. If a page
is zero it will not be included in the saved image, instead to track these zero
pages in the image file we will introduce a new flag which we will set on the
packed PFN list. When reading back in the image file we will detect these zero
page PFNs and rebuild the zero page bitmap.

When the image is being loaded through calls to write_next_page if we encounter
a zero page we will silently memset it to 0 and then continue on to the next
page. Given the implementation in snapshot_read_next/snapshot_write_next this
change  will be transparent to non-compressed/compressed and swsusp modes of
operation.

To provide some concrete numbers from simple ad-hoc testing, on a device which
was lightly in use we saw that:

PM: hibernation: Image created (964408 pages copied, 548304 zero pages)

Of the approximately 6.2GB of saveable pages 2.2GB (36%) were just zero filled
and could be tracked entirely within the packed PFN list. The savings would
obviously be much lower for lzo compressed images, but even in the case of
compression not copying pages across to the compression threads will still
speed things up. It's also possible that we would see better overall compression
ratios as larger regions of "real data" would improve the compressibility.

Finally, such an approach could dramatically improve swsusp performance
as each one of those zero pages requires a write syscall to reload, by
handling it as part of the packed PFN list we're able to fully avoid
that.

Signed-off-by: Brian Geffon <bgeffon@google.com>
[ rjw: Whitespace adjustments, removal of redundant parentheses ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2023-07-24 09:51:59 +02:00
Linus Torvalds 3b4e48b800 Tracing fixes for 6.5-rc2:
- Swapping the ring buffer for snapshotting (for things like irqsoff)
   can crash if the ring buffer is being resized. Disable swapping
   when this happens. The missed swap will be reported to the tracer.
 
 - Report error if the histogram fails to be created due to an error in
   adding a histogram variable, in event_hist_trigger_parse().
 
 - Remove unused declaration of tracing_map_set_field_descr().
 
 Chen Lin (1):
       ring-buffer: Do not swap cpu_buffer during resize process
 
 Mohamed Khalfella (1):
       tracing/histograms: Return an error if we fail to add histogram to hist_vars list
 
 YueHaibing (1):
       tracing: Remove unused extern declaration tracing_map_set_field_descr()
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZL2IixQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qsHAAQCS/VLpMOA5AS9JWvwuEnGAVymyJcGS
 jmnWkuMmf5fPpQD/di/xY1clLNhz6P7PAZvR3N6qw3AsNjPW/ZapDkrRWQA=
 =RoHL
 -----END PGP SIGNATURE-----

Merge tag 'trace-v6.5-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing fixes from Steven Rostedt:

 - Swapping the ring buffer for snapshotting (for things like irqsoff)
   can crash if the ring buffer is being resized. Disable swapping when
   this happens. The missed swap will be reported to the tracer

 - Report error if the histogram fails to be created due to an error in
   adding a histogram variable, in event_hist_trigger_parse()

 - Remove unused declaration of tracing_map_set_field_descr()

* tag 'trace-v6.5-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing/histograms: Return an error if we fail to add histogram to hist_vars list
  ring-buffer: Do not swap cpu_buffer during resize process
  tracing: Remove unused extern declaration tracing_map_set_field_descr()
2023-07-23 15:19:14 -07:00
Mohamed Khalfella 4b8b390516 tracing/histograms: Return an error if we fail to add histogram to hist_vars list
Commit 6018b585e8 ("tracing/histograms: Add histograms to hist_vars if
they have referenced variables") added a check to fail histogram creation
if save_hist_vars() failed to add histogram to hist_vars list. But the
commit failed to set ret to failed return code before jumping to
unregister histogram, fix it.

Link: https://lore.kernel.org/linux-trace-kernel/20230714203341.51396-1-mkhalfella@purestorage.com

Cc: stable@vger.kernel.org
Fixes: 6018b585e8 ("tracing/histograms: Add histograms to hist_vars if they have referenced variables")
Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-07-23 11:18:52 -04:00
Chen Lin 8a96c0288d ring-buffer: Do not swap cpu_buffer during resize process
When ring_buffer_swap_cpu was called during resize process,
the cpu buffer was swapped in the middle, resulting in incorrect state.
Continuing to run in the wrong state will result in oops.

This issue can be easily reproduced using the following two scripts:
/tmp # cat test1.sh
//#! /bin/sh
for i in `seq 0 100000`
do
         echo 2000 > /sys/kernel/debug/tracing/buffer_size_kb
         sleep 0.5
         echo 5000 > /sys/kernel/debug/tracing/buffer_size_kb
         sleep 0.5
done
/tmp # cat test2.sh
//#! /bin/sh
for i in `seq 0 100000`
do
        echo irqsoff > /sys/kernel/debug/tracing/current_tracer
        sleep 1
        echo nop > /sys/kernel/debug/tracing/current_tracer
        sleep 1
done
/tmp # ./test1.sh &
/tmp # ./test2.sh &

A typical oops log is as follows, sometimes with other different oops logs.

[  231.711293] WARNING: CPU: 0 PID: 9 at kernel/trace/ring_buffer.c:2026 rb_update_pages+0x378/0x3f8
[  231.713375] Modules linked in:
[  231.714735] CPU: 0 PID: 9 Comm: kworker/0:1 Tainted: G        W          6.5.0-rc1-00276-g20edcec23f92 #15
[  231.716750] Hardware name: linux,dummy-virt (DT)
[  231.718152] Workqueue: events update_pages_handler
[  231.719714] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[  231.721171] pc : rb_update_pages+0x378/0x3f8
[  231.722212] lr : rb_update_pages+0x25c/0x3f8
[  231.723248] sp : ffff800082b9bd50
[  231.724169] x29: ffff800082b9bd50 x28: ffff8000825f7000 x27: 0000000000000000
[  231.726102] x26: 0000000000000001 x25: fffffffffffff010 x24: 0000000000000ff0
[  231.728122] x23: ffff0000c3a0b600 x22: ffff0000c3a0b5c0 x21: fffffffffffffe0a
[  231.730203] x20: ffff0000c3a0b600 x19: ffff0000c0102400 x18: 0000000000000000
[  231.732329] x17: 0000000000000000 x16: 0000000000000000 x15: 0000ffffe7aa8510
[  231.734212] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000002
[  231.736291] x11: ffff8000826998a8 x10: ffff800082b9baf0 x9 : ffff800081137558
[  231.738195] x8 : fffffc00030e82c8 x7 : 0000000000000000 x6 : 0000000000000001
[  231.740192] x5 : ffff0000ffbafe00 x4 : 0000000000000000 x3 : 0000000000000000
[  231.742118] x2 : 00000000000006aa x1 : 0000000000000001 x0 : ffff0000c0007208
[  231.744196] Call trace:
[  231.744892]  rb_update_pages+0x378/0x3f8
[  231.745893]  update_pages_handler+0x1c/0x38
[  231.746893]  process_one_work+0x1f0/0x468
[  231.747852]  worker_thread+0x54/0x410
[  231.748737]  kthread+0x124/0x138
[  231.749549]  ret_from_fork+0x10/0x20
[  231.750434] ---[ end trace 0000000000000000 ]---
[  233.720486] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
[  233.721696] Mem abort info:
[  233.721935]   ESR = 0x0000000096000004
[  233.722283]   EC = 0x25: DABT (current EL), IL = 32 bits
[  233.722596]   SET = 0, FnV = 0
[  233.722805]   EA = 0, S1PTW = 0
[  233.723026]   FSC = 0x04: level 0 translation fault
[  233.723458] Data abort info:
[  233.723734]   ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000
[  233.724176]   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
[  233.724589]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
[  233.725075] user pgtable: 4k pages, 48-bit VAs, pgdp=0000000104943000
[  233.725592] [0000000000000000] pgd=0000000000000000, p4d=0000000000000000
[  233.726231] Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP
[  233.726720] Modules linked in:
[  233.727007] CPU: 0 PID: 9 Comm: kworker/0:1 Tainted: G        W          6.5.0-rc1-00276-g20edcec23f92 #15
[  233.727777] Hardware name: linux,dummy-virt (DT)
[  233.728225] Workqueue: events update_pages_handler
[  233.728655] pstate: 200000c5 (nzCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[  233.729054] pc : rb_update_pages+0x1a8/0x3f8
[  233.729334] lr : rb_update_pages+0x154/0x3f8
[  233.729592] sp : ffff800082b9bd50
[  233.729792] x29: ffff800082b9bd50 x28: ffff8000825f7000 x27: 0000000000000000
[  233.730220] x26: 0000000000000000 x25: ffff800082a8b840 x24: ffff0000c0102418
[  233.730653] x23: 0000000000000000 x22: fffffc000304c880 x21: 0000000000000003
[  233.731105] x20: 00000000000001f4 x19: ffff0000c0102400 x18: ffff800082fcbc58
[  233.731727] x17: 0000000000000000 x16: 0000000000000001 x15: 0000000000000001
[  233.732282] x14: ffff8000825fe0c8 x13: 0000000000000001 x12: 0000000000000000
[  233.732709] x11: ffff8000826998a8 x10: 0000000000000ae0 x9 : ffff8000801b760c
[  233.733148] x8 : fefefefefefefeff x7 : 0000000000000018 x6 : ffff0000c03298c0
[  233.733553] x5 : 0000000000000002 x4 : 0000000000000000 x3 : 0000000000000000
[  233.733972] x2 : ffff0000c3a0b600 x1 : 0000000000000000 x0 : 0000000000000000
[  233.734418] Call trace:
[  233.734593]  rb_update_pages+0x1a8/0x3f8
[  233.734853]  update_pages_handler+0x1c/0x38
[  233.735148]  process_one_work+0x1f0/0x468
[  233.735525]  worker_thread+0x54/0x410
[  233.735852]  kthread+0x124/0x138
[  233.736064]  ret_from_fork+0x10/0x20
[  233.736387] Code: 92400000 910006b5 aa000021 aa0303f7 (f9400060)
[  233.736959] ---[ end trace 0000000000000000 ]---

After analysis, the seq of the error is as follows [1-5]:

int ring_buffer_resize(struct trace_buffer *buffer, unsigned long size,
			int cpu_id)
{
	for_each_buffer_cpu(buffer, cpu) {
		cpu_buffer = buffer->buffers[cpu];
		//1. get cpu_buffer, aka cpu_buffer(A)
		...
		...
		schedule_work_on(cpu,
		 &cpu_buffer->update_pages_work);
		//2. 'update_pages_work' is queue on 'cpu', cpu_buffer(A) is passed to
		// update_pages_handler, do the update process, set 'update_done' in
		// complete(&cpu_buffer->update_done) and to wakeup resize process.
	//---->
		//3. Just at this moment, ring_buffer_swap_cpu is triggered,
		//cpu_buffer(A) be swaped to cpu_buffer(B), the max_buffer.
		//ring_buffer_swap_cpu is called as the 'Call trace' below.

		Call trace:
		 dump_backtrace+0x0/0x2f8
		 show_stack+0x18/0x28
		 dump_stack+0x12c/0x188
		 ring_buffer_swap_cpu+0x2f8/0x328
		 update_max_tr_single+0x180/0x210
		 check_critical_timing+0x2b4/0x2c8
		 tracer_hardirqs_on+0x1c0/0x200
		 trace_hardirqs_on+0xec/0x378
		 el0_svc_common+0x64/0x260
		 do_el0_svc+0x90/0xf8
		 el0_svc+0x20/0x30
		 el0_sync_handler+0xb0/0xb8
		 el0_sync+0x180/0x1c0
	//<----

	/* wait for all the updates to complete */
	for_each_buffer_cpu(buffer, cpu) {
		cpu_buffer = buffer->buffers[cpu];
		//4. get cpu_buffer, cpu_buffer(B) is used in the following process,
		//the state of cpu_buffer(A) and cpu_buffer(B) is totally wrong.
		//for example, cpu_buffer(A)->update_done will leave be set 1, and will
		//not 'wait_for_completion' at the next resize round.
		  if (!cpu_buffer->nr_pages_to_update)
			continue;

		if (cpu_online(cpu))
			wait_for_completion(&cpu_buffer->update_done);
		cpu_buffer->nr_pages_to_update = 0;
	}
	...
}
	//5. the state of cpu_buffer(A) and cpu_buffer(B) is totally wrong,
	//Continuing to run in the wrong state, then oops occurs.

Link: https://lore.kernel.org/linux-trace-kernel/202307191558478409990@zte.com.cn

Signed-off-by: Chen Lin <chen.lin5@zte.com.cn>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-07-23 11:09:25 -04:00
YueHaibing 1faf7e4a0b tracing: Remove unused extern declaration tracing_map_set_field_descr()
Since commit 08d43a5fa0 ("tracing: Add lock-free tracing_map"),
this is never used, so can be removed.

Link: https://lore.kernel.org/linux-trace-kernel/20230722032123.24664-1-yuehaibing@huawei.com

Cc: <mhiramat@kernel.org>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-07-23 11:08:14 -04:00
Miaohe Lin a3fdeeb3f1 cgroup: fix obsolete comment above cgroup_create()
Since commit 743210386c ("cgroup: use cgrp->kn->id as the cgroup ID"),
cgrp is associated with its kernfs_node. Update corresponding comment.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-07-21 09:54:48 -10:00
Haitao Huang 714e08cc3e cgroup/misc: Store atomic64_t reads to u64
Change 'new_usage' type to u64 so it can be compared with unsigned 'max'
and 'capacity' properly even if the value crosses the signed boundary.

Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-07-21 08:10:06 -10:00
Xiu Jianfeng bf98354280 audit: correct audit_filter_inodes() definition
After changes in commit 0590b9335a ("fixing audit rule ordering mess,
part 1"), audit_filter_inodes() returns void, so if CONFIG_AUDITSYSCALL
not defined, it should be do {} while(0).

Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2023-07-21 12:17:25 -04:00
Jakub Kicinski 59be3baa8d Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

No conflicts or adjacent changes.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-20 15:52:55 -07:00
Linus Torvalds 57f1f9dd3a Including fixes from BPF, netfilter, bluetooth and CAN.
Current release - regressions:
 
  - eth: r8169: multiple fixes for PCIe ASPM-related problems
 
  - vrf: fix RCU lockdep splat in output path
 
 Previous releases - regressions:
 
  - gso: fall back to SW segmenting with GSO_UDP_L4 dodgy bit set
 
  - dsa: mv88e6xxx: do a final check before timing out when polling
 
  - nf_tables: fix sleep in atomic in nft_chain_validate
 
 Previous releases - always broken:
 
  - sched: fix undoing tcf_bind_filter() in multiple classifiers
 
  - bpf, arm64: fix BTI type used for freplace attached functions
 
  - can: gs_usb: fix time stamp counter initialization
 
  - nft_set_pipapo: fix improper element removal (leading to UAF)
 
 Misc:
 
  - net: support STP on bridge in non-root netns, STP prevents
    packet loops so not supporting it results in freezing systems
    of unsuspecting users, and in turn very upset noises being made
 
  - fix kdoc warnings
 
  - annotate various bits of TCP state to prevent data races
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmS5pp0ACgkQMUZtbf5S
 IrtudA/9Ep+URprI3tpv+VHOQMWtMd7lzz+wwEUDQSo2T6xdMcYbd1E4ZWWOPw/y
 jTIIVF3qde4nuI/MZtzGhvCD8v4bzhw10uRm4f4vhC2i+CzXr/UdOQSMqeZmJZgN
 vndixvRjHJKYxogOa+DjXgOiuQTQfuSfSnaai0kvw3zZzi4tev/Bdj6KZmFW+UK+
 Q7uQZ5n8tdE4UvUdj8Jek23SZ4kL+HtQOIdAAqyduQnYnax5L5sbep0TjuCjjkpK
 26rvmwYFJmEab4mC2T3Y7VDaXYM9M2f/EuFBMBVEohE3KPTTdT12WzLfJv7TTKTl
 hymfXgfmCXiZElzoQTJ69bFGbhqFaCJwhCUHFwYqkqj0bW9cXYJD2achpi3nVgnn
 CV8vfqJtkzdgh2bV2faG+1wmAm1wzHSURmT5NlnFaX6a6BYypaN7CERn7BnIdLM/
 YA2wud39bL0EJsic5e3gtlyJdfhtx7iqCMzE7S5FiUZvgOmUhBZ4IWkMs6Aq5PpL
 FLLgBSHGEIAdLVQGvXLjfQ/LeSrW8JsiSy6deztzR+ZflvvaBIP5y8sC3+KdxAvN
 3ybMsMEE5OK3i808aV3l6/8DLeAJ+DWuMc96Ix7Yyt2LXFnnV79DX49zJAEUWrc7
 54FnNzkgAO/Q9aEFmmQoFt5qZmoFHuNwcHBOmXARAatQqNCwDqk=
 =Xifr
 -----END PGP SIGNATURE-----

Merge tag 'net-6.5-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
 "Including fixes from BPF, netfilter, bluetooth and CAN.

  Current release - regressions:

   - eth: r8169: multiple fixes for PCIe ASPM-related problems

   - vrf: fix RCU lockdep splat in output path

  Previous releases - regressions:

   - gso: fall back to SW segmenting with GSO_UDP_L4 dodgy bit set

   - dsa: mv88e6xxx: do a final check before timing out when polling

   - nf_tables: fix sleep in atomic in nft_chain_validate

  Previous releases - always broken:

   - sched: fix undoing tcf_bind_filter() in multiple classifiers

   - bpf, arm64: fix BTI type used for freplace attached functions

   - can: gs_usb: fix time stamp counter initialization

   - nft_set_pipapo: fix improper element removal (leading to UAF)

  Misc:

   - net: support STP on bridge in non-root netns, STP prevents packet
     loops so not supporting it results in freezing systems of
     unsuspecting users, and in turn very upset noises being made

   - fix kdoc warnings

   - annotate various bits of TCP state to prevent data races"

* tag 'net-6.5-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (95 commits)
  net: phy: prevent stale pointer dereference in phy_init()
  tcp: annotate data-races around fastopenq.max_qlen
  tcp: annotate data-races around icsk->icsk_user_timeout
  tcp: annotate data-races around tp->notsent_lowat
  tcp: annotate data-races around rskq_defer_accept
  tcp: annotate data-races around tp->linger2
  tcp: annotate data-races around icsk->icsk_syn_retries
  tcp: annotate data-races around tp->keepalive_probes
  tcp: annotate data-races around tp->keepalive_intvl
  tcp: annotate data-races around tp->keepalive_time
  tcp: annotate data-races around tp->tsoffset
  tcp: annotate data-races around tp->tcp_tx_delay
  Bluetooth: MGMT: Use correct address for memcpy()
  Bluetooth: btusb: Fix bluetooth on Intel Macbook 2014
  Bluetooth: SCO: fix sco_conn related locking and validity issues
  Bluetooth: hci_conn: return ERR_PTR instead of NULL when there is no link
  Bluetooth: hci_sync: Avoid use-after-free in dbg for hci_remove_adv_monitor()
  Bluetooth: coredump: fix building with coredump disabled
  Bluetooth: ISO: fix iso_conn related locking and validity issues
  Bluetooth: hci_event: call disconnect callback before deleting conn
  ...
2023-07-20 14:46:39 -07:00
Xiu Jianfeng be4187faa8 audit: include security.h unconditionally
The ifdef-else logic is already in the header file, so include it
unconditionally, no functional changes here.

Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com>
[PM: fixed misspelling in the subject]
Signed-off-by: Paul Moore <paul@paul-moore.com>
2023-07-20 15:06:58 -04:00
John Ogness 132a90d152 printk: Rename abandon_console_lock_in_panic() to other_cpu_in_panic()
Currently abandon_console_lock_in_panic() is only used to determine if
the current CPU should immediately release the console lock because
another CPU is in panic. However, later this function will be used by
the CPU to immediately release other resources in this situation.

Rename the function to other_cpu_in_panic(), which is a better
description and does not assume it is related to the console lock.

Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20230717194607.145135-8-john.ogness@linutronix.de
2023-07-20 13:06:22 +02:00
John Ogness 9e70a5e109 printk: Add per-console suspended state
Currently the global @console_suspended is used to determine if
consoles are in a suspended state. Its primary purpose is to allow
usage of the console_lock when suspended without causing console
printing. It is synchronized by the console_lock.

Rather than relying on the console_lock to determine suspended
state, make it an official per-console state that is set within
console->flags. This allows the state to be queried via SRCU.

Remove @console_suspended. Console printing will still be avoided
when suspended because console_is_usable() returns false when
the new suspended flag is set for that console.

Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20230717194607.145135-7-john.ogness@linutronix.de
2023-07-20 13:06:22 +02:00
John Ogness 696ffaf50e printk: Consolidate console deferred printing
Printing to consoles can be deferred for several reasons:

- explicitly with printk_deferred()
- printk() in NMI context
- recursive printk() calls

The current implementation is not consistent. For printk_deferred(),
irq work is scheduled twice. For NMI und recursive, panic CPU
suppression and caller delays are not properly enforced.

Correct these inconsistencies by consolidating the deferred printing
code so that vprintk_deferred() is the top-level function for
deferred printing and vprintk_emit() will perform whichever irq_work
queueing is appropriate.

Also add kerneldoc for wake_up_klogd() and defer_console_output() to
clarify their differences and appropriate usage.

Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20230717194607.145135-6-john.ogness@linutronix.de
2023-07-20 13:06:22 +02:00
John Ogness eacb04ff3c printk: Do not take console lock for console_flush_on_panic()
Currently console_flush_on_panic() will attempt to acquire the
console lock when flushing the buffer on panic. If it fails to
acquire the lock, it continues anyway because this is the last
chance to get any pending records printed.

The reason why the console lock was attempted at all was to
prevent any other CPUs from acquiring the console lock for
printing while the panic CPU was printing. But as of the
previous commit, non-panic CPUs will no longer attempt to
acquire the console lock in a panic situation. Therefore it is
no longer strictly necessary for a panic CPU to acquire the
console lock.

Avoiding taking the console lock when flushing in panic has
the additional benefit of avoiding possible deadlocks due to
semaphore usage in NMI context (semaphores are not NMI-safe)
and avoiding possible deadlocks if another CPU accesses the
semaphore and is stopped while holding one of the semaphore's
internal spinlocks.

Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20230717194607.145135-5-john.ogness@linutronix.de
2023-07-20 13:06:22 +02:00
John Ogness 51a1d258e5 printk: Keep non-panic-CPUs out of console lock
When in a panic situation, non-panic CPUs should avoid holding the
console lock so as not to contend with the panic CPU. This is already
implemented with abandon_console_lock_in_panic(), which is checked
after each printed line. However, non-panic CPUs should also avoid
trying to acquire the console lock during a panic.

Modify console_trylock() to fail and console_lock() to block() when
called from a non-panic CPU during a panic.

Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20230717194607.145135-4-john.ogness@linutronix.de
2023-07-20 13:06:22 +02:00
John Ogness 7b23a66db5 printk: Reduce console_unblank() usage in unsafe scenarios
A semaphore is not NMI-safe, even when using down_trylock(). Both
down_trylock() and up() are using internal spinlocks and up()
might even call wake_up_process().

In the panic() code path it gets even worse because the internal
spinlocks of the semaphore may have been taken by a CPU that has
been stopped.

To reduce the risk of deadlocks caused by the console semaphore in
the panic path, make the following changes:

- First check if any consoles have implemented the unblank()
  callback. If not, then there is no reason to take the console
  semaphore anyway. (This check is also useful for the non-panic
  path since the locking/unlocking of the console lock can be
  quite expensive due to console printing.)

- If the panic path is in NMI context, bail out without attempting
  to take the console semaphore or calling any unblank() callbacks.
  Bailing out is acceptable because console_unblank() would already
  bail out if the console semaphore is contended. The alternative of
  ignoring the console semaphore and calling the unblank() callbacks
  anyway is a bad idea because these callbacks are also not NMI-safe.

If consoles with unblank() callbacks exist and console_unblank() is
called from a non-NMI panic context, it will still attempt a
down_trylock(). This could still result in a deadlock if one of the
stopped CPUs is holding the semaphore internal spinlock. But this
is a risk that the kernel has been (and continues to be) willing
to take.

Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20230717194607.145135-3-john.ogness@linutronix.de
2023-07-20 13:06:22 +02:00
John Ogness 6d3e0d8cc6 kdb: Do not assume write() callback available
It is allowed for consoles to not provide a write() callback. For
example ttynull does this.

Check if a write() callback is available before using it.

Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Daniel Thompson <daniel.thompson@linaro.org>
Acked-by: Daniel Thompson <daniel.thompson@linaro.org>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20230717194607.145135-2-john.ogness@linutronix.de
2023-07-20 13:06:22 +02:00
Paul E. McKenney c924bf5a43 rcu: Clarify rcu_is_watching() kernel-doc comment
Make it clear that this function always returns either true or false
without other planned failure modes.

Reported-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
2023-07-19 13:21:28 -07:00
Alexei Starovoitov 6f5a630d7c bpf, net: Introduce skb_pointer_if_linear().
Network drivers always call skb_header_pointer() with non-null buffer.
Remove !buffer check to prevent accidental misuse of skb_header_pointer().
Introduce skb_pointer_if_linear() instead.

Reported-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230718234021.43640-1-alexei.starovoitov@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-19 10:27:33 -07:00
Daniel Borkmann e420bed025 bpf: Add fd-based tcx multi-prog infra with link support
This work refactors and adds a lightweight extension ("tcx") to the tc BPF
ingress and egress data path side for allowing BPF program management based
on fds via bpf() syscall through the newly added generic multi-prog API.
The main goal behind this work which we also presented at LPC [0] last year
and a recent update at LSF/MM/BPF this year [3] is to support long-awaited
BPF link functionality for tc BPF programs, which allows for a model of safe
ownership and program detachment.

Given the rise in tc BPF users in cloud native environments, this becomes
necessary to avoid hard to debug incidents either through stale leftover
programs or 3rd party applications accidentally stepping on each others toes.
As a recap, a BPF link represents the attachment of a BPF program to a BPF
hook point. The BPF link holds a single reference to keep BPF program alive.
Moreover, hook points do not reference a BPF link, only the application's
fd or pinning does. A BPF link holds meta-data specific to attachment and
implements operations for link creation, (atomic) BPF program update,
detachment and introspection. The motivation for BPF links for tc BPF programs
is multi-fold, for example:

  - From Meta: "It's especially important for applications that are deployed
    fleet-wide and that don't "control" hosts they are deployed to. If such
    application crashes and no one notices and does anything about that, BPF
    program will keep running draining resources or even just, say, dropping
    packets. We at FB had outages due to such permanent BPF attachment
    semantics. With fd-based BPF link we are getting a framework, which allows
    safe, auto-detachable behavior by default, unless application explicitly
    opts in by pinning the BPF link." [1]

  - From Cilium-side the tc BPF programs we attach to host-facing veth devices
    and phys devices build the core datapath for Kubernetes Pods, and they
    implement forwarding, load-balancing, policy, EDT-management, etc, within
    BPF. Currently there is no concept of 'safe' ownership, e.g. we've recently
    experienced hard-to-debug issues in a user's staging environment where
    another Kubernetes application using tc BPF attached to the same prio/handle
    of cls_bpf, accidentally wiping all Cilium-based BPF programs from underneath
    it. The goal is to establish a clear/safe ownership model via links which
    cannot accidentally be overridden. [0,2]

BPF links for tc can co-exist with non-link attachments, and the semantics are
in line also with XDP links: BPF links cannot replace other BPF links, BPF
links cannot replace non-BPF links, non-BPF links cannot replace BPF links and
lastly only non-BPF links can replace non-BPF links. In case of Cilium, this
would solve mentioned issue of safe ownership model as 3rd party applications
would not be able to accidentally wipe Cilium programs, even if they are not
BPF link aware.

Earlier attempts [4] have tried to integrate BPF links into core tc machinery
to solve cls_bpf, which has been intrusive to the generic tc kernel API with
extensions only specific to cls_bpf and suboptimal/complex since cls_bpf could
be wiped from the qdisc also. Locking a tc BPF program in place this way, is
getting into layering hacks given the two object models are vastly different.

We instead implemented the tcx (tc 'express') layer which is an fd-based tc BPF
attach API, so that the BPF link implementation blends in naturally similar to
other link types which are fd-based and without the need for changing core tc
internal APIs. BPF programs for tc can then be successively migrated from classic
cls_bpf to the new tc BPF link without needing to change the program's source
code, just the BPF loader mechanics for attaching is sufficient.

For the current tc framework, there is no change in behavior with this change
and neither does this change touch on tc core kernel APIs. The gist of this
patch is that the ingress and egress hook have a lightweight, qdisc-less
extension for BPF to attach its tc BPF programs, in other words, a minimal
entry point for tc BPF. The name tcx has been suggested from discussion of
earlier revisions of this work as a good fit, and to more easily differ between
the classic cls_bpf attachment and the fd-based one.

For the ingress and egress tcx points, the device holds a cache-friendly array
with program pointers which is separated from control plane (slow-path) data.
Earlier versions of this work used priority to determine ordering and expression
of dependencies similar as with classic tc, but it was challenged that for
something more future-proof a better user experience is required. Hence this
resulted in the design and development of the generic attach/detach/query API
for multi-progs. See prior patch with its discussion on the API design. tcx is
the first user and later we plan to integrate also others, for example, one
candidate is multi-prog support for XDP which would benefit and have the same
'look and feel' from API perspective.

The goal with tcx is to have maximum compatibility to existing tc BPF programs,
so they don't need to be rewritten specifically. Compatibility to call into
classic tcf_classify() is also provided in order to allow successive migration
or both to cleanly co-exist where needed given its all one logical tc layer and
the tcx plus classic tc cls/act build one logical overall processing pipeline.

tcx supports the simplified return codes TCX_NEXT which is non-terminating (go
to next program) and terminating ones with TCX_PASS, TCX_DROP, TCX_REDIRECT.
The fd-based API is behind a static key, so that when unused the code is also
not entered. The struct tcx_entry's program array is currently static, but
could be made dynamic if necessary at a point in future. The a/b pair swap
design has been chosen so that for detachment there are no allocations which
otherwise could fail.

The work has been tested with tc-testing selftest suite which all passes, as
well as the tc BPF tests from the BPF CI, and also with Cilium's L4LB.

Thanks also to Nikolay Aleksandrov and Martin Lau for in-depth early reviews
of this work.

  [0] https://lpc.events/event/16/contributions/1353/
  [1] https://lore.kernel.org/bpf/CAEf4BzbokCJN33Nw_kg82sO=xppXnKWEncGTWCTB9vGCmLB6pw@mail.gmail.com
  [2] https://colocatedeventseu2023.sched.com/event/1Jo6O/tales-from-an-ebpf-programs-murder-mystery-hemanth-malla-guillaume-fournier-datadog
  [3] http://vger.kernel.org/bpfconf2023_material/tcx_meta_netdev_borkmann.pdf
  [4] https://lore.kernel.org/bpf/20210604063116.234316-1-memxor@gmail.com

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230719140858.13224-3-daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-19 10:07:27 -07:00
Daniel Borkmann 053c8e1f23 bpf: Add generic attach/detach/query API for multi-progs
This adds a generic layer called bpf_mprog which can be reused by different
attachment layers to enable multi-program attachment and dependency resolution.
In-kernel users of the bpf_mprog don't need to care about the dependency
resolution internals, they can just consume it with few API calls.

The initial idea of having a generic API sparked out of discussion [0] from an
earlier revision of this work where tc's priority was reused and exposed via
BPF uapi as a way to coordinate dependencies among tc BPF programs, similar
as-is for classic tc BPF. The feedback was that priority provides a bad user
experience and is hard to use [1], e.g.:

  I cannot help but feel that priority logic copy-paste from old tc, netfilter
  and friends is done because "that's how things were done in the past". [...]
  Priority gets exposed everywhere in uapi all the way to bpftool when it's
  right there for users to understand. And that's the main problem with it.

  The user don't want to and don't need to be aware of it, but uapi forces them
  to pick the priority. [...] Your cover letter [0] example proves that in
  real life different service pick the same priority. They simply don't know
  any better. Priority is an unnecessary magic that apps _have_ to pick, so
  they just copy-paste and everyone ends up using the same.

The course of the discussion showed more and more the need for a generic,
reusable API where the "same look and feel" can be applied for various other
program types beyond just tc BPF, for example XDP today does not have multi-
program support in kernel, but also there was interest around this API for
improving management of cgroup program types. Such common multi-program
management concept is useful for BPF management daemons or user space BPF
applications coordinating internally about their attachments.

Both from Cilium and Meta side [2], we've collected the following requirements
for a generic attach/detach/query API for multi-progs which has been implemented
as part of this work:

  - Support prog-based attach/detach and link API
  - Dependency directives (can also be combined):
    - BPF_F_{BEFORE,AFTER} with relative_{fd,id} which can be {prog,link,none}
      - BPF_F_ID flag as {fd,id} toggle; the rationale for id is so that user
        space application does not need CAP_SYS_ADMIN to retrieve foreign fds
        via bpf_*_get_fd_by_id()
      - BPF_F_LINK flag as {prog,link} toggle
      - If relative_{fd,id} is none, then BPF_F_BEFORE will just prepend, and
        BPF_F_AFTER will just append for attaching
      - Enforced only at attach time
    - BPF_F_REPLACE with replace_bpf_fd which can be prog, links have their
      own infra for replacing their internal prog
    - If no flags are set, then it's default append behavior for attaching
  - Internal revision counter and optionally being able to pass expected_revision
  - User space application can query current state with revision, and pass it
    along for attachment to assert current state before doing updates
  - Query also gets extension for link_ids array and link_attach_flags:
    - prog_ids are always filled with program IDs
    - link_ids are filled with link IDs when link was used, otherwise 0
    - {prog,link}_attach_flags for holding {prog,link}-specific flags
  - Must be easy to integrate/reuse for in-kernel users

The uapi-side changes needed for supporting bpf_mprog are rather minimal,
consisting of the additions of the attachment flags, revision counter, and
expanding existing union with relative_{fd,id} member.

The bpf_mprog framework consists of an bpf_mprog_entry object which holds
an array of bpf_mprog_fp (fast-path structure). The bpf_mprog_cp (control-path
structure) is part of bpf_mprog_bundle. Both have been separated, so that
fast-path gets efficient packing of bpf_prog pointers for maximum cache
efficiency. Also, array has been chosen instead of linked list or other
structures to remove unnecessary indirections for a fast point-to-entry in
tc for BPF.

The bpf_mprog_entry comes as a pair via bpf_mprog_bundle so that in case of
updates the peer bpf_mprog_entry is populated and then just swapped which
avoids additional allocations that could otherwise fail, for example, in
detach case. bpf_mprog_{fp,cp} arrays are currently static, but they could
be converted to dynamic allocation if necessary at a point in future.
Locking is deferred to the in-kernel user of bpf_mprog, for example, in case
of tcx which uses this API in the next patch, it piggybacks on rtnl.

An extensive test suite for checking all aspects of this API for prog-based
attach/detach and link API comes as BPF selftests in this series.

Thanks also to Andrii Nakryiko for early API discussions wrt Meta's BPF prog
management.

  [0] https://lore.kernel.org/bpf/20221004231143.19190-1-daniel@iogearbox.net
  [1] https://lore.kernel.org/bpf/CAADnVQ+gEY3FjCR=+DmjDR4gp5bOYZUFJQXj4agKFHT9CQPZBw@mail.gmail.com
  [2] http://vger.kernel.org/bpfconf2023_material/tcx_meta_netdev_borkmann.pdf

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/20230719140858.13224-2-daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-19 10:07:27 -07:00
Anton Protopopov 72829b1c1f bpf: allow any program to use the bpf_map_sum_elem_count kfunc
Register the bpf_map_sum_elem_count func for all programs, and update the
map_ptr subtest of the test_progs test to test the new functionality.

The usage is allowed as long as the pointer to the map is trusted (when
using tracing programs) or is a const pointer to map, as in the following
example:

    struct {
            __uint(type, BPF_MAP_TYPE_HASH);
            ...
    } hash SEC(".maps");

    ...

    static inline int some_bpf_prog(void)
    {
            struct bpf_map *map = (struct bpf_map *)&hash;
            __s64 count;

            count = bpf_map_sum_elem_count(map);

            ...
    }

Signed-off-by: Anton Protopopov <aspsk@isovalent.com>
Link: https://lore.kernel.org/r/20230719092952.41202-5-aspsk@isovalent.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-19 09:48:53 -07:00
Anton Protopopov 9c29804961 bpf: make an argument const in the bpf_map_sum_elem_count kfunc
We use the map pointer only to read the counter values, no locking
involved, so mark the argument as const.

Signed-off-by: Anton Protopopov <aspsk@isovalent.com>
Link: https://lore.kernel.org/r/20230719092952.41202-4-aspsk@isovalent.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-19 09:48:52 -07:00
Anton Protopopov 5ba190c29c bpf: consider CONST_PTR_TO_MAP as trusted pointer to struct bpf_map
Add the BTF id of struct bpf_map to the reg2btf_ids array. This makes the
values of the CONST_PTR_TO_MAP type to be considered as trusted by kfuncs.
This, in turn, allows users to execute trusted kfuncs which accept `struct
bpf_map *` arguments from non-tracing programs.

While exporting the btf_bpf_map_id variable, save some bytes by defining
it as BTF_ID_LIST_GLOBAL_SINGLE (which is u32[1]) and not as BTF_ID_LIST
(which is u32[64]).

Signed-off-by: Anton Protopopov <aspsk@isovalent.com>
Link: https://lore.kernel.org/r/20230719092952.41202-3-aspsk@isovalent.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-19 09:48:52 -07:00
Anton Protopopov 831deb2976 bpf: consider types listed in reg2btf_ids as trusted
The reg2btf_ids array contains a list of types for which we can (and need)
to find a corresponding static BTF id. All the types in the list can be
considered as trusted for purposes of kfuncs.

Signed-off-by: Anton Protopopov <aspsk@isovalent.com>
Link: https://lore.kernel.org/r/20230719092952.41202-2-aspsk@isovalent.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-19 09:48:52 -07:00
Peter Zijlstra d07f09a1f9 sched/fair: Propagate enqueue flags into place_entity()
This allows place_entity() to consider ENQUEUE_WAKEUP and
ENQUEUE_MIGRATED.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230531124604.274010996@infradead.org
2023-07-19 09:43:59 +02:00
Peter Zijlstra e4ec3318a1 sched/debug: Rename sysctl_sched_min_granularity to sysctl_sched_base_slice
EEVDF uses this tunable as the base request/slice -- make sure the
name reflects this.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230531124604.205287511@infradead.org
2023-07-19 09:43:59 +02:00
Peter Zijlstra 5e963f2bd4 sched/fair: Commit to EEVDF
EEVDF is a better defined scheduling policy, as a result it has less
heuristics/tunables. There is no compelling reason to keep CFS around.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230531124604.137187212@infradead.org
2023-07-19 09:43:58 +02:00
Peter Zijlstra e8f331bcc2 sched/smp: Use lag to simplify cross-runqueue placement
Using lag is both more correct and simpler when moving between
runqueues.

Notable, min_vruntime() was invented as a cheap approximation of
avg_vruntime() for this very purpose (SMP migration). Since we now
have the real thing; use it.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230531124604.068911180@infradead.org
2023-07-19 09:43:58 +02:00
Peter Zijlstra 76cae9dbe1 sched/fair: Commit to lag based placement
Removes the FAIR_SLEEPERS code in favour of the new LAG based
placement.

Specifically, the whole FAIR_SLEEPER thing was a very crude
approximation to make up for the lack of lag based placement,
specifically the 'service owed' part. This is important for things
like 'starve' and 'hackbench'.

One side effect of FAIR_SLEEPER is that it caused 'small' unfairness,
specifically, by always ignoring up-to 'thresh' sleeptime it would
have a 50%/50% time distribution for a 50% sleeper vs a 100% runner,
while strictly speaking this should (of course) result in a 33%/67%
split (as CFS will also do if the sleep period exceeds 'thresh').

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230531124604.000198861@infradead.org
2023-07-19 09:43:58 +02:00
Peter Zijlstra 147f3efaa2 sched/fair: Implement an EEVDF-like scheduling policy
Where CFS is currently a WFQ based scheduler with only a single knob,
the weight. The addition of a second, latency oriented parameter,
makes something like WF2Q or EEVDF based a much better fit.

Specifically, EEVDF does EDF like scheduling in the left half of the
tree -- those entities that are owed service. Except because this is a
virtual time scheduler, the deadlines are in virtual time as well,
which is what allows over-subscription.

EEVDF has two parameters:

 - weight, or time-slope: which is mapped to nice just as before

 - request size, or slice length: which is used to compute
   the virtual deadline as: vd_i = ve_i + r_i/w_i

Basically, by setting a smaller slice, the deadline will be earlier
and the task will be more eligible and ran earlier.

Tick driven preemption is driven by request/slice completion; while
wakeup preemption is driven by the deadline.

Because the tree is now effectively an interval tree, and the
selection is no longer 'leftmost', over-scheduling is less of a
problem.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230531124603.931005524@infradead.org
2023-07-19 09:43:58 +02:00
Peter Zijlstra 86bfbb7ce4 sched/fair: Add lag based placement
With the introduction of avg_vruntime, it is possible to approximate
lag (the entire purpose of introducing it in fact). Use this to do lag
based placement over sleep+wake.

Specifically, the FAIR_SLEEPERS thing places things too far to the
left and messes up the deadline aspect of EEVDF.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230531124603.794929315@infradead.org
2023-07-19 09:43:58 +02:00
Peter Zijlstra e0c2ff903c sched/fair: Remove sched_feat(START_DEBIT)
With the introduction of avg_vruntime() there is no need to use worse
approximations. Take the 0-lag point as starting point for inserting
new tasks.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230531124603.722361178@infradead.org
2023-07-19 09:43:58 +02:00
Peter Zijlstra af4cf40470 sched/fair: Add cfs_rq::avg_vruntime
In order to move to an eligibility based scheduling policy, we need
to have a better approximation of the ideal scheduler.

Specifically, for a virtual time weighted fair queueing based
scheduler the ideal scheduler will be the weighted average of the
individual virtual runtimes (math in the comment).

As such, compute the weighted average to approximate the ideal
scheduler -- note that the approximation is in the individual task
behaviour, which isn't strictly conformant.

Specifically consider adding a task with a vruntime left of center, in
this case the average will move backwards in time -- something the
ideal scheduler would of course never do.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230531124603.654144274@infradead.org
2023-07-19 09:43:58 +02:00
Ingo Molnar 752182b24b Linux 6.5-rc2
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmS0at0eHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiG6m8H/RZCd2DWeM94CgK5
 DIxNLxu90PrEcrOnqeHFJtQoSiUQTHeseh9E4BH0JdPDxtlU89VwYRAevseWiVkp
 JyyJPB40UR4i2DO0P1+oWBBsGEG+bo8lZ1M+uxU5k6lgC0fAi96/O48mwwmI0Mtm
 P6BkWd3IkSXc7l4AGIrKa5P+pYEnm0Z6YAZILfdMVBcLXZWv1OAwEEkZNdUuhE3d
 5DxlQ8MLzlQIbe+LasiwdL9r606acFaJG9Hewl9x5DBlsObZ3d2rX4vEt1NEVh89
 shV29xm2AjCpLh4kstJFdTDCkDw/4H7TcFB/NMxXyzEXp3Bx8YXq+mjVd2mpq1FI
 hHtCsOA=
 =KiaU
 -----END PGP SIGNATURE-----

Merge tag 'v6.5-rc2' into sched/core, to pick up fixes

Sync with upstream fixes before applying EEVDF.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2023-07-19 09:43:25 +02:00
Dave Marchevsky c3c510ce43 bpf: Add 'owner' field to bpf_{list,rb}_node
As described by Kumar in [0], in shared ownership scenarios it is
necessary to do runtime tracking of {rb,list} node ownership - and
synchronize updates using this ownership information - in order to
prevent races. This patch adds an 'owner' field to struct bpf_list_node
and bpf_rb_node to implement such runtime tracking.

The owner field is a void * that describes the ownership state of a
node. It can have the following values:

  NULL           - the node is not owned by any data structure
  BPF_PTR_POISON - the node is in the process of being added to a data
                   structure
  ptr_to_root    - the pointee is a data structure 'root'
                   (bpf_rb_root / bpf_list_head) which owns this node

The field is initially NULL (set by bpf_obj_init_field default behavior)
and transitions states in the following sequence:

  Insertion: NULL -> BPF_PTR_POISON -> ptr_to_root
  Removal:   ptr_to_root -> NULL

Before a node has been successfully inserted, it is not protected by any
root's lock, and therefore two programs can attempt to add the same node
to different roots simultaneously. For this reason the intermediate
BPF_PTR_POISON state is necessary. For removal, the node is protected
by some root's lock so this intermediate hop isn't necessary.

Note that bpf_list_pop_{front,back} helpers don't need to check owner
before removing as the node-to-be-removed is not passed in as input and
is instead taken directly from the list. Do the check anyways and
WARN_ON_ONCE in this unexpected scenario.

Selftest changes in this patch are entirely mechanical: some BTF
tests have hardcoded struct sizes for structs that contain
bpf_{list,rb}_node fields, those were adjusted to account for the new
sizes. Selftest additions to validate the owner field are added in a
further patch in the series.

  [0]: https://lore.kernel.org/bpf/d7hyspcow5wtjcmw4fugdgyp3fwhljwuscp3xyut5qnwivyeru@ysdq543otzv2

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Suggested-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20230718083813.3416104-4-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-18 17:23:10 -07:00
Dave Marchevsky 0a1f7bfe35 bpf: Introduce internal definitions for UAPI-opaque bpf_{rb,list}_node
Structs bpf_rb_node and bpf_list_node are opaquely defined in
uapi/linux/bpf.h, as BPF program writers are not expected to touch their
fields - nor does the verifier allow them to do so.

Currently these structs are simple wrappers around structs rb_node and
list_head and linked_list / rbtree implementation just casts and passes
to library functions for those data structures. Later patches in this
series, though, will add an "owner" field to bpf_{rb,list}_node, such
that they're not just wrapping an underlying node type. Moreover, the
bpf linked_list and rbtree implementations will deal with these owner
pointers directly in a few different places.

To avoid having to do

  void *owner = (void*)bpf_list_node + sizeof(struct list_head)

with opaque UAPI node types, add bpf_{list,rb}_node_kern struct
definitions to internal headers and modify linked_list and rbtree to use
the internal types where appropriate.

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Link: https://lore.kernel.org/r/20230718083813.3416104-3-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-18 17:23:10 -07:00
Kumar Kartikeya Dwivedi b5e9ad522c bpf: Repeat check_max_stack_depth for async callbacks
While the check_max_stack_depth function explores call chains emanating
from the main prog, which is typically enough to cover all possible call
chains, it doesn't explore those rooted at async callbacks unless the
async callback will have been directly called, since unlike non-async
callbacks it skips their instruction exploration as they don't
contribute to stack depth.

It could be the case that the async callback leads to a callchain which
exceeds the stack depth, but this is never reachable while only
exploring the entry point from main subprog. Hence, repeat the check for
the main subprog *and* all async callbacks marked by the symbolic
execution pass of the verifier, as execution of the program may begin at
any of them.

Consider functions with following stack depths:
main: 256
async: 256
foo: 256

main:
    rX = async
    bpf_timer_set_callback(...)

async:
    foo()

Here, async is not descended as it does not contribute to stack depth of
main (since it is referenced using bpf_pseudo_func and not
bpf_pseudo_call). However, when async is invoked asynchronously, it will
end up breaching the MAX_BPF_STACK limit by calling foo.

Hence, in addition to main, we also need to explore call chains
beginning at all async callback subprogs in a program.

Fixes: 7ddc80a476 ("bpf: Teach stack depth check about async callbacks.")
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20230717161530.1238-3-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-18 15:21:09 -07:00
Kumar Kartikeya Dwivedi ba7b3e7d5f bpf: Fix subprog idx logic in check_max_stack_depth
The assignment to idx in check_max_stack_depth happens once we see a
bpf_pseudo_call or bpf_pseudo_func. This is not an issue as the rest of
the code performs a few checks and then pushes the frame to the frame
stack, except the case of async callbacks. If the async callback case
causes the loop iteration to be skipped, the idx assignment will be
incorrect on the next iteration of the loop. The value stored in the
frame stack (as the subprogno of the current subprog) will be incorrect.

This leads to incorrect checks and incorrect tail_call_reachable
marking. Save the target subprog in a new variable and only assign to
idx once we are done with the is_async_cb check which may skip pushing
of frame to the frame stack and subsequent stack depth checks and tail
call markings.

Fixes: 7ddc80a476 ("bpf: Teach stack depth check about async callbacks.")
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20230717161530.1238-2-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-18 15:21:09 -07:00
Haitao Huang 32bf85c60c cgroup/misc: Change counters to be explicit 64bit types
So the variables can account for resources of huge quantities even on
32-bit machines.

Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-07-18 12:10:00 -10:00
Linus Torvalds 4806364acf Seven hotfixes, six of which are cc:stable and one of which addresses a
post-6.5 issue.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZLboHQAKCRDdBJ7gKXxA
 jqwtAP4m3MQNcYzQk8qbV+EQat/csTnrefytyD0ogFRoxcMAFAD/XT784sZzn4SU
 s/mL1HLk1BsubT/yQmY3lISXHDPuPAo=
 =5W3V
 -----END PGP SIGNATURE-----

Merge tag 'mm-hotfixes-stable-2023-07-18-12-28' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull hotfixes from Andrew Morton:
 "Seven hotfixes, six of which are cc:stable and one of which addresses
  a post-6.5 issue"

* tag 'mm-hotfixes-stable-2023-07-18-12-28' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  maple_tree: fix node allocation testing on 32 bit
  maple_tree: fix 32 bit mas_next testing
  selftests/mm: mkdirty: fix incorrect position of #endif
  maple_tree: set the node limit when creating a new root node
  mm/mlock: fix vma iterator conversion of apply_vma_lock_flags()
  prctl: move PR_GET_AUXV out of PR_MCE_KILL
  selftests/mm: give scripts execute permission
2023-07-18 14:19:42 -07:00
Andrei Vagin 48a1084a8b seccomp: add the synchronous mode for seccomp_unotify
seccomp_unotify allows more privileged processes do actions on behalf
of less privileged processes.

In many cases, the workflow is fully synchronous. It means a target
process triggers a system call and passes controls to a supervisor
process that handles the system call and returns controls to the target
process. In this context, "synchronous" means that only one process is
running and another one is waiting.

There is the WF_CURRENT_CPU flag that is used to advise the scheduler to
move the wakee to the current CPU. For such synchronous workflows, it
makes context switches a few times faster.

Right now, each interaction takes 12µs. With this patch, it takes about
3µs.

This change introduce the SECCOMP_USER_NOTIF_FD_SYNC_WAKE_UP flag that
it used to enable the sync mode.

Signed-off-by: Andrei Vagin <avagin@google.com>
Acked-by: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230308073201.3102738-5-avagin@google.com
Signed-off-by: Kees Cook <keescook@chromium.org>
2023-07-17 16:08:08 -07:00
Andrei Vagin 6f63904c8f sched: add a few helpers to wake up tasks on the current cpu
Add complete_on_current_cpu, wake_up_poll_on_current_cpu helpers to wake
up tasks on the current CPU.

These two helpers are useful when the task needs to make a synchronous context
switch to another task. In this context, synchronous means it wakes up the
target task and falls asleep right after that.

One example of such workloads is seccomp user notifies. This mechanism allows
the  supervisor process handles system calls on behalf of a target process.
While the supervisor is handling an intercepted system call, the target process
will be blocked in the kernel, waiting for a response to come back.

On-CPU context switches are much faster than regular ones.

Signed-off-by: Andrei Vagin <avagin@google.com>
Acked-by: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230308073201.3102738-4-avagin@google.com
Signed-off-by: Kees Cook <keescook@chromium.org>
2023-07-17 16:08:08 -07:00
Peter Oskolkov ab83f455f0 sched: add WF_CURRENT_CPU and externise ttwu
Add WF_CURRENT_CPU wake flag that advices the scheduler to
move the wakee to the current CPU. This is useful for fast on-CPU
context switching use cases.

In addition, make ttwu external rather than static so that
the flag could be passed to it from outside of sched/core.c.

Signed-off-by: Peter Oskolkov <posk@google.com>
Signed-off-by: Andrei Vagin <avagin@google.com>
Acked-by: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230308073201.3102738-3-avagin@google.com
Signed-off-by: Kees Cook <keescook@chromium.org>
2023-07-17 16:08:08 -07:00
Andrei Vagin 4943b66df1 seccomp: don't use semaphore and wait_queue together
The main reason is to use new wake_up helpers that will be added in the
following patches. But here are a few other reasons:

* if we use two different ways, we always need to call them both. This
  patch fixes seccomp_notify_recv where we forgot to call wake_up_poll
  in the error path.

* If we use one primitive, we can control how many waiters are woken up
  for each request. Our goal is to wake up just one that will handle a
  request. Right now, wake_up_poll can wake up one waiter and
  up(&match->notif->request) can wake up one more.

Signed-off-by: Andrei Vagin <avagin@google.com>
Acked-by: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230308073201.3102738-2-avagin@google.com
Signed-off-by: Kees Cook <keescook@chromium.org>
2023-07-17 16:08:08 -07:00
Miguel Ojeda 636e348353 prctl: move PR_GET_AUXV out of PR_MCE_KILL
Somehow PR_GET_AUXV got added into PR_MCE_KILL's switch when the patch was
applied [1].

Thus move it out of the switch, to the place the patch added it.

In the recently released v6.4 kernel some user could, in principle, be
already using this feature by mapping the right page and passing the
PR_GET_AUXV constant as a pointer:

    prctl(PR_MCE_KILL, PR_GET_AUXV, ...)

So this does change the behavior for users.  We could keep the bug since
the other subcases in PR_MCE_KILL (PR_MCE_KILL_CLEAR and PR_MCE_KILL_SET)
do not overlap.

However, v6.4 may be recent enough (2 weeks old) that moving the lines
(rather than just adding a new case) does not break anybody?  Moreover,
the documentation in man-pages was just committed today [2].

Link: https://lkml.kernel.org/r/20230708233344.361854-1-ojeda@kernel.org
Fixes: ddc65971bb ("prctl: add PR_GET_AUXV to copy auxv to userspace")
Link: https://lore.kernel.org/all/d81864a7f7f43bca6afa2a09fc2e850e4050ab42.1680611394.git.josh@joshtriplett.org/ [1]
Link: https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/commit/?id=8cf0c06bfd3c2b219b044d4151c96f0da50af9ad [2]
Signed-off-by: Miguel Ojeda <ojeda@kernel.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-07-17 12:53:21 -07:00
Kamalesh Babulal c25ff4b911 cgroup: remove cgrp->kn check in css_populate_dir()
cgroup_create() creates cgrp and assigns the kernfs_node to cgrp->kn,
then cgroup_mkdir() populates base and csses cft file by calling
css_populate_dir() and cgroup_apply_control_enable() with a valid
cgrp->kn. Check for NULL cgrp->kn, will always be false, remove it.

Signed-off-by: Kamalesh Babulal <kamalesh.babulal@oracle.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-07-17 08:44:56 -10:00
Miaohe Lin 6f71780e7f cgroup: fix obsolete function name
cgroup_taskset_migrate() has been renamed to cgroup_migrate_execute() since
commit e595cd7069 ("cgroup: track migration context in cgroup_mgctx").
Update the corresponding comment.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-07-17 08:38:58 -10:00
Miaohe Lin fcbb485d9f cgroup: use cached local variable parent in for loop
Use local variable parent to initialize iter tcgrp in for loop so the size
of cgroup.o can be reduced by 64 bytes. No functional change intended.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-07-17 08:34:41 -10:00
Peter Zijlstra f7853c3424 locking/rtmutex: Fix task->pi_waiters integrity
Henry reported that rt_mutex_adjust_prio_check() has an ordering
problem and puts the lie to the comment in [7]. Sharing the sort key
between lock->waiters and owner->pi_waiters *does* create problems,
since unlike what the comment claims, holding [L] is insufficient.

Notably, consider:

	A
      /   \
     M1   M2
     |     |
     B     C

That is, task A owns both M1 and M2, B and C block on them. In this
case a concurrent chain walk (B & C) will modify their resp. sort keys
in [7] while holding M1->wait_lock and M2->wait_lock. So holding [L]
is meaningless, they're different Ls.

This then gives rise to a race condition between [7] and [11], where
the requeue of pi_waiters will observe an inconsistent tree order.

	B				C

  (holds M1->wait_lock,		(holds M2->wait_lock,
   holds B->pi_lock)		 holds A->pi_lock)

  [7]
  waiter_update_prio();
  ...
  [8]
  raw_spin_unlock(B->pi_lock);
  ...
  [10]
  raw_spin_lock(A->pi_lock);

				[11]
				rt_mutex_enqueue_pi();
				// observes inconsistent A->pi_waiters
				// tree order

Fixing this means either extending the range of the owner lock from
[10-13] to [6-13], with the immediate problem that this means [6-8]
hold both blocked and owner locks, or duplicating the sort key.

Since the locking in chain walk is horrible enough without having to
consider pi_lock nesting rules, duplicate the sort key instead.

By giving each tree their own sort key, the above race becomes
harmless, if C sees B at the old location, then B will correct things
(if they need correcting) when it walks up the chain and reaches A.

Fixes: fb00aca474 ("rtmutex: Turn the plist into an rb-tree")
Reported-by: Henry Wu <triangletrap12@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Henry Wu <triangletrap12@gmail.com>
Link: https://lkml.kernel.org/r/20230707161052.GF2883469%40hirez.programming.kicks-ass.net
2023-07-17 13:59:10 +02:00
Linus Torvalds f61a89ca11 - Remove a cgroup from under a polling process properly
- Fix the idle sibling selection
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmS0N8AACgkQEsHwGGHe
 VUq4qhAAhLlRz7V0txbvj65xrQAuF7RiqWQlIm6NYX+eWb5wpqasy8mTpVr37Lhg
 JJIdSlI0sw3vIgnjgmuLU6e6MKO2pDWCqppv7CB4YjTT6/ifyIWvFotHTtfOBDjy
 eTV/wEpfDKOKHJDdWHRdUjiDgktZs1gJVV8e65/ViuWoCEV3ARy7tS4liORiwNpr
 RTuG6C3lDAfw6RKWCvBxDV15XhMDNYYQzN1bwedTAryP8jAFFUKce2de6HK6qyUF
 pNqzX0eGkN+6TG13uKLJIsAfdydWHWvkXWqFzVZsS7Upu4lyAyjslHK5UEF/MgCy
 4c4/xaqcDTKn8dtQycQ3FMwujbamHdnct9or4aDsEa9PIIAcab/gKxojNPlBSkCt
 0FpBjJUP2X/DBGpMaNX02fwQhhXwj0dN/4OWI5kXrL2sctqzueMSfTUKpjsLZMAx
 VTngoXVcVlLxqB2LqIzjhsWG+awlw/IKEKotjhX4vyra9L50CRbMLJo7hJdHD16J
 dmGQmsJRlGyQYEOgoyI7BU/Bqs/+iE1r5fAcVgjWWUPotl7XKDgcwTpELTPj5xzO
 b7ALkUXRcHf2NdfaLDZeyJzT/Z1mq1znPTsDqi94f1ASbmAd0iXLFNHtyJxXJabZ
 /aTLUj/GmmaCw2R02S4l67uC/BHChToloNVCGsq2KfBYcUunod0=
 =bSNb
 -----END PGP SIGNATURE-----

Merge tag 'sched_urgent_for_v6.5_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler fixes from Borislav Petkov:

 - Remove a cgroup from under a polling process properly

 - Fix the idle sibling selection

* tag 'sched_urgent_for_v6.5_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/psi: use kernfs polling functions for PSI trigger polling
  sched/fair: Use recent_used_cpu to test p->cpus_ptr
2023-07-16 13:22:08 -07:00
Linus Torvalds 6eede0686f hardening fixes for v6.5-rc2
- Remove LTO-only suffixes from promoted global function symbols (Yonghong Song)
 
 - Remove unused .text..refcount section from vmlinux.lds.h (Petr Pavlu)
 
 - Add missing __always_inline to sparc __arch_xchg() (Arnd Bergmann)
 
 - Claim maintainership of string routines
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmSzO7IWHGtlZXNjb29r
 QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJim9EACS3Jor73jTPAKKzbjiPdJXCZpW
 EesWC+a4nJ/hV4+pqIMUqpF/TRaJd1PZNJPMblI/wSGO4VcI6AywVLrX67a1wFyO
 pJUp3O8lXj7g5uA3cFmCMM1KKtQqZb/qZuAaP7BfnFaATjRNjejr0CHzcqqY45Jm
 fMo/it1k0zDUAHKx112T7v+lMFxGxzbysPYbXzU62zf74gPPmilk+ZnfEvUV3Qe7
 sRjV1Zk0VXwqSx0ED4tPIpJEHCCy0uZpQdo/vhghjYxFecIkM9T8XaKyEY2rqv+a
 u4YaMc/31TXi33ajfCdi1WxbNrRB6WU2tCjPrMbzdoRvTi1zaxc8GSl1mcftmdJk
 /f5Pxq1Jcf7ZRC7Ujbag9pfncITHtMXWRb84boV0JdncxmdMUmyZGuUUhU+Ma8m4
 yqdHMstaMvdtcn9FigVep2r0Fa+Eh8H+jNZuixdaReZHEYbjmrPM1P8zLtDYOSfk
 TZnnt2KrxeudgkqpNAFibWNkv9waKfx/oUU4ljwRbjjzQdUofvqyV15s9f5mcBLU
 98EDHTZ/XbURrzbgLAui/oFHZNuuBqdjZCaHD/1SQoemPF6zmkY3Hv+8YYQitk7a
 i+VaSoukE/S1vIBCasCi+2qgOOCxH4orZ2Ewll9iw2VVO4x8o8UVHIPYI5JIT6F9
 Xx1CeYrKfWDdDUXKVA==
 =HUoI
 -----END PGP SIGNATURE-----

Merge tag 'hardening-v6.5-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull hardening fixes from Kees Cook:

 - Remove LTO-only suffixes from promoted global function symbols
   (Yonghong Song)

 - Remove unused .text..refcount section from vmlinux.lds.h (Petr Pavlu)

 - Add missing __always_inline to sparc __arch_xchg() (Arnd Bergmann)

 - Claim maintainership of string routines

* tag 'hardening-v6.5-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  sparc: mark __arch_xchg() as __always_inline
  MAINTAINERS: Foolishly claim maintainership of string routines
  kallsyms: strip LTO-only suffixes from promoted global functions
  vmlinux.lds.h: Remove a reference to no longer used sections .text..refcount
2023-07-16 12:18:18 -07:00
Linus Torvalds 4b4eef57e6 Probe fixes for 6.5-rc1, the 2nd set:
- fprobe: Add a comment why fprobe will be skipped if another kprobe is
    running in fprobe_kprobe_handler().
 
  - probe-events: Fix some issues related to fetch-argument
   . Fix double counting of the string length for user-string and symstr.
     This will require longer buffer in the array case.
   . Fix not to count error code (minus value) for the total used length
     in array argument. This makes the total used length shorter.
   . Fix to update dynamic used data size counter only if fetcharg uses
     the dynamic size data. This may mis-count the used dynamic data
     size and corrupt data.
   . Revert "tracing: Add "(fault)" name injection to kernel probes"
     because that did not work correctly with a bug, and we agreed the
     current '(fault)' output (instead of '"(fault)"' like a string)
     explains what happened more clearly.
   . Fix to record 0-length (means fault access) data_loc data in fetch
     function itself, instead of store_trace_args(). If we record an
     array of string, this will fix to save fault access data on each
     entry of the array correctly.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCgAdFiEEh7BulGwFlgAOi5DV2/sHvwUrPxsFAmSxSlYACgkQ2/sHvwUr
 PxupyAgApFDi9YGsmrVbXmIN5y+yGMyio2H6xR7XkX+L02nvDY6uVqL/jgT8pHfI
 AeGZEA+EqwxIfWpYBfztsFej+Gl3Elfvu14OSxwaafUlW3mgZFQqw1ZR0HvzXoKJ
 8Iw6WOXjhLe3/QLy43UY8JQGOKI07i3gh71wa0W0huOyiwwHuuVwPSY9QJJ2ulSg
 OWFSuMFO8IxYimp0BpFu/vrfa8CdgWLc24tgJ5EpZtzu6L0A2I/FMZjnBukxnP9s
 rjAXv0uRuSFvvF7/RGCqrLza12525qyHx7d5IWUq5shd3bCnaUOnAieF//MoJaR3
 q8McDJK//EPbUvCWgESuuyPS05smyQ==
 =iumA
 -----END PGP SIGNATURE-----

Merge tag 'probes-fixes-v6.5-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull probe fixes from Masami Hiramatsu:

 - fprobe: Add a comment why fprobe will be skipped if another kprobe is
   running in fprobe_kprobe_handler().

 - probe-events: Fix some issues related to fetch-arguments:

    - Fix double counting of the string length for user-string and
      symstr. This will require longer buffer in the array case.

    - Fix not to count error code (minus value) for the total used
      length in array argument. This makes the total used length
      shorter.

    - Fix to update dynamic used data size counter only if fetcharg uses
      the dynamic size data. This may mis-count the used dynamic data
      size and corrupt data.

    - Revert "tracing: Add "(fault)" name injection to kernel probes"
      because that did not work correctly with a bug, and we agreed the
      current '(fault)' output (instead of '"(fault)"' like a string)
      explains what happened more clearly.

    - Fix to record 0-length (means fault access) data_loc data in fetch
      function itself, instead of store_trace_args(). If we record an
      array of string, this will fix to save fault access data on each
      entry of the array correctly.

* tag 'probes-fixes-v6.5-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing/probes: Fix to record 0-length data_loc in fetch_store_string*() if fails
  Revert "tracing: Add "(fault)" name injection to kernel probes"
  tracing/probes: Fix to update dynamic data counter if fetcharg uses it
  tracing/probes: Fix not to count error code to total length
  tracing/probes: Fix to avoid double count of the string length on the array
  fprobes: Add a comment why fprobe_kprobe_handler exits if kprobe is running
2023-07-16 12:13:51 -07:00
Paul E. McKenney e40806e9bc clocksource: Handle negative skews in "skew is too large" messages
The nanosecond-to-millisecond skew computation uses unsigned arithmetic,
which produces user-unfriendly large positive numbers for negative skews.
Therefore, use signed arithmetic for this computation in order to preserve
the negativity.

Reported-by: Chris Bainbridge <chris.bainbridge@gmail.com>
Reported-by: Feng Tang <feng.tang@intel.com>
Fixes: dd02926994 ("clocksource: Improve "skew is too large" messages")
Reviewed-by: Feng Tang <feng.tang@intel.com>
Tested-by: Chris Bainbridge <chris.bainbridge@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-07-14 15:17:09 -07:00
Connor O'Brien e2a0b786c5 torture: Support randomized shuffling for proxy exec testing
Currently shuffling sets the same cpu affinities for all tasks,
which makes us less likely to hit paths involving migrating
blocked tasks onto a cpu where they can't run.

This patch adds an element of randomness to allow affinities of
different writer tasks to diverge.

This has helped uncover issues in testing with Proxy Execution

Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: kernel-team@android.com
Signed-off-by: Connor O'Brien <connoro@google.com>
Signed-off-by: John Stultz <jstultz@google.com>
Acked-by: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-07-14 15:04:09 -07:00
Paul E. McKenney 9cafe974cf rcutorture: Dump grace-period state upon rtort_pipe_count incidents
The rtort_pipe_count WARN() indicates that grace periods were unable
to invoke all callbacks during a stutter_wait() interval.  But it is
sometimes helpful to have a bit more information as to why.  This commit
therefore invokes show_rcu_gp_kthreads() immediately before that WARN()
in order to dump out some relevant information.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-07-14 15:04:09 -07:00
Paul E. McKenney 4a71be9387 scftorture: Pause testing after memory-allocation failure
The scftorture test can quickly execute a large number of calls to no-wait
smp_call_function(), each of which holds a block of memory until the
corresponding handler is invoked.  Especially when the longwait module
parameter is specified, this can chew up an arbitrarily large amount
of memory.  This commit therefore blocks after each memory-allocation
failure, with the duration a function of longwait.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-07-14 15:02:57 -07:00
Paul E. McKenney 013608cd08 scftorture: Forgive memory-allocation failure if KASAN
Kernels built with CONFIG_KASAN=y quarantine newly freed memory in order
to better detect use-after-free errors.  However, this can exhaust memory
more quickly in allocator-heavy tests, which can result in spurious
scftorture failure.  This commit therefore forgives memory-allocation
failure in kernels built with CONFIG_KASAN=y, but continues counting
the errors for use in detailed test-result analyses.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-07-14 15:02:57 -07:00
Zqiang e60c122a16 rcuscale: Move rcu_scale_writer() schedule_timeout_uninterruptible() to _idle()
The rcuscale.holdoff module parameter can be used to delay the start
of rcu_scale_writer() kthread.  However, the hung-task timeout will
trigger when the timeout specified by rcuscale.holdoff is greater than
hung_task_timeout_secs:

runqemu kvm nographic slirp qemuparams="-smp 4 -m 2048M"
bootparams="rcuscale.shutdown=0 rcuscale.holdoff=300"

[  247.071753] INFO: task rcu_scale_write:59 blocked for more than 122 seconds.
[  247.072529]       Not tainted 6.4.0-rc1-00134-gb9ed6de8d4ff #7
[  247.073400] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  247.074331] task:rcu_scale_write state:D stack:30144 pid:59    ppid:2      flags:0x00004000
[  247.075346] Call Trace:
[  247.075660]  <TASK>
[  247.075965]  __schedule+0x635/0x1280
[  247.076448]  ? __pfx___schedule+0x10/0x10
[  247.076967]  ? schedule_timeout+0x2dc/0x4d0
[  247.077471]  ? __pfx_lock_release+0x10/0x10
[  247.078018]  ? enqueue_timer+0xe2/0x220
[  247.078522]  schedule+0x84/0x120
[  247.078957]  schedule_timeout+0x2e1/0x4d0
[  247.079447]  ? __pfx_schedule_timeout+0x10/0x10
[  247.080032]  ? __pfx_rcu_scale_writer+0x10/0x10
[  247.080591]  ? __pfx_process_timeout+0x10/0x10
[  247.081163]  ? __pfx_sched_set_fifo_low+0x10/0x10
[  247.081760]  ? __pfx_rcu_scale_writer+0x10/0x10
[  247.082287]  rcu_scale_writer+0x6b1/0x7f0
[  247.082773]  ? mark_held_locks+0x29/0xa0
[  247.083252]  ? __pfx_rcu_scale_writer+0x10/0x10
[  247.083865]  ? __pfx_rcu_scale_writer+0x10/0x10
[  247.084412]  kthread+0x179/0x1c0
[  247.084759]  ? __pfx_kthread+0x10/0x10
[  247.085098]  ret_from_fork+0x2c/0x50
[  247.085433]  </TASK>

This commit therefore replaces schedule_timeout_uninterruptible() with
schedule_timeout_idle().

Signed-off-by: Zqiang <qiang.zhang1211@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-07-14 15:01:49 -07:00
Arnd Bergmann e0a34641eb rcuscale: fix building with RCU_TINY
Both the CONFIG_TASKS_RCU and CONFIG_TASKS_RUDE_RCU options
are broken when RCU_TINY is enabled as well, as some functions
are missing a declaration.

In file included from kernel/rcu/update.c:649:
kernel/rcu/tasks.h:1271:21: error: no previous prototype for 'get_rcu_tasks_rude_gp_kthread' [-Werror=missing-prototypes]
 1271 | struct task_struct *get_rcu_tasks_rude_gp_kthread(void)
      |                     ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
kernel/rcu/rcuscale.c:330:27: error: 'get_rcu_tasks_rude_gp_kthread' undeclared here (not in a function); did you mean 'get_rcu_tasks_trace_gp_kthread'?
  330 |         .rso_gp_kthread = get_rcu_tasks_rude_gp_kthread,
      |                           ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      |                           get_rcu_tasks_trace_gp_kthread

In file included from /home/arnd/arm-soc/kernel/rcu/update.c:649:
kernel/rcu/tasks.h:1113:21: error: no previous prototype for 'get_rcu_tasks_gp_kthread' [-Werror=missing-prototypes]
 1113 | struct task_struct *get_rcu_tasks_gp_kthread(void)
      |                     ^~~~~~~~~~~~~~~~~~~~~~~~

Also, building with CONFIG_TASKS_RUDE_RCU but not CONFIG_TASKS_RCU is
broken because of some missing stub functions:

kernel/rcu/rcuscale.c:322:27: error: 'tasks_scale_read_lock' undeclared here (not in a function); did you mean 'srcu_scale_read_lock'?
  322 |         .readlock       = tasks_scale_read_lock,
      |                           ^~~~~~~~~~~~~~~~~~~~~
      |                           srcu_scale_read_lock
kernel/rcu/rcuscale.c:323:27: error: 'tasks_scale_read_unlock' undeclared here (not in a function); did you mean 'srcu_scale_read_unlock'?
  323 |         .readunlock     = tasks_scale_read_unlock,
      |                           ^~~~~~~~~~~~~~~~~~~~~~~
      |                           srcu_scale_read_unlock

Move the declarations outside of the RCU_TINY #ifdef and duplicate the
shared stub functions to address all of the above.

Fixes: 88d7ff818f0ce ("rcuscale: Add RCU Tasks Rude testing")
Fixes: 755f1c5eb416b ("rcuscale: Measure RCU Tasks Trace grace-period kthread CPU time")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-07-14 15:01:49 -07:00
Paul E. McKenney a15ec57cfc rcuscale: Add RCU Tasks Rude testing
Add a "tasks-rude" option to the rcuscale.scale_type module parameter.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-07-14 15:01:49 -07:00
Paul E. McKenney 271a8467a5 rcuscale: Measure RCU Tasks Trace grace-period kthread CPU time
This commit causes RCU Tasks Trace to output the CPU time consumed by
its grace-period kthread.  The CPU time is whatever is in the designated
task's current->stime field, and thus is controlled by whatever CPU-time
accounting scheme is in effect.

This output appears in microseconds as follows on the console:

rcu_scale: Grace-period kthread CPU time: 42367.037

[ paulmck: Apply Willy Tarreau feedback. ]

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-07-14 15:01:49 -07:00
Paul E. McKenney 5f8e320269 rcuscale: Measure grace-period kthread CPU time
This commit adds the ability to output the CPU time consumed by the
grace-period kthread for the RCU variant under test.  The CPU time is
whatever is in the designated task's current->stime field, and thus is
controlled by whatever CPU-time accounting scheme is in effect.

This output appears in microseconds as follows on the console:

rcu_scale: Grace-period kthread CPU time: 42367.037

[ paulmck: Apply feedback from Stephen Rothwell and kernel test robot. ]

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Tested-by: Yujie Liu <yujie.liu@intel.com>
2023-07-14 15:01:49 -07:00
Paul E. McKenney bb7bad3dae rcuscale: Print out full set of kfree_rcu parameters
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
2023-07-14 15:01:49 -07:00
Paul E. McKenney c68465dfaa rcuscale: Print out full set of module parameters
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-07-14 15:01:49 -07:00
Paul E. McKenney 7221f493c5 rcuscale: Add minruntime module parameter
By default, rcuscale collects only 100 points of data per writer, but
arranging for all kthreads to be actively collecting (if not recording)
data during the time that any kthread might be recording.  This works
well, but does not allow much time to bring external performance tools
to bear.  This commit therefore adds a minruntime module parameter
that specifies a minimum data-collection interval in seconds.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-07-14 15:01:49 -07:00
Paul E. McKenney ee7516a163 rcuscale: Fix gp_async_max typo: s/reader/writer/
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-07-14 15:01:49 -07:00
Paul E. McKenney 2226f3dc05 rcuscale: Permit blocking delays between writers
Some workloads do isolated RCU work, and this can affect operation
latencies.  This commit therefore adds a writer_holdoff_jiffies module
parameter that causes writers to block for the specified number of
jiffies between each pair of consecutive write-side operations.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-07-14 15:01:48 -07:00
Paul E. McKenney b5a2801fc0 refscale: Add a "jiffies" test
This commit adds a "jiffies" test to refscale, allowing use of jiffies
to be compared to ktime_get_real_fast_ns().  On my x86 laptop, jiffies
is more than 20x faster.  (Though for many uses, the tens-of-nanoseconds
overhead of ktime_get_real_fast_ns() will be just fine.)

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-07-14 15:01:04 -07:00
Waiman Long f5063e8948 refscale: Fix uninitalized use of wait_queue_head_t
Running the refscale test occasionally crashes the kernel with the
following error:

[ 8569.952896] BUG: unable to handle page fault for address: ffffffffffffffe8
[ 8569.952900] #PF: supervisor read access in kernel mode
[ 8569.952902] #PF: error_code(0x0000) - not-present page
[ 8569.952904] PGD c4b048067 P4D c4b049067 PUD c4b04b067 PMD 0
[ 8569.952910] Oops: 0000 [#1] PREEMPT_RT SMP NOPTI
[ 8569.952916] Hardware name: Dell Inc. PowerEdge R750/0WMWCR, BIOS 1.2.4 05/28/2021
[ 8569.952917] RIP: 0010:prepare_to_wait_event+0x101/0x190
  :
[ 8569.952940] Call Trace:
[ 8569.952941]  <TASK>
[ 8569.952944]  ref_scale_reader+0x380/0x4a0 [refscale]
[ 8569.952959]  kthread+0x10e/0x130
[ 8569.952966]  ret_from_fork+0x1f/0x30
[ 8569.952973]  </TASK>

The likely cause is that init_waitqueue_head() is called after the call to
the torture_create_kthread() function that creates the ref_scale_reader
kthread.  Although this init_waitqueue_head() call will very likely
complete before this kthread is created and starts running, it is
possible that the calling kthread will be delayed between the calls to
torture_create_kthread() and init_waitqueue_head().  In this case, the
new kthread will use the waitqueue head before it is properly initialized,
which is not good for the kernel's health and well-being.

The above crash happened here:

	static inline void __add_wait_queue(...)
	{
		:
		if (!(wq->flags & WQ_FLAG_PRIORITY)) <=== Crash here

The offset of flags from list_head entry in wait_queue_entry is
-0x18. If reader_tasks[i].wq.head.next is NULL as allocated reader_task
structure is zero initialized, the instruction will try to access address
0xffffffffffffffe8, which is exactly the fault address listed above.

This commit therefore invokes init_waitqueue_head() before creating
the kthread.

Fixes: 653ed64b01 ("refperf: Add a test to measure performance of read-side synchronization")
Signed-off-by: Waiman Long <longman@redhat.com>
Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
Acked-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-07-14 15:01:04 -07:00
Paul E. McKenney db13710a03 rcu-tasks: Cancel callback laziness if too many callbacks
The various RCU Tasks flavors now do lazy grace periods when there are
only asynchronous grace period requests.  By default, the system will let
250 milliseconds elapse after the first call_rcu_tasks*() callbacki is
queued before starting a grace period.  In contrast, synchronous grace
period requests such as synchronize_rcu_tasks*() will start a grace
period immediately.

However, invoking one of the call_rcu_tasks*() functions in a too-tight
loop can result in a callback flood, which in turn can exhaust memory
if grace periods are delayed for too long.

This commit therefore sets a limit so that the grace-period kthread
will be awakened when any CPU's callback list expands to contain
rcupdate.rcu_task_lazy_lim callbacks elements (defaulting to 32, set to -1
to disable), the grace-period kthread will be awakened, thus cancelling
any ongoing laziness and getting out in front of the potential callback
flood.

Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Martin KaFai Lau <kafai@fb.com>
Cc: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-07-14 15:00:12 -07:00
Paul E. McKenney 450d461aa6 rcu-tasks: Add kernel boot parameters for callback laziness
This commit adds kernel boot parameters for callback laziness, allowing
the RCU Tasks flavors to be individually adjusted.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-07-14 15:00:12 -07:00
Paul E. McKenney 5ae769c611 rcu-tasks: Remove redundant #ifdef CONFIG_TASKS_RCU
The kernel/rcu/tasks.h file has a #endif immediately followed by an

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-07-14 15:00:12 -07:00
Paul E. McKenney d119357d07 rcu-tasks: Treat only synchronous grace periods urgently
The performance requirements on RCU Tasks, and in particular on RCU
Tasks Trace, have evolved over time as the workloads have evolved.
The current implementation is designed to provide low grace-period
latencies, and also to accommodate short-duration floods of callbacks.

However, current workloads can also provide a constant background
callback-queuing rate of a few hundred call_rcu_tasks_trace() invocations
per second.  This results in continuous back-to-back RCU Tasks Trace
grace periods, which in turn can consume the better part of 10% of a CPU.
One could take the attitude that there are several tens of other CPUs on
the systems running such workloads, but energy efficiency is a thing.
On these systems, although asynchronous grace-period requests happen
every few milliseconds, synchronous grace-period requests are quite rare.

This commit therefore arrranges for grace periods to be initiated
immediately in response to calls to synchronize_rcu_tasks*() and
also to calls to synchronize_rcu_mult() that are passed one of the
call_rcu_tasks*() functions.  These are recognized by the tell-tale
wakeme_after_rcu callback function.

In other cases, callbacks are gathered up for up to about 250 milliseconds
before a grace period is initiated.  This results in more than an order of
magnitude reduction in RCU Tasks Trace grace periods, with corresponding
reduction in consumption of CPU time.

Reported-by: Alexei Starovoitov <ast@kernel.org>
Reported-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-07-14 15:00:11 -07:00
Randy Dunlap 67b3f564cb time: add kernel-doc in time.c
Add kernel-doc for all APIs that do not already have it.

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: John Stultz <jstultz@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Stephen Boyd <sboyd@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: linux-doc@vger.kernel.org
Acked-by: John Stultz <jstultz@google.com>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Link: https://lore.kernel.org/r/20230704052405.5089-3-rdunlap@infradead.org
2023-07-14 13:47:07 -06:00
Linus Torvalds bde7f15027 Power management fixes for 6.5-rc2
- Unbreak the /sys/power/resume interface after recent changes (Azat
    Khuzhin).
 
  - Allow PM_QOS_DEFAULT_VALUE to be used with frequency QoS (Chungkai
    Yang).
 
  - Remove __init from cpufreq callbacks in the sparc driver, because
    they may be called after initialization too (Viresh Kumar).
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAmSxhIcSHHJqd0Byand5
 c29ja2kubmV0AAoJEILEb/54YlRxfdQP/1+MxGLh2XVVKvB9QTI0/ofYPxPfoTuf
 5Sf2lOa7rNeauc2xnQFM6EMOeme4ckTUbrO7AkZVACYVqbKJ92IUBJfo3R2Ar+1Z
 9TogwG+YOX3OjR8QtiGHLwA5fvOgbt89JaDn2ZCWcS+gHARZ3VMgdbDt+C3MUldV
 UVr/5kAkWefv39PIYHCwAJU7Xhn97B5nW58KgpkHuxOcHDKe0VfdxLcKBsyoc6QE
 IGMVV2WtnoyEdM1aNfZ37+3NksiIdZMA6OvM5C/7HOs6IqJaFxVUxm4333sM5AW9
 y5LPYSBbedVxICdLkUUq8W5MDRNCTPXgC3gexEu0XtWdAV9AG+9aNeZytT7KGrLO
 xe4vbl6s1LnxC6YBw2bB+U/DbLtxQrAP1nYZj6yxhbHVsnTTZg7Qvevz7nAdPlOL
 3FsutIT+9OQprWXxYBRv3AumF+hpG1bm8Zutyaqb2vwRwMbbXWTpzRry4ydp6bTj
 VB2YWeQOxCKl+dC5jXM1wfPjbsqWQvvGZVh1VIzlcZgzqALWf+F5fTMUKuY963kd
 V6fR1YmKg+Xxb+BU9mEjaUMfH5Yr8Mv9Gpf7D87MTsNTluFjAmRWk5a0ahVAXcwe
 n1jFxBNUsuJuvz2KwWrVZExT2xqJ8kVfnOxNcevtaXK+uk8+jE94lwjJn0P9v8w8
 e+0QABkgUYY2
 =b/om
 -----END PGP SIGNATURE-----

Merge tag 'pm-6.5-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management fixes from Rafael Wysocki:
 "These fix hibernation (after recent changes), frequency QoS and the
  sparc cpufreq driver.

  Specifics:

   - Unbreak the /sys/power/resume interface after recent changes (Azat
     Khuzhin).

   - Allow PM_QOS_DEFAULT_VALUE to be used with frequency QoS (Chungkai
     Yang).

   - Remove __init from cpufreq callbacks in the sparc driver, because
     they may be called after initialization too (Viresh Kumar)"

* tag 'pm-6.5-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  cpufreq: sparc: Don't mark cpufreq callbacks with __init
  PM: QoS: Restore support for default value on frequency QoS
  PM: hibernate: Fix writing maj:min to /sys/power/resume
2023-07-14 11:07:04 -07:00
Rafael J. Wysocki d121758da6 Merge branches 'pm-sleep' and 'pm-qos'
Merge a PM QoS fix and a hibernation fix for 6.5-rc2.

 - Unbreak the /sys/power/resume interface after recent changes (Azat
   Khuzhin).

 - Allow PM_QOS_DEFAULT_VALUE to be used with frequency QoS (Chungkai
   Yang).

* pm-sleep:
  PM: hibernate: Fix writing maj:min to /sys/power/resume

* pm-qos:
  PM: QoS: Restore support for default value on frequency QoS
2023-07-14 19:13:21 +02:00
Masami Hiramatsu (Google) 797311bce5 tracing/probes: Fix to record 0-length data_loc in fetch_store_string*() if fails
Fix to record 0-length data to data_loc in fetch_store_string*() if it fails
to get the string data.
Currently those expect that the data_loc is updated by store_trace_args() if
it returns the error code. However, that does not work correctly if the
argument is an array of strings. In that case, store_trace_args() only clears
the first entry of the array (which may have no error) and leaves other
entries. So it should be cleared by fetch_store_string*() itself.
Also, 'dyndata' and 'maxlen' in store_trace_args() should be updated
only if it is used (ret > 0 and argument is a dynamic data.)

Link: https://lore.kernel.org/all/168908496683.123124.4761206188794205601.stgit@devnote2/

Fixes: 40b53b7718 ("tracing: probeevent: Add array type support")
Cc: stable@vger.kernel.org
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2023-07-14 17:04:58 +09:00
Jakub Kicinski d2afa89f66 for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmSwqwoACgkQ6rmadz2v
 bTqOHRAAn+fzTLqUqsveFQcxOkie5MPHxKoOTjG4+yFR7rzPkU6Mn5RX3w5yFzSn
 RqutwykF9OgipAzC3QXv4pRJuq6Gia5nvwUSDP4CX273ljyeF54DK7HfopE1+YrK
 HXyBWZvVvMZP6q7qQyQ3qtbHZSjs5XP/M6YBlJ5zo/BTLFCyvbSDP14YKEqcBkWG
 ld72ElXFxlnr/zEfRjzBCfMlbmgeHLO0SiHS/9827zEmNP1AAH5/ETA7/rJ7yCJs
 QNQUIoJWob8xm5FMJ6CU/+sOqXR1CY053meGJFFBX5pvVD/CLRhrwHn0IMCyQqmh
 wKR5waeXhpl/CKNeFuxXVMNFiXbqBb/0LYJaJtrMysjMLTsQ9X7NkrDBa/+kYGyZ
 +ghGlaMQvPqUGg0rLH2nl9JNB8Ne/8prLMsAKUWnPuOo+Q03j054gnqhGeNtDd5b
 gpSk+7x93PlhGcegBV1Wk8dkiGC5V9nTVNxg40XQUCs4k9L/8Vjc35Tjqx7nBTNH
 DiFD24DDKUZacw9L6nEqvLF/N2fiRjtUZnVPC0yn/annyBcfX1s+ZH2Tu1F6Qk38
 QMfBCnt12exmsiDoxdzzGJtjHnS/k5fsaKjlR21mOyMrIH7ipltr5UHHrdr1hBP6
 24uSeTImvQQKDi+9IuXN127jZDOupKqVS6csrA0ZXrlKWh2HR+U=
 =GVUB
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Alexei Starovoitov says:

====================
pull-request: bpf-next 2023-07-13

We've added 67 non-merge commits during the last 15 day(s) which contain
a total of 106 files changed, 4444 insertions(+), 619 deletions(-).

The main changes are:

1) Fix bpftool build in presence of stale vmlinux.h,
   from Alexander Lobakin.

2) Introduce bpf_me_mcache_free_rcu() and fix OOM under stress,
   from Alexei Starovoitov.

3) Teach verifier actual bounds of bpf_get_smp_processor_id()
   and fix perf+libbpf issue related to custom section handling,
   from Andrii Nakryiko.

4) Introduce bpf map element count, from Anton Protopopov.

5) Check skb ownership against full socket, from Kui-Feng Lee.

6) Support for up to 12 arguments in BPF trampoline, from Menglong Dong.

7) Export rcu_request_urgent_qs_task, from Paul E. McKenney.

8) Fix BTF walking of unions, from Yafang Shao.

9) Extend link_info for kprobe_multi and perf_event links,
   from Yafang Shao.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (67 commits)
  selftests/bpf: Add selftest for PTR_UNTRUSTED
  bpf: Fix an error in verifying a field in a union
  selftests/bpf: Add selftests for nested_trust
  bpf: Fix an error around PTR_UNTRUSTED
  selftests/bpf: add testcase for TRACING with 6+ arguments
  bpf, x86: allow function arguments up to 12 for TRACING
  bpf, x86: save/restore regs with BPF_DW size
  bpftool: Use "fallthrough;" keyword instead of comments
  bpf: Add object leak check.
  bpf: Convert bpf_cpumask to bpf_mem_cache_free_rcu.
  bpf: Introduce bpf_mem_free_rcu() similar to kfree_rcu().
  selftests/bpf: Improve test coverage of bpf_mem_alloc.
  rcu: Export rcu_request_urgent_qs_task()
  bpf: Allow reuse from waiting_for_gp_ttrace list.
  bpf: Add a hint to allocated objects.
  bpf: Change bpf_mem_cache draining process.
  bpf: Further refactor alloc_bulk().
  bpf: Factor out inc/dec of active flag into helpers.
  bpf: Refactor alloc_bulk().
  bpf: Let free_all() return the number of freed elements.
  ...
====================

Link: https://lore.kernel.org/r/20230714020910.80794-1-alexei.starovoitov@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-13 19:13:24 -07:00
Yafang Shao 33937607ef bpf: Fix an error in verifying a field in a union
We are utilizing BPF LSM to monitor BPF operations within our container
environment. When we add support for raw_tracepoint, it hits below
error.

; (const void *)attr->raw_tracepoint.name);
27: (79) r3 = *(u64 *)(r2 +0)
access beyond the end of member map_type (mend:4) in struct (anon) with off 0 size 8

It can be reproduced with below BPF prog.

SEC("lsm/bpf")
int BPF_PROG(bpf_audit, int cmd, union bpf_attr *attr, unsigned int size)
{
	switch (cmd) {
	case BPF_RAW_TRACEPOINT_OPEN:
		bpf_printk("raw_tracepoint is %s", attr->raw_tracepoint.name);
		break;
	default:
		break;
	}
	return 0;
}

The reason is that when accessing a field in a union, such as bpf_attr,
if the field is located within a nested struct that is not the first
member of the union, it can result in incorrect field verification.

  union bpf_attr {
      struct {
          __u32 map_type; <<<< Actually it will find that field.
          __u32 key_size;
          __u32 value_size;
         ...
      };
      ...
      struct {
          __u64 name;    <<<< We want to verify this field.
          __u32 prog_fd;
      } raw_tracepoint;
  };

Considering the potential deep nesting levels, finding a perfect
solution to address this issue has proven challenging. Therefore, I
propose a solution where we simply skip the verification process if the
field in question is located within a union.

Fixes: 7e3617a72d ("bpf: Add array support to btf_struct_access")
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Link: https://lore.kernel.org/r/20230713025642.27477-4-laoar.shao@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-13 16:24:29 -07:00
Yafang Shao 7ce4dc3e4a bpf: Fix an error around PTR_UNTRUSTED
Per discussion with Alexei, the PTR_UNTRUSTED flag should not been
cleared when we start to walk a new struct, because the struct in
question may be a struct nested in a union. We should also check and set
this flag before we walk its each member, in case itself is a union.
We will clear this flag if the field is BTF_TYPE_SAFE_RCU_OR_NULL.

Fixes: 6fcd486b3a ("bpf: Refactor RCU enforcement in the verifier.")
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Link: https://lore.kernel.org/r/20230713025642.27477-2-laoar.shao@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-13 16:24:29 -07:00
Linus Torvalds b1983d427a Networking fixes for 6.5-rc2, including fixes from netfilter,
wireless and ebpf
 
 Current release - regressions:
 
   - netfilter: conntrack: gre: don't set assured flag for clash entries
 
   - wifi: iwlwifi: remove 'use_tfh' config to fix crash
 
 Previous releases - regressions:
 
   - ipv6: fix a potential refcount underflow for idev
 
   - icmp6: ifix null-ptr-deref of ip6_null_entry->rt6i_idev in icmp6_dev()
 
   - bpf: fix max stack depth check for async callbacks
 
   - eth: mlx5e:
     - check for NOT_READY flag state after locking
     - fix page_pool page fragment tracking for XDP
 
   - eth: igc:
     - fix tx hang issue when QBV gate is closed
     - fix corner cases for TSN offload
 
   - eth: octeontx2-af: Move validation of ptp pointer before its usage
 
   - eth: ena: fix shift-out-of-bounds in exponential backoff
 
 Previous releases - always broken:
 
   - core: prevent skb corruption on frag list segmentation
 
   - sched:
     - cls_fw: fix improper refcount update leads to use-after-free
     - sch_qfq: account for stab overhead in qfq_enqueue
 
   - netfilter:
     - report use refcount overflow
     - prevent OOB access in nft_byteorder_eval
 
   - wifi: mt7921e: fix init command fail with enabled device
 
   - eth: ocelot: fix oversize frame dropping for preemptible TCs
 
   - eth: fec: recycle pages for transmitted XDP frames
 
 Signed-off-by: Paolo Abeni <pabeni@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmSv1YISHHBhYmVuaUBy
 ZWRoYXQuY29tAAoJECkkeY3MjxOkpQgP/1msj0MlIWJnMgzPiMonDSe746JGTg/j
 YengEjqcy3ozC4COBEeyBO6ilt6I+Wrb5H5jimn9h2djB+D7htWNaejQaqJrBxph
 F4lUC6OJqd2ncI3tXAG2BSX1duzDr6B7yL7d5InFIczw8vNh+chsyX0sjlzU12bt
 ppjcSb+Ffc796DB0ItJkBqluxcpjyXE15ZWTTV4GEHK6RoRdxNIGjd7NgvD8podB
 Q/464bHs1jJYkAavuobiOXV2fuxWLTs77E0Vloizoo+42UiRFMLJk+RX98PhSIMa
 eejkxfm+H6+6Qi2omYepvf7vDN3GtLjxbr5C3mTdWPuL4QbNY8agVJ7sS4XnL5/v
 B7EAjyGQK9SmD36zTu7QL/Ul6fSnRq8jz20B0mDa0imAWzi58A+jqbQAMoVOMSS+
 Uv4yKJpIUyx7mUI77+EX3U9r1wytw5eniatTDU+GAsQb2CJ43CqDmn/7RcmGacBo
 P1q+il9JW4kzUQrisUSxmQDfpBvQi5wiygiEdUNI5FEhq6/iKe/lrJnmJZpaLkd5
 P3oEKjapamAmcyrEr/7VD1Mb4jrRfpB7zVn/5OyvywbcLQxA+531iPpy4r4W6cWH
 1MRLBVVHKyb3jfm8J3T4lpDEzd03+MiPS8JiKMUYYNUYkY8tYp92muwC7z2sGI4M
 6eR2MeKD4vds
 =cELX
 -----END PGP SIGNATURE-----

Merge tag 'net-6.5-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Paolo Abeni:
 "Including fixes from netfilter, wireless and ebpf.

  Current release - regressions:

   - netfilter: conntrack: gre: don't set assured flag for clash entries

   - wifi: iwlwifi: remove 'use_tfh' config to fix crash

  Previous releases - regressions:

   - ipv6: fix a potential refcount underflow for idev

   - icmp6: ifix null-ptr-deref of ip6_null_entry->rt6i_idev in
     icmp6_dev()

   - bpf: fix max stack depth check for async callbacks

   - eth: mlx5e:
      - check for NOT_READY flag state after locking
      - fix page_pool page fragment tracking for XDP

   - eth: igc:
      - fix tx hang issue when QBV gate is closed
      - fix corner cases for TSN offload

   - eth: octeontx2-af: Move validation of ptp pointer before its usage

   - eth: ena: fix shift-out-of-bounds in exponential backoff

  Previous releases - always broken:

   - core: prevent skb corruption on frag list segmentation

   - sched:
      - cls_fw: fix improper refcount update leads to use-after-free
      - sch_qfq: account for stab overhead in qfq_enqueue

   - netfilter:
      - report use refcount overflow
      - prevent OOB access in nft_byteorder_eval

   - wifi: mt7921e: fix init command fail with enabled device

   - eth: ocelot: fix oversize frame dropping for preemptible TCs

   - eth: fec: recycle pages for transmitted XDP frames"

* tag 'net-6.5-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (79 commits)
  selftests: tc-testing: add test for qfq with stab overhead
  net/sched: sch_qfq: account for stab overhead in qfq_enqueue
  selftests: tc-testing: add tests for qfq mtu sanity check
  net/sched: sch_qfq: reintroduce lmax bound check for MTU
  wifi: cfg80211: fix receiving mesh packets without RFC1042 header
  wifi: rtw89: debug: fix error code in rtw89_debug_priv_send_h2c_set()
  net: txgbe: fix eeprom calculation error
  net/sched: make psched_mtu() RTNL-less safe
  net: ena: fix shift-out-of-bounds in exponential backoff
  netdevsim: fix uninitialized data in nsim_dev_trap_fa_cookie_write()
  net/sched: flower: Ensure both minimum and maximum ports are specified
  MAINTAINERS: Add another mailing list for QUALCOMM ETHQOS ETHERNET DRIVER
  docs: netdev: update the URL of the status page
  wifi: iwlwifi: remove 'use_tfh' config to fix crash
  xdp: use trusted arguments in XDP hints kfuncs
  bpf: cpumap: Fix memory leak in cpu_map_update_elem
  wifi: airo: avoid uninitialized warning in airo_get_rate()
  octeontx2-pf: Add additional check for MCAM rules
  net: dsa: Removed unneeded of_node_put in felix_parse_ports_node
  net: fec: use netdev_err_once() instead of netdev_err()
  ...
2023-07-13 14:21:22 -07:00
Linus Torvalds ebc27aacee Tracing fixes and clean ups:
- Fix some missing-prototype warnings
 
 - Fix user events struct args (did not include size of struct)
   When creating a user event, the "struct" keyword is to denote
   that the size of the field will be passed in. But the parsing
   failed to handle this case.
 
 - Add selftest to struct sizes for user events
 
 - Fix sample code for direct trampolines.
   The sample code for direct trampolines attached to handle_mm_fault().
   But the prototype changed and the direct trampoline sample code
   was not updated. Direct trampolines needs to have the arguments correct
   otherwise it can fail or crash the system.
 
 - Remove unused ftrace_regs_caller_ret() prototype.
 
 - Quiet false positive of FORTIFY_SOURCE
   Due to backward compatibility, the structure used to save stack traces
   in the kernel had a fixed size of 8. This structure is exported to
   user space via the tracing format file. A change was made to allow
   more than 8 functions to be recorded, and user space now uses the
   size field to know how many functions are actually in the stack.
   But the structure still has size of 8 (even though it points into
   the ring buffer that has the required amount allocated to hold a
   full stack. This was fine until the fortifier noticed that the
   memcpy(&entry->caller, stack, size) was greater than the 8 functions
   and would complain at runtime about it. Hide this by using a pointer
   to the stack location on the ring buffer instead of using the address
   of the entry structure caller field.
 
 - Fix a deadloop in reading trace_pipe that was caused by a mismatch
   between ring_buffer_empty() returning false which then asked to
   read the data, but the read code uses rb_num_of_entries() that
   returned zero, and causing a infinite "retry".
 
 - Fix a warning caused by not using all pages allocated to store
   ftrace functions, where this can happen if the linker inserts a bunch of
   "NULL" entries, causing the accounting of how many pages needed
   to be off.
 
 - Fix histogram synthetic event crashing when the start event is
   removed and the end event is still using a variable from it.
 
 - Fix memory leak in freeing iter->temp in tracing_release_pipe()
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZLBF6hQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qkswAP4mhdoFFfNosM7+Sh/R4t31IxKZApm9
 M2Hf9jgvJ7b65AD/VV1XfO6skw2+5Yn9S4UyNE2MQaYxPwWpONcNFUzZ3Q8=
 =Nb+7
 -----END PGP SIGNATURE-----

Merge tag 'trace-v6.5-rc1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing fixes from Steven Rostedt:

 - Fix some missing-prototype warnings

 - Fix user events struct args (did not include size of struct)

   When creating a user event, the "struct" keyword is to denote that
   the size of the field will be passed in. But the parsing failed to
   handle this case.

 - Add selftest to struct sizes for user events

 - Fix sample code for direct trampolines.

   The sample code for direct trampolines attached to handle_mm_fault().
   But the prototype changed and the direct trampoline sample code was
   not updated. Direct trampolines needs to have the arguments correct
   otherwise it can fail or crash the system.

 - Remove unused ftrace_regs_caller_ret() prototype.

 - Quiet false positive of FORTIFY_SOURCE

   Due to backward compatibility, the structure used to save stack
   traces in the kernel had a fixed size of 8. This structure is
   exported to user space via the tracing format file. A change was made
   to allow more than 8 functions to be recorded, and user space now
   uses the size field to know how many functions are actually in the
   stack.

   But the structure still has size of 8 (even though it points into the
   ring buffer that has the required amount allocated to hold a full
   stack.

   This was fine until the fortifier noticed that the
   memcpy(&entry->caller, stack, size) was greater than the 8 functions
   and would complain at runtime about it.

   Hide this by using a pointer to the stack location on the ring buffer
   instead of using the address of the entry structure caller field.

 - Fix a deadloop in reading trace_pipe that was caused by a mismatch
   between ring_buffer_empty() returning false which then asked to read
   the data, but the read code uses rb_num_of_entries() that returned
   zero, and causing a infinite "retry".

 - Fix a warning caused by not using all pages allocated to store ftrace
   functions, where this can happen if the linker inserts a bunch of
   "NULL" entries, causing the accounting of how many pages needed to be
   off.

 - Fix histogram synthetic event crashing when the start event is
   removed and the end event is still using a variable from it

 - Fix memory leak in freeing iter->temp in tracing_release_pipe()

* tag 'trace-v6.5-rc1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing: Fix memory leak of iter->temp when reading trace_pipe
  tracing/histograms: Add histograms to hist_vars if they have referenced variables
  tracing: Stop FORTIFY_SOURCE complaining about stack trace caller
  ftrace: Fix possible warning on checking all pages used in ftrace_process_locs()
  ring-buffer: Fix deadloop issue on reading trace_pipe
  tracing: arm64: Avoid missing-prototype warnings
  selftests/user_events: Test struct size match cases
  tracing/user_events: Fix struct arg size match check
  x86/ftrace: Remove unsued extern declaration ftrace_regs_caller_ret()
  arm64: ftrace: Add direct call trampoline samples support
  samples: ftrace: Save required argument registers in sample trampolines
2023-07-13 13:44:28 -07:00
Masami Hiramatsu (Google) 4ed8f337de Revert "tracing: Add "(fault)" name injection to kernel probes"
This reverts commit 2e9906f84f.

It was turned out that commit 2e9906f84f ("tracing: Add "(fault)"
name injection to kernel probes") did not work correctly and probe
events still show just '(fault)' (instead of '"(fault)"'). Also,
current '(fault)' is more explicit that it faulted.

This also moves FAULT_STRING macro to trace.h so that synthetic
event can keep using it, and uses it in trace_probe.c too.

Link: https://lore.kernel.org/all/168908495772.123124.1250788051922100079.stgit@devnote2/
Link: https://lore.kernel.org/all/20230706230642.3793a593@rorschach.local.home/

Cc: stable@vger.kernel.org
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-07-14 00:37:43 +09:00
Masami Hiramatsu (Google) e38e2c6a9e tracing/probes: Fix to update dynamic data counter if fetcharg uses it
Fix to update dynamic data counter ('dyndata') and max length ('maxlen')
only if the fetcharg uses the dynamic data. Also get out arg->dynamic
from unlikely(). This makes dynamic data address wrong if
process_fetch_insn() returns error on !arg->dynamic case.

Link: https://lore.kernel.org/all/168908494781.123124.8160245359962103684.stgit@devnote2/

Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Link: https://lore.kernel.org/all/20230710233400.5aaf024e@gandalf.local.home/
Fixes: 9178412ddf ("tracing: probeevent: Return consumed bytes of dynamic area")
Cc: stable@vger.kernel.org
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-07-14 00:37:00 +09:00
Masami Hiramatsu (Google) b41326b5e0 tracing/probes: Fix not to count error code to total length
Fix not to count the error code (which is minus value) to the total
used length of array, because it can mess up the return code of
process_fetch_insn_bottom(). Also clear the 'ret' value because it
will be used for calculating next data_loc entry.

Link: https://lore.kernel.org/all/168908493827.123124.2175257289106364229.stgit@devnote2/

Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/all/8819b154-2ba1-43c3-98a2-cbde20892023@moroto.mountain/
Fixes: 9b960a3883 ("tracing: probeevent: Unify fetch_insn processing common part")
Cc: stable@vger.kernel.org
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-07-14 00:36:28 +09:00
Masami Hiramatsu (Google) 66bcf65d6c tracing/probes: Fix to avoid double count of the string length on the array
If an array is specified with the ustring or symstr, the length of the
strings are accumlated on both of 'ret' and 'total', which means the
length is double counted.
Just set the length to the 'ret' value for avoiding double counting.

Link: https://lore.kernel.org/all/168908492917.123124.15076463491122036025.stgit@devnote2/

Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/all/8819b154-2ba1-43c3-98a2-cbde20892023@moroto.mountain/
Fixes: 88903c4643 ("tracing/probe: Add ustring type for user-space string")
Cc: stable@vger.kernel.org
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-07-14 00:35:53 +09:00
Masami Hiramatsu (Google) d5f28bb1ce fprobes: Add a comment why fprobe_kprobe_handler exits if kprobe is running
Add a comment the reason why fprobe_kprobe_handler() exits if any other
kprobe is running.

Link: https://lore.kernel.org/all/168874788299.159442.2485957441413653858.stgit@devnote2/

Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Link: https://lore.kernel.org/all/20230706120916.3c6abf15@gandalf.local.home/
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-07-14 00:24:00 +09:00
Zheng Yejian d5a8218963 tracing: Fix memory leak of iter->temp when reading trace_pipe
kmemleak reports:
  unreferenced object 0xffff88814d14e200 (size 256):
    comm "cat", pid 336, jiffies 4294871818 (age 779.490s)
    hex dump (first 32 bytes):
      04 00 01 03 00 00 00 00 08 00 00 00 00 00 00 00  ................
      0c d8 c8 9b ff ff ff ff 04 5a ca 9b ff ff ff ff  .........Z......
    backtrace:
      [<ffffffff9bdff18f>] __kmalloc+0x4f/0x140
      [<ffffffff9bc9238b>] trace_find_next_entry+0xbb/0x1d0
      [<ffffffff9bc9caef>] trace_print_lat_context+0xaf/0x4e0
      [<ffffffff9bc94490>] print_trace_line+0x3e0/0x950
      [<ffffffff9bc95499>] tracing_read_pipe+0x2d9/0x5a0
      [<ffffffff9bf03a43>] vfs_read+0x143/0x520
      [<ffffffff9bf04c2d>] ksys_read+0xbd/0x160
      [<ffffffff9d0f0edf>] do_syscall_64+0x3f/0x90
      [<ffffffff9d2000aa>] entry_SYSCALL_64_after_hwframe+0x6e/0xd8

when reading file 'trace_pipe', 'iter->temp' is allocated or relocated
in trace_find_next_entry() but not freed before 'trace_pipe' is closed.

To fix it, free 'iter->temp' in tracing_release_pipe().

Link: https://lore.kernel.org/linux-trace-kernel/20230713141435.1133021-1-zhengyejian1@huawei.com

Cc: stable@vger.kernel.org
Fixes: ff895103a8 ("tracing: Save off entry when peeking at next entry")
Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-07-13 10:48:36 -04:00
Vincent Guittot 7ee7642c91 sched/fair: Stabilize asym cpu capacity system idle cpu selection
select_idle_capacity() not only looks for an idle cpu that fits for the
waking task but also for cpu with highest bandwidth when no cpu fits.
Start the loop with target cpu so it will be selected 1st when no cpu fits
but several cpus shared the same bandwidth. Starting with target cpu
prevents the task to migrate between cpus with same bandwidth at every
wakeup when no cpu fits.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20230711081359.868862-1-vincent.guittot@linaro.org
2023-07-13 15:21:53 +02:00
Peter Zijlstra ed74cc4995 sched/debug: Dump domains' sched group flags
There have been a case where the SD_SHARE_CPUCAPACITY sched group flag
in a parent domain were not set and propagated properly when a degenerate
domain is removed.

Add dump of domain sched group flags of a CPU to make debug easier
in the future.

Usage:
cat /debug/sched/domains/cpu0/domain1/groups_flags
to dump cpu0 domain1's sched group flags.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lore.kernel.org/r/ed1749262d94d95a8296c86a415999eda90bcfe3.1688770494.git.tim.c.chen@linux.intel.com
2023-07-13 15:21:53 +02:00
Ricardo Neri b1bfeab9b0 sched/fair: Consider the idle state of the whole core for load balance
should_we_balance() traverses the group_balance_mask (AND'ed with lb_env::
cpus) starting from lower numbered CPUs looking for the first idle CPU.

In hybrid x86 systems, the siblings of SMT cores get CPU numbers, before
non-SMT cores:

	[0, 1] [2, 3] [4, 5] 6 7 8 9
         b  i   b  i   b  i  b i i i

In the figure above, CPUs in brackets are siblings of an SMT core. The
rest are non-SMT cores. 'b' indicates a busy CPU, 'i' indicates an
idle CPU.

We should let a CPU on a fully idle core get the first chance to idle
load balance as it has more CPU capacity than a CPU on an idle SMT
CPU with busy sibling.  So for the figure above, if we are running
should_we_balance() to CPU 1, we should return false to let CPU 7 on
idle core to have a chance first to idle load balance.

A partially busy (i.e., of type group_has_spare) local group with SMT 
cores will often have only one SMT sibling busy. If the destination CPU
is a non-SMT core, partially busy, lower-numbered, SMT cores should not
be considered when finding the first idle CPU. 

However, in should_we_balance(), when we encounter idle SMT first in partially
busy core, we prematurely break the search for the first idle CPU.

Higher-numbered, non-SMT cores is not given the chance to have
idle balance done on their behalf. Those CPUs will only be considered
for idle balancing by chance via CPU_NEWLY_IDLE.

Instead, consider the idle state of the whole SMT core.

Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/807bdd05331378ea3bf5956bda87ded1036ba769.1688770494.git.tim.c.chen@linux.intel.com
2023-07-13 15:21:52 +02:00
Tim C Chen 7ff1693236 sched/fair: Implement prefer sibling imbalance calculation between asymmetric groups
In the current prefer sibling load balancing code, there is an implicit
assumption that the busiest sched group and local sched group are
equivalent, hence the tasks to be moved is simply the difference in
number of tasks between the two groups (i.e. imbalance) divided by two.

However, we may have different number of cores between the cluster groups,
say when we take CPU offline or we have hybrid groups.  In that case,
we should balance between the two groups such that #tasks/#cores ratio
is the same between the same between both groups.  Hence the imbalance
computed will need to reflect this.

Adjust the sibling imbalance computation to take into account of the
above considerations.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/4eacbaa236e680687dae2958378a6173654113df.1688770494.git.tim.c.chen@linux.intel.com
2023-07-13 15:21:52 +02:00
Tim C Chen d24cb0d911 sched/topology: Record number of cores in sched group
When balancing sibling domains that have different number of cores,
tasks in respective sibling domain should be proportional to the
number of cores in each domain. In preparation of implementing such a
policy, record the number of cores in a scheduling group.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/04641eeb0e95c21224352f5743ecb93dfac44654.1688770494.git.tim.c.chen@linux.intel.com
2023-07-13 15:21:51 +02:00
Tim C Chen fee1759e4f sched/fair: Determine active load balance for SMT sched groups
On hybrid CPUs with scheduling cluster enabled, we will need to
consider balancing between SMT CPU cluster, and Atom core cluster.

Below shows such a hybrid x86 CPU with 4 big cores and 8 atom cores.
Each scheduling cluster span a L2 cache.

          --L2-- --L2-- --L2-- --L2-- ----L2---- -----L2------
          [0, 1] [2, 3] [4, 5] [5, 6] [7 8 9 10] [11 12 13 14]
          Big    Big    Big    Big    Atom       Atom
          core   core   core   core   Module     Module

If the busiest group is a big core with both SMT CPUs busy, we should
active load balance if destination group has idle CPU cores.  Such
condition is considered by asym_active_balance() in load balancing but not
considered when looking for busiest group and computing load imbalance.
Add this consideration in find_busiest_group() and calculate_imbalance().

In addition, update the logic determining the busier group when one group
is SMT and the other group is non SMT but both groups are partially busy
with idle CPU. The busier group should be the group with idle cores rather
than the group with one busy SMT CPU.  We do not want to make the SMT group
the busiest one to pull the only task off SMT CPU and causing the whole core to
go empty.

Otherwise suppose in the search for the busiest group, we first encounter
an SMT group with 1 task and set it as the busiest.  The destination
group is an atom cluster with 1 task and we next encounter an atom
cluster group with 3 tasks, we will not pick this atom cluster over the
SMT group, even though we should.  As a result, we do not load balance
the busier Atom cluster (with 3 tasks) towards the local atom cluster
(with 1 task).  And it doesn't make sense to pick the 1 task SMT group
as the busier group as we also should not pull task off the SMT towards
the 1 task atom cluster and make the SMT core completely empty.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/e24f35d142308790f69be65930b82794ef6658a2.1688770494.git.tim.c.chen@linux.intel.com
2023-07-13 15:21:51 +02:00
Miaohe Lin 35cd21f629 sched/psi: make psi_cgroups_enabled static
The static key psi_cgroups_enabled is only used inside file psi.c.
Make it static.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Link: https://lore.kernel.org/r/20230525103428.49712-1-linmiaohe@huawei.com
2023-07-13 15:21:50 +02:00
Cruz Zhao 548796e2e7 sched/core: introduce sched_core_idle_cpu()
As core scheduling introduced, a new state of idle is defined as
force idle, running idle task but nr_running greater than zero.

If a cpu is in force idle state, idle_cpu() will return zero. This
result makes sense in some scenarios, e.g., load balance,
showacpu when dumping, and judge the RCU boost kthread is starving.

But this will cause error in other scenarios, e.g., tick_irq_exit():
When force idle, rq->curr == rq->idle but rq->nr_running > 0, results
that idle_cpu() returns 0. In function tick_irq_exit(), if idle_cpu()
is 0, tick_nohz_irq_exit() will not be called, and ts->idle_active will
not become 1, which became 0 in tick_nohz_irq_enter().
ts->idle_sleeptime won't update in function update_ts_time_stats(), if
ts->idle_active is 0, which should be 1. And this bug will result that
ts->idle_sleeptime is less than the actual value, and finally will
result that the idle time in /proc/stat is less than the actual value.

To solve this problem, we introduce sched_core_idle_cpu(), which
returns 1 when force idle. We audit all users of idle_cpu(), and
change idle_cpu() into sched_core_idle_cpu() in function
tick_irq_exit().

v2-->v3: Only replace idle_cpu() with sched_core_idle_cpu() in
function tick_irq_exit(). And modify the corresponding commit log.

Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Joel Fernandes <joel@joelfernandes.org>
Link: https://lore.kernel.org/r/1688011324-42406-1-git-send-email-CruzZhao@linux.alibaba.com
2023-07-13 15:21:50 +02:00
Josh Don 677ea015f2 sched: add throttled time stat for throttled children
We currently export the total throttled time for cgroups that are given
a bandwidth limit. This patch extends this accounting to also account
the total time that each children cgroup has been throttled.

This is useful to understand the degree to which children have been
affected by the throttling control. Children which are not runnable
during the entire throttled period, for example, will not show any
self-throttling time during this period.

Expose this in a new interface, 'cpu.stat.local', which is similar to
how non-hierarchical events are accounted in 'memory.events.local'.

Signed-off-by: Josh Don <joshdon@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20230620183247.737942-2-joshdon@google.com
2023-07-13 15:21:49 +02:00
Josh Don 79462e8c87 sched: don't account throttle time for empty groups
It is easy for a cfs_rq to become throttled even when it has no enqueued
entities (for example, if we have just put_prev()'d the last runnable
task of the cfs_rq, and the cfs_rq is out of quota).

Avoid accounting this time towards total throttle time, since it
otherwise falsely inflates the stats.

Note that the dequeue path is special, since we normally disallow
migrations when a task is in a throttled hierarchy (see
throttled_lb_pair()).

Signed-off-by: Josh Don <joshdon@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230620183247.737942-1-joshdon@google.com
2023-07-13 15:21:49 +02:00
Wander Lairson Costa d243b34459 kernel/fork: beware of __put_task_struct() calling context
Under PREEMPT_RT, __put_task_struct() indirectly acquires sleeping
locks. Therefore, it can't be called from an non-preemptible context.

One practical example is splat inside inactive_task_timer(), which is
called in a interrupt context:

  CPU: 1 PID: 2848 Comm: life Kdump: loaded Tainted: G W ---------
   Hardware name: HP ProLiant DL388p Gen8, BIOS P70 07/15/2012
   Call Trace:
   dump_stack_lvl+0x57/0x7d
   mark_lock_irq.cold+0x33/0xba
   mark_lock+0x1e7/0x400
   mark_usage+0x11d/0x140
   __lock_acquire+0x30d/0x930
   lock_acquire.part.0+0x9c/0x210
   rt_spin_lock+0x27/0xe0
   refill_obj_stock+0x3d/0x3a0
   kmem_cache_free+0x357/0x560
   inactive_task_timer+0x1ad/0x340
   __run_hrtimer+0x8a/0x1a0
   __hrtimer_run_queues+0x91/0x130
   hrtimer_interrupt+0x10f/0x220
   __sysvec_apic_timer_interrupt+0x7b/0xd0
   sysvec_apic_timer_interrupt+0x4f/0xd0
   asm_sysvec_apic_timer_interrupt+0x12/0x20
   RIP: 0033:0x7fff196bf6f5

Instead of calling __put_task_struct() directly, we defer it using
call_rcu(). A more natural approach would use a workqueue, but since
in PREEMPT_RT, we can't allocate dynamic memory from atomic context,
the code would become more complex because we would need to put the
work_struct instance in the task_struct and initialize it when we
allocate a new task_struct.

The issue is reproducible with stress-ng:

  while true; do
      stress-ng --sched deadline --sched-period 1000000000 \
	      --sched-runtime 800000000 --sched-deadline \
	      1000000000 --mmapfork 23 -t 20
  done

Reported-by: Hu Chunyu <chuhu@redhat.com>
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Suggested-by: Valentin Schneider <vschneid@redhat.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Wander Lairson Costa <wander@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230614122323.37957-2-wander@redhat.com
2023-07-13 15:21:48 +02:00
Mohamed Khalfella 6018b585e8 tracing/histograms: Add histograms to hist_vars if they have referenced variables
Hist triggers can have referenced variables without having direct
variables fields. This can be the case if referenced variables are added
for trigger actions. In this case the newly added references will not
have field variables. Not taking such referenced variables into
consideration can result in a bug where it would be possible to remove
hist trigger with variables being refenced. This will result in a bug
that is easily reproducable like so

$ cd /sys/kernel/tracing
$ echo 'synthetic_sys_enter char[] comm; long id' >> synthetic_events
$ echo 'hist:keys=common_pid.execname,id.syscall:vals=hitcount:comm=common_pid.execname' >> events/raw_syscalls/sys_enter/trigger
$ echo 'hist:keys=common_pid.execname,id.syscall:onmatch(raw_syscalls.sys_enter).synthetic_sys_enter($comm, id)' >> events/raw_syscalls/sys_enter/trigger
$ echo '!hist:keys=common_pid.execname,id.syscall:vals=hitcount:comm=common_pid.execname' >> events/raw_syscalls/sys_enter/trigger

[  100.263533] ==================================================================
[  100.264634] BUG: KASAN: slab-use-after-free in resolve_var_refs+0xc7/0x180
[  100.265520] Read of size 8 at addr ffff88810375d0f0 by task bash/439
[  100.266320]
[  100.266533] CPU: 2 PID: 439 Comm: bash Not tainted 6.5.0-rc1 #4
[  100.267277] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-20220807_005459-localhost 04/01/2014
[  100.268561] Call Trace:
[  100.268902]  <TASK>
[  100.269189]  dump_stack_lvl+0x4c/0x70
[  100.269680]  print_report+0xc5/0x600
[  100.270165]  ? resolve_var_refs+0xc7/0x180
[  100.270697]  ? kasan_complete_mode_report_info+0x80/0x1f0
[  100.271389]  ? resolve_var_refs+0xc7/0x180
[  100.271913]  kasan_report+0xbd/0x100
[  100.272380]  ? resolve_var_refs+0xc7/0x180
[  100.272920]  __asan_load8+0x71/0xa0
[  100.273377]  resolve_var_refs+0xc7/0x180
[  100.273888]  event_hist_trigger+0x749/0x860
[  100.274505]  ? kasan_save_stack+0x2a/0x50
[  100.275024]  ? kasan_set_track+0x29/0x40
[  100.275536]  ? __pfx_event_hist_trigger+0x10/0x10
[  100.276138]  ? ksys_write+0xd1/0x170
[  100.276607]  ? do_syscall_64+0x3c/0x90
[  100.277099]  ? entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[  100.277771]  ? destroy_hist_data+0x446/0x470
[  100.278324]  ? event_hist_trigger_parse+0xa6c/0x3860
[  100.278962]  ? __pfx_event_hist_trigger_parse+0x10/0x10
[  100.279627]  ? __kasan_check_write+0x18/0x20
[  100.280177]  ? mutex_unlock+0x85/0xd0
[  100.280660]  ? __pfx_mutex_unlock+0x10/0x10
[  100.281200]  ? kfree+0x7b/0x120
[  100.281619]  ? ____kasan_slab_free+0x15d/0x1d0
[  100.282197]  ? event_trigger_write+0xac/0x100
[  100.282764]  ? __kasan_slab_free+0x16/0x20
[  100.283293]  ? __kmem_cache_free+0x153/0x2f0
[  100.283844]  ? sched_mm_cid_remote_clear+0xb1/0x250
[  100.284550]  ? __pfx_sched_mm_cid_remote_clear+0x10/0x10
[  100.285221]  ? event_trigger_write+0xbc/0x100
[  100.285781]  ? __kasan_check_read+0x15/0x20
[  100.286321]  ? __bitmap_weight+0x66/0xa0
[  100.286833]  ? _find_next_bit+0x46/0xe0
[  100.287334]  ? task_mm_cid_work+0x37f/0x450
[  100.287872]  event_triggers_call+0x84/0x150
[  100.288408]  trace_event_buffer_commit+0x339/0x430
[  100.289073]  ? ring_buffer_event_data+0x3f/0x60
[  100.292189]  trace_event_raw_event_sys_enter+0x8b/0xe0
[  100.295434]  syscall_trace_enter.constprop.0+0x18f/0x1b0
[  100.298653]  syscall_enter_from_user_mode+0x32/0x40
[  100.301808]  do_syscall_64+0x1a/0x90
[  100.304748]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[  100.307775] RIP: 0033:0x7f686c75c1cb
[  100.310617] Code: 73 01 c3 48 8b 0d 65 3c 10 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 21 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 35 3c 10 00 f7 d8 64 89 01 48
[  100.317847] RSP: 002b:00007ffc60137a38 EFLAGS: 00000246 ORIG_RAX: 0000000000000021
[  100.321200] RAX: ffffffffffffffda RBX: 000055f566469ea0 RCX: 00007f686c75c1cb
[  100.324631] RDX: 0000000000000001 RSI: 0000000000000001 RDI: 000000000000000a
[  100.328104] RBP: 00007ffc60137ac0 R08: 00007f686c818460 R09: 000000000000000a
[  100.331509] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000009
[  100.334992] R13: 0000000000000007 R14: 000000000000000a R15: 0000000000000007
[  100.338381]  </TASK>

We hit the bug because when second hist trigger has was created
has_hist_vars() returned false because hist trigger did not have
variables. As a result of that save_hist_vars() was not called to add
the trigger to trace_array->hist_vars. Later on when we attempted to
remove the first histogram find_any_var_ref() failed to detect it is
being used because it did not find the second trigger in hist_vars list.

With this change we wait until trigger actions are created so we can take
into consideration if hist trigger has variable references. Also, now we
check the return value of save_hist_vars() and fail trigger creation if
save_hist_vars() fails.

Link: https://lore.kernel.org/linux-trace-kernel/20230712223021.636335-1-mkhalfella@purestorage.com

Cc: stable@vger.kernel.org
Fixes: 067fe038e7 ("tracing: Add variable reference handling to hist triggers")
Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-07-12 19:08:56 -04:00
Yonghong Song 8cc32a9bbf kallsyms: strip LTO-only suffixes from promoted global functions
Commit 6eb4bd92c1 ("kallsyms: strip LTO suffixes from static functions")
stripped all function/variable suffixes started with '.' regardless
of whether those suffixes are generated at LTO mode or not. In fact,
as far as I know, in LTO mode, when a static function/variable is
promoted to the global scope, '.llvm.<...>' suffix is added.

The existing mechanism breaks live patch for a LTO kernel even if
no <symbol>.llvm.<...> symbols are involved. For example, for the following
kernel symbols:
  $ grep bpf_verifier_vlog /proc/kallsyms
  ffffffff81549f60 t bpf_verifier_vlog
  ffffffff8268b430 d bpf_verifier_vlog._entry
  ffffffff8282a958 d bpf_verifier_vlog._entry_ptr
  ffffffff82e12a1f d bpf_verifier_vlog.__already_done
'bpf_verifier_vlog' is a static function. '_entry', '_entry_ptr' and
'__already_done' are static variables used inside 'bpf_verifier_vlog',
so llvm promotes them to file-level static with prefix 'bpf_verifier_vlog.'.
Note that the func-level to file-level static function promotion also
happens without LTO.

Given a symbol name 'bpf_verifier_vlog', with LTO kernel, current mechanism will
return 4 symbols to live patch subsystem which current live patching
subsystem cannot handle it. With non-LTO kernel, only one symbol
is returned.

In [1], we have a lengthy discussion, the suggestion is to separate two
cases:
  (1). new symbols with suffix which are generated regardless of whether
       LTO is enabled or not, and
  (2). new symbols with suffix generated only when LTO is enabled.

The cleanup_symbol_name() should only remove suffixes for case (2).
Case (1) should not be changed so it can work uniformly with or without LTO.

This patch removed LTO-only suffix '.llvm.<...>' so live patching and
tracing should work the same way for non-LTO kernel.
The cleanup_symbol_name() in scripts/kallsyms.c is also changed to have the same
filtering pattern so both kernel and kallsyms tool have the same
expectation on the order of symbols.

 [1] https://lore.kernel.org/live-patching/20230615170048.2382735-1-song@kernel.org/T/#u

Fixes: 6eb4bd92c1 ("kallsyms: strip LTO suffixes from static functions")
Reported-by: Song Liu <song@kernel.org>
Signed-off-by: Yonghong Song <yhs@fb.com>
Reviewed-by: Zhen Lei <thunder.leizhen@huawei.com>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Acked-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230628181926.4102448-1-yhs@fb.com
Signed-off-by: Kees Cook <keescook@chromium.org>
2023-07-12 15:39:34 -07:00
Steven Rostedt (Google) bec3c25c24 tracing: Stop FORTIFY_SOURCE complaining about stack trace caller
The stack_trace event is an event created by the tracing subsystem to
store stack traces. It originally just contained a hard coded array of 8
words to hold the stack, and a "size" to know how many entries are there.
This is exported to user space as:

name: kernel_stack
ID: 4
format:
	field:unsigned short common_type;	offset:0;	size:2;	signed:0;
	field:unsigned char common_flags;	offset:2;	size:1;	signed:0;
	field:unsigned char common_preempt_count;	offset:3;	size:1;	signed:0;
	field:int common_pid;	offset:4;	size:4;	signed:1;

	field:int size;	offset:8;	size:4;	signed:1;
	field:unsigned long caller[8];	offset:16;	size:64;	signed:0;

print fmt: "\t=> %ps\n\t=> %ps\n\t=> %ps\n" "\t=> %ps\n\t=> %ps\n\t=> %ps\n" "\t=> %ps\n\t=> %ps\n",i
 (void *)REC->caller[0], (void *)REC->caller[1], (void *)REC->caller[2],
 (void *)REC->caller[3], (void *)REC->caller[4], (void *)REC->caller[5],
 (void *)REC->caller[6], (void *)REC->caller[7]

Where the user space tracers could parse the stack. The library was
updated for this specific event to only look at the size, and not the
array. But some older users still look at the array (note, the older code
still checks to make sure the array fits inside the event that it read.
That is, if only 4 words were saved, the parser would not read the fifth
word because it will see that it was outside of the event size).

This event was changed a while ago to be more dynamic, and would save a
full stack even if it was greater than 8 words. It does this by simply
allocating more ring buffer to hold the extra words. Then it copies in the
stack via:

	memcpy(&entry->caller, fstack->calls, size);

As the entry is struct stack_entry, that is created by a macro to both
create the structure and export this to user space, it still had the caller
field of entry defined as: unsigned long caller[8].

When the stack is greater than 8, the FORTIFY_SOURCE code notices that the
amount being copied is greater than the source array and complains about
it. It has no idea that the source is pointing to the ring buffer with the
required allocation.

To hide this from the FORTIFY_SOURCE logic, pointer arithmetic is used:

	ptr = ring_buffer_event_data(event);
	entry = ptr;
	ptr += offsetof(typeof(*entry), caller);
	memcpy(ptr, fstack->calls, size);

Link: https://lore.kernel.org/all/20230612160748.4082850-1-svens@linux.ibm.com/
Link: https://lore.kernel.org/linux-trace-kernel/20230712105235.5fc441aa@gandalf.local.home

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Reported-by: Sven Schnelle <svens@linux.ibm.com>
Tested-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-07-12 17:47:49 -04:00
Zheng Yejian 26efd79c46 ftrace: Fix possible warning on checking all pages used in ftrace_process_locs()
As comments in ftrace_process_locs(), there may be NULL pointers in
mcount_loc section:
 > Some architecture linkers will pad between
 > the different mcount_loc sections of different
 > object files to satisfy alignments.
 > Skip any NULL pointers.

After commit 20e5227e9f ("ftrace: allow NULL pointers in mcount_loc"),
NULL pointers will be accounted when allocating ftrace pages but skipped
before adding into ftrace pages, this may result in some pages not being
used. Then after commit 706c81f87f ("ftrace: Remove extra helper
functions"), warning may occur at:
  WARN_ON(pg->next);

To fix it, only warn for case that no pointers skipped but pages not used
up, then free those unused pages after releasing ftrace_lock.

Link: https://lore.kernel.org/linux-trace-kernel/20230712060452.3175675-1-zhengyejian1@huawei.com

Cc: stable@vger.kernel.org
Fixes: 706c81f87f ("ftrace: Remove extra helper functions")
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-07-12 17:47:37 -04:00
Hou Tao 4ed8b5bcfa bpf: Add object leak check.
The object leak check is cheap. Do it unconditionally to spot difficult races
in bpf_mem_alloc.

Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20230706033447.54696-15-alexei.starovoitov@gmail.com
2023-07-12 23:45:23 +02:00
Alexei Starovoitov 8e07bb9ebc bpf: Convert bpf_cpumask to bpf_mem_cache_free_rcu.
Convert bpf_cpumask to bpf_mem_cache_free_rcu.
Note that migrate_disable() in bpf_cpumask_release() is still necessary, since
bpf_cpumask_release() is a dtor. bpf_obj_free_fields() can be converted to do
migrate_disable() there in a follow up.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/bpf/20230706033447.54696-14-alexei.starovoitov@gmail.com
2023-07-12 23:45:23 +02:00
Alexei Starovoitov 5af6807bdb bpf: Introduce bpf_mem_free_rcu() similar to kfree_rcu().
Introduce bpf_mem_[cache_]free_rcu() similar to kfree_rcu().
Unlike bpf_mem_[cache_]free() that links objects for immediate reuse into
per-cpu free list the _rcu() flavor waits for RCU grace period and then moves
objects into free_by_rcu_ttrace list where they are waiting for RCU
task trace grace period to be freed into slab.

The life cycle of objects:
alloc: dequeue free_llist
free: enqeueu free_llist
free_rcu: enqueue free_by_rcu -> waiting_for_gp
free_llist above high watermark -> free_by_rcu_ttrace
after RCU GP waiting_for_gp -> free_by_rcu_ttrace
free_by_rcu_ttrace -> waiting_for_gp_ttrace -> slab

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/bpf/20230706033447.54696-13-alexei.starovoitov@gmail.com
2023-07-12 23:45:23 +02:00
Paul E. McKenney 43a89baecf rcu: Export rcu_request_urgent_qs_task()
If a CPU is executing a long series of non-sleeping system calls,
RCU grace periods can be delayed for on the order of a couple hundred
milliseconds.  This is normally not a problem, but if each system call
does a call_rcu(), those callbacks can stack up.  RCU will eventually
notice this callback storm, but use of rcu_request_urgent_qs_task()
allows the code invoking call_rcu() to give RCU a heads up.

This function is not for general use, not yet, anyway.

Reported-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20230706033447.54696-11-alexei.starovoitov@gmail.com
2023-07-12 23:45:23 +02:00
Alexei Starovoitov 04fabf00b4 bpf: Allow reuse from waiting_for_gp_ttrace list.
alloc_bulk() can reuse elements from free_by_rcu_ttrace.
Let it reuse from waiting_for_gp_ttrace as well to avoid unnecessary kmalloc().

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20230706033447.54696-10-alexei.starovoitov@gmail.com
2023-07-12 23:45:23 +02:00
Alexei Starovoitov 822fb26bdb bpf: Add a hint to allocated objects.
To address OOM issue when one cpu is allocating and another cpu is freeing add
a target bpf_mem_cache hint to allocated objects and when local cpu free_llist
overflows free to that bpf_mem_cache. The hint addresses the OOM while
maintaining the same performance for common case when alloc/free are done on the
same cpu.

Note that do_call_rcu_ttrace() now has to check 'draining' flag in one more case,
since do_call_rcu_ttrace() is called not only for current cpu.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/bpf/20230706033447.54696-9-alexei.starovoitov@gmail.com
2023-07-12 23:45:23 +02:00
Alexei Starovoitov d114dde245 bpf: Change bpf_mem_cache draining process.
The next patch will introduce cross-cpu llist access and existing
irq_work_sync() + drain_mem_cache() + rcu_barrier_tasks_trace() mechanism will
not be enough, since irq_work_sync() + drain_mem_cache() on cpu A won't
guarantee that llist on cpu A are empty. The free_bulk() on cpu B might add
objects back to llist of cpu A. Add 'bool draining' flag.
The modified sequence looks like:
for_each_cpu:
  WRITE_ONCE(c->draining, true); // do_call_rcu_ttrace() won't be doing call_rcu() any more
  irq_work_sync(); // wait for irq_work callback (free_bulk) to finish
  drain_mem_cache(); // free all objects
rcu_barrier_tasks_trace(); // wait for RCU callbacks to execute

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/bpf/20230706033447.54696-8-alexei.starovoitov@gmail.com
2023-07-12 23:45:22 +02:00
Alexei Starovoitov 7468048237 bpf: Further refactor alloc_bulk().
In certain scenarios alloc_bulk() might be taking free objects mainly from
free_by_rcu_ttrace list. In such case get_memcg() and set_active_memcg() are
redundant, but they show up in perf profile. Split the loop and only set memcg
when allocating from slab. No performance difference in this patch alone, but
it helps in combination with further patches.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/bpf/20230706033447.54696-7-alexei.starovoitov@gmail.com
2023-07-12 23:45:22 +02:00
Alexei Starovoitov 18e027b1c7 bpf: Factor out inc/dec of active flag into helpers.
Factor out local_inc/dec_return(&c->active) into helpers.
No functional changes.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/bpf/20230706033447.54696-6-alexei.starovoitov@gmail.com
2023-07-12 23:45:22 +02:00
Alexei Starovoitov 05ae68656a bpf: Refactor alloc_bulk().
Factor out inner body of alloc_bulk into separate helper.
No functional changes.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/bpf/20230706033447.54696-5-alexei.starovoitov@gmail.com
2023-07-12 23:45:22 +02:00
Alexei Starovoitov 9de3e81521 bpf: Let free_all() return the number of freed elements.
Let free_all() helper return the number of freed elements.
It's not used in this patch, but helps in debug/development of bpf_mem_alloc.

For example this diff for __free_rcu():
-       free_all(llist_del_all(&c->waiting_for_gp_ttrace), !!c->percpu_size);
+       printk("cpu %d freed %d objs after tasks trace\n", raw_smp_processor_id(),
+       	free_all(llist_del_all(&c->waiting_for_gp_ttrace), !!c->percpu_size));

would show how busy RCU tasks trace is.
In artificial benchmark where one cpu is allocating and different cpu is freeing
the RCU tasks trace won't be able to keep up and the list of objects
would keep growing from thousands to millions and eventually OOMing.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/bpf/20230706033447.54696-4-alexei.starovoitov@gmail.com
2023-07-12 23:45:22 +02:00
Alexei Starovoitov a80672d7e1 bpf: Simplify code of destroy_mem_alloc() with kmemdup().
Use kmemdup() to simplify the code.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/bpf/20230706033447.54696-3-alexei.starovoitov@gmail.com
2023-07-12 23:45:22 +02:00
Alexei Starovoitov 12c8d0f4c8 bpf: Rename few bpf_mem_alloc fields.
Rename:
-       struct rcu_head rcu;
-       struct llist_head free_by_rcu;
-       struct llist_head waiting_for_gp;
-       atomic_t call_rcu_in_progress;
+       struct llist_head free_by_rcu_ttrace;
+       struct llist_head waiting_for_gp_ttrace;
+       struct rcu_head rcu_ttrace;
+       atomic_t call_rcu_ttrace_in_progress;
...
-	static void do_call_rcu(struct bpf_mem_cache *c)
+	static void do_call_rcu_ttrace(struct bpf_mem_cache *c)

to better indicate intended use.

The 'tasks trace' is shortened to 'ttrace' to reduce verbosity.
No functional changes.

Later patches will add free_by_rcu/waiting_for_gp fields to be used with normal RCU.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/bpf/20230706033447.54696-2-alexei.starovoitov@gmail.com
2023-07-12 23:45:22 +02:00
Linus Torvalds 9a3236ce48 Probes fixes and clean ups for v6.5-rc1:
- Fix fprobe's rethook release timing issue(1). Release rethook after
   ftrace_ops is unregistered so that the rethook is not accessed after
   free.
 
 - Fix fprobe's rethook access timing issue(2). Stop rethook before
   ftrace_ops is unregistered so that the rethook is NOT keep using
   after exiting the unregister_fprobe().
 
 - Fix eprobe cleanup logic. If it attaches to multiple events and failes
   to enable one of them, rollback all enabled events correctly.
 
 - Fix fprobe to unlock ftrace recursion lock correctly when it missed
   by another running kprobe.
 
 - Cleanup kprobe to remove unnecessary NULL.
 
 - Cleanup kprobe to remove unnecessary 0 initializations.
 -----BEGIN PGP SIGNATURE-----
 
 iQFPBAABCgA5FiEEh7BulGwFlgAOi5DV2/sHvwUrPxsFAmStawEbHG1hc2FtaS5o
 aXJhbWF0c3VAZ21haWwuY29tAAoJENv7B78FKz8bMBkIAJYun4zeXsFeUUNVZMP8
 UlcyBt/uiB1Ch/t1T1wc55plWIDAvUfN+FEltwhb6MJsQgWEjKJxNcH+oquQeqSH
 OkUvV6a8BR73FWbCt1Tm2MQKEG1RHC1R4JCj5GCzP93rQSCnmvr0c1yb6+JKQRx4
 aPgWUjDm9vhYlOXS6heHo0hf0MbXDl1kqHjvMU2MXUMk7NtQ2JEo1Whikf7Drl6D
 ufNqV54GmtuJhgIWAqSk+qWresRvXy0/5i4ONK1Kmq+pdhssjZ4KFsWFQTIkQzdU
 nP8t2EfOp3aglx7ANvEJ+COfFNi0aMnHVBo7BbzsOy7Cq7IZTfD6zOTYSfylGGdA
 Uw4=
 =RBbc
 -----END PGP SIGNATURE-----

Merge tag 'probes-fixes-v6.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull probes fixes from Masami Hiramatsu:

 - Fix fprobe's rethook release issues:

     - Release rethook after ftrace_ops is unregistered so that the
       rethook is not accessed after free.

     - Stop rethook before ftrace_ops is unregistered so that the
       rethook is NOT used after exiting unregister_fprobe()

 - Fix eprobe cleanup logic. If it attaches to multiple events and
   failes to enable one of them, rollback all enabled events correctly.

 - Fix fprobe to unlock ftrace recursion lock correctly when it missed
   by another running kprobe.

 - Cleanup kprobe to remove unnecessary NULL.

 - Cleanup kprobe to remove unnecessary 0 initializations.

* tag 'probes-fixes-v6.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  fprobe: Ensure running fprobe_exit_handler() finished before calling rethook_free()
  kernel: kprobes: Remove unnecessary ‘0’ values
  kprobes: Remove unnecessary ‘NULL’ values from correct_ret_addr
  fprobe: add unlock to match a succeeded ftrace_test_recursion_trylock
  kernel/trace: Fix cleanup logic of enable_trace_eprobe
  fprobe: Release rethook after the ftrace_ops is unregistered
2023-07-12 12:01:16 -07:00
Zheng Yejian 7e42907f3a ring-buffer: Fix deadloop issue on reading trace_pipe
Soft lockup occurs when reading file 'trace_pipe':

  watchdog: BUG: soft lockup - CPU#6 stuck for 22s! [cat:4488]
  [...]
  RIP: 0010:ring_buffer_empty_cpu+0xed/0x170
  RSP: 0018:ffff88810dd6fc48 EFLAGS: 00000246
  RAX: 0000000000000000 RBX: 0000000000000246 RCX: ffffffff93d1aaeb
  RDX: ffff88810a280040 RSI: 0000000000000008 RDI: ffff88811164b218
  RBP: ffff88811164b218 R08: 0000000000000000 R09: ffff88815156600f
  R10: ffffed102a2acc01 R11: 0000000000000001 R12: 0000000051651901
  R13: 0000000000000000 R14: ffff888115e49500 R15: 0000000000000000
  [...]
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00007f8d853c2000 CR3: 000000010dcd8000 CR4: 00000000000006e0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Call Trace:
   __find_next_entry+0x1a8/0x4b0
   ? peek_next_entry+0x250/0x250
   ? down_write+0xa5/0x120
   ? down_write_killable+0x130/0x130
   trace_find_next_entry_inc+0x3b/0x1d0
   tracing_read_pipe+0x423/0xae0
   ? tracing_splice_read_pipe+0xcb0/0xcb0
   vfs_read+0x16b/0x490
   ksys_read+0x105/0x210
   ? __ia32_sys_pwrite64+0x200/0x200
   ? switch_fpu_return+0x108/0x220
   do_syscall_64+0x33/0x40
   entry_SYSCALL_64_after_hwframe+0x61/0xc6

Through the vmcore, I found it's because in tracing_read_pipe(),
ring_buffer_empty_cpu() found some buffer is not empty but then it
cannot read anything due to "rb_num_of_entries() == 0" always true,
Then it infinitely loop the procedure due to user buffer not been
filled, see following code path:

  tracing_read_pipe() {
    ... ...
    waitagain:
      tracing_wait_pipe() // 1. find non-empty buffer here
      trace_find_next_entry_inc()  // 2. loop here try to find an entry
        __find_next_entry()
          ring_buffer_empty_cpu();  // 3. find non-empty buffer
          peek_next_entry()  // 4. but peek always return NULL
            ring_buffer_peek()
              rb_buffer_peek()
                rb_get_reader_page()
                  // 5. because rb_num_of_entries() == 0 always true here
                  //    then return NULL
      // 6. user buffer not been filled so goto 'waitgain'
      //    and eventually leads to an deadloop in kernel!!!
  }

By some analyzing, I found that when resetting ringbuffer, the 'entries'
of its pages are not all cleared (see rb_reset_cpu()). Then when reducing
the ringbuffer, and if some reduced pages exist dirty 'entries' data, they
will be added into 'cpu_buffer->overrun' (see rb_remove_pages()), which
cause wrong 'overrun' count and eventually cause the deadloop issue.

To fix it, we need to clear every pages in rb_reset_cpu().

Link: https://lore.kernel.org/linux-trace-kernel/20230708225144.3785600-1-zhengyejian1@huawei.com

Cc: stable@vger.kernel.org
Fixes: a5fb833172 ("ring-buffer: Fix uninitialized read_stamp")
Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-07-12 12:15:26 -04:00
Arnd Bergmann 7d8b31b73c tracing: arm64: Avoid missing-prototype warnings
These are all tracing W=1 warnings in arm64 allmodconfig about missing
prototypes:

kernel/trace/trace_kprobe_selftest.c:7:5: error: no previous prototype for 'kprobe_trace_selftest_target' [-Werror=missing-pro
totypes]
kernel/trace/ftrace.c:329:5: error: no previous prototype for '__register_ftrace_function' [-Werror=missing-prototypes]
kernel/trace/ftrace.c:372:5: error: no previous prototype for '__unregister_ftrace_function' [-Werror=missing-prototypes]
kernel/trace/ftrace.c:4130:15: error: no previous prototype for 'arch_ftrace_match_adjust' [-Werror=missing-prototypes]
kernel/trace/fgraph.c:243:15: error: no previous prototype for 'ftrace_return_to_handler' [-Werror=missing-prototypes]
kernel/trace/fgraph.c:358:6: error: no previous prototype for 'ftrace_graph_sleep_time_control' [-Werror=missing-prototypes]
arch/arm64/kernel/ftrace.c:460:6: error: no previous prototype for 'prepare_ftrace_return' [-Werror=missing-prototypes]
arch/arm64/kernel/ptrace.c:2172:5: error: no previous prototype for 'syscall_trace_enter' [-Werror=missing-prototypes]
arch/arm64/kernel/ptrace.c:2195:6: error: no previous prototype for 'syscall_trace_exit' [-Werror=missing-prototypes]

Move the declarations to an appropriate header where they can be seen
by the caller and callee, and make sure the headers are included where
needed.

Link: https://lore.kernel.org/linux-trace-kernel/20230517125215.930689-1-arnd@kernel.org

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Florent Revest <revest@chromium.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
[ Fixed ftrace_return_to_handler() to handle CONFIG_HAVE_FUNCTION_GRAPH_RETVAL case ]
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-07-12 12:06:04 -04:00
Andrii Nakryiko f42bcd168d bpf: teach verifier actual bounds of bpf_get_smp_processor_id() result
bpf_get_smp_processor_id() helper returns current CPU on which BPF
program runs. It can't return value that is bigger than maximum allowed
number of CPUs (minus one, due to zero indexing). Teach BPF verifier to
recognize that. This makes it possible to use bpf_get_smp_processor_id()
result to index into arrays without extra checks, as demonstrated in
subsequent selftests/bpf patch.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20230711232400.1658562-1-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-12 07:57:18 -07:00
Yafang Shao 1b715e1b0e bpf: Support ->fill_link_info for perf_event
By introducing support for ->fill_link_info to the perf_event link, users
gain the ability to inspect it using `bpftool link show`. While the current
approach involves accessing this information via `bpftool perf show`,
consolidating link information for all link types in one place offers
greater convenience. Additionally, this patch extends support to the
generic perf event, which is not currently accommodated by
`bpftool perf show`. While only the perf type and config are exposed to
userspace, other attributes such as sample_period and sample_freq are
ignored. It's important to note that if kptr_restrict is not permitted, the
probed address will not be exposed, maintaining security measures.

A new enum bpf_perf_event_type is introduced to help the user understand
which struct is relevant.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20230709025630.3735-9-laoar.shao@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-11 20:07:51 -07:00
Yafang Shao 57d4853765 bpf: Add a common helper bpf_copy_to_user()
Add a common helper bpf_copy_to_user(), which will be used at multiple
places.
No functional change.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20230709025630.3735-8-laoar.shao@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-11 20:07:51 -07:00
Yafang Shao cd3910d005 bpf: Expose symbol's respective address
Since different symbols can share the same name, it is insufficient to only
expose the symbol name. It is essential to also expose the symbol address
so that users can accurately identify which one is being probed.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20230709025630.3735-7-laoar.shao@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-11 20:07:51 -07:00
Yafang Shao 5125e757e6 bpf: Clear the probe_addr for uprobe
To avoid returning uninitialized or random values when querying the file
descriptor (fd) and accessing probe_addr, it is necessary to clear the
variable prior to its use.

Fixes: 41bdc4b40e ("bpf: introduce bpf subcommand BPF_TASK_FD_QUERY")
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Acked-by: Yonghong Song <yhs@fb.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20230709025630.3735-6-laoar.shao@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-11 20:07:51 -07:00
Yafang Shao f1a414537e bpf: Protect probed address based on kptr_restrict setting
The probed address can be accessed by userspace through querying the task
file descriptor (fd). However, it is crucial to adhere to the kptr_restrict
setting and refrain from exposing the address if it is not permitted.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20230709025630.3735-5-laoar.shao@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-11 20:07:51 -07:00
Yafang Shao 7ac8d0d261 bpf: Support ->fill_link_info for kprobe_multi
With the addition of support for fill_link_info to the kprobe_multi link,
users will gain the ability to inspect it conveniently using the
`bpftool link show`. This enhancement provides valuable information to the
user, including the count of probed functions and their respective
addresses. It's important to note that if the kptr_restrict setting is not
permitted, the probed address will not be exposed, ensuring security.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20230709025630.3735-2-laoar.shao@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-11 20:07:50 -07:00
Pu Lehui 4369016497 bpf: cpumap: Fix memory leak in cpu_map_update_elem
Syzkaller reported a memory leak as follows:

BUG: memory leak
unreferenced object 0xff110001198ef748 (size 192):
  comm "syz-executor.3", pid 17672, jiffies 4298118891 (age 9.906s)
  hex dump (first 32 bytes):
    00 00 00 00 4a 19 00 00 80 ad e3 e4 fe ff c0 00  ....J...........
    00 b2 d3 0c 01 00 11 ff 28 f5 8e 19 01 00 11 ff  ........(.......
  backtrace:
    [<ffffffffadd28087>] __cpu_map_entry_alloc+0xf7/0xb00
    [<ffffffffadd28d8e>] cpu_map_update_elem+0x2fe/0x3d0
    [<ffffffffadc6d0fd>] bpf_map_update_value.isra.0+0x2bd/0x520
    [<ffffffffadc7349b>] map_update_elem+0x4cb/0x720
    [<ffffffffadc7d983>] __se_sys_bpf+0x8c3/0xb90
    [<ffffffffb029cc80>] do_syscall_64+0x30/0x40
    [<ffffffffb0400099>] entry_SYSCALL_64_after_hwframe+0x61/0xc6

BUG: memory leak
unreferenced object 0xff110001198ef528 (size 192):
  comm "syz-executor.3", pid 17672, jiffies 4298118891 (age 9.906s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<ffffffffadd281f0>] __cpu_map_entry_alloc+0x260/0xb00
    [<ffffffffadd28d8e>] cpu_map_update_elem+0x2fe/0x3d0
    [<ffffffffadc6d0fd>] bpf_map_update_value.isra.0+0x2bd/0x520
    [<ffffffffadc7349b>] map_update_elem+0x4cb/0x720
    [<ffffffffadc7d983>] __se_sys_bpf+0x8c3/0xb90
    [<ffffffffb029cc80>] do_syscall_64+0x30/0x40
    [<ffffffffb0400099>] entry_SYSCALL_64_after_hwframe+0x61/0xc6

BUG: memory leak
unreferenced object 0xff1100010fd93d68 (size 8):
  comm "syz-executor.3", pid 17672, jiffies 4298118891 (age 9.906s)
  hex dump (first 8 bytes):
    00 00 00 00 00 00 00 00                          ........
  backtrace:
    [<ffffffffade5db3e>] kvmalloc_node+0x11e/0x170
    [<ffffffffadd28280>] __cpu_map_entry_alloc+0x2f0/0xb00
    [<ffffffffadd28d8e>] cpu_map_update_elem+0x2fe/0x3d0
    [<ffffffffadc6d0fd>] bpf_map_update_value.isra.0+0x2bd/0x520
    [<ffffffffadc7349b>] map_update_elem+0x4cb/0x720
    [<ffffffffadc7d983>] __se_sys_bpf+0x8c3/0xb90
    [<ffffffffb029cc80>] do_syscall_64+0x30/0x40
    [<ffffffffb0400099>] entry_SYSCALL_64_after_hwframe+0x61/0xc6

In the cpu_map_update_elem flow, when kthread_stop is called before
calling the threadfn of rcpu->kthread, since the KTHREAD_SHOULD_STOP bit
of kthread has been set by kthread_stop, the threadfn of rcpu->kthread
will never be executed, and rcpu->refcnt will never be 0, which will
lead to the allocated rcpu, rcpu->queue and rcpu->queue->queue cannot be
released.

Calling kthread_stop before executing kthread's threadfn will return
-EINTR. We can complete the release of memory resources in this state.

Fixes: 6710e11269 ("bpf: introduce new bpf cpu map type BPF_MAP_TYPE_CPUMAP")
Signed-off-by: Pu Lehui <pulehui@huawei.com>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Acked-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20230711115848.2701559-1-pulehui@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-11 19:57:03 -07:00
Miaohe Lin d1d4ff5d11 cgroup: put cgroup_tryget_css() inside CONFIG_CGROUP_SCHED
Put cgroup_tryget_css() inside CONFIG_CGROUP_SCHED to fix the warning
of 'cgroup_tryget_css' defined but not used [-Wunused-function] when
CONFIG_CGROUP_SCHED is disabled.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Kamalesh Babulal <kamalesh.babulal@oracle.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-07-11 11:46:00 -10:00
Chungkai Yang 3a8395b565 PM: QoS: Restore support for default value on frequency QoS
Commit 8d36694245 ("PM: QoS: Add check to make sure CPU freq is
non-negative") makes sure CPU freq is non-negative to avoid negative
value converting to unsigned data type. However, when the value is
PM_QOS_DEFAULT_VALUE, pm_qos_update_target specifically uses
c->default_value which is set to FREQ_QOS_MIN/MAX_DEFAULT_VALUE when
cpufreq_policy_alloc is executed, for this case handling.

Adding check for PM_QOS_DEFAULT_VALUE to let default setting work will
fix this problem.

Fixes: 8d36694245 ("PM: QoS: Add check to make sure CPU freq is non-negative")
Link: https://lore.kernel.org/lkml/20230626035144.19717-1-Chung-kai.Yang@mediatek.com/
Link: https://lore.kernel.org/lkml/20230627071727.16646-1-Chung-kai.Yang@mediatek.com/
Link: https://lore.kernel.org/lkml/CAJZ5v0gxNOWhC58PHeUhW_tgf6d1fGJVZ1x91zkDdht11yUv-A@mail.gmail.com/
Signed-off-by: Chungkai Yang <Chung-kai.Yang@mediatek.com>
Cc: 6.0+ <stable@vger.kernel.org> # 6.0+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2023-07-11 20:09:57 +02:00
Azat Khuzhin c9e4bf607d PM: hibernate: Fix writing maj:min to /sys/power/resume
resume_store() first calls lookup_bdev() and after tries to handle
maj:min, but it does not reset the error before, hence if you will write
maj:min you will get ENOENT:

    # echo 259:2 >| /sys/power/resume
    bash: echo: write error: No such file or directory

This also should fix hiberation via systemd, since it uses this way.

Fixes: 1e8c813b08 ("PM: hibernate: don't use early_lookup_bdev in resume_store")
Signed-off-by: Azat Khuzhin <a3at.mail@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
[ rjw: Subject edits ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2023-07-11 19:58:08 +02:00
Beau Belgrave d0a3022f30 tracing/user_events: Fix struct arg size match check
When users register an event the name of the event and it's argument are
checked to ensure they match if the event already exists. Normally all
arguments are in the form of "type name", except for when the type
starts with "struct ". In those cases, the size of the struct is passed
in addition to the name, IE: "struct my_struct a 20" for an argument
that is of type "struct my_struct" with a field name of "a" and has the
size of 20 bytes.

The current code does not honor the above case properly when comparing
a match. This causes the event register to fail even when the same
string was used for events that contain a struct argument within them.
The example above "struct my_struct a 20" generates a match string of
"struct my_struct a" omitting the size field.

Add the struct size of the existing field when generating a comparison
string for a struct field to ensure proper match checking.

Link: https://lkml.kernel.org/r/20230629235049.581-2-beaub@linux.microsoft.com

Cc: stable@vger.kernel.org
Fixes: e6f89a1498 ("tracing/user_events: Ensure user provided strings are safely formatted")
Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-07-10 21:38:13 -04:00
Masami Hiramatsu (Google) 195b9cb5b2 fprobe: Ensure running fprobe_exit_handler() finished before calling rethook_free()
Ensure running fprobe_exit_handler() has finished before
calling rethook_free() in the unregister_fprobe() so that caller can free
the fprobe right after unregister_fprobe().

unregister_fprobe() ensured that all running fprobe_entry/exit_handler()
have finished by calling unregister_ftrace_function() which synchronizes
RCU. But commit 5f81018753 ("fprobe: Release rethook after the ftrace_ops
is unregistered") changed to call rethook_free() after
unregister_ftrace_function(). So call rethook_stop() to make rethook
disabled before unregister_ftrace_function() and ensure it again.

Here is the possible code flow that can call the exit handler after
unregister_fprobe().

------
 CPU1                              CPU2
 call unregister_fprobe(fp)
 ...
                                   __fprobe_handler()
                                   rethook_hook() on probed function
 unregister_ftrace_function()
                                   return from probed function
                                   rethook hooks
                                   find rh->handler == fprobe_exit_handler
                                   call fprobe_exit_handler()
 rethook_free():
   set rh->handler = NULL;
 return from unreigster_fprobe;
                                   call fp->exit_handler() <- (*)
------

(*) At this point, the exit handler is called after returning from
unregister_fprobe().

This fixes it as following;
------
 CPU1                              CPU2
 call unregister_fprobe()
 ...
 rethook_stop():
   set rh->handler = NULL;
                                   __fprobe_handler()
                                   rethook_hook() on probed function
 unregister_ftrace_function()
                                   return from probed function
                                   rethook hooks
                                   find rh->handler == NULL
                                   return from rethook
 rethook_free()
 return from unreigster_fprobe;
------

Link: https://lore.kernel.org/all/168873859949.156157.13039240432299335849.stgit@devnote2/

Fixes: 5f81018753 ("fprobe: Release rethook after the ftrace_ops is unregistered")
Cc: stable@vger.kernel.org
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-07-11 09:17:01 +09:00
Imran Khan 0d3a00b370 smp: Reduce NMI traffic from CSD waiters to CSD destination
On systems with hundreds of CPUs, if most of the CPUs detect a CSD hang,
then all of these waiting CPUs send an NMI to the destination CPU in
order to dump its backtrace.

Given enough NMIs, the destination CPU will spent much of its time
producing backtraces, thus further delaying that CPU's response to the
original CSD IPI.  In the worst case, by the time destination CPU is
done producing all of these backtrace NMIs, the CSD wait timeout will
have elapsed so that the waiters resend their backtrace NMIs again,
further delaying forward progress.

Therefore, to avoid these delays, issue the backtrace NMI only from
the first waiter.  The destination CPU's other waiters can make use of
backtrace obtained from the first waiter's NMI.

Signed-off-by: Imran Khan <imran.f.khan@oracle.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juergen Gross <jgross@suse.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-07-10 14:19:04 -07:00
Imran Khan 5bd00f6db0 smp: Reduce logging due to dump_stack of CSD waiters
If a waiter is waiting for CSD lock, its call stack will not change
between first and subsequent hang detection for the same CSD lock.
Therefore, do dump_stack only for first-time detection for a given waiter.

This avoids excessive logging on systems with hundreds of CPUs where
repetitive dump_stack from hundreds of CPUs would otherwise flood the
console.

Signed-off-by: Imran Khan <imran.f.khan@oracle.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juergen Gross <jgross@suse.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-07-10 14:19:04 -07:00
Waiman Long 3ae0b77321 cgroup/cpuset: Allow suppression of sched domain rebuild in update_cpumasks_hier()
A single partition setup and tear-down operation can lead to
multiple rebuild_sched_domains_locked() calls which is a waste of
effort. This can partly be mitigated by adding a flag to suppress the
rebuild_sched_domains_locked() call in update_cpumasks_hier(). Since
a Boolean flag has already been passed as the 3rd argument to
update_cpumasks_hier(), we can extend that to a full flag word.

The sched domain rebuild suppression is now enabled in
update_sibling_cpumasks() as all it callers will do the sched domain
rebuild after its return later on anyway.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-07-10 11:01:23 -10:00
Waiman Long 99fe36ba6f cgroup/cpuset: Improve temporary cpumasks handling
The limitation that update_parent_subparts_cpumask() can only use
addmask & delmask in the given tmp cpumasks is fragile and may lead to
unexpected error.

Fix this problem by allocating/freeing a struct tmpmasks in
update_cpumask() to avoid reusing the cpumasks in trial_cs.

With this change, we can move the update_tasks_cpumask() for the
parent and update_sibling_cpumasks() for the sibling to inside
update_parent_subparts_cpumask().

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-07-10 11:01:03 -10:00
Waiman Long a86ce68078 cgroup/cpuset: Extract out CS_CPU_EXCLUSIVE & CS_SCHED_LOAD_BALANCE handling
Extract out the setting of CS_CPU_EXCLUSIVE and CS_SCHED_LOAD_BALANCE
flags as well as the rebuilding of scheduling domains into the new
update_partition_exclusive() and update_partition_sd_lb() helper
functions to simplify the logic. The update_partition_exclusive()
helper is called mainly at the beginning of the caller, but it may be
called at the end too. The update_partition_sd_lb() helper is called
at the end of the caller.

This patch should reduce the chance that cpuset partition will end up
in an incorrect state.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-07-10 11:00:12 -10:00
Waiman Long c8c926200c cgroup/cpuset: Inherit parent's load balance state in v2
Since commit f28e22441f ("cgroup/cpuset: Add a new isolated
cpus.partition type"), the CS_SCHED_LOAD_BALANCE bit of a v2 cpuset
can be on or off. The child cpusets of a partition root must have the
same setting as its parent or it may screw up the rebuilding of sched
domains. Fix this problem by making sure the a child v2 cpuset will
follows its parent cpuset load balance state unless the child cpuset
is a new partition root itself.

Fixes: f28e22441f ("cgroup/cpuset: Add a new isolated cpus.partition type")
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-07-10 10:59:27 -10:00
Miaohe Lin 868f87b375 cgroup: fix obsolete comment above for_each_css()
cgroup_tree_mutex is removed since commit 8353da1f91 ("cgroup: remove
cgroup_tree_mutex"), update corresponding comment.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-07-10 10:47:25 -10:00
tiozhang ace3c5499e workqueue: add cmdline parameter `workqueue.unbound_cpus` to further constrain wq_unbound_cpumask at boot time
Motivation of doing this is to better improve boot times for devices when
we want to prevent our workqueue works from running on some specific CPUs,
e,g, some CPUs are busy with interrupts.

Signed-off-by: tiozhang <tiozhang@didiglobal.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-07-10 10:42:51 -10:00
Tetsuo Handa 20bdedafd2 workqueue: Warn attempt to flush system-wide workqueues.
Based on commit c4f135d643 ("workqueue: Wrap flush_workqueue() using
a macro"), all in-tree users stopped flushing system-wide workqueues.
Therefore, start emitting runtime message so that all out-of-tree users
will understand that they need to update their code.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-07-10 10:39:17 -10:00
Michal Koutný 0a67b847e1 cpuset: Allow setscheduler regardless of manipulated task
When we migrate a task between two cgroups, one of the checks is a
verification whether we can modify task's scheduler settings
(cap_task_setscheduler()).

An implicit migration occurs also when enabling a controller on the
unified hierarchy (think of parent to child migration). The
aforementioned check may be problematic if the caller of the migration
(enabling a controller) has no permissions over migrated tasks.
For instance, a user's cgroup that ends up running a process of a
different user. Although cgroup permissions are configured favorably,
the enablement fails due to the foreign process [1].

Change the behavior by relaxing the permissions check on the unified
hierarchy when no effective change would happen.
This is in accordance with unified hierarchy attachment behavior when
permissions of the source to target cgroups are decisive whereas the
migrated task is opaque (as opposed to more restrictive check in
__cgroup1_procs_write()).

Notice that foreign task's affinity may still be modified if the user
can modify destination cgroup's cpuset attributes
(update_tasks_cpumask() does no permissions check). The permissions
check could thus be skipped on v2 even when affinity changes. Stay
conservative in this patch though.

[1] https://github.com/systemd/systemd/issues/18293#issuecomment-831205649

Signed-off-by: Michal Koutný <mkoutny@suse.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-07-10 10:28:43 -10:00
Miaohe Lin 48f074565b cgroup/cpuset: avoid unneeded cpuset_mutex re-lock
cpuset_mutex unlock and lock pair is only needed when transferring tasks
out of empty cpuset. Avoid unneeded cpuset_mutex re-lock when !is_empty
to save cpu cycles.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-07-10 10:26:06 -10:00
Miaohe Lin a453be9725 cgroup/cpuset: simplify the percpu kthreads check in update_tasks_cpumask()
kthread_is_per_cpu() can be called directly without checking whether
PF_KTHREAD is set in task->flags. So remove PF_KTHREAD check to make
code more concise.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-07-10 10:24:56 -10:00
Miaohe Lin 1299eb2b0a cgroup: minor cleanup for cgroup_extra_stat_show()
Make it under CONFIG_CGROUP_SCHED to rid of __maybe_unused annotation.
And further fetch cgrp inside cgroup_extra_stat_show() directly to rid
of __maybe_unused annotation of cgrp. No functional change intended.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-07-10 10:23:41 -10:00
Li zeming 9ce170cef6 kernel: params: Remove unnecessary ‘0’ values from err
err is assigned first, so it does not need to initialize the assignment.

Signed-off-by: Li zeming <zeming@nfschina.com>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2023-07-10 12:47:01 -07:00
Palmer Dabbelt c05780ef3c module: Ignore RISC-V mapping symbols too
RISC-V has an extended form of mapping symbols that we use to encode
the ISA when it changes in the middle of an ELF.  This trips up modpost
as a build failure, I haven't yet verified it yet but I believe the
kallsyms difference should result in stacks looking sane again.

Reported-by: Randy Dunlap <rdunlap@infradead.org>
Closes: https://lore.kernel.org/all/9d9e2902-5489-4bf0-d9cb-556c8e5d71c2@infradead.org/
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Tested-by: Randy Dunlap <rdunlap@infradead.org> # build-tested
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2023-07-10 12:45:23 -07:00
Miaohe Lin 2246ca53d7 cgroup: remove unneeded return value of cgroup_rm_cftypes_locked()
The return value of cgroup_rm_cftypes_locked() is always 0. So remove
it to simplify the code. No functional change intended.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-07-10 09:39:22 -10:00
Li zeming ed9492dfef kernel: kprobes: Remove unnecessary ‘0’ values
it is assigned first, so it does not need to initialize the assignment.

Link: https://lore.kernel.org/all/20230711185353.3218-1-zeming@nfschina.com/

Signed-off-by: Li zeming <zeming@nfschina.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2023-07-11 00:50:51 +09:00
Li zeming e1164787f2 kprobes: Remove unnecessary ‘NULL’ values from correct_ret_addr
The 'correct_ret_addr' pointer is always set in the later code, no need
to initialize it at definition time.

Link: https://lore.kernel.org/all/20230704194359.3124-1-zeming@nfschina.com/

Signed-off-by: Li zeming <zeming@nfschina.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2023-07-11 00:50:35 +09:00
Ze Gao 5f0c584daf fprobe: add unlock to match a succeeded ftrace_test_recursion_trylock
Unlock ftrace recursion lock when fprobe_kprobe_handler() is failed
because of some running kprobe.

Link: https://lore.kernel.org/all/20230703092336.268371-1-zegao@tencent.com/

Fixes: 3cc4e2c5fb ("fprobe: make fprobe_kprobe_handler recursion free")
Reported-by: Yafang <laoar.shao@gmail.com>
Closes: https://lore.kernel.org/linux-trace-kernel/CALOAHbC6UpfFOOibdDiC7xFc5YFUgZnk3MZ=3Ny6we=AcrNbew@mail.gmail.com/
Signed-off-by: Ze Gao <zegao@tencent.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Yafang Shao <laoar.shao@gmail.com>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2023-07-11 00:48:33 +09:00
Tzvetomir Stoyanov (VMware) cf0a624dc7 kernel/trace: Fix cleanup logic of enable_trace_eprobe
The enable_trace_eprobe() function enables all event probes, attached
to given trace probe. If an error occurs in enabling one of the event
probes, all others should be roll backed. There is a bug in that roll
back logic - instead of all event probes, only the failed one is
disabled.

Link: https://lore.kernel.org/all/20230703042853.1427493-1-tz.stoyanov@gmail.com/

Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Fixes: 7491e2c442 ("tracing: Add a probe that attaches to trace events")
Signed-off-by: Tzvetomir Stoyanov (VMware) <tz.stoyanov@gmail.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2023-07-11 00:44:57 +09:00
Uros Bizjak 1af61adb3a perf/ring_buffer: Use local_try_cmpxchg in __perf_output_begin
Use local_try_cmpxchg instead of local_cmpxchg (*ptr, old, new) == old
in __perf_output_begin.  x86 CMPXCHG instruction returns success in ZF
flag, so this change saves a compare after cmpxchg (and related move
instruction in front of cmpxchg).

Also, try_cmpxchg implicitly assigns old *ptr value to "old" when cmpxchg
fails. There is no need to re-read the value in the loop.

No functional change intended.

Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20230708090048.63046-2-ubizjak@gmail.com
2023-07-10 09:52:36 +02:00
Uros Bizjak 28fd85a10a perf/core: Use local64_try_cmpxchg in perf_swevent_set_period
Use local64_try_cmpxchg instead of local64_cmpxchg (*ptr, old, new) == old
in perf_swevent_set_period.  x86 CMPXCHG instruction returns success in ZF
flag, so this change saves a compare after cmpxchg (and related move
instruction in front of cmpxchg).

Also, try_cmpxchg implicitly assigns old *ptr value to "old" when cmpxchg
fails. There is no need to re-read the value in the loop.

No functional change intended.

Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20230708081129.45915-1-ubizjak@gmail.com
2023-07-10 09:52:35 +02:00
Suren Baghdasaryan aff037078e sched/psi: use kernfs polling functions for PSI trigger polling
Destroying psi trigger in cgroup_file_release causes UAF issues when
a cgroup is removed from under a polling process. This is happening
because cgroup removal causes a call to cgroup_file_release while the
actual file is still alive. Destroying the trigger at this point would
also destroy its waitqueue head and if there is still a polling process
on that file accessing the waitqueue, it will step on the freed pointer:

do_select
  vfs_poll
                           do_rmdir
                             cgroup_rmdir
                               kernfs_drain_open_files
                                 cgroup_file_release
                                   cgroup_pressure_release
                                     psi_trigger_destroy
                                       wake_up_pollfree(&t->event_wait)
// vfs_poll is unblocked
                                       synchronize_rcu
                                       kfree(t)
  poll_freewait -> UAF access to the trigger's waitqueue head

Patch [1] fixed this issue for epoll() case using wake_up_pollfree(),
however the same issue exists for synchronous poll() case.
The root cause of this issue is that the lifecycles of the psi trigger's
waitqueue and of the file associated with the trigger are different. Fix
this by using kernfs_generic_poll function when polling on cgroup-specific
psi triggers. It internally uses kernfs_open_node->poll waitqueue head
with its lifecycle tied to the file's lifecycle. This also renders the
fix in [1] obsolete, so revert it.

[1] commit c2dbe32d5d ("sched/psi: Fix use-after-free in ep_remove_wait_queue()")

Fixes: 0e94682b73 ("psi: introduce psi monitor")
Closes: https://lore.kernel.org/all/20230613062306.101831-1-lujialin4@huawei.com/
Reported-by: Lu Jialin <lujialin4@huawei.com>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20230630005612.1014540-1-surenb@google.com
2023-07-10 09:52:30 +02:00
Miaohe Lin ae2ad293d6 sched/fair: Use recent_used_cpu to test p->cpus_ptr
When checking whether a recently used CPU can be a potential idle
candidate, recent_used_cpu should be used to test p->cpus_ptr as
p->recent_used_cpu is not equal to recent_used_cpu and candidate
decision is made based on recent_used_cpu here.

Fixes: 89aafd67f2 ("sched/fair: Use prev instead of new target as recent_used_cpu")
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Phil Auld <pauld@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Link: https://lore.kernel.org/r/20230620080747.359122-1-linmiaohe@huawei.com
2023-07-10 09:52:30 +02:00
Linus Torvalds f71f64210d dma-mapping fixes for Linux 6.5
- swiotlb area sizing fixes (Petr Tesarik)
 -----BEGIN PGP SIGNATURE-----
 
 iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAmSq57sLHGhjaEBsc3Qu
 ZGUACgkQD55TZVIEUYOh0xAAkwklIxxzXvNlgjvy2hdgPWImPS8tGPDSIsqA9TDD
 WDZq89nz/ndnchdPObDvyJXmfBgqa0qCHqopBVPqMKv5a1pKZhrRXYlbajGQQwji
 MIqefTLZ/VGw7bDpEivOt+yadwQ3KxuxWWs7/JPKLLReSJ22H8P+jkrK7P7kBL5X
 YaMtZG9d86fvFHnQHKdAOlF1iCvnoZDHPcvaZbI6m5mMSZ+HIYqK5pP1MTUEAbIU
 MX4ZSI7/mL0q67+kZuM/NCSeq1pH0Cd0D2DGm+8k/y87G81GS6E5Wgg+xO7vYiXf
 5ChuwlAO9K9KhH7NIRkKhkad/Ii89ENXSyv5gmPRoKYK5FXajnHSlJTUrZojV6XC
 Pbsd9ATVzV0rY61EPyh6G1a+Ttp/pwMp+W0I2fi032GVAePQ/vhB9x9O+2O/3QiC
 v80nUSatkciZncWqkguhp3NONsRmLKep3CCQnEAA/gLs27B0avdQeslnqbOOUQKd
 Si+djIoel8ONjQ+mW8eFCsVYMH1xFSo0aGsgGe0y2cyBE3DN1AW9eRnOXWT4C1PR
 UyDlx8ACs87ojec+YRQFYk2/PbsU7CQiH1pteXvBHcbhiVUAvrtXtg6ANQ+7066P
 IIduAZmlHcHb1BhyrSQbAtRllVLIp/l9IAkCSY9SvL0tjh/B5CaRBD5m2Taow5I/
 KUI=
 =4Lfc
 -----END PGP SIGNATURE-----

Merge tag 'dma-mapping-6.5-2023-07-09' of git://git.infradead.org/users/hch/dma-mapping

Pull dma-mapping fixes from Christoph Hellwig:

 - swiotlb area sizing fixes (Petr Tesarik)

* tag 'dma-mapping-6.5-2023-07-09' of git://git.infradead.org/users/hch/dma-mapping:
  swiotlb: reduce the number of areas to match actual memory pool size
  swiotlb: always set the number of areas before allocating the pool
2023-07-09 10:24:22 -07:00
Linus Torvalds a9943ad3dd - Optimize IRQ domain's name assignment
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmSqa70ACgkQEsHwGGHe
 VUrTlhAAtfvA+mCeWVwn42827YlJeMvFV6RedERSk8x5CO/IJgHx2YFB0UXlznuM
 xQSIC7pkY2He62ggb+s6m+Pb1M3bxxJDaRSRnQGHOqXuSNuzokqjwvVMaSBE9s5W
 fgvdbZ4CDEynW7awiDq1PaMcJFzbc6RMbBDspNDke2ikZgGvqqWl3Gx2DHhGyNfQ
 UFJHETejxhfheU7Y9c06P2ToTyzc0RcaPFQfFmYIse+c6iTcjnqr9NRHCIklSygN
 F3VjSEXC1MkAZIAuV9pjkl3z01xfvDe1WqxH3jID/vM97AY08DqmBQC31VZOBmmV
 shndEUpc2yi7ued0WU1XLf1sMyEHm5rgF/19gRhFR9sxu4Zm9+JxOofo0URwqtsV
 Z3d0CxpDrF0ATJs2zTstUmOG0AD6SBfC+BQZi3M6rohXgR0z04dBnyTgCnxAs60u
 d5byFza5nZ/sX7KZXgVlKfH5ej0AcZuyFM/qMKqiTqzz5C4dv0Hq7y718aaecxmz
 W9A3olDGhC48pr9250TthEMhBzRdXvKgKQBUhC2TY/0hUBHyvmHeU//ql54XYhgh
 XVx3qI7yieZC3d27hZN0WTCG6qSCJoq8LROmGDIbx16Uj/cm1AHzJTJ+5CbCOafZ
 41gPBoNCNYNXlgYNBXpVcPRtfcK98fGl3jpPOk0llaSa6IW+Lcc=
 =ATsc
 -----END PGP SIGNATURE-----

Merge tag 'irq_urgent_for_v6.5_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull irq update from Borislav Petkov:

 - Optimize IRQ domain's name assignment

* tag 'irq_urgent_for_v6.5_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  irqdomain: Use return value of strreplace()
2023-07-09 10:16:04 -07:00
Suren Baghdasaryan fb49c45532 fork: lock VMAs of the parent process when forking
When forking a child process, the parent write-protects anonymous pages
and COW-shares them with the child being forked using copy_present_pte().

We must not take any concurrent page faults on the source vma's as they
are being processed, as we expect both the vma and the pte's behind it
to be stable.  For example, the anon_vma_fork() expects the parents
vma->anon_vma to not change during the vma copy.

A concurrent page fault on a page newly marked read-only by the page
copy might trigger wp_page_copy() and a anon_vma_prepare(vma) on the
source vma, defeating the anon_vma_clone() that wasn't done because the
parent vma originally didn't have an anon_vma, but we now might end up
copying a pte entry for a page that has one.

Before the per-vma lock based changes, the mmap_lock guaranteed
exclusion with concurrent page faults.  But now we need to do a
vma_start_write() to make sure no concurrent faults happen on this vma
while it is being processed.

This fix can potentially regress some fork-heavy workloads.  Kernel
build time did not show noticeable regression on a 56-core machine while
a stress test mapping 10000 VMAs and forking 5000 times in a tight loop
shows ~5% regression.  If such fork time regression is unacceptable,
disabling CONFIG_PER_VMA_LOCK should restore its performance.  Further
optimizations are possible if this regression proves to be problematic.

Suggested-by: David Hildenbrand <david@redhat.com>
Reported-by: Jiri Slaby <jirislaby@kernel.org>
Closes: https://lore.kernel.org/all/dbdef34c-3a07-5951-e1ae-e9c6e3cdf51b@kernel.org/
Reported-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Closes: https://lore.kernel.org/all/b198d649-f4bf-b971-31d0-e8433ec2a34c@applied-asynchrony.com/
Reported-by: Jacob Young <jacobly.alt@gmail.com>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217624
Fixes: 0bff0aaea0 ("x86/mm: try VMA lock-based page fault handling first")
Cc: stable@vger.kernel.org
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-07-08 14:08:02 -07:00
Linus Torvalds 8066178f53 Tracing fixes for 6.5:
- Fix bad git merge of #endif in arm64 code
   A merge of the arm64 tree caused #endif to go into the wrong place
 
 - Fix crash on lseek of write access to tracefs/error_log
   Opening error_log as write only, and then doing an lseek() causes
   a kernel panic, because the lseek() handle expects a "seq_file"
   to exist (which is not done on write only opens). Use tracing_lseek()
   that tests for this instead of calling the default seq lseek handler.
 
 - Check for negative instead of -E2BIG for error on strscpy() returns
   Instead of testing for -E2BIG from strscpy(), to be more robust,
   check for less than zero, which will make sure it catches any error
   that strscpy() may someday return.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZKbQUhQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qpWaAP9zQ1eLQSfMt0dHH01OBSJvc2mMd4QJ
 VZtWZ+xTSvk+4gD/axDzDS7Qisfrrli+1oQSPwVik2SXiz0SPJqJ25m9zw4=
 =xMlg
 -----END PGP SIGNATURE-----

Merge tag 'trace-v6.5-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing fixes from Steven Rostedt:

 - Fix bad git merge of #endif in arm64 code

   A merge of the arm64 tree caused #endif to go into the wrong place

 - Fix crash on lseek of write access to tracefs/error_log

   Opening error_log as write only, and then doing an lseek() causes a
   kernel panic, because the lseek() handle expects a "seq_file" to
   exist (which is not done on write only opens). Use tracing_lseek()
   that tests for this instead of calling the default seq lseek handler.

 - Check for negative instead of -E2BIG for error on strscpy() returns

   Instead of testing for -E2BIG from strscpy(), to be more robust,
   check for less than zero, which will make sure it catches any error
   that strscpy() may someday return.

* tag 'trace-v6.5-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing/boot: Test strscpy() against less than zero for error
  arm64: ftrace: fix build error with CONFIG_FUNCTION_GRAPH_TRACER=n
  tracing: Fix null pointer dereference in tracing_err_log_open()
2023-07-06 19:07:15 -07:00
Anton Protopopov 515ee52b22 bpf: make preloaded map iterators to display map elements count
Add another column to the /sys/fs/bpf/maps.debug iterator to display
cur_entries, the current number of entries in the map as is returned
by the bpf_map_sum_elem_count kfunc. Also fix formatting.

Example:

    # cat /sys/fs/bpf/maps.debug
      id name             max_entries  cur_entries
       2 iterator.rodata            1            0
     125 cilium_auth_map       524288          666
     126 cilium_runtime_          256            0
     127 cilium_signals            32            0
     128 cilium_node_map        16384         1344
     129 cilium_events             32            0
     ...

Signed-off-by: Anton Protopopov <aspsk@isovalent.com>
Link: https://lore.kernel.org/r/20230706133932.45883-5-aspsk@isovalent.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-06 12:42:25 -07:00
Anton Protopopov 9bc421b6be bpf: populate the per-cpu insertions/deletions counters for hashmaps
Initialize and utilize the per-cpu insertions/deletions counters for hash-based
maps. Non-trivial changes only apply to the preallocated maps for which the
{inc,dec}_elem_count functions are not called, as there's no need in counting
elements to sustain proper map operations.

To increase/decrease percpu counters for preallocated maps we add raw calls to
the bpf_map_{inc,dec}_elem_count functions so that the impact is minimal. For
dynamically allocated maps we add corresponding calls to the existing
{inc,dec}_elem_count functions.

Signed-off-by: Anton Protopopov <aspsk@isovalent.com>
Link: https://lore.kernel.org/r/20230706133932.45883-4-aspsk@isovalent.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-06 12:42:25 -07:00
Anton Protopopov 803370d3d3 bpf: add a new kfunc to return current bpf_map elements count
A bpf_map_sum_elem_count kfunc was added to simplify getting the sum of the map
per-cpu element counters. If a map doesn't implement the counter, then the
function will always return 0.

Signed-off-by: Anton Protopopov <aspsk@isovalent.com>
Link: https://lore.kernel.org/r/20230706133932.45883-3-aspsk@isovalent.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-06 12:42:25 -07:00
Linus Torvalds 7b82e90411 asm-generic updates for 6.5
These are cleanups for architecture specific header files:
 
  - the comments in include/linux/syscalls.h have gone out of sync
    and are really pointless, so these get removed
 
  - The asm/bitsperlong.h header no longer needs to be architecture
    specific on modern compilers, so use a generic version for newer
    architectures that use new enough userspace compilers
 
  - A cleanup for virt_to_pfn/virt_to_bus to have proper type
    checking, forcing the use of pointers
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEiK/NIGsWEZVxh/FrYKtH/8kJUicFAmSl138ACgkQYKtH/8kJ
 UieqWxAA2WjNVfyuieYckglOVE0PZPs2fzCwyzTY5iUTH3gE5cBFWJDWcg2EnouG
 v3X3htEQcowYWaCF9+rypQXaGiSx4WXi2Bjxnz3D/BcreqWPI4eSQ0fpGG5SURTY
 2zYF72GTt4JGR++l+7/R9MZwPbwYDT9BsD5tkel8PxnyVLM6/c5xFvbjzRSKFE8x
 SMN1jGZ62ITLNf/8coAOEPNxBYtDT6yQyu7P2sx5cd65LAQq9yLKjFklnBBovgWT
 OoCIZAdGkhcNwOh1LjyHcdNdpfNJGceKyqKPqty07IhCQuF2jxiyFYFzuBbeyQfE
 S0itN8o/MIfUmxaQl3e8dPAVb1RlNVr1zfQ6y4tUtWNdkNL2WwSnSQSRHrBfHxCQ
 QCF++PMeFcLhGwMYtqdNJ7XGLQ0PsjD74pRf0vo+vjmqDk2BJsJBP57VU+8MJn5r
 SoxqnJ0WxLvm1TfrNKusV7zMNWquc2duJDW40zsOssP4itjYELSI6qa56qmzlqmX
 zKmRx6mxAlx9RRK8FHXFYHbz3p93vv8z9vTOZV3AjIjjED960CLknUAwCC8FoJyz
 9b5wyMXsLQHQjGt8luAvPc6OiU0EiU9a4SPK+feWcv27serFvnjJlRTS/yG2Z3zd
 BYsUgsXHypsdoud+aE7MeCy7fE8n3mhoyMQQRBkOMFJ7RsG6wAE=
 =S/he
 -----END PGP SIGNATURE-----

Merge tag 'asm-generic-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic

Pull asm-generic updates from Arnd Bergmann:
 "These are cleanups for architecture specific header files:

   - the comments in include/linux/syscalls.h have gone out of sync and
     are really pointless, so these get removed

   - The asm/bitsperlong.h header no longer needs to be architecture
     specific on modern compilers, so use a generic version for newer
     architectures that use new enough userspace compilers

   - A cleanup for virt_to_pfn/virt_to_bus to have proper type checking,
     forcing the use of pointers"

* tag 'asm-generic-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic:
  syscalls: Remove file path comments from headers
  tools arch: Remove uapi bitsperlong.h of hexagon and microblaze
  asm-generic: Unify uapi bitsperlong.h for arm64, riscv and loongarch
  m68k/mm: Make pfn accessors static inlines
  arm64: memory: Make virt_to_pfn() a static inline
  ARM: mm: Make virt_to_pfn() a static inline
  asm-generic/page.h: Make pfn accessors static inlines
  xen/netback: Pass (void *) to virt_to_page()
  netfs: Pass a pointer to virt_to_page()
  cifs: Pass a pointer to virt_to_page() in cifsglob
  cifs: Pass a pointer to virt_to_page()
  riscv: mm: init: Pass a pointer to virt_to_page()
  ARC: init: Pass a pointer to virt_to_pfn() in init
  m68k: Pass a pointer to virt_to_pfn() virt_to_page()
  fs/proc/kcore.c: Pass a pointer to virt_addr_valid()
2023-07-06 10:06:04 -07:00
Kumar Kartikeya Dwivedi 5415ccd50a bpf: Fix max stack depth check for async callbacks
The check_max_stack_depth pass happens after the verifier's symbolic
execution, and attempts to walk the call graph of the BPF program,
ensuring that the stack usage stays within bounds for all possible call
chains. There are two cases to consider: bpf_pseudo_func and
bpf_pseudo_call. In the former case, the callback pointer is loaded into
a register, and is assumed that it is passed to some helper later which
calls it (however there is no way to be sure), but the check remains
conservative and accounts the stack usage anyway. For this particular
case, asynchronous callbacks are skipped as they execute asynchronously
when their corresponding event fires.

The case of bpf_pseudo_call is simpler and we know that the call is
definitely made, hence the stack depth of the subprog is accounted for.

However, the current check still skips an asynchronous callback even if
a bpf_pseudo_call was made for it. This is erroneous, as it will miss
accounting for the stack usage of the asynchronous callback, which can
be used to breach the maximum stack depth limit.

Fix this by only skipping asynchronous callbacks when the instruction is
not a pseudo call to the subprog.

Fixes: 7ddc80a476 ("bpf: Teach stack depth check about async callbacks.")
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20230705144730.235802-2-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-05 19:14:54 -07:00
Linus Torvalds 6843306689 Including fixes from bluetooth, bpf and wireguard.
Current release - regressions:
 
  - nvme-tcp: fix comma-related oops after sendpage changes
 
 Current release - new code bugs:
 
  - ptp: make max_phase_adjustment sysfs device attribute invisible
    when not supported
 
 Previous releases - regressions:
 
  - sctp: fix potential deadlock on &net->sctp.addr_wq_lock
 
  - mptcp:
    - ensure subflow is unhashed before cleaning the backlog
    - do not rely on implicit state check in mptcp_listen()
 
 Previous releases - always broken:
 
  - net: fix net_dev_start_xmit trace event vs skb_transport_offset()
 
  - Bluetooth:
    - fix use-bdaddr-property quirk
    - L2CAP: fix multiple UaFs
    - ISO: use hci_sync for setting CIG parameters
    - hci_event: fix Set CIG Parameters error status handling
    - hci_event: fix parsing of CIS Established Event
    - MGMT: fix marking SCAN_RSP as not connectable
 
  - wireguard: queuing: use saner cpu selection wrapping
 
  - sched: act_ipt: various bug fixes for iptables <> TC interactions
 
  - sched: act_pedit: add size check for TCA_PEDIT_PARMS_EX
 
  - dsa: fixes for receiving PTP packets with 8021q and sja1105 tagging
 
  - eth: sfc: fix null-deref in devlink port without MAE access
 
  - eth: ibmvnic: do not reset dql stats on NON_FATAL err
 
 Misc:
 
  - xsk: honor SO_BINDTODEVICE on bind
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmSlu+MACgkQMUZtbf5S
 Irslgw//S7jf/GL8V6y8VL3te+/OPOZnLDTzFFOdy64/y97FE6XIacJUpyWRhtmz
 oSzcSNHETPW9U+xSGa2ZQlKhAXt6n9iRNvUegql+VBb13Iz+l7AdTeoxRv/YuwDo
 5lTOIB6cBw+ATd0oxS6wr8SyUlcvktUKBfTAItjbVM55aXfIUpXIa84+F7avJgIA
 XP1u/3PHhwItmwo/hXhHH0+P0QA8ix1q2SvRB7DAlQLBsTuQhaKjXWQkYYTKw/Nt
 dtvh8iQSs/YXaHMjTa5CK28HOD8+ywIizr+uJ9VaNqIzV0W5JE9IE8P4NFpBcY7t
 kGjTYODOph7dkNmZ5RLj3N+B6CyC57OXDzoo/tr8940UytCLVj9EVyduarLGLx57
 edqK9cUz5kWejyGoyZ4Pvlo/SKvCQ2HKMeiAJ0/nNpTJMFuygMoqGsaD6ttzkXMj
 fZLPjRUK3axd+15ZzhLEf8HyL5Qh+qPqqX9p7NljfMKwhxMWJ5fuICJfdGOSdMJR
 ndL+wPfRPFQwszZ4pbTY2Ivn29mo8ScBOSOEgQs2mOny+zFzTzmqNWz/jcFfQnjS
 cylxBEHrgudT2FuCImZ/v66TM5yakHXqIdpTGG+zsvJWQqjM96Z3I7WRvi0g9d75
 n84il+j34mnzl90j2xEutqUiK7BQ9ZpZBsutPVTKBIHKWWiortI=
 =9yzk
 -----END PGP SIGNATURE-----

Merge tag 'net-6.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
 "Including fixes from bluetooth, bpf and wireguard.

  Current release - regressions:

   - nvme-tcp: fix comma-related oops after sendpage changes

  Current release - new code bugs:

   - ptp: make max_phase_adjustment sysfs device attribute invisible
     when not supported

  Previous releases - regressions:

   - sctp: fix potential deadlock on &net->sctp.addr_wq_lock

   - mptcp:
      - ensure subflow is unhashed before cleaning the backlog
      - do not rely on implicit state check in mptcp_listen()

  Previous releases - always broken:

   - net: fix net_dev_start_xmit trace event vs skb_transport_offset()

   - Bluetooth:
      - fix use-bdaddr-property quirk
      - L2CAP: fix multiple UaFs
      - ISO: use hci_sync for setting CIG parameters
      - hci_event: fix Set CIG Parameters error status handling
      - hci_event: fix parsing of CIS Established Event
      - MGMT: fix marking SCAN_RSP as not connectable

   - wireguard: queuing: use saner cpu selection wrapping

   - sched: act_ipt: various bug fixes for iptables <> TC interactions

   - sched: act_pedit: add size check for TCA_PEDIT_PARMS_EX

   - dsa: fixes for receiving PTP packets with 8021q and sja1105 tagging

   - eth: sfc: fix null-deref in devlink port without MAE access

   - eth: ibmvnic: do not reset dql stats on NON_FATAL err

  Misc:

   - xsk: honor SO_BINDTODEVICE on bind"

* tag 'net-6.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (70 commits)
  nfp: clean mc addresses in application firmware when closing port
  selftests: mptcp: pm_nl_ctl: fix 32-bit support
  selftests: mptcp: depend on SYN_COOKIES
  selftests: mptcp: userspace_pm: report errors with 'remove' tests
  selftests: mptcp: userspace_pm: use correct server port
  selftests: mptcp: sockopt: return error if wrong mark
  selftests: mptcp: sockopt: use 'iptables-legacy' if available
  selftests: mptcp: connect: fail if nft supposed to work
  mptcp: do not rely on implicit state check in mptcp_listen()
  mptcp: ensure subflow is unhashed before cleaning the backlog
  s390/qeth: Fix vipa deletion
  octeontx-af: fix hardware timestamp configuration
  net: dsa: sja1105: always enable the send_meta options
  net: dsa: tag_sja1105: fix MAC DA patching from meta frames
  net: Replace strlcpy with strscpy
  pptp: Fix fib lookup calls.
  mlxsw: spectrum_router: Fix an IS_ERR() vs NULL check
  net/sched: act_pedit: Add size check for TCA_PEDIT_PARMS_EX
  xsk: Honor SO_BINDTODEVICE on bind
  ptp: Make max_phase_adjustment sysfs device attribute invisible when not supported
  ...
2023-07-05 15:44:45 -07:00
Steven Rostedt (Google) fddca7db4a tracing/boot: Test strscpy() against less than zero for error
Instead of checking for -E2BIG, it is better to just check for less than
zero of strscpy() for error. Testing for -E2BIG is not very robust, and
the calling code does not really care about the error code, just that
there was an error.

One of the updates to convert strlcpy() to strscpy() had a v2 version
that changed the test from testing against -E2BIG to less than zero, but I
took the v1 version that still tested for -E2BIG.

Link: https://lore.kernel.org/linux-trace-kernel/20230615180420.400769-1-azeemshaikh38@gmail.com/
Link: https://lore.kernel.org/linux-trace-kernel/20230704100807.707d1605@rorschach.local.home

Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Azeem Shaikh <azeemshaikh38@gmail.com>
Cc: Kees Cook <keescook@chromium.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-07-05 10:30:49 -04:00
Mateusz Stachyra 02b0095e2f tracing: Fix null pointer dereference in tracing_err_log_open()
Fix an issue in function 'tracing_err_log_open'.
The function doesn't call 'seq_open' if the file is opened only with
write permissions, which results in 'file->private_data' being left as null.
If we then use 'lseek' on that opened file, 'seq_lseek' dereferences
'file->private_data' in 'mutex_lock(&m->lock)', resulting in a kernel panic.
Writing to this node requires root privileges, therefore this bug
has very little security impact.

Tracefs node: /sys/kernel/tracing/error_log

Example Kernel panic:

Unable to handle kernel NULL pointer dereference at virtual address 0000000000000038
Call trace:
 mutex_lock+0x30/0x110
 seq_lseek+0x34/0xb8
 __arm64_sys_lseek+0x6c/0xb8
 invoke_syscall+0x58/0x13c
 el0_svc_common+0xc4/0x10c
 do_el0_svc+0x24/0x98
 el0_svc+0x24/0x88
 el0t_64_sync_handler+0x84/0xe4
 el0t_64_sync+0x1b4/0x1b8
Code: d503201f aa0803e0 aa1f03e1 aa0103e9 (c8e97d02)
---[ end trace 561d1b49c12cf8a5 ]---
Kernel panic - not syncing: Oops: Fatal exception

Link: https://lore.kernel.org/linux-trace-kernel/20230703155237eucms1p4dfb6a19caa14c79eb6c823d127b39024@eucms1p4
Link: https://lore.kernel.org/linux-trace-kernel/20230704102706eucms1p30d7ecdcc287f46ad67679fc8491b2e0f@eucms1p3

Cc: stable@vger.kernel.org
Fixes: 8a062902be ("tracing: Add tracing error log")
Signed-off-by: Mateusz Stachyra <m.stachyra@samsung.com>
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-07-05 09:45:45 -04:00
Hou Tao cf6eeb8f9d bpf: Remove unnecessary ring buffer size check
The theoretical maximum size of ring buffer is about 64GB, but now the
size of ring buffer is specified by max_entries in bpf_attr and its
maximum value is (4GB - 1), and it won't be possible for overflow.

So just remove the unnecessary size check in ringbuf_map_alloc() but
keep the comments for possible extension in future.

Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Closes: https://lore.kernel.org/bpf/9c636a63-1f3d-442d-9223-96c2dccb9469@moroto.mountain
Link: https://lore.kernel.org/bpf/20230704074014.216616-1-houtao@huaweicloud.com
2023-07-05 14:09:45 +02:00
Linus Torvalds f196220715 module: fix init_module_from_file() error handling
Vegard Nossum pointed out two different problems with the error handling
in init_module_from_file():

 (a) the idempotent loading code didn't clean up properly in some error
     cases, leaving the on-stack 'struct idempotent' element still in
     the hash table

 (b) failure to read the module file would nonsensically update the
     'invalid_kread_bytes' stat counter with the error value

The first error is quite nasty, in that it can then cause subsequent
idempotent loads of that same file to access stale stack contents of the
previous failure.  The case may not happen in any normal situation
(explaining all the "Tested-by's on the original change), and requires
admin privileges, but syzkaller triggers random bad behavior as a
result:

    BUG: soft lockup in sys_finit_module
    BUG: unable to handle kernel paging request in init_module_from_file
    general protection fault in init_module_from_file
    INFO: task hung in init_module_from_file
    KASAN: out-of-bounds Read in init_module_from_file
    KASAN: slab-out-of-bounds Read in init_module_from_file
    ...

The second error is fairly benign and just leads to nonsensical stats
(and has been around since the debug stats were added).

Vegard also provided a patch for the idempotent loading issue, but I'd
rather re-organize the code and make it more legible using another level
of helper functions than add the usual "goto out" error handling.

Link: https://lore.kernel.org/lkml/20230704100852.23452-1-vegard.nossum@oracle.com/
Fixes: 9b9879fc03 ("modules: catch concurrent module loads, treat them as idempotent")
Reported-by: Vegard Nossum <vegard.nossum@oracle.com>
Reported-by: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com>
Reported-by: syzbot+9c2bdc9d24e4a7abe741@syzkaller.appspotmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-07-04 10:17:11 -07:00
Linus Torvalds eded37770c kgdb patches for 6.5
Fairly small changes this cycle.
 
  * An additional static inline function when kgdb is not enabled to
    reduce boilerplate in arch files.
  * kdb will now handle input with linefeeds more like carriage return.
    This will make little difference for interactive use but can make
    it script to use expect-like interaction with kdb.
  * A couple of warning fixes.
 
 Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEELzVBU1D3lWq6cKzwfOMlXTn3iKEFAmSiiJwACgkQfOMlXTn3
 iKFAyQ//U+/08u01ikY0o8h//l7ELX4L7SUEJim1xE9Qk6DVF6R5c8+G6E0kWS5M
 ClKBf8J3YdjM2UvysNmVmh3/Sv1JP2GLT7YUoKNBYv+O8kveGy2N/vE6cYfoHK3n
 1uOXtIxQ3IafjlgNF+FTTIYm2PDqDhYZTTYIKGlEDnwL3mjEP3iV+hNP2debcLvT
 3AWeLcrvzni5KQ301kXqw4SmrgNAeLDKwKKQvDnho8EdAuMDmHTbmUt6Zgvvl6TN
 rH7nlxcUoDW+M1nMNxQ4Ko7xgEbqhpNQnkOIDv/j4j+KObQqvAVOF4dj9thBZUje
 IEPQD0O/rAeYDpnKTIx1bmaVkYIBIpiSokRKlHWpz3iHxU52M+cOXe6u2vMHj97t
 dHTc42TwcBQHSxCQ+R9i1pv6G7i2BfwUSju7opVqKISolwphp9Z4oIZnHwLtWpba
 P5Tv7Azv6nw1HoYl2tAP/2xNO2u7w1nfqUzfr+/XXFTcZGRt65084HDbRQSzjmiJ
 6F5Qw7fwOTNUHEObybjL/A4AilIRK8sBl/vU3TXJUEf3CEryH3VK4lE7qlqKn75G
 nPplXMiBePbSTFE4aIJDdQFA4l3AOffg4w2Ce8v06eytr9EaNh8vS9W8Pv9dxwAz
 ZQrItOPO4i+1Sm9n/CvsPkUDVt3wYbb/QnPdQHsbGOoDN9H3JeI=
 =wAuD
 -----END PGP SIGNATURE-----

Merge tag 'kgdb-6.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux

Pull kgdb updates from Daniel Thompson:
 "Fairly small changes this cycle:

   - An additional static inline function when kgdb is not enabled to
     reduce boilerplate in arch files

   - kdb will now handle input with linefeeds more like carriage return.
     This will make little difference for interactive use but can make
     it script to use expect-like interaction with kdb

   - A couple of warning fixes"

* tag 'kgdb-6.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux:
  kdb: move kdb_send_sig() declaration to a better header file
  kdb: Handle LF in the command parser
  kdb: include kdb_private.h for function prototypes
  kgdb: Provide a stub kgdb_nmicallback() if !CONFIG_KGDB
2023-07-03 15:19:56 -07:00
SeongJae Park 3de4d22cc9 bpf, btf: Warn but return no error for NULL btf from __register_btf_kfunc_id_set()
__register_btf_kfunc_id_set() assumes .BTF to be part of the module's .ko
file if CONFIG_DEBUG_INFO_BTF is enabled. If that's not the case, the
function prints an error message and return an error. As a result, such
modules cannot be loaded.

However, the section could be stripped out during a build process. It would
be better to let the modules loaded, because their basic functionalities
have no problem [0], though the BTF functionalities will not be supported.
Make the function to lower the level of the message from error to warn, and
return no error.

  [0] https://lore.kernel.org/bpf/20220219082037.ow2kbq5brktf4f2u@apollo.legion

Fixes: c446fdacb1 ("bpf: fix register_btf_kfunc_id_set for !CONFIG_DEBUG_INFO_BTF")
Reported-by: Alexander Egorenkov <Alexander.Egorenkov@ibm.com>
Suggested-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/bpf/87y228q66f.fsf@oc8242746057.ibm.com
Link: https://lore.kernel.org/bpf/20220219082037.ow2kbq5brktf4f2u@apollo.legion
Link: https://lore.kernel.org/bpf/20230701171447.56464-1-sj@kernel.org
2023-07-03 18:48:09 +02:00
Daniel Thompson b6464883f4 kdb: move kdb_send_sig() declaration to a better header file
kdb_send_sig() is defined in the signal code and called from kdb,
but the declaration is part of the kdb internal code.
Move the declaration to the shared header to avoid the warning:

kernel/signal.c:4789:6: error: no previous prototype for 'kdb_send_sig' [-Werror=missing-prototypes]

Reported-by: Arnd Bergmann <arnd@arndb.de>
Closes: https://lore.kernel.org/lkml/20230517125423.930967-1-arnd@kernel.org/
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
Link: https://lore.kernel.org/r/20230630201206.2396930-1-daniel.thompson@linaro.org
2023-07-03 09:27:12 +01:00
Linus Torvalds ad2885979e Kbuild updates for v6.5
- Remove the deprecated rule to build *.dtbo from *.dts
 
  - Refactor section mismatch detection in modpost
 
  - Fix bogus ARM section mismatch detections
 
  - Fix error of 'make gtags' with O= option
 
  - Add Clang's target triple to KBUILD_CPPFLAGS to fix a build error with
    the latest LLVM version
 
  - Rebuild the built-in initrd when KBUILD_BUILD_TIMESTAMP is changed
 
  - Ignore more compiler-generated symbols for kallsyms
 
  - Fix 'make local*config' to handle the ${CONFIG_FOO} form in Makefiles
 
  - Enable more kernel-doc warnings with W=2
 
  - Refactor <linux/export.h> by generating KSYMTAB data by modpost
 
  - Deprecate <asm/export.h> and <asm-generic/export.h>
 
  - Remove the EXPORT_DATA_SYMBOL macro
 
  - Move the check for static EXPORT_SYMBOL back to modpost, which makes
    the build faster
 
  - Re-implement CONFIG_TRIM_UNUSED_KSYMS with one-pass algorithm
 
  - Warn missing MODULE_DESCRIPTION when building modules with W=1
 
  - Make 'make clean' robust against too long argument error
 
  - Exclude more objects from GCOV to fix CFI failures with GCOV
 
  - Allow 'make modules_install' to install modules.builtin and
    modules.builtin.modinfo even when CONFIG_MODULES is disabled
 
  - Include modules.builtin and modules.builtin.modinfo in the linux-image
    Debian package even when CONFIG_MODULES is disabled
 
  - Revive "Entering directory" logging for the latest Make version
 -----BEGIN PGP SIGNATURE-----
 
 iQJJBAABCgAzFiEEbmPs18K1szRHjPqEPYsBB53g2wYFAmSf6B0VHG1hc2FoaXJv
 eUBrZXJuZWwub3JnAAoJED2LAQed4NsGS2wP/1izzNJ/64XmQoyBDhZCbuOl7ODF
 n4wgVJnsJmRnD/RxXR/AZ0JZwQHhzpGISWQM61rVIf/RVFOB7Apx1HpmomKUUjrL
 Yc53wLfhTEizGgwttP6tusLM3RO6jkuMKhjC4rllc0tDLJ3zCcwAjSyiOQQ9PBcH
 txwAb8r4/TZUzDDCJ0d98WdhIsNDca/ISeRXKHMiIkfvHe+6yizDKu25Y4B6BL5g
 0VPJ9nVJZ+XVwRqdVR+UQoPYGZzZ/O2NqAtU7n4PpBKvFfLACILJW+aBDAz9SqN7
 RSxn1ahxwq0vrhlB9bSrQRj3N0g8zsi7/xShEZSnGLCbyxYilr5Gq8C59+QxOIJf
 5lGBwZlEgn5aWH+D9abwjEI/QOQbTI9kX09sVzweulGCN9iJlJqyIGsB0Ri0/S2R
 c/n7c8nLwnWnGF/+LXYvkrak8L9YRKori//YYf9zdvh4h1c2/0SS0nDoC29DhDru
 Am7YmhBAkJXXX3NUB2gLvtdp94GSumqefHeSJ5Sp9v/+f2Ft7ruY2ouJC81xDa4p
 nNpvolAq2txlZ9t5OU7x7DQiuCWYSws0W7PJ9FBhyHJchf21UHbcm97/HfDoU8rN
 ioLQGm+h+g6oZt8pArk45wccjkR3ydpEFDWenYbTEr2o3zLfeKigZps5uhCK3DW2
 gnVk50VNagkzrzvA
 =Rc1z
 -----END PGP SIGNATURE-----

Merge tag 'kbuild-v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild

Pull Kbuild updates from Masahiro Yamada:

 - Remove the deprecated rule to build *.dtbo from *.dts

 - Refactor section mismatch detection in modpost

 - Fix bogus ARM section mismatch detections

 - Fix error of 'make gtags' with O= option

 - Add Clang's target triple to KBUILD_CPPFLAGS to fix a build error
   with the latest LLVM version

 - Rebuild the built-in initrd when KBUILD_BUILD_TIMESTAMP is changed

 - Ignore more compiler-generated symbols for kallsyms

 - Fix 'make local*config' to handle the ${CONFIG_FOO} form in Makefiles

 - Enable more kernel-doc warnings with W=2

 - Refactor <linux/export.h> by generating KSYMTAB data by modpost

 - Deprecate <asm/export.h> and <asm-generic/export.h>

 - Remove the EXPORT_DATA_SYMBOL macro

 - Move the check for static EXPORT_SYMBOL back to modpost, which makes
   the build faster

 - Re-implement CONFIG_TRIM_UNUSED_KSYMS with one-pass algorithm

 - Warn missing MODULE_DESCRIPTION when building modules with W=1

 - Make 'make clean' robust against too long argument error

 - Exclude more objects from GCOV to fix CFI failures with GCOV

 - Allow 'make modules_install' to install modules.builtin and
   modules.builtin.modinfo even when CONFIG_MODULES is disabled

 - Include modules.builtin and modules.builtin.modinfo in the
   linux-image Debian package even when CONFIG_MODULES is disabled

 - Revive "Entering directory" logging for the latest Make version

* tag 'kbuild-v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (72 commits)
  modpost: define more R_ARM_* for old distributions
  kbuild: revive "Entering directory" for Make >= 4.4.1
  kbuild: set correct abs_srctree and abs_objtree for package builds
  scripts/mksysmap: Ignore prefixed KCFI symbols
  kbuild: deb-pkg: remove the CONFIG_MODULES check in buildeb
  kbuild: builddeb: always make modules_install, to install modules.builtin*
  modpost: continue even with unknown relocation type
  modpost: factor out Elf_Sym pointer calculation to section_rel()
  modpost: factor out inst location calculation to section_rel()
  kbuild: Disable GCOV for *.mod.o
  kbuild: Fix CFI failures with GCOV
  kbuild: make clean rule robust against too long argument error
  script: modpost: emit a warning when the description is missing
  kbuild: make modules_install copy modules.builtin(.modinfo)
  linux/export.h: rename 'sec' argument to 'license'
  modpost: show offset from symbol for section mismatch warnings
  modpost: merge two similar section mismatch warnings
  kbuild: implement CONFIG_TRIM_UNUSED_KSYMS without recursion
  modpost: use null string instead of NULL pointer for default namespace
  modpost: squash sym_update_namespace() into sym_add_exported()
  ...
2023-07-01 09:24:31 -07:00
Linus Torvalds d25f002575 cxl for v6.5
- Add infrastructure for supporting background commands along with
   support for device sanitization and firmware update
 
 - Introduce a CXL performance monitoring unit driver based on the common
   definition in the specification.
 
 - Land some preparatory cleanup and refactoring for the anticipated
   arrival of CXL type-2 (accelerator devices) and CXL RCH (CXL-v1.1
   topology) error handling.
 
 - Rework CPU cache management with respect to region configuration
   (device hotplug or other dynamic changes to memory interleaving)
 
 - Fix region reconfiguration vs CXL decoder ordering rules.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQSbo+XnGs+rwLz9XGXfioYZHlFsZwUCZJ9fkQAKCRDfioYZHlFs
 ZyWcAP9THJ6ZzX1mbAfHhPz9r+oxsrE3l1jQpNjNbh7MNW29MAEA36dmTE62JaHK
 lTPDgHxqBt1vrHPktYWOM9ZPHE2tLwA=
 =3fFL
 -----END PGP SIGNATURE-----

Merge tag 'cxl-for-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl

Pull CXL updates from Dan Williams:
 "The highlights in terms of new functionality are support for the
  standard CXL Performance Monitor definition that appeared in CXL 3.0,
  support for device sanitization (wiping all data from a device),
  secure-erase (re-keying encryption of user data), and support for
  firmware update. The firmware update support is notable as it reuses
  the simple sysfs_upload interface to just cat(1) a blob to a sysfs
  file and pipe that to the device.

  Additionally there are a substantial number of cleanups and
  reorganizations to get ready for RCH error handling (RCH == Restricted
  CXL Host == current shipping hardware generation / pre CXL-2.0
  topologies) and type-2 (accelerator / vendor specific) devices.

  For vendor specific devices they implement a subset of what the
  generic type-3 (generic memory expander) driver expects. As a result
  the rework decouples optional infrastructure from the core driver
  context.

  For RCH topologies, where the specification working group did not want
  to confuse pre-CXL-aware operating systems, many of the standard
  registers are hidden which makes support standard bus features like
  AER (PCIe Advanced Error Reporting) difficult. The rework arranges for
  the driver to help the PCI-AER core. Bjorn is on board with this
  direction but a late regression disocvery means the completion of this
  functionality needs to cook a bit longer, so it is code
  reorganizations only for now.

  Summary:

   - Add infrastructure for supporting background commands along with
     support for device sanitization and firmware update

   - Introduce a CXL performance monitoring unit driver based on the
     common definition in the specification.

   - Land some preparatory cleanup and refactoring for the anticipated
     arrival of CXL type-2 (accelerator devices) and CXL RCH (CXL-v1.1
     topology) error handling.

   - Rework CPU cache management with respect to region configuration
     (device hotplug or other dynamic changes to memory interleaving)

   - Fix region reconfiguration vs CXL decoder ordering rules"

* tag 'cxl-for-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl: (51 commits)
  cxl: Fix one kernel-doc comment
  cxl/pci: Use correct flag for sanitize polling
  docs: perf: Minimal introduction the the CXL PMU device and driver
  perf: CXL Performance Monitoring Unit driver
  tools/testing/cxl: add firmware update emulation to CXL memdevs
  tools/testing/cxl: Use named effects for the Command Effect Log
  tools/testing/cxl: Fix command effects for inject/clear poison
  cxl: add a firmware update mechanism using the sysfs firmware loader
  cxl/test: Add Secure Erase opcode support
  cxl/mem: Support Secure Erase
  cxl/test: Add Sanitize opcode support
  cxl/mem: Wire up Sanitization support
  cxl/mbox: Add sanitization handling machinery
  cxl/mem: Introduce security state sysfs file
  cxl/mbox: Allow for IRQ_NONE case in the isr
  Revert "cxl/port: Enable the HDM decoder capability for switch ports"
  cxl/memdev: Formalize endpoint port linkage
  cxl/pci: Unconditionally unmask 256B Flit errors
  cxl/region: Manage decoder target_type at decoder-attach time
  cxl/hdm: Default CXL_DEVTYPE_DEVMEM decoders to CXL_DECODER_DEVMEM
  ...
2023-07-01 08:58:41 -07:00
Christian Brauner dd546618ba pid: use struct_size_t() helper
Before commit d67790ddf0 ("overflow: Add struct_size_t() helper") only
struct_size() existed, which expects a valid pointer instance containing
the flexible array.

However, when we determine the default struct pid allocation size for
the associated kmem cache of a pid namespace we need to take the nesting
depth of the pid namespace into account without an variable instance
necessarily being available.

In commit b69f0aeb06 ("pid: Replace struct pid 1-element array with
flex-array") we used to handle this the old fashioned way and cast NULL
to a struct pid pointer type. However, we do apparently have a dedicated
struct_size_t() helper for exactly this case. So switch to that.

Suggested-by: Kees Cook <keescook@chromium.org>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-07-01 08:26:23 -07:00
Linus Torvalds f4ce392b03 Livepatching changes for 6.5
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEESH4wyp42V4tXvYsjUqAMR0iAlPIFAmSeigUACgkQUqAMR0iA
 lPIMbxAAqnbFKTDZ0alQibPlPulEvDsdVnh0FYLpWRW6sZDG96Lr/s3i1ghWxIUI
 cJrePc7i8FJBz6cDMvT6e9TOqoP3iYU9AuQdWr1aoN37O3gJO/3AvHN89CFy6tE3
 scV/Uo2sM5RCQWptutpd2nu1pFLKYWAKegwX0g0+KkyDINDzDA54iSd6HAwfDUeu
 DEO2cp+kUgBKDOcGWEHQHQX5DUs1P7lJDBt5aTBq2KzAYSaUls/HBt1L2T9JYV/D
 yHE3vjmvniug3sYDzXd5aXA+yvIjIsdjqqxMpN7uz8e/q+Mx3045WM1aPCevPq9x
 V7jkc61CQcDURjUdHaZw8jMCX7kDjNkTzkt0OvATqqoL4wqfcohVExI5KUtf6mOq
 NAhsaPL4E+pGOa71uXenhbLClYjfpLfpHRxNHT8kh7O+DTMprt7srp0174vnrZTQ
 Gsfh2BfCoojZxt1SPJWDl1rBaAtZ7EZTmt0oV1opbTgZS1drDKrlbmHQs44oBDeg
 Uqy0uAB6kI92BzPbS66ZwHup3Tg2uz98YwCTMQ1dTuLSFzQNOju6h38BKRzMKkkz
 nWF/AQA89TKOgaOZG3HGVadru7zk1P5EvnOD2+iMI3Arw9ClHMu9wAMoNxNcuYTO
 i+h/hrtenmeH3owA+HVsR/lMnJhnMlxRRfan0pIzUfICIiYlZaw=
 =macg
 -----END PGP SIGNATURE-----

Merge tag 'livepatching-for-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/livepatching/livepatching

Pull livepatching update from Petr Mladek:

 - Make a variable static to fix a sparse warning

* tag 'livepatching-for-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/livepatching/livepatching:
  livepatch: Make 'klp_stack_entries' static
2023-06-30 21:40:44 -07:00
Linus Torvalds d2a6fd45c5 Probes updates for v6.5:
- fprobe: Pass return address to the fprobe entry/exit callbacks so that
   the callbacks don't need to analyze pt_regs/stack to find the function
   return address.
 
 - kprobe events: cleanup usage of TPARG_FL_FENTRY and TPARG_FL_RETURN
   flags so that those are not set at once.
 
 - fprobe events:
  . Add a new fprobe events for tracing arbitrary function entry and
    exit as a trace event.
  . Add a new tracepoint events for tracing raw tracepoint as a trace
    event. This allows user to trace non user-exposed tracepoints.
  . Move eprobe's event parser code into probe event common file.
  . Introduce BTF (BPF type format) support to kernel probe (kprobe,
    fprobe and tracepoint probe) events so that user can specify traced
    function arguments by name. This also applies the type of argument
    when fetching the argument.
  . Introduce '$arg*' wildcard support if BTF is available. This expands
    the '$arg*' meta argument to all function argument automatically.
  . Check the return value types by BTF. If the function returns 'void',
    '$retval' is rejected.
  . Add some selftest script for fprobe events, tracepoint events and
    BTF support.
  . Update documentation about the fprobe events.
  . Some fixes for above features, document and selftests.
 
 - selftests for ftrace (except for new fprobe events):
  . Add a test case for multiple consecutive probes in a function which
    checks if ftrace based kprobe, optimized kprobe and normal kprobe
    can be defined in the same target function.
  . Add a test case for optimized probe, which checks whether kprobe
    can be optimized or not.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCgAdFiEEh7BulGwFlgAOi5DV2/sHvwUrPxsFAmSa+9MACgkQ2/sHvwUr
 PxsmOAgAmUOIWtvH5py7AZpIRhCj8B18F6KnT7w2hByCsRxf7SaCqMhpBCk9VnYv
 9fJFBHpvYRJEmpHoH3o2ET5AGfKVNac9z96AGI2qJ4ECWITd6I5+WfTdZ5ueVn2d
 f6DQ10mHXDHSMFbuqfYWSHtkeivJpWpUNHhwzPb4doNOe06bZNfVuSgnksFg1at5
 kq16HbvGnhPzdO4YHmvqwjmRHr5/nCI1KDE9xIBcqNtWFbiRigC11zaZEUkLX+vT
 F63ShyfCK718AiwDfnjXpGkXAiVOZuAIR8RELaSqQ92YHCFKq5k9K4++WllPR5f9
 AxjVultFDiCd4oSPgYpQkjuZdFq9NA==
 =IhmY
 -----END PGP SIGNATURE-----

Merge tag 'probes-v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull probes updates from Masami Hiramatsu:

 - fprobe: Pass return address to the fprobe entry/exit callbacks so
   that the callbacks don't need to analyze pt_regs/stack to find the
   function return address.

 - kprobe events: cleanup usage of TPARG_FL_FENTRY and TPARG_FL_RETURN
   flags so that those are not set at once.

 - fprobe events:
      - Add a new fprobe events for tracing arbitrary function entry and
        exit as a trace event.
      - Add a new tracepoint events for tracing raw tracepoint as a
        trace event. This allows user to trace non user-exposed
        tracepoints.
      - Move eprobe's event parser code into probe event common file.
      - Introduce BTF (BPF type format) support to kernel probe (kprobe,
        fprobe and tracepoint probe) events so that user can specify
        traced function arguments by name. This also applies the type of
        argument when fetching the argument.
      - Introduce '$arg*' wildcard support if BTF is available. This
        expands the '$arg*' meta argument to all function argument
        automatically.
      - Check the return value types by BTF. If the function returns
        'void', '$retval' is rejected.
      - Add some selftest script for fprobe events, tracepoint events
        and BTF support.
      - Update documentation about the fprobe events.
      - Some fixes for above features, document and selftests.

 - selftests for ftrace (in addition to the new fprobe events):
      - Add a test case for multiple consecutive probes in a function
        which checks if ftrace based kprobe, optimized kprobe and normal
        kprobe can be defined in the same target function.
      - Add a test case for optimized probe, which checks whether kprobe
        can be optimized or not.

* tag 'probes-v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing/probes: Fix tracepoint event with $arg* to fetch correct argument
  Documentation: Fix typo of reference file name
  tracing/probes: Fix to return NULL and keep using current argc
  selftests/ftrace: Add new test case which checks for optimized probes
  selftests/ftrace: Add new test case which adds multiple consecutive probes in a function
  Documentation: tracing/probes: Add fprobe event tracing document
  selftests/ftrace: Add BTF arguments test cases
  selftests/ftrace: Add tracepoint probe test case
  tracing/probes: Add BTF retval type support
  tracing/probes: Add $arg* meta argument for all function args
  tracing/probes: Support function parameters if BTF is available
  tracing/probes: Move event parameter fetching code to common parser
  tracing/probes: Add tracepoint support on fprobe_events
  selftests/ftrace: Add fprobe related testcases
  tracing/probes: Add fprobe events for tracing function entry and exit.
  tracing/probes: Avoid setting TPARG_FL_FENTRY and TPARG_FL_RETURN
  fprobe: Pass return address to the handlers
2023-06-30 10:44:53 -07:00
Linus Torvalds cccf0c2ee5 Tracing updates for 6.5:
- Add new feature to have function graph tracer record the return value.
   Adds a new option: funcgraph-retval ; when set, will show the return
   value of a function in the function graph tracer.
 
 - Also add the option: funcgraph-retval-hex where if it is not set, and
   the return value is an error code, then it will return the decimal of
   the error code, otherwise it still reports the hex value.
 
 - Add the file /sys/kernel/tracing/osnoise/per_cpu/cpu<cpu>/timerlat_fd
   That when a application opens it, it becomes the task that the timer lat
   tracer traces. The application can also read this file to find out how
   it's being interrupted.
 
 - Add the file /sys/kernel/tracing/available_filter_functions_addrs
   that works just the same as available_filter_functions but also shows
   the addresses of the functions like kallsyms, except that it gives the
   address of where the fentry/mcount jump/nop is. This is used by BPF to
   make it easier to attach BPF programs to ftrace hooks.
 
 - Replace strlcpy with strscpy in the tracing boot code.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZJy6ixQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qnzRAPsEI2YgjaJSHnuPoGRHbrNil6pq66wY
 LYaLizGI4Jv9BwEAqdSdcYcMiWo1SFBAO8QxEDM++BX3zrRyVgW8ahaTNgs=
 =TF0C
 -----END PGP SIGNATURE-----

Merge tag 'trace-v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing updates from Steven Rostedt:

 - Add new feature to have function graph tracer record the return
   value. Adds a new option: funcgraph-retval ; when set, will show the
   return value of a function in the function graph tracer.

 - Also add the option: funcgraph-retval-hex where if it is not set, and
   the return value is an error code, then it will return the decimal of
   the error code, otherwise it still reports the hex value.

 - Add the file /sys/kernel/tracing/osnoise/per_cpu/cpu<cpu>/timerlat_fd
   That when a application opens it, it becomes the task that the timer
   lat tracer traces. The application can also read this file to find
   out how it's being interrupted.

 - Add the file /sys/kernel/tracing/available_filter_functions_addrs
   that works just the same as available_filter_functions but also shows
   the addresses of the functions like kallsyms, except that it gives
   the address of where the fentry/mcount jump/nop is. This is used by
   BPF to make it easier to attach BPF programs to ftrace hooks.

 - Replace strlcpy with strscpy in the tracing boot code.

* tag 'trace-v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing: Fix warnings when building htmldocs for function graph retval
  riscv: ftrace: Enable HAVE_FUNCTION_GRAPH_RETVAL
  tracing/boot: Replace strlcpy with strscpy
  tracing/timerlat: Add user-space interface
  tracing/osnoise: Skip running osnoise if all instances are off
  tracing/osnoise: Switch from PF_NO_SETAFFINITY to migrate_disable
  ftrace: Show all functions with addresses in available_filter_functions_addrs
  selftests/ftrace: Add funcgraph-retval test case
  LoongArch: ftrace: Enable HAVE_FUNCTION_GRAPH_RETVAL
  x86/ftrace: Enable HAVE_FUNCTION_GRAPH_RETVAL
  arm64: ftrace: Enable HAVE_FUNCTION_GRAPH_RETVAL
  tracing: Add documentation for funcgraph-retval and funcgraph-retval-hex
  function_graph: Support recording and printing the return value of function
  fgraph: Add declaration of "struct fgraph_ret_regs"
2023-06-30 10:33:17 -07:00
Linus Torvalds 533925cb76 RISC-V Patches for the 6.5 Merge Window, Part 1
* Support for ACPI.
 * Various cleanups to the ISA string parsing, including making them
   case-insensitive
 * Support for the vector extension.
 * Support for independent irq/softirq stacks.
 * Our CPU DT binding now has "unevaluatedProperties: false"
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCAAxFiEEKzw3R0RoQ7JKlDp6LhMZ81+7GIkFAmSe70ATHHBhbG1lckBk
 YWJiZWx0LmNvbQAKCRAuExnzX7sYiWNPD/0ZfSdQ0A/gMVOzAD4zFKPEqQ6ffW2V
 Zy6Jo7UDNqKsiai7QA4XB1uyYIv/y1yUKJ0oeBVcA9Nzyq+TW9QDcApDBTabxAUI
 agY19YKw6VVZ+p7I9sMsf6EbdJdkNfSAzcQACPxb4ScEoaf9X+oAK5qgXuRuWluh
 qQuVkkJlgWc/t1cuUkrRdJmHQYvjP3zL7z4o344q2IVpXJkNNu0GeP+HbF8BYKcA
 +I/TTA5JY3kCIaxkpF2rU6pE6T5T9xrPmRYZ7bZoPUPnbL+M8As/jx3ym52Y4WGp
 kf8pgkxixOjU64kVJOH66CA8GaOiaAH/ptjQb0ZmCaGrHhr7aOT9HrkX4rU1lS8T
 stPphfM4gGPcCoPgRqSl+mEhBzjII8maOBLtbricAoQi6efRq8fzoOGaif/QpCbc
 6n0LGS4nQPGVyD3rAPfHxxfrlGJR+SsgyDvjZoDhqauFglims14GnK+eBeO8zrui
 Aj/uuAS63VIYprJWC1NOBJlU2WKZiOGhCANpZ6W6SH21PYn2WjsVILqaGh+WN8ZO
 KOHxZNaN8fQag0Yg7oNAUb7l6S0DHYtJIksFnFW2Rf2+VT58RAMYRQbpbhr7Tqr+
 jLgIR8PkFrBERHE49IqLGhAxGDnNzAUysMRw9pIk7WIre2Jt4wPqUdl+ee+5ErIX
 jiYfSFZw9q28UA==
 =Fpq8
 -----END PGP SIGNATURE-----

Merge tag 'riscv-for-linus-6.5-mw1' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux

Pull RISC-V updates from Palmer Dabbelt:

 - Support for ACPI

 - Various cleanups to the ISA string parsing, including making them
   case-insensitive

 - Support for the vector extension

 - Support for independent irq/softirq stacks

 - Our CPU DT binding now has "unevaluatedProperties: false"

* tag 'riscv-for-linus-6.5-mw1' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux: (78 commits)
  riscv: hibernate: remove WARN_ON in save_processor_state
  dt-bindings: riscv: cpus: switch to unevaluatedProperties: false
  dt-bindings: riscv: cpus: add a ref the common cpu schema
  riscv: stack: Add config of thread stack size
  riscv: stack: Support HAVE_SOFTIRQ_ON_OWN_STACK
  riscv: stack: Support HAVE_IRQ_EXIT_ON_IRQ_STACK
  RISC-V: always report presence of extensions formerly part of the base ISA
  dt-bindings: riscv: explicitly mention assumption of Zicntr & Zihpm support
  RISC-V: remove decrement/increment dance in ISA string parser
  RISC-V: rework comments in ISA string parser
  RISC-V: validate riscv,isa at boot, not during ISA string parsing
  RISC-V: split early & late of_node to hartid mapping
  RISC-V: simplify register width check in ISA string parsing
  perf: RISC-V: Limit the number of counters returned from SBI
  riscv: replace deprecated scall with ecall
  riscv: uprobes: Restore thread.bad_cause
  riscv: mm: try VMA lock-based page fault handling first
  riscv: mm: Pre-allocate PGD entries for vmalloc/modules area
  RISC-V: hwprobe: Expose Zba, Zbb, and Zbs
  RISC-V: Track ISA extensions per hart
  ...
2023-06-30 09:37:26 -07:00
Linus Torvalds d8b0bd57c2 powerpc updates for 6.5
- Extend KCSAN support to 32-bit and BookE. Add some KCSAN annotations.
 
  - Make ELFv2 ABI the default for 64-bit big-endian kernel builds, and use
    the -mprofile-kernel option (kernel specific ftrace ABI) for big endian
    ELFv2 kernels.
 
  - Add initial Dynamic Execution Control Register (DEXCR) support, and allow
    the ROP protection instructions to be used on Power 10.
 
  - Various other small features and fixes.
 
 Thanks to: Aditya Gupta, Aneesh Kumar K.V, Benjamin Gray, Brian King,
 Christophe Leroy, Colin Ian King, Dmitry Torokhov, Gaurav Batra, Jean Delvare,
 Joel Stanley, Marco Elver, Masahiro Yamada, Nageswara R Sastry, Nathan
 Chancellor, Naveen N Rao, Nayna Jain, Nicholas Piggin, Paul Gortmaker, Randy
 Dunlap, Rob Herring, Rohan McLure, Russell Currey, Sachin Sant, Timothy
 Pearson, Tom Rix, Uwe Kleine-König.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCAAxFiEEJFGtCPCthwEv2Y/bUevqPMjhpYAFAmSeqQMTHG1wZUBlbGxl
 cm1hbi5pZC5hdQAKCRBR6+o8yOGlgKukD/sGUceX6gIc7UcjWhL1ZCVco0bsgLjT
 JrY1NenisGKjwwRd/o+2+h3ziJDoO5AsQfT72EiNLhaYJhnlb1d0vXzsvN0THc+2
 W5RrxAZUNhBy+c7gSSEJjy8+vBIwSQAliQLChHGOSejGCj94SxF5+zjUFvSX458I
 z0+ZQK+Fiw5NcpzEnBT0XPnLzap74a7TL0JcG1MLbj2QtHXhbfjIlkkPDX3kK0Gw
 xbelFy38X7KKbQsXXYSTCGqwRdJ3yqu21nEsjRuo2yT5H5rQbjCNggkMOL1DecDd
 ULGxit/z13Pt1Ad3oe+6FF17ggOiCG0F75DONASjFDthFYx6NQffkJS1n1VZauQj
 jU6LtWeD3HkgIYm6Udjq+LaSmkAmn5a+9tsElE/K+V1WG4rKyMeVmE3z/tCJG0l2
 yhLKyFs+glXN/LiWHyX0mrQIIVZdRK237X1qXJuIvAuB7Drm5duXFAHR8pdJD0dg
 H23OhoO2FvLxb9GvnzGxqjdazzattctz31wU/1RgnPxumYkJ9PlBcXn9h1uXa8/K
 rDZFJADsQhEfRCjmLG3GIaFWqZdc4Cn+ZUk4iHkjPDFDL05Fq7JYHIuwteiN6/wP
 NHRvtKdJisu583NI9RN9300JykrEqjSRbMOWlc3vuFwbRLGioXvWhWlIZ3/t58jG
 R8s+f0nKSPr+fg==
 =ssit
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-6.5-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc updates from Michael Ellerman:

 - Extend KCSAN support to 32-bit and BookE. Add some KCSAN annotations

 - Make ELFv2 ABI the default for 64-bit big-endian kernel builds, and
   use the -mprofile-kernel option (kernel specific ftrace ABI) for big
   endian ELFv2 kernels

 - Add initial Dynamic Execution Control Register (DEXCR) support, and
   allow the ROP protection instructions to be used on Power 10

 - Various other small features and fixes

Thanks to Aditya Gupta, Aneesh Kumar K.V, Benjamin Gray, Brian King,
Christophe Leroy, Colin Ian King, Dmitry Torokhov, Gaurav Batra, Jean
Delvare, Joel Stanley, Marco Elver, Masahiro Yamada, Nageswara R Sastry,
Nathan Chancellor, Naveen N Rao, Nayna Jain, Nicholas Piggin, Paul
Gortmaker, Randy Dunlap, Rob Herring, Rohan McLure, Russell Currey,
Sachin Sant, Timothy Pearson, Tom Rix, and Uwe Kleine-König.

* tag 'powerpc-6.5-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (76 commits)
  powerpc: remove checks for binutils older than 2.25
  powerpc: Fail build if using recordmcount with binutils v2.37
  powerpc/iommu: TCEs are incorrectly manipulated with DLPAR add/remove of memory
  powerpc/iommu: Only build sPAPR access functions on pSeries
  powerpc: powernv: Annotate data races in opal events
  powerpc: Mark writes registering ipi to host cpu through kvm and polling
  powerpc: Annotate accesses to ipi message flags
  powerpc: powernv: Fix KCSAN datarace warnings on idle_state contention
  powerpc: Mark [h]ssr_valid accesses in check_return_regs_valid
  powerpc: qspinlock: Enforce qnode writes prior to publishing to queue
  powerpc: qspinlock: Mark accesses to qnode lock checks
  powerpc/powernv/pci: Remove last IODA1 defines
  powerpc/powernv/pci: Remove MVE code
  powerpc/powernv/pci: Remove ioda1 support
  powerpc: 52xx: Make immr_id DT match tables static
  powerpc: mpc512x: Remove open coded "ranges" parsing
  powerpc: fsl_soc: Use of_range_to_resource() for "ranges" parsing
  powerpc: fsl: Use of_property_read_reg() to parse "reg"
  powerpc: fsl_rio: Use of_range_to_resource() for "ranges" parsing
  macintosh: Use of_property_read_reg() to parse "reg"
  ...
2023-06-30 09:20:08 -07:00
Kees Cook b69f0aeb06 pid: Replace struct pid 1-element array with flex-array
For pid namespaces, struct pid uses a dynamically sized array member,
"numbers".  This was implemented using the ancient 1-element fake
flexible array, which has been deprecated for decades.

Replace it with a C99 flexible array, refactor the array size
calculations to use struct_size(), and address elements via indexes.
Note that the static initializer (which defines a single element) works
as-is, and requires no special handling.

Without this, CONFIG_UBSAN_BOUNDS (and potentially
CONFIG_FORTIFY_SOURCE) will trigger bounds checks:

  https://lore.kernel.org/lkml/20230517-bushaltestelle-super-e223978c1ba6@brauner

Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Xu <jeffxu@google.com>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Daniel Verkamp <dverkamp@chromium.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Jeff Xu <jeffxu@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Frederic Weisbecker <frederic@kernel.org>
Reported-by: syzbot+ac3b41786a2d0565b6d5@syzkaller.appspotmail.com
[brauner: dropped unrelated changes and remove 0 with NULL cast]
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-06-30 09:04:01 -07:00
Douglas Anderson 1ed0555850 kdb: Handle LF in the command parser
The main kdb command parser only handles CR (ASCII 13 AKA '\r') today,
but not LF (ASCII 10 AKA '\n'). That means that the kdb command parser
can handle terminals that send just CR or that send CR+LF but can't
handle terminals that send just LF.

The fact that kdb didn't handle LF in the command parser tripped up a
tool I tried to use with it. Specifically, I was trying to send a
command to my device to resume it from kdb using a ChromeOS tool like:
  dut-control cpu_uart_cmd:"g"
That tool only terminates lines with LF, not CR+LF.

Arguably the ChromeOS tool should be fixed. After all, officially kdb
seems to be designed such that CR+LF is the official line ending
transmitted over the wire and that internally a line ending is just
'\n' (LF). Some evidence:
* uart_poll_put_char(), which is used by kdb, notices a '\n' and
  converts it to '\r\n'.
* kdb functions specifically use '\r' to get a carriage return without
  a newline. You can see this in the pager where kdb will write a '\r'
  and then write over the pager prompt.

However, all that being said there's no real harm in accepting LF as a
command terminator in the kdb parser and doing so seems like it would
improve compatibility. After this, I'd expect that things would work
OK-ish with a remote terminal that used any of CR, CR+LF, or LF as a
line ending. Someone using CR as a line ending might get some ugliness
where kdb wasn't able to overwrite the last line, but basic commands
would work. Someone using just LF as a line ending would probably also
work OK.

A few other notes:
- It can be noted that "bash" running on an "agetty" handles LF as a
  line termination with no complaints.
- Historically, kdb's "pager" actually handled either CR or LF fine. A
  very quick inspection would make one think that kdb's pager actually
  could have paged down two lines instead of one for anyone using
  CR+LF, but this is generally avoided because of kdb_input_flush().
- Conceivably one could argue that some of this special case logic
  belongs in uart_poll_get_char() since uart_poll_put_char() handles
  the '\n' => '\r\n' conversion. I would argue that perhaps we should
  eventually do the opposite and move the '\n' => '\r\n' out of
  uart_poll_put_char(). Having that conversion at such a low level
  could interfere if we ever want to transfer binary data. In
  addition, if we truly made uart_poll_get_char() the inverse of
  uart_poll_put_char() it would convert back to '\n' and (ironically)
  kdb's parser currently only looks for '\r' to find the end of a
  command.

Signed-off-by: Douglas Anderson <dianders@chromium.org>
Link: https://lore.kernel.org/r/20230628125612.1.I5cc6c3d916195f5bcfdf5b75d823f2037707f5dc@changeid
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
2023-06-30 16:44:29 +01:00
Andy Shevchenko 67a4e1a3bf irqdomain: Use return value of strreplace()
Since strreplace() returns the pointer to the string itself, use it
directly.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20230628150251.17832-1-andriy.shevchenko@linux.intel.com
2023-06-30 11:13:44 +02:00
Stanislav Fomichev 819d43428a bpf: Resolve modifiers when walking structs
It is impossible to use skb_frag_t in the tracing program. Resolve typedefs
when walking structs.

Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20230626212522.2414485-1-sdf@google.com
2023-06-30 10:35:59 +02:00
Linus Torvalds 1e6d5dea34 dma-mapping uodates for Linux 6.5
- swiotlb cleanups (Petr Tesarik)
  - use kvmalloc_array (gaoxu)
  - a small step towards removing is_swiotlb_active (Christoph Hellwig)
  - fix a Kconfig typo Sui Jingfeng)
 -----BEGIN PGP SIGNATURE-----
 
 iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAmScHTELHGhjaEBsc3Qu
 ZGUACgkQD55TZVIEUYOoGhAAuKdcWAumhEuKXu/vuA4/xH5Kvn2uhrAVs/RSKKh7
 Wbh2WXh+9GOQs4dwpYI5/chRmCJgyC2ics/E7dny2ovJumM/RGOR9zff/+zJk5yh
 3tpiZ4VW/2x/9qmoRpdQVTaj1uB2Ug8wpztUohyeEYPpjLX0/Vq/zTItAyvQs4kO
 vskg2u9EGwzgV4qmHWq2fjowljHS1sMZq2URKGX8JDHrwOh0+uCke626ze/WznSQ
 5quWW4mE0IYq/nldIMTvJxUx6273zBMm+yoZSfFY6K0VrP3WDE5endP5oYR7d251
 zNRPyWqW78HQT0z/gntaPOKDB0t7YxsWVLwam98Yv8wl1wLJQYe5sRoL7qXL0yBz
 cu4uTOGU6Kvw3ZHWQOk2qkGxYAxDLnaFejz8ZDCycWQA7T2+TNdaVXNnl/vE+aRN
 25mnT0JtClfZOvHihhHKaiRXMKE0WgPy4vBZBT0SAUVuUR7JwtYDv/5giGf6Bi3x
 QwNeXAmFD75VELJoqX1KL/UiJh8FzyWZMI7l7XmDVy9lJsA1P3PO+npI8SOmA+e8
 q4GSD36BxDoDE06Hg/BQcRqPohDCEsrRgfvOaa7xld9SCwe1lGaN3kDsyv6HoBzX
 NRS3aishdqHf21dNctjX9f+II+30NFkPbAa64B8L4tP5aQLy3bgjNLweLP9fhKgl
 KQs=
 =yYCS
 -----END PGP SIGNATURE-----

Merge tag 'dma-mapping-6.5-2023-06-28' of git://git.infradead.org/users/hch/dma-mapping

Pull dma-mapping updates from Christoph Hellwig:

 - swiotlb cleanups (Petr Tesarik)

 - use kvmalloc_array (gaoxu)

 - a small step towards removing is_swiotlb_active (Christoph Hellwig)

 - fix a Kconfig typo Sui Jingfeng)

* tag 'dma-mapping-6.5-2023-06-28' of git://git.infradead.org/users/hch/dma-mapping:
  drm/nouveau: stop using is_swiotlb_active
  swiotlb: use the atomic counter of total used slabs if available
  swiotlb: remove unused field "used" from struct io_tlb_mem
  dma-remap: use kvmalloc_array/kvfree for larger dma memory remap
  dma-mapping: fix a Kconfig typo
2023-06-29 21:12:20 -07:00
Linus Torvalds 82a2a51055 sysctl-6.5-rc1-fixes
Linus, included in this pull request is just a minor fix which
 Matthieu Baerts noted I had not picked up. I adjusted the Fixes
 commit ID to reflect the latest tree.
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCgAwFiEENnNq2KuOejlQLZofziMdCjCSiKcFAmSeA/ISHG1jZ3JvZkBr
 ZXJuZWwub3JnAAoJEM4jHQowkoinUq4QAIGJi5oRY+tAqzJ1VigkUgwheDeND3KL
 ixwn/9RoZ6Sk9hpXbFVdS4pUGueYmyCnbeKF48aTYKMSRV8Vm3E61RevYnpjSXBA
 D+zhJ80NS95MacrHM0cK77UD/A8T4Y57e/rUJfL/3RqWMPpJbPciqO/Y+9MZJRFT
 RCC8jbPUjdmvbccEMQhALSZb2nOUtHouUWgG4SgyiELjY5djNojINEOQ1xIh9W1G
 LWWCM/99NeDZfQtF+36k6F1uc568xrSunELXBKcQBk6PlETt4dtQta/PqEAQ+tHT
 qZDnwepw9RCNPxuGv0oialYe3weFB4zC2kr3ZYePs0qLfM4GGqXjDy7bIT5qa59V
 9AsAhSEuLu/ZAnYgQNMcJZss/PGrxSDJbKOUEW3pJ00nlj9zwWXpGX4jJsBNKdoe
 smalelXUOR3Kr5BZm7cm0125kiQrvnMx8L/5waG81sCKsW+LcHQgTJu8RVD49f5H
 k/LCUlxk3jyXO3xlyYvQKzpA70P4vromIhWjBxcK2cAbPJL5N6ux4WX7L+m/niID
 cD0/74XJ+es1qAUHkl1on5s7U1DOryY4xJA+liQK/pKFIMyuKaUYlLmgD8lworI4
 jqJ4BgtSASc/W81TX7x32CAbKWwjxKDQK3mLBUdVAJPtcr5vFOnmHh2vgZ0/A/yI
 eMgoAPB5E+UD
 =Tyyn
 -----END PGP SIGNATURE-----

Merge tag 'sysctl-6.5-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux

Pull sysctl fix from Luis Chamberlain:
 "A missed minor fix which Matthieu Baerts noted I had not picked up"

* tag 'sysctl-6.5-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux:
  sysctl: fix unused proc_cap_handler() function warning
2023-06-29 17:44:56 -07:00
Linus Torvalds 3ad7b12c72 tracing: Fix user event write on buffer disabled
The user events write currently returns the size of what was suppose to be
 written when tracing is disabled and nothing was written. Instead, behave like
 trace_marker and return -EBADF, as that is what is returned if a file is opened
 for read only, and a write is performed on it. Writing to the buffer
 that is disabled is like trying to write to a file opened for read
 only, as the buffer still can be read, but just not written to.
 
 This also includes test cases for this use case
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZJxLzRQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qs9eAP0ZRDempFNyMhi+pfENtwv65CHxRX/3
 3s1Lmt04oqwXfgD/dCfojrVAd++kpq3p9cGxJYWuNiM4s47mlD8VLgQ7AAY=
 =aZ8/
 -----END PGP SIGNATURE-----

Merge tag 'trace-v6.4-rc7-v3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing fix from Steven Rostedt:
 "Fix user event write on buffer disabled.

  The user events write currently returns the size of what was supposed
  to be written when tracing is disabled and nothing was written.

  Instead, behave like trace_marker and return -EBADF, as that is what
  is returned if a file is opened for read only, and a write is
  performed on it. Writing to the buffer that is disabled is like trying
  to write to a file opened for read only, as the buffer still can be
  read, but just not written to.

  This also includes test cases for this use case"

* tag 'trace-v6.4-rc7-v3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  selftests/user_events: Add test cases when event is disabled
  selftests/user_events: Enable the event before write_fault test in ftrace self-test
  tracing/user_events: Fix incorrect return value for writing operation when events are disabled
2023-06-29 17:36:00 -07:00
Linus Torvalds 632f54b4d6 slab updates for 6.5
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEe7vIQRWZI0iWSE3xu+CwddJFiJoFAmSZtjsACgkQu+CwddJF
 iJqCTwf/XVhmAD7zMOj6g1aak5oHNZDRG5jufM5UNXmiWjCWT3w4DpltrJkz0PPm
 mg3Ac5fjNUqesZ1SGtUbvoc363smroBrRudGEFrsUhqBcpR+S4fSneoDk+xqMypf
 VLXP/8kJlFEBGMiR7ouAWnR4+u6JgY4E8E8JIPNzao5KE/L1lD83nY+Usjc/01ek
 oqMyYVFRfncsGjGJXc5fOOTTCj768mRroF0sLmEegIonnwQkSHE7HWJ/nyaVraDV
 bomnTIgMdVIDqharin08ZPIM7qBIWM09Uifaf0lIs6fIA94pQP+5Ko3mum2P/S+U
 ON/qviSrlNgRXoHPJ3hvPHdfEU9cSg==
 =1d0v
 -----END PGP SIGNATURE-----

Merge tag 'slab-for-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab

Pull slab updates from Vlastimil Babka:

 - SLAB deprecation:

   Following the discussion at LSF/MM 2023 [1] and no objections, the
   SLAB allocator is deprecated by renaming the config option (to make
   its users notice) to CONFIG_SLAB_DEPRECATED with updated help text.
   SLUB should be used instead. Existing defconfigs with CONFIG_SLAB are
   also updated.

 - SLAB_NO_MERGE kmem_cache flag (Jesper Dangaard Brouer):

   There are (very limited) cases where kmem_cache merging is
   undesirable, and existing ways to prevent it are hacky. Introduce a
   new flag to do that cleanly and convert the existing hacky users.
   Btrfs plans to use this for debug kernel builds (that use case is
   always fine), networking for performance reasons (that should be very
   rare).

 - Replace the usage of weak PRNGs (David Keisar Schmidt):

   In addition to using stronger RNGs for the security related features,
   the code is a bit cleaner.

 - Misc code cleanups (SeongJae Parki, Xiongwei Song, Zhen Lei, and
   zhaoxinchao)

Link: https://lwn.net/Articles/932201/ [1]

* tag 'slab-for-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab:
  mm/slab_common: use SLAB_NO_MERGE instead of negative refcount
  mm/slab: break up RCU readers on SLAB_TYPESAFE_BY_RCU example code
  mm/slab: add a missing semicolon on SLAB_TYPESAFE_BY_RCU example code
  mm/slab_common: reduce an if statement in create_cache()
  mm/slab: introduce kmem_cache flag SLAB_NO_MERGE
  mm/slab: rename CONFIG_SLAB to CONFIG_SLAB_DEPRECATED
  mm/slab: remove HAVE_HARDENED_USERCOPY_ALLOCATOR
  mm/slab_common: Replace invocation of weak PRNG
  mm/slab: Replace invocation of weak PRNG
  slub: Don't read nr_slabs and total_objects directly
  slub: Remove slabs_node() function
  slub: Remove CONFIG_SMP defined check
  slub: Put objects_show() into CONFIG_SLUB_DEBUG enabled block
  slub: Correct the error code when slab_kset is NULL
  mm/slab: correct return values in comment for _kmem_cache_create()
2023-06-29 16:34:12 -07:00
Arnd Bergmann 554588e8e9 sysctl: fix unused proc_cap_handler() function warning
Since usermodehelper_table() is marked static now, we get a
warning about it being unused when SYSCTL is disabled:

kernel/umh.c:497:12: error: 'proc_cap_handler' defined but not used [-Werror=unused-function]

Just move it inside of the same #ifdef.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Tested-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Fixes: 861dc0b464 ("sysctl: move umh sysctl registration to its own file")
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Tested-by: Randy Dunlap <rdunlap@infradead.org> # build-tested
[mcgrof: adjust new commit ID for Fixes tag]
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2023-06-29 15:19:43 -07:00
Arnd Bergmann 0914e4d3cd kdb: include kdb_private.h for function prototypes
The kdb_kbd_cleanup_state() is called from another file through
the kdb_private.h file, but that is not included before the
definition, causing a W=1 warning:

kernel/debug/kdb/kdb_keyboard.c:198:6: error: no previous prototype for 'kdb_kbd_cleanup_state' [-Werror=missing-prototypes]

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/r/20230517124802.929751-1-arnd@kernel.org
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
2023-06-29 15:10:17 +01:00
Fangrui Song bbaf1ff06a bpf: Replace deprecated -target with --target= for Clang
The -target option has been deprecated since clang 3.4 in 2013. Therefore, use
the preferred --target=bpf form instead. This also matches how we use --target=
in scripts/Makefile.clang.

Signed-off-by: Fangrui Song <maskray@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Acked-by: Quentin Monnet <quentin@isovalent.com>
Link: 274b6f0c87
Link: https://lore.kernel.org/bpf/20230624001856.1903733-1-maskray@google.com
2023-06-29 15:46:17 +02:00
Petr Tesarik 8ac0406335 swiotlb: reduce the number of areas to match actual memory pool size
Although the desired size of the SWIOTLB memory pool is increased in
swiotlb_adjust_nareas() to match the number of areas, the actual allocation
may be smaller, which may require reducing the number of areas.

For example, Xen uses swiotlb_init_late(), which in turn uses the page
allocator. On x86, page size is 4 KiB and MAX_ORDER is 10 (1024 pages),
resulting in a maximum memory pool size of 4 MiB. This corresponds to 2048
slots of 2 KiB each. The minimum area size is 128 (IO_TLB_SEGSIZE),
allowing at most 2048 / 128 = 16 areas.

If num_possible_cpus() is greater than the maximum number of areas, areas
are smaller than IO_TLB_SEGSIZE and contiguous groups of free slots will
span multiple areas. When allocating and freeing slots, only one area will
be properly locked, causing race conditions on the unlocked slots and
ultimately data corruption, kernel hangs and crashes.

Fixes: 20347fca71 ("swiotlb: split up the global swiotlb lock")
Signed-off-by: Petr Tesarik <petr.tesarik.ext@huawei.com>
Reviewed-by: Roberto Sassu <roberto.sassu@huawei.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-06-29 07:10:28 +02:00
Petr Tesarik aabd12609f swiotlb: always set the number of areas before allocating the pool
The number of areas defaults to the number of possible CPUs. However, the
total number of slots may have to be increased after adjusting the number
of areas. Consequently, the number of areas must be determined before
allocating the memory pool. This is even explained with a comment in
swiotlb_init_remap(), but swiotlb_init_late() adjusts the number of areas
after slots are already allocated. The areas may end up being smaller than
IO_TLB_SEGSIZE, which breaks per-area locking.

While fixing swiotlb_init_late(), move all relevant comments before the
definition of swiotlb_adjust_nareas() and convert them to kernel-doc.

Fixes: 20347fca71 ("swiotlb: split up the global swiotlb lock")
Signed-off-by: Petr Tesarik <petr.tesarik.ext@huawei.com>
Reviewed-by: Roberto Sassu <roberto.sassu@huawei.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-06-29 07:10:00 +02:00
Linus Torvalds 3a8a670eee Networking changes for 6.5.
Core
 ----
 
  - Rework the sendpage & splice implementations. Instead of feeding
    data into sockets page by page extend sendmsg handlers to support
    taking a reference on the data, controlled by a new flag called
    MSG_SPLICE_PAGES. Rework the handling of unexpected-end-of-file
    to invoke an additional callback instead of trying to predict what
    the right combination of MORE/NOTLAST flags is.
    Remove the MSG_SENDPAGE_NOTLAST flag completely.
 
  - Implement SCM_PIDFD, a new type of CMSG type analogous to
    SCM_CREDENTIALS, but it contains pidfd instead of plain pid.
 
  - Enable socket busy polling with CONFIG_RT.
 
  - Improve reliability and efficiency of reporting for ref_tracker.
 
  - Auto-generate a user space C library for various Netlink families.
 
 Protocols
 ---------
 
  - Allow TCP to shrink the advertised window when necessary, prevent
    sk_rcvbuf auto-tuning from growing the window all the way up to
    tcp_rmem[2].
 
  - Use per-VMA locking for "page-flipping" TCP receive zerocopy.
 
  - Prepare TCP for device-to-device data transfers, by making sure
    that payloads are always attached to skbs as page frags.
 
  - Make the backoff time for the first N TCP SYN retransmissions
    linear. Exponential backoff is unnecessarily conservative.
 
  - Create a new MPTCP getsockopt to retrieve all info (MPTCP_FULL_INFO).
 
  - Avoid waking up applications using TLS sockets until we have
    a full record.
 
  - Allow using kernel memory for protocol ioctl callbacks, paving
    the way to issuing ioctls over io_uring.
 
  - Add nolocalbypass option to VxLAN, forcing packets to be fully
    encapsulated even if they are destined for a local IP address.
 
  - Make TCPv4 use consistent hash in TIME_WAIT and SYN_RECV. Ensure
    in-kernel ECMP implementation (e.g. Open vSwitch) select the same
    link for all packets. Support L4 symmetric hashing in Open vSwitch.
 
  - PPPoE: make number of hash bits configurable.
 
  - Allow DNS to be overwritten by DHCPACK in the in-kernel DHCP client
    (ipconfig).
 
  - Add layer 2 miss indication and filtering, allowing higher layers
    (e.g. ACL filters) to make forwarding decisions based on whether
    packet matched forwarding state in lower devices (bridge).
 
  - Support matching on Connectivity Fault Management (CFM) packets.
 
  - Hide the "link becomes ready" IPv6 messages by demoting their
    printk level to debug.
 
  - HSR: don't enable promiscuous mode if device offloads the proto.
 
  - Support active scanning in IEEE 802.15.4.
 
  - Continue work on Multi-Link Operation for WiFi 7.
 
 BPF
 ---
 
  - Add precision propagation for subprogs and callbacks. This allows
    maintaining verification efficiency when subprograms are used,
    or in fact passing the verifier at all for complex programs,
    especially those using open-coded iterators.
 
  - Improve BPF's {g,s}setsockopt() length handling. Previously BPF
    assumed the length is always equal to the amount of written data.
    But some protos allow passing a NULL buffer to discover what
    the output buffer *should* be, without writing anything.
 
  - Accept dynptr memory as memory arguments passed to helpers.
 
  - Add routing table ID to bpf_fib_lookup BPF helper.
 
  - Support O_PATH FDs in BPF_OBJ_PIN and BPF_OBJ_GET commands.
 
  - Drop bpf_capable() check in BPF_MAP_FREEZE command (used to mark
    maps as read-only).
 
  - Show target_{obj,btf}_id in tracing link fdinfo.
 
  - Addition of several new kfuncs (most of the names are self-explanatory):
    - Add a set of new dynptr kfuncs: bpf_dynptr_adjust(),
      bpf_dynptr_is_null(), bpf_dynptr_is_rdonly(), bpf_dynptr_size()
      and bpf_dynptr_clone().
    - bpf_task_under_cgroup()
    - bpf_sock_destroy() - force closing sockets
    - bpf_cpumask_first_and(), rework bpf_cpumask_any*() kfuncs
 
 Netfilter
 ---------
 
  - Relax set/map validation checks in nf_tables. Allow checking
    presence of an entry in a map without using the value.
 
  - Increase ip_vs_conn_tab_bits range for 64BIT builds.
 
  - Allow updating size of a set.
 
  - Improve NAT tuple selection when connection is closing.
 
 Driver API
 ----------
 
  - Integrate netdev with LED subsystem, to allow configuring HW
    "offloaded" blinking of LEDs based on link state and activity
    (i.e. packets coming in and out).
 
  - Support configuring rate selection pins of SFP modules.
 
  - Factor Clause 73 auto-negotiation code out of the drivers, provide
    common helper routines.
 
  - Add more fool-proof helpers for managing lifetime of MDIO devices
    associated with the PCS layer.
 
  - Allow drivers to report advanced statistics related to Time Aware
    scheduler offload (taprio).
 
  - Allow opting out of VF statistics in link dump, to allow more VFs
    to fit into the message.
 
  - Split devlink instance and devlink port operations.
 
 New hardware / drivers
 ----------------------
 
  - Ethernet:
    - Synopsys EMAC4 IP support (stmmac)
    - Marvell 88E6361 8 port (5x1GE + 3x2.5GE) switches
    - Marvell 88E6250 7 port switches
    - Microchip LAN8650/1 Rev.B0 PHYs
    - MediaTek MT7981/MT7988 built-in 1GE PHY driver
 
  - WiFi:
    - Realtek RTL8192FU, 2.4 GHz, b/g/n mode, 2T2R, 300 Mbps
    - Realtek RTL8723DS (SDIO variant)
    - Realtek RTL8851BE
 
  - CAN:
    - Fintek F81604
 
 Drivers
 -------
 
  - Ethernet NICs:
    - Intel (100G, ice):
      - support dynamic interrupt allocation
      - use meta data match instead of VF MAC addr on slow-path
    - nVidia/Mellanox:
      - extend link aggregation to handle 4, rather than just 2 ports
      - spawn sub-functions without any features by default
    - OcteonTX2:
      - support HTB (Tx scheduling/QoS) offload
      - make RSS hash generation configurable
      - support selecting Rx queue using TC filters
    - Wangxun (ngbe/txgbe):
      - add basic Tx/Rx packet offloads
      - add phylink support (SFP/PCS control)
    - Freescale/NXP (enetc):
      - report TAPRIO packet statistics
    - Solarflare/AMD:
      - support matching on IP ToS and UDP source port of outer header
      - VxLAN and GENEVE tunnel encapsulation over IPv4 or IPv6
      - add devlink dev info support for EF10
 
  - Virtual NICs:
    - Microsoft vNIC:
      - size the Rx indirection table based on requested configuration
      - support VLAN tagging
    - Amazon vNIC:
      - try to reuse Rx buffers if not fully consumed, useful for ARM
        servers running with 16kB pages
    - Google vNIC:
      - support TCP segmentation of >64kB frames
 
  - Ethernet embedded switches:
    - Marvell (mv88e6xxx):
      - enable USXGMII (88E6191X)
    - Microchip:
     - lan966x: add support for Egress Stage 0 ACL engine
     - lan966x: support mapping packet priority to internal switch
       priority (based on PCP or DSCP)
 
  - Ethernet PHYs:
    - Broadcom PHYs:
      - support for Wake-on-LAN for BCM54210E/B50212E
      - report LPI counter
    - Microsemi PHYs: support RGMII delay configuration (VSC85xx)
    - Micrel PHYs: receive timestamp in the frame (LAN8841)
    - Realtek PHYs: support optional external PHY clock
    - Altera TSE PCS: merge the driver into Lynx PCS which it is
      a variant of
 
  - CAN: Kvaser PCIEcan:
    - support packet timestamping
 
  - WiFi:
    - Intel (iwlwifi):
      - major update for new firmware and Multi-Link Operation (MLO)
      - configuration rework to drop test devices and split
        the different families
      - support for segmented PNVM images and power tables
      - new vendor entries for PPAG (platform antenna gain) feature
    - Qualcomm 802.11ax (ath11k):
      - Multiple Basic Service Set Identifier (MBSSID) and
        Enhanced MBSSID Advertisement (EMA) support in AP mode
      - support factory test mode
    - RealTek (rtw89):
      - add RSSI based antenna diversity
      - support U-NII-4 channels on 5 GHz band
    - RealTek (rtl8xxxu):
      - AP mode support for 8188f
      - support USB RX aggregation for the newer chips
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmSbJM4ACgkQMUZtbf5S
 IrtoDhAAhEim1+LBIKf4lhPcVdZ2p/TkpnwTz5jsTwSeRBAxTwuNJ2fQhFXg13E3
 MnRq6QaEp8G4/tA/gynLvQop+FEZEnv+horP0zf/XLcC8euU7UrKdrpt/4xxdP07
 IL/fFWsoUGNO+L9LNaHwBo8g7nHvOkPscHEBHc2Xrvzab56TJk6vPySfLqcpKlNZ
 CHWDwTpgRqNZzSKiSpoMVd9OVMKUXcPYHpDmfEJ5l+e8vTXmZzOLHrSELHU5nP5f
 mHV7gxkDCTshoGcaed7UTiOvgu1p6E5EchDJxiLaSUbgsd8SZ3u4oXwRxgj33RK/
 fB2+UaLrRt/DdlHvT/Ph8e8Ygu77yIXMjT49jsfur/zVA0HEA2dFb7V6QlsYRmQp
 J25pnrdXmE15llgqsC0/UOW5J1laTjII+T2T70UOAqQl4LWYAQDG4WwsAqTzU0KY
 dueydDouTp9XC2WYrRUEQxJUzxaOaazskDUHc5c8oHp/zVBT+djdgtvVR9+gi6+7
 yy4elI77FlEEqL0ItdU/lSWINayAlPLsIHkMyhSGKX0XDpKjeycPqkNx4UterXB/
 JKIR5RBWllRft+igIngIkKX0tJGMU0whngiw7d1WLw25wgu4sB53hiWWoSba14hv
 tXMxwZs5iGaPcT38oRVMZz8I1kJM4Dz3SyI7twVvi4RUut64EG4=
 =9i4I
 -----END PGP SIGNATURE-----

Merge tag 'net-next-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next

Pull networking changes from Jakub Kicinski:
 "WiFi 7 and sendpage changes are the biggest pieces of work for this
  release. The latter will definitely require fixes but I think that we
  got it to a reasonable point.

  Core:

   - Rework the sendpage & splice implementations

     Instead of feeding data into sockets page by page extend sendmsg
     handlers to support taking a reference on the data, controlled by a
     new flag called MSG_SPLICE_PAGES

     Rework the handling of unexpected-end-of-file to invoke an
     additional callback instead of trying to predict what the right
     combination of MORE/NOTLAST flags is

     Remove the MSG_SENDPAGE_NOTLAST flag completely

   - Implement SCM_PIDFD, a new type of CMSG type analogous to
     SCM_CREDENTIALS, but it contains pidfd instead of plain pid

   - Enable socket busy polling with CONFIG_RT

   - Improve reliability and efficiency of reporting for ref_tracker

   - Auto-generate a user space C library for various Netlink families

  Protocols:

   - Allow TCP to shrink the advertised window when necessary, prevent
     sk_rcvbuf auto-tuning from growing the window all the way up to
     tcp_rmem[2]

   - Use per-VMA locking for "page-flipping" TCP receive zerocopy

   - Prepare TCP for device-to-device data transfers, by making sure
     that payloads are always attached to skbs as page frags

   - Make the backoff time for the first N TCP SYN retransmissions
     linear. Exponential backoff is unnecessarily conservative

   - Create a new MPTCP getsockopt to retrieve all info
     (MPTCP_FULL_INFO)

   - Avoid waking up applications using TLS sockets until we have a full
     record

   - Allow using kernel memory for protocol ioctl callbacks, paving the
     way to issuing ioctls over io_uring

   - Add nolocalbypass option to VxLAN, forcing packets to be fully
     encapsulated even if they are destined for a local IP address

   - Make TCPv4 use consistent hash in TIME_WAIT and SYN_RECV. Ensure
     in-kernel ECMP implementation (e.g. Open vSwitch) select the same
     link for all packets. Support L4 symmetric hashing in Open vSwitch

   - PPPoE: make number of hash bits configurable

   - Allow DNS to be overwritten by DHCPACK in the in-kernel DHCP client
     (ipconfig)

   - Add layer 2 miss indication and filtering, allowing higher layers
     (e.g. ACL filters) to make forwarding decisions based on whether
     packet matched forwarding state in lower devices (bridge)

   - Support matching on Connectivity Fault Management (CFM) packets

   - Hide the "link becomes ready" IPv6 messages by demoting their
     printk level to debug

   - HSR: don't enable promiscuous mode if device offloads the proto

   - Support active scanning in IEEE 802.15.4

   - Continue work on Multi-Link Operation for WiFi 7

  BPF:

   - Add precision propagation for subprogs and callbacks. This allows
     maintaining verification efficiency when subprograms are used, or
     in fact passing the verifier at all for complex programs,
     especially those using open-coded iterators

   - Improve BPF's {g,s}setsockopt() length handling. Previously BPF
     assumed the length is always equal to the amount of written data.
     But some protos allow passing a NULL buffer to discover what the
     output buffer *should* be, without writing anything

   - Accept dynptr memory as memory arguments passed to helpers

   - Add routing table ID to bpf_fib_lookup BPF helper

   - Support O_PATH FDs in BPF_OBJ_PIN and BPF_OBJ_GET commands

   - Drop bpf_capable() check in BPF_MAP_FREEZE command (used to mark
     maps as read-only)

   - Show target_{obj,btf}_id in tracing link fdinfo

   - Addition of several new kfuncs (most of the names are
     self-explanatory):
      - Add a set of new dynptr kfuncs: bpf_dynptr_adjust(),
        bpf_dynptr_is_null(), bpf_dynptr_is_rdonly(), bpf_dynptr_size()
        and bpf_dynptr_clone().
      - bpf_task_under_cgroup()
      - bpf_sock_destroy() - force closing sockets
      - bpf_cpumask_first_and(), rework bpf_cpumask_any*() kfuncs

  Netfilter:

   - Relax set/map validation checks in nf_tables. Allow checking
     presence of an entry in a map without using the value

   - Increase ip_vs_conn_tab_bits range for 64BIT builds

   - Allow updating size of a set

   - Improve NAT tuple selection when connection is closing

  Driver API:

   - Integrate netdev with LED subsystem, to allow configuring HW
     "offloaded" blinking of LEDs based on link state and activity
     (i.e. packets coming in and out)

   - Support configuring rate selection pins of SFP modules

   - Factor Clause 73 auto-negotiation code out of the drivers, provide
     common helper routines

   - Add more fool-proof helpers for managing lifetime of MDIO devices
     associated with the PCS layer

   - Allow drivers to report advanced statistics related to Time Aware
     scheduler offload (taprio)

   - Allow opting out of VF statistics in link dump, to allow more VFs
     to fit into the message

   - Split devlink instance and devlink port operations

  New hardware / drivers:

   - Ethernet:
      - Synopsys EMAC4 IP support (stmmac)
      - Marvell 88E6361 8 port (5x1GE + 3x2.5GE) switches
      - Marvell 88E6250 7 port switches
      - Microchip LAN8650/1 Rev.B0 PHYs
      - MediaTek MT7981/MT7988 built-in 1GE PHY driver

   - WiFi:
      - Realtek RTL8192FU, 2.4 GHz, b/g/n mode, 2T2R, 300 Mbps
      - Realtek RTL8723DS (SDIO variant)
      - Realtek RTL8851BE

   - CAN:
      - Fintek F81604

  Drivers:

   - Ethernet NICs:
      - Intel (100G, ice):
         - support dynamic interrupt allocation
         - use meta data match instead of VF MAC addr on slow-path
      - nVidia/Mellanox:
         - extend link aggregation to handle 4, rather than just 2 ports
         - spawn sub-functions without any features by default
      - OcteonTX2:
         - support HTB (Tx scheduling/QoS) offload
         - make RSS hash generation configurable
         - support selecting Rx queue using TC filters
      - Wangxun (ngbe/txgbe):
         - add basic Tx/Rx packet offloads
         - add phylink support (SFP/PCS control)
      - Freescale/NXP (enetc):
         - report TAPRIO packet statistics
      - Solarflare/AMD:
         - support matching on IP ToS and UDP source port of outer
           header
         - VxLAN and GENEVE tunnel encapsulation over IPv4 or IPv6
         - add devlink dev info support for EF10

   - Virtual NICs:
      - Microsoft vNIC:
         - size the Rx indirection table based on requested
           configuration
         - support VLAN tagging
      - Amazon vNIC:
         - try to reuse Rx buffers if not fully consumed, useful for ARM
           servers running with 16kB pages
      - Google vNIC:
         - support TCP segmentation of >64kB frames

   - Ethernet embedded switches:
      - Marvell (mv88e6xxx):
         - enable USXGMII (88E6191X)
      - Microchip:
         - lan966x: add support for Egress Stage 0 ACL engine
         - lan966x: support mapping packet priority to internal switch
           priority (based on PCP or DSCP)

   - Ethernet PHYs:
      - Broadcom PHYs:
         - support for Wake-on-LAN for BCM54210E/B50212E
         - report LPI counter
      - Microsemi PHYs: support RGMII delay configuration (VSC85xx)
      - Micrel PHYs: receive timestamp in the frame (LAN8841)
      - Realtek PHYs: support optional external PHY clock
      - Altera TSE PCS: merge the driver into Lynx PCS which it is a
        variant of

   - CAN: Kvaser PCIEcan:
      - support packet timestamping

   - WiFi:
      - Intel (iwlwifi):
         - major update for new firmware and Multi-Link Operation (MLO)
         - configuration rework to drop test devices and split the
           different families
         - support for segmented PNVM images and power tables
         - new vendor entries for PPAG (platform antenna gain) feature
      - Qualcomm 802.11ax (ath11k):
         - Multiple Basic Service Set Identifier (MBSSID) and Enhanced
           MBSSID Advertisement (EMA) support in AP mode
         - support factory test mode
      - RealTek (rtw89):
         - add RSSI based antenna diversity
         - support U-NII-4 channels on 5 GHz band
      - RealTek (rtl8xxxu):
         - AP mode support for 8188f
         - support USB RX aggregation for the newer chips"

* tag 'net-next-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1602 commits)
  net: scm: introduce and use scm_recv_unix helper
  af_unix: Skip SCM_PIDFD if scm->pid is NULL.
  net: lan743x: Simplify comparison
  netlink: Add __sock_i_ino() for __netlink_diag_dump().
  net: dsa: avoid suspicious RCU usage for synced VLAN-aware MAC addresses
  Revert "af_unix: Call scm_recv() only after scm_set_cred()."
  phylink: ReST-ify the phylink_pcs_neg_mode() kdoc
  libceph: Partially revert changes to support MSG_SPLICE_PAGES
  net: phy: mscc: fix packet loss due to RGMII delays
  net: mana: use vmalloc_array and vcalloc
  net: enetc: use vmalloc_array and vcalloc
  ionic: use vmalloc_array and vcalloc
  pds_core: use vmalloc_array and vcalloc
  gve: use vmalloc_array and vcalloc
  octeon_ep: use vmalloc_array and vcalloc
  net: usb: qmi_wwan: add u-blox 0x1312 composition
  perf trace: fix MSG_SPLICE_PAGES build error
  ipvlan: Fix return value of ipvlan_queue_xmit()
  netfilter: nf_tables: fix underflow in chain reference counter
  netfilter: nf_tables: unbind non-anonymous set if rule construction fails
  ...
2023-06-28 16:43:10 -07:00
Linus Torvalds 6a8cbd9253 v6.5-rc1-sysctl-next
The changes queued up for v6.5-rc1 for sysctl are in line with
 prior efforts to stop usage of deprecated routines which incur
 recursion and also make it hard to remove the empty array element
 in each sysctl array declaration. The most difficult user to modify
 was parport which required a bit of re-thinking of how to declare shared
 sysctls there, Joel Granados has stepped up to the plate to do most of
 this work and eventual removal of register_sysctl_table(). That work
 ended up saving us about 1465 bytes according to bloat-o-meter. Since
 we gained a few bloat-o-meter karma points I moved two rather small
 sysctl arrays from kernel/sysctl.c leaving us only two more sysctl
 arrays to move left.
 
 Most changes have been tested on linux-next for about a month. The last
 straggler patches are a minor parport fix, changes to the sysctl
 kernel selftest so to verify correctness and prevent regressions for
 the future change he made to provide an alternative solution for the
 special sysctl mount point target which was using the now deprecated
 sysctl child element.
 
 This is all prep work to now finally be able to remove the empty
 array element in all sysctl declarations / registrations which is
 expected to save us a bit of bytes all over the kernel. That work
 will be tested early after v6.5-rc1 is out.
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCgAwFiEENnNq2KuOejlQLZofziMdCjCSiKcFAmSceh0SHG1jZ3JvZkBr
 ZXJuZWwub3JnAAoJEM4jHQowkoinBFQQAK9WdpcU8ODoDzoSls4jsCQpZUCfZ+ED
 pbCgQqUqu9VPs6bnJ+aXVa6Fh3uCr6+TIfNFM55qI/Sbo2issZ7bm0nvKmGgc6/m
 giqDP7btvHqiAsEootci8DVdbBXKkdH4dx3pSwleyN8pdinewH0hrKImaPpahyo6
 1mB1du0iI89yjsZmheHVVSyfXXYAnP0PqRVy5Y+qxY7yYlIegQ5uAZmwRE62lfTf
 TuiV7OFuDZ2DBYOmqIhfGKGRnfOL5ZVF3iHCrfUpX3p+fEFzDmwvm3vr73PTSrFw
 /aRRLa/hOWr5ilw1bvnMcazgQzFEOlQb3DMhBKH7gLl3XHVrM+TaaqYHjUia1+6Y
 e2axz/duA2q9uLMW81daRApvHMCgy0exkpC7prfOxF5bgTe4TjA7ZWvGpqG1kPKT
 PPSxw80XvG5hLZm4tB0ZWJ5rOfFpiUGGneSeRQwyuClBt73SIO+F03jyGpt83slU
 jFE50ac14Zwh1oxpCQtYoR1+bXWdq1QwM5vQBNEuaoTSnJfVjrXqBz/BnqJChtjr
 m1vA27+4/dfki2P3gVWF1lGx43ir3uJvqk+BjWXm2CDDJqpRi3N0qcUwZwLuqAAz
 /LEgFqK61bpHi/C8c2NWAxIoeWRU4NUOaoiKmZwyt0sKAWU1Yzg70xssYeg7VYqZ
 3pvFNVBqkV+F
 =sXUU
 -----END PGP SIGNATURE-----

Merge tag 'v6.5-rc1-sysctl-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux

Pull sysctl updates from Luis Chamberlain:
 "The changes for sysctl are in line with prior efforts to stop usage of
  deprecated routines which incur recursion and also make it hard to
  remove the empty array element in each sysctl array declaration.

  The most difficult user to modify was parport which required a bit of
  re-thinking of how to declare shared sysctls there, Joel Granados has
  stepped up to the plate to do most of this work and eventual removal
  of register_sysctl_table(). That work ended up saving us about 1465
  bytes according to bloat-o-meter. Since we gained a few bloat-o-meter
  karma points I moved two rather small sysctl arrays from
  kernel/sysctl.c leaving us only two more sysctl arrays to move left.

  Most changes have been tested on linux-next for about a month. The
  last straggler patches are a minor parport fix, changes to the sysctl
  kernel selftest so to verify correctness and prevent regressions for
  the future change he made to provide an alternative solution for the
  special sysctl mount point target which was using the now deprecated
  sysctl child element.

  This is all prep work to now finally be able to remove the empty array
  element in all sysctl declarations / registrations which is expected
  to save us a bit of bytes all over the kernel. That work will be
  tested early after v6.5-rc1 is out"

* tag 'v6.5-rc1-sysctl-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux:
  sysctl: replace child with an enumeration
  sysctl: Remove debugging dump_stack
  test_sysclt: Test for registering a mount point
  test_sysctl: Add an option to prevent test skip
  test_sysctl: Add an unregister sysctl test
  test_sysctl: Group node sysctl test under one func
  test_sysctl: Fix test metadata getters
  parport: plug a sysctl register leak
  sysctl: move security keys sysctl registration to its own file
  sysctl: move umh sysctl registration to its own file
  signal: move show_unhandled_signals sysctl to its own file
  sysctl: remove empty dev table
  sysctl: Remove register_sysctl_table
  sysctl: Refactor base paths registrations
  sysctl: stop exporting register_sysctl_table
  parport: Removed sysctl related defines
  parport: Remove register_sysctl_table from parport_default_proc_register
  parport: Remove register_sysctl_table from parport_device_proc_register
  parport: Remove register_sysctl_table from parport_proc_register
  parport: Move magic number "15" to a define
2023-06-28 16:05:21 -07:00
Linus Torvalds 4e3c09e954 v6.5-rc1-modules-next
The changes queued up for v6.5-rc1 for modules are pretty tame, mostly
 code removal of moving of code. Only two minor functional changes are
 made, the only one which stands out is Sebastian Andrzej Siewior's
 simplification of module reference counting by removing preempt_disable()
 and that has been tested on linux-next for well over a month without
 no regressions. I'm now, I guess, also a kitchen sink for some kallsyms
 changes.
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCgAwFiEENnNq2KuOejlQLZofziMdCjCSiKcFAmScfl0SHG1jZ3JvZkBr
 ZXJuZWwub3JnAAoJEM4jHQowkoin8oMP/0DQK+r3BZimknzz6rF0EBStNZ/dIK2W
 1Q/r/ER4VKQKYxklc1M74K+7IX8ZDCYxqlaDS9lvAkRDWNC+t69aNZEib2odJleC
 p6WB30P0JIwfZZC0DS/ct3vrWZTyUhw7aOtvABRmjBfiJ3lFlU092Glvk1w1aFbD
 UrNRomPu4CujzfmnGj3VGc+HVSOEK0F1/GLm9ClrsR8SzKEpQmH4ALI/ON69B0ea
 PmL+d1Wyt6WEoH0hlV1TOXNdHUb3ZO1riSSfDYQ7TiG2AM5w1t4n26YRusc16hYU
 6Bx4OGt52ZJYR3btsRQlcylF4R5DUo+boDkM0NqEDU/3ciGMg6DgKdHnYCBN1w+X
 ZO8aXK1MIgF7W6CqSz+8HCsu5CuCos55FgM22dPbpZr3OEFCWemqnV+cYCu1DA+M
 Gbnn883ZLtt+R+qikD3135s+LxYIvxSuQrj+B3ZoQeIKEtAlyxuhrUJbU0tOns0j
 05PrkI8J1FtIysdlNZeIFg752IPtjp/0QNB4R46m40mT16L0TSjEP7c+zcPDryMb
 84SdLqh1gis0QZRkoH6JbMBDeT2dtuxqtQ5dTPka4s1mtg3SvRYr53sCJg+gQ8e2
 CBW6jgrIf3F4RIMMiSfXpSf4yVVxXxJAEFnGLRXhQ2HkUnk3mdGEfsZc7ucrsnlK
 f/KwaEzmLD9c
 =gjKD
 -----END PGP SIGNATURE-----

Merge tag 'v6.5-rc1-modules-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux

Pull module updates from Luis Chamberlain:
 "The changes queued up for modules are pretty tame, mostly code removal
  of moving of code.

  Only two minor functional changes are made, the only one which stands
  out is Sebastian Andrzej Siewior's simplification of module reference
  counting by removing preempt_disable() and that has been tested on
  linux-next for well over a month without no regressions.

  I'm now, I guess, also a kitchen sink for some kallsyms changes"

[ There was a mis-communication about the concurrent module load changes
  that I had expected to come through Luis despite me authoring the
  patch. So some of the module updates were left hanging in the email
  ether, and I just committed them separately.

  It's my bad - I should have made it more clear that I expected my
  own patches to come through the module tree too. Now they missed
  linux-next, but hopefully that won't cause any issues    - Linus ]

* tag 'v6.5-rc1-modules-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux:
  kallsyms: make kallsyms_show_value() as generic function
  kallsyms: move kallsyms_show_value() out of kallsyms.c
  kallsyms: remove unsed API lookup_symbol_attrs
  kallsyms: remove unused arch_get_kallsym() helper
  module: Remove preempt_disable() from module reference counting.
2023-06-28 15:51:08 -07:00
Linus Torvalds 9b9879fc03 modules: catch concurrent module loads, treat them as idempotent
This is the new-and-improved attempt at avoiding huge memory load spikes
when the user space boot sequence tries to load hundreds (or even
thousands) of redundant duplicate modules in parallel.

See commit 9828ed3f69 ("module: error out early on concurrent load of
the same module file") for background and an earlier failed attempt that
was reverted.

That earlier attempt just said "concurrently loading the same module is
silly, just open the module file exclusively and return -ETXTBSY if
somebody else is already loading it".

While it is true that concurrent module loads of the same module is
silly, the reason that earlier attempt then failed was that the
concurrently loaded module would often be a prerequisite for another
module.

Thus failing to load the prerequisite would then cause cascading
failures of the other modules, rather than just short-circuiting that
one unnecessary module load.

At the same time, we still really don't want to load the contents of the
same module file hundreds of times, only to then wait for an eventually
successful load, and have everybody else return -EEXIST.

As a result, this takes another approach, and treats concurrent module
loads from the same file as "idempotent" in the inode.  So if one module
load is ongoing, we don't start a new one, but instead just wait for the
first one to complete and return the same return value as it did.

So unlike the first attempt, this does not return early: the intent is
not to speed up the boot, but to avoid a thundering herd problem in
allocating memory (both physical and virtual) for a module more than
once.

Also note that this does change behavior: it used to be that when you
had concurrent loads, you'd have one "winner" that would return success,
and everybody else would return -EEXIST.

In contrast, this idempotent logic goes all Oprah on the problem, and
says "You are a winner! And you are a winner! We are ALL winners".  But
since there's no possible actual real semantic difference between "you
loaded the module" and "somebody else already loaded the module", this
is more of a feel-good change than an actual honest-to-goodness semantic
change.

Of course, any true Johnny-come-latelies that don't get caught in the
concurrency filter will still return -EEXIST.  It's no different from
not even getting a seat at an Oprah taping.  That's life.

See the long thread on the kernel mailing list about this all, which
includes some numbers for memory use before and after the patch.

Link: https://lore.kernel.org/lkml/20230524213620.3509138-1-mcgrof@kernel.org/
Reviewed-by: Johan Hovold <johan@kernel.org>
Tested-by: Johan Hovold <johan@kernel.org>
Tested-by: Luis Chamberlain <mcgrof@kernel.org>
Tested-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Rudi Heitbaum <rudi@heitbaum..com>
Tested-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-06-28 15:46:44 -07:00
Linus Torvalds 054a73009c module: split up 'finit_module()' into init_module_from_file() helper
This will simplify the next step, where we can then key off the inode to
do one idempotent module load.

Let's do the obvious re-organization in one step, and then the new code
in another.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-06-28 15:46:43 -07:00
Linus Torvalds 77b1a7f7a0 - Arnd Bergmann has fixed a bunch of -Wmissing-prototypes in
top-level directories.
 
 - Douglas Anderson has added a new "buddy" mode to the hardlockup
   detector.  It permits the detector to work on architectures which
   cannot provide the required interrupts, by having CPUs periodically
   perform checks on other CPUs.
 
 - Zhen Lei has enhanced kexec's ability to support two crash regions.
 
 - Petr Mladek has done a lot of cleanup on the hard lockup detector's
   Kconfig entries.
 
 - And the usual bunch of singleton patches in various places.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZJelTAAKCRDdBJ7gKXxA
 juDkAP0VXWynzkXoojdS/8e/hhi+htedmQ3v2dLZD+vBrctLhAEA7rcH58zAVoWa
 2ejqO6wDrRGUC7JQcO9VEjT0nv73UwU=
 =F293
 -----END PGP SIGNATURE-----

Merge tag 'mm-nonmm-stable-2023-06-24-19-23' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull non-mm updates from Andrew Morton:

 - Arnd Bergmann has fixed a bunch of -Wmissing-prototypes in top-level
   directories

 - Douglas Anderson has added a new "buddy" mode to the hardlockup
   detector. It permits the detector to work on architectures which
   cannot provide the required interrupts, by having CPUs periodically
   perform checks on other CPUs

 - Zhen Lei has enhanced kexec's ability to support two crash regions

 - Petr Mladek has done a lot of cleanup on the hard lockup detector's
   Kconfig entries

 - And the usual bunch of singleton patches in various places

* tag 'mm-nonmm-stable-2023-06-24-19-23' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (72 commits)
  kernel/time/posix-stubs.c: remove duplicated include
  ocfs2: remove redundant assignment to variable bit_off
  watchdog/hardlockup: fix typo in config HARDLOCKUP_DETECTOR_PREFER_BUDDY
  powerpc: move arch_trigger_cpumask_backtrace from nmi.h to irq.h
  devres: show which resource was invalid in __devm_ioremap_resource()
  watchdog/hardlockup: define HARDLOCKUP_DETECTOR_ARCH
  watchdog/sparc64: define HARDLOCKUP_DETECTOR_SPARC64
  watchdog/hardlockup: make HAVE_NMI_WATCHDOG sparc64-specific
  watchdog/hardlockup: declare arch_touch_nmi_watchdog() only in linux/nmi.h
  watchdog/hardlockup: make the config checks more straightforward
  watchdog/hardlockup: sort hardlockup detector related config values a logical way
  watchdog/hardlockup: move SMP barriers from common code to buddy code
  watchdog/buddy: simplify the dependency for HARDLOCKUP_DETECTOR_PREFER_BUDDY
  watchdog/buddy: don't copy the cpumask in watchdog_next_cpu()
  watchdog/buddy: cleanup how watchdog_buddy_check_hardlockup() is called
  watchdog/hardlockup: remove softlockup comment in touch_nmi_watchdog()
  watchdog/hardlockup: in watchdog_hardlockup_check() use cpumask_copy()
  watchdog/hardlockup: don't use raw_cpu_ptr() in watchdog_hardlockup_kick()
  watchdog/hardlockup: HAVE_NMI_WATCHDOG must implement watchdog_hardlockup_probe()
  watchdog/hardlockup: keep kernel.nmi_watchdog sysctl as 0444 if probe fails
  ...
2023-06-28 10:59:38 -07:00
Linus Torvalds 6e17c6de3d - Yosry Ahmed brought back some cgroup v1 stats in OOM logs.
- Yosry has also eliminated cgroup's atomic rstat flushing.
 
 - Nhat Pham adds the new cachestat() syscall.  It provides userspace
   with the ability to query pagecache status - a similar concept to
   mincore() but more powerful and with improved usability.
 
 - Mel Gorman provides more optimizations for compaction, reducing the
   prevalence of page rescanning.
 
 - Lorenzo Stoakes has done some maintanance work on the get_user_pages()
   interface.
 
 - Liam Howlett continues with cleanups and maintenance work to the maple
   tree code.  Peng Zhang also does some work on maple tree.
 
 - Johannes Weiner has done some cleanup work on the compaction code.
 
 - David Hildenbrand has contributed additional selftests for
   get_user_pages().
 
 - Thomas Gleixner has contributed some maintenance and optimization work
   for the vmalloc code.
 
 - Baolin Wang has provided some compaction cleanups,
 
 - SeongJae Park continues maintenance work on the DAMON code.
 
 - Huang Ying has done some maintenance on the swap code's usage of
   device refcounting.
 
 - Christoph Hellwig has some cleanups for the filemap/directio code.
 
 - Ryan Roberts provides two patch series which yield some
   rationalization of the kernel's access to pte entries - use the provided
   APIs rather than open-coding accesses.
 
 - Lorenzo Stoakes has some fixes to the interaction between pagecache
   and directio access to file mappings.
 
 - John Hubbard has a series of fixes to the MM selftesting code.
 
 - ZhangPeng continues the folio conversion campaign.
 
 - Hugh Dickins has been working on the pagetable handling code, mainly
   with a view to reducing the load on the mmap_lock.
 
 - Catalin Marinas has reduced the arm64 kmalloc() minimum alignment from
   128 to 8.
 
 - Domenico Cerasuolo has improved the zswap reclaim mechanism by
   reorganizing the LRU management.
 
 - Matthew Wilcox provides some fixups to make gfs2 work better with the
   buffer_head code.
 
 - Vishal Moola also has done some folio conversion work.
 
 - Matthew Wilcox has removed the remnants of the pagevec code - their
   functionality is migrated over to struct folio_batch.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZJejewAKCRDdBJ7gKXxA
 joggAPwKMfT9lvDBEUnJagY7dbDPky1cSYZdJKxxM2cApGa42gEA6Cl8HRAWqSOh
 J0qXCzqaaN8+BuEyLGDVPaXur9KirwY=
 =B7yQ
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2023-06-24-19-15' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull mm updates from Andrew Morton:

 - Yosry Ahmed brought back some cgroup v1 stats in OOM logs

 - Yosry has also eliminated cgroup's atomic rstat flushing

 - Nhat Pham adds the new cachestat() syscall. It provides userspace
   with the ability to query pagecache status - a similar concept to
   mincore() but more powerful and with improved usability

 - Mel Gorman provides more optimizations for compaction, reducing the
   prevalence of page rescanning

 - Lorenzo Stoakes has done some maintanance work on the
   get_user_pages() interface

 - Liam Howlett continues with cleanups and maintenance work to the
   maple tree code. Peng Zhang also does some work on maple tree

 - Johannes Weiner has done some cleanup work on the compaction code

 - David Hildenbrand has contributed additional selftests for
   get_user_pages()

 - Thomas Gleixner has contributed some maintenance and optimization
   work for the vmalloc code

 - Baolin Wang has provided some compaction cleanups,

 - SeongJae Park continues maintenance work on the DAMON code

 - Huang Ying has done some maintenance on the swap code's usage of
   device refcounting

 - Christoph Hellwig has some cleanups for the filemap/directio code

 - Ryan Roberts provides two patch series which yield some
   rationalization of the kernel's access to pte entries - use the
   provided APIs rather than open-coding accesses

 - Lorenzo Stoakes has some fixes to the interaction between pagecache
   and directio access to file mappings

 - John Hubbard has a series of fixes to the MM selftesting code

 - ZhangPeng continues the folio conversion campaign

 - Hugh Dickins has been working on the pagetable handling code, mainly
   with a view to reducing the load on the mmap_lock

 - Catalin Marinas has reduced the arm64 kmalloc() minimum alignment
   from 128 to 8

 - Domenico Cerasuolo has improved the zswap reclaim mechanism by
   reorganizing the LRU management

 - Matthew Wilcox provides some fixups to make gfs2 work better with the
   buffer_head code

 - Vishal Moola also has done some folio conversion work

 - Matthew Wilcox has removed the remnants of the pagevec code - their
   functionality is migrated over to struct folio_batch

* tag 'mm-stable-2023-06-24-19-15' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (380 commits)
  mm/hugetlb: remove hugetlb_set_page_subpool()
  mm: nommu: correct the range of mmap_sem_read_lock in task_mem()
  hugetlb: revert use of page_cache_next_miss()
  Revert "page cache: fix page_cache_next/prev_miss off by one"
  mm/vmscan: fix root proactive reclaim unthrottling unbalanced node
  mm: memcg: rename and document global_reclaim()
  mm: kill [add|del]_page_to_lru_list()
  mm: compaction: convert to use a folio in isolate_migratepages_block()
  mm: zswap: fix double invalidate with exclusive loads
  mm: remove unnecessary pagevec includes
  mm: remove references to pagevec
  mm: rename invalidate_mapping_pagevec to mapping_try_invalidate
  mm: remove struct pagevec
  net: convert sunrpc from pagevec to folio_batch
  i915: convert i915_gpu_error to use a folio_batch
  pagevec: rename fbatch_count()
  mm: remove check_move_unevictable_pages()
  drm: convert drm_gem_put_pages() to use a folio_batch
  i915: convert shmem_sg_free_table() to use a folio_batch
  scatterlist: add sg_set_folio()
  ...
2023-06-28 10:28:11 -07:00
sunliming f6d026eea3 tracing/user_events: Fix incorrect return value for writing operation when events are disabled
The writing operation return the count of writes regardless of whether events
are enabled or disabled. Switch it to return -EBADF to indicates that the event
is disabled.

Link: https://lkml.kernel.org/r/20230626111344.19136-2-sunliming@kylinos.cn

Cc: stable@vger.kernel.org
7f5a08c79d ("user_events: Add minimal support for trace_event into ftrace")
Acked-by: Beau Belgrave <beaub@linux.microsoft.com>
Signed-off-by: sunliming <sunliming@kylinos.cn>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-06-28 11:00:50 -04:00
Linus Torvalds 582c161cf3 hardening updates for v6.5-rc1
- Fix KMSAN vs FORTIFY in strlcpy/strlcat (Alexander Potapenko)
 
 - Convert strreplace() to return string start (Andy Shevchenko)
 
 - Flexible array conversions (Arnd Bergmann, Wyes Karny, Kees Cook)
 
 - Add missing function prototypes seen with W=1 (Arnd Bergmann)
 
 - Fix strscpy() kerndoc typo (Arne Welzel)
 
 - Replace strlcpy() with strscpy() across many subsystems which were
   either Acked by respective maintainers or were trivial changes that
   went ignored for multiple weeks (Azeem Shaikh)
 
 - Remove unneeded cc-option test for UBSAN_TRAP (Nick Desaulniers)
 
 - Add KUnit tests for strcat()-family
 
 - Enable KUnit tests of FORTIFY wrappers under UML
 
 - Add more complete FORTIFY protections for strlcat()
 
 - Add missed disabling of FORTIFY for all arch purgatories.
 
 - Enable -fstrict-flex-arrays=3 globally
 
 - Tightening UBSAN_BOUNDS when using GCC
 
 - Improve checkpatch to check for strcpy, strncpy, and fake flex arrays
 
 - Improve use of const variables in FORTIFY
 
 - Add requested struct_size_t() helper for types not pointers
 
 - Add __counted_by macro for annotating flexible array size members
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmSbftQWHGtlZXNjb29r
 QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJj0MD/9X9jzJzCmsAU+yNldeoAzC84Sk
 GVU3RBxGcTNysL1gZXynkIgigw7DWc4htMGeSABHHwQRVP65JCH1Kw/VqIkyumbx
 9LdX6IklMJb4pRT4PVU3azebV4eNmSjlur2UxMeW54Czm91/6I8RHbJOyAPnOUmo
 2oomGdP/hpEHtKR7hgy8Axc6w5ySwQixh2V5sVZG3VbvCS5WKTmTXbs6puuRT5hz
 iHt7v+7VtEg/Qf1W7J2oxfoghvVBsaRrSLrExWT/oZYh1ZxM7DsCAAoG/IsDgHGA
 9LBXiRECgAFThbHVxLvvKZQMXdVk0i8iXLX43XMKC0wTA+NTyH7wlcQQ4RWNMuo8
 sfA9Qm9gMArXaf64aymr3Uwn20Zan0391HdlbhOJZAE6v3PPJbleUnM58AzD2d3r
 5Lz6AIFBxDImy+3f9iDWgacCT5/PkeiXTHzk9QnKhJyKKtRA58XJxj4q2+rPnGJP
 n4haXqoxD5FJbxdXiGKk31RS0U5HBug7wkOcUrTqDHUbc/QNU2b7dxTKUx+zYtCU
 uV5emPzpF4H4z+91WpO47n9gkMAfwV0lt9S2dwS8pxsgqctbmIan+Jgip7rsqZ2G
 OgLXBsb43eEs+6WgO8tVt/ZHYj9ivGMdrcNcsIfikzNs/xweUJ53k2xSEn2xEa5J
 cwANDmkL6QQK7yfeeg==
 =s0j1
 -----END PGP SIGNATURE-----

Merge tag 'hardening-v6.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull hardening updates from Kees Cook:
 "There are three areas of note:

  A bunch of strlcpy()->strscpy() conversions ended up living in my tree
  since they were either Acked by maintainers for me to carry, or got
  ignored for multiple weeks (and were trivial changes).

  The compiler option '-fstrict-flex-arrays=3' has been enabled
  globally, and has been in -next for the entire devel cycle. This
  changes compiler diagnostics (though mainly just -Warray-bounds which
  is disabled) and potential UBSAN_BOUNDS and FORTIFY _warning_
  coverage. In other words, there are no new restrictions, just
  potentially new warnings. Any new FORTIFY warnings we've seen have
  been fixed (usually in their respective subsystem trees). For more
  details, see commit df8fc4e934.

  The under-development compiler attribute __counted_by has been added
  so that we can start annotating flexible array members with their
  associated structure member that tracks the count of flexible array
  elements at run-time. It is possible (likely?) that the exact syntax
  of the attribute will change before it is finalized, but GCC and Clang
  are working together to sort it out. Any changes can be made to the
  macro while we continue to add annotations.

  As an example of that last case, I have a treewide commit waiting with
  such annotations found via Coccinelle:

    https://git.kernel.org/linus/adc5b3cb48a049563dc673f348eab7b6beba8a9b

  Also see commit dd06e72e68 for more details.

  Summary:

   - Fix KMSAN vs FORTIFY in strlcpy/strlcat (Alexander Potapenko)

   - Convert strreplace() to return string start (Andy Shevchenko)

   - Flexible array conversions (Arnd Bergmann, Wyes Karny, Kees Cook)

   - Add missing function prototypes seen with W=1 (Arnd Bergmann)

   - Fix strscpy() kerndoc typo (Arne Welzel)

   - Replace strlcpy() with strscpy() across many subsystems which were
     either Acked by respective maintainers or were trivial changes that
     went ignored for multiple weeks (Azeem Shaikh)

   - Remove unneeded cc-option test for UBSAN_TRAP (Nick Desaulniers)

   - Add KUnit tests for strcat()-family

   - Enable KUnit tests of FORTIFY wrappers under UML

   - Add more complete FORTIFY protections for strlcat()

   - Add missed disabling of FORTIFY for all arch purgatories.

   - Enable -fstrict-flex-arrays=3 globally

   - Tightening UBSAN_BOUNDS when using GCC

   - Improve checkpatch to check for strcpy, strncpy, and fake flex
     arrays

   - Improve use of const variables in FORTIFY

   - Add requested struct_size_t() helper for types not pointers

   - Add __counted_by macro for annotating flexible array size members"

* tag 'hardening-v6.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (54 commits)
  netfilter: ipset: Replace strlcpy with strscpy
  uml: Replace strlcpy with strscpy
  um: Use HOST_DIR for mrproper
  kallsyms: Replace all non-returning strlcpy with strscpy
  sh: Replace all non-returning strlcpy with strscpy
  of/flattree: Replace all non-returning strlcpy with strscpy
  sparc64: Replace all non-returning strlcpy with strscpy
  Hexagon: Replace all non-returning strlcpy with strscpy
  kobject: Use return value of strreplace()
  lib/string_helpers: Change returned value of the strreplace()
  jbd2: Avoid printing outside the boundary of the buffer
  checkpatch: Check for 0-length and 1-element arrays
  riscv/purgatory: Do not use fortified string functions
  s390/purgatory: Do not use fortified string functions
  x86/purgatory: Do not use fortified string functions
  acpi: Replace struct acpi_table_slit 1-element array with flex-array
  clocksource: Replace all non-returning strlcpy with strscpy
  string: use __builtin_memcpy() in strlcpy/strlcat
  staging: most: Replace all non-returning strlcpy with strscpy
  drm/i2c: tda998x: Replace all non-returning strlcpy with strscpy
  ...
2023-06-27 21:24:18 -07:00
Linus Torvalds 21953eb16c lsm/stable-6.5 PR 20230626
-----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCAAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAmSZuh0UHHBhdWxAcGF1
 bC1tb29yZS5jb20ACgkQ6iDy2pc3iXNpuxAAxChGqME9nE7iITx1TaFRrbK49mDF
 1RZh/5cwzde72lLLFkTFKB6ErMSQkrrtA+jFH7vKsrOslBel1+yO80vkXmhYCeZU
 P3m0FeREUpuU4QV0tbQamPeR+SWohmKi2dYWd+VdpLA+1aTK3KNYsi2NFkDIreap
 BqeRq4S0Rqc4u3/5juk6JCGFhTRWaH16YJQrzIKHF/K3DK+gMhAY5sjuAWzFc6ma
 /5bbD55kdVVDfnsxNSe+lzJ7zEf7TYedLG6BN+R9cVrU+El12a38M29kASaAof5w
 vpb92a27hA9Q5EyQ2O9QXnr2L5CShT4bvAZCGkK4cmZerGNTdM0iojhYj1s7FAV/
 USkWgkDmEuSatp0+DdXlfQyUmZZWlw1W0oiEfZwR8w7TY7q9CU7aD8K7+GDSIazB
 g89nYznVjlaC/oA4/owMraoWP3eiDiAcsQdO052Vv63TVyJtTiRiKyBq5EFLrX8L
 iaUCa4cBaYFc94kN1PZeNXZKwqRc2F6oAFT1YuXnFWBGmixN0kUL023C0xjl/J7P
 02jYYSVzLm22aU39GU0DSnaLfAwl3muazOB3XuyGOhUWHFYzjkc9UhmGp0W50DkK
 qigW3ONA8s8CKUS/q7QSGq+Vf+CVZA5f+daDDPGYstPfCTk61eu0wjwfwek3W0o+
 xKzBr2Od3vTOzAs=
 =3nWy
 -----END PGP SIGNATURE-----

Merge tag 'lsm-pr-20230626' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm

Pull lsm updates from Paul Moore:

 - A SafeSetID patch to correct what appears to be a cut-n-paste typo in
   the code causing a UID to be printed where a GID was desired.

   This is coming via the LSM tree because we haven't been able to get a
   response from the SafeSetID maintainer (Micah Morton) in several
   months. Hopefully we are able to get in touch with Micah, but until
   we do I'm going to pick them up in the LSM tree.

 - A small fix to the reiserfs LSM xattr code.

   We're continuing to work through some issues with the reiserfs code
   as we try to fixup the LSM xattr handling, but in the process we're
   uncovering some ugly problems in reiserfs and we may just end up
   removing the LSM xattr support in reiserfs prior to reiserfs'
   removal.

   For better or worse, this shouldn't impact any of the reiserfs users,
   as we discovered that LSM xattrs on reiserfs were completely broken,
   meaning no one is currently using the combo of reiserfs and a file
   labeling LSM.

 - A tweak to how the cap_user_data_t struct/typedef is declared in the
   header file to appease the Sparse gods.

 - In the process of trying to sort out the SafeSetID lost-maintainer
   problem I realized that I needed to update the labeled networking
   entry to "Supported".

 - Minor comment/documentation and spelling fixes.

* tag 'lsm-pr-20230626' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm:
  device_cgroup: Fix kernel-doc warnings in device_cgroup
  SafeSetID: fix UID printed instead of GID
  MAINTAINERS: move labeled networking to "supported"
  capability: erase checker warnings about struct __user_cap_data_struct
  lsm: fix a number of misspellings
  reiserfs: Initialize sec->length in reiserfs_security_init().
  capability: fix kernel-doc warnings in capability.c
2023-06-27 17:24:26 -07:00
Linus Torvalds cae72026b5 audit/stable-6.5 PR 20230626
-----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCAAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAmSZuaAUHHBhdWxAcGF1
 bC1tb29yZS5jb20ACgkQ6iDy2pc3iXMd1Q//euuQA+nZbQN5II/Ht7bJPNtjxjKj
 OqZer5dXuwzfjpo2YXxmUrm7NPlyfa4hsyv0q70fV2bUbTy3j+ukAnxDdYv8DAcH
 QICdQiRWPkmOZBRq5TsMMhf/pet7ZNuKTZ8x8rHisZ2Oj7AO9I2eiQw5ihu8xND8
 NK1WRNEpZVKOVzWzjdUHLSArlQggNKAxsYGyiyeKROWtklAK+WAJfnVXgHRgcKq8
 KRi7kdDSN5ewTZIcT38ghGx3stBu/RJTICE5HE1MkQf+ZTLsNKum+bAhrzG2ZsWP
 zzo8UWdELdixO7VHFqnN/TTqk9aItjS4AHxoR8Vkt6VXTdjijaEvFDIEeuGSxjMW
 qszP7D4kjtsosLGMDCsR35yrG0yrISU9a5xqwZMbNcOCpbMLXg/FVs2WhtJ2MxDH
 4AYpbEQdkoSsfAL2F5c9+iKJ/ZFLPjdHM0yPsKfMPWWj8xuhZIwhmPOqh9Ce1vEF
 GpS7KGA/35wNpFPDu1KyQML1VanwKKSoerd4HU2SphyBwu8MYNRnxjClNCJkPgcc
 XWAxHOIFETUxxBEtVfTsHBytft0go7LpQkAXQ/4olcZBznQoHCr6F5qPhWWsaUDz
 WxIqZaiY4WnRER4M1lTgBqdLg2sGdsbRIMzMFhKkdLVFEaPLbZ+7HP4e/SnfzvoA
 Y+qTHowFNyOIEd4=
 =+w0u
 -----END PGP SIGNATURE-----

Merge tag 'audit-pr-20230626' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit

Pull audit update from Paul Moore:
 "A single audit patch that resolves two compiler warnings regarding
  missing function prototypes"

* tag 'audit-pr-20230626' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
  audit: avoid missing-prototype warnings
2023-06-27 17:15:35 -07:00
Linus Torvalds 6e2332e0ab cgroup: Changes for v6.5
* Whenever cpuset needs to rebuild sched_domain, it walked all tasks looking
   for DEADLINE tasks as they need to be accounted on the new domain. Walking
   all tasks can be expensive and there may not be any DEADLINE tasks at all.
   Task iteration is now omitted if there are no DEADLINE tasks.
 
 * Fixes DEADLINE bandwidth misaccounting after task migration failures.
 
 * When no controller is enabled, -Wstringop-overflow warning is triggered.
   The fix patch added an early exit which is too eager and got reverted for
   now. Will fix later.
 
 * Everything else are minor cleanups.
 -----BEGIN PGP SIGNATURE-----
 
 iIQEABYIACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZJoRHw4cdGpAa2VybmVs
 Lm9yZwAKCRCxYfJx3gVYGZatAQCKTv8pb5HEgochph4n26laSdVZs6ce3Y+s7V1T
 rum+3QD/TyJFmCkZSMscolZGFuafpg41sjPbmc4SexeuAMYCMgY=
 =nioD
 -----END PGP SIGNATURE-----

Merge tag 'cgroup-for-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup

Pull cgroup updates from Tejun Heo:

 - Whenever cpuset needs to rebuild sched_domain, it walked all tasks
   looking for DEADLINE tasks as they need to be accounted on the new
   domain. Walking all tasks can be expensive and there may not be any
   DEADLINE tasks at all. Task iteration is now omitted if there are no
   DEADLINE tasks

 - Fixes DEADLINE bandwidth misaccounting after task migration failures

 - When no controller is enabled, -Wstringop-overflow warning is
   triggered. The fix patch added an early exit which is too eager and
   got reverted for now. Will fix later

 - Everything else is minor cleanups

* tag 'cgroup-for-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  Revert "cgroup: Avoid -Wstringop-overflow warnings"
  cgroup/misc: Expose misc.current on cgroup v2 root
  cgroup: Avoid -Wstringop-overflow warnings
  cgroup: remove obsolete comment on cgroup_on_dfl()
  cgroup: remove unused task_cgroup_path()
  cgroup/cpuset: remove unneeded header files
  cgroup: make cgroup_is_threaded() and cgroup_is_thread_root() static
  rdmacg: fix kernel-doc warnings in rdmacg
  cgroup: Replace the css_set call with cgroup_get
  cgroup: remove unused macro for_each_e_css()
  cgroup: Update out-of-date comment in cgroup_migrate()
  cgroup: Replace all non-returning strlcpy with strscpy
  cgroup/cpuset: remove unneeded header files
  cgroup/cpuset: Free DL BW in case can_attach() fails
  sched/deadline: Create DL BW alloc, free & check overflow interface
  cgroup/cpuset: Iterate only if DEADLINE tasks are present
  sched/cpuset: Keep track of SCHED_DEADLINE task in cpusets
  sched/cpuset: Bring back cpuset_mutex
  cgroup/cpuset: Rename functions dealing with DEADLINE accounting
2023-06-27 16:54:21 -07:00
Linus Torvalds 7ab044a4f4 workqueue: Changes for v6.5
* Concurrency-managed per-cpu work items that hog CPUs and delay the
   execution of other work items are now automatically detected and excluded
   from concurrency management. Reporting on such work items can also be
   enabled through a config option.
 
 * Added tools/workqueue/wq_monitor.py which improves visibility into
   workqueue usages and behaviors.
 
 * Includes Arnd's minimal fix for gcc-13 enum warning on 32bit compiles.
   This conflicts with afa4bb778e ("workqueue: clean up WORK_* constant
   types, clarify masking") in master. Can be resolved by picking the master
   version.
 -----BEGIN PGP SIGNATURE-----
 
 iIQEABYIACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZJoGvw4cdGpAa2VybmVs
 Lm9yZwAKCRCxYfJx3gVYGZu0AP9IGK2opAzO9i3i1/Ys81b3sHi9PwrYWH3g252T
 Oe3O6QD/Wh0wYBVl0o7IdW6BGdd5iNwIEs420G53UmmPrATqsgQ=
 =TffY
 -----END PGP SIGNATURE-----

Merge tag 'wq-for-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq

Pull workqueue updates from Tejun Heo:

 - Concurrency-managed per-cpu work items that hog CPUs and delay the
   execution of other work items are now automatically detected and
   excluded from concurrency management. Reporting on such work items
   can also be enabled through a config option.

 - Added tools/workqueue/wq_monitor.py which improves visibility into
   workqueue usages and behaviors.

 - Arnd's minimal fix for gcc-13 enum warning on 32bit compiles,
   superseded by commit afa4bb778e in mainline.

* tag 'wq-for-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
  workqueue: Disable per-cpu CPU hog detection when wq_cpu_intensive_thresh_us is 0
  workqueue: Fix WARN_ON_ONCE() triggers in worker_enter_idle()
  workqueue: fix enum type for gcc-13
  workqueue: Track and monitor per-workqueue CPU time usage
  workqueue: Report work funcs that trigger automatic CPU_INTENSIVE mechanism
  workqueue: Automatically mark CPU-hogging work items CPU_INTENSIVE
  workqueue: Improve locking rule description for worker fields
  workqueue: Move worker_set/clr_flags() upwards
  workqueue: Re-order struct worker fields
  workqueue: Add pwq->stats[] and a monitoring script
  Further upgrade queue_work_on() comment
2023-06-27 16:32:52 -07:00
Linus Torvalds a193cc7506 Perf events changes for v6.5:
- Rework & fix the event forwarding logic by extending the
   core interface. This fixes AMD PMU events that have to
   be forwarded from the core PMU to the IBS PMU.
 
 - Add self-tests to test AMD IBS invocation via core PMU events
 
 - Clean up Intel FixCntrCtl MSR encoding & handling
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmSayC0RHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1jlWxAAqUPtfst1p6H5sSyCBPYo5Y/Rh0SyyqJj
 w0YZ8p2nbB/+EmIN3WS1uYhx1/AalTP254q2BgVF4DlDFQA1MlJCmSNJ9HhtzOgt
 mbpNKzy50cQCR/iH+s3ldcFsLGhSG07j6w8xeb6BGiABm2JoiZeg6iVU76zRe5A1
 iPnjC7qoqjKH+sq8pu32fBClMjzf05/LGMd0MqFuYfl5950xRW61olstjo93XWgK
 O5z+5wm5H3MhJ2mzU6x+0C/xurIEQ0zRf6AqLbFp41BbJJJORgTCK746flghiqd5
 DiADc7oj9eOqL1X9jFPHgE07T/6QPrMC8BoH64pOcM3PoZ6Iq3zTkUHxAw3qK5j+
 kqduxzlVaFLFnf7R/vxUvjMg1PM+qP3pqgCrT+NFUdqsdLgSPxRzt5pAM6aAUwmU
 1lhuapESH44RUFZGWrfOwzQE5q/FDmUc2yGyGW2aYDmwkclNjVpnvHEJrQMugI3M
 M3/y9a+ErcPDUJfHcodutBDGw9l7VhsxJFMt4ydOTkNbEfZLbi2TzNapui6SKFja
 G2efrB/HhrV9nE+21Wfa3uxoKMuJ/UPiGrVr2qyGOnShQpK7sdyGDshO1s6TTPye
 OoVf9I0LhewMPap52SU/KDP7GJVPW1BhL/C7w6OSnXxlS5k4lOji7z4Dj2hqXHib
 19Jm7BhqZwE=
 =xn05
 -----END PGP SIGNATURE-----

Merge tag 'perf-core-2023-06-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull perf events updates from Ingo Molnar:

 - Rework & fix the event forwarding logic by extending the core
   interface.

   This fixes AMD PMU events that have to be forwarded from the
   core PMU to the IBS PMU.

 - Add self-tests to test AMD IBS invocation via core PMU events

 - Clean up Intel FixCntrCtl MSR encoding & handling

* tag 'perf-core-2023-06-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf: Re-instate the linear PMU search
  perf/x86/intel: Define bit macros for FixCntrCtl MSR
  perf test: Add selftest to test IBS invocation via core pmu events
  perf/core: Remove pmu linear searching code
  perf/ibs: Fix interface via core pmu events
  perf/core: Rework forwarding of {task|cpu}-clock events
2023-06-27 14:43:02 -07:00
Linus Torvalds bc6cb4d5bc Locking changes for v6.5:
- Introduce cmpxchg128() -- aka. the demise of cmpxchg_double().
 
   The cmpxchg128() family of functions is basically & functionally
   the same as cmpxchg_double(), but with a saner interface: instead
   of a 6-parameter horror that forced u128 - u64/u64-halves layout
   details on the interface and exposed users to complexity,
   fragility & bugs, use a natural 3-parameter interface with u128 types.
 
 - Restructure the generated atomic headers, and add
   kerneldoc comments for all of the generic atomic{,64,_long}_t
   operations. Generated definitions are much cleaner now,
   and come with documentation.
 
 - Implement lock_set_cmp_fn() on lockdep, for defining an ordering
   when taking multiple locks of the same type. This gets rid of
   one use of lockdep_set_novalidate_class() in the bcache code.
 
 - Fix raw_cpu_generic_try_cmpxchg() bug due to an unintended
   variable shadowing generating garbage code on Clang on certain
   ARM builds.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmSav3wRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1gDyxAAjCHQjpolrre7fRpyiTDwqzIKT27H04vQ
 zrQVlVc42WBnn9pe8LthGy43/RvYvqlZvLoLONA4fMkuYriM6nSMsoZjeUmE+6Rs
 QAElQC74P5YvEBOa67VNY3/M7sj22ftDe7ODtVV8OrnPjMk1sQNRvaK025Cs3yig
 8MAI//hHGNmyVAp1dPYZMJNqxGCvluReLZ4SaUJFCMrg7YgUXgCBj/5Gi07TlKxn
 sT8BFCssoEW/B9FXkh59B1t6FBCZoSy4XSZfsZe0uVAUJ4XDEOO+zBgaWFCedNQT
 wP323ryBgMrkzUKA8j2/o5d3QnMA1GcBfHNNlvAl/fOfrxWXzDZnOEY26YcaLMa0
 YIuRF/JNbPZlt6DCUVBUEvMPpfNYi18dFN0rat1a6xL2L4w+tm55y3mFtSsg76Ka
 r7L2nWlRrAGXnuA+VEPqkqbSWRUSWOv5hT2Mcyb5BqqZRsxBETn6G8GVAzIO6j6v
 giyfUdA8Z9wmMZ7NtB6usxe3p1lXtnZ/shCE7ZHXm6xstyZrSXaHgOSgAnB9DcuJ
 7KpGIhhSODQSwC/h/J0KEpb9Pr/5jCWmXAQ2DWnZK6ndt1jUfFi8pfK58wm0AuAM
 o9t8Mx3o8wZjbMdt6up9OIM1HyFiMx2BSaZK+8f/bWemHQ0xwez5g4k5O5AwVOaC
 x9Nt+Tp0Ze4=
 =DsYj
 -----END PGP SIGNATURE-----

Merge tag 'locking-core-2023-06-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull locking updates from Ingo Molnar:

 - Introduce cmpxchg128() -- aka. the demise of cmpxchg_double()

   The cmpxchg128() family of functions is basically & functionally the
   same as cmpxchg_double(), but with a saner interface.

   Instead of a 6-parameter horror that forced u128 - u64/u64-halves
   layout details on the interface and exposed users to complexity,
   fragility & bugs, use a natural 3-parameter interface with u128
   types.

 - Restructure the generated atomic headers, and add kerneldoc comments
   for all of the generic atomic{,64,_long}_t operations.

   The generated definitions are much cleaner now, and come with
   documentation.

 - Implement lock_set_cmp_fn() on lockdep, for defining an ordering when
   taking multiple locks of the same type.

   This gets rid of one use of lockdep_set_novalidate_class() in the
   bcache code.

 - Fix raw_cpu_generic_try_cmpxchg() bug due to an unintended variable
   shadowing generating garbage code on Clang on certain ARM builds.

* tag 'locking-core-2023-06-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (43 commits)
  locking/atomic: scripts: fix ${atomic}_dec_if_positive() kerneldoc
  percpu: Fix self-assignment of __old in raw_cpu_generic_try_cmpxchg()
  locking/atomic: treewide: delete arch_atomic_*() kerneldoc
  locking/atomic: docs: Add atomic operations to the driver basic API documentation
  locking/atomic: scripts: generate kerneldoc comments
  docs: scripts: kernel-doc: accept bitwise negation like ~@var
  locking/atomic: scripts: simplify raw_atomic*() definitions
  locking/atomic: scripts: simplify raw_atomic_long*() definitions
  locking/atomic: scripts: split pfx/name/sfx/order
  locking/atomic: scripts: restructure fallback ifdeffery
  locking/atomic: scripts: build raw_atomic_long*() directly
  locking/atomic: treewide: use raw_atomic*_<op>()
  locking/atomic: scripts: add trivial raw_atomic*_<op>()
  locking/atomic: scripts: factor out order template generation
  locking/atomic: scripts: remove leftover "${mult}"
  locking/atomic: scripts: remove bogus order parameter
  locking/atomic: xtensa: add preprocessor symbols
  locking/atomic: x86: add preprocessor symbols
  locking/atomic: sparc: add preprocessor symbols
  locking/atomic: sh: add preprocessor symbols
  ...
2023-06-27 14:14:30 -07:00
Linus Torvalds ed3b7923a8 Scheduler changes for v6.5:
- Scheduler SMP load-balancer improvements:
 
     - Avoid unnecessary migrations within SMT domains on hybrid systems.
 
       Problem:
 
         On hybrid CPU systems, (processors with a mixture of higher-frequency
 	SMT cores and lower-frequency non-SMT cores), under the old code
 	lower-priority CPUs pulled tasks from the higher-priority cores if
 	more than one SMT sibling was busy - resulting in many unnecessary
 	task migrations.
 
       Solution:
 
         The new code improves the load balancer to recognize SMT cores with more
         than one busy sibling and allows lower-priority CPUs to pull tasks, which
         avoids superfluous migrations and lets lower-priority cores inspect all SMT
         siblings for the busiest queue.
 
     - Implement the 'runnable boosting' feature in the EAS balancer: consider CPU
       contention in frequency, EAS max util & load-balance busiest CPU selection.
 
       This improves CPU utilization for certain workloads, while leaves other key
       workloads unchanged.
 
 - Scheduler infrastructure improvements:
 
     - Rewrite the scheduler topology setup code by consolidating it
       into the build_sched_topology() helper function and building
       it dynamically on the fly.
 
     - Resolve the local_clock() vs. noinstr complications by rewriting
       the code: provide separate sched_clock_noinstr() and
       local_clock_noinstr() functions to be used in instrumentation code,
       and make sure it is all instrumentation-safe.
 
 - Fixes:
 
     - Fix a kthread_park() race with wait_woken()
 
     - Fix misc wait_task_inactive() bugs unearthed by the -rt merge:
        - Fix UP PREEMPT bug by unifying the SMP and UP implementations.
        - Fix task_struct::saved_state handling.
 
     - Fix various rq clock update bugs, unearthed by turning on the rq clock
       debugging code.
 
     - Fix the PSI WINDOW_MIN_US trigger limit, which was easy to trigger by
       creating enough cgroups, by removing the warnign and restricting
       window size triggers to PSI file write-permission or CAP_SYS_RESOURCE.
 
     - Propagate SMT flags in the topology when removing degenerate domain
 
     - Fix grub_reclaim() calculation bug in the deadline scheduler code
 
     - Avoid resetting the min update period when it is unnecessary, in
       psi_trigger_destroy().
 
     - Don't balance a task to its current running CPU in load_balance(),
       which was possible on certain NUMA topologies with overlapping
       groups.
 
     - Fix the sched-debug printing of rq->nr_uninterruptible
 
 - Cleanups:
 
     - Address various -Wmissing-prototype warnings, as a preparation
       to (maybe) enable this warning in the future.
 
     - Remove unused code
 
     - Mark more functions __init
 
     - Fix shadow-variable warnings
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmSatWQRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1j62xAAuGOx1LcDfRGC6WGQzp1zOdlsVQtnDvlS
 qL58zYSHgizprpVQ3j87SBaG4CHCdvd2Bo36yW0lNZS4nd203qdq7fkrMb3hPP/w
 egUQUzMegf5fF6BWldKeMjuHSt+twFQz/ZAKK8iSbAir6CHNAqbNst1oL0i/+Tyk
 o33hBs1hT5tnbFb1NSVZkX4k+qT3LzTW4K2QgjjGtkScr6yHh2BdEVefyigWOjdo
 9s02d00ll9a2r+F5txlN7Dnw6TN7rmTXGMOJU5bZvBE90/anNiAorMXHJdEKCyUR
 u9+JtBdJWiCplGa/tSRcxT16ZW1VdtTnd9q66TDhXREd2UNDFqBEyg5Wl77K4Tlf
 vKFajmj/to+cTbuv6m6TVR+zyXpdEpdL6F04P44U3qiJvDobBqeDNKHHIqpmbHXl
 AXUXcPWTVAzXX1Ce5M+BeAgTBQ1T7C5tELILrTNQHJvO1s9VVBRFZ/l65Ps4vu7T
 wIZ781IFuopk0zWqHovNvgKrJ7oFmOQQZFttQEe8n6nafkjI7u+IZ8FayiGaUMRr
 4GawFGUCEdYh8z9qyslGKe8Q/Rphfk6hxMFRYUJpDmubQ0PkMeDjDGq77jDGl1PF
 VqwSDEyOaBJs7Gqf/mem00JtzBmXhkhm1SEjggHMI2IQbr/eeBXoLQOn3CDapO/N
 PiDbtX760ic=
 =EWQA
 -----END PGP SIGNATURE-----

Merge tag 'sched-core-2023-06-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler updates from Ingo Molnar:
 "Scheduler SMP load-balancer improvements:

   - Avoid unnecessary migrations within SMT domains on hybrid systems.

     Problem:

        On hybrid CPU systems, (processors with a mixture of
        higher-frequency SMT cores and lower-frequency non-SMT cores),
        under the old code lower-priority CPUs pulled tasks from the
        higher-priority cores if more than one SMT sibling was busy -
        resulting in many unnecessary task migrations.

     Solution:

        The new code improves the load balancer to recognize SMT cores
        with more than one busy sibling and allows lower-priority CPUs
        to pull tasks, which avoids superfluous migrations and lets
        lower-priority cores inspect all SMT siblings for the busiest
        queue.

   - Implement the 'runnable boosting' feature in the EAS balancer:
     consider CPU contention in frequency, EAS max util & load-balance
     busiest CPU selection.

     This improves CPU utilization for certain workloads, while leaves
     other key workloads unchanged.

  Scheduler infrastructure improvements:

   - Rewrite the scheduler topology setup code by consolidating it into
     the build_sched_topology() helper function and building it
     dynamically on the fly.

   - Resolve the local_clock() vs. noinstr complications by rewriting
     the code: provide separate sched_clock_noinstr() and
     local_clock_noinstr() functions to be used in instrumentation code,
     and make sure it is all instrumentation-safe.

  Fixes:

   - Fix a kthread_park() race with wait_woken()

   - Fix misc wait_task_inactive() bugs unearthed by the -rt merge:
       - Fix UP PREEMPT bug by unifying the SMP and UP implementations
       - Fix task_struct::saved_state handling

   - Fix various rq clock update bugs, unearthed by turning on the rq
     clock debugging code.

   - Fix the PSI WINDOW_MIN_US trigger limit, which was easy to trigger
     by creating enough cgroups, by removing the warnign and restricting
     window size triggers to PSI file write-permission or
     CAP_SYS_RESOURCE.

   - Propagate SMT flags in the topology when removing degenerate domain

   - Fix grub_reclaim() calculation bug in the deadline scheduler code

   - Avoid resetting the min update period when it is unnecessary, in
     psi_trigger_destroy().

   - Don't balance a task to its current running CPU in load_balance(),
     which was possible on certain NUMA topologies with overlapping
     groups.

   - Fix the sched-debug printing of rq->nr_uninterruptible

  Cleanups:

   - Address various -Wmissing-prototype warnings, as a preparation to
     (maybe) enable this warning in the future.

   - Remove unused code

   - Mark more functions __init

   - Fix shadow-variable warnings"

* tag 'sched-core-2023-06-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (50 commits)
  sched/core: Avoid multiple calling update_rq_clock() in __cfsb_csd_unthrottle()
  sched/core: Avoid double calling update_rq_clock() in __balance_push_cpu_stop()
  sched/core: Fixed missing rq clock update before calling set_rq_offline()
  sched/deadline: Update GRUB description in the documentation
  sched/deadline: Fix bandwidth reclaim equation in GRUB
  sched/wait: Fix a kthread_park race with wait_woken()
  sched/topology: Mark set_sched_topology() __init
  sched/fair: Rename variable cpu_util eff_util
  arm64/arch_timer: Fix MMIO byteswap
  sched/fair, cpufreq: Introduce 'runnable boosting'
  sched/fair: Refactor CPU utilization functions
  cpuidle: Use local_clock_noinstr()
  sched/clock: Provide local_clock_noinstr()
  x86/tsc: Provide sched_clock_noinstr()
  clocksource: hyper-v: Provide noinstr sched_clock()
  clocksource: hyper-v: Adjust hv_read_tsc_page_tsc() to avoid special casing U64_MAX
  x86/vdso: Fix gettimeofday masking
  math64: Always inline u128 version of mul_u64_u64_shr()
  s390/time: Provide sched_clock_noinstr()
  loongarch: Provide noinstr sched_clock_read()
  ...
2023-06-27 14:03:21 -07:00
Linus Torvalds af96134dc8 RCU pull request for v6.5
This pull contains the following branches:
 
 doc.2023.05.10a: Documentation updates
 
 fixes.2023.05.11a: Miscellaneous fixes, perhaps most notably:
 
 o	Remove RCU_NONIDLE().  The new visibility of most of the idle
 	loop to RCU has obsoleted this API.
 
 o	Make the RCU_SOFTIRQ callback-invocation time limit also apply
 	to the rcuc kthreads that invoke callbacks for CONFIG_PREEMPT_RT.
 
 o	Add a jiffies-based callback-invocation time limit to handle
 	long-running callbacks.  (The local_clock() function is only
 	invoked once per 32 callbacks due to its high overhead.)
 
 o	Stop rcu_tasks_invoke_cbs() from using never-onlined CPUs,
 	which fixes a bug that can occur on systems with non-contiguous
 	CPU numbering.
 
 kvfree.2023.05.10a: kvfree_rcu updates
 
 o	Eliminate the single-argument variant of k[v]free_rcu() now
 	that all uses have been converted to k[v]free_rcu_mightsleep().
 
 o	Add WARN_ON_ONCE() checks for k[v]free_rcu*() freeing callbacks
 	too soon.  Yes, this is closing the barn door after the horse
 	has escaped, but Murphy says that there will be more horses.
 
 nocb.2023.05.11a: Callback-offloading updates
 
 o	Fix a number of bugs involving the shrinker and lazy callbacks.
 
 rcu-tasks.2023.05.10a: Tasks RCU updates
 
 torture.2023.05.15a: Torture-test updates
 
 rcu-urgent.2023.06.06a: Urgent SRCU fix (already pulled)
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEbK7UrM+RBIrCoViJnr8S83LZ+4wFAmSUuukTHHBhdWxtY2tA
 a2VybmVsLm9yZwAKCRCevxLzctn7jLB5EACWArBYSbXh9kx6RP3LRkOd//fQWuqx
 z/RmHjMx3a2uIQpsbeAj+jrgHYzSOi7Afdnx2s0gUIWGjpF4d+e31eco9xTQtWIs
 A3/pXUlcTyaPXEZh5ro763UyBF/K003TAdo7EZAScTfDNp2knqGdEOyXTOXiAULX
 GH922kIqg0chbYaWocLY3g5mXeEm+kGY8GrDAB7/B3jHgoyylXzmSULDP4GQV7hw
 DkM0GOlc3TSzHonnNS6j1xboqY4HhWIDkBrD4Oh5P//ttMpb1b6gs1zEyjCQcNBe
 a6fnNF+0dUwANIZKroPn/L1uTGsEUhmLFkVK+XIuAit97yWI6t+aRH6TzHHYmkpu
 wVmLxv/FbJohP7ArWaI8l0gNl0vkli3ZgQXnRvSpCqIFR93AWVMeZsDTGOcLUdry
 AZEnuGXHnc9UB0KGOIras0o/EQezKq57JUV2bBZjl/GIDc3qiaJKnBhHysPc1iuE
 UfP052vCaoZxO3U/FrObQhjLZnstKBYHj8WolxMjIyNMlRIvDro6O1WG4+mjeLDP
 xdrjKGstsJh80CYDei+vJBXsbszhxv8yV4hCQX9JcDl3RjEqOOxgKUnAaP2mm02O
 MX33P3MZvSsHGoxkJpXDSlkQlbNqDBMIjZXbZLRF4o8fPhVmQU/4QlJN0iFOoXaQ
 1qqGrerEzfn0Jw==
 =3LCd
 -----END PGP SIGNATURE-----

Merge tag 'rcu.2023.06.22a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu

Pull RCU updates from Paul McKenney:
 "Documentation updates

  Miscellaneous fixes, perhaps most notably:

   - Remove RCU_NONIDLE(). The new visibility of most of the idle loop
     to RCU has obsoleted this API.

   - Make the RCU_SOFTIRQ callback-invocation time limit also apply to
     the rcuc kthreads that invoke callbacks for CONFIG_PREEMPT_RT.

   - Add a jiffies-based callback-invocation time limit to handle
     long-running callbacks. (The local_clock() function is only invoked
     once per 32 callbacks due to its high overhead.)

   - Stop rcu_tasks_invoke_cbs() from using never-onlined CPUs, which
     fixes a bug that can occur on systems with non-contiguous CPU
     numbering.

  kvfree_rcu updates:

   - Eliminate the single-argument variant of k[v]free_rcu() now that
     all uses have been converted to k[v]free_rcu_mightsleep().

   - Add WARN_ON_ONCE() checks for k[v]free_rcu*() freeing callbacks too
     soon. Yes, this is closing the barn door after the horse has
     escaped, but Murphy says that there will be more horses.

  Callback-offloading updates:

   - Fix a number of bugs involving the shrinker and lazy callbacks.

  Tasks RCU updates

  Torture-test updates"

* tag 'rcu.2023.06.22a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu: (32 commits)
  torture: Remove duplicated argument -enable-kvm for ppc64
  doc/rcutorture: Add description of rcutorture.stall_cpu_block
  rcu/rcuscale: Stop kfree_scale_thread thread(s) after unloading rcuscale
  rcu/rcuscale: Move rcu_scale_*() after kfree_scale_cleanup()
  rcutorture: Correct name of use_softirq module parameter
  locktorture: Add long_hold to adjust lock-hold delays
  rcu/nocb: Make shrinker iterate only over NOCB CPUs
  rcu-tasks: Stop rcu_tasks_invoke_cbs() from using never-onlined CPUs
  rcu: Make rcu_cpu_starting() rely on interrupts being disabled
  rcu: Mark rcu_cpu_kthread() accesses to ->rcu_cpu_has_work
  rcu: Mark additional concurrent load from ->cpu_no_qs.b.exp
  rcu: Employ jiffies-based backstop to callback time limit
  rcu: Check callback-invocation time limit for rcuc kthreads
  rcu: Remove RCU_NONIDLE()
  rcu: Add more RCU files to kernel-api.rst
  rcu-tasks: Clarify the cblist_init_generic() function's pr_info() output
  rcu-tasks: Avoid pr_info() with spin lock in cblist_init_generic()
  rcu/nocb: Recheck lazy callbacks under the ->nocb_lock from shrinker
  rcu/nocb: Fix shrinker race against callback enqueuer
  rcu/nocb: Protect lazy shrinker against concurrent (de-)offloading
  ...
2023-06-27 10:37:01 -07:00
Jiri Olsa 5f81018753 fprobe: Release rethook after the ftrace_ops is unregistered
While running bpf selftests it's possible to get following fault:

  general protection fault, probably for non-canonical address \
  0x6b6b6b6b6b6b6b6b: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC NOPTI
  ...
  Call Trace:
   <TASK>
   fprobe_handler+0xc1/0x270
   ? __pfx_bpf_testmod_init+0x10/0x10
   ? __pfx_bpf_testmod_init+0x10/0x10
   ? bpf_fentry_test1+0x5/0x10
   ? bpf_fentry_test1+0x5/0x10
   ? bpf_testmod_init+0x22/0x80
   ? do_one_initcall+0x63/0x2e0
   ? rcu_is_watching+0xd/0x40
   ? kmalloc_trace+0xaf/0xc0
   ? do_init_module+0x60/0x250
   ? __do_sys_finit_module+0xac/0x120
   ? do_syscall_64+0x37/0x90
   ? entry_SYSCALL_64_after_hwframe+0x72/0xdc
   </TASK>

In unregister_fprobe function we can't release fp->rethook while it's
possible there are some of its users still running on another cpu.

Moving rethook_free call after fp->ops is unregistered with
unregister_ftrace_function call.

Link: https://lore.kernel.org/all/20230615115236.3476617-1-jolsa@kernel.org/

Fixes: 5b0ab78998 ("fprobe: Add exit_handler support")
Cc: stable@vger.kernel.org
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2023-06-27 23:38:31 +09:00
Linus Torvalds 40e8e98f51 Power management updates for 6.5-rc1
- Introduce power capping core support for Intel TPMI (Topology Aware
    Register and PM Capsule Interface) and a TPMI interface driver for
    Intel RAPL (Zhang Rui, Dan Carpenter).
 
  - Fix CONFIG_IOSF_MBI dependency in the Intel RAPL power capping
    driver (Zhang Rui).
 
  - Fix invalid initialization for pl4_supported field in the Intel RAPL
    power capping driver (Sumeet Pawnikar).
 
  - Clean up the intel_idle driver, make it work with VM guests that
    cannot use the MWAIT instruction and address the case in which the
    host may enter a deep idle state when the guest is idle (Arjan van
    de Ven).
 
  - Prevent cpufreq drivers that provide the ->adjust_perf() callback
    without a ->fast_switch() one which is used as a fallback from the
    former in some cases (Wyes Karny).
 
  - Fix some issues related to the AMD P-state cpufreq driver (Mario
    Limonciello, Wyes Karny).
 
  - Fix the energy_performance_preference attribute handling in the
    intel_pstate driver in passive mode (Tero Kristo).
 
  - Fix the handling of pm_suspend_target_state when CONFIG_PM is unset
    (Kai-Heng Feng).
 
  - Correct spelling mistake in a comment in the hibernation code (Wang
    Honghui).
 
  - Add arch_resume_nosmt() prototype to avoid a "missing prototypes"
    build warning (Arnd Bergmann).
 
  - Restrict pm_pr_dbg() to system-wide power transitions and use it in
    a few additional places (Mario Limonciello).
 
  - Drop verification of in-params from genpd_add_device() and ensure
    that all of its callers will do it (Ulf Hansson).
 
  - Prevent possible integer overflows from occurring in
    genpd_parse_state() (Nikita Zhandarovich).
 
  - Reorder fieldls in 'struct devfreq_dev_status' to reduce its size
    somewhat (Christophe JAILLET).
 
  - Ensure that the Exynos PPMU driver is already loaded before the
    Exynos Bus driver starts probing so as to avoid a possible freeze
    loading of the kernel modules (Marek Szyprowski).
 
  - Fix variable deferencing before NULL check in the mtk-cci devfreq
    driver (Sukrut Bellary).
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAmSZwPASHHJqd0Byand5
 c29ja2kubmV0AAoJEILEb/54YlRxu44P/AouvVFDMt+eE76nPfNc10X1lswS0vrT
 X7LmylSDPyWiuz6p8MWGXB6T2nQ+3DbvSfVyBJ960ymkBnE/F9me8o3wB8eGbd6z
 ZvOD8+wVPXS4Cq8gUxy2zV1ul+o5IwwT20cYC6mWjasvByl13vTevN5d6ZQ9o6hS
 1hAQQDd6JjsdLIUyU0EbE4aD+l4h96o45IFxbV86qVH77ywa6VMNdulRKmDcONj3
 kM7jHFYL4xl0TfMjHp4IhGWXK32qGYgX1zYTOU5kSc11IExJfVzQcL2uQ9A0KSLp
 RJ0c93loUsHdMhenNkN4nSBFWBIaftKDLbS+5Ubt0DBuNN7kxWivEVts4DM/wxuB
 72PNl5h8YglcW7LHH2IXb/6HEerzbj42+6y459o+M0DcNTq18gu19OQTK5IGtRrQ
 Yf6+5BhgLR3R1REg0eaBg6njtGq0f5fmW7Iqo52eA8cXhHU0MTDJE1p6ytfN40gH
 ViA+T8HB6Mh91lWHVftbwo3wONHGcfJy+S2hGM45V5LKEGeKHILgmw0nUzO7epWE
 7VIPKGzkVd7h/Drk7X3nQR3DJFA/x5eNhjxt5LZD83cVVg34SS3ST5oH13FI9S7Q
 8zwG5KoHTDrmYug3sxQ+Q9pq8MnOl0ZbgqlVfwyiWjKYmNMbg4elsQG/2zD3t0kv
 y8zXbr7Kr4QB
 =SZV7
 -----END PGP SIGNATURE-----

Merge tag 'pm-6.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management updates from Rafael Wysocki:
 "These add Intel TPMI (Topology Aware Register and PM Capsule
  Interface) support to the power capping subsystem, extend the
  intel_idle driver to work in VM guests where MWAIT is not available,
  extend the system-wide power management diagnostics, fix bugs and
  clean up code.

  Specifics:

   - Introduce power capping core support for Intel TPMI (Topology Aware
     Register and PM Capsule Interface) and a TPMI interface driver for
     Intel RAPL (Zhang Rui, Dan Carpenter)

   - Fix CONFIG_IOSF_MBI dependency in the Intel RAPL power capping
     driver (Zhang Rui)

   - Fix invalid initialization for pl4_supported field in the Intel
     RAPL power capping driver (Sumeet Pawnikar)

   - Clean up the intel_idle driver, make it work with VM guests that
     cannot use the MWAIT instruction and address the case in which the
     host may enter a deep idle state when the guest is idle (Arjan van
     de Ven)

   - Prevent cpufreq drivers that provide the ->adjust_perf() callback
     without a ->fast_switch() one which is used as a fallback from the
     former in some cases (Wyes Karny)

   - Fix some issues related to the AMD P-state cpufreq driver (Mario
     Limonciello, Wyes Karny)

   - Fix the energy_performance_preference attribute handling in the
     intel_pstate driver in passive mode (Tero Kristo)

   - Fix the handling of pm_suspend_target_state when CONFIG_PM is unset
     (Kai-Heng Feng)

   - Correct spelling mistake in a comment in the hibernation code (Wang
     Honghui)

   - Add arch_resume_nosmt() prototype to avoid a "missing prototypes"
     build warning (Arnd Bergmann)

   - Restrict pm_pr_dbg() to system-wide power transitions and use it in
     a few additional places (Mario Limonciello)

   - Drop verification of in-params from genpd_add_device() and ensure
     that all of its callers will do it (Ulf Hansson)

   - Prevent possible integer overflows from occurring in
     genpd_parse_state() (Nikita Zhandarovich)

   - Reorder fieldls in 'struct devfreq_dev_status' to reduce its size
     somewhat (Christophe JAILLET)

   - Ensure that the Exynos PPMU driver is already loaded before the
     Exynos Bus driver starts probing so as to avoid a possible freeze
     loading of the kernel modules (Marek Szyprowski)

   - Fix variable deferencing before NULL check in the mtk-cci devfreq
     driver (Sukrut Bellary)"

* tag 'pm-6.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (42 commits)
  intel_idle: Add a "Long HLT" C1 state for the VM guest mode
  cpufreq: intel_pstate: Fix energy_performance_preference for passive
  cpufreq: amd-pstate: Add a kernel config option to set default mode
  cpufreq: amd-pstate: Set a fallback policy based on preferred_profile
  ACPI: CPPC: Add definition for undefined FADT preferred PM profile value
  cpufreq: amd-pstate: Set default governor to schedutil
  PM: domains: Move the verification of in-params from genpd_add_device()
  cpufreq: amd-pstate: Make amd-pstate EPP driver name hyphenated
  cpufreq: amd-pstate: Write CPPC enable bit per-socket
  intel_idle: Add support for using intel_idle in a VM guest using just hlt
  cpufreq: Fail driver register if it has adjust_perf without fast_switch
  intel_idle: clean up the (new) state_update_enter_method function
  intel_idle: refactor state->enter manipulation into its own function
  platform/x86/amd: pmc: Use pm_pr_dbg() for suspend related messages
  pinctrl: amd: Use pm_pr_dbg to show debugging messages
  ACPI: x86: Add pm_debug_messages for LPS0 _DSM state tracking
  include/linux/suspend.h: Only show pm_pr_dbg messages at suspend/resume
  powercap: RAPL: Fix a NULL vs IS_ERR() bug
  powercap: RAPL: Fix CONFIG_IOSF_MBI dependency
  powercap: RAPL: fix invalid initialization for pl4_supported field
  ...
2023-06-26 19:36:30 -07:00
Linus Torvalds 19300488c9 - Address -Wmissing-prototype warnings
- Remove repeated 'the' in comments
  - Remove unused current_untag_mask()
  - Document urgent tip branch timing
  - Clean up MSR kernel-doc notation
  - Clean up paravirt_ops doc
  - Update Srivatsa S. Bhat's maintained areas
  - Remove unused extern declaration acpi_copy_wakeup_routine()
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEV76QKkVc4xCGURexaDWVMHDJkrAFAmSZ6xIACgkQaDWVMHDJ
 krB9aQ/+NjB4CiWLbrnOYj9QYG6p1GE7lfu2dzIDdmcNuiai8htopXys54Igy3Rq
 BbIoW4E0SGK5E2OD7nLe4fBA/LpsYZTwDhGUu3SiovxLOoC5qkF0Q+6aVypPJE5o
 q7kn0Eo9IDL1dO0EbJptFDJRjk3K5caEoyXJRelarjIfPRbDEhUFaybVRykMZN9I
 4AOxrlb9WFggT4gUE4+N0kWyEqdgI9/aguavmasaG4lBHZ5JAHNQPNIa8bkVSAPL
 wULAzsrGp96V3tVxdjDCzD9aumk4xlJq7gk+v7mfx013dg7Cjs074Xoi2Y+TmaC7
 fdIZiGPJIkNToW+nENVO7BYtACSQhXeVTGxLQO/HNTDc//ZWiIUoJT2U4qu/6e6F
 aAIGoLwv68H4BghS2qx6Gz+BTIfl35mcPUb75MQhu+D84QZoZWrdamCYhsvHeZzc
 uC3nojrb6PBOth9nJsRae+j1zpRe/DT2LvHSWPJgK6EygOAi05ZfYUll/6sb0vze
 IXkUrVV1BvDDVpY9/HnE8RpDCDolP0/ezK9zsw48arZtkc+Qmw2WlD/2D98E+pSb
 MJPelbVmpzWTaoR4jDzXJCXkWe7CQJ5uPQj5azAE9l7YvnxgCQP5xnm5sLU9eyLu
 RsOwRzss0+3z44x5rJi9nSxQJ0LHfTAzW8/ZmNSZGHzi0ClszK0=
 =N82i
 -----END PGP SIGNATURE-----

Merge tag 'x86_cleanups_for_6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 cleanups from Dave Hansen:
 "As usual, these are all over the map. The biggest cluster is work from
  Arnd to eliminate -Wmissing-prototype warnings:

   - Address -Wmissing-prototype warnings

   - Remove repeated 'the' in comments

   - Remove unused current_untag_mask()

   - Document urgent tip branch timing

   - Clean up MSR kernel-doc notation

   - Clean up paravirt_ops doc

   - Update Srivatsa S. Bhat's maintained areas

   - Remove unused extern declaration acpi_copy_wakeup_routine()"

* tag 'x86_cleanups_for_6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (22 commits)
  x86/acpi: Remove unused extern declaration acpi_copy_wakeup_routine()
  Documentation: virt: Clean up paravirt_ops doc
  x86/mm: Remove unused current_untag_mask()
  x86/mm: Remove repeated word in comments
  x86/lib/msr: Clean up kernel-doc notation
  x86/platform: Avoid missing-prototype warnings for OLPC
  x86/mm: Add early_memremap_pgprot_adjust() prototype
  x86/usercopy: Include arch_wb_cache_pmem() declaration
  x86/vdso: Include vdso/processor.h
  x86/mce: Add copy_mc_fragile_handle_tail() prototype
  x86/fbdev: Include asm/fb.h as needed
  x86/hibernate: Declare global functions in suspend.h
  x86/entry: Add do_SYSENTER_32() prototype
  x86/quirks: Include linux/pnp.h for arch_pnpbios_disabled()
  x86/mm: Include asm/numa.h for set_highmem_pages_init()
  x86: Avoid missing-prototype warnings for doublefault code
  x86/fpu: Include asm/fpu/regset.h
  x86: Add dummy prototype for mk_early_pgtbl_32()
  x86/pci: Mark local functions as 'static'
  x86/ftrace: Move prepare_ftrace_return prototype to header
  ...
2023-06-26 16:43:54 -07:00
Linus Torvalds cd336f6562 Time, timekeeping and related device driver updates:
- Core:
 
    - A set of fixes, cleanups and enhancements to the posix timer code:
 
      - Prevent another possible live lock scenario in the exit() path,
        which affects POSIX_CPU_TIMERS_TASK_WORK enabled architectures.
 
      - Fix a loop termination issue which was reported syzcaller/KSAN in
        the posix timer ID allocation code.
 
        That triggered a deeper look into the posix-timer code which
        unearthed more small issues.
 
      - Add missing READ/WRITE_ONCE() annotations
 
      - Fix or remove completely outdated comments
 
      - Document places which are subtle and completely undocumented.
 
    - Add missing hrtimer modes to the trace event decoder
 
    - Small cleanups and enhancements all over the place
 
  - Drivers:
 
      - Rework the Hyper-V clocksource and sched clock setup code
 
      - Remove a deprecated clocksource driver
 
      - Small fixes and enhancements all over the place
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmSZctYTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoQqpEACAzSDKH7lpWFXwXMR0j6GKi5erZEYg
 I0PtvK70+zV0Fk2DOXplxDIis3qtYPSinSEK5Kzycyf+MNOuWKaB8//4PsCbD6aR
 3DWWi5xUGAOkmtFQMlmQBKahDcfFhSTN7GeYYcTd5TaQIwVPjb+Qh9XuOG5d/O0q
 66jeiYRkiOqTwOM8jZqWOWeKOt56xd9BmCvSdNbnAbZZEjUNAFT7LN6Oux2I91BU
 VUh1luoKPPKRFQN07oWaBKg/V7Iib10SCejDmAd6QKZQg1A/UulJl0WBOtRYr3RG
 81b05dG2Ulp2ygm5YuRWtkpIC6pcFKjhh6WzDio0do6aOtWHOn5oefqJqUmufM9K
 h6WRRmGecoSvon1euzciy/ArzzoI0fSHYtB2cgBaBS7ImGb+7hDk0RkNota4alLG
 gfn98Rufqx/FXHFUJeHxoZTQbW1PUoU0VIF1r/nmSwDRJsxmqPyCW+52/TOjnSo1
 cvrTflAu/JYazhggsIpOCyVlnaiXZnfGUdbvnzlhaB1vQ8M4X+aq48b1sPU9XawN
 VB9WDdh8Ba6w8ebALjM0apNaLYLq71P9dzs5dHsmjMkqx2rA+Kafc/jIu37h6ZEp
 RBFDcI/WAPnp6lS6w2v0F852xBzIJe4zbTIrUivuVxcTo5Rh8iW0AexmHFN2PN4N
 MGyyJHu8bMdIww==
 =hRV9
 -----END PGP SIGNATURE-----

Merge tag 'timers-core-2023-06-26' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer updates from Thomas Gleixner:
 "Time, timekeeping and related device driver updates:

  Core:

   - A set of fixes, cleanups and enhancements to the posix timer code:

       - Prevent another possible live lock scenario in the exit() path,
         which affects POSIX_CPU_TIMERS_TASK_WORK enabled architectures.

       - Fix a loop termination issue which was reported syzcaller/KSAN
         in the posix timer ID allocation code.

         That triggered a deeper look into the posix-timer code which
         unearthed more small issues.

       - Add missing READ/WRITE_ONCE() annotations

       - Fix or remove completely outdated comments

       - Document places which are subtle and completely undocumented.

   - Add missing hrtimer modes to the trace event decoder

   - Small cleanups and enhancements all over the place

  Drivers:

   - Rework the Hyper-V clocksource and sched clock setup code

   - Remove a deprecated clocksource driver

   - Small fixes and enhancements all over the place"

* tag 'timers-core-2023-06-26' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/tip/tip: (39 commits)
  clocksource/drivers/cadence-ttc: Fix memory leak in ttc_timer_probe
  dt-bindings: timers: Add Ralink SoCs timer
  clocksource/drivers/hyper-v: Rework clocksource and sched clock setup
  dt-bindings: timer: brcm,kona-timer: convert to YAML
  clocksource/drivers/imx-gpt: Fold <soc/imx/timer.h> into its only user
  clk: imx: Drop inclusion of unused header <soc/imx/timer.h>
  hrtimer: Add missing sparse annotations to hrtimer locking
  clocksource/drivers/imx-gpt: Use only a single name for functions
  clocksource/drivers/loongson1: Move PWM timer to clocksource framework
  dt-bindings: timer: Add Loongson-1 clocksource
  MIPS: Loongson32: Remove deprecated PWM timer clocksource
  clocksource/drivers/ingenic-timer: Use pm_sleep_ptr() macro
  tracing/timer: Add missing hrtimer modes to decode_hrtimer_mode().
  posix-timers: Add sys_ni_posix_timers() prototype
  tick/rcu: Fix bogus ratelimit condition
  alarmtimer: Remove unnecessary (void *) cast
  alarmtimer: Remove unnecessary initialization of variable 'ret'
  posix-timers: Refer properly to CONFIG_HIGH_RES_TIMERS
  posix-timers: Polish coding style in a few places
  posix-timers: Remove pointless comments
  ...
2023-06-26 14:10:45 -07:00
Linus Torvalds 9244724fbf A large update for SMP management:
- Parallel CPU bringup
 
     The reason why people are interested in parallel bringup is to shorten
     the (kexec) reboot time of cloud servers to reduce the downtime of the
     VM tenants.
 
     The current fully serialized bringup does the following per AP:
 
       1) Prepare callbacks (allocate, intialize, create threads)
       2) Kick the AP alive (e.g. INIT/SIPI on x86)
       3) Wait for the AP to report alive state
       4) Let the AP continue through the atomic bringup
       5) Let the AP run the threaded bringup to full online state
 
     There are two significant delays:
 
       #3 The time for an AP to report alive state in start_secondary() on
          x86 has been measured in the range between 350us and 3.5ms
          depending on vendor and CPU type, BIOS microcode size etc.
 
       #4 The atomic bringup does the microcode update. This has been
          measured to take up to ~8ms on the primary threads depending on
          the microcode patch size to apply.
 
     On a two socket SKL server with 56 cores (112 threads) the boot CPU
     spends on current mainline about 800ms busy waiting for the APs to come
     up and apply microcode. That's more than 80% of the actual onlining
     procedure.
 
     This can be reduced significantly by splitting the bringup mechanism
     into two parts:
 
       1) Run the prepare callbacks and kick the AP alive for each AP which
       	 needs to be brought up.
 
 	 The APs wake up, do their firmware initialization and run the low
       	 level kernel startup code including microcode loading in parallel
       	 up to the first synchronization point. (#1 and #2 above)
 
       2) Run the rest of the bringup code strictly serialized per CPU
       	 (#3 - #5 above) as it's done today.
 
 	 Parallelizing that stage of the CPU bringup might be possible in
 	 theory, but it's questionable whether required surgery would be
 	 justified for a pretty small gain.
 
     If the system is large enough the first AP is already waiting at the
     first synchronization point when the boot CPU finished the wake-up of
     the last AP. That reduces the AP bringup time on that SKL from ~800ms
     to ~80ms, i.e. by a factor ~10x.
 
     The actual gain varies wildly depending on the system, CPU, microcode
     patch size and other factors. There are some opportunities to reduce
     the overhead further, but that needs some deep surgery in the x86 CPU
     bringup code.
 
     For now this is only enabled on x86, but the core functionality
     obviously works for all SMP capable architectures.
 
   - Enhancements for SMP function call tracing so it is possible to locate
     the scheduling and the actual execution points. That allows to measure
     IPI delivery time precisely.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmSZb/YTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoRoOD/9vAiGI3IhGyZcX/RjXxauSHf8Pmqll
 05jUubFi5Vi3tKI1ubMOsnMmJTw2yy5xDyS/iGj7AcbRLq9uQd3iMtsXXHNBzo/X
 FNxnuWTXYUj0vcOYJ+j4puBumFzzpRCprqccMInH0kUnSWzbnaQCeelicZORAf+w
 zUYrswK4HpBXHDOnvPw6Z7MYQe+zyDQSwjSftstLyROzu+lCEw/9KUaysY2epShJ
 wHClxS2XqMnpY4rJ/CmJAlRhD0Plb89zXyo6k9YZYVDWoAcmBZy6vaTO4qoR171L
 37ApqrgsksMkjFycCMnmrFIlkeb7bkrYDQ5y+xqC3JPTlYDKOYmITV5fZ83HD77o
 K7FAhl/CgkPq2Ec+d82GFLVBKR1rijbwHf7a0nhfUy0yMeaJCxGp4uQ45uQ09asi
 a/VG2T38EgxVdseC92HRhcdd3pipwCb5wqjCH/XdhdlQrk9NfeIeP+TxF4QhADhg
 dApp3ifhHSnuEul7+HNUkC6U+Zc8UeDPdu5lvxSTp2ooQ0JwaGgC5PJq3nI9RUi2
 Vv826NHOknEjFInOQcwvp6SJPfcuSTF75Yx6xKz8EZ3HHxpvlolxZLq+3ohSfOKn
 2efOuZO5bEu4S/G2tRDYcy+CBvNVSrtZmCVqSOS039c8quBWQV7cj0334cjzf+5T
 TRiSzvssbYYmaw==
 =Y8if
 -----END PGP SIGNATURE-----

Merge tag 'smp-core-2023-06-26' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull SMP updates from Thomas Gleixner:
 "A large update for SMP management:

   - Parallel CPU bringup

     The reason why people are interested in parallel bringup is to
     shorten the (kexec) reboot time of cloud servers to reduce the
     downtime of the VM tenants.

     The current fully serialized bringup does the following per AP:

       1) Prepare callbacks (allocate, intialize, create threads)
       2) Kick the AP alive (e.g. INIT/SIPI on x86)
       3) Wait for the AP to report alive state
       4) Let the AP continue through the atomic bringup
       5) Let the AP run the threaded bringup to full online state

     There are two significant delays:

       #3 The time for an AP to report alive state in start_secondary()
          on x86 has been measured in the range between 350us and 3.5ms
          depending on vendor and CPU type, BIOS microcode size etc.

       #4 The atomic bringup does the microcode update. This has been
          measured to take up to ~8ms on the primary threads depending
          on the microcode patch size to apply.

     On a two socket SKL server with 56 cores (112 threads) the boot CPU
     spends on current mainline about 800ms busy waiting for the APs to
     come up and apply microcode. That's more than 80% of the actual
     onlining procedure.

     This can be reduced significantly by splitting the bringup
     mechanism into two parts:

       1) Run the prepare callbacks and kick the AP alive for each AP
          which needs to be brought up.

          The APs wake up, do their firmware initialization and run the
          low level kernel startup code including microcode loading in
          parallel up to the first synchronization point. (#1 and #2
          above)

       2) Run the rest of the bringup code strictly serialized per CPU
          (#3 - #5 above) as it's done today.

          Parallelizing that stage of the CPU bringup might be possible
          in theory, but it's questionable whether required surgery
          would be justified for a pretty small gain.

     If the system is large enough the first AP is already waiting at
     the first synchronization point when the boot CPU finished the
     wake-up of the last AP. That reduces the AP bringup time on that
     SKL from ~800ms to ~80ms, i.e. by a factor ~10x.

     The actual gain varies wildly depending on the system, CPU,
     microcode patch size and other factors. There are some
     opportunities to reduce the overhead further, but that needs some
     deep surgery in the x86 CPU bringup code.

     For now this is only enabled on x86, but the core functionality
     obviously works for all SMP capable architectures.

   - Enhancements for SMP function call tracing so it is possible to
     locate the scheduling and the actual execution points. That allows
     to measure IPI delivery time precisely"

* tag 'smp-core-2023-06-26' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/tip/tip: (45 commits)
  trace,smp: Add tracepoints for scheduling remotelly called functions
  trace,smp: Add tracepoints around remotelly called functions
  MAINTAINERS: Add CPU HOTPLUG entry
  x86/smpboot: Fix the parallel bringup decision
  x86/realmode: Make stack lock work in trampoline_compat()
  x86/smp: Initialize cpu_primary_thread_mask late
  cpu/hotplug: Fix off by one in cpuhp_bringup_mask()
  x86/apic: Fix use of X{,2}APIC_ENABLE in asm with older binutils
  x86/smpboot/64: Implement arch_cpuhp_init_parallel_bringup() and enable it
  x86/smpboot: Support parallel startup of secondary CPUs
  x86/smpboot: Implement a bit spinlock to protect the realmode stack
  x86/apic: Save the APIC virtual base address
  cpu/hotplug: Allow "parallel" bringup up to CPUHP_BP_KICK_AP_STATE
  x86/apic: Provide cpu_primary_thread mask
  x86/smpboot: Enable split CPU startup
  cpu/hotplug: Provide a split up CPUHP_BRINGUP mechanism
  cpu/hotplug: Reset task stack state in _cpu_up()
  cpu/hotplug: Remove unused state functions
  riscv: Switch to hotplug core state synchronization
  parisc: Switch to hotplug core state synchronization
  ...
2023-06-26 13:59:56 -07:00
Linus Torvalds 0017387938 Updates for the interrupt subsystem:
- Core:
 
    - Convert the interrupt descriptor storage to a maple tree to overcome
      the limitations of the radixtree + fixed size bitmap. This allows to
      handle real large servers with a huge number of guests without
      imposing a huge memory overhead on everyone.
 
    - Implement optional retriggering of interrupts which utilize the
      fasteoi handler to work around a GICv3 architecture issue.
 
  - Drivers:
 
    - A set of fixes and updates for the Loongson/Loongarch related drivers.
 
    - Workaound for an ASR8601 integration hickup which ends up with CPU
      numbering which can't be represented in the GIC implementation.
 
    - The usual set of boring fixes and updates all over the place.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmSZaf0THHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoUgaD/9PwYvqeR12oJRz24gso6NNxlZ2nZMh
 KIApeIV4eoDPjM9Qdc38Tz+LbiClZuhiNRmxqzkaKmLsNObeYJhNvRg14bQA/Mfy
 t1kqO2rlNTSeRR5Y0XiQqFMIKCcpMQeKXzJ+ZQspiX08kCSl9UqBKpE5HgbTVFiB
 yTwdtagi8zrDr8KuETe+REKcwvoLippHrnz6evVMOXtN6Jdtz2maZT9dVDAvaVl7
 pXgarzMScEFTfK8Q6wjH9ayC1UXPmSIIiirWZHYvtaAXh4/IY1U1LY4KqkVPQ1MB
 7thv4CbE/Iyzw78FUMtrsMwqOV/fu71SfBh9uV6kFxoySFJ/gJ8QLOcAqkbNGyBf
 9oRWuuY0LJZl1AKtmU6jNaS17JeOpdIdB44cAXBArYMbJUStZ2Mo2EDdw+/IHNzM
 tt32+Pjtg8BVrFLcR7gQ5rzAktz6678x9Qk6ys+KUCG3tuFyKx6RiD+f0DARe1Td
 DflNoJ6WTqwoimvTokAg6QGPUyHKJLe29ciSuUjHXaHJAE9xyeGtfJQWNLwpjejD
 KYYo5mb8cJc917Yx8LUOj02jVtebQtLezDtnUyGXrIR+ze4ZUQxhgvSKRDxX7E56
 CjG3ghx6Ty1sTpjL4dHtXLJ1NgitFyjJ7VQlVqxWNQBNI+m3l2zmxj4zB9eI6v1R
 qyjKEgnFi60vSw==
 =qKCo
 -----END PGP SIGNATURE-----

Merge tag 'irq-core-2023-06-26' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull irq updates from Thomas Gleixner:
 "Updates for the interrupt subsystem:

  Core:

   - Convert the interrupt descriptor storage to a maple tree to
     overcome the limitations of the radixtree + fixed size bitmap.

     This allows us to handle very large servers with a huge number of
     guests without imposing a huge memory overhead on everyone

   - Implement optional retriggering of interrupts which utilize the
     fasteoi handler to work around a GICv3 architecture issue

  Drivers:

   - A set of fixes and updates for the Loongson/Loongarch related
     drivers

   - Workaound for an ASR8601 integration hickup which ends up with CPU
     numbering which can't be represented in the GIC implementation

   - The usual set of boring fixes and updates all over the place"

* tag 'irq-core-2023-06-26' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/tip/tip: (27 commits)
  Revert "irqchip/mxs: Include linux/irqchip/mxs.h"
  irqchip/jcore-aic: Fix missing allocation of IRQ descriptors
  irqchip/stm32-exti: Fix warning on initialized field overwritten
  irqchip/stm32-exti: Add STM32MP15xx IWDG2 EXTI to GIC map
  irqchip/gicv3: Add a iort_pmsi_get_dev_id() prototype
  irqchip/mxs: Include linux/irqchip/mxs.h
  irqchip/clps711x: Remove unused clps711x_intc_init() function
  irqchip/mmp: Remove non-DT codepath
  irqchip/ftintc010: Mark all function static
  irqdomain: Include internals.h for function prototypes
  irqchip/loongson-eiointc: Add DT init support
  dt-bindings: interrupt-controller: Add Loongson EIOINTC
  irqchip/loongson-eiointc: Fix irq affinity setting during resume
  irqchip/loongson-liointc: Add IRQCHIP_SKIP_SET_WAKE flag
  irqchip/loongson-liointc: Fix IRQ trigger polarity
  irqchip/loongson-pch-pic: Fix potential incorrect hwirq assignment
  irqchip/loongson-pch-pic: Fix initialization of HT vector register
  irqchip/gic-v3-its: Enable RESEND_WHEN_IN_PROGRESS for LPIs
  genirq: Allow fasteoi handler to resend interrupts on concurrent handling
  genirq: Expand doc for PENDING and REPLAY flags
  ...
2023-06-26 13:34:39 -07:00
Linus Torvalds a0433f8cae for-6.5/block-2023-06-23
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmSV8dwQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpilGD/9Yys1oxIXJpRf00fzrylAlBthRxMjFQVWw
 zAut106hAQiBHvU8IkmGA3MvEFVHxtzwYhHI7IR8K3aZBIqscweCqmVI9JyogJw9
 U9Twnzel47VmuKdM94FeoN+hbj1fP8EWTjzmy67/zEEfFCdmHvNlMi3lSrGYIpFy
 39LxTB99Y4UarM5PtWbes37GYYljzMSWKuo4AfBkvq1eQa+sZ0Vq2xAABKq3UM7f
 apqhgHtkJooRePDP0eQp+kAyyVMgW2jIK+oIdJDxNF3CKTu2w40RzaYz6fp+jVSU
 H4R/xS59GW4/xql+VBJDh/qJg9K62DPPYjlW8BmSR8+IjvfFpsyH3/MacE50CD3P
 20fs/Mnj49H79fDrQEHJI53cOOb2EmUitbwLbvOcColNTPpt8loBtdQxjF2RMU8R
 Nyort9DJPFclYCxky1LYg1CNEC2Ln4Zy/jD47wPvqRmOQphOoVlV/hPnOEqvjaZC
 49Vn70W2DeE9cXvYI7ha+XIg6/oj+Gs3iusEbV08Ci7EAtXgI+ZUUsQ97K8UNiUh
 h2lqSJtuI7lBpYP9sf+BeCch5UCC+xGYyTdoM5f58lehWBBPtbs0g7S9RyRyOYxe
 n+yxEUo3dAGzJ/xsKAjinbZfeWIpr0b1TkAh4w3Cq/BKzRr9Bp8lBAxYuancbQ+Y
 1ADPteUOTA==
 =zP4Y
 -----END PGP SIGNATURE-----

Merge tag 'for-6.5/block-2023-06-23' of git://git.kernel.dk/linux

Pull block updates from Jens Axboe:

 - NVMe pull request via Keith:
      - Various cleanups all around (Irvin, Chaitanya, Christophe)
      - Better struct packing (Christophe JAILLET)
      - Reduce controller error logs for optional commands (Keith)
      - Support for >=64KiB block sizes (Daniel Gomez)
      - Fabrics fixes and code organization (Max, Chaitanya, Daniel
        Wagner)

 - bcache updates via Coly:
      - Fix a race at init time (Mingzhe Zou)
      - Misc fixes and cleanups (Andrea, Thomas, Zheng, Ye)

 - use page pinning in the block layer for dio (David)

 - convert old block dio code to page pinning (David, Christoph)

 - cleanups for pktcdvd (Andy)

 - cleanups for rnbd (Guoqing)

 - use the unchecked __bio_add_page() for the initial single page
   additions (Johannes)

 - fix overflows in the Amiga partition handling code (Michael)

 - improve mq-deadline zoned device support (Bart)

 - keep passthrough requests out of the IO schedulers (Christoph, Ming)

 - improve support for flush requests, making them less special to deal
   with (Christoph)

 - add bdev holder ops and shutdown methods (Christoph)

 - fix the name_to_dev_t() situation and use cases (Christoph)

 - decouple the block open flags from fmode_t (Christoph)

 - ublk updates and cleanups, including adding user copy support (Ming)

 - BFQ sanity checking (Bart)

 - convert brd from radix to xarray (Pankaj)

 - constify various structures (Thomas, Ivan)

 - more fine grained persistent reservation ioctl capability checks
   (Jingbo)

 - misc fixes and cleanups (Arnd, Azeem, Demi, Ed, Hengqi, Hou, Jan,
   Jordy, Li, Min, Yu, Zhong, Waiman)

* tag 'for-6.5/block-2023-06-23' of git://git.kernel.dk/linux: (266 commits)
  scsi/sg: don't grab scsi host module reference
  ext4: Fix warning in blkdev_put()
  block: don't return -EINVAL for not found names in devt_from_devname
  cdrom: Fix spectre-v1 gadget
  block: Improve kernel-doc headers
  blk-mq: don't insert passthrough request into sw queue
  bsg: make bsg_class a static const structure
  ublk: make ublk_chr_class a static const structure
  aoe: make aoe_class a static const structure
  block/rnbd: make all 'class' structures const
  block: fix the exclusive open mask in disk_scan_partitions
  block: add overflow checks for Amiga partition support
  block: change all __u32 annotations to __be32 in affs_hardblocks.h
  block: fix signed int overflow in Amiga partition support
  block: add capacity validation in bdev_add_partition()
  block: fine-granular CAP_SYS_ADMIN for Persistent Reservation
  block: disallow Persistent Reservation on partitions
  reiserfs: fix blkdev_put() warning from release_journal_dev()
  block: fix wrong mode for blkdev_get_by_dev() from disk_scan_partitions()
  block: document the holder argument to blkdev_get_by_path
  ...
2023-06-26 12:47:20 -07:00
Linus Torvalds 3eccc0c886 for-6.5/splice-2023-06-23
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmSV8QgQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpupIEADKEZvpxDyaxHjYZFFeoSJRkh+AEJHe0Xtr
 J5vUL8t8zmAV3F7i8XaoAEcR0dC0VQcoTc8fAOty71+5hsc7gvtyyNjqU/YWRVqK
 Xr+VJuSJ+OGx3MzpRWEkepagfPyqP5cyyCOK6gqIgqzc3IwqkR/3QHVRc6oR8YbY
 AQd7tqm2fQXK9WDHEy5hcaQeqb9uKZjQQoZejpPPerpJM+9RMgKxpCGtnLLIUhr/
 sgl7KyLIQPBmveO2vfOR+dmsJBqsLqneqkXDKMAIfpeVEEkHHAlCH4E5Ne1XUS+s
 ie4If+reuyn1Ktt5Ry1t7w2wr8cX1fcay3K28tgwjE2Bvremc5YnYgb3pyUDW38f
 tXXkpg/eTXd/Pn0Crpagoa9zJ927tt5JXIO1/PagPEP1XOqUuthshDFsrVqfqbs+
 36gqX2JWB4NJTg9B9KBHA3+iVCJyZLjUqOqws7hOJOvhQytZVm/IwkGBg1Slhe1a
 J5WemBlqX8lTgXz0nM7cOhPYTZeKe6hazCcb5VwxTUTj9SGyYtsMfqqTwRJO9kiF
 j1VzbOAgExDYe+GvfqOFPh9VqZho66+DyOD/Xtca4eH7oYyHSmP66o8nhRyPBPZA
 maBxQhUkPQn4/V/0fL2TwIdWYKsbj8bUyINKPZ2L35YfeICiaYIctTwNJxtRmItB
 M3VxWD3GZQ==
 =KhW4
 -----END PGP SIGNATURE-----

Merge tag 'for-6.5/splice-2023-06-23' of git://git.kernel.dk/linux

Pull splice updates from Jens Axboe:
 "This kills off ITER_PIPE to avoid a race between truncate,
  iov_iter_revert() on the pipe and an as-yet incomplete DMA to a bio
  with unpinned/unref'ed pages from an O_DIRECT splice read. This causes
  memory corruption.

  Instead, we either use (a) filemap_splice_read(), which invokes the
  buffered file reading code and splices from the pagecache into the
  pipe; (b) copy_splice_read(), which bulk-allocates a buffer, reads
  into it and then pushes the filled pages into the pipe; or (c) handle
  it in filesystem-specific code.

  Summary:

   - Rename direct_splice_read() to copy_splice_read()

   - Simplify the calculations for the number of pages to be reclaimed
     in copy_splice_read()

   - Turn do_splice_to() into a helper, vfs_splice_read(), so that it
     can be used by overlayfs and coda to perform the checks on the
     lower fs

   - Make vfs_splice_read() jump to copy_splice_read() to handle
     direct-I/O and DAX

   - Provide shmem with its own splice_read to handle non-existent pages
     in the pagecache. We don't want a ->read_folio() as we don't want
     to populate holes, but filemap_get_pages() requires it

   - Provide overlayfs with its own splice_read to call down to a lower
     layer as overlayfs doesn't provide ->read_folio()

   - Provide coda with its own splice_read to call down to a lower layer
     as coda doesn't provide ->read_folio()

   - Direct ->splice_read to copy_splice_read() in tty, procfs, kernfs
     and random files as they just copy to the output buffer and don't
     splice pages

   - Provide wrappers for afs, ceph, ecryptfs, ext4, f2fs, nfs, ntfs3,
     ocfs2, orangefs, xfs and zonefs to do locking and/or revalidation

   - Make cifs use filemap_splice_read()

   - Replace pointers to generic_file_splice_read() with pointers to
     filemap_splice_read() as DIO and DAX are handled in the caller;
     filesystems can still provide their own alternate ->splice_read()
     op

   - Remove generic_file_splice_read()

   - Remove ITER_PIPE and its paraphernalia as generic_file_splice_read
     was the only user"

* tag 'for-6.5/splice-2023-06-23' of git://git.kernel.dk/linux: (31 commits)
  splice: kdoc for filemap_splice_read() and copy_splice_read()
  iov_iter: Kill ITER_PIPE
  splice: Remove generic_file_splice_read()
  splice: Use filemap_splice_read() instead of generic_file_splice_read()
  cifs: Use filemap_splice_read()
  trace: Convert trace/seq to use copy_splice_read()
  zonefs: Provide a splice-read wrapper
  xfs: Provide a splice-read wrapper
  orangefs: Provide a splice-read wrapper
  ocfs2: Provide a splice-read wrapper
  ntfs3: Provide a splice-read wrapper
  nfs: Provide a splice-read wrapper
  f2fs: Provide a splice-read wrapper
  ext4: Provide a splice-read wrapper
  ecryptfs: Provide a splice-read wrapper
  ceph: Provide a splice-read wrapper
  afs: Provide a splice-read wrapper
  9p: Add splice_read wrapper
  net: Make sock_splice_read() use copy_splice_read() by default
  tty, proc, kernfs, random: Use copy_splice_read()
  ...
2023-06-26 11:52:12 -07:00
Linus Torvalds 64bf6ae93e v6.5/vfs.misc
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZJU4SwAKCRCRxhvAZXjc
 ojOTAP9gT/z1gasIf8OwDHb4inZGnVpHh2ApKLvgMXH6ICtwRgD+OBtOcf438Lx1
 cpFSTVJlh21QXMOOXWHe/LRUV2kZ5wI=
 =zdfx
 -----END PGP SIGNATURE-----

Merge tag 'v6.5/vfs.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull misc vfs updates from Christian Brauner:
 "Miscellaneous features, cleanups, and fixes for vfs and individual fs

  Features:

   - Use mode 0600 for file created by cachefilesd so it can be run by
     unprivileged users. This aligns them with directories which are
     already created with mode 0700 by cachefilesd

   - Reorder a few members in struct file to prevent some false sharing
     scenarios

   - Indicate that an eventfd is used a semaphore in the eventfd's
     fdinfo procfs file

   - Add a missing uapi header for eventfd exposing relevant uapi
     defines

   - Let the VFS protect transitions of a superblock from read-only to
     read-write in addition to the protection it already provides for
     transitions from read-write to read-only. Protecting read-only to
     read-write transitions allows filesystems such as ext4 to perform
     internal writes, keeping writers away until the transition is
     completed

  Cleanups:

   - Arnd removed the architecture specific arch_report_meminfo()
     prototypes and added a generic one into procfs.h. Note, we got a
     report about a warning in amdpgpu codepaths that suggested this was
     bisectable to this change but we concluded it was a false positive

   - Remove unused parameters from split_fs_names()

   - Rename put_and_unmap_page() to unmap_and_put_page() to let the name
     reflect the order of the cleanup operation that has to unmap before
     the actual put

   - Unexport buffer_check_dirty_writeback() as it is not used outside
     of block device aops

   - Stop allocating aio rings from highmem

   - Protecting read-{only,write} transitions in the VFS used open-coded
     barriers in various places. Replace them with proper little helpers
     and document both the helpers and all barrier interactions involved
     when transitioning between read-{only,write} states

   - Use flexible array members in old readdir codepaths

  Fixes:

   - Use the correct type __poll_t for epoll and eventfd

   - Replace all deprecated strlcpy() invocations, whose return value
     isn't checked with an equivalent strscpy() call

   - Fix some kernel-doc warnings in fs/open.c

   - Reduce the stack usage in jffs2's xattr codepaths finally getting
     rid of this: fs/jffs2/xattr.c:887:1: error: the frame size of 1088
     bytes is larger than 1024 bytes [-Werror=frame-larger-than=]
     royally annoying compilation warning

   - Use __FMODE_NONOTIFY instead of FMODE_NONOTIFY where an int and not
     fmode_t is required to avoid fmode_t to integer degradation
     warnings

   - Create coredumps with O_WRONLY instead of O_RDWR. There's a long
     explanation in that commit how O_RDWR is actually a bug which we
     found out with the help of Linus and git archeology

   - Fix "no previous prototype" warnings in the pipe codepaths

   - Add overflow calculations for remap_verify_area() as a signed
     addition overflow could be triggered in xfstests

   - Fix a null pointer dereference in sysv

   - Use an unsigned variable for length calculations in jfs avoiding
     compilation warnings with gcc 13

   - Fix a dangling pipe pointer in the watch queue codepath

   - The legacy mount option parser provided as a fallback by the VFS
     for filesystems not yet converted to the new mount api did prefix
     the generated mount option string with a leading ',' causing issues
     for some filesystems

   - Fix a repeated word in a comment in fs.h

   - autofs: Update the ctime when mtime is updated as mandated by
     POSIX"

* tag 'v6.5/vfs.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (27 commits)
  readdir: Replace one-element arrays with flexible-array members
  fs: Provide helpers for manipulating sb->s_readonly_remount
  fs: Protect reconfiguration of sb read-write from racing writes
  eventfd: add a uapi header for eventfd userspace APIs
  autofs: set ctime as well when mtime changes on a dir
  eventfd: show the EFD_SEMAPHORE flag in fdinfo
  fs/aio: Stop allocating aio rings from HIGHMEM
  fs: Fix comment typo
  fs: unexport buffer_check_dirty_writeback
  fs: avoid empty option when generating legacy mount string
  watch_queue: prevent dangling pipe pointer
  fs.h: Optimize file struct to prevent false sharing
  highmem: Rename put_and_unmap_page() to unmap_and_put_page()
  cachefiles: Allow the cache to be non-root
  init: remove unused names parameter in split_fs_names()
  jfs: Use unsigned variable for length calculations
  fs/sysv: Null check to prevent null-ptr-deref bug
  fs: use UB-safe check for signed addition overflow in remap_verify_area
  procfs: consolidate arch_report_meminfo declaration
  fs: pipe: reveal missing function protoypes
  ...
2023-06-26 09:50:21 -07:00
Rafael J. Wysocki 9b8f36398e Merge branches 'pm-sleep' and 'pm-domains'
Merge updates related to system-wide power management and generic power
domains (genpd) updates for 6.5-rc1:

 - Fix the handling of pm_suspend_target_state when CONFIG_PM is unset
   (Kai-Heng Feng).

 - Correct spelling mistake in a comment in the hibernation code (Wang
   Honghui).

 - Add arch_resume_nosmt() prototype to avoid a "missing prototypes"
   build warning (Arnd Bergmann).

 - Restrict pm_pr_dbg() to system-wide power transitions and use it in
   a few additional places (Mario Limonciello).

 - Drop verification of in-params from genpd_add_device() and ensure
   that all of its callers will do it (Ulf Hansson).

 - Prevent possible integer overflows from occurring in
   genpd_parse_state() (Nikita Zhandarovich).

* pm-sleep:
  platform/x86/amd: pmc: Use pm_pr_dbg() for suspend related messages
  pinctrl: amd: Use pm_pr_dbg to show debugging messages
  ACPI: x86: Add pm_debug_messages for LPS0 _DSM state tracking
  include/linux/suspend.h: Only show pm_pr_dbg messages at suspend/resume
  PM: suspend: add a arch_resume_nosmt() prototype
  PM: hibernate: Correct spelling mistake in a comment
  PM: suspend: Fix pm_suspend_target_state handling for !CONFIG_PM

* pm-domains:
  PM: domains: Move the verification of in-params from genpd_add_device()
  PM: domains: fix integer overflow issues in genpd_parse_state()
2023-06-26 17:44:50 +02:00
Thomas Gleixner f121ab7f4a irqchip updates for 6.5
- A number of Loogson/Loogarch fixes
 
 - Allow the core code to retrigger an interrupt that has
   fired while the same interrupt is being handled on another
   CPU, papering over a GICv3 architecture issue
 
 - Work around an integration problem on ASR8601, where the CPU
   numbering isn't representable in the GIC implementation...
 
 - Add some missing interrupt to the STM32 irqchip
 
 - A bunch of warning squashing triggered by W=1 builds
 -----BEGIN PGP SIGNATURE-----
 
 iQJDBAABCgAtFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAmSWHlEPHG1hekBrZXJu
 ZWwub3JnAAoJECPQ0LrRPXpDgDcP/AyuNjjt9ybZ8LT5hGYRBKTfO3bsssiV/ZIE
 NRbfwK6nWTurA3tN9XypYouB3HfKvTvDYHzeo9Yi2y51GI20o7noSfVNHHld5dAE
 WTBYcfc6n+1KqGNA1tPVeE403jd50TnmOmFaXhN1Vn/KQVQKs+LmjoVdS7HXKmwu
 PLk1tbFSGh5+vgLQTlhze5Wz7nU59lJ5WfCJ8negYls/5l3FseGgxEuffWKytElJ
 I6PbnRyd9y7R/KYmkobFjS98cjeNznAxJEm/2InVpcb3IwUSEyJUHgNwbs6KSeYf
 4BdVyxSAQlFPgxO17UKQxblYu/HuFY/NWghkg7dn6ZgVubj52rtrl9txUId57HGC
 pYbSVm0bMevqlgHDYHold+Q4I6DqltyHKgiA9UfU9IQj4MV/KXbDjtng/ih4ZkHe
 rw+BnckyrhZ5K5p3nDlzS/TNQT1baKqNFgx/O5HPV27NT8ARynqoFxNUzA0z37v3
 x5dj34kiIW2kAGJDOP9+9jj2uF4yX0MKdNxdMs8PM9hYAsO0AqItNsXEQoUuzJF+
 u1hhGZ1H6HYmY2WphXD57fnIR/7Hm6Iz00XC17+1jUP+wsefhAbreOyJIbbcpENF
 IIxrcIxAPl3PQPsj8n98/9fKQ6u/QfrvWUH+6aoFQH/Np7RTFWmDh8bqucjuXsGS
 8gxR3REr
 =quHC
 -----END PGP SIGNATURE-----

Merge tag 'irqchip-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms into irq/core

Pull irqchip updates from Marc Zyngier:

  - A number of Loogson/Loogarch fixes

  - Allow the core code to retrigger an interrupt that has
    fired while the same interrupt is being handled on another
    CPU, papering over a GICv3 architecture issue

  - Work around an integration problem on ASR8601, where the CPU
    numbering isn't representable in the GIC implementation...

  - Add some missing interrupt to the STM32 irqchip

  - A bunch of warning squashing triggered by W=1 builds

Link: https://lore.kernel.org/r/20230623224345.3577134-1-maz@kernel.org
2023-06-26 11:05:49 +02:00
Dan Williams d2f9fe6953 Merge branch 'for-6.5/cxl-perf' into for-6.5/cxl
Pick up initial support for the CXL 3.0 performance monitoring
definition. Small conflicts with the firmware update work as they both
placed their init code in the same location.
2023-06-25 17:53:18 -07:00
Jakub Kicinski a685d0df75 bpf-next-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZJX+ygAKCRDbK58LschI
 g0/2AQDHg12smf9mPfK9wOFDNRIIX8r2iufB8LUFQMzCwltN6gEAkAdkAyfbof7P
 TMaNUiHABijAFtChxoSI35j3OOSRrwE=
 =GJgN
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Daniel Borkmann says:

====================
pull-request: bpf-next 2023-06-23

We've added 49 non-merge commits during the last 24 day(s) which contain
a total of 70 files changed, 1935 insertions(+), 442 deletions(-).

The main changes are:

1) Extend bpf_fib_lookup helper to allow passing the route table ID,
   from Louis DeLosSantos.

2) Fix regsafe() in verifier to call check_ids() for scalar registers,
   from Eduard Zingerman.

3) Extend the set of cpumask kfuncs with bpf_cpumask_first_and()
   and a rework of bpf_cpumask_any*() kfuncs. Additionally,
   add selftests, from David Vernet.

4) Fix socket lookup BPF helpers for tc/XDP to respect VRF bindings,
   from Gilad Sever.

5) Change bpf_link_put() to use workqueue unconditionally to fix it
   under PREEMPT_RT, from Sebastian Andrzej Siewior.

6) Follow-ups to address issues in the bpf_refcount shared ownership
   implementation, from Dave Marchevsky.

7) A few general refactorings to BPF map and program creation permissions
   checks which were part of the BPF token series, from Andrii Nakryiko.

8) Various fixes for benchmark framework and add a new benchmark
   for BPF memory allocator to BPF selftests, from Hou Tao.

9) Documentation improvements around iterators and trusted pointers,
   from Anton Protopopov.

10) Small cleanup in verifier to improve allocated object check,
    from Daniel T. Lee.

11) Improve performance of bpf_xdp_pointer() by avoiding access
    to shared_info when XDP packet does not have frags,
    from Jesper Dangaard Brouer.

12) Silence a harmless syzbot-reported warning in btf_type_id_size(),
    from Yonghong Song.

13) Remove duplicate bpfilter_umh_cleanup in favor of umd_cleanup_helper,
    from Jarkko Sakkinen.

14) Fix BPF selftests build for resolve_btfids under custom HOSTCFLAGS,
    from Viktor Malik.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (49 commits)
  bpf, docs: Document existing macros instead of deprecated
  bpf, docs: BPF Iterator Document
  selftests/bpf: Fix compilation failure for prog vrf_socket_lookup
  selftests/bpf: Add vrf_socket_lookup tests
  bpf: Fix bpf socket lookup from tc/xdp to respect socket VRF bindings
  bpf: Call __bpf_sk_lookup()/__bpf_skc_lookup() directly via TC hookpoint
  bpf: Factor out socket lookup functions for the TC hookpoint.
  selftests/bpf: Set the default value of consumer_cnt as 0
  selftests/bpf: Ensure that next_cpu() returns a valid CPU number
  selftests/bpf: Output the correct error code for pthread APIs
  selftests/bpf: Use producer_cnt to allocate local counter array
  xsk: Remove unused inline function xsk_buff_discard()
  bpf: Keep BPF_PROG_LOAD permission checks clear of validations
  bpf: Centralize permissions checks for all BPF map types
  bpf: Inline map creation logic in map_create() function
  bpf: Move unprivileged checks into map_create() and bpf_prog_load()
  bpf: Remove in_atomic() from bpf_link_put().
  selftests/bpf: Verify that check_ids() is used for scalars in regsafe()
  bpf: Verify scalar ids mapping in regsafe() using check_ids()
  selftests/bpf: Check if mark_chain_precision() follows scalar ids
  ...
====================

Link: https://lore.kernel.org/r/20230623211256.8409-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-24 14:52:28 -07:00
Yang Li 4afc9a402a kernel/time/posix-stubs.c: remove duplicated include
./kernel/time/posix-stubs.c: linux/syscalls.h is included more than once.

Link: https://lkml.kernel.org/r/20230608082312.123939-1-yang.lee@linux.alibaba.com
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=5463
Fixes: c1956519cd7e ("syscalls: add sys_ni_posix_timers prototype")
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-23 17:04:05 -07:00
Andrew Morton 63773d2b59 Merge mm-hotfixes-stable into mm-stable to pick up depended-upon changes. 2023-06-23 16:58:19 -07:00
Linus Torvalds afa4bb778e workqueue: clean up WORK_* constant types, clarify masking
Dave Airlie reports that gcc-13.1.1 has started complaining about some
of the workqueue code in 32-bit arm builds:

  kernel/workqueue.c: In function ‘get_work_pwq’:
  kernel/workqueue.c:713:24: error: cast to pointer from integer of different size [-Werror=int-to-pointer-cast]
    713 |                 return (void *)(data & WORK_STRUCT_WQ_DATA_MASK);
        |                        ^
  [ ... a couple of other cases ... ]

and while it's not immediately clear exactly why gcc started complaining
about it now, I suspect it's some C23-induced enum type handlign fixup in
gcc-13 is the cause.

Whatever the reason for starting to complain, the code and data types
are indeed disgusting enough that the complaint is warranted.

The wq code ends up creating various "helper constants" (like that
WORK_STRUCT_WQ_DATA_MASK) using an enum type, which is all kinds of
confused.  The mask needs to be 'unsigned long', not some unspecified
enum type.

To make matters worse, the actual "mask and cast to a pointer" is
repeated a couple of times, and the cast isn't even always done to the
right pointer, but - as the error case above - to a 'void *' with then
the compiler finishing the job.

That's now how we roll in the kernel.

So create the masks using the proper types rather than some ambiguous
enumeration, and use a nice helper that actually does the type
conversion in one well-defined place.

Incidentally, this magically makes clang generate better code.  That,
admittedly, is really just a sign of clang having been seriously
confused before, and cleaning up the typing unconfuses the compiler too.

Reported-by: Dave Airlie <airlied@gmail.com>
Link: https://lore.kernel.org/lkml/CAPM=9twNnV4zMCvrPkw3H-ajZOH-01JVh_kDrxdPYQErz8ZTdA@mail.gmail.com/
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Tejun Heo <tj@kernel.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-06-23 12:08:14 -07:00
Masami Hiramatsu (Google) 53431798f4 tracing/probes: Fix tracepoint event with $arg* to fetch correct argument
To hide the first dummy 'data' argument on the tracepoint probe events,
the BTF argument array was modified (skip the first argument for tracepoint),
but the '$arg*' meta argument parser missed that.

Fix to increment the argument index if it is tracepoint probe. And decrement
the index when searching the type of the argument.

Link: https://lore.kernel.org/all/168657113778.3038017.12245893750241701312.stgit@mhiramat.roam.corp.google.com/

Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2023-06-23 17:35:49 +09:00
Masami Hiramatsu (Google) ed5f297802 tracing/probes: Fix to return NULL and keep using current argc
Fix to return NULL and keep using current argc when there is
$argN and the BTF is not available.

Link: https://lore.kernel.org/all/168584574094.2056209.2694238431743782342.stgit@mhiramat.roam.corp.google.com/

Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202306030940.Cej2JoUx-lkp@intel.com/
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2023-06-23 17:35:49 +09:00
Jakub Kicinski a7384f3918 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

Conflicts:

tools/testing/selftests/net/fcnal-test.sh
  d7a2fc1437 ("selftests: net: fcnal-test: check if FIPS mode is enabled")
  dd017c72dd ("selftests: fcnal: Test SO_DONTROUTE on TCP sockets.")
https://lore.kernel.org/all/5007b52c-dd16-dbf6-8d64-b9701bfa498b@tessares.net/
https://lore.kernel.org/all/20230619105427.4a0df9b3@canb.auug.org.au/

No adjacent changes.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-22 18:40:38 -07:00
Linus Torvalds 8a28a0b6f1 Networking fixes for 6.4-rc8, including fixes from ipsec, bpf,
mptcp and netfilter.
 
 Current release - regressions:
 
   - netfilter: add NFT_TRANS_PREPARE_ERROR to deal with bound set/chain
 
   - eth: mlx5e:
     - fix scheduling of IPsec ASO query while in atomic
     - free IRQ rmap and notifier on kernel shutdown
 
 Current release - new code bugs:
 
   - phy: manual remove LEDs to ensure correct ordering
 
 Previous releases - regressions:
 
   - mptcp: fix possible divide by zero in recvmsg()
 
   - dsa: revert "net: phy: dp83867: perform soft reset and retain established link"
 
 Previous releases - always broken:
 
   - sched: netem: acquire qdisc lock in netem_change()
 
   - bpf:
     - fix verifier id tracking of scalars on spill
     - fix NULL dereference on exceptions
     - accept function names that contain dots
 
   - netfilter: disallow element updates of bound anonymous sets
 
   - mptcp: ensure listener is unhashed before updating the sk status
 
   - xfrm:
     - add missed call to delete offloaded policies
     - fix inbound ipv4/udp/esp packets to UDPv6 dualstack sockets
 
   - selftests: fixes for FIPS mode
 
   - dsa: mt7530: fix multiple CPU ports, BPDU and LLDP handling
 
   - eth: sfc: use budget for TX completions
 
 Misc:
 
   - wifi: iwlwifi: add support for SO-F device with PCI id 0x7AF0
 
 Signed-off-by: Paolo Abeni <pabeni@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmSUZO0SHHBhYmVuaUBy
 ZWRoYXQuY29tAAoJECkkeY3MjxOkBjAP/RfTUYdlPqz9jSvz0HmQt2Er39HyVb9I
 pzEpJSQGfO+eyIrlxmleu8cAaW5HdvyfMcBgr04uh+Jf06s+VJrD95IO9zDHHKoC
 86itYNKMS3fSt1ivzg49i5uq66MhjtAcfIOB9HMOAQ2Jd+DYlzyWOOHw28ZAxsBZ
 Q6TU97YEMuU4FdLkoKob1aVswC5cPxNx2IH9NagfbtijaYZqeN9ZX9EI5yMUyH8f
 5gboqOhXUQK0MQLM5TFySHeoayyQ+tRBz24nF0/6lWiRr+xzMTEKdkFpRza7Mxzj
 S8NxN3C+zOf96gic6kYOXmM6y0sOlbwC9JoeWTp8Tuh6DEYi6xLC2XkiYJ51idZg
 PElgRpkM1ddqvvFWFgZlNik5z0vbGnJH7pt0VuOSNntxE60cdQwvWEOr09vvPcS5
 0nMVD0uc8pds2h4hit+sdLltcVnOgoNUYr1/sI6oydofa1BrLnhFPF7z/gUs9foD
 NuCchiaBF11yBGKufcNBNEB4w35g3Kcu6TGhHb168OJi+UnSnwlI0Ccw7iO10pkv
 RjefhR60+wZC6+leo57nZeYqaLQJuALY0QYFsyeM+T0MGSYkbH24CmbNdSmO4MRr
 +VX2CwIqeIds4Hx31o0Feu+FaJqXw46/2nrSDxel/hlCJnGSMXZTw+b/4pFEHLP+
 l71ijZpJqV1S
 =GH2b
 -----END PGP SIGNATURE-----

Merge tag 'net-6.4-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Paolo Abeni:
 "Including fixes from ipsec, bpf, mptcp and netfilter.

  Current release - regressions:

   - netfilter: add NFT_TRANS_PREPARE_ERROR to deal with bound set/chain

   - eth: mlx5e:
      - fix scheduling of IPsec ASO query while in atomic
      - free IRQ rmap and notifier on kernel shutdown

  Current release - new code bugs:

   - phy: manual remove LEDs to ensure correct ordering

  Previous releases - regressions:

   - mptcp: fix possible divide by zero in recvmsg()

   - dsa: revert "net: phy: dp83867: perform soft reset and retain
     established link"

  Previous releases - always broken:

   - sched: netem: acquire qdisc lock in netem_change()

   - bpf:
      - fix verifier id tracking of scalars on spill
      - fix NULL dereference on exceptions
      - accept function names that contain dots

   - netfilter: disallow element updates of bound anonymous sets

   - mptcp: ensure listener is unhashed before updating the sk status

   - xfrm:
      - add missed call to delete offloaded policies
      - fix inbound ipv4/udp/esp packets to UDPv6 dualstack sockets

   - selftests: fixes for FIPS mode

   - dsa: mt7530: fix multiple CPU ports, BPDU and LLDP handling

   - eth: sfc: use budget for TX completions

  Misc:

   - wifi: iwlwifi: add support for SO-F device with PCI id 0x7AF0"

* tag 'net-6.4-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (74 commits)
  revert "net: align SO_RCVMARK required privileges with SO_MARK"
  net: wwan: iosm: Convert single instance struct member to flexible array
  sch_netem: acquire qdisc lock in netem_change()
  selftests: forwarding: Fix race condition in mirror installation
  wifi: mac80211: report all unusable beacon frames
  mptcp: ensure listener is unhashed before updating the sk status
  mptcp: drop legacy code around RX EOF
  mptcp: consolidate fallback and non fallback state machine
  mptcp: fix possible list corruption on passive MPJ
  mptcp: fix possible divide by zero in recvmsg()
  mptcp: handle correctly disconnect() failures
  bpf: Force kprobe multi expected_attach_type for kprobe_multi link
  bpf/btf: Accept function names that contain dots
  Revert "net: phy: dp83867: perform soft reset and retain established link"
  net: mdio: fix the wrong parameters
  netfilter: nf_tables: Fix for deleting base chains with payload
  netfilter: nfnetlink_osf: fix module autoload
  netfilter: nf_tables: drop module reference after updating chain
  netfilter: nf_tables: disallow timeout for anonymous sets
  netfilter: nf_tables: disallow updates of anonymous sets
  ...
2023-06-22 17:59:51 -07:00
Linus Torvalds 5950a0066f cgroup: Fixes for v6.4-rc7
It's late but here are two bug fixes. Both fix problems which can be severe
 but are very confined in scope. The risk to most use cases should be
 minimal.
 
 * Fix for an old bug which triggers if a cgroup subsystem is remounted to a
   different hierarchy while someone is reading its cgroup.procs/tasks file.
   The risk is pretty low given how seldom cgroup subsystems are moved across
   hierarchies.
 
 * We moved cpus_read_lock() outside of cgroup internal locks a while ago but
   forgot to update the legacy_freezer leading to lockdep triggers. Fixed.
 -----BEGIN PGP SIGNATURE-----
 
 iIQEABYIACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZJNz6g4cdGpAa2VybmVs
 Lm9yZwAKCRCxYfJx3gVYGS9zAP9lGszI1Zgvjz+qlU0dmE96yUEuqEg7Tfwcqxr3
 Y+hHyAEArgNGnCoPfu4NAWQDZ31AgPUdL8EFqx6pY9Vq9R0oFg0=
 =uaB+
 -----END PGP SIGNATURE-----

Merge tag 'cgroup-for-6.4-rc7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup

Pull cgroup fixes from Tejun Heo:
 "It's late but here are two bug fixes. Both fix problems which can be
  severe but are very confined in scope. The risk to most use cases
  should be minimal.

   - Fix for an old bug which triggers if a cgroup subsystem is
     remounted to a different hierarchy while someone is reading its
     cgroup.procs/tasks file. The risk is pretty low given how seldom
     cgroup subsystems are moved across hierarchies.

   - We moved cpus_read_lock() outside of cgroup internal locks a while
     ago but forgot to update the legacy_freezer leading to lockdep
     triggers. Fixed"

* tag 'cgroup-for-6.4-rc7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup: Do not corrupt task iteration when rebinding subsystem
  cgroup,freezer: hold cpu_hotplug_lock before freezer_mutex in freezer_css_{online,offline}()
2023-06-22 17:27:16 -07:00
Tejun Heo 81621430c8 Revert "cgroup: Avoid -Wstringop-overflow warnings"
This reverts commit 36de5f303c.

The commit caused boot failures on some configurations due to cgroup
hierarchies not being created at all.

Signed-off-by: Tejun Heo <tj@kernel.org>
2023-06-22 08:51:14 -10:00
Sohil Mehta 4dd595c34c syscalls: Remove file path comments from headers
Source file locations for syscall definitions can change over a period
of time. File paths in comments get stale and are hard to maintain long
term. Also, their usefulness is questionable since it would be easier to
locate a syscall definition using the SYSCALL_DEFINEx() macro.

Remove all source file path comments from the syscall headers. Also,
equalize the uneven line spacing (some of which is introduced due to the
deletions).

Signed-off-by: Sohil Mehta <sohil.mehta@intel.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2023-06-22 17:10:09 +02:00
Azeem Shaikh 38638ffa60 tracing/boot: Replace strlcpy with strscpy
strlcpy() reads the entire source buffer first.
This read may exceed the destination size limit.
This is both inefficient and can lead to linear read
overflows if a source string is not NUL-terminated [1].
In an effort to remove strlcpy() completely [2], replace
strlcpy() here with strscpy().

Direct replacement is safe here since return value of -E2BIG
is used to check for truncation instead of sizeof(dest).

[1] https://www.kernel.org/doc/html/latest/process/deprecated.html#strlcpy
[2] https://github.com/KSPP/linux/issues/89

Link: https://lore.kernel.org/linux-trace-kernel/20230613004125.3539934-1-azeemshaikh38@gmail.com

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Azeem Shaikh <azeemshaikh38@gmail.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-06-22 10:39:56 -04:00
Daniel Bristot de Oliveira e88ed227f6 tracing/timerlat: Add user-space interface
Going a step further, we propose a way to use any user-space
workload as the task waiting for the timerlat timer. This is done
via a per-CPU file named osnoise/cpu$id/timerlat_fd file.

The tracef_fd allows a task to open at a time. When a task reads
the file, the timerlat timer is armed for future osnoise/timerlat_period_us
time. When the timer fires, it prints the IRQ latency and
wakes up the user-space thread waiting in the timerlat_fd.

The thread then starts to run, executes the timerlat measurement, prints
the thread scheduling latency and returns to user-space.

When the thread rereads the timerlat_fd, the tracer will print the
user-ret(urn) latency, which is an additional metric.

This additional metric is also traced by the tracer and can be used, for
example of measuring the context switch overhead from kernel-to-user and
user-to-kernel, or the response time for an arbitrary execution in
user-space.

The tracer supports one thread per CPU, the thread must be pinned to
the CPU, and it cannot migrate while holding the timerlat_fd. The reason
is that the tracer is per CPU (nothing prohibits the tracer from
allowing migrations in the future). The tracer monitors the migration
of the thread and disables the tracer if detected.

The timerlat_fd is only available for opening/reading when timerlat
tracer is enabled, and NO_OSNOISE_WORKLOAD is set.

The simplest way to activate this feature from user-space is:

 -------------------------------- %< -----------------------------------
 int main(void)
 {
	char buffer[1024];
	int timerlat_fd;
	int retval;
	long cpu = 0;	/* place in CPU 0 */
	cpu_set_t set;

	CPU_ZERO(&set);
	CPU_SET(cpu, &set);

	if (sched_setaffinity(gettid(), sizeof(set), &set) == -1)
		return 1;

	snprintf(buffer, sizeof(buffer),
		"/sys/kernel/tracing/osnoise/per_cpu/cpu%ld/timerlat_fd",
		cpu);

	timerlat_fd = open(buffer, O_RDONLY);
	if (timerlat_fd < 0) {
		printf("error opening %s: %s\n", buffer, strerror(errno));
		exit(1);
	}

	for (;;) {
		retval = read(timerlat_fd, buffer, 1024);
		if (retval < 0)
			break;
	}

	close(timerlat_fd);
	exit(0);
}
 -------------------------------- >% -----------------------------------

When disabling timerlat, if there is a workload holding the timerlat_fd,
the SIGKILL will be sent to the thread.

Link: https://lkml.kernel.org/r/69fe66a863d2792ff4c3a149bf9e32e26468bb3a.1686063934.git.bristot@kernel.org

Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: William White <chwhite@redhat.com>
Cc: Daniel Bristot de Oliveira <bristot@kernel.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-06-22 10:39:56 -04:00
Daniel Bristot de Oliveira cb7ca871c8 tracing/osnoise: Skip running osnoise if all instances are off
In the case of all tracing instances being off, sleep for the entire
period.

 Q: Why not kill all threads so?
 A: It is valid and useful to start the threads with tracing off.
For example, rtla disables tracing, starts the tracer, applies the
scheduling setup to the threads, e.g., sched priority and cgroup,
and then begin tracing with all set.

Skipping the period helps to speed up rtla setup and save the
trace after a stop tracing.

Link: https://lkml.kernel.org/r/aa4dd9b7e76fcb63901fe5407e15ec002b318599.1686063934.git.bristot@kernel.org

Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: William White <chwhite@redhat.com>
Cc: Daniel Bristot de Oliveira <bristot@kernel.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-06-22 10:39:56 -04:00
Daniel Bristot de Oliveira 4998e7fda1 tracing/osnoise: Switch from PF_NO_SETAFFINITY to migrate_disable
Currently, osnoise/timerlat threads run with PF_NO_SETAFFINITY set.
It works well, however, cgroups do not allow PF_NO_SETAFFINITY threads
to be accepted, and this creates a limitation to osnoise/timerlat.

To avoid this limitation, disable migration of the threads as soon
as they start to run, and then clean the PF_NO_SETAFFINITY flag (still)
used during thread creation.

If for some reason a thread migration is requested, e.g., via
sched_settafinity, the tracer thread will notice and exit.

Link: https://lkml.kernel.org/r/8ba8bc9c15b3ea40cf73cf67a9bc061a264609f0.1686063934.git.bristot@kernel.org

Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: William White <chwhite@redhat.com>
Cc: Daniel Bristot de Oliveira <bristot@kernel.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-06-22 10:39:56 -04:00
Jiri Olsa 83f74441bc ftrace: Show all functions with addresses in available_filter_functions_addrs
Adding new available_filter_functions_addrs file that shows all available
functions (same as available_filter_functions) together with addresses,
like:

  # cat available_filter_functions_addrs | head
  ffffffff81000770 __traceiter_initcall_level
  ffffffff810007c0 __traceiter_initcall_start
  ffffffff81000810 __traceiter_initcall_finish
  ffffffff81000860 trace_initcall_finish_cb
  ...

Note displayed address is the patch-site address and can differ from
/proc/kallsyms address.

It's useful to have address avilable for traceable symbols, so we don't
need to allways cross check kallsyms with available_filter_functions
(or the other way around) and have all the data in single file.

For backwards compatibility reasons we can't change the existing
available_filter_functions file output, but we need to add new file.

The problem is that we need to do 2 passes:

 - through available_filter_functions and find out if the function is traceable
 - through /proc/kallsyms to get the address for traceable function

Having available_filter_functions symbols together with addresses allow
us to skip the kallsyms step and we are ok with the address in
available_filter_functions_addr not being the function entry, because
kprobe_multi uses fprobe and that handles both entry and patch-site
address properly.

We have 2 interfaces how to create kprobe_multi link:

  a) passing symbols to kernel

     1) user gathers symbols and need to ensure that they are
        trace-able -> pass through available_filter_functions file

     2) kernel takes those symbols and translates them to addresses
        through kallsyms api

     3) addresses are passed to fprobe/ftrace through:

         register_fprobe_ips
         -> ftrace_set_filter_ips

  b) passing addresses to kernel

     1) user gathers symbols and needs to ensure that they are
        trace-able -> pass through available_filter_functions file

     2) user takes those symbols and translates them to addresses
       through /proc/kallsyms

     3) addresses are passed to the kernel and kernel calls:

         register_fprobe_ips
         -> ftrace_set_filter_ips

The new available_filter_functions_addrs file helps us with option b),
because we can make 'b 1' and 'b 2' in one step - while filtering traceable
functions, we get the address directly.

Link: https://lore.kernel.org/linux-trace-kernel/20230611130029.1202298-1-jolsa@kernel.org

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Tested-by: Jackie Liu <liuyun01@kylinos.cn> # x86
Suggested-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Suggested-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-06-22 10:39:16 -04:00
Masahiro Yamada ddb5cdbafa kbuild: generate KSYMTAB entries by modpost
Commit 7b4537199a ("kbuild: link symbol CRCs at final link, removing
CONFIG_MODULE_REL_CRCS") made modpost output CRCs in the same way
whether the EXPORT_SYMBOL() is placed in *.c or *.S.

For further cleanups, this commit applies a similar approach to the
entire data structure of EXPORT_SYMBOL().

The EXPORT_SYMBOL() compilation is split into two stages.

When a source file is compiled, EXPORT_SYMBOL() will be converted into
a dummy symbol in the .export_symbol section.

For example,

    EXPORT_SYMBOL(foo);
    EXPORT_SYMBOL_NS_GPL(bar, BAR_NAMESPACE);

will be encoded into the following assembly code:

    .section ".export_symbol","a"
    __export_symbol_foo:
            .asciz ""                      /* license */
            .asciz ""                      /* name space */
            .balign 8
            .quad foo                      /* symbol reference */
    .previous

    .section ".export_symbol","a"
    __export_symbol_bar:
            .asciz "GPL"                   /* license */
            .asciz "BAR_NAMESPACE"         /* name space */
            .balign 8
            .quad bar                      /* symbol reference */
    .previous

They are mere markers to tell modpost the name, license, and namespace
of the symbols. They will be dropped from the final vmlinux and modules
because the *(.export_symbol) will go into /DISCARD/ in the linker script.

Then, modpost extracts all the information about EXPORT_SYMBOL() from the
.export_symbol section, and generates the final C code:

    KSYMTAB_FUNC(foo, "", "");
    KSYMTAB_FUNC(bar, "_gpl", "BAR_NAMESPACE");

KSYMTAB_FUNC() (or KSYMTAB_DATA() if it is data) is expanded to struct
kernel_symbol that will be linked to the vmlinux or a module.

With this change, EXPORT_SYMBOL() works in the same way for *.c and *.S
files, providing the following benefits.

[1] Deprecate EXPORT_DATA_SYMBOL()

In the old days, EXPORT_SYMBOL() was only available in C files. To export
a symbol in *.S, EXPORT_SYMBOL() was placed in a separate *.c file.
arch/arm/kernel/armksyms.c is one example written in the classic manner.

Commit 22823ab419 ("EXPORT_SYMBOL() for asm") removed this limitation.
Since then, EXPORT_SYMBOL() can be placed close to the symbol definition
in *.S files. It was a nice improvement.

However, as that commit mentioned, you need to use EXPORT_DATA_SYMBOL()
for data objects on some architectures.

In the new approach, modpost checks symbol's type (STT_FUNC or not),
and outputs KSYMTAB_FUNC() or KSYMTAB_DATA() accordingly.

There are only two users of EXPORT_DATA_SYMBOL:

  EXPORT_DATA_SYMBOL_GPL(empty_zero_page)    (arch/ia64/kernel/head.S)
  EXPORT_DATA_SYMBOL(ia64_ivt)               (arch/ia64/kernel/ivt.S)

They are transformed as follows and output into .vmlinux.export.c

  KSYMTAB_DATA(empty_zero_page, "_gpl", "");
  KSYMTAB_DATA(ia64_ivt, "", "");

The other EXPORT_SYMBOL users in ia64 assembly are output as
KSYMTAB_FUNC().

EXPORT_DATA_SYMBOL() is now deprecated.

[2] merge <linux/export.h> and <asm-generic/export.h>

There are two similar header implementations:

  include/linux/export.h        for .c files
  include/asm-generic/export.h  for .S files

Ideally, the functionality should be consistent between them, but they
tend to diverge.

Commit 8651ec01da ("module: add support for symbol namespaces.") did
not support the namespace for *.S files.

This commit shifts the essential implementation part to C, which supports
EXPORT_SYMBOL_NS() for *.S files.

<asm/export.h> and <asm-generic/export.h> will remain as a wrapper of
<linux/export.h> for a while.

They will be removed after #include <asm/export.h> directives are all
replaced with #include <linux/export.h>.

[3] Implement CONFIG_TRIM_UNUSED_KSYMS in one-pass algorithm (by a later commit)

When CONFIG_TRIM_UNUSED_KSYMS is enabled, Kbuild recursively traverses
the directory tree to determine which EXPORT_SYMBOL to trim. If an
EXPORT_SYMBOL turns out to be unused by anyone, Kbuild begins the
second traverse, where some source files are recompiled with their
EXPORT_SYMBOL() tuned into a no-op.

We can do this better now; modpost can selectively emit KSYMTAB entries
that are really used by modules.

Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
2023-06-22 21:17:10 +09:00
Ben Dooks ccaa4926c2 hrtimer: Add missing sparse annotations to hrtimer locking
Sparse warns about lock imbalance vs. the hrtimer_base lock due to missing
sparse annotations:

kernel/time/hrtimer.c:175:33: warning: context imbalance in 'lock_hrtimer_base' - wrong count at exit
kernel/time/hrtimer.c:1301:28: warning: context imbalance in 'hrtimer_start_range_ns' - unexpected unlock
kernel/time/hrtimer.c:1336:28: warning: context imbalance in 'hrtimer_try_to_cancel' - unexpected unlock
kernel/time/hrtimer.c:1457:9: warning: context imbalance in '__hrtimer_get_remaining' - unexpected unlock

Add the annotations to the relevant functions.

Signed-off-by: Ben Dooks <ben.dooks@codethink.co.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20230621075928.394481-1-ben.dooks@codethink.co.uk
2023-06-22 10:32:37 +02:00
Jakub Kicinski 59bb14bda2 bpf-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZJK5DwAKCRDbK58LschI
 gyUtAQD4gT4BEVHRqvniw9yyqYo0BvElAznutDq7o9kFHFep2gEAoksEWS84OdZj
 0L5mSKjXrpHKzmY/jlMrVIcTb3VzOw0=
 =gAYE
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf

Daniel Borkmann says:

====================
pull-request: bpf 2023-06-21

We've added 7 non-merge commits during the last 14 day(s) which contain
a total of 7 files changed, 181 insertions(+), 15 deletions(-).

The main changes are:

1) Fix a verifier id tracking issue with scalars upon spill,
   from Maxim Mikityanskiy.

2) Fix NULL dereference if an exception is generated while a BPF
   subprogram is running, from Krister Johansen.

3) Fix a BTF verification failure when compiling kernel with LLVM_IAS=0,
   from Florent Revest.

4) Fix expected_attach_type enforcement for kprobe_multi link,
   from Jiri Olsa.

5) Fix a bpf_jit_dump issue for x86_64 to pick the correct JITed image,
   from Yonghong Song.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  bpf: Force kprobe multi expected_attach_type for kprobe_multi link
  bpf/btf: Accept function names that contain dots
  selftests/bpf: add a test for subprogram extables
  bpf: ensure main program has an extable
  bpf: Fix a bpf_jit_dump issue for x86_64 with sysctl bpf_jit_enable.
  selftests/bpf: Add test cases to assert proper ID tracking on spill
  bpf: Fix verifier id tracking of scalars on spill
====================

Link: https://lore.kernel.org/r/20230621101116.16122-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-21 13:59:46 -07:00
LeiZhou-97 e973dfe929 cgroup/misc: Expose misc.current on cgroup v2 root
Hello,

This patch is to expose misc.current on cgroup v2 root for tracking
how much of the resource has been consumed in total on the system.

Most of the cloud infrastucture use cgroup to fetch the host
information for scheduling purpose.

Currently, the misc controller can be used by Intel TDX HKIDs and
AMD SEV ASIDs, which are both used for creating encrypted VMs.
Intel TDX and AMD SEV are mostly be used by the cloud providers
for providing confidential VMs.

In actual use of a server, these confidential VMs may be launched
in different ways. For the cloud solution, there are kubvirt and
coco (tracked by kubepods.slice); on host, they can be booted
directly through qemu by end user (tracked by user.slice), etc.

In this complex environment, when wanting to know how many resource
is used in total it has to iterate through all existing slices to
get the value of each misc.current and add them up to calculate
the total number of consumed keys.

So exposing misc.current to root cgroup tends to give much easier
when calculates how much resource has been used in total, which
helps to schedule and count resources for the cloud infrastucture.

Signed-off-by: LeiZhou-97 <lei.zhou@intel.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-06-21 10:43:29 -10:00
Gustavo A. R. Silva 36de5f303c cgroup: Avoid -Wstringop-overflow warnings
Address the following -Wstringop-overflow warnings seen when
built with ARM architecture and aspeed_g4_defconfig configuration
(notice that under this configuration CGROUP_SUBSYS_COUNT == 0):
kernel/cgroup/cgroup.c:1208:16: warning: 'find_existing_css_set' accessing 4 bytes in a region of size 0 [-Wstringop-overflow=]
kernel/cgroup/cgroup.c:1258:15: warning: 'css_set_hash' accessing 4 bytes in a region of size 0 [-Wstringop-overflow=]
kernel/cgroup/cgroup.c:6089:18: warning: 'css_set_hash' accessing 4 bytes in a region of size 0 [-Wstringop-overflow=]
kernel/cgroup/cgroup.c:6153:18: warning: 'css_set_hash' accessing 4 bytes in a region of size 0 [-Wstringop-overflow=]

These changes are based on commit d20d30ebb1 ("cgroup: Avoid compiler
warnings with no subsystems").

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-06-21 10:41:44 -10:00
Linus Torvalds dad9774dea A single regression fix for a regression fix:
For a long time the tick was aligned to clock MONOTONIC so that the tick
   event happened at a multiple of nanoseconds per tick starting from clock
   MONOTONIC = 0.
 
   At some point this changed as the refined jiffies clocksource which is
   used during boot before the TSC or other clocksources becomes usable, was
   adjusted with a boot offset, so that time 0 is closer to the point where
   the kernel starts.
 
   This broke the assumption in the tick code that when the tick setup
   happens early on ktime_get() will return a multiple of nanoseconds per
   tick. As a consequence applications which aligned their periodic
   execution so that it does not collide with the tick were not longer
   guaranteed that the tick period starts from time 0.
 
   The fix for this regression was to realign the tick when it is initially
   set up to a multiple of tick periods. That works as long as the
   underlying tick device supports periodic mode, but breaks under certain
   conditions when the tick device supports only one shot mode.
 
   Depending on the offset, the alignment delta to clock MONOTONIC can get
   in a range where the minimal programming delta of the underlying clock
   event device is larger than the calculated delta to the next tick. This
   results in a boot hang as the tick code tries to play catch up, but as
   the tick never fires jiffies are not advanced so it keeps trying for
   ever.
 
   Solve this by moving the tick alignement into the NOHZ / HIGHRES
   enablement code because at that point it is guaranteed that the
   underlying clocksource is high resolution capable and not longer
   depending on the tick.
 
   This is far before user space starts, so at the point where applications
   try to align their timers, the old behaviour of the tick happening at a
   multiple of nanoseconds per tick starting from clock MONOTONIC = 0 is
   restored.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmSTUNQTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoQQjD/4+zHK7IxRK+1k1lIStxksEudVeNuqK
 7RZKNTawZFSChrxGmv5uhmqYDK9/tudnZot+uyAXI81JHnnGSXbco+VrAA2DQ2v2
 w/oxhAHIAobJWDzUuCqT2YsseDVNoTGMrTTPrL4klXqhVtShVOLTI/603+macEiw
 Olqmd7XxlyhCloNqoMIh18EwSNIs78kP3YAQHUmq3NoRkJT5c3wWKvZy38WOxSJ9
 Jjpw75tePfcwWte9fGVq+oJ8hOWkPAKN5hFL8420RrEgdIkxKlhVmuyG48ieZKH1
 LE1brB5iaU8+M91PevZMRlD7oio8u/d6xX1vT3Cqz9UBe1NLj1I0ToX6COhwSNxm
 50U0HJ5wGlpodu4IyBHXTG45UAVwpOg/EgDlFigeP3Nxl6h45ugrA010r6FqE1zV
 KltjNGsk/OD5d8ywW+yManU2mySm9znqmWN2zbjRGm138Tuz7vglcBhFB6jEJFUF
 V8vYPLwI4J2gqifx8NoFBIfLuZp/+D4LOQiJ7XRc9CVm3JLK8Xi8fxnHGTBwJHIn
 45WU4hzmY7fjR/gGhAyG7lH6tlo7VTC9/lrO+YtAy2hL2sK14+m/ZH7836I5bAhw
 EgSyVlcfR6tgO55I3JgkbNIyO5TUAZPZIAY1TlyR53xXvMkb9aZTh9UMm1xMJ3uX
 +sQaLAuDFNUQSg==
 =/uvx
 -----END PGP SIGNATURE-----

Merge tag 'timers-urgent-2023-06-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer fix from Thomas Gleixner:
 "A single regression fix for a regression fix:

  For a long time the tick was aligned to clock MONOTONIC so that the
  tick event happened at a multiple of nanoseconds per tick starting
  from clock MONOTONIC = 0.

  At some point this changed as the refined jiffies clocksource which is
  used during boot before the TSC or other clocksources becomes usable,
  was adjusted with a boot offset, so that time 0 is closer to the point
  where the kernel starts.

  This broke the assumption in the tick code that when the tick setup
  happens early on ktime_get() will return a multiple of nanoseconds per
  tick. As a consequence applications which aligned their periodic
  execution so that it does not collide with the tick were not longer
  guaranteed that the tick period starts from time 0.

  The fix for this regression was to realign the tick when it is
  initially set up to a multiple of tick periods. That works as long as
  the underlying tick device supports periodic mode, but breaks under
  certain conditions when the tick device supports only one shot mode.

  Depending on the offset, the alignment delta to clock MONOTONIC can
  get in a range where the minimal programming delta of the underlying
  clock event device is larger than the calculated delta to the next
  tick. This results in a boot hang as the tick code tries to play catch
  up, but as the tick never fires jiffies are not advanced so it keeps
  trying for ever.

  Solve this by moving the tick alignement into the NOHZ / HIGHRES
  enablement code because at that point it is guaranteed that the
  underlying clocksource is high resolution capable and not longer
  depending on the tick.

  This is far before user space starts, so at the point where
  applications try to align their timers, the old behaviour of the tick
  happening at a multiple of nanoseconds per tick starting from clock
  MONOTONIC = 0 is restored"

* tag 'timers-urgent-2023-06-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  tick/common: Align tick period during sched_timer setup
2023-06-21 12:36:34 -07:00
Miaohe Lin a04de42460 cgroup: remove obsolete comment on cgroup_on_dfl()
The debug feature is supported since commit 8cc38fa7fa ("cgroup: make
debug an implicit controller on cgroup2"), update corresponding comment.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-06-21 09:00:03 -10:00
Marc Zyngier a82f3119d5 Merge branch irq/misc-6.5 into irq/irqchip-next
* irq/misc-6.5:
  : .
  : Misc cleanups:
  :
  : - Add a number of missing prototypes
  : - Mark global symbol as static where needed
  : - Drop some now useless non-DT code paths
  : - Add a missing interrupt mapping to the STM32 irqchip
  : - Silence another STM32 warning when building with W=1
  : - Fix the jcore-aic driver that actually never worked...
  : .
  Revert "irqchip/mxs: Include linux/irqchip/mxs.h"
  irqchip/jcore-aic: Fix missing allocation of IRQ descriptors
  irqchip/stm32-exti: Fix warning on initialized field overwritten
  irqchip/stm32-exti: Add STM32MP15xx IWDG2 EXTI to GIC map
  irqchip/gicv3: Add a iort_pmsi_get_dev_id() prototype
  irqchip/mxs: Include linux/irqchip/mxs.h
  irqchip/clps711x: Remove unused clps711x_intc_init() function
  irqchip/mmp: Remove non-DT codepath
  irqchip/ftintc010: Mark all function static
  irqdomain: Include internals.h for function prototypes

Signed-off-by: Marc Zyngier <maz@kernel.org>
2023-06-21 13:53:41 +01:00
Jiri Olsa db8eae6bc5 bpf: Force kprobe multi expected_attach_type for kprobe_multi link
We currently allow to create perf link for program with
expected_attach_type == BPF_TRACE_KPROBE_MULTI.

This will cause crash when we call helpers like get_attach_cookie or
get_func_ip in such program, because it will call the kprobe_multi's
version (current->bpf_ctx context setup) of those helpers while it
expects perf_link's current->bpf_ctx context setup.

Making sure that we use BPF_TRACE_KPROBE_MULTI expected_attach_type
only for programs attaching through kprobe_multi link.

Fixes: ca74823c6e ("bpf: Add cookie support to programs attached with kprobe multi link")
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20230618131414.75649-1-jolsa@kernel.org
2023-06-21 10:40:26 +02:00
Florent Revest 9724160b39 bpf/btf: Accept function names that contain dots
When building a kernel with LLVM=1, LLVM_IAS=0 and CONFIG_KASAN=y, LLVM
leaves DWARF tags for the "asan.module_ctor" & co symbols. In turn,
pahole creates BTF_KIND_FUNC entries for these and this makes the BTF
metadata validation fail because they contain a dot.

In a dramatic turn of event, this BTF verification failure can cause
the netfilter_bpf initialization to fail, causing netfilter_core to
free the netfilter_helper hashmap and netfilter_ftp to trigger a
use-after-free. The risk of u-a-f in netfilter will be addressed
separately but the existence of "asan.module_ctor" debug info under some
build conditions sounds like a good enough reason to accept functions
that contain dots in BTF.

Although using only LLVM=1 is the recommended way to compile clang-based
kernels, users can certainly do LLVM=1, LLVM_IAS=0 as well and we still
try to support that combination according to Nick. To clarify:

  - > v5.10 kernel, LLVM=1 (LLVM_IAS=0 is not the default) is recommended,
    but user can still have LLVM=1, LLVM_IAS=0 to trigger the issue

  - <= 5.10 kernel, LLVM=1 (LLVM_IAS=0 is the default) is recommended in
    which case GNU as will be used

Fixes: 1dc9285184 ("bpf: kernel side support for BTF Var and DataSec")
Signed-off-by: Florent Revest <revest@chromium.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Cc: Yonghong Song <yhs@meta.com>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Link: https://lore.kernel.org/bpf/20230615145607.3469985-1-revest@chromium.org
2023-06-21 10:32:22 +02:00
Donglin Peng a1be9ccc57 function_graph: Support recording and printing the return value of function
Analyzing system call failures with the function_graph tracer can be a
time-consuming process, particularly when locating the kernel function
that first returns an error in the trace logs. This change aims to
simplify the process by recording the function return value to the
'retval' member of 'ftrace_graph_ret' and printing it when outputting
the trace log.

We have introduced new trace options: funcgraph-retval and
funcgraph-retval-hex. The former controls whether to display the return
value, while the latter controls the display format.

Please note that even if a function's return type is void, a return
value will still be printed. You can simply ignore it.

This patch only establishes the fundamental infrastructure. Subsequent
patches will make this feature available on some commonly used processor
architectures.

Here is an example:

I attempted to attach the demo process to a cpu cgroup, but it failed:

echo `pidof demo` > /sys/fs/cgroup/cpu/test/tasks
-bash: echo: write error: Invalid argument

The strace logs indicate that the write system call returned -EINVAL(-22):
...
write(1, "273\n", 4)                    = -1 EINVAL (Invalid argument)
...

To capture trace logs during a write system call, use the following
commands:

cd /sys/kernel/debug/tracing/
echo 0 > tracing_on
echo > trace
echo *sys_write > set_graph_function
echo *spin* > set_graph_notrace
echo *rcu* >> set_graph_notrace
echo *alloc* >> set_graph_notrace
echo preempt* >> set_graph_notrace
echo kfree* >> set_graph_notrace
echo $$ > set_ftrace_pid
echo function_graph > current_tracer
echo 1 > options/funcgraph-retval
echo 0 > options/funcgraph-retval-hex
echo 1 > tracing_on
echo `pidof demo` > /sys/fs/cgroup/cpu/test/tasks
echo 0 > tracing_on
cat trace > ~/trace.log

To locate the root cause, search for error code -22 directly in the file
trace.log and identify the first function that returned -22. Once you
have identified this function, examine its code to determine the root
cause.

For example, in the trace log below, cpu_cgroup_can_attach
returned -22 first, so we can focus our analysis on this function to
identify the root cause.

...

 1)          | cgroup_migrate() {
 1) 0.651 us |   cgroup_migrate_add_task(); /* = 0xffff93fcfd346c00 */
 1)          |   cgroup_migrate_execute() {
 1)          |     cpu_cgroup_can_attach() {
 1)          |       cgroup_taskset_first() {
 1) 0.732 us |         cgroup_taskset_next(); /* = 0xffff93fc8fb20000 */
 1) 1.232 us |       } /* cgroup_taskset_first = 0xffff93fc8fb20000 */
 1) 0.380 us |       sched_rt_can_attach(); /* = 0x0 */
 1) 2.335 us |     } /* cpu_cgroup_can_attach = -22 */
 1) 4.369 us |   } /* cgroup_migrate_execute = -22 */
 1) 7.143 us | } /* cgroup_migrate = -22 */

...

Link: https://lkml.kernel.org/r/1fc502712c981e0e6742185ba242992170ac9da8.1680954589.git.pengdonglin@sangfor.com.cn

Tested-by: Florian Kauer <florian.kauer@linutronix.de>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Donglin Peng <pengdonglin@sangfor.com.cn>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-06-20 18:38:37 -04:00
Steven Rostedt (Google) f3d40e6545 fgraph: Add declaration of "struct fgraph_ret_regs"
In final testing of:

  1fc502712c.1680954589.git.pengdonglin@sangfor.com.cn/
  "function_graph: Support recording and printing the return value of function"

The test failed due to a new warning found in the build:

kernel/trace/fgraph.c:243:56: warning: ‘struct fgraph_ret_regs’ declared inside parameter list will not be visible outside of this definition or declaration

Instead of asking to send another patch series, just add it and then apply
the updates.

Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-06-20 18:34:55 -04:00
Linus Torvalds 2e30b97343 Tracing fixes for 6.4:
- Fix MAINTAINERS file to point to proper mailing list for rtla and rv
     The mailing list pointed to linux-trace-devel instead of
     linux-trace-kernel. The former is for the tracing libraries
     and the latter is for anything in the Linux kernel tree.
     The wrong mailing list was used because linux-trace-kernel did not
     exist when rtla and rv were created.
 
  - User events:
    . Fix matching of dynamic events to their user events
      When user writes to dynamic_events file, a lookup of the
      registered dynamic events are made, but there were some cases
      that a match could be incorrectly made.
 
    . Add auto cleanup of user events
      Have the user events automatically get removed when the last
      reference (file descriptor) is closed. This was asked for to
      prevent leaks of user events hanging around needing admins
      to clean them up.
 
    . Add persistent logic (but not let user space use it yet)
      In some cases, having a persistent user event (one that does not
      get cleaned up automatically) is useful. But there's still
      debates about how to expose this to user space. The infrastructure
      is added, but the API is not.
 
    . Update the selftests
      Update the user event selftests to reflect the above changes.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZJGrABQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qgRDAQDvF8ktlcn+gqyUxt2OcTlbBh0jqS0b
 FKXYdq6FTgfWYQD/ctunFbPdzn4D6Kc/lG8p4QxpMmtA19BUOPwEt3CkAwM=
 =CDzr
 -----END PGP SIGNATURE-----

Merge tag 'trace-v6.4-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing fixes from Steven Rostedt:

 - Fix MAINTAINERS file to point to proper mailing list for rtla and rv

   The mailing list pointed to linux-trace-devel instead of
   linux-trace-kernel. The former is for the tracing libraries and the
   latter is for anything in the Linux kernel tree. The wrong mailing
   list was used because linux-trace-kernel did not exist when rtla and
   rv were created.

 - User events:

    - Fix matching of dynamic events to their user events

      When user writes to dynamic_events file, a lookup of the
      registered dynamic events is made, but there were some cases that
      a match could be incorrectly made.

    - Add auto cleanup of user events

      Have the user events automatically get removed when the last
      reference (file descriptor) is closed. This was asked for to
      prevent leaks of user events hanging around needing admins to
      clean them up.

    - Add persistent logic (but not let user space use it yet)

      In some cases, having a persistent user event (one that does not
      get cleaned up automatically) is useful. But there's still debates
      about how to expose this to user space. The infrastructure is
      added, but the API is not.

    - Update the selftests

      Update the user event selftests to reflect the above changes"

* tag 'trace-v6.4-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing/user_events: Document auto-cleanup and remove dyn_event refs
  selftests/user_events: Adapt dyn_test to non-persist events
  selftests/user_events: Ensure auto cleanup works as expected
  tracing/user_events: Add auto cleanup and future persist flag
  tracing/user_events: Track refcount consistently via put/get
  tracing/user_events: Store register flags on events
  tracing/user_events: Remove user_ns walk for groups
  selftests/user_events: Add perf self-test for empty arguments events
  selftests/user_events: Clear the events after perf self-test
  selftests/user_events: Add ftrace self-test for empty arguments events
  tracing/user_events: Fix the incorrect trace record for empty arguments events
  tracing: Modify print_fields() for fields output order
  tracing/user_events: Handle matching arguments that is null from dyn_events
  tracing/user_events: Prevent same name but different args event
  tracing/rv/rtla: Update MAINTAINERS file to point to proper mailing list
2023-06-20 15:01:08 -07:00
Petr Mladek 47f4cb4339 watchdog/sparc64: define HARDLOCKUP_DETECTOR_SPARC64
The HAVE_ prefix means that the code could be enabled. Add another
variable for HAVE_HARDLOCKUP_DETECTOR_SPARC64 without this prefix.
It will be set when it should be built. It will make it compatible
with the other hardlockup detectors.

Before, it is far from obvious that the SPARC64 variant is actually used:

$> make ARCH=sparc64 defconfig
$> grep HARDLOCKUP_DETECTOR .config
CONFIG_HAVE_HARDLOCKUP_DETECTOR_BUDDY=y
CONFIG_HAVE_HARDLOCKUP_DETECTOR_SPARC64=y

After, it is more clear:

$> make ARCH=sparc64 defconfig
$> grep HARDLOCKUP_DETECTOR .config
CONFIG_HAVE_HARDLOCKUP_DETECTOR_BUDDY=y
CONFIG_HAVE_HARDLOCKUP_DETECTOR_SPARC64=y
CONFIG_HARDLOCKUP_DETECTOR_SPARC64=y

Link: https://lkml.kernel.org/r/20230616150618.6073-6-pmladek@suse.com
Signed-off-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-19 16:25:29 -07:00
Petr Mladek a5fcc2367e watchdog/hardlockup: make HAVE_NMI_WATCHDOG sparc64-specific
There are several hardlockup detector implementations and several Kconfig
values which allow selection and build of the preferred one.

CONFIG_HARDLOCKUP_DETECTOR was introduced by the commit 23637d477c
("lockup_detector: Introduce CONFIG_HARDLOCKUP_DETECTOR") in v2.6.36.
It was a preparation step for introducing the new generic perf hardlockup
detector.

The existing arch-specific variants did not support the to-be-created
generic build configurations, sysctl interface, etc. This distinction
was made explicit by the commit 4a7863cc2e ("x86, nmi_watchdog:
Remove ARCH_HAS_NMI_WATCHDOG and rely on CONFIG_HARDLOCKUP_DETECTOR")
in v2.6.38.

CONFIG_HAVE_NMI_WATCHDOG was introduced by the commit d314d74c69
("nmi watchdog: do not use cpp symbol in Kconfig") in v3.4-rc1. It replaced
the above mentioned ARCH_HAS_NMI_WATCHDOG. At that time, it was still used
by three architectures, namely blackfin, mn10300, and sparc.

The support for blackfin and mn10300 architectures has been completely
dropped some time ago. And sparc is the only architecture with the historic
NMI watchdog at the moment.

And the old sparc implementation is really special. It is always built on
sparc64. It used to be always enabled until the commit 7a5c8b57ce
("sparc: implement watchdog_nmi_enable and watchdog_nmi_disable") added
in v4.10-rc1.

There are only few locations where the sparc64 NMI watchdog interacts
with the generic hardlockup detectors code:

  + implements arch_touch_nmi_watchdog() which is called from the generic
    touch_nmi_watchdog()

  + implements watchdog_hardlockup_enable()/disable() to support
    /proc/sys/kernel/nmi_watchdog

  + is always preferred over other generic watchdogs, see
    CONFIG_HARDLOCKUP_DETECTOR

  + includes asm/nmi.h into linux/nmi.h because some sparc-specific
    functions are needed in sparc-specific code which includes
    only linux/nmi.h.

The situation became more complicated after the commit 05a4a95279
("kernel/watchdog: split up config options") and commit 2104180a53
("powerpc/64s: implement arch-specific hardlockup watchdog") in v4.13-rc1.
They introduced HAVE_HARDLOCKUP_DETECTOR_ARCH. It was used for powerpc
specific hardlockup detector. It was compatible with the perf one
regarding the general boot, sysctl, and programming interfaces.

HAVE_HARDLOCKUP_DETECTOR_ARCH was defined as a superset of
HAVE_NMI_WATCHDOG. It made some sense because all arch-specific
detectors had some common requirements, namely:

  + implemented arch_touch_nmi_watchdog()
  + included asm/nmi.h into linux/nmi.h
  + defined the default value for /proc/sys/kernel/nmi_watchdog

But it actually has made things pretty complicated when the generic
buddy hardlockup detector was added. Before the generic perf detector
was newer supported together with an arch-specific one. But the buddy
detector could work on any SMP system. It means that an architecture
could support both the arch-specific and buddy detector.

As a result, there are few tricky dependencies. For example,
CONFIG_HARDLOCKUP_DETECTOR depends on:

  ((HAVE_HARDLOCKUP_DETECTOR_PERF || HAVE_HARDLOCKUP_DETECTOR_BUDDY) && !HAVE_NMI_WATCHDOG) || HAVE_HARDLOCKUP_DETECTOR_ARCH

The problem is that the very special sparc implementation is defined as:

  HAVE_NMI_WATCHDOG && !HAVE_HARDLOCKUP_DETECTOR_ARCH

Another problem is that the meaning of HAVE_NMI_WATCHDOG is far from clear
without reading understanding the history.

Make the logic less tricky and more self-explanatory by making
HAVE_NMI_WATCHDOG specific for the sparc64 implementation. And rename it to
HAVE_HARDLOCKUP_DETECTOR_SPARC64.

Note that HARDLOCKUP_DETECTOR_PREFER_BUDDY, HARDLOCKUP_DETECTOR_PERF,
and HARDLOCKUP_DETECTOR_BUDDY may conflict only with
HAVE_HARDLOCKUP_DETECTOR_ARCH. They depend on HARDLOCKUP_DETECTOR
and it is not longer enabled when HAVE_NMI_WATCHDOG is set.

Link: https://lkml.kernel.org/r/20230616150618.6073-5-pmladek@suse.com
Signed-off-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-19 16:25:29 -07:00
Douglas Anderson 28168eca32 watchdog/hardlockup: move SMP barriers from common code to buddy code
It's been suggested that since the SMP barriers are only potentially
useful for the buddy hardlockup detector, not the perf hardlockup
detector, that the barriers belong in the buddy code.  Let's move them and
add clearer comments about why they're needed.

Link: https://lkml.kernel.org/r/20230526184139.9.I5ab0a0eeb0bd52fb23f901d298c72fa5c396e22b@changeid
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Suggested-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-19 16:25:28 -07:00
Douglas Anderson 813efda239 watchdog/buddy: don't copy the cpumask in watchdog_next_cpu()
There's no reason to make a copy of the "watchdog_cpus" locally in
watchdog_next_cpu().  Making a copy wouldn't make things any more race
free and we're just reading the value so there's no need for a copy.

Link: https://lkml.kernel.org/r/20230526184139.7.If466f9a2b50884cbf6a1d8ad05525a2c17069407@changeid
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Suggested-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-19 16:25:28 -07:00
Douglas Anderson d3b62ace0f watchdog/buddy: cleanup how watchdog_buddy_check_hardlockup() is called
In the patch ("watchdog/hardlockup: detect hard lockups using secondary
(buddy) CPUs"), we added a call from the common watchdog.c file into the
buddy.  That call could be done more cleanly.  Specifically:

1. If we move the call into watchdog_hardlockup_kick() then it keeps
   watchdog_timer_fn() simpler.
2. We don't need to pass an "unsigned long" to the buddy for the timer
   count. In the patch ("watchdog/hardlockup: add a "cpu" param to
   watchdog_hardlockup_check()") the count was changed to "atomic_t"
   which is backed by an int, so we should match types.

Link: https://lkml.kernel.org/r/20230526184139.6.I006c7d958a1ea5c4e1e4dc44a25596d9bb5fd3ba@changeid
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Suggested-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-19 16:25:27 -07:00
Douglas Anderson 7a71d8e650 watchdog/hardlockup: in watchdog_hardlockup_check() use cpumask_copy()
In the patch ("watchdog/hardlockup: add a "cpu" param to
watchdog_hardlockup_check()") we started using a cpumask to keep track of
which CPUs to backtrace.  When setting up this cpumask, it's better to use
cpumask_copy() than to just copy the structure directly.  Fix this.

Link: https://lkml.kernel.org/r/20230526184139.4.Iccee2d1ea19114dafb6553a854ea4d8ab2a3f25b@changeid
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Suggested-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-19 16:25:27 -07:00
Douglas Anderson 2711e4adef watchdog/hardlockup: don't use raw_cpu_ptr() in watchdog_hardlockup_kick()
In the patch ("watchdog/hardlockup: add a "cpu" param to
watchdog_hardlockup_check()") there was no reason to use raw_cpu_ptr(). 
Using this_cpu_ptr() works fine.

Link: https://lkml.kernel.org/r/20230526184139.3.I660e103077dcc23bb29aaf2be09cb234e0495b2d@changeid
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Suggested-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-19 16:25:27 -07:00
Douglas Anderson 6426e8d1f2 watchdog/hardlockup: HAVE_NMI_WATCHDOG must implement watchdog_hardlockup_probe()
Right now there is one arch (sparc64) that selects HAVE_NMI_WATCHDOG
without selecting HAVE_HARDLOCKUP_DETECTOR_ARCH.  Because of that one
architecture, we have some special case code in the watchdog core to
handle the fact that watchdog_hardlockup_probe() isn't implemented.

Let's implement watchdog_hardlockup_probe() for sparc64 and get rid of the
special case.

As a side effect of doing this, code inspection tells us that we could fix
a minor bug where the system won't properly realize that NMI watchdogs are
disabled.  Specifically, on powerpc if CONFIG_PPC_WATCHDOG is turned off
the arch might still select CONFIG_HAVE_HARDLOCKUP_DETECTOR_ARCH which
selects CONFIG_HAVE_NMI_WATCHDOG.  Since CONFIG_PPC_WATCHDOG was off then
nothing will override the "weak" watchdog_hardlockup_probe() and we'll
fallback to looking at CONFIG_HAVE_NMI_WATCHDOG.

Link: https://lkml.kernel.org/r/20230526184139.2.Ic6ebbf307ca0efe91f08ce2c1eb4a037ba6b0700@changeid
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Suggested-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-19 16:25:26 -07:00
Douglas Anderson 9ec272c586 watchdog/hardlockup: keep kernel.nmi_watchdog sysctl as 0444 if probe fails
Patch series "watchdog: Cleanup / fixes after buddy series v5 reviews".

This patch series attempts to finish resolving the feedback received
from Petr Mladek on the v5 series I posted.

Probably the only thing that wasn't fully as clean as Petr requested was
the Kconfig stuff.  I couldn't find a better way to express it without a
more major overhaul.  In the very least, I renamed "NON_ARCH" to
"PERF_OR_BUDDY" in the hopes that will make it marginally better.

Nothing in this series is terribly critical and even the bugfixes are
small.  However, it does cleanup a few things that were pointed out in
review.


This patch (of 10):

The permissions for the kernel.nmi_watchdog sysctl have always been set at
compile time despite the fact that a watchdog can fail to probe.  Let's
fix this and set the permissions based on whether the hardlockup detector
actually probed.

Link: https://lkml.kernel.org/r/20230527014153.2793931-1-dianders@chromium.org
Link: https://lkml.kernel.org/r/20230526184139.1.I0d75971cc52a7283f495aac0bd5c3041aadc734e@changeid
Fixes: a994a3147e ("watchdog/hardlockup/perf: Implement init time detection of perf")
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Reported-by: Petr Mladek <pmladek@suse.com>
Closes: https://lore.kernel.org/r/ZHCn4hNxFpY5-9Ki@alley
Reviewed-by: Petr Mladek <pmladek@suse.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-19 16:25:26 -07:00
YueHaibing 3efd33b753 kernel: pid_namespace: remove unused set_memfd_noexec_scope()
This inline function is unused, remove it.

Link: https://lkml.kernel.org/r/20230610102858.31488-1-yuehaibing@huawei.com
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Cc: Jeff Xu <jeffxu@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-19 16:19:28 -07:00
Ryan Roberts c33c794828 mm: ptep_get() conversion
Convert all instances of direct pte_t* dereferencing to instead use
ptep_get() helper.  This means that by default, the accesses change from a
C dereference to a READ_ONCE().  This is technically the correct thing to
do since where pgtables are modified by HW (for access/dirty) they are
volatile and therefore we should always ensure READ_ONCE() semantics.

But more importantly, by always using the helper, it can be overridden by
the architecture to fully encapsulate the contents of the pte.  Arch code
is deliberately not converted, as the arch code knows best.  It is
intended that arch code (arm64) will override the default with its own
implementation that can (e.g.) hide certain bits from the core code, or
determine young/dirty status by mixing in state from another source.

Conversion was done using Coccinelle:

----

// $ make coccicheck \
//          COCCI=ptepget.cocci \
//          SPFLAGS="--include-headers" \
//          MODE=patch

virtual patch

@ depends on patch @
pte_t *v;
@@

- *v
+ ptep_get(v)

----

Then reviewed and hand-edited to avoid multiple unnecessary calls to
ptep_get(), instead opting to store the result of a single call in a
variable, where it is correct to do so.  This aims to negate any cost of
READ_ONCE() and will benefit arch-overrides that may be more complex.

Included is a fix for an issue in an earlier version of this patch that
was pointed out by kernel test robot.  The issue arose because config
MMU=n elides definition of the ptep helper functions, including
ptep_get().  HUGETLB_PAGE=n configs still define a simple
huge_ptep_clear_flush() for linking purposes, which dereferences the ptep.
So when both configs are disabled, this caused a build error because
ptep_get() is not defined.  Fix by continuing to do a direct dereference
when MMU=n.  This is safe because for this config the arch code cannot be
trying to virtualize the ptes because none of the ptep helpers are
defined.

Link: https://lkml.kernel.org/r/20230612151545.3317766-4-ryan.roberts@arm.com
Reported-by: kernel test robot <lkp@intel.com>
Link: https://lore.kernel.org/oe-kbuild-all/202305120142.yXsNEo6H-lkp@intel.com/
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Dave Airlie <airlied@gmail.com>
Cc: Dimitri Sivanich <dimitri.sivanich@hpe.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: SeongJae Park <sj@kernel.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-19 16:19:25 -07:00
Catalin Marinas 370645f41e dma-mapping: force bouncing if the kmalloc() size is not cache-line-aligned
For direct DMA, if the size is small enough to have originated from a
kmalloc() cache below ARCH_DMA_MINALIGN, check its alignment against
dma_get_cache_alignment() and bounce if necessary.  For larger sizes, it
is the responsibility of the DMA API caller to ensure proper alignment.

At this point, the kmalloc() caches are properly aligned but this will
change in a subsequent patch.

Architectures can opt in by selecting DMA_BOUNCE_UNALIGNED_KMALLOC.

Link: https://lkml.kernel.org/r/20230612153201.554742-15-catalin.marinas@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Robin Murphy <robin.murphy@arm.com>
Tested-by: Isaac J. Manjarres <isaacmanjarres@google.com>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Jerry Snitselaar <jsnitsel@redhat.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Jonathan Cameron <jic23@kernel.org>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Lars-Peter Clausen <lars@metafoo.de>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Mike Snitzer <snitzer@kernel.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Saravana Kannan <saravanak@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-19 16:19:23 -07:00
Robin Murphy cb147bbe22 dma-mapping: name SG DMA flag helpers consistently
sg_is_dma_bus_address() is inconsistent with the naming pattern of its
corresponding setters and its own kerneldoc, so take the majority vote and
rename it sg_dma_is_bus_address() (and fix up the missing underscores in
the kerneldoc too).  This gives us a nice clear pattern where SG DMA flags
are SG_DMA_<NAME>, and the helpers for acting on them are
sg_dma_<action>_<name>().

Link: https://lkml.kernel.org/r/20230612153201.554742-14-catalin.marinas@arm.com
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jerry Snitselaar <jsnitsel@redhat.com>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
  Link: https://lore.kernel.org/r/fa2eca2862c7ffc41b50337abffb2dfd2864d3ea.1685036694.git.robin.murphy@arm.com
Tested-by: Isaac J. Manjarres <isaacmanjarres@google.com>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Jonathan Cameron <jic23@kernel.org>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Lars-Peter Clausen <lars@metafoo.de>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Mike Snitzer <snitzer@kernel.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Saravana Kannan <saravanak@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-19 16:19:22 -07:00
Robin Murphy af2880ec44 scatterlist: add dedicated config for DMA flags
The DMA flags field will be useful for users beyond PCI P2P, so upgrade to
its own dedicated config option.

[catalin.marinas@arm.com: use #ifdef CONFIG_NEED_SG_DMA_FLAGS in scatterlist.h]
[catalin.marinas@arm.com: update PCI_P2PDMA dma_flags comment in scatterlist.h]
Link: https://lkml.kernel.org/r/20230612153201.554742-13-catalin.marinas@arm.com
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Isaac J. Manjarres <isaacmanjarres@google.com>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Jerry Snitselaar <jsnitsel@redhat.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Jonathan Cameron <jic23@kernel.org>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Lars-Peter Clausen <lars@metafoo.de>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Mike Snitzer <snitzer@kernel.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Saravana Kannan <saravanak@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-19 16:19:22 -07:00
Hugh Dickins a92cbb82c8 perf/core: allow pte_offset_map() to fail
In rare transient cases, not yet made possible, pte_offset_map() and
pte_offet_map_lock() may not find a page table: handle appropriately.

[hughd@google.com: __wp_page_copy_user(): don't call update_mmu_tlb() with NULL]
  Link: https://lkml.kernel.org/r/1a4db221-7872-3594-57ce-42369945ec8d@google.com
Link: https://lkml.kernel.org/r/a194441b-63f3-adb6-5964-7ca3171ae7c2@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Song Liu <song@kernel.org>
Cc: Steven Price <steven.price@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Zack Rusin <zackr@vmware.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-19 16:19:19 -07:00
Andrii Nakryiko 7f6719f7a8 bpf: Keep BPF_PROG_LOAD permission checks clear of validations
Move out flags validation and license checks out of the permission
checks. They were intermingled, which makes subsequent changes harder.
Clean this up: perform straightforward flag validation upfront, and
fetch and check license later, right where we use it. Also consolidate
capabilities check in one block, right after basic attribute sanity
checks.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/bpf/20230613223533.3689589-5-andrii@kernel.org
2023-06-19 14:04:04 +02:00
Andrii Nakryiko 6c3eba1c5e bpf: Centralize permissions checks for all BPF map types
This allows to do more centralized decisions later on, and generally
makes it very explicit which maps are privileged and which are not
(e.g., LRU_HASH and LRU_PERCPU_HASH, which are privileged HASH variants,
as opposed to unprivileged HASH and HASH_PERCPU; now this is explicit
and easy to verify).

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/bpf/20230613223533.3689589-4-andrii@kernel.org
2023-06-19 14:04:04 +02:00
Andrii Nakryiko 22db41226b bpf: Inline map creation logic in map_create() function
Currently find_and_alloc_map() performs two separate functions: some
argument sanity checking and partial map creation workflow hanling.
Neither of those functions are self-sufficient and are augmented by
further checks and initialization logic in the caller (map_create()
function). So unify all the sanity checks, permission checks, and
creation and initialization logic in one linear piece of code in
map_create() instead. This also make it easier to further enhance
permission checks and keep them located in one place.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/bpf/20230613223533.3689589-3-andrii@kernel.org
2023-06-19 14:04:04 +02:00
Andrii Nakryiko 1d28635abc bpf: Move unprivileged checks into map_create() and bpf_prog_load()
Make each bpf() syscall command a bit more self-contained, making it
easier to further enhance it. We move sysctl_unprivileged_bpf_disabled
handling down to map_create() and bpf_prog_load(), two special commands
in this regard.

Also swap the order of checks, calling bpf_capable() only if
sysctl_unprivileged_bpf_disabled is true, avoiding unnecessary audit
messages.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/bpf/20230613223533.3689589-2-andrii@kernel.org
2023-06-19 14:04:04 +02:00
Wen Yang a7e282c777 tick/rcu: Fix bogus ratelimit condition
The ratelimit logic in report_idle_softirq() is broken because the
exit condition is always true:

	static int ratelimit;

	if (ratelimit < 10)
		return false;  ---> always returns here

	ratelimit++;           ---> no chance to run

Make it check for >= 10 instead.

Fixes: 0345691b24 ("tick/rcu: Stop allowing RCU_SOFTIRQ in idle")
Signed-off-by: Wen Yang <wenyang.linux@foxmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/tencent_5AAA3EEAB42095C9B7740BE62FBF9A67E007@qq.com
2023-06-18 22:41:53 +02:00
Li zeming fc4b4d96f7 alarmtimer: Remove unnecessary (void *) cast
Pointers of type void * do not require a type cast when they are assigned
to a real pointer.

Signed-off-by: Li zeming <zeming@nfschina.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20230609182059.4509-1-zeming@nfschina.com
2023-06-18 22:41:53 +02:00
Li zeming 986af8dc5a alarmtimer: Remove unnecessary initialization of variable 'ret'
ret is assigned before checked, so it does not need to initialize the
variable

Signed-off-by: Li zeming <zeming@nfschina.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20230609182856.4660-1-zeming@nfschina.com
2023-06-18 22:41:53 +02:00
Lukas Bulwahn b9a40f24d8 posix-timers: Refer properly to CONFIG_HIGH_RES_TIMERS
Commit c78f261e5dcb ("posix-timers: Clarify posix_timer_fn() comments")
turns an ifdef CONFIG_HIGH_RES_TIMERS into an conditional on
"IS_ENABLED(CONFIG_HIGHRES_TIMERS)"; note that the new conditional refers
to "HIGHRES_TIMERS" not "HIGH_RES_TIMERS" as before.

Fix this typo introduced in that refactoring.

Fixes: c78f261e5dcb ("posix-timers: Clarify posix_timer_fn() comments")
Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20230609094643.26253-1-lukas.bulwahn@gmail.com
2023-06-18 22:41:53 +02:00
Thomas Gleixner b96ce4931f posix-timers: Polish coding style in a few places
Make it consistent with the TIP tree documentation.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20230425183313.888493625@linutronix.de
2023-06-18 22:41:53 +02:00
Thomas Gleixner 200dbd6d14 posix-timers: Remove pointless comments
Documenting the obvious is just consuming space for no value.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20230425183313.832240451@linutronix.de
2023-06-18 22:41:52 +02:00
Thomas Gleixner 84999b8bdb posix-timers: Clarify posix_timer_fn() comments
Make the issues vs. SIG_IGN understandable and remove the 15 years old
promise that a proper solution is already on the horizon.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/874jnrdmrq.ffs@tglx
2023-06-18 22:41:52 +02:00
Thomas Gleixner 02972d7955 posix-timers: Clarify posix_timer_rearm() comment
Yet another incomprehensible piece of art.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20230425183313.724863461@linutronix.de
2023-06-18 22:41:52 +02:00
Thomas Gleixner c575689d66 posix-timers: Comment SIGEV_THREAD_ID properly
Replace the word salad.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20230425183313.672220780@linutronix.de
2023-06-18 22:41:51 +02:00
Thomas Gleixner 52f090b164 posix-timers: Add proper comments in do_timer_create()
The comment about timer lifetime at the end of the function is misplaced
and uncomprehensible.

Make it understandable and put it at the right place. Add a new comment
about the visibility of the new timer ID to user space.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20230425183313.619897296@linutronix.de
2023-06-18 22:41:51 +02:00
Thomas Gleixner 640fe745d7 posix-timers: Document nanosleep() details
The descriptions for common_nsleep() is wrong and common_nsleep_timens()
lacks any form of comment.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20230425183313.567072835@linutronix.de
2023-06-18 22:41:51 +02:00
Thomas Gleixner 3561fcb402 posix-timers: Document sys_clock_settime() permissions in place
The documentation of sys_clock_settime() permissions is at a random place
and mostly word salad.

Remove it and add a concise comment into sys_clock_settime().

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20230425183313.514700292@linutronix.de
2023-06-18 22:41:51 +02:00
Thomas Gleixner 65cade468d posix-timers: Document sys_clock_getoverrun()
Document the syscall in detail and with coherent sentences.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20230425183313.462051641@linutronix.de
2023-06-18 22:41:50 +02:00
Thomas Gleixner a86e928433 posix-timers: Document common_clock_get() correctly
Replace another confusing and inaccurate set of comments.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20230425183313.409169321@linutronix.de
2023-06-18 22:41:50 +02:00
Thomas Gleixner 01679b5db7 posix-timers: Document sys_clock_getres() correctly
The decades old comment about Posix clock resolution is confusing at best.

Remove it and add a proper explanation to sys_clock_getres().

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20230425183313.356427330@linutronix.de
2023-06-18 22:41:50 +02:00
Thomas Gleixner 8cc96ca2c7 posix-timers: Split release_posix_timers()
release_posix_timers() is called for cleaning up both hashed and unhashed
timers. The cases are differentiated by an argument and the usage is
hideous.

Seperate the actual free path out and use it for unhashed timers. Provide a
function for hashed timers.

No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20230425183313.301432503@linutronix.de
2023-06-18 22:41:50 +02:00
Thomas Gleixner 11fbe6cd41 posix-timers: Remove pointless irqsafe from hash_lock
All usage of hash_lock is in thread context. No point in using
spin_lock_irqsave()/irqrestore() for a single usage site.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20230425183313.249063953@linutronix.de
2023-06-18 22:41:49 +02:00
Thomas Gleixner 72786ff23d posix-timers: Set k_itimer:: It_signal to NULL on exit()
Technically it's not required to set k_itimer::it_signal to NULL on exit()
because there is no other thread anymore which could lookup the timer
concurrently.

Set it to NULL for consistency sake and add a comment to that effect.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20230425183313.196462644@linutronix.de
2023-06-18 22:41:49 +02:00
Thomas Gleixner 028cf5eaa1 posix-timers: Annotate concurrent access to k_itimer:: It_signal
k_itimer::it_signal is read lockless in the RCU protected hash lookup, but
it can be written concurrently in the timer_create() and timer_delete()
path. Annotate these places with READ_ONCE() and WRITE_ONCE()

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20230425183313.143596887@linutronix.de
2023-06-18 22:41:49 +02:00
Thomas Gleixner ae88967d71 posix-timers: Add comments about timer lookup
Document how the timer ID validation in the hash table works.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20230425183313.091081515@linutronix.de
2023-06-18 22:41:49 +02:00
Thomas Gleixner 8d44b958a1 posix-timers: Cleanup comments about timer ID tracking
Describe the hash table properly and remove the IDR leftover comments.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20230425183313.038444551@linutronix.de
2023-06-18 22:41:48 +02:00
Thomas Gleixner 7d99090266 posix-timers: Clarify timer_wait_running() comment
Explain it better and add the CONFIG_POSIX_CPU_TIMERS_TASK_WORK=y aspect
for completeness.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20230425183312.985681995@linutronix.de
2023-06-18 22:41:48 +02:00
Thomas Gleixner 8ce8849dd1 posix-timers: Ensure timer ID search-loop limit is valid
posix_timer_add() tries to allocate a posix timer ID by starting from the
cached ID which was stored by the last successful allocation.

This is done in a loop searching the ID space for a free slot one by
one. The loop has to terminate when the search wrapped around to the
starting point.

But that's racy vs. establishing the starting point. That is read out
lockless, which leads to the following problem:

CPU0	  	      	     	   CPU1
posix_timer_add()
  start = sig->posix_timer_id;
  lock(hash_lock);
  ...				   posix_timer_add()
  if (++sig->posix_timer_id < 0)
      			             start = sig->posix_timer_id;
     sig->posix_timer_id = 0;

So CPU1 can observe a negative start value, i.e. -1, and the loop break
never happens because the condition can never be true:

  if (sig->posix_timer_id == start)
     break;

While this is unlikely to ever turn into an endless loop as the ID space is
huge (INT_MAX), the racy read of the start value caught the attention of
KCSAN and Dmitry unearthed that incorrectness.

Rewrite it so that all id operations are under the hash lock.

Reported-by: syzbot+5c54bd3eb218bb595aa9@syzkaller.appspotmail.com
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/87bkhzdn6g.ffs@tglx
2023-06-18 22:41:48 +02:00
Thomas Gleixner 9d9e522010 posix-timers: Prevent RT livelock in itimer_delete()
itimer_delete() has a retry loop when the timer is concurrently expired. On
non-RT kernels this just spin-waits until the timer callback has completed,
except for posix CPU timers which have HAVE_POSIX_CPU_TIMERS_TASK_WORK
enabled.

In that case and on RT kernels the existing task could live lock when
preempting the task which does the timer delivery.

Replace spin_unlock() with an invocation of timer_wait_running() to handle
it the same way as the other retry loops in the posix timer code.

Fixes: ec8f954a40 ("posix-timers: Use a callback for cancel synchronization on PREEMPT_RT")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/87v8g7c50d.ffs@tglx
2023-06-18 22:40:42 +02:00
Arnd Bergmann 8091f56ee4 irqdomain: Include internals.h for function prototypes
irq_domain_debugfs_init() is defined in irqdomain.c, but the
declaration is in a header that is not included here:

kernel/irq/irqdomain.c:1965:13: error: no previous prototype for 'irq_domain_debugfs_init' [-Werror=missing-prototypes]

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230516200432.554240-1-arnd@kernel.org
2023-06-17 07:20:45 +01:00
Hao Jia ebb83d84e4 sched/core: Avoid multiple calling update_rq_clock() in __cfsb_csd_unthrottle()
After commit 8ad075c2eb ("sched: Async unthrottling for cfs
bandwidth"), we may update the rq clock multiple times in the loop of
__cfsb_csd_unthrottle().

A prior (although less common) instance of this problem exists in
unthrottle_offline_cfs_rqs().

Cure both by ensuring update_rq_clock() is called before the loop and
setting RQCF_ACT_SKIP during the loop, to supress further updates.
The alternative would be pulling update_rq_clock() out of
unthrottle_cfs_rq(), but that gives an even bigger mess.

Fixes: 8ad075c2eb ("sched: Async unthrottling for cfs bandwidth")
Reviewed-By: Ben Segall <bsegall@google.com>
Suggested-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Hao Jia <jiahao.os@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20230613082012.49615-4-jiahao.os@bytedance.com
2023-06-16 22:08:13 +02:00
Hao Jia 96500560f0 sched/core: Avoid double calling update_rq_clock() in __balance_push_cpu_stop()
There is a double update_rq_clock() invocation:

  __balance_push_cpu_stop()
    update_rq_clock()
    __migrate_task()
      update_rq_clock()

Sadly select_fallback_rq() also needs update_rq_clock() for
__do_set_cpus_allowed(), it is not possible to remove the update from
__balance_push_cpu_stop(). So remove it from __migrate_task() and
ensure all callers of this function call update_rq_clock() prior to
calling it.

Signed-off-by: Hao Jia <jiahao.os@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20230613082012.49615-3-jiahao.os@bytedance.com
2023-06-16 22:08:12 +02:00
Hao Jia cab3ecaed5 sched/core: Fixed missing rq clock update before calling set_rq_offline()
When using a cpufreq governor that uses
cpufreq_add_update_util_hook(), it is possible to trigger a missing
update_rq_clock() warning for the CPU hotplug path:

  rq_attach_root()
    set_rq_offline()
      rq_offline_rt()
	__disable_runtime()
	  sched_rt_rq_enqueue()
	    enqueue_top_rt_rq()
	      cpufreq_update_util()
		data->func(data, rq_clock(rq), flags)

Move update_rq_clock() from sched_cpu_deactivate() (one of it's
callers) into set_rq_offline() such that it covers all
set_rq_offline() usage.

Additionally change rq_attach_root() to use rq_lock_irqsave() so that
it will properly manage the runqueue clock flags.

Suggested-by: Ben Segall <bsegall@google.com>
Signed-off-by: Hao Jia <jiahao.os@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20230613082012.49615-2-jiahao.os@bytedance.com
2023-06-16 22:08:12 +02:00
Vineeth Pillai 6a9d623aad sched/deadline: Fix bandwidth reclaim equation in GRUB
According to the GRUB[1] rule, the runtime is depreciated as:
  "dq = -max{u, (1 - Uinact - Uextra)} dt" (1)

To guarantee that deadline tasks doesn't starve lower class tasks,
we do not allocate the full bandwidth of the cpu to deadline tasks.
Maximum bandwidth usable by deadline tasks is denoted by "Umax".
Considering Umax, equation (1) becomes:
  "dq = -(max{u, (Umax - Uinact - Uextra)} / Umax) dt" (2)

Current implementation has a minor bug in equation (2), which this
patch fixes.

The reclamation logic is verified by a sample program which creates
multiple deadline threads and observing their utilization. The tests
were run on an isolated cpu(isolcpus=3) on a 4 cpu system.

Tests on 6.3.0
==============

RUN 1: runtime=7ms, deadline=period=10ms, RT capacity = 95%
TID[693]: RECLAIM=1, (r=7ms, d=10ms, p=10ms), Util: 93.33
TID[693]: RECLAIM=1, (r=7ms, d=10ms, p=10ms), Util: 93.35

RUN 2: runtime=1ms, deadline=period=100ms, RT capacity = 95%
TID[708]: RECLAIM=1, (r=1ms, d=100ms, p=100ms), Util: 16.69
TID[708]: RECLAIM=1, (r=1ms, d=100ms, p=100ms), Util: 16.69

RUN 3: 2 tasks
  Task 1: runtime=1ms, deadline=period=10ms
  Task 2: runtime=1ms, deadline=period=100ms
TID[631]: RECLAIM=1, (r=1ms, d=10ms, p=10ms), Util: 62.67
TID[632]: RECLAIM=1, (r=1ms, d=100ms, p=100ms), Util: 6.37
TID[631]: RECLAIM=1, (r=1ms, d=10ms, p=10ms), Util: 62.38
TID[632]: RECLAIM=1, (r=1ms, d=100ms, p=100ms), Util: 6.23

As seen above, the reclamation doesn't reclaim the maximum allowed
bandwidth and as the bandwidth of tasks gets smaller, the reclaimed
bandwidth also comes down.

Tests with this patch applied
=============================

RUN 1: runtime=7ms, deadline=period=10ms, RT capacity = 95%
TID[608]: RECLAIM=1, (r=7ms, d=10ms, p=10ms), Util: 95.19
TID[608]: RECLAIM=1, (r=7ms, d=10ms, p=10ms), Util: 95.16

RUN 2: runtime=1ms, deadline=period=100ms, RT capacity = 95%
TID[616]: RECLAIM=1, (r=1ms, d=100ms, p=100ms), Util: 95.27
TID[616]: RECLAIM=1, (r=1ms, d=100ms, p=100ms), Util: 95.21

RUN 3: 2 tasks
  Task 1: runtime=1ms, deadline=period=10ms
  Task 2: runtime=1ms, deadline=period=100ms
TID[620]: RECLAIM=1, (r=1ms, d=10ms, p=10ms), Util: 86.64
TID[621]: RECLAIM=1, (r=1ms, d=100ms, p=100ms), Util: 8.66
TID[620]: RECLAIM=1, (r=1ms, d=10ms, p=10ms), Util: 86.45
TID[621]: RECLAIM=1, (r=1ms, d=100ms, p=100ms), Util: 8.73

Running tasks on all cpus allowing for migration also showed that
the utilization is reclaimed to the maximum. Running 10 tasks on
3 cpus SCHED_FLAG_RECLAIM - top shows:
%Cpu0  : 94.6 us,  0.0 sy,  0.0 ni,  5.4 id,  0.0 wa
%Cpu1  : 95.2 us,  0.0 sy,  0.0 ni,  4.8 id,  0.0 wa
%Cpu2  : 95.8 us,  0.0 sy,  0.0 ni,  4.2 id,  0.0 wa

[1]: Abeni, Luca & Lipari, Giuseppe & Parri, Andrea & Sun, Youcheng.
     (2015). Parallel and sequential reclaiming in multicore
     real-time global scheduling.

Signed-off-by: Vineeth Pillai (Google) <vineeth@bitbyteword.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Acked-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://lore.kernel.org/r/20230530135526.2385378-1-vineeth@bitbyteword.org
2023-06-16 22:08:11 +02:00
Leonardo Bras bf5a8c26ad trace,smp: Add tracepoints for scheduling remotelly called functions
Add a tracepoint for when a CSD is queued to a remote CPU's
call_single_queue. This allows finding exactly which CPU queued a given CSD
when looking at a csd_function_{entry,exit} event, and also enables us to
accurately measure IPI delivery time with e.g. a synthetic event:

  $ echo 'hist:keys=cpu,csd.hex:ts=common_timestamp.usecs' >\
      /sys/kernel/tracing/events/smp/csd_queue_cpu/trigger
  $ echo 'csd_latency unsigned int dst_cpu; unsigned long csd; u64 time' >\
      /sys/kernel/tracing/synthetic_events
  $ echo \
  'hist:keys=common_cpu,csd.hex:'\
  'time=common_timestamp.usecs-$ts:'\
  'onmatch(smp.csd_queue_cpu).trace(csd_latency,common_cpu,csd,$time)' >\
      /sys/kernel/tracing/events/smp/csd_function_entry/trigger

  $ trace-cmd record -e 'synthetic:csd_latency' hackbench
  $ trace-cmd report
  <...>-467   [001]    21.824263: csd_queue_cpu:        cpu=0 callsite=try_to_wake_up+0x2ea func=sched_ttwu_pending csd=0xffff8880076148b8
  <...>-467   [001]    21.824280: ipi_send_cpu:         cpu=0 callsite=try_to_wake_up+0x2ea callback=generic_smp_call_function_single_interrupt+0x0
  <...>-489   [000]    21.824299: csd_function_entry:   func=sched_ttwu_pending csd=0xffff8880076148b8
  <...>-489   [000]    21.824320: csd_latency:          dst_cpu=0, csd=18446612682193848504, time=36

Suggested-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Leonardo Bras <leobras@redhat.com>
Tested-and-reviewed-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230615065944.188876-7-leobras@redhat.com
2023-06-16 22:08:09 +02:00
Leonardo Bras 949fa3f11c trace,smp: Add tracepoints around remotelly called functions
The recently added ipi_send_{cpu,cpumask} tracepoints allow finding sources
of IPIs targeting CPUs running latency-sensitive applications.

For NOHZ_FULL CPUs, all IPIs are interference, and those tracepoints are
sufficient to find them and work on getting rid of them. In some setups
however, not *all* IPIs are to be suppressed, but long-running IPI
callbacks can still be problematic.

Add a pair of tracepoints to mark the start and end of processing a CSD IPI
callback, similar to what exists for softirq, workqueue or timer callbacks.

Signed-off-by: Leonardo Bras <leobras@redhat.com>
Tested-and-reviewed-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230615065944.188876-5-leobras@redhat.com
2023-06-16 22:08:09 +02:00
Thomas Gleixner 13bb06f8dd tick/common: Align tick period during sched_timer setup
The tick period is aligned very early while the first clock_event_device is
registered. At that point the system runs in periodic mode and switches
later to one-shot mode if possible.

The next wake-up event is programmed based on the aligned value
(tick_next_period) but the delta value, that is used to program the
clock_event_device, is computed based on ktime_get().

With the subtracted offset, the device fires earlier than the exact time
frame. With a large enough offset the system programs the timer for the
next wake-up and the remaining time left is too small to make any boot
progress. The system hangs.

Move the alignment later to the setup of tick_sched timer. At this point
the system switches to oneshot mode and a high resolution clocksource is
available. At this point it is safe to align tick_next_period because
ktime_get() will now return accurate (not jiffies based) time.

[bigeasy: Patch description + testing].

Fixes: e9523a0d81 ("tick/common: Align tick period with the HZ tick.")
Reported-by: Mathias Krause <minipli@grsecurity.net>
Reported-by: "Bhatnagar, Rishabh" <risbhat@amazon.com>
Suggested-by: Mathias Krause <minipli@grsecurity.net>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Richard W.M. Jones <rjones@redhat.com>
Tested-by: Mathias Krause <minipli@grsecurity.net>
Acked-by: SeongJae Park <sj@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/5a56290d-806e-b9a5-f37c-f21958b5a8c0@grsecurity.net
Link: https://lore.kernel.org/12c6f9a3-d087-b824-0d05-0d18c9bc1bf3@amazon.com
Link: https://lore.kernel.org/r/20230615091830.RxMV2xf_@linutronix.de
2023-06-16 20:45:28 +02:00
Sebastian Andrzej Siewior ab5d47bd41 bpf: Remove in_atomic() from bpf_link_put().
bpf_free_inode() is invoked as a RCU callback. Usually RCU callbacks are
invoked within softirq context. By setting rcutree.use_softirq=0 boot
option the RCU callbacks will be invoked in a per-CPU kthread with
bottom halves disabled which implies a RCU read section.

On PREEMPT_RT the context remains fully preemptible. The RCU read
section however does not allow schedule() invocation. The latter happens
in mutex_lock() performed by bpf_trampoline_unlink_prog() originated
from bpf_link_put().

It was pointed out that the bpf_link_put() invocation should not be
delayed if originated from close(). It was also pointed out that other
invocations from within a syscall should also avoid the workqueue.
Everyone else should use workqueue by default to remain safe in the
future (while auditing the code, every caller was preemptible except for
the RCU case).

Let bpf_link_put() use the worker unconditionally. Add
bpf_link_put_direct() which will directly free the resources and is used
by close() and from within __sys_bpf().

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20230614083430.oENawF8f@linutronix.de
2023-06-16 09:28:02 -07:00
Arve Hjønnevåg ef73d6a4ef sched/wait: Fix a kthread_park race with wait_woken()
kthread_park and wait_woken have a similar race that
kthread_stop and wait_woken used to have before it was fixed in
commit cb6538e740 ("sched/wait: Fix a kthread race with
wait_woken()"). Extend that fix to also cover kthread_park.

[jstultz: Made changes suggested by Peter to optimize
 memory loads]

Signed-off-by: Arve Hjønnevåg <arve@android.com>
Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lore.kernel.org/r/20230602212350.535358-1-jstultz@google.com
2023-06-16 17:08:01 +02:00
Miaohe Lin 0cce0fde49 sched/topology: Mark set_sched_topology() __init
All callers of set_sched_topology() are within __init section. Mark
it __init too.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lore.kernel.org/r/20230603073645.1173332-1-linmiaohe@huawei.com
2023-06-16 17:08:01 +02:00
Tom Rix a707df30c9 sched/fair: Rename variable cpu_util eff_util
cppcheck reports
kernel/sched/fair.c:7436:17: style: Local variable 'cpu_util' shadows outer function [shadowFunction]
  unsigned long cpu_util;
                ^

Clean this up by renaming the variable to eff_util

Signed-off-by: Tom Rix <trix@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lore.kernel.org/r/20230611122535.183654-1-trix@redhat.com
2023-06-16 17:08:01 +02:00
James Gowans 9c15eeb536 genirq: Allow fasteoi handler to resend interrupts on concurrent handling
There is a class of interrupt controllers out there that, once they
have signalled a given interrupt number, will still signal incoming
instances of the *same* interrupt despite the original interrupt
not having been EOIed yet.

As long as the new interrupt reaches the *same* CPU, nothing bad
happens, as that CPU still has its interrupts globally disabled,
and we will only take the new interrupt once the interrupt has
been EOIed.

However, things become more "interesting" if an affinity change comes
in while the interrupt is being handled. More specifically, while
the per-irq lock is being dropped. This results in the affinity change
taking place immediately. At this point, there is nothing that prevents
the interrupt from firing on the new target CPU. We end-up with the
interrupt running concurrently on two CPUs, which isn't a good thing.

And that's where things become worse: the new CPU notices that the
interrupt handling is in progress (irq_may_run() return false), and
*drops the interrupt on the floor*.

The whole race looks like this:

           CPU 0             |          CPU 1
-----------------------------|-----------------------------
interrupt start              |
  handle_fasteoi_irq         | set_affinity(CPU 1)
    handler                  |
    ...                      | interrupt start
    ...                      |   handle_fasteoi_irq -> early out
  handle_fasteoi_irq return  | interrupt end
interrupt end                |

If the interrupt was an edge, too bad. The interrupt is lost, and
the system will eventually die one way or another. Not great.

A way to avoid this situation is to detect this problem at the point
we handle the interrupt on the new target. Instead of dropping the
interrupt, use the resend mechanism to force it to be replayed.

Also, in order to limit the impact of this workaround to the pathetic
architectures that require it, gate it behind a new irq flag aptly
named IRQD_RESEND_WHEN_IN_PROGRESS.

Suggested-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: James Gowans <jgowans@amazon.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Marc Zyngier <maz@kernel.org>
Cc: KarimAllah Raslan <karahmed@amazon.com>
Cc: Yipeng Zou <zouyipeng@huawei.com>
Cc: Zhang Jianhua <chris.zjh@huawei.com>
[maz: reworded commit mesage]
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230608120021.3273400-3-jgowans@amazon.com
2023-06-16 12:22:35 +01:00
James Gowans 7cc148a32f genirq: Expand doc for PENDING and REPLAY flags
Adding a bit more info about what the flags are used for may help future
code readers.

Signed-off-by: James Gowans <jgowans@amazon.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Liao Chang <liaochang1@huawei.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230608120021.3273400-2-jgowans@amazon.com
2023-06-16 12:22:05 +01:00
Jakub Kicinski 173780ff18 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

Conflicts:

include/linux/mlx5/driver.h
  617f5db1a6 ("RDMA/mlx5: Fix affinity assignment")
  dc13180824 ("net/mlx5: Enable devlink port for embedded cpu VF vports")
https://lore.kernel.org/all/20230613125939.595e50b8@canb.auug.org.au/

tools/testing/selftests/net/mptcp/mptcp_join.sh
  47867f0a7e ("selftests: mptcp: join: skip check if MIB counter not supported")
  425ba80312 ("selftests: mptcp: join: support RM_ADDR for used endpoints or not")
  45b1a1227a ("mptcp: introduces more address related mibs")
  0639fa230a ("selftests: mptcp: add explicit check for new mibs")
https://lore.kernel.org/netdev/20230609-upstream-net-20230610-mptcp-selftests-support-old-kernels-part-3-v1-0-2896fe2ee8a3@tessares.net/

No adjacent changes.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-15 22:19:41 -07:00
Azeem Shaikh 33457938a0 kallsyms: Replace all non-returning strlcpy with strscpy
strlcpy() reads the entire source buffer first.
This read may exceed the destination size limit.
This is both inefficient and can lead to linear read
overflows if a source string is not NUL-terminated [1].
In an effort to remove strlcpy() completely [2], replace
strlcpy() here with strscpy().
No return values were used, so direct replacement is safe.

[1] https://www.kernel.org/doc/html/latest/process/deprecated.html#strlcpy
[2] https://github.com/KSPP/linux/issues/89

Signed-off-by: Azeem Shaikh <azeemshaikh38@gmail.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20230614010354.1026096-1-azeemshaikh38@gmail.com
2023-06-14 12:27:38 -07:00
Beau Belgrave a65442edb4 tracing/user_events: Add auto cleanup and future persist flag
Currently user events need to be manually deleted via the delete IOCTL
call or via the dynamic_events file. Most operators and processes wish
to have these events auto cleanup when they are no longer used by
anything to prevent them piling without manual maintenance. However,
some operators may not want this, such as pre-registering events via the
dynamic_events tracefs file.

Update user_event_put() to attempt an auto delete of the event if it's
the last reference. The auto delete must run in a work queue to ensure
proper behavior of class->reg() invocations that don't expect the call
to go away from underneath them during the unregister. Add work_struct
to user_event struct to ensure we can do this reliably.

Add a persist flag, that is not yet exposed, to ensure we can toggle
between auto-cleanup and leaving the events existing in the future. When
a non-zero flag is seen during register, return -EINVAL to ensure ABI
is clear for the user processes while we work out the best approach for
persistent events.

Link: https://lkml.kernel.org/r/20230614163336.5797-4-beaub@linux.microsoft.com
Link: https://lore.kernel.org/linux-trace-kernel/20230518093600.3f119d68@rorschach.local.home/

Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-06-14 13:43:27 -04:00
Beau Belgrave f0dbf6fd0b tracing/user_events: Track refcount consistently via put/get
Various parts of the code today track user_event's refcnt field directly
via a refcount_add/dec. This makes it hard to modify the behavior of the
last reference decrement in all code paths consistently. For example, in
the future we will auto-delete events upon the last reference going
away. This last reference could happen in many places, but we want it to
be consistently handled.

Add user_event_get() and user_event_put() for the add/dec. Update all
places where direct refcounts are being used to utilize these new
functions. In each location pass if event_mutex is locked or not. This
allows us to drop events automatically in future patches clearly. Ensure
when caller states the lock is held, it really is (or is not) held.

Link: https://lkml.kernel.org/r/20230614163336.5797-3-beaub@linux.microsoft.com

Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-06-14 13:43:26 -04:00
Beau Belgrave b08d725805 tracing/user_events: Store register flags on events
Currently we don't have any available flags for user processes to use to
indicate options for user_events. We will soon have a flag to indicate
the event should or should not auto-delete once it's not being used by
anyone.

Add a reg_flags field to user_events and parameters to existing
functions to allow for this in future patches.

Link: https://lkml.kernel.org/r/20230614163336.5797-2-beaub@linux.microsoft.com

Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-06-14 13:43:26 -04:00
Beau Belgrave ed0e0ae0c9 tracing/user_events: Remove user_ns walk for groups
During discussions it was suggested that user_ns is not a good place to
try to attach a tracing namespace. The current code has stubs to enable
that work that are very likely to change and incur a performance cost.

Remove the user_ns walk when creating a group and determining the system
name to use, since it's unlikely user_ns will be used in the future.

Link: https://lore.kernel.org/all/20230601-urenkel-holzofen-cd9403b9cadd@brauner/
Link: https://lore.kernel.org/linux-trace-kernel/20230601224928.301-1-beaub@linux.microsoft.com

Suggested-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-06-14 12:41:53 -04:00
sunliming 6f05dcabe5 tracing/user_events: Fix the incorrect trace record for empty arguments events
The user_events support events that has empty arguments. But the trace event
is discarded and not really committed when the arguments is empty. Fix this
by not attempting to copy in zero-length data.

Link: https://lkml.kernel.org/r/20230606062027.1008398-2-sunliming@kylinos.cn

Acked-by: Beau Belgrave <beaub@linux.microsoft.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: sunliming <sunliming@kylinos.cn>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-06-14 12:41:52 -04:00
sunliming e70bb54d7a tracing: Modify print_fields() for fields output order
Now the print_fields() print trace event fields in reverse order. Modify
it to the positive sequence.

Example outputs for a user event:
	test0 u32 count1; u32 count2

Output before:
	example-2547    [000] .....   325.666387: test0: count2=0x2 (2) count1=0x1 (1)

Output after:
	example-2742    [002] .....   429.769370: test0: count1=0x1 (1) count2=0x2 (2)

Link: https://lore.kernel.org/linux-trace-kernel/20230525085232.5096-1-sunliming@kylinos.cn

Fixes: 80a76994b2 ("tracing: Add "fields" option to show raw trace event fields")
Signed-off-by: sunliming <sunliming@kylinos.cn>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-06-14 12:41:52 -04:00
sunliming cfac4ed727 tracing/user_events: Handle matching arguments that is null from dyn_events
When A registering user event from dyn_events has no argments, it will pass the
matching check, regardless of whether there is a user event with the same name
and arguments. Add the matching check when the arguments of registering user
event is null.

Link: https://lore.kernel.org/linux-trace-kernel/20230529065110.303440-1-sunliming@kylinos.cn

Signed-off-by: sunliming <sunliming@kylinos.cn>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-06-14 12:41:52 -04:00
sunliming ba470eebc2 tracing/user_events: Prevent same name but different args event
User processes register name_args for events. If the same name but different
args event are registered. The trace outputs of second event are printed
as the first event. This is incorrect.

Return EADDRINUSE back to the user process if the same name but different args
event has being registered.

Link: https://lore.kernel.org/linux-trace-kernel/20230529032100.286534-1-sunliming@kylinos.cn

Signed-off-by: sunliming <sunliming@kylinos.cn>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Beau Belgrave <beaub@linux.microsoft.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-06-14 12:41:52 -04:00
Eduard Zingerman 1ffc85d929 bpf: Verify scalar ids mapping in regsafe() using check_ids()
Make sure that the following unsafe example is rejected by verifier:

1: r9 = ... some pointer with range X ...
2: r6 = ... unbound scalar ID=a ...
3: r7 = ... unbound scalar ID=b ...
4: if (r6 > r7) goto +1
5: r6 = r7
6: if (r6 > X) goto ...
--- checkpoint ---
7: r9 += r7
8: *(u64 *)r9 = Y

This example is unsafe because not all execution paths verify r7 range.
Because of the jump at (4) the verifier would arrive at (6) in two states:
I.  r6{.id=b}, r7{.id=b} via path 1-6;
II. r6{.id=a}, r7{.id=b} via path 1-4, 6.

Currently regsafe() does not call check_ids() for scalar registers,
thus from POV of regsafe() states (I) and (II) are identical. If the
path 1-6 is taken by verifier first, and checkpoint is created at (6)
the path [1-4, 6] would be considered safe.

Changes in this commit:
- check_ids() is modified to disallow mapping multiple old_id to the
  same cur_id.
- check_scalar_ids() is added, unlike check_ids() it treats ID zero as
  a unique scalar ID.
- check_scalar_ids() needs to generate temporary unique IDs, field
  'tmp_id_gen' is added to bpf_verifier_env::idmap_scratch to
  facilitate this.
- regsafe() is updated to:
  - use check_scalar_ids() for precise scalar registers.
  - compare scalar registers using memcmp only for explore_alu_limits
    branch. This simplifies control flow for scalar case, and has no
    measurable performance impact.
- check_alu_op() is updated to avoid generating bpf_reg_state::id for
  constant scalar values when processing BPF_MOV. ID is needed to
  propagate range information for identical values, but there is
  nothing to propagate for constants.

Fixes: 75748837b7 ("bpf: Propagate scalar ranges through register assignments.")
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20230613153824.3324830-4-eddyz87@gmail.com
2023-06-13 15:15:08 -07:00
Eduard Zingerman 904e6ddf41 bpf: Use scalar ids in mark_chain_precision()
Change mark_chain_precision() to track precision in situations
like below:

    r2 = unknown value
    ...
  --- state #0 ---
    ...
    r1 = r2                 // r1 and r2 now share the same ID
    ...
  --- state #1 {r1.id = A, r2.id = A} ---
    ...
    if (r2 > 10) goto exit; // find_equal_scalars() assigns range to r1
    ...
  --- state #2 {r1.id = A, r2.id = A} ---
    r3 = r10
    r3 += r1                // need to mark both r1 and r2

At the beginning of the processing of each state, ensure that if a
register with a scalar ID is marked as precise, all registers sharing
this ID are also marked as precise.

This property would be used by a follow-up change in regsafe().

Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20230613153824.3324830-2-eddyz87@gmail.com
2023-06-13 15:14:27 -07:00
Krister Johansen 0108a4e9f3 bpf: ensure main program has an extable
When subprograms are in use, the main program is not jit'd after the
subprograms because jit_subprogs sets a value for prog->bpf_func upon
success.  Subsequent calls to the JIT are bypassed when this value is
non-NULL.  This leads to a situation where the main program and its
func[0] counterpart are both in the bpf kallsyms tree, but only func[0]
has an extable.  Extables are only created during JIT.  Now there are
two nearly identical program ksym entries in the tree, but only one has
an extable.  Depending upon how the entries are placed, there's a chance
that a fault will call search_extable on the aux with the NULL entry.

Since jit_subprogs already copies state from func[0] to the main
program, include the extable pointer in this state duplication.
Additionally, ensure that the copy of the main program in func[0] is not
added to the bpf_prog_kallsyms table. Instead, let the main program get
added later in bpf_prog_load().  This ensures there is only a single
copy of the main program in the kallsyms table, and that its tag matches
the tag observed by tooling like bpftool.

Cc: stable@vger.kernel.org
Fixes: 1c2a088a66 ("bpf: x64: add JIT support for multi-function programs")
Signed-off-by: Krister Johansen <kjlx@templeofstupid.com>
Acked-by: Yonghong Song <yhs@fb.com>
Acked-by: Ilya Leoshkevich <iii@linux.ibm.com>
Tested-by: Ilya Leoshkevich <iii@linux.ibm.com>
Link: https://lore.kernel.org/r/6de9b2f4b4724ef56efbb0339daaa66c8b68b1e7.1686616663.git.kjlx@templeofstupid.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-06-13 15:13:52 -07:00
Linus Torvalds fb054096ae 19 hotfixes. 14 are cc:stable and the remainder address issues which were
introduced during this -rc cycle or which were considered inappropriate
 for a backport.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZIdw7QAKCRDdBJ7gKXxA
 jki4AQCygi1UoqVPq4N/NzJbv2GaNDXNmcJIoLvPpp3MYFhucAEAtQNzAYO9z6CT
 iLDMosnuh+1KLTaKNGL5iak3NAxnxQw=
 =mTdI
 -----END PGP SIGNATURE-----

Merge tag 'mm-hotfixes-stable-2023-06-12-12-22' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull misc fixes from Andrew Morton:
 "19 hotfixes. 14 are cc:stable and the remainder address issues which
  were introduced during this development cycle or which were considered
  inappropriate for a backport"

* tag 'mm-hotfixes-stable-2023-06-12-12-22' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  zswap: do not shrink if cgroup may not zswap
  page cache: fix page_cache_next/prev_miss off by one
  ocfs2: check new file size on fallocate call
  mailmap: add entry for John Keeping
  mm/damon/core: fix divide error in damon_nr_accesses_to_accesses_bp()
  epoll: ep_autoremove_wake_function should use list_del_init_careful
  mm/gup_test: fix ioctl fail for compat task
  nilfs2: reject devices with insufficient block count
  ocfs2: fix use-after-free when unmounting read-only filesystem
  lib/test_vmalloc.c: avoid garbage in page array
  nilfs2: fix possible out-of-bounds segment allocation in resize ioctl
  riscv/purgatory: remove PGO flags
  powerpc/purgatory: remove PGO flags
  x86/purgatory: remove PGO flags
  kexec: support purgatories with .text.hot sections
  mm/uffd: allow vma to merge as much as possible
  mm/uffd: fix vma operation where start addr cuts part of vma
  radix-tree: move declarations to header
  nilfs2: fix incomplete buffer cleanup in nilfs_btnode_abort_change_key()
2023-06-12 16:14:34 -07:00
David Vernet f983be9173 bpf: Replace bpf_cpumask_any* with bpf_cpumask_any_distribute*
We currently export the bpf_cpumask_any() and bpf_cpumask_any_and()
kfuncs. Intuitively, one would expect these to choose any CPU in the
cpumask, but what they actually do is alias to cpumask_first() and
cpmkas_first_and().

This is useless given that we already export bpf_cpumask_first() and
bpf_cpumask_first_and(), so this patch replaces them with kfuncs that
call cpumask_any_distribute() and cpumask_any_and_distribute(), which
actually choose any CPU from the cpumask (or the AND of two cpumasks for
the latter).

Signed-off-by: David Vernet <void@manifault.com>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20230610035053.117605-3-void@manifault.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-06-12 15:09:11 -07:00
David Vernet 5ba3a7a851 bpf: Add bpf_cpumask_first_and() kfunc
We currently provide bpf_cpumask_first(), bpf_cpumask_any(), and
bpf_cpumask_any_and() kfuncs. bpf_cpumask_any() and
bpf_cpumask_any_and() are confusing misnomers in that they actually just
call cpumask_first() and cpumask_first_and() respectively.

We'll replace them with bpf_cpumask_any_distribute() and
bpf_cpumask_any_distribute_and() kfuncs in a subsequent patch, so let's
ensure feature parity by adding a bpf_cpumask_first_and() kfunc to
account for bpf_cpumask_any_and() being removed.

Signed-off-by: David Vernet <void@manifault.com>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20230610035053.117605-1-void@manifault.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-06-12 15:09:11 -07:00
Ricardo Ribalda 8652d44f46 kexec: support purgatories with .text.hot sections
Patch series "kexec: Fix kexec_file_load for llvm16 with PGO", v7.

When upreving llvm I realised that kexec stopped working on my test
platform.

The reason seems to be that due to PGO there are multiple .text sections
on the purgatory, and kexec does not supports that.


This patch (of 4):

Clang16 links the purgatory text in two sections when PGO is in use:

  [ 1] .text             PROGBITS         0000000000000000  00000040
       00000000000011a1  0000000000000000  AX       0     0     16
  [ 2] .rela.text        RELA             0000000000000000  00003498
       0000000000000648  0000000000000018   I      24     1     8
  ...
  [17] .text.hot.        PROGBITS         0000000000000000  00003220
       000000000000020b  0000000000000000  AX       0     0     1
  [18] .rela.text.hot.   RELA             0000000000000000  00004428
       0000000000000078  0000000000000018   I      24    17     8

And both of them have their range [sh_addr ... sh_addr+sh_size] on the
area pointed by `e_entry`.

This causes that image->start is calculated twice, once for .text and
another time for .text.hot. The second calculation leaves image->start
in a random location.

Because of this, the system crashes immediately after:

kexec_core: Starting new kernel

Link: https://lkml.kernel.org/r/20230321-kexec_clang16-v7-0-b05c520b7296@chromium.org
Link: https://lkml.kernel.org/r/20230321-kexec_clang16-v7-1-b05c520b7296@chromium.org
Fixes: 930457057a ("kernel/kexec_file.c: split up __kexec_load_puragory")
Signed-off-by: Ricardo Ribalda <ribalda@chromium.org>
Reviewed-by: Ross Zwisler <zwisler@google.com>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Reviewed-by: Philipp Rudo <prudo@redhat.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov (AMD) <bp@alien8.de>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Palmer Dabbelt <palmer@rivosinc.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Simon Horman <horms@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tom Rix <trix@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-12 11:31:50 -07:00
Mario Limonciello cdb8c100d8 include/linux/suspend.h: Only show pm_pr_dbg messages at suspend/resume
All uses in the kernel are currently already oriented around
suspend/resume. As some other parts of the kernel may also use these
messages in functions that could also be used outside of
suspend/resume, only enable in suspend/resume path.

Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2023-06-12 19:56:08 +02:00
Xiu Jianfeng 6f363f5aa8 cgroup: Do not corrupt task iteration when rebinding subsystem
We found a refcount UAF bug as follows:

refcount_t: addition on 0; use-after-free.
WARNING: CPU: 1 PID: 342 at lib/refcount.c:25 refcount_warn_saturate+0xa0/0x148
Workqueue: events cpuset_hotplug_workfn
Call trace:
 refcount_warn_saturate+0xa0/0x148
 __refcount_add.constprop.0+0x5c/0x80
 css_task_iter_advance_css_set+0xd8/0x210
 css_task_iter_advance+0xa8/0x120
 css_task_iter_next+0x94/0x158
 update_tasks_root_domain+0x58/0x98
 rebuild_root_domains+0xa0/0x1b0
 rebuild_sched_domains_locked+0x144/0x188
 cpuset_hotplug_workfn+0x138/0x5a0
 process_one_work+0x1e8/0x448
 worker_thread+0x228/0x3e0
 kthread+0xe0/0xf0
 ret_from_fork+0x10/0x20

then a kernel panic will be triggered as below:

Unable to handle kernel paging request at virtual address 00000000c0000010
Call trace:
 cgroup_apply_control_disable+0xa4/0x16c
 rebind_subsystems+0x224/0x590
 cgroup_destroy_root+0x64/0x2e0
 css_free_rwork_fn+0x198/0x2a0
 process_one_work+0x1d4/0x4bc
 worker_thread+0x158/0x410
 kthread+0x108/0x13c
 ret_from_fork+0x10/0x18

The race that cause this bug can be shown as below:

(hotplug cpu)                | (umount cpuset)
mutex_lock(&cpuset_mutex)    | mutex_lock(&cgroup_mutex)
cpuset_hotplug_workfn        |
 rebuild_root_domains        |  rebind_subsystems
  update_tasks_root_domain   |   spin_lock_irq(&css_set_lock)
   css_task_iter_start       |    list_move_tail(&cset->e_cset_node[ss->id]
   while(css_task_iter_next) |                  &dcgrp->e_csets[ss->id]);
   css_task_iter_end         |   spin_unlock_irq(&css_set_lock)
mutex_unlock(&cpuset_mutex)  | mutex_unlock(&cgroup_mutex)

Inside css_task_iter_start/next/end, css_set_lock is hold and then
released, so when iterating task(left side), the css_set may be moved to
another list(right side), then it->cset_head points to the old list head
and it->cset_pos->next points to the head node of new list, which can't
be used as struct css_set.

To fix this issue, switch from all css_sets to only scgrp's css_sets to
patch in-flight iterators to preserve correct iteration, and then
update it->cset_head as well.

Reported-by: Gaosheng Cui <cuigaosheng1@huawei.com>
Link: https://www.spinics.net/lists/cgroups/msg37935.html
Suggested-by: Michal Koutný <mkoutny@suse.com>
Link: https://lore.kernel.org/all/20230526114139.70274-1-xiujianfeng@huaweicloud.com/
Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com>
Fixes: 2d8f243a5e ("cgroup: implement cgroup->e_csets[]")
Cc: stable@vger.kernel.org # v3.16+
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-06-12 07:21:57 -10:00
Miaohe Lin d16b3af466 cgroup: remove unused task_cgroup_path()
task_cgroup_path() is not used anymore. So remove it.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-06-12 07:05:56 -10:00
Arnd Bergmann ba49f97688 bpf: Hide unused bpf_patch_call_args
This function is only used when CONFIG_BPF_JIT_ALWAYS_ON is disabled, but
CONFIG_BPF_SYSCALL is enabled. When both are turned off, the prototype is
missing but the unused function is still compiled, as seen from this W=1
warning:

  [...]
  kernel/bpf/core.c:2075:6: error: no previous prototype for 'bpf_patch_call_args' [-Werror=missing-prototypes]
  [...]

Add a matching #ifdef for the definition to leave it out.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20230602135128.1498362-1-arnd@kernel.org
2023-06-12 19:00:08 +02:00
Miaohe Lin 5273ee254e cgroup/cpuset: remove unneeded header files
Remove some unnecessary header files. No functional change intended.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-06-12 06:52:04 -10:00
Tetsuo Handa f0cc749254 cgroup,freezer: hold cpu_hotplug_lock before freezer_mutex in freezer_css_{online,offline}()
syzbot is again reporting circular locking dependency between
cpu_hotplug_lock and freezer_mutex. Do like what we did with
commit 57dcd64c7e ("cgroup,freezer: hold cpu_hotplug_lock
before freezer_mutex").

Reported-by: syzbot <syzbot+2ab700fe1829880a2ec6@syzkaller.appspotmail.com>
Closes: https://syzkaller.appspot.com/bug?extid=2ab700fe1829880a2ec6
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Tested-by: syzbot <syzbot+2ab700fe1829880a2ec6@syzkaller.appspotmail.com>
Fixes: f5d39b0208 ("freezer,sched: Rewrite core freezer logic")
Cc: stable@vger.kernel.org # v6.1+
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-06-12 06:39:48 -10:00
Christoph Hellwig 05bdb99653 block: replace fmode_t with a block-specific type for block open flags
The only overlap between the block open flags mapped into the fmode_t and
other uses of fmode_t are FMODE_READ and FMODE_WRITE.  Define a new
blk_mode_t instead for use in blkdev_get_by_{dev,path}, ->open and
->ioctl and stop abusing fmode_t.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Jack Wang <jinpu.wang@ionos.com>		[rnbd]
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/r/20230608110258.189493-28-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-12 08:04:05 -06:00
Christoph Hellwig 2736e8eeb0 block: use the holder as indication for exclusive opens
The current interface for exclusive opens is rather confusing as it
requires both the FMODE_EXCL flag and a holder.  Remove the need to pass
FMODE_EXCL and just key off the exclusive open off a non-NULL holder.

For blkdev_put this requires adding the holder argument, which provides
better debug checking that only the holder actually releases the hold,
but at the same time allows removing the now superfluous mode argument.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Christian Brauner <brauner@kernel.org>
Acked-by: David Sterba <dsterba@suse.com>		[btrfs]
Acked-by: Jack Wang <jinpu.wang@ionos.com>		[rnbd]
Link: https://lore.kernel.org/r/20230608110258.189493-16-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-12 08:04:04 -06:00
Christoph Hellwig c889d0793d swsusp: don't pass a stack address to blkdev_get_by_path
holder is just an on-stack pointer that can easily be reused by other calls,
replace it with a static variable that doesn't change.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Rafael J. Wysocki <rafael@kernel.org>
Link: https://lore.kernel.org/r/20230608110258.189493-12-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-12 08:04:04 -06:00
Arnd Bergmann a94181ec06 syscalls: add sys_ni_posix_timers prototype
The sys_ni_posix_timers() definition causes a warning when the declaration
is missing, so this needs to be added along with the normal syscalls,
outside of the #ifdef.

kernel/time/posix-stubs.c:26:17: error: no previous prototype for 'sys_ni_posix_timers' [-Werror=missing-prototypes]

Link: https://lkml.kernel.org/r/20230607142925.3126422-1-arnd@kernel.org
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 17:44:25 -07:00
Zhen Lei 16c6006af4 kexec: enable kexec_crash_size to support two crash kernel regions
The crashk_low_res should be considered by /sys/kernel/kexec_crash_size
to support two crash kernel regions shrinking if existing.

While doing it, crashk_low_res will only be shrunk when the entire
crashk_res is empty; and if the crashk_res is empty and crahk_low_res
is not, change crashk_low_res to be crashk_res.

[bhe@redhat.com: redo changelog]
Link: https://lkml.kernel.org/r/20230527123439.772-7-thunder.leizhen@huawei.com
Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Cong Wang <amwang@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 17:44:24 -07:00
Zhen Lei 5b7bfb32cb kexec: add helper __crash_shrink_memory()
No functional change, in preparation for the next patch so that it is
easier to review.

[akpm@linux-foundation.org: make  __crash_shrink_memory() static]
  Link: https://lore.kernel.org/oe-kbuild-all/202305280717.Pw06aLkz-lkp@intel.com/
Link: https://lkml.kernel.org/r/20230527123439.772-6-thunder.leizhen@huawei.com
Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Cong Wang <amwang@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 17:44:24 -07:00
Zhen Lei 8a7db7790a kexec: improve the readability of crash_shrink_memory()
The major adjustments are:
1. end = start + new_size.
   The 'end' here is not an accurate representation, because it is not the
   new end of crashk_res, but the start of ram_res, difference 1. So
   eliminate it and replace it with ram_res->start.
2. Use 'ram_res->start' and 'ram_res->end' as arguments to
   crash_free_reserved_phys_range() to indicate that the memory covered by
   'ram_res' is released from the crashk. And keep it close to
   insert_resource().
3. Replace 'if (start == end)' with 'if (!new_size)', clear indication that
   all crashk memory will be shrunken.

No functional change.

Link: https://lkml.kernel.org/r/20230527123439.772-5-thunder.leizhen@huawei.com
Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Cong Wang <amwang@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 17:44:24 -07:00
Zhen Lei f7f567b95b kexec: clear crashk_res if all its memory has been released
If the resource of crashk_res has been released, it is better to clear
crashk_res.start and crashk_res.end.  Because 'end = start - 1' is not
reasonable, and in some places the test is based on crashk_res.end, not
resource_size(&crashk_res).

Link: https://lkml.kernel.org/r/20230527123439.772-4-thunder.leizhen@huawei.com
Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Cong Wang <amwang@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 17:44:23 -07:00
Zhen Lei 6f22a744f4 kexec: delete a useless check in crash_shrink_memory()
The check '(crashk_res.parent != NULL)' is added by commit e05bd3367b
("kexec: fix Oops in crash_shrink_memory()"), but it's stale now.  Because
if 'crashk_res' is not reserved, it will be zero in size and will be
intercepted by the above 'if (new_size >= old_size)'.

Ago:
	if (new_size >= end - start + 1)

Now:
	old_size = (end == 0) ? 0 : end - start + 1;
	if (new_size >= old_size)

Link: https://lkml.kernel.org/r/20230527123439.772-3-thunder.leizhen@huawei.com
Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Cong Wang <amwang@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 17:44:23 -07:00
Zhen Lei 1cba6c4309 kexec: fix a memory leak in crash_shrink_memory()
Patch series "kexec: enable kexec_crash_size to support two crash kernel
regions".

When crashkernel=X fails to reserve region under 4G, it will fall back to
reserve region above 4G and a region of the default size will also be
reserved under 4G.  Unfortunately, /sys/kernel/kexec_crash_size only
supports one crash kernel region now, the user cannot sense the low memory
reserved by reading /sys/kernel/kexec_crash_size.  Also, low memory cannot
be freed by writing this file.

For example:
resource_size(crashk_res) = 512M
resource_size(crashk_low_res) = 256M

The result of 'cat /sys/kernel/kexec_crash_size' is 512M, but it should be
768M.  When we execute 'echo 0 > /sys/kernel/kexec_crash_size', the size
of crashk_res becomes 0 and resource_size(crashk_low_res) is still 256 MB,
which is incorrect.

Since crashk_res manages the memory with high address and crashk_low_res
manages the memory with low address, crashk_low_res is shrunken only when
all crashk_res is shrunken.  And because when there is only one crash
kernel region, crashk_res is always used.  Therefore, if all crashk_res is
shrunken and crashk_low_res still exists, swap them.


This patch (of 6):

If the value of parameter 'new_size' is in the semi-open and semi-closed
interval (crashk_res.end - KEXEC_CRASH_MEM_ALIGN + 1, crashk_res.end], the
calculation result of ram_res is:

	ram_res->start = crashk_res.end + 1
	ram_res->end   = crashk_res.end

The operation of insert_resource() fails, and ram_res is not added to
iomem_resource.  As a result, the memory of the control block ram_res is
leaked.

In fact, on all architectures, the start address and size of crashk_res
are already aligned by KEXEC_CRASH_MEM_ALIGN.  Therefore, we do not need
to round up crashk_res.start again.  Instead, we should round up
'new_size' in advance.

Link: https://lkml.kernel.org/r/20230527123439.772-1-thunder.leizhen@huawei.com
Link: https://lkml.kernel.org/r/20230527123439.772-2-thunder.leizhen@huawei.com
Fixes: 6480e5a092 ("kdump: add missing RAM resource in crash_shrink_memory()")
Fixes: 06a7f71124 ("kexec: premit reduction of the reserved memory size")
Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Cong Wang <amwang@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 17:44:23 -07:00
Simon Horman 4df3504e2f kexec: avoid calculating array size twice
Avoid calculating array size twice in kexec_purgatory_setup_sechdrs().
Once using array_size(), and once open-coded.

Flagged by Coccinelle:

  .../kexec_file.c:881:8-25: WARNING: array_size is already used (line 877) to compute the same size

No functional change intended.
Compile tested only.

Link: https://lkml.kernel.org/r/20230525-kexec-array_size-v1-1-8b4bf4f7500a@kernel.org
Signed-off-by: Simon Horman <horms@kernel.org>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Zhen Lei <thunder.leizhen@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 17:44:22 -07:00
Lecopzer Chen 930d8f8dba watchdog/perf: adapt the watchdog_perf interface for async model
When lockup_detector_init()->watchdog_hardlockup_probe(), PMU may be not
ready yet.  E.g.  on arm64, PMU is not ready until
device_initcall(armv8_pmu_driver_init).  And it is deeply integrated with
the driver model and cpuhp.  Hence it is hard to push this initialization
before smp_init().

But it is easy to take an opposite approach and try to initialize the
watchdog once again later.  The delayed probe is called using workqueues. 
It need to allocate memory and must be proceed in a normal context.  The
delayed probe is able to use if watchdog_hardlockup_probe() returns
non-zero which means the return code returned when PMU is not ready yet.

Provide an API - lockup_detector_retry_init() for anyone who needs to
delayed init lockup detector if they had ever failed at
lockup_detector_init().

The original assumption is: nobody should use delayed probe after
lockup_detector_check() which has __init attribute.  That is, anyone uses
this API must call between lockup_detector_init() and
lockup_detector_check(), and the caller must have __init attribute

Link: https://lkml.kernel.org/r/20230519101840.v5.16.If4ad5dd5d09fb1309cebf8bcead4b6a5a7758ca7@changeid
Reviewed-by: Petr Mladek <pmladek@suse.com>
Co-developed-by: Pingfan Liu <kernelfans@gmail.com>
Signed-off-by: Pingfan Liu <kernelfans@gmail.com>
Signed-off-by: Lecopzer Chen <lecopzer.chen@mediatek.com>
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Suggested-by: Petr Mladek <pmladek@suse.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chen-Yu Tsai <wens@csie.org>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Colin Cross <ccross@android.com>
Cc: Daniel Thompson <daniel.thompson@linaro.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Guenter Roeck <groeck@chromium.org>
Cc: Ian Rogers <irogers@google.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Masayoshi Mizuma <msys.mizuma@gmail.com>
Cc: Matthias Kaehlcke <mka@chromium.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: Ricardo Neri <ricardo.neri@intel.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Stephen Boyd <swboyd@chromium.org>
Cc: Sumit Garg <sumit.garg@linaro.org>
Cc: Tzung-Bi Shih <tzungbi@chromium.org>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 17:44:21 -07:00
Douglas Anderson b17aa95933 watchdog/perf: add a weak function for an arch to detect if perf can use NMIs
On arm64, NMI support needs to be detected at runtime.  Add a weak
function to the perf hardlockup detector so that an architecture can
implement it to detect whether NMIs are available.

Link: https://lkml.kernel.org/r/20230519101840.v5.15.Ic55cb6f90ef5967d8aaa2b503a4e67c753f64d3a@changeid
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chen-Yu Tsai <wens@csie.org>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Colin Cross <ccross@android.com>
Cc: Daniel Thompson <daniel.thompson@linaro.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Guenter Roeck <groeck@chromium.org>
Cc: Ian Rogers <irogers@google.com>
Cc: Lecopzer Chen <lecopzer.chen@mediatek.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Masayoshi Mizuma <msys.mizuma@gmail.com>
Cc: Matthias Kaehlcke <mka@chromium.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Pingfan Liu <kernelfans@gmail.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: Ricardo Neri <ricardo.neri@intel.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Stephen Boyd <swboyd@chromium.org>
Cc: Sumit Garg <sumit.garg@linaro.org>
Cc: Tzung-Bi Shih <tzungbi@chromium.org>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 17:44:21 -07:00
Douglas Anderson 1f423c905a watchdog/hardlockup: detect hard lockups using secondary (buddy) CPUs
Implement a hardlockup detector that doesn't doesn't need any extra
arch-specific support code to detect lockups.  Instead of using something
arch-specific we will use the buddy system, where each CPU watches out for
another one.  Specifically, each CPU will use its softlockup hrtimer to
check that the next CPU is processing hrtimer interrupts by verifying that
a counter is increasing.

NOTE: unlike the other hard lockup detectors, the buddy one can't easily
show what's happening on the CPU that locked up just by doing a simple
backtrace.  It relies on some other mechanism in the system to get
information about the locked up CPUs.  This could be support for NMI
backtraces like [1], it could be a mechanism for printing the PC of locked
CPUs at panic time like [2] / [3], or it could be something else.  Even
though that means we still rely on arch-specific code, this arch-specific
code seems to often be implemented even on architectures that don't have a
hardlockup detector.

This style of hardlockup detector originated in some downstream Android
trees and has been rebased on / carried in ChromeOS trees for quite a long
time for use on arm and arm64 boards.  Historically on these boards we've
leveraged mechanism [2] / [3] to get information about hung CPUs, but we
could move to [1].

Although the original motivation for the buddy system was for use on
systems without an arch-specific hardlockup detector, it can still be
useful to use even on systems that _do_ have an arch-specific hardlockup
detector.  On x86, for instance, there is a 24-part patch series [4] in
progress switching the arch-specific hard lockup detector from a scarce
perf counter to a less-scarce hardware resource.  Potentially the buddy
system could be a simpler alternative to free up the perf counter but
still get hard lockup detection.

Overall, pros (+) and cons (-) of the buddy system compared to an
arch-specific hardlockup detector (which might be implemented using
perf):
+ The buddy system is usable on systems that don't have an
  arch-specific hardlockup detector, like arm32 and arm64 (though it's
  being worked on for arm64 [5]).
+ The buddy system may free up scarce hardware resources.
+ If a CPU totally goes out to lunch (can't process NMIs) the buddy
  system could still detect the problem (though it would be unlikely
  to be able to get a stack trace).
+ The buddy system uses the same timer function to pet the hardlockup
  detector on the running CPU as it uses to detect hardlockups on
  other CPUs. Compared to other hardlockup detectors, this means it
  generates fewer interrupts and thus is likely better able to let
  CPUs stay idle longer.
- If all CPUs are hard locked up at the same time the buddy system
  can't detect it.
- If we don't have SMP we can't use the buddy system.
- The buddy system needs an arch-specific mechanism (possibly NMI
  backtrace) to get info about the locked up CPU.

[1] https://lore.kernel.org/r/20230419225604.21204-1-dianders@chromium.org
[2] https://issuetracker.google.com/172213129
[3] https://docs.kernel.org/trace/coresight/coresight-cpu-debug.html
[4] https://lore.kernel.org/lkml/20230301234753.28582-1-ricardo.neri-calderon@linux.intel.com/
[5] https://lore.kernel.org/linux-arm-kernel/20220903093415.15850-1-lecopzer.chen@mediatek.com/

Link: https://lkml.kernel.org/r/20230519101840.v5.14.I6bf789d21d0c3d75d382e7e51a804a7a51315f2c@changeid
Signed-off-by: Colin Cross <ccross@android.com>
Signed-off-by: Matthias Kaehlcke <mka@chromium.org>
Signed-off-by: Guenter Roeck <groeck@chromium.org>
Signed-off-by: Tzung-Bi Shih <tzungbi@chromium.org>
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chen-Yu Tsai <wens@csie.org>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Daniel Thompson <daniel.thompson@linaro.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Ian Rogers <irogers@google.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Masayoshi Mizuma <msys.mizuma@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Pingfan Liu <kernelfans@gmail.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: Ricardo Neri <ricardo.neri@intel.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Stephen Boyd <swboyd@chromium.org>
Cc: Sumit Garg <sumit.garg@linaro.org>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 17:44:21 -07:00
Douglas Anderson d9b3629ade watchdog/hardlockup: have the perf hardlockup use __weak functions more cleanly
The fact that there watchdog_hardlockup_enable(),
watchdog_hardlockup_disable(), and watchdog_hardlockup_probe() are
declared __weak means that the configured hardlockup detector can define
non-weak versions of those functions if it needs to.  Instead of doing
this, the perf hardlockup detector hooked itself into the default __weak
implementation, which was a bit awkward.  Clean this up.

From comments, it looks as if the original design was done because the
__weak function were expected to implemented by the architecture and not
by the configured hardlockup detector.  This got awkward when we tried to
add the buddy lockup detector which was not arch-specific but wanted to
hook into those same functions.

This is not expected to have any functional impact.

Link: https://lkml.kernel.org/r/20230519101840.v5.13.I847d9ec852449350997ba00401d2462a9cb4302b@changeid
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chen-Yu Tsai <wens@csie.org>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Colin Cross <ccross@android.com>
Cc: Daniel Thompson <daniel.thompson@linaro.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Guenter Roeck <groeck@chromium.org>
Cc: Ian Rogers <irogers@google.com>
Cc: Lecopzer Chen <lecopzer.chen@mediatek.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Masayoshi Mizuma <msys.mizuma@gmail.com>
Cc: Matthias Kaehlcke <mka@chromium.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Pingfan Liu <kernelfans@gmail.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: Ricardo Neri <ricardo.neri@intel.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Stephen Boyd <swboyd@chromium.org>
Cc: Sumit Garg <sumit.garg@linaro.org>
Cc: Tzung-Bi Shih <tzungbi@chromium.org>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 17:44:21 -07:00
Douglas Anderson df95d3085c watchdog/hardlockup: rename some "NMI watchdog" constants/function
Do a search and replace of:
- NMI_WATCHDOG_ENABLED => WATCHDOG_HARDLOCKUP_ENABLED
- SOFT_WATCHDOG_ENABLED => WATCHDOG_SOFTOCKUP_ENABLED
- watchdog_nmi_ => watchdog_hardlockup_
- nmi_watchdog_available => watchdog_hardlockup_available
- nmi_watchdog_user_enabled => watchdog_hardlockup_user_enabled
- soft_watchdog_user_enabled => watchdog_softlockup_user_enabled
- NMI_WATCHDOG_DEFAULT => WATCHDOG_HARDLOCKUP_DEFAULT

Then update a few comments near where names were changed.

This is specifically to make it less confusing when we want to introduce
the buddy hardlockup detector, which isn't using NMIs.  As part of this,
we sanitized a few names for consistency.

[trix@redhat.com: make variables static]
  Link: https://lkml.kernel.org/r/20230525162822.1.I0fb41d138d158c9230573eaa37dc56afa2fb14ee@changeid
Link: https://lkml.kernel.org/r/20230519101840.v5.12.I91f7277bab4bf8c0cb238732ed92e7ce7bbd71a6@changeid
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Signed-off-by: Tom Rix <trix@redhat.com>
Reviewed-by: Tom Rix <trix@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chen-Yu Tsai <wens@csie.org>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Colin Cross <ccross@android.com>
Cc: Daniel Thompson <daniel.thompson@linaro.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Guenter Roeck <groeck@chromium.org>
Cc: Ian Rogers <irogers@google.com>
Cc: Lecopzer Chen <lecopzer.chen@mediatek.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Masayoshi Mizuma <msys.mizuma@gmail.com>
Cc: Matthias Kaehlcke <mka@chromium.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Pingfan Liu <kernelfans@gmail.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: Ricardo Neri <ricardo.neri@intel.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Stephen Boyd <swboyd@chromium.org>
Cc: Sumit Garg <sumit.garg@linaro.org>
Cc: Tzung-Bi Shih <tzungbi@chromium.org>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 17:44:20 -07:00
Douglas Anderson ed92e1ef52 watchdog/hardlockup: move perf hardlockup watchdog petting to watchdog.c
In preparation for the buddy hardlockup detector, which wants the same
petting logic as the current perf hardlockup detector, move the code to
watchdog.c.  While doing this, rename the global variable to match others
nearby.  As part of this change we have to change the code to account for
the fact that the CPU we're running on might be different than the one
we're checking.

Currently the code in watchdog.c is guarded by
CONFIG_HARDLOCKUP_DETECTOR_PERF, which makes this change seem silly. 
However, a future patch will change this.

Link: https://lkml.kernel.org/r/20230519101840.v5.11.I00dfd6386ee00da25bf26d140559a41339b53e57@changeid
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chen-Yu Tsai <wens@csie.org>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Colin Cross <ccross@android.com>
Cc: Daniel Thompson <daniel.thompson@linaro.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Guenter Roeck <groeck@chromium.org>
Cc: Ian Rogers <irogers@google.com>
Cc: Lecopzer Chen <lecopzer.chen@mediatek.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Masayoshi Mizuma <msys.mizuma@gmail.com>
Cc: Matthias Kaehlcke <mka@chromium.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Pingfan Liu <kernelfans@gmail.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: Ricardo Neri <ricardo.neri@intel.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Stephen Boyd <swboyd@chromium.org>
Cc: Sumit Garg <sumit.garg@linaro.org>
Cc: Tzung-Bi Shih <tzungbi@chromium.org>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 17:44:20 -07:00
Douglas Anderson 77c12fc959 watchdog/hardlockup: add a "cpu" param to watchdog_hardlockup_check()
In preparation for the buddy hardlockup detector where the CPU checking
for lockup might not be the currently running CPU, add a "cpu" parameter
to watchdog_hardlockup_check().

As part of this change, make hrtimer_interrupts an atomic_t since now the
CPU incrementing the value and the CPU reading the value might be
different.  Technially this could also be done with just READ_ONCE and
WRITE_ONCE, but atomic_t feels a little cleaner in this case.

While hrtimer_interrupts is made atomic_t, we change
hrtimer_interrupts_saved from "unsigned long" to "int".  The "int" is
needed to match the data type backing atomic_t for hrtimer_interrupts. 
Even if this changes us from 64-bits to 32-bits (which I don't think is
true for most compilers), it doesn't really matter.  All we ever do is
increment it every few seconds and compare it to an old value so 32-bits
is fine (even 16-bits would be).  The "signed" vs "unsigned" also doesn't
matter for simple equality comparisons.

hrtimer_interrupts_saved is _not_ switched to atomic_t nor even accessed
with READ_ONCE / WRITE_ONCE.  The hrtimer_interrupts_saved is always
consistently accessed with the same CPU.  NOTE: with the upcoming "buddy"
detector there is one special case.  When a CPU goes offline/online then
we can change which CPU is the one to consistently access a given instance
of hrtimer_interrupts_saved.  We still can't end up with a partially
updated hrtimer_interrupts_saved, however, because we end up petting all
affected CPUs to make sure the new and old CPU can't end up somehow
read/write hrtimer_interrupts_saved at the same time.

Link: https://lkml.kernel.org/r/20230519101840.v5.10.I3a7d4dd8c23ac30ee0b607d77feb6646b64825c0@changeid
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chen-Yu Tsai <wens@csie.org>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Colin Cross <ccross@android.com>
Cc: Daniel Thompson <daniel.thompson@linaro.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Guenter Roeck <groeck@chromium.org>
Cc: Ian Rogers <irogers@google.com>
Cc: Lecopzer Chen <lecopzer.chen@mediatek.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Masayoshi Mizuma <msys.mizuma@gmail.com>
Cc: Matthias Kaehlcke <mka@chromium.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Pingfan Liu <kernelfans@gmail.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: Ricardo Neri <ricardo.neri@intel.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Stephen Boyd <swboyd@chromium.org>
Cc: Sumit Garg <sumit.garg@linaro.org>
Cc: Tzung-Bi Shih <tzungbi@chromium.org>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 17:44:20 -07:00
Douglas Anderson 1610611aad watchdog/hardlockup: style changes to watchdog_hardlockup_check() / is_hardlockup()
These are tiny style changes:
- Add a blank line before a "return".
- Renames two globals to use the "watchdog_hardlockup" prefix.
- Store processor id in "unsigned int" rather than "int".
- Minor comment rewording.
- Use "else" rather than extra returns since it seemed more symmetric.

Link: https://lkml.kernel.org/r/20230519101840.v5.9.I818492c326b632560b09f20d2608455ecf9d3650@changeid
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chen-Yu Tsai <wens@csie.org>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Colin Cross <ccross@android.com>
Cc: Daniel Thompson <daniel.thompson@linaro.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Guenter Roeck <groeck@chromium.org>
Cc: Ian Rogers <irogers@google.com>
Cc: Lecopzer Chen <lecopzer.chen@mediatek.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Masayoshi Mizuma <msys.mizuma@gmail.com>
Cc: Matthias Kaehlcke <mka@chromium.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Pingfan Liu <kernelfans@gmail.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: Ricardo Neri <ricardo.neri@intel.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Stephen Boyd <swboyd@chromium.org>
Cc: Sumit Garg <sumit.garg@linaro.org>
Cc: Tzung-Bi Shih <tzungbi@chromium.org>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 17:44:20 -07:00
Douglas Anderson 81972551df watchdog/hardlockup: move perf hardlockup checking/panic to common watchdog.c
The perf hardlockup detector works by looking at interrupt counts and
seeing if they change from run to run.  The interrupt counts are managed
by the common watchdog code via its watchdog_timer_fn().

Currently the API between the perf detector and the common code is a
function: is_hardlockup().  When the hard lockup detector sees that
function return true then it handles printing out debug info and inducing
a panic if necessary.

Let's change the API a little bit in preparation for the buddy hardlockup
detector.  The buddy hardlockup detector wants to print nearly the same
debug info and have nearly the same panic behavior.  That means we want to
move all that code to the common file.  For now, the code in the common
file will only be there if the perf hardlockup detector is enabled, but
eventually it will be selected by a common config.

Right now, this _just_ moves the code from the perf detector file to the
common file and changes the names.  It doesn't make the changes that the
buddy hardlockup detector will need and doesn't do any style cleanups.  A
future patch will do cleanup to make it more obvious what changed.

With the above, we no longer have any callers of is_hardlockup() outside
of the "watchdog.c" file, so we can remove it from the header, make it
static, and move it to the same "#ifdef" block as our new
watchdog_hardlockup_check().  While doing this, it can be noted that even
if no hardlockup detectors were configured the existing code used to still
have the code for counting/checking "hrtimer_interrupts" even if the perf
hardlockup detector wasn't configured.  We didn't need to do that, so move
all the "hrtimer_interrupts" counting to only be there if the perf
hardlockup detector is configured as well.

This change is expected to be a no-op.

Link: https://lkml.kernel.org/r/20230519101840.v5.8.Id4133d3183e798122dc3b6205e7852601f289071@changeid
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chen-Yu Tsai <wens@csie.org>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Colin Cross <ccross@android.com>
Cc: Daniel Thompson <daniel.thompson@linaro.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Guenter Roeck <groeck@chromium.org>
Cc: Ian Rogers <irogers@google.com>
Cc: Lecopzer Chen <lecopzer.chen@mediatek.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Masayoshi Mizuma <msys.mizuma@gmail.com>
Cc: Matthias Kaehlcke <mka@chromium.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Pingfan Liu <kernelfans@gmail.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: Ricardo Neri <ricardo.neri@intel.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Stephen Boyd <swboyd@chromium.org>
Cc: Sumit Garg <sumit.garg@linaro.org>
Cc: Tzung-Bi Shih <tzungbi@chromium.org>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 17:44:19 -07:00
Douglas Anderson 6ea0d04211 watchdog/perf: rename watchdog_hld.c to watchdog_perf.c
The code currently in "watchdog_hld.c" is for detecting hardlockups using
perf, as evidenced by the line in the Makefile that only compiles this
file if CONFIG_HARDLOCKUP_DETECTOR_PERF is defined.  Rename the file to
prepare for the buddy hardlockup detector, which doesn't use perf.

It could be argued that the new name makes it less obvious that this is a
hardlockup detector.  While true, it's not hard to remember that the
"perf" detector is always a hardlockup detector and it's nice not to have
names that are too convoluted.

Link: https://lkml.kernel.org/r/20230519101840.v5.7.Ice803cb078d0e15fb2cbf49132f096ee2bd4199d@changeid
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Acked-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chen-Yu Tsai <wens@csie.org>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Colin Cross <ccross@android.com>
Cc: Daniel Thompson <daniel.thompson@linaro.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Guenter Roeck <groeck@chromium.org>
Cc: Ian Rogers <irogers@google.com>
Cc: Lecopzer Chen <lecopzer.chen@mediatek.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Masayoshi Mizuma <msys.mizuma@gmail.com>
Cc: Matthias Kaehlcke <mka@chromium.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Pingfan Liu <kernelfans@gmail.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: Ricardo Neri <ricardo.neri@intel.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Stephen Boyd <swboyd@chromium.org>
Cc: Sumit Garg <sumit.garg@linaro.org>
Cc: Tzung-Bi Shih <tzungbi@chromium.org>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 17:44:19 -07:00
Pingfan Liu 1fafaa7745 watchdog/perf: ensure CPU-bound context when creating hardlockup detector event
hardlockup_detector_event_create() should create perf_event on the current
CPU.  Preemption could not get disabled because
perf_event_create_kernel_counter() allocates memory.  Instead, the CPU
locality is achieved by processing the code in a per-CPU bound kthread.

Add a check to prevent mistakes when calling the code in another code
path.

Link: https://lkml.kernel.org/r/20230519101840.v5.5.I654063e53782b11d53e736a8ad4897ffd207406a@changeid
Signed-off-by: Pingfan Liu <kernelfans@gmail.com>
Co-developed-by: Lecopzer Chen <lecopzer.chen@mediatek.com>
Signed-off-by: Lecopzer Chen <lecopzer.chen@mediatek.com>
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chen-Yu Tsai <wens@csie.org>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Colin Cross <ccross@android.com>
Cc: Daniel Thompson <daniel.thompson@linaro.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Guenter Roeck <groeck@chromium.org>
Cc: Ian Rogers <irogers@google.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Masayoshi Mizuma <msys.mizuma@gmail.com>
Cc: Matthias Kaehlcke <mka@chromium.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: Ricardo Neri <ricardo.neri@intel.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Stephen Boyd <swboyd@chromium.org>
Cc: Sumit Garg <sumit.garg@linaro.org>
Cc: Tzung-Bi Shih <tzungbi@chromium.org>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 17:44:19 -07:00
Lecopzer Chen 730211182e watchdog/hardlockup: change watchdog_nmi_enable() to void
Nobody cares about the return value of watchdog_nmi_enable(), changing its
prototype to void.

Link: https://lkml.kernel.org/r/20230519101840.v5.4.Ic3a19b592eb1ac4c6f6eade44ffd943e8637b6e5@changeid
Signed-off-by: Pingfan Liu <kernelfans@gmail.com>
Signed-off-by: Lecopzer Chen <lecopzer.chen@mediatek.com>
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Acked-by: Nicholas Piggin <npiggin@gmail.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chen-Yu Tsai <wens@csie.org>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Colin Cross <ccross@android.com>
Cc: Daniel Thompson <daniel.thompson@linaro.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Guenter Roeck <groeck@chromium.org>
Cc: Ian Rogers <irogers@google.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Masayoshi Mizuma <msys.mizuma@gmail.com>
Cc: Matthias Kaehlcke <mka@chromium.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: Ricardo Neri <ricardo.neri@intel.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Stephen Boyd <swboyd@chromium.org>
Cc: Sumit Garg <sumit.garg@linaro.org>
Cc: Tzung-Bi Shih <tzungbi@chromium.org>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 17:44:18 -07:00
Lecopzer Chen 810b560e89 watchdog: remove WATCHDOG_DEFAULT
No reference to WATCHDOG_DEFAULT, remove it.

Link: https://lkml.kernel.org/r/20230519101840.v5.3.I6a729209a1320e0ad212176e250ff945b8f91b2a@changeid
Signed-off-by: Pingfan Liu <kernelfans@gmail.com>
Signed-off-by: Lecopzer Chen <lecopzer.chen@mediatek.com>
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chen-Yu Tsai <wens@csie.org>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Colin Cross <ccross@android.com>
Cc: Daniel Thompson <daniel.thompson@linaro.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Guenter Roeck <groeck@chromium.org>
Cc: Ian Rogers <irogers@google.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Masayoshi Mizuma <msys.mizuma@gmail.com>
Cc: Matthias Kaehlcke <mka@chromium.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: Ricardo Neri <ricardo.neri@intel.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Stephen Boyd <swboyd@chromium.org>
Cc: Sumit Garg <sumit.garg@linaro.org>
Cc: Tzung-Bi Shih <tzungbi@chromium.org>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 17:44:18 -07:00
Douglas Anderson 4379e59fe5 watchdog/perf: more properly prevent false positives with turbo modes
Currently, in the watchdog_overflow_callback() we first check to see if
the watchdog had been touched and _then_ we handle the workaround for
turbo mode.  This order should be reversed.

Specifically, "touching" the hardlockup detector's watchdog should avoid
lockups being detected for one period that should be roughly the same
regardless of whether we're running turbo or not.  That means that we
should do the extra accounting for turbo _before_ we look at (and clear)
the global indicating that we've been touched.

NOTE: this fix is made based on code inspection.  I am not aware of any
reports where the old code would have generated false positives.  That
being said, this order seems more correct and also makes it easier down
the line to share code with the "buddy" hardlockup detector.

Link: https://lkml.kernel.org/r/20230519101840.v5.2.I843b0d1de3e096ba111a179f3adb16d576bef5c7@changeid
Fixes: 7edaeb6841 ("kernel/watchdog: Prevent false positives with turbo modes")
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chen-Yu Tsai <wens@csie.org>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Colin Cross <ccross@android.com>
Cc: Daniel Thompson <daniel.thompson@linaro.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Guenter Roeck <groeck@chromium.org>
Cc: Ian Rogers <irogers@google.com>
Cc: Lecopzer Chen <lecopzer.chen@mediatek.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Masayoshi Mizuma <msys.mizuma@gmail.com>
Cc: Matthias Kaehlcke <mka@chromium.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Pingfan Liu <kernelfans@gmail.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: Ricardo Neri <ricardo.neri@intel.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Stephen Boyd <swboyd@chromium.org>
Cc: Sumit Garg <sumit.garg@linaro.org>
Cc: Tzung-Bi Shih <tzungbi@chromium.org>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 17:44:18 -07:00
Arnd Bergmann e0ddec73fd kcov: add prototypes for helper functions
A number of internal functions in kcov are only called from generated code
and don't technically need a declaration, but 'make W=1' warns about
global symbols without a prototype:

kernel/kcov.c:199:14: error: no previous prototype for '__sanitizer_cov_trace_pc' [-Werror=missing-prototypes]
kernel/kcov.c:264:14: error: no previous prototype for '__sanitizer_cov_trace_cmp1' [-Werror=missing-prototypes]
kernel/kcov.c:270:14: error: no previous prototype for '__sanitizer_cov_trace_cmp2' [-Werror=missing-prototypes]
kernel/kcov.c:276:14: error: no previous prototype for '__sanitizer_cov_trace_cmp4' [-Werror=missing-prototypes]
kernel/kcov.c:282:14: error: no previous prototype for '__sanitizer_cov_trace_cmp8' [-Werror=missing-prototypes]
kernel/kcov.c:288:14: error: no previous prototype for '__sanitizer_cov_trace_const_cmp1' [-Werror=missing-prototypes]
kernel/kcov.c:295:14: error: no previous prototype for '__sanitizer_cov_trace_const_cmp2' [-Werror=missing-prototypes]
kernel/kcov.c:302:14: error: no previous prototype for '__sanitizer_cov_trace_const_cmp4' [-Werror=missing-prototypes]
kernel/kcov.c:309:14: error: no previous prototype for '__sanitizer_cov_trace_const_cmp8' [-Werror=missing-prototypes]
kernel/kcov.c:316:14: error: no previous prototype for '__sanitizer_cov_trace_switch' [-Werror=missing-prototypes]

Adding prototypes for these in a header solves that problem, but now there
is a mismatch between the built-in type and the prototype on 64-bit
architectures because they expect some functions to take a 64-bit
'unsigned long' argument rather than an 'unsigned long long' u64 type:

include/linux/kcov.h:84:6: error: conflicting types for built-in function '__sanitizer_cov_trace_switch'; expected 'void(long long unsigned int,  void *)' [-Werror=builtin-declaration-mismatch]
   84 | void __sanitizer_cov_trace_switch(u64 val, u64 *cases);
      |      ^~~~~~~~~~~~~~~~~~~~~~~~~~~~

Avoid this as well with a custom type definition.

Link: https://lkml.kernel.org/r/20230517124944.929997-1-arnd@kernel.org
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Rong Tao <rongtao@cestc.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 17:44:17 -07:00
Arnd Bergmann 525bb813a9 panic: hide unused global functions
Building with W=1 shows warnings about two functions that have no
declaration or caller in certain configurations:

kernel/panic.c:688:6: error: no previous prototype for 'warn_slowpath_fmt' [-Werror=missing-prototypes]
kernel/panic.c:710:6: error: no previous prototype for '__warn_printk' [-Werror=missing-prototypes]

Enclose the definition in the same #ifdef check as the declaration.

Link: https://lkml.kernel.org/r/20230517131102.934196-8-arnd@kernel.org
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Eric Paris <eparis@redhat.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Moore <paul@paul-moore.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rafael@kernel.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 17:44:15 -07:00
Arnd Bergmann ff7138813a locking: add lockevent_read() prototype
lockevent_read() has a __weak definition and the only caller in
kernel/locking/lock_events.c, plus a strong definition in qspinlock_stat.h
that overrides it, but no other declaration.  This causes a W=1 warning:

kernel/locking/lock_events.c:61:16: error: no previous prototype for 'lockevent_read' [-Werror=missing-prototypes]

Add shared prototype to avoid the warnings.

Link: https://lkml.kernel.org/r/20230517131102.934196-7-arnd@kernel.org
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Eric Paris <eparis@redhat.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Moore <paul@paul-moore.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rafael@kernel.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 17:44:15 -07:00
Haifeng Xu 4e2f6342cc fork: optimize memcg_charge_kernel_stack() a bit
Since commit f1c1a9ee00 ("fork: Move memcg_charge_kernel_stack()
into CONFIG_VMAP_STACK"), memcg_charge_kernel_stack() has been moved
into CONFIG_VMAP_STACK block, so the CONFIG_VMAP_STACK check can be
removed.

Furthermore, memcg_charge_kernel_stack() is only invoked by
alloc_thread_stack_node() instead of dup_task_struct(). If
memcg_kmem_charge_page() fails, the uncharge process is handled in
memcg_charge_kernel_stack() itself instead of free_thread_stack(),
so remove the incorrect comments.

If memcg_charge_kernel_stack() fails to charge pages used by kernel
stack, only charged pages need to be uncharged. It's unnecessary to
uncharge those pages which memory cgroup pointer is NULL.

[akpm@linux-foundation.org: remove assertion that PAGE_SIZE is a multiple of 1k]
Link: https://lkml.kernel.org/r/20230508064458.32855-1-haifeng.xu@shopee.com
Signed-off-by: Haifeng Xu <haifeng.xu@shopee.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Daniel Bristot de Oliveira <bristot@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 17:44:13 -07:00
Prathu Baronia 6a25212dc3 kthread: fix spelling typo and grammar in comments
- `If present` -> `If present,'
- `reuturn` -> `return`
- `function exit safely` -> `function to exit safely`

Link: https://lkml.kernel.org/r/20230502090242.3037194-1-quic_pbaronia@quicinc.com
Signed-off-by: Prathu Baronia <quic_pbaronia@quicinc.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Jason A. Donenfeld <Jason@zx2c4.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Sami Tolvanen <samitolvanen@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Zqiang <qiang1.zhang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 17:44:12 -07:00
Lorenzo Stoakes ca5e863233 mm/gup: remove vmas parameter from get_user_pages_remote()
The only instances of get_user_pages_remote() invocations which used the
vmas parameter were for a single page which can instead simply look up the
VMA directly. In particular:-

- __update_ref_ctr() looked up the VMA but did nothing with it so we simply
  remove it.

- __access_remote_vm() was already using vma_lookup() when the original
  lookup failed so by doing the lookup directly this also de-duplicates the
  code.

We are able to perform these VMA operations as we already hold the
mmap_lock in order to be able to call get_user_pages_remote().

As part of this work we add get_user_page_vma_remote() which abstracts the
VMA lookup, error handling and decrementing the page reference count should
the VMA lookup fail.

This forms part of a broader set of patches intended to eliminate the vmas
parameter altogether.

[akpm@linux-foundation.org: avoid passing NULL to PTR_ERR]
Link: https://lkml.kernel.org/r/d20128c849ecdbf4dd01cc828fcec32127ed939a.1684350871.git.lstoakes@gmail.com
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> (for arm64)
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Janosch Frank <frankja@linux.ibm.com> (for s390)
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Christian König <christian.koenig@amd.com>
Cc: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jarkko Sakkinen <jarkko@kernel.org>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Sakari Ailus <sakari.ailus@linux.intel.com>
Cc: Sean Christopherson <seanjc@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 16:25:26 -07:00
Lorenzo Stoakes 0b295316b3 mm/gup: remove unused vmas parameter from pin_user_pages_remote()
No invocation of pin_user_pages_remote() uses the vmas parameter, so
remove it.  This forms part of a larger patch set eliminating the use of
the vmas parameters altogether.

Link: https://lkml.kernel.org/r/28f000beb81e45bf538a2aaa77c90f5482b67a32.1684350871.git.lstoakes@gmail.com
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Janosch Frank <frankja@linux.ibm.com>
Cc: Jarkko Sakkinen <jarkko@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Sakari Ailus <sakari.ailus@linux.intel.com>
Cc: Sean Christopherson <seanjc@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 16:25:25 -07:00
Kefeng Wang e95d372c4c mm: page_alloc: move sysctls into it own fils
This moves all page alloc related sysctls to its own file, as part of the
kernel/sysctl.c spring cleaning, also move some functions declarations
from mm.h into internal.h.

Link: https://lkml.kernel.org/r/20230516063821.121844-13-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Len Brown <len.brown@intel.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Rafael J. Wysocki <rafael@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 16:25:24 -07:00
Kefeng Wang 07f44ac3c9 mm: page_alloc: move pm_* function into power
pm_restrict_gfp_mask()/pm_restore_gfp_mask() only used in power, let's
move them out of page_alloc.c.

Adding a general gfp_has_io_fs() function which return true if gfp with
both __GFP_IO and __GFP_FS flags, then use it inside of
pm_suspended_storage(), also the pm_suspended_storage() is moved into
suspend.h.

Link: https://lkml.kernel.org/r/20230516063821.121844-11-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Len Brown <len.brown@intel.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Rafael J. Wysocki <rafael@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 16:25:24 -07:00
Kefeng Wang 31a1b9d7fe mm: page_alloc: move mark_free_page() into snapshot.c
The mark_free_page() is only used in kernel/power/snapshot.c, move it out
to reduce a bit of page_alloc.c

Link: https://lkml.kernel.org/r/20230516063821.121844-10-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Len Brown <len.brown@intel.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Rafael J. Wysocki <rafael@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 16:25:24 -07:00
Nhat Pham cf264e1329 cachestat: implement cachestat syscall
There is currently no good way to query the page cache state of large file
sets and directory trees.  There is mincore(), but it scales poorly: the
kernel writes out a lot of bitmap data that userspace has to aggregate,
when the user really doesn not care about per-page information in that
case.  The user also needs to mmap and unmap each file as it goes along,
which can be quite slow as well.

Some use cases where this information could come in handy:
  * Allowing database to decide whether to perform an index scan or
    direct table queries based on the in-memory cache state of the
    index.
  * Visibility into the writeback algorithm, for performance issues
    diagnostic.
  * Workload-aware writeback pacing: estimating IO fulfilled by page
    cache (and IO to be done) within a range of a file, allowing for
    more frequent syncing when and where there is IO capacity, and
    batching when there is not.
  * Computing memory usage of large files/directory trees, analogous to
    the du tool for disk usage.

More information about these use cases could be found in the following
thread:

https://lore.kernel.org/lkml/20230315170934.GA97793@cmpxchg.org/

This patch implements a new syscall that queries cache state of a file and
summarizes the number of cached pages, number of dirty pages, number of
pages marked for writeback, number of (recently) evicted pages, etc.  in a
given range.  Currently, the syscall is only wired in for x86
architecture.

NAME
    cachestat - query the page cache statistics of a file.

SYNOPSIS
    #include <sys/mman.h>

    struct cachestat_range {
        __u64 off;
        __u64 len;
    };

    struct cachestat {
        __u64 nr_cache;
        __u64 nr_dirty;
        __u64 nr_writeback;
        __u64 nr_evicted;
        __u64 nr_recently_evicted;
    };

    int cachestat(unsigned int fd, struct cachestat_range *cstat_range,
        struct cachestat *cstat, unsigned int flags);

DESCRIPTION
    cachestat() queries the number of cached pages, number of dirty
    pages, number of pages marked for writeback, number of evicted
    pages, number of recently evicted pages, in the bytes range given by
    `off` and `len`.

    An evicted page is a page that is previously in the page cache but
    has been evicted since. A page is recently evicted if its last
    eviction was recent enough that its reentry to the cache would
    indicate that it is actively being used by the system, and that
    there is memory pressure on the system.

    These values are returned in a cachestat struct, whose address is
    given by the `cstat` argument.

    The `off` and `len` arguments must be non-negative integers. If
    `len` > 0, the queried range is [`off`, `off` + `len`]. If `len` ==
    0, we will query in the range from `off` to the end of the file.

    The `flags` argument is unused for now, but is included for future
    extensibility. User should pass 0 (i.e no flag specified).

    Currently, hugetlbfs is not supported.

    Because the status of a page can change after cachestat() checks it
    but before it returns to the application, the returned values may
    contain stale information.

RETURN VALUE
    On success, cachestat returns 0. On error, -1 is returned, and errno
    is set to indicate the error.

ERRORS
    EFAULT cstat or cstat_args points to an invalid address.

    EINVAL invalid flags.

    EBADF  invalid file descriptor.

    EOPNOTSUPP file descriptor is of a hugetlbfs file

[nphamcs@gmail.com: replace rounddown logic with the existing helper]
  Link: https://lkml.kernel.org/r/20230504022044.3675469-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20230503013608.2431726-3-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Brian Foster <bfoster@redhat.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 16:25:16 -07:00
Yosry Ahmed 0a2dc6ac33 cgroup: remove cgroup_rstat_flush_atomic()
Previous patches removed the only caller of cgroup_rstat_flush_atomic(). 
Remove the function and simplify the code.

Link: https://lkml.kernel.org/r/20230421174020.2994750-6-yosryahmed@google.com
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 16:25:15 -07:00
Linus Torvalds dbfa18c5d7 virtio,vhost,vdpa: bugfixes
A bunch of fixes all over the place
 
 Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQFDBAABCAAtFiEEXQn9CHHI+FuUyooNKB8NuNKNVGkFAmSDUGkPHG1zdEByZWRo
 YXQuY29tAAoJECgfDbjSjVRpw6gH+wbopniPKDrxNwTpDFx3jix3QDTzVMY4Bq4k
 QdwPfjAZ1aDZXYHV1CdFXeKTA+ZkWHIREZSr+E/2/jeI55Exc2AeFptZrUesSg29
 jMN1MPs00CCy8Qi9BiCZIQkFkIKHNA2PY8wIA0oIXhIaG7pBtYQ14CnAFqn41ev5
 II20h389KMthe0lwm4ni/qHVZzG/2qP/JXLKf35proDEnU5WWM1rQZ1666EFMaIR
 6QExqwbPubxfv44Kl3mMkanGj6MmtLtFa2XlMLbEfLrU5/Xz+CywqSFHTUerrh3I
 eTNyqz4Oyj6UpRq264rqQBJmpSn8LWFBZXQlJ6Y+ef/h8Mhdewk=
 =G8CT
 -----END PGP SIGNATURE-----

Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost

Pull virtio bug fixes from Michael Tsirkin:
 "A bunch of fixes all over the place"

* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
  tools/virtio: use canonical ftrace path
  vhost_vdpa: support PACKED when setting-getting vring_base
  vhost: support PACKED when setting-getting vring_base
  vhost: Fix worker hangs due to missed wake up calls
  vhost: Fix crash during early vhost_transport_send_pkt calls
  vhost_net: revert upend_idx only on retriable error
  vhost_vdpa: tell vqs about the negotiated
  vdpa/mlx5: Fix hang when cvq commands are triggered during device unregister
  tools/virtio: Add .gitignore for ringtest
  tools/virtio: Fix arm64 ringtest compilation error
  vduse: avoid empty string for dev name
  vhost: use kzalloc() instead of kmalloc() followed by memset()
2023-06-09 11:04:08 -07:00
Christophe Leroy 353e7300a1 kcsan: Don't expect 64 bits atomic builtins from 32 bits architectures
Activating KCSAN on a 32 bits architecture leads to the following
link-time failure:

    LD      .tmp_vmlinux.kallsyms1
  powerpc64-linux-ld: kernel/kcsan/core.o: in function `__tsan_atomic64_load':
  kernel/kcsan/core.c:1273: undefined reference to `__atomic_load_8'
  powerpc64-linux-ld: kernel/kcsan/core.o: in function `__tsan_atomic64_store':
  kernel/kcsan/core.c:1273: undefined reference to `__atomic_store_8'
  powerpc64-linux-ld: kernel/kcsan/core.o: in function `__tsan_atomic64_exchange':
  kernel/kcsan/core.c:1273: undefined reference to `__atomic_exchange_8'
  powerpc64-linux-ld: kernel/kcsan/core.o: in function `__tsan_atomic64_fetch_add':
  kernel/kcsan/core.c:1273: undefined reference to `__atomic_fetch_add_8'
  powerpc64-linux-ld: kernel/kcsan/core.o: in function `__tsan_atomic64_fetch_sub':
  kernel/kcsan/core.c:1273: undefined reference to `__atomic_fetch_sub_8'
  powerpc64-linux-ld: kernel/kcsan/core.o: in function `__tsan_atomic64_fetch_and':
  kernel/kcsan/core.c:1273: undefined reference to `__atomic_fetch_and_8'
  powerpc64-linux-ld: kernel/kcsan/core.o: in function `__tsan_atomic64_fetch_or':
  kernel/kcsan/core.c:1273: undefined reference to `__atomic_fetch_or_8'
  powerpc64-linux-ld: kernel/kcsan/core.o: in function `__tsan_atomic64_fetch_xor':
  kernel/kcsan/core.c:1273: undefined reference to `__atomic_fetch_xor_8'
  powerpc64-linux-ld: kernel/kcsan/core.o: in function `__tsan_atomic64_fetch_nand':
  kernel/kcsan/core.c:1273: undefined reference to `__atomic_fetch_nand_8'
  powerpc64-linux-ld: kernel/kcsan/core.o: in function `__tsan_atomic64_compare_exchange_strong':
  kernel/kcsan/core.c:1273: undefined reference to `__atomic_compare_exchange_8'
  powerpc64-linux-ld: kernel/kcsan/core.o: in function `__tsan_atomic64_compare_exchange_weak':
  kernel/kcsan/core.c:1273: undefined reference to `__atomic_compare_exchange_8'
  powerpc64-linux-ld: kernel/kcsan/core.o: in function `__tsan_atomic64_compare_exchange_val':
  kernel/kcsan/core.c:1273: undefined reference to `__atomic_compare_exchange_8'

32 bits architectures don't have 64 bits atomic builtins. Only
include DEFINE_TSAN_ATOMIC_OPS(64) on 64 bits architectures.

Fixes: 0f8ad5f2e9 ("kcsan: Add support for atomic builtins")
Suggested-by: Marco Elver <elver@google.com>
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Reviewed-by: Marco Elver <elver@google.com>
Acked-by: Marco Elver <elver@google.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/d9c6afc28d0855240171a4e0ad9ffcdb9d07fceb.1683892665.git.christophe.leroy@csgroup.eu
2023-06-09 23:29:50 +10:00
Linus Torvalds 9cd6357f10 cgroup: Fixes for v6.4-rc5
* Fix css_set reference leaks on fork failures.
 
 * Fix CPU hotplug locking in cgroup_transfer_tasks() which is used by
   cgroup1 cpuset.
 
 * Doc update.
 -----BEGIN PGP SIGNATURE-----
 
 iIQEABYIACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZIJ5bQ4cdGpAa2VybmVs
 Lm9yZwAKCRCxYfJx3gVYGc/CAQCmE2cKGBWN45xbzIA5S7+zq8QCv85BYlnAgqpR
 jgF8GQD/fFXdmKL0wTzjTf1YOvEi9UxJqhDvHSRtV53fzPedbg4=
 =s+u8
 -----END PGP SIGNATURE-----

Merge tag 'cgroup-for-6.4-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup

Pull cgroup fixes from Tejun Heo:

 - Fix css_set reference leaks on fork failures

 - Fix CPU hotplug locking in cgroup_transfer_tasks() which is used by
   cgroup1 cpuset

 - Doc update

* tag 'cgroup-for-6.4-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup: Documentation: Clarify usage of memory limits
  cgroup: always put cset in cgroup_css_set_put_fork
  cgroup: fix missing cpus_read_{lock,unlock}() in cgroup_transfer_tasks()
2023-06-08 18:52:54 -07:00
Luis Chamberlain 28898e260a sysctl: move security keys sysctl registration to its own file
The security keys sysctls are already declared on its own file,
just move the sysctl registration to its own file to help avoid
merge conflicts on sysctls.c, and help with clearing up sysctl.c
further.

This creates a small penalty of 23 bytes:

./scripts/bloat-o-meter vmlinux.1 vmlinux.2
add/remove: 2/0 grow/shrink: 0/1 up/down: 49/-26 (23)
Function                                     old     new   delta
init_security_keys_sysctls                     -      33     +33
__pfx_init_security_keys_sysctls               -      16     +16
sysctl_init_bases                             85      59     -26
Total: Before=21256937, After=21256960, chg +0.00%

But soon we'll be saving tons of bytes anyway, as we modify the
sysctl registrations to use ARRAY_SIZE and so we get rid of all the
empty array elements so let's just clean this up now.

Reviewed-by: Paul Moore <paul@paul-moore.com>
Acked-by: Jarkko Sakkinen <jarkko@kernel.org>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2023-06-08 15:42:02 -07:00
Luis Chamberlain 861dc0b464 sysctl: move umh sysctl registration to its own file
Move the umh sysctl registration to its own file, the array is
already there. We do this to remove the clutter out of kernel/sysctl.c
to avoid merge conflicts.

This also lets the sysctls not be built at all now when CONFIG_SYSCTL
is not enabled.

This has a small penalty of 23 bytes but soon we'll be removing
all the empty entries on sysctl arrays so just do this cleanup
now:

./scripts/bloat-o-meter vmlinux.base vmlinux.1
add/remove: 2/0 grow/shrink: 0/1 up/down: 49/-26 (23)
Function                                     old     new   delta
init_umh_sysctls                               -      33     +33
__pfx_init_umh_sysctls                         -      16     +16
sysctl_init_bases                            111      85     -26
Total: Before=21256914, After=21256937, chg +0.00%

Acked-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2023-06-08 15:41:34 -07:00
Mike Christie 4b13cbef79 vhost: Fix worker hangs due to missed wake up calls
We can race where we have added work to the work_list, but
vhost_task_fn has passed that check but not yet set us into
TASK_INTERRUPTIBLE. wake_up_process will see us in TASK_RUNNING and
just return.

This bug was intoduced in commit f9010dbdce ("fork, vhost: Use
CLONE_THREAD to fix freezer/ps regression") when I moved the setting
of TASK_INTERRUPTIBLE to simplfy the code and avoid get_signal from
logging warnings about being in the wrong state. This moves the setting
of TASK_INTERRUPTIBLE back to before we test if we need to stop the
task to avoid a possible race there as well. We then have vhost_worker
set TASK_RUNNING if it finds work similar to before.

Fixes: f9010dbdce ("fork, vhost: Use CLONE_THREAD to fix freezer/ps regression")
Signed-off-by: Mike Christie <michael.christie@oracle.com>
Message-Id: <20230607192338.6041-3-michael.christie@oracle.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2023-06-08 15:43:09 -04:00
Maninder Singh 0eeaf1eb40 kallsyms: make kallsyms_show_value() as generic function
This change makes function kallsyms_show_value() as
generic function without dependency on CONFIG_KALLSYMS.

Now module address will be displayed with lsmod and /proc/modules.

Earlier:
=======
/ # insmod  test.ko
/ # lsmod
test 12288 0 - Live 0x0000000000000000 (O)  // No Module Load address
/ #

With change:
==========
/ # insmod test.ko
/ # lsmod
test 12288 0 - Live 0xffff800000fc0000 (O)  // Module address
/ # cat /proc/modules
test 12288 0 - Live 0xffff800000fc0000 (O)

Co-developed-by: Onkarnath <onkarnath.1@samsung.com>
Signed-off-by: Onkarnath <onkarnath.1@samsung.com>
Signed-off-by: Maninder Singh <maninder1.s@samsung.com>
Reviewed-by: Zhen Lei <thunder.leizhen@huawei.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2023-06-08 12:27:20 -07:00
Maninder Singh b06e9318bf kallsyms: move kallsyms_show_value() out of kallsyms.c
function kallsyms_show_value() is used by other parts
like modules_open(), kprobes_read() etc. which can work in case of
!KALLSYMS also.

e.g. as of now lsmod do not show module address if KALLSYMS is disabled.
since kallsyms_show_value() defination is not present, it returns false
in !KALLSYMS.

/ # lsmod
test 12288 0 - Live 0x0000000000000000 (O)

So kallsyms_show_value() can be made generic
without dependency on KALLSYMS.

Thus moving out function to a new file ksyms_common.c.

With this patch code is just moved to new file
and no functional change.

Co-developed-by: Onkarnath <onkarnath.1@samsung.com>
Signed-off-by: Onkarnath <onkarnath.1@samsung.com>
Signed-off-by: Maninder Singh <maninder1.s@samsung.com>
Reviewed-by: Zhen Lei <thunder.leizhen@huawei.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2023-06-08 12:27:20 -07:00
Jakub Kicinski 449f6bc17a Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

Conflicts:

net/sched/sch_taprio.c
  d636fc5dd6 ("net: sched: add rcu annotations around qdisc->qdisc_sleeping")
  dced11ef84 ("net/sched: taprio: don't overwrite "sch" variable in taprio_dump_class_stats()")

net/ipv4/sysctl_net_ipv4.c
  e209fee411 ("net/ipv4: ping_group_range: allow GID from 2147483648 to 4294967294")
  ccce324dab ("tcp: make the first N SYN RTO backoffs linear")
https://lore.kernel.org/all/20230605100816.08d41a7b@canb.auug.org.au/

No adjacent changes.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-08 11:35:14 -07:00
Linus Torvalds 25041a4c02 Networking fixes for 6.4-rc6, including fixes from can, wifi, netfilter,
bluetooth and ebpf.
 
 Current release - regressions:
 
   - bpf: sockmap: avoid potential NULL dereference in sk_psock_verdict_data_ready()
 
   - wifi: iwlwifi: fix -Warray-bounds bug in iwl_mvm_wait_d3_notif()
 
   - phylink: actually fix ksettings_set() ethtool call
 
   - eth: dwmac-qcom-ethqos: fix a regression on EMAC < 3
 
 Current release - new code bugs:
 
   - wifi: mt76: fix possible NULL pointer dereference in mt7996_mac_write_txwi()
 
 Previous releases - regressions:
 
   - netfilter: fix NULL pointer dereference in nf_confirm_cthelper
 
   - wifi: rtw88/rtw89: correct PS calculation for SUPPORTS_DYNAMIC_PS
 
   - openvswitch: fix upcall counter access before allocation
 
   - bluetooth:
     - fix use-after-free in hci_remove_ltk/hci_remove_irk
     - fix l2cap_disconnect_req deadlock
 
   - nic: bnxt_en: prevent kernel panic when receiving unexpected PHC_UPDATE event
 
 Previous releases - always broken:
 
   - core: annotate rfs lockless accesses
 
   - sched: fq_pie: ensure reasonable TCA_FQ_PIE_QUANTUM values
 
   - netfilter: add null check for nla_nest_start_noflag() in nft_dump_basechain_hook()
 
   - bpf: fix UAF in task local storage
 
   - ipv4: ping_group_range: allow GID from 2147483648 to 4294967294
 
   - ipv6: rpl: fix route of death.
 
   - tcp: gso: really support BIG TCP
 
   - mptcp: fixes for user-space PM address advertisement
 
   - smc: avoid to access invalid RMBs' MRs in SMCRv1 ADD LINK CONT
 
   - can: avoid possible use-after-free when j1939_can_rx_register fails
 
   - batman-adv: fix UaF while rescheduling delayed work
 
   - eth: qede: fix scheduling while atomic
 
   - eth: ice: make writes to /dev/gnssX synchronous
 
 Signed-off-by: Paolo Abeni <pabeni@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmSBsv4SHHBhYmVuaUBy
 ZWRoYXQuY29tAAoJECkkeY3MjxOkMXUP/jisT2xvTFRmtshX3h+xxPkBxZSo9ovx
 ujviqZkyCNep9fu7Njv+5WWp0V8cy3Ui6G6RiGNHDV24vBtISlX21yQt+VANOPjH
 7x8oqqnANxn3PXjL5hp6YZhNaxiwfAfQGJiU+TngVo1jTJopnWEt2x8Q3EhF/k0S
 id8VaHGh/ugC8lRZSJBK/b+FsJjWY0sxTcsoRSjp6gg1WHUVO8mJXlCfHFhNJcQQ
 /8ghieuskLUs4V6UX3TGg4smGxgl2HPdA79+ohvrVhcB1WoGCsWV83SfUTBWgHkU
 IZrIfM4BFCThcN88IgRgJioeX95D54SK0RzEZdCnJx+elmgTK1ZdUGlBh1Vybh+v
 iQel2dgJI+8zyIl/4lXYdhHogLwnONVrkszMrx+Ds2PzNecmnFWg4LUK01xLjW7J
 poAFsZGVBk0BuTkEqXtxv/8Cc7wU/PMOmy4ZVBrHkNIyGgOLbt5eM0T/pArYoKvr
 +34del2Us2vGVk6i89F/GgRuNCvevO0Y+HyAArOJr2XwpakwQYQHdBdj/77FGjFZ
 PyR/bVJZhxdUMv+J7BdKQK+mwt+ZFBVwIRfU2gvHcDa2XQJe2Eg8GXRtcJ1P7hpr
 Q2A+AgiHSoAn6GrgYNHNZVBhWywQFCsu2ZpH7J0uo4zOyTUl3+4O8jyfDrD7o56D
 BodtDJKZit3B
 =X6b2
 -----END PGP SIGNATURE-----

Merge tag 'net-6.4-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Paolo Abeni:
 "Including fixes from can, wifi, netfilter, bluetooth and ebpf.

  Current release - regressions:

   - bpf: sockmap: avoid potential NULL dereference in
     sk_psock_verdict_data_ready()

   - wifi: iwlwifi: fix -Warray-bounds bug in iwl_mvm_wait_d3_notif()

   - phylink: actually fix ksettings_set() ethtool call

   - eth: dwmac-qcom-ethqos: fix a regression on EMAC < 3

  Current release - new code bugs:

   - wifi: mt76: fix possible NULL pointer dereference in
     mt7996_mac_write_txwi()

  Previous releases - regressions:

   - netfilter: fix NULL pointer dereference in nf_confirm_cthelper

   - wifi: rtw88/rtw89: correct PS calculation for SUPPORTS_DYNAMIC_PS

   - openvswitch: fix upcall counter access before allocation

   - bluetooth:
      - fix use-after-free in hci_remove_ltk/hci_remove_irk
      - fix l2cap_disconnect_req deadlock

   - nic: bnxt_en: prevent kernel panic when receiving unexpected
     PHC_UPDATE event

  Previous releases - always broken:

   - core: annotate rfs lockless accesses

   - sched: fq_pie: ensure reasonable TCA_FQ_PIE_QUANTUM values

   - netfilter: add null check for nla_nest_start_noflag() in
     nft_dump_basechain_hook()

   - bpf: fix UAF in task local storage

   - ipv4: ping_group_range: allow GID from 2147483648 to 4294967294

   - ipv6: rpl: fix route of death.

   - tcp: gso: really support BIG TCP

   - mptcp: fixes for user-space PM address advertisement

   - smc: avoid to access invalid RMBs' MRs in SMCRv1 ADD LINK CONT

   - can: avoid possible use-after-free when j1939_can_rx_register fails

   - batman-adv: fix UaF while rescheduling delayed work

   - eth: qede: fix scheduling while atomic

   - eth: ice: make writes to /dev/gnssX synchronous"

* tag 'net-6.4-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (83 commits)
  bnxt_en: Implement .set_port / .unset_port UDP tunnel callbacks
  bnxt_en: Prevent kernel panic when receiving unexpected PHC_UPDATE event
  bnxt_en: Skip firmware fatal error recovery if chip is not accessible
  bnxt_en: Query default VLAN before VNIC setup on a VF
  bnxt_en: Don't issue AP reset during ethtool's reset operation
  bnxt_en: Fix bnxt_hwrm_update_rss_hash_cfg()
  net: bcmgenet: Fix EEE implementation
  eth: ixgbe: fix the wake condition
  eth: bnxt: fix the wake condition
  lib: cpu_rmap: Fix potential use-after-free in irq_cpu_rmap_release()
  bpf: Add extra path pointer check to d_path helper
  net: sched: fix possible refcount leak in tc_chain_tmplt_add()
  net: sched: act_police: fix sparse errors in tcf_police_dump()
  net: openvswitch: fix upcall counter access before allocation
  net: sched: move rtm_tca_policy declaration to include file
  ice: make writes to /dev/gnssX synchronous
  net: sched: add rcu annotations around qdisc->qdisc_sleeping
  rfs: annotate lockless accesses to RFS sock flow table
  rfs: annotate lockless accesses to sk->sk_rxhash
  virtio_net: use control_buf for coalesce params
  ...
2023-06-08 09:27:19 -07:00
Andy Chiu 1fd96a3e9d
riscv: Add prctl controls for userspace vector management
This patch add two riscv-specific prctls, to allow usespace control the
use of vector unit:

 * PR_RISCV_V_SET_CONTROL: control the permission to use Vector at next,
   or all following execve for a thread. Turning off a thread's Vector
   live is not possible since libraries may have registered ifunc that
   may execute Vector instructions.
 * PR_RISCV_V_GET_CONTROL: get the same permission setting for the
   current thread, and the setting for following execve(s).

Signed-off-by: Andy Chiu <andy.chiu@sifive.com>
Reviewed-by: Greentime Hu <greentime.hu@sifive.com>
Reviewed-by: Vincent Chen <vincent.chen@sifive.com>
Link: https://lore.kernel.org/r/20230605110724.21391-22-andy.chiu@sifive.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2023-06-08 07:16:53 -07:00
Maxim Mikityanskiy 713274f1f2 bpf: Fix verifier id tracking of scalars on spill
The following scenario describes a bug in the verifier where it
incorrectly concludes about equivalent scalar IDs which could lead to
verifier bypass in privileged mode:

1. Prepare a 32-bit rogue number.
2. Put the rogue number into the upper half of a 64-bit register, and
   roll a random (unknown to the verifier) bit in the lower half. The
   rest of the bits should be zero (although variations are possible).
3. Assign an ID to the register by MOVing it to another arbitrary
   register.
4. Perform a 32-bit spill of the register, then perform a 32-bit fill to
   another register. Due to a bug in the verifier, the ID will be
   preserved, although the new register will contain only the lower 32
   bits, i.e. all zeros except one random bit.

At this point there are two registers with different values but the same
ID, which means the integrity of the verifier state has been corrupted.

5. Compare the new 32-bit register with 0. In the branch where it's
   equal to 0, the verifier will believe that the original 64-bit
   register is also 0, because it has the same ID, but its actual value
   still contains the rogue number in the upper half.
   Some optimizations of the verifier prevent the actual bypass, so
   extra care is needed: the comparison must be between two registers,
   and both branches must be reachable (this is why one random bit is
   needed). Both branches are still suitable for the bypass.
6. Right shift the original register by 32 bits to pop the rogue number.
7. Use the rogue number as an offset with any pointer. The verifier will
   believe that the offset is 0, while in reality it's the given number.

The fix is similar to the 32-bit BPF_MOV handling in check_alu_op for
SCALAR_VALUE. If the spill is narrowing the actual register value, don't
keep the ID, make sure it's reset to 0.

Fixes: 354e8f1970 ("bpf: Support <8-byte scalar spill and refill")
Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Andrii Nakryiko <andrii@kernel.org> # Checked veristat delta
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20230607123951.558971-2-maxtram95@gmail.com
2023-06-08 10:27:43 +02:00
Jakub Kicinski c9d99cfa66 bpf-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZIDxUwAKCRDbK58LschI
 g5hDAQD7ukrniCvMRNIm2yUZIGSxE4RvGiXptO4a0NfLck5R/wEAsfN2KUsPcPhW
 HS37lVfx7VVXfj42+REf7lWLu4TXpwk=
 =6mS/
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf

Daniel Borkmann says:

====================
pull-request: bpf 2023-06-07

We've added 7 non-merge commits during the last 7 day(s) which contain
a total of 12 files changed, 112 insertions(+), 7 deletions(-).

The main changes are:

1) Fix a use-after-free in BPF's task local storage, from KP Singh.

2) Make struct path handling more robust in bpf_d_path, from Jiri Olsa.

3) Fix a syzbot NULL-pointer dereference in sockmap, from Eric Dumazet.

4) UAPI fix for BPF_NETFILTER before final kernel ships,
   from Florian Westphal.

5) Fix map-in-map array_map_gen_lookup code generation where elem_size was
   not being set for inner maps, from Rhys Rustad-Elliott.

6) Fix sockopt_sk selftest's NETLINK_LIST_MEMBERSHIPS assertion,
   from Yonghong Song.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  bpf: Add extra path pointer check to d_path helper
  selftests/bpf: Fix sockopt_sk selftest
  bpf: netfilter: Add BPF_NETFILTER bpf_attach_type
  selftests/bpf: Add access_inner_map selftest
  bpf: Fix elem_size not being set for inner maps
  bpf: Fix UAF in task local storage
  bpf, sockmap: Avoid potential NULL dereference in sk_psock_verdict_data_ready()
====================

Link: https://lore.kernel.org/r/20230607220514.29698-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-07 21:47:11 -07:00
Paul E. McKenney 2e31da752c Merge branches 'doc.2023.05.10a', 'fixes.2023.05.11a', 'kvfree.2023.05.10a', 'nocb.2023.05.11a', 'rcu-tasks.2023.05.10a', 'torture.2023.05.15a' and 'rcu-urgent.2023.06.06a' into HEAD
doc.2023.05.10a: Documentation updates
fixes.2023.05.11a: Miscellaneous fixes
kvfree.2023.05.10a: kvfree_rcu updates
nocb.2023.05.11a: Callback-offloading updates
rcu-tasks.2023.05.10a: Tasks RCU updates
torture.2023.05.15a: Torture-test updates
rcu-urgent.2023.06.06a: Urgent SRCU fix
2023-06-07 13:44:06 -07:00
Petr Tesarik 693405cf11 swiotlb: use the atomic counter of total used slabs if available
If DEBUG_FS is enabled, the cost of keeping an exact number of total
used slabs is already paid. In this case, there is no reason to use an
inexact number for statistics and kernel messages.

Signed-off-by: Petr Tesarik <petr.tesarik.ext@huawei.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-06-07 15:06:40 +02:00
gaoxu 51ff97d54f dma-remap: use kvmalloc_array/kvfree for larger dma memory remap
If dma_direct_alloc() alloc memory in size of 64MB, the inner function
dma_common_contiguous_remap() will allocate 128KB memory by invoking
the function kmalloc_array(). and the kmalloc_array seems to fail to try to
allocate 128KB mem.

Call trace:
[14977.928623] qcrosvm: page allocation failure: order:5, mode:0x40cc0
[14977.928638] dump_backtrace.cfi_jt+0x0/0x8
[14977.928647] dump_stack_lvl+0x80/0xb8
[14977.928652] warn_alloc+0x164/0x200
[14977.928657] __alloc_pages_slowpath+0x9f0/0xb4c
[14977.928660] __alloc_pages+0x21c/0x39c
[14977.928662] kmalloc_order+0x48/0x108
[14977.928666] kmalloc_order_trace+0x34/0x154
[14977.928668] __kmalloc+0x548/0x7e4
[14977.928673] dma_direct_alloc+0x11c/0x4f8
[14977.928678] dma_alloc_attrs+0xf4/0x138
[14977.928680] gh_vm_ioctl_set_fw_name+0x3c4/0x610 [gunyah]
[14977.928698] gh_vm_ioctl+0x90/0x14c [gunyah]
[14977.928705] __arm64_sys_ioctl+0x184/0x210

work around by doing kvmalloc_array instead.

Signed-off-by: Gao Xu <gaoxu2@hihonor.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-06-07 15:06:28 +02:00
Sui Jingfeng 4d3af20eaf dma-mapping: fix a Kconfig typo
'thing' -> 'think'

Signed-off-by: Sui Jingfeng <suijingfeng@loongson.cn>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-06-07 15:06:09 +02:00
Jiri Olsa f46fab0e36 bpf: Add extra path pointer check to d_path helper
Anastasios reported crash on stable 5.15 kernel with following
BPF attached to lsm hook:

  SEC("lsm.s/bprm_creds_for_exec")
  int BPF_PROG(bprm_creds_for_exec, struct linux_binprm *bprm)
  {
          struct path *path = &bprm->executable->f_path;
          char p[128] = { 0 };

          bpf_d_path(path, p, 128);
          return 0;
  }

But bprm->executable can be NULL, so bpf_d_path call will crash:

  BUG: kernel NULL pointer dereference, address: 0000000000000018
  #PF: supervisor read access in kernel mode
  #PF: error_code(0x0000) - not-present page
  PGD 0 P4D 0
  Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC NOPTI
  ...
  RIP: 0010:d_path+0x22/0x280
  ...
  Call Trace:
   <TASK>
   bpf_d_path+0x21/0x60
   bpf_prog_db9cf176e84498d9_bprm_creds_for_exec+0x94/0x99
   bpf_trampoline_6442506293_0+0x55/0x1000
   bpf_lsm_bprm_creds_for_exec+0x5/0x10
   security_bprm_creds_for_exec+0x29/0x40
   bprm_execve+0x1c1/0x900
   do_execveat_common.isra.0+0x1af/0x260
   __x64_sys_execve+0x32/0x40

It's problem for all stable trees with bpf_d_path helper, which was
added in 5.9.

This issue is fixed in current bpf code, where we identify and mark
trusted pointers, so the above code would fail even to load.

For the sake of the stable trees and to workaround potentially broken
verifier in the future, adding the code that reads the path object from
the passed pointer and verifies it's valid in kernel space.

Fixes: 6e22ab9da7 ("bpf: Add d_path helper")
Reported-by: Anastasios Papagiannis <tasos.papagiannnis@gmail.com>
Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Stanislav Fomichev <sdf@google.com>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20230606181714.532998-1-jolsa@kernel.org
2023-06-07 15:03:43 +02:00
Hou Tao aa7881fcfe bpf: Factor out a common helper free_all()
Factor out a common helper free_all() to free all normal elements or
per-cpu elements on a lock-less list.

Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20230606035310.4026145-2-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-06-06 13:40:07 -07:00
Masami Hiramatsu (Google) fd26290ec8 tracing/probes: Add BTF retval type support
Check the target function has non-void retval type and set the correct
fetch type if user doesn't specify it.
If the function returns void, $retval is rejected as below;

 # echo 'f unregister_kprobes%return $retval' >> dynamic_events
sh: write error: No such file or directory
 # cat error_log
[   37.488397] trace_fprobe: error: This function returns 'void' type
  Command: f unregister_kprobes%return $retval
                                       ^
Link: https://lore.kernel.org/all/168507476195.913472.16290308831790216609.stgit@mhiramat.roam.corp.google.com/

Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2023-06-06 21:39:56 +09:00
Masami Hiramatsu (Google) 18b1e870a4 tracing/probes: Add $arg* meta argument for all function args
Add the '$arg*' meta fetch argument for function-entry probe events. This
will be expanded to the all arguments of the function and the tracepoint
using BTF function argument information.

e.g.
 #  echo 'p vfs_read $arg*' >> dynamic_events
 #  echo 'f vfs_write $arg*' >> dynamic_events
 #  echo 't sched_overutilized_tp $arg*' >> dynamic_events
 # cat dynamic_events
p:kprobes/p_vfs_read_0 vfs_read file=file buf=buf count=count pos=pos
f:fprobes/vfs_write__entry vfs_write file=file buf=buf count=count pos=pos
t:tracepoints/sched_overutilized_tp sched_overutilized_tp rd=rd overutilized=overutilized

Also, single '$arg[0-9]*' will be converted to the BTF function argument.

NOTE: This seems like a wildcard, but a fake one at this moment. This
is just for telling user that this can be expanded to several arguments.
And it is not like other $-vars, you can not use this $arg* as a part of
fetch args, e.g. specifying name "foo=$arg*" and using it in dereferences
"+0($arg*)" will lead a parse error.

Link: https://lore.kernel.org/all/168507475126.913472.18329684401466211816.stgit@mhiramat.roam.corp.google.com/

Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2023-06-06 21:39:56 +09:00
Masami Hiramatsu (Google) b576e09701 tracing/probes: Support function parameters if BTF is available
Support function or tracepoint parameters by name if BTF support is enabled
and the event is for function entry (this feature can be used with kprobe-
events, fprobe-events and tracepoint probe events.)

Note that the BTF variable syntax does not require a prefix. If it starts
with an alphabetic character or an underscore ('_') without a prefix like
'$' and '%', it is considered as a BTF variable.
If you specify only the BTF variable name, the argument name will also
be the same name instead of 'arg*'.

 # echo 'p vfs_read count pos' >> dynamic_events
 # echo 'f vfs_write count pos' >> dynamic_events
 # echo 't sched_overutilized_tp rd overutilized' >> dynamic_events
 # cat dynamic_events
p:kprobes/p_vfs_read_0 vfs_read count=count pos=pos
f:fprobes/vfs_write__entry vfs_write count=count pos=pos
t:tracepoints/sched_overutilized_tp sched_overutilized_tp rd=rd overutilized=overutilized

Link: https://lore.kernel.org/all/168507474014.913472.16963996883278039183.stgit@mhiramat.roam.corp.google.com/

Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reviewed-by: Alan Maguire <alan.maguire@oracle.com>
Tested-by: Alan Maguire <alan.maguire@oracle.com>
2023-06-06 21:39:56 +09:00
Masami Hiramatsu (Google) 1b8b0cd754 tracing/probes: Move event parameter fetching code to common parser
Move trace event parameter fetching code to common parser in
trace_probe.c. This simplifies eprobe's trace-event variable fetching
code by introducing a parse context data structure.

Link: https://lore.kernel.org/all/168507472950.913472.2812253181558471278.stgit@mhiramat.roam.corp.google.com/

Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2023-06-06 21:39:56 +09:00
Masami Hiramatsu (Google) e2d0d7b2f4 tracing/probes: Add tracepoint support on fprobe_events
Allow fprobe_events to trace raw tracepoints so that user can trace
tracepoints which don't have traceevent wrappers. This new event is
always available if the fprobe_events is enabled (thus no kconfig),
because the fprobe_events depends on the trace-event and traceporint.

e.g.
 # echo 't sched_overutilized_tp' >> dynamic_events
 # echo 't 9p_client_req' >> dynamic_events
 # cat dynamic_events
t:tracepoints/sched_overutilized_tp sched_overutilized_tp
t:tracepoints/_9p_client_req 9p_client_req

The event name is based on the tracepoint name, but if it is started
with digit character, an underscore '_' will be added.

NOTE: to avoid further confusion, this renames TPARG_FL_TPOINT to
TPARG_FL_TEVENT because this flag is used for eprobe (trace-event probe).
And reuse TPARG_FL_TPOINT for this raw tracepoint probe.

Link: https://lore.kernel.org/all/168507471874.913472.17214624519622959593.stgit@mhiramat.roam.corp.google.com/

Reported-by: kernel test robot <lkp@intel.com>
Link: https://lore.kernel.org/oe-kbuild-all/202305020453.afTJ3VVp-lkp@intel.com/
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2023-06-06 21:39:55 +09:00
Masami Hiramatsu (Google) 334e5519c3 tracing/probes: Add fprobe events for tracing function entry and exit.
Add fprobe events for tracing function entry and exit instead of kprobe
events. With this change, we can continue to trace function entry/exit
even if the CONFIG_KPROBES_ON_FTRACE is not available. Since
CONFIG_KPROBES_ON_FTRACE requires the CONFIG_DYNAMIC_FTRACE_WITH_REGS,
it is not available if the architecture only supports
CONFIG_DYNAMIC_FTRACE_WITH_ARGS. And that means kprobe events can not
probe function entry/exit effectively on such architecture.
But this can be solved if the dynamic events supports fprobe events.

The fprobe event is a new dynamic events which is only for the function
(symbol) entry and exit. This event accepts non register fetch arguments
so that user can trace the function arguments and return values.

The fprobe events syntax is here;

 f[:[GRP/][EVENT]] FUNCTION [FETCHARGS]
 f[MAXACTIVE][:[GRP/][EVENT]] FUNCTION%return [FETCHARGS]

E.g.

 # echo 'f vfs_read $arg1'  >> dynamic_events
 # echo 'f vfs_read%return $retval'  >> dynamic_events
 # cat dynamic_events
 f:fprobes/vfs_read__entry vfs_read arg1=$arg1
 f:fprobes/vfs_read__exit vfs_read%return arg1=$retval
 # echo 1 > events/fprobes/enable
 # head -n 20 trace | tail
 #           TASK-PID     CPU#  |||||  TIMESTAMP  FUNCTION
 #              | |         |   |||||     |         |
              sh-142     [005] ...1.   448.386420: vfs_read__entry: (vfs_read+0x4/0x340) arg1=0xffff888007f7c540
              sh-142     [005] .....   448.386436: vfs_read__exit: (ksys_read+0x75/0x100 <- vfs_read) arg1=0x1
              sh-142     [005] ...1.   448.386451: vfs_read__entry: (vfs_read+0x4/0x340) arg1=0xffff888007f7c540
              sh-142     [005] .....   448.386458: vfs_read__exit: (ksys_read+0x75/0x100 <- vfs_read) arg1=0x1
              sh-142     [005] ...1.   448.386469: vfs_read__entry: (vfs_read+0x4/0x340) arg1=0xffff888007f7c540
              sh-142     [005] .....   448.386476: vfs_read__exit: (ksys_read+0x75/0x100 <- vfs_read) arg1=0x1
              sh-142     [005] ...1.   448.602073: vfs_read__entry: (vfs_read+0x4/0x340) arg1=0xffff888007f7c540
              sh-142     [005] .....   448.602089: vfs_read__exit: (ksys_read+0x75/0x100 <- vfs_read) arg1=0x1

Link: https://lore.kernel.org/all/168507469754.913472.6112857614708350210.stgit@mhiramat.roam.corp.google.com/

Reported-by: kernel test robot <lkp@intel.com>
Link: https://lore.kernel.org/all/202302011530.7vm4O8Ro-lkp@intel.com/
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2023-06-06 21:39:55 +09:00
Masami Hiramatsu (Google) 30460c21ed tracing/probes: Avoid setting TPARG_FL_FENTRY and TPARG_FL_RETURN
When parsing a kprobe event, the return probe always sets both
TPARG_FL_RETURN and TPARG_FL_FENTRY, but this is not useful because
some fetchargs are only for return probe and some others only for
function entry. Make it obviously mutual exclusive.

Link: https://lore.kernel.org/all/168507468731.913472.11354553441385410734.stgit@mhiramat.roam.corp.google.com/

Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2023-06-06 21:39:55 +09:00
Masami Hiramatsu (Google) cb16330d12 fprobe: Pass return address to the handlers
Pass return address as 'ret_ip' to the fprobe entry and return handlers
so that the fprobe user handler can get the reutrn address without
analyzing arch-dependent pt_regs.

Link: https://lore.kernel.org/all/168507467664.913472.11642316698862778600.stgit@mhiramat.roam.corp.google.com/

Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2023-06-06 21:39:55 +09:00
Siddh Raman Pant 943211c874 watch_queue: prevent dangling pipe pointer
NULL the dangling pipe reference while clearing watch_queue.

If not done, a reference to a freed pipe remains in the watch_queue,
as this function is called before freeing a pipe in free_pipe_info()
(see line 834 of fs/pipe.c).

The sole use of wqueue->defunct is for checking if the watch queue has
been cleared, but wqueue->pipe is also NULLed while clearing.

Thus, wqueue->defunct is superfluous, as wqueue->pipe can be checked
for NULL. Hence, the former can be removed.

Tested with keyutils testsuite.

Cc: stable@vger.kernel.org # 6.1
Signed-off-by: Siddh Raman Pant <code@siddh.me>
Acked-by: David Howells <dhowells@redhat.com>
Message-Id: <20230605143616.640517-1-code@siddh.me>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-06-06 10:47:04 +02:00
Peter Zijlstra 228020b490 perf: Re-instate the linear PMU search
Full revert of commit 9551fbb64d ("perf/core: Remove pmu linear
searching code").

Some architectures (notably arm/arm64) still relied on the linear
search in order to find the PMU that consumes
PERF_TYPE_{HARDWARE,HW_CACHE,RAW}.

This will need a more thorought audit and cleanup.

Reported-by: Nathan Chancellor <nathan@kernel.org>
Reported-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20230605101401.GL38236@hirez.programming.kicks-ass.net
2023-06-06 10:09:21 +02:00
Florian Westphal 132328e8e8 bpf: netfilter: Add BPF_NETFILTER bpf_attach_type
Andrii Nakryiko writes:

 And we currently don't have an attach type for NETLINK BPF link.
 Thankfully it's not too late to add it. I see that link_create() in
 kernel/bpf/syscall.c just bypasses attach_type check. We shouldn't
 have done that. Instead we need to add BPF_NETLINK attach type to enum
 bpf_attach_type. And wire all that properly throughout the kernel and
 libbpf itself.

This adds BPF_NETFILTER and uses it.  This breaks uabi but this
wasn't in any non-rc release yet, so it should be fine.

v2: check link_attack prog type in link_create too

Fixes: 84601d6ee6 ("bpf: add bpf_link support for BPF_NETFILTER programs")
Suggested-by: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/CAEf4BzZ69YgrQW7DHCJUT_X+GqMq_ZQQPBwopaJJVGFD5=d5Vg@mail.gmail.com/
Link: https://lore.kernel.org/bpf/20230605131445.32016-1-fw@strlen.de
2023-06-05 15:01:43 -07:00
David Vernet 51302c951c bpf: Teach verifier that trusted PTR_TO_BTF_ID pointers are non-NULL
In reg_type_not_null(), we currently assume that a pointer may be NULL
if it has the PTR_MAYBE_NULL modifier, or if it doesn't belong to one of
several base type of pointers that are never NULL-able. For example,
PTR_TO_CTX, PTR_TO_MAP_VALUE, etc.

It turns out that in some cases, PTR_TO_BTF_ID can never be NULL as
well, though we currently don't specify it. For example, if you had the
following program:

SEC("tc")
long example_refcnt_fail(void *ctx)
{
	struct bpf_cpumask *mask1, *mask2;

	mask1 = bpf_cpumask_create();
	mask2 = bpf_cpumask_create();

        if (!mask1 || !mask2)
		goto error_release;

	bpf_cpumask_test_cpu(0, (const struct cpumask *)mask1);
	bpf_cpumask_test_cpu(0, (const struct cpumask *)mask2);

error_release:
	if (mask1)
		bpf_cpumask_release(mask1);
	if (mask2)
		bpf_cpumask_release(mask2);
	return ret;
}

The verifier will incorrectly fail to load the program, thinking
(unintuitively) that we have a possibly-unreleased reference if the mask
is NULL, because we (correctly) don't issue a bpf_cpumask_release() on
the NULL path.

The reason the verifier gets confused is due to the fact that we don't
explicitly tell the verifier that trusted PTR_TO_BTF_ID pointers can
never be NULL. Basically, if we successfully get past the if check
(meaning both pointers go from ptr_or_null_bpf_cpumask to
ptr_bpf_cpumask), the verifier will correctly assume that the references
need to be dropped on any possible branch that leads to program exit.
However, it will _incorrectly_ think that the ptr == NULL branch is
possible, and will erroneously detect it as a branch on which we failed
to drop the reference.

The solution is of course to teach the verifier that trusted
PTR_TO_BTF_ID pointers can never be NULL, so that it doesn't incorrectly
think it's possible for the reference to be present on the ptr == NULL
branch.

A follow-on patch will add a selftest that verifies this behavior.

Signed-off-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/r/20230602150112.1494194-1-void@manifault.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-06-05 14:36:57 -07:00
Daniel T. Lee 503e4def54 bpf: Replace open code with for allocated object check
>From commit 282de143ea ("bpf: Introduce allocated objects support"),
With this allocated object with BPF program, (PTR_TO_BTF_ID | MEM_ALLOC)
has been a way of indicating to check the type is the allocated object.

commit d8939cb0a0 ("bpf: Loosen alloc obj test in verifier's
reg_btf_record")
>From the commit, there has been helper function for checking this, named
type_is_ptr_alloc_obj(). But still, some of the code use open code to
retrieve this info. This commit replaces the open code with the
type_is_alloc(), and the type_is_ptr_alloc_obj() function.

Signed-off-by: Daniel T. Lee <danieltimlee@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20230527122706.59315-1-danieltimlee@gmail.com
2023-06-05 14:33:17 -07:00
Miaohe Lin 0dad9b072b cgroup: make cgroup_is_threaded() and cgroup_is_thread_root() static
They're only called inside cgroup.c. Make them static.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-06-05 10:31:41 -10:00
Dave Marchevsky 7793fc3bab bpf: Make bpf_refcount_acquire fallible for non-owning refs
This patch fixes an incorrect assumption made in the original
bpf_refcount series [0], specifically that the BPF program calling
bpf_refcount_acquire on some node can always guarantee that the node is
alive. In that series, the patch adding failure behavior to rbtree_add
and list_push_{front, back} breaks this assumption for non-owning
references.

Consider the following program:

  n = bpf_kptr_xchg(&mapval, NULL);
  /* skip error checking */

  bpf_spin_lock(&l);
  if(bpf_rbtree_add(&t, &n->rb, less)) {
    bpf_refcount_acquire(n);
    /* Failed to add, do something else with the node */
  }
  bpf_spin_unlock(&l);

It's incorrect to assume that bpf_refcount_acquire will always succeed in this
scenario. bpf_refcount_acquire is being called in a critical section
here, but the lock being held is associated with rbtree t, which isn't
necessarily the lock associated with the tree that the node is already
in. So after bpf_rbtree_add fails to add the node and calls bpf_obj_drop
in it, the program has no ownership of the node's lifetime. Therefore
the node's refcount can be decr'd to 0 at any time after the failing
rbtree_add. If this happens before the refcount_acquire above, the node
might be free'd, and regardless refcount_acquire will be incrementing a
0 refcount.

Later patches in the series exercise this scenario, resulting in the
expected complaint from the kernel (without this patch's changes):

  refcount_t: addition on 0; use-after-free.
  WARNING: CPU: 1 PID: 207 at lib/refcount.c:25 refcount_warn_saturate+0xbc/0x110
  Modules linked in: bpf_testmod(O)
  CPU: 1 PID: 207 Comm: test_progs Tainted: G           O       6.3.0-rc7-02231-g723de1a718a2-dirty #371
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014
  RIP: 0010:refcount_warn_saturate+0xbc/0x110
  Code: 6f 64 f6 02 01 e8 84 a3 5c ff 0f 0b eb 9d 80 3d 5e 64 f6 02 00 75 94 48 c7 c7 e0 13 d2 82 c6 05 4e 64 f6 02 01 e8 64 a3 5c ff <0f> 0b e9 7a ff ff ff 80 3d 38 64 f6 02 00 0f 85 6d ff ff ff 48 c7
  RSP: 0018:ffff88810b9179b0 EFLAGS: 00010082
  RAX: 0000000000000000 RBX: 0000000000000002 RCX: 0000000000000000
  RDX: 0000000000000202 RSI: 0000000000000008 RDI: ffffffff857c3680
  RBP: ffff88810027d3c0 R08: ffffffff8125f2a4 R09: ffff88810b9176e7
  R10: ffffed1021722edc R11: 746e756f63666572 R12: ffff88810027d388
  R13: ffff88810027d3c0 R14: ffffc900005fe030 R15: ffffc900005fe048
  FS:  00007fee0584a700(0000) GS:ffff88811b280000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00005634a96f6c58 CR3: 0000000108ce9002 CR4: 0000000000770ee0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  PKRU: 55555554
  Call Trace:
   <TASK>
   bpf_refcount_acquire_impl+0xb5/0xc0

  (rest of output snipped)

The patch addresses this by changing bpf_refcount_acquire_impl to use
refcount_inc_not_zero instead of refcount_inc and marking
bpf_refcount_acquire KF_RET_NULL.

For owning references, though, we know the above scenario is not possible
and thus that bpf_refcount_acquire will always succeed. Some verifier
bookkeeping is added to track "is input owning ref?" for bpf_refcount_acquire
calls and return false from is_kfunc_ret_null for bpf_refcount_acquire on
owning refs despite it being marked KF_RET_NULL.

Existing selftests using bpf_refcount_acquire are modified where
necessary to NULL-check its return value.

  [0]: https://lore.kernel.org/bpf/20230415201811.343116-1-davemarchevsky@fb.com/

Fixes: d2dcc67df9 ("bpf: Migrate bpf_rbtree_add and bpf_list_push_{front,back} to possibly fail")
Reported-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Link: https://lore.kernel.org/r/20230602022647.1571784-5-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-06-05 13:17:20 -07:00
Dave Marchevsky cc0d76cafe bpf: Fix __bpf_{list,rbtree}_add's beginning-of-node calculation
Given the pointer to struct bpf_{rb,list}_node within a local kptr and
the byte offset of that field within the kptr struct, the calculation changed
by this patch is meant to find the beginning of the kptr so that it can
be passed to bpf_obj_drop.

Unfortunately instead of doing

  ptr_to_kptr = ptr_to_node_field - offset_bytes

the calculation is erroneously doing

  ptr_to_ktpr = ptr_to_node_field - (offset_bytes * sizeof(struct bpf_rb_node))

or the bpf_list_node equivalent.

This patch fixes the calculation.

Fixes: d2dcc67df9 ("bpf: Migrate bpf_rbtree_add and bpf_list_push_{front,back} to possibly fail")
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Link: https://lore.kernel.org/r/20230602022647.1571784-4-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-06-05 13:17:19 -07:00
Dave Marchevsky 2140a6e342 bpf: Set kptr_struct_meta for node param to list and rbtree insert funcs
In verifier.c, fixup_kfunc_call uses struct bpf_insn_aux_data's
kptr_struct_meta field to pass information about local kptr types to
various helpers and kfuncs at runtime. The recent bpf_refcount series
added a few functions to the set that need this information:

  * bpf_refcount_acquire
    * Needs to know where the refcount field is in order to increment
  * Graph collection insert kfuncs: bpf_rbtree_add, bpf_list_push_{front,back}
    * Were migrated to possibly fail by the bpf_refcount series. If
      insert fails, the input node is bpf_obj_drop'd. bpf_obj_drop needs
      the kptr_struct_meta in order to decr refcount and properly free
      special fields.

Unfortunately the verifier handling of collection insert kfuncs was not
modified to actually populate kptr_struct_meta. Accordingly, when the
node input to those kfuncs is passed to bpf_obj_drop, it is done so
without the information necessary to decr refcount.

This patch fixes the issue by populating kptr_struct_meta for those
kfuncs.

Fixes: d2dcc67df9 ("bpf: Migrate bpf_rbtree_add and bpf_list_push_{front,back} to possibly fail")
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Link: https://lore.kernel.org/r/20230602022647.1571784-3-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-06-05 13:17:19 -07:00
Gaosheng Cui c20d4d889f rdmacg: fix kernel-doc warnings in rdmacg
Fix all kernel-doc warnings in rdmacg:

kernel/cgroup/rdma.c:209: warning: Function parameter or member 'cg'
not described in 'rdmacg_uncharge_hierarchy'
kernel/cgroup/rdma.c:230: warning: Function parameter or member 'cg'
not described in 'rdmacg_uncharge'

Signed-off-by: Gaosheng Cui <cuigaosheng1@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-06-05 09:45:14 -10:00
Gaosheng Cui 7bf11e90a3 cgroup: Replace the css_set call with cgroup_get
We will release the refcnt of cgroup via cgroup_put, for example
in the cgroup_lock_and_drain_offline function, so replace css_get
with the cgroup_get function for better readability.

Signed-off-by: Gaosheng Cui <cuigaosheng1@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-06-05 09:42:42 -10:00
Miaohe Lin a49a11dc64 cgroup: remove unused macro for_each_e_css()
for_each_e_css() is unused now. Remove it.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-06-05 09:20:20 -10:00
Dietmar Eggemann 7d0583cf9e sched/fair, cpufreq: Introduce 'runnable boosting'
The responsiveness of the Per Entity Load Tracking (PELT) util_avg in
mobile devices is still considered too low for utilization changes
during task ramp-up.

In Android this manifests in the fact that the first frames of a UI
activity are very prone to be jankframes (a frame which doesn't meet
the required frame rendering time, e.g. 16ms@60Hz) since the CPU
frequency is normally low at this point and has to ramp up quickly.

The beginning of an UI activity is also characterized by the occurrence
of CPU contention, especially on little CPUs. Current little CPUs can
have an original CPU capacity of only ~ 150 which means that the actual
CPU capacity at lower frequency can even be much smaller.

Schedutil maps CPU util_avg into CPU frequency request via:

  util = effective_cpu_util(..., cpu_util_cfs(cpu), ...) ->
  util = map_util_perf(util) -> freq = map_util_freq(util, ...)

CPU contention for CFS tasks can be detected by 'CPU runnable > CPU
utililization' in cpu_util_cfs_boost() -> cpu_util(..., boost = 1).
Schedutil uses 'runnable boosting' by calling cpu_util_cfs_boost().

To be in sync with schedutil's CPU frequency selection, Energy Aware
Scheduling (EAS) also calls cpu_util(..., boost = 1) during max util
detection.

Moreover, 'runnable boosting' is also used in load-balance for busiest
CPU selection when the migration type is 'migrate_util', i.e. only at
sched domains which don't have the SD_SHARE_PKG_RESOURCES flag set.

Suggested-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20230515115735.296329-3-dietmar.eggemann@arm.com
2023-06-05 21:13:44 +02:00
Dietmar Eggemann 3eb6d6ecec sched/fair: Refactor CPU utilization functions
There is a lot of code duplication in cpu_util_next() & cpu_util_cfs().

Remove this by allowing cpu_util_next() to be called with p = NULL.
Rename cpu_util_next() to cpu_util() since the '_next' suffix is no
longer necessary to distinct cpu utilization related functions.
Implement cpu_util_cfs(cpu) as cpu_util(cpu, p = NULL, -1).

This will allow to code future related cpu util changes only in one
place, namely in cpu_util().

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20230515115735.296329-2-dietmar.eggemann@arm.com
2023-06-05 21:13:43 +02:00
Peter Zijlstra fb7d4948c4 sched/clock: Provide local_clock_noinstr()
Now that all ARCH_WANTS_NO_INSTR architectures (arm64, loongarch,
s390, x86) provide sched_clock_noinstr(), use this to provide
local_clock_noinstr().

This local_clock_noinstr() will be safe to use from noinstr code with
the assumption that any such noinstr code is non-preemptible (it had
better be, entry code will have IRQs disabled while __cpuidle must
have preemption disabled).

Specifically, preempt_enable_notrace(), a common part of many a
sched_clock() implementation calls out to schedule() -- even though,
per the above, it will never trigger -- which frustrates noinstr
validation.

  vmlinux.o: warning: objtool: local_clock+0xb5: call to preempt_schedule_notrace_thunk() leaves .noinstr.text section

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>  # Hyper-V
Link: https://lore.kernel.org/r/20230519102715.978624636@infradead.org
2023-06-05 21:11:09 +02:00
Peter Zijlstra 5949a68c73 time/sched_clock: Provide sched_clock_noinstr()
With the intent to provide local_clock_noinstr(), a variant of
local_clock() that's safe to be called from noinstr code (with the
assumption that any such code will already be non-preemptible),
prepare for things by providing a noinstr sched_clock_noinstr() function.

Specifically, preempt_enable_*() calls out to schedule(), which upsets
noinstr validation efforts.

As such, pull out the preempt_{dis,en}able_notrace() requirements from
the sched_clock_read() implementations by explicitly providing it in
the sched_clock() function.

This further requires said sched_clock_read() functions to be noinstr
themselves, for ARCH_WANTS_NO_INSTR users. See the next few patches.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>  # Hyper-V
Link: https://lore.kernel.org/r/20230519102715.302350330@infradead.org
2023-06-05 21:11:04 +02:00
Peter Zijlstra d16317de9b seqlock/latch: Provide raw_read_seqcount_latch_retry()
The read side of seqcount_latch consists of:

  do {
    seq = raw_read_seqcount_latch(&latch->seq);
    ...
  } while (read_seqcount_latch_retry(&latch->seq, seq));

which is asymmetric in the raw_ department, and sure enough,
read_seqcount_latch_retry() includes (explicit) instrumentation where
raw_read_seqcount_latch() does not.

This inconsistency becomes a problem when trying to use it from
noinstr code. As such, fix it by renaming and re-implementing
raw_read_seqcount_latch_retry() without the instrumentation.

Specifically the instrumentation in question is kcsan_atomic_next(0)
in do___read_seqcount_retry(). Loosing this annotation is not a
problem because raw_read_seqcount_latch() does not pass through
kcsan_atomic_next(KCSAN_SEQLOCK_REGION_MAX).

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Tested-by: Michael Kelley <mikelley@microsoft.com>  # Hyper-V
Link: https://lore.kernel.org/r/20230519102715.233598176@infradead.org
2023-06-05 21:11:03 +02:00
Peter Zijlstra 1c06918788 sched: Consider task_struct::saved_state in wait_task_inactive()
With the introduction of task_struct::saved_state in commit
5f220be214 ("sched/wakeup: Prepare for RT sleeping spin/rwlocks")
matching the task state has gotten more complicated. That same commit
changed try_to_wake_up() to consider both states, but
wait_task_inactive() has been neglected.

Sebastian noted that the wait_task_inactive() usage in
ptrace_check_attach() can misbehave when ptrace_stop() is blocked on
the tasklist_lock after it sets TASK_TRACED.

Therefore extract a common helper from ttwu_state_match() and use that
to teach wait_task_inactive() about the PREEMPT_RT locks.

Originally-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://lkml.kernel.org/r/20230601091234.GW83892@hirez.programming.kicks-ass.net
2023-06-05 21:11:03 +02:00
Peter Zijlstra d5e1586617 sched: Unconditionally use full-fat wait_task_inactive()
While modifying wait_task_inactive() for PREEMPT_RT; the build robot
noted that UP got broken. This led to audit and consideration of the
UP implementation of wait_task_inactive().

It looks like the UP implementation is also broken for PREEMPT;
consider task_current_syscall() getting preempted between the two
calls to wait_task_inactive().

Therefore move the wait_task_inactive() implementation out of
CONFIG_SMP and unconditionally use it.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20230602103731.GA630648%40hirez.programming.kicks-ass.net
2023-06-05 21:11:02 +02:00
Yicong Yang 0dd37d6dd3 sched/fair: Don't balance task to its current running CPU
We've run into the case that the balancer tries to balance a migration
disabled task and trigger the warning in set_task_cpu() like below:

 ------------[ cut here ]------------
 WARNING: CPU: 7 PID: 0 at kernel/sched/core.c:3115 set_task_cpu+0x188/0x240
 Modules linked in: hclgevf xt_CHECKSUM ipt_REJECT nf_reject_ipv4 <...snip>
 CPU: 7 PID: 0 Comm: swapper/7 Kdump: loaded Tainted: G           O       6.1.0-rc4+ #1
 Hardware name: Huawei TaiShan 2280 V2/BC82AMDC, BIOS 2280-V2 CS V5.B221.01 12/09/2021
 pstate: 604000c9 (nZCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
 pc : set_task_cpu+0x188/0x240
 lr : load_balance+0x5d0/0xc60
 sp : ffff80000803bc70
 x29: ffff80000803bc70 x28: ffff004089e190e8 x27: ffff004089e19040
 x26: ffff007effcabc38 x25: 0000000000000000 x24: 0000000000000001
 x23: ffff80000803be84 x22: 000000000000000c x21: ffffb093e79e2a78
 x20: 000000000000000c x19: ffff004089e19040 x18: 0000000000000000
 x17: 0000000000001fad x16: 0000000000000030 x15: 0000000000000000
 x14: 0000000000000003 x13: 0000000000000000 x12: 0000000000000000
 x11: 0000000000000001 x10: 0000000000000400 x9 : ffffb093e4cee530
 x8 : 00000000fffffffe x7 : 0000000000ce168a x6 : 000000000000013e
 x5 : 00000000ffffffe1 x4 : 0000000000000001 x3 : 0000000000000b2a
 x2 : 0000000000000b2a x1 : ffffb093e6d6c510 x0 : 0000000000000001
 Call trace:
  set_task_cpu+0x188/0x240
  load_balance+0x5d0/0xc60
  rebalance_domains+0x26c/0x380
  _nohz_idle_balance.isra.0+0x1e0/0x370
  run_rebalance_domains+0x6c/0x80
  __do_softirq+0x128/0x3d8
  ____do_softirq+0x18/0x24
  call_on_irq_stack+0x2c/0x38
  do_softirq_own_stack+0x24/0x3c
  __irq_exit_rcu+0xcc/0xf4
  irq_exit_rcu+0x18/0x24
  el1_interrupt+0x4c/0xe4
  el1h_64_irq_handler+0x18/0x2c
  el1h_64_irq+0x74/0x78
  arch_cpu_idle+0x18/0x4c
  default_idle_call+0x58/0x194
  do_idle+0x244/0x2b0
  cpu_startup_entry+0x30/0x3c
  secondary_start_kernel+0x14c/0x190
  __secondary_switched+0xb0/0xb4
 ---[ end trace 0000000000000000 ]---

Further investigation shows that the warning is superfluous, the migration
disabled task is just going to be migrated to its current running CPU.
This is because that on load balance if the dst_cpu is not allowed by the
task, we'll re-select a new_dst_cpu as a candidate. If no task can be
balanced to dst_cpu we'll try to balance the task to the new_dst_cpu
instead. In this case when the migration disabled task is not on CPU it
only allows to run on its current CPU, load balance will select its
current CPU as new_dst_cpu and later triggers the warning above.

The new_dst_cpu is chosen from the env->dst_grpmask. Currently it
contains CPUs in sched_group_span() and if we have overlapped groups it's
possible to run into this case. This patch makes env->dst_grpmask of
group_balance_mask() which exclude any CPUs from the busiest group and
solve the issue. For balancing in a domain with no overlapped groups
the behaviour keeps same as before.

Suggested-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20230530082507.10444-1-yangyicong@huawei.com
2023-06-05 21:08:24 +02:00
Christoph Hellwig 1e8c813b08 PM: hibernate: don't use early_lookup_bdev in resume_store
resume_store is a sysfs attribute written during normal kernel runtime,
and it should not use the early_lookup_bdev API that bypasses all normal
path based permission checking, and might cause problems with certain
container environments renaming devices.

Switch to lookup_bdev, which does a normal path lookup instead, and fall
back to trying to parse a numeric dev_t just like early_lookup_bdev did.

Note that this strictly speaking changes the kernel ABI as the PARTUUID=
and PARTLABEL= style syntax is now not available during a running
systems.  They never were intended for that, but this breaks things
we'll have to figure out a way to make them available again.  But if
avoidable in any way I'd rather avoid that.

Fixes: 421a5fa1a6 ("PM / hibernate: use name_to_dev_t to parse resume")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Rafael J. Wysocki <rafael@kernel.org>
Link: https://lore.kernel.org/r/20230531125535.676098-22-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-05 10:57:40 -06:00
Christoph Hellwig cf056a4312 init: improve the name_to_dev_t interface
name_to_dev_t has a very misleading name, that doesn't make clear
it should only be used by the early init code, and also has a bad
calling convention that doesn't allow returning different kinds of
errors.  Rename it to early_lookup_bdev to make the use case clear,
and return an errno, where -EINVAL means the string could not be
parsed, and -ENODEV means it the string was valid, but there was
no device found for it.

Also stub out the whole call for !CONFIG_BLOCK as all the non-block
root cases are always covered in the caller.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230531125535.676098-14-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-05 10:56:46 -06:00
Christoph Hellwig cc89c63e2f PM: hibernate: move finding the resume device out of software_resume
software_resume can be called either from an init call in the boot code,
or from sysfs once the system has finished booting, and the two
invocation methods this can't race with each other.

For the latter case we did just parse the suspend device manually, while
the former might not have one.  Split software_resume so that the search
only happens for the boot case, which also means the special lockdep
nesting annotation can go away as the system transition mutex can be
taken a little later and doesn't have the sysfs locking nest inside it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Rafael J. Wysocki <rafael@kernel.org>
Link: https://lore.kernel.org/r/20230531125535.676098-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-05 10:55:20 -06:00
Christoph Hellwig d6545e6872 PM: hibernate: remove the global snapshot_test variable
Passing call dependent variable in global variables is a huge
antipattern.  Fix it up.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Rafael J. Wysocki <rafael@kernel.org>
Link: https://lore.kernel.org/r/20230531125535.676098-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-05 10:55:20 -06:00
Christoph Hellwig 02b42d58f3 PM: hibernate: factor out a helper to find the resume device
Split the logic to find the resume device out software_resume and into
a separate helper to start unwindig the convoluted goto logic.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Rafael J. Wysocki <rafael@kernel.org>
Link: https://lore.kernel.org/r/20230531125535.676098-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-05 10:55:20 -06:00
Christoph Hellwig 0718afd47f block: introduce holder ops
Add a new blk_holder_ops structure, which is passed to blkdev_get_by_* and
installed in the block_device for exclusive claims.  It will be used to
allow the block layer to call back into the user of the block device for
thing like notification of a removed device or a device resize.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Link: https://lore.kernel.org/r/20230601094459.1350643-10-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-05 10:53:04 -06:00
Josh Poimboeuf 42cffe980c livepatch: Make 'klp_stack_entries' static
The 'klp_stack_entries' percpu array is only used in transition.c.  Make
it static.

Fixes: e92606fa17 ("livepatch: Convert stack entries array to percpu")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202305171329.i0UQ4TJa-lkp@intel.com/
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/5115752fca6537720700f4bf5b178959dfbca41a.1685488550.git.jpoimboe@kernel.org
2023-06-05 13:56:52 +02:00
Mark Rutland 0f613bfa82 locking/atomic: treewide: use raw_atomic*_<op>()
Now that we have raw_atomic*_<op>() definitions, there's no need to use
arch_atomic*_<op>() definitions outside of the low-level atomic
definitions.

Move treewide users of arch_atomic*_<op>() over to the equivalent
raw_atomic*_<op>().

There should be no functional change as a result of this patch.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20230605070124.3741859-19-mark.rutland@arm.com
2023-06-05 09:57:20 +02:00
Linus Torvalds 51f269a6ec Probes fixes for 6.4-rc4:
- Return NULL if the trace_probe list on trace_probe_event is empty.
 
 - selftests/ftrace: Choose testing symbol name for filtering feature
   from sample data instead of fixed symbol.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCgAdFiEEh7BulGwFlgAOi5DV2/sHvwUrPxsFAmR640AACgkQ2/sHvwUr
 PxugGgf/YwwocmUqiEtTukTB7fzoAjYyQXr0YaJM+DjeZXMqAJ4dl9tV1/AmAL4j
 iWtZd53aolTym/3P2VADfSc4xiyWjFdkYv7zRPjpqfMg3XsELJgshwz+12dmmMdx
 0uco1l2/Ge3JNPK6BuWaO3V44QjoPSgiRsmxxKLh5K7M9V5swL7fadoLtins1B0r
 TVVqdyEHQkZLTByexg7wHYd/ro+4lexv1yhvyP4rEmYRPDoR56eOF2zwcQMHPvaY
 qstdP2ce6m5rG0gp4TsY7oRkezb64y903hNQuumoU6VR9nI3IK4PZjuX5/xns2By
 G9mRaOqb02+UmP+HhX4QGmr92G9Vyw==
 =o07w
 -----END PGP SIGNATURE-----

Merge tag 'probes-fixes-6.4-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull probes fixes from Masami Hiramatsu:

 - Return NULL if the trace_probe list on trace_probe_event is empty

 - selftests/ftrace: Choose testing symbol name for filtering feature
   from sample data instead of fixed symbol

* tag 'probes-fixes-6.4-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  selftests/ftrace: Choose target function for filter test from samples
  tracing/probe: trace_probe_primary_from_call(): checked list_first_entry
2023-06-03 08:23:16 -04:00
Rhys Rustad-Elliott cba41bb78d bpf: Fix elem_size not being set for inner maps
Commit d937bc3449 ("bpf: make uniform use of array->elem_size
everywhere in arraymap.c") changed array_map_gen_lookup to use
array->elem_size instead of round_up(map->value_size, 8) as the element
size when generating code to access a value in an array map.

array->elem_size, however, is not set by bpf_map_meta_alloc when
initializing an BPF_MAP_TYPE_ARRAY_OF_MAPS or BPF_MAP_TYPE_HASH_OF_MAPS.
This results in array_map_gen_lookup incorrectly outputting code that
always accesses index 0 in the array (as the index will be calculated
via a multiplication with the element size, which is incorrectly set to
0).

Set elem_size on the bpf_array object when allocating an array or hash
of maps to fix this.

Fixes: d937bc3449 ("bpf: make uniform use of array->elem_size everywhere in arraymap.c")
Signed-off-by: Rhys Rustad-Elliott <me@rhysre.net>
Link: https://lore.kernel.org/r/20230602190110.47068-2-me@rhysre.net
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-06-02 16:22:12 -07:00
KP Singh b0fd1852bc bpf: Fix UAF in task local storage
When task local storage was generalized for tracing programs, the
bpf_task_local_storage callback was moved from a BPF LSM hook
callback for security_task_free LSM hook to it's own callback. But a
failure case in bad_fork_cleanup_security was missed which, when
triggered, led to a dangling task owner pointer and a subsequent
use-after-free. Move the bpf_task_storage_free to the very end of
free_task to handle all failure cases.

This issue was noticed when a BPF LSM program was attached to the
task_alloc hook on a kernel with KASAN enabled. The program used
bpf_task_storage_get to copy the task local storage from the current
task to the new task being created.

Fixes: a10787e6d5 ("bpf: Enable task local storage for tracing programs")
Reported-by: Kuba Piecuch <jpiecuch@google.com>
Signed-off-by: KP Singh <kpsingh@kernel.org>
Acked-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230602002612.1117381-1-kpsingh@kernel.org
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-06-02 10:18:07 -07:00
Linus Torvalds c43a6ff9f9 modules-6.4-rc5-second-pull
A zstd fix by lucas as he tested zstd decompression support
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCgAwFiEENnNq2KuOejlQLZofziMdCjCSiKcFAmR5D8USHG1jZ3JvZkBr
 ZXJuZWwub3JnAAoJEM4jHQowkoingK8P/3XmCG83G1JULOgFso657i1qdsVbqQek
 Qzo7t9IniifEYZPu8+BTplrtwP3G7x9xlci94JWSftaDvRCWh7kULJ2EMYe61aQk
 eEiYu79AYQYDAYx2mUacvf8QeSWoJmxY7DyUHt9eZQTJ90VN6/svKThxLk2HbQvR
 LtDryHVJqVnH2iSdkP3GA/kVulvLfrQ4+entVhr609kFsvZz21WICtnPU5/xu3PJ
 /Cb8T5FvTjQO6+nCjKMWlBM5sfXSQDxMqcnppZO/sed0yPP47hGzljJM2tlWNkK7
 qb7IrXYW4SowjcIGzQIFHbPSBRM+02hhIF0iSK66kGVrDHocJ1u2pKhhzgsnZdfk
 Wjr1R1CHAthjr1e1S4sFAnl3VTXBCPtC2L2SdCR3aGb1EB2bEjPmZTex6HWWNeCV
 iBVLdJxyt0K6NQTlFe4b5ZE5j0JB2h/uDSpY9OrwMwkVA4BptKRlkUt+4EliPeaf
 lBNABLFDSK82x7bL9MSurzJhLOumj/9CMpl/WkwJYsT5RK5zkfcLpmh6YuRQlA/1
 xR4NKlJn3pVGjXrmAl2t6VJrSa/wCQjkUzd1xyilLI/oxk4uxelGFJiofCyl4zSF
 +A0MHbvrL6N8x6conbfWkQ4cFgSGLSneB5UgVe9myl2b/P3n6PCgyghulo7F9JYj
 G65Ty9CqKJRm
 =FVHd
 -----END PGP SIGNATURE-----

Merge tag 'modules-6.4-rc5-second-pull' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux

Pull modules fix from Luis Chamberlain:
 "A zstd fix by lucas as he tested zstd decompression support"

* tag 'modules-6.4-rc5-second-pull' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux:
  module/decompress: Fix error checking on zstd decompression
2023-06-01 20:48:16 -04:00
Jakub Kicinski a03a91bd68 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

No conflicts.

Adjacent changes:

drivers/net/ethernet/sfc/tc.c
  622ab65634 ("sfc: fix error unwinds in TC offload")
  b6583d5e9e ("sfc: support TC decap rules matching on enc_src_port")

net/mptcp/protocol.c
  5b825727d0 ("mptcp: add annotations around msk->subflow accesses")
  e76c8ef5cc ("mptcp: refactor mptcp_stream_accept()")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-01 15:38:26 -07:00
Lucas De Marchi fadb74f9f2 module/decompress: Fix error checking on zstd decompression
While implementing support for in-kernel decompression in kmod,
finit_module() was returning a very suspicious value:

	finit_module(3, "", MODULE_INIT_COMPRESSED_FILE) = 18446744072717407296

It turns out the check for module_get_next_page() failing is wrong,
and hence the decompression was not really taking place. Invert
the condition to fix it.

Fixes: 169a58ad82 ("module/decompress: Support zstd in-kernel decompression")
Cc: stable@kernel.org
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Cc: Stephen Boyd <swboyd@chromium.org>
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2023-06-01 14:36:46 -07:00
Mike Christie f9010dbdce fork, vhost: Use CLONE_THREAD to fix freezer/ps regression
When switching from kthreads to vhost_tasks two bugs were added:
1. The vhost worker tasks's now show up as processes so scripts doing
ps or ps a would not incorrectly detect the vhost task as another
process.  2. kthreads disabled freeze by setting PF_NOFREEZE, but
vhost tasks's didn't disable or add support for them.

To fix both bugs, this switches the vhost task to be thread in the
process that does the VHOST_SET_OWNER ioctl, and has vhost_worker call
get_signal to support SIGKILL/SIGSTOP and freeze signals. Note that
SIGKILL/STOP support is required because CLONE_THREAD requires
CLONE_SIGHAND which requires those 2 signals to be supported.

This is a modified version of the patch written by Mike Christie
<michael.christie@oracle.com> which was a modified version of patch
originally written by Linus.

Much of what depended upon PF_IO_WORKER now depends on PF_USER_WORKER.
Including ignoring signals, setting up the register state, and having
get_signal return instead of calling do_group_exit.

Tidied up the vhost_task abstraction so that the definition of
vhost_task only needs to be visible inside of vhost_task.c.  Making
it easier to review the code and tell what needs to be done where.
As part of this the main loop has been moved from vhost_worker into
vhost_task_fn.  vhost_worker now returns true if work was done.

The main loop has been updated to call get_signal which handles
SIGSTOP, freezing, and collects the message that tells the thread to
exit as part of process exit.  This collection clears
__fatal_signal_pending.  This collection is not guaranteed to
clear signal_pending() so clear that explicitly so the schedule()
sleeps.

For now the vhost thread continues to exist and run work until the
last file descriptor is closed and the release function is called as
part of freeing struct file.  To avoid hangs in the coredump
rendezvous and when killing threads in a multi-threaded exec.  The
coredump code and de_thread have been modified to ignore vhost threads.

Remvoing the special case for exec appears to require teaching
vhost_dev_flush how to directly complete transactions in case
the vhost thread is no longer running.

Removing the special case for coredump rendezvous requires either the
above fix needed for exec or moving the coredump rendezvous into
get_signal.

Fixes: 6e890c5d50 ("vhost: use vhost_tasks for worker threads")
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Co-developed-by: Mike Christie <michael.christie@oracle.com>
Signed-off-by: Mike Christie <michael.christie@oracle.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-06-01 17:15:33 -04:00
Azeem Shaikh 76edc27eda clocksource: Replace all non-returning strlcpy with strscpy
strlcpy() reads the entire source buffer first.
This read may exceed the destination size limit.
This is both inefficient and can lead to linear read
overflows if a source string is not NUL-terminated [1].
In an effort to remove strlcpy() completely [2], replace
strlcpy() here with strscpy().
No return values were used, so direct replacement is safe.

[1] https://www.kernel.org/doc/html/latest/process/deprecated.html#strlcpy
[2] https://github.com/KSPP/linux/issues/89

Signed-off-by: Azeem Shaikh <azeemshaikh38@gmail.com>
Acked-by: John Stultz <jstultz@google.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20230530163546.986188-1-azeemshaikh38@gmail.com
2023-06-01 11:24:50 -07:00
Azeem Shaikh ffadc37252 bpf: Replace all non-returning strlcpy with strscpy
strlcpy() reads the entire source buffer first. This read may exceed the
destination size limit. This is both inefficient and can lead to linear
read overflows if a source string is not NUL-terminated [1]. This is not
the case here, however, in an effort to remove strlcpy() completely [2],
lets replace strlcpy() here with strscpy(). No return values were used,
so a direct replacement is safe.

  [1] https://www.kernel.org/doc/html/latest/process/deprecated.html#strlcpy
  [2] https://github.com/KSPP/linux/issues/89

Signed-off-by: Azeem Shaikh <azeemshaikh38@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/bpf/20230530155659.309657-1-azeemshaikh38@gmail.com
2023-05-31 13:04:20 +02:00
Pietro Borrello 81d0fa4cb4 tracing/probe: trace_probe_primary_from_call(): checked list_first_entry
All callers of trace_probe_primary_from_call() check the return
value to be non NULL. However, the function returns
list_first_entry(&tpe->probes, ...) which can never be NULL.
Additionally, it does not check for the list being possibly empty,
possibly causing a type confusion on empty lists.
Use list_first_entry_or_null() which solves both problems.

Link: https://lore.kernel.org/linux-trace-kernel/20230128-list-entry-null-check-v1-1-8bde6a3da2ef@diag.uniroma1.it/

Fixes: 60d53e2c3b ("tracing/probe: Split trace_event related data from trace_probe")
Signed-off-by: Pietro Borrello <borrello@diag.uniroma1.it>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Mukesh Ojha <quic_mojha@quicinc.com>
Cc: stable@vger.kernel.org
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2023-05-31 18:47:10 +09:00
Azeem Shaikh d0c2d66fcc ftrace: Replace all non-returning strlcpy with strscpy
strlcpy() reads the entire source buffer first.
This read may exceed the destination size limit.
This is both inefficient and can lead to linear read
overflows if a source string is not NUL-terminated [1].
In an effort to remove strlcpy() completely [2], replace
strlcpy() here with strscpy().
No return values were used, so direct replacement is safe.

[1] https://www.kernel.org/doc/html/latest/process/deprecated.html#strlcpy
[2] https://github.com/KSPP/linux/issues/89

Signed-off-by: Azeem Shaikh <azeemshaikh38@gmail.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20230517145323.1522010-1-azeemshaikh38@gmail.com
2023-05-30 16:42:00 -07:00
Luis Chamberlain 01e6aac78b signal: move show_unhandled_signals sysctl to its own file
The show_unhandled_signals sysctl is the only sysctl for debug
left on kernel/sysctl.c. We've been moving the syctls out from
kernel/sysctl.c so to help avoid merge conflicts as the shared
array gets out of hand.

This change incurs simplifies sysctl registration by localizing
it where it should go for a penalty in size of increasing the
kernel by 23 bytes, we accept this given recent cleanups have
actually already saved us 1465 bytes in the prior commits.

./scripts/bloat-o-meter vmlinux.3-remove-dev-table vmlinux.4-remove-debug-table
add/remove: 3/1 grow/shrink: 0/1 up/down: 177/-154 (23)
Function                                     old     new   delta
signal_debug_table                             -     128    +128
init_signal_sysctls                            -      33     +33
__pfx_init_signal_sysctls                      -      16     +16
sysctl_init_bases                             85      59     -26
debug_table                                  128       -    -128
Total: Before=21256967, After=21256990, chg +0.00%

Reviewed-by: Joel Granados <j.granados@samsung.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2023-05-30 15:46:31 -07:00
Luis Chamberlain 996ef312f2 sysctl: remove empty dev table
Now that all the dev sysctls have been moved out we can remove the
dev sysctl base directory. We don't need to create base directories,
they are created for you as if using 'mkdir -p' with register_syctl()
and register_sysctl_init(). For details refer to sysctl_mkdir_p()
usage.

We save 90 bytes with this changes:

./scripts/bloat-o-meter vmlinux.2.remove-sysctl-table vmlinux.3-remove-dev-table
add/remove: 0/1 grow/shrink: 0/1 up/down: 0/-90 (-90)
Function                                     old     new   delta
sysctl_init_bases                            111      85     -26
dev_table                                     64       -     -64
Total: Before=21257057, After=21256967, chg -0.00%

The empty dev table has been in place since the v2.5.0 days because
back then ordering was essentialy. But later commit 7ec66d0636
("sysctl: Stop requiring explicit management of sysctl directories"),
merged as of v3.4-rc1, the entire ordering of directories was replaced
by allowing sysctl directory autogeneration. This new mechanism
introduced on v3.4 allows for sysctl directories to automatically be
created for sysctl tables when they are needed and automatically removes
them when no sysctl tables use them. That commit also added a dedicated
struct ctl_dir as a new type for these autogenerated directories.

Reviewed-by: Joel Granados <j.granados@samsung.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2023-05-30 15:46:11 -07:00
Yonghong Song e6c2f594ed bpf: Silence a warning in btf_type_id_size()
syzbot reported a warning in [1] with the following stacktrace:
  WARNING: CPU: 0 PID: 5005 at kernel/bpf/btf.c:1988 btf_type_id_size+0x2d9/0x9d0 kernel/bpf/btf.c:1988
  ...
  RIP: 0010:btf_type_id_size+0x2d9/0x9d0 kernel/bpf/btf.c:1988
  ...
  Call Trace:
   <TASK>
   map_check_btf kernel/bpf/syscall.c:1024 [inline]
   map_create+0x1157/0x1860 kernel/bpf/syscall.c:1198
   __sys_bpf+0x127f/0x5420 kernel/bpf/syscall.c:5040
   __do_sys_bpf kernel/bpf/syscall.c:5162 [inline]
   __se_sys_bpf kernel/bpf/syscall.c:5160 [inline]
   __x64_sys_bpf+0x79/0xc0 kernel/bpf/syscall.c:5160
   do_syscall_x64 arch/x86/entry/common.c:50 [inline]
   do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80
   entry_SYSCALL_64_after_hwframe+0x63/0xcd

With the following btf
  [1] DECL_TAG 'a' type_id=4 component_idx=-1
  [2] PTR '(anon)' type_id=0
  [3] TYPE_TAG 'a' type_id=2
  [4] VAR 'a' type_id=3, linkage=static
and when the bpf_attr.btf_key_type_id = 1 (DECL_TAG),
the following WARN_ON_ONCE in btf_type_id_size() is triggered:
  if (WARN_ON_ONCE(!btf_type_is_modifier(size_type) &&
                   !btf_type_is_var(size_type)))
          return NULL;

Note that 'return NULL' is the correct behavior as we don't want
a DECL_TAG type to be used as a btf_{key,value}_type_id even
for the case like 'DECL_TAG -> STRUCT'. So there
is no correctness issue here, we just want to silence warning.

To silence the warning, I added DECL_TAG as one of kinds in
btf_type_nosize() which will cause btf_type_id_size() returning
NULL earlier without the warning.

  [1] https://lore.kernel.org/bpf/000000000000e0df8d05fc75ba86@google.com/

Reported-by: syzbot+958967f249155967d42a@syzkaller.appspotmail.com
Signed-off-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20230530205029.264910-1-yhs@fb.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-05-30 14:34:46 -07:00
Miaohe Lin 3f4bf7aa31 sched/deadline: remove unused dl_bandwidth
The default deadline bandwidth control structure has been removed since
commit eb77cf1c15 ("sched/deadline: Remove unused def_dl_bandwidth")
leading to unused init_dl_bandwidth() and struct dl_bandwidth. Remove
them to clean up the code.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://lore.kernel.org/r/20230524102514.407486-1-linmiaohe@huawei.com
2023-05-30 22:46:26 +02:00
Arnd Bergmann 7aa55f2a59 sched/fair: Move unused stub functions to header
These four functions have a normal definition for CONFIG_FAIR_GROUP_SCHED,
and empty one that is only referenced when FAIR_GROUP_SCHED is disabled
but CGROUP_SCHED is still enabled. If both are turned off, the functions
are still defined but the misisng prototype causes a W=1 warning:

kernel/sched/fair.c:12544:6: error: no previous prototype for 'free_fair_sched_group'
kernel/sched/fair.c:12546:5: error: no previous prototype for 'alloc_fair_sched_group'
kernel/sched/fair.c:12553:6: error: no previous prototype for 'online_fair_sched_group'
kernel/sched/fair.c:12555:6: error: no previous prototype for 'unregister_fair_sched_group'

Move the alternatives into the header as static inline functions with
the correct combination of #ifdef checks to avoid the warning without
adding even more complexity.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20230522195021.3456768-6-arnd@kernel.org
2023-05-30 22:46:26 +02:00
Arnd Bergmann f7df852ad6 sched: Make task_vruntime_update() prototype visible
Having the prototype next to the caller but not visible to the callee causes
a W=1 warning:

kernel/sched/fair.c:11985:6: error: no previous prototype for 'task_vruntime_update' [-Werror=missing-prototypes]

Move this to a header, as we do for all other function declarations.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20230522195021.3456768-5-arnd@kernel.org
2023-05-30 22:46:26 +02:00
Arnd Bergmann c0bdfd72fb sched/fair: Hide unused init_cfs_bandwidth() stub
init_cfs_bandwidth() is only used when CONFIG_FAIR_GROUP_SCHED is
enabled, and without this causes a W=1 warning for the missing prototype:

kernel/sched/fair.c:6131:6: error: no previous prototype for 'init_cfs_bandwidth'

The normal implementation is only defined for CONFIG_CFS_BANDWIDTH,
so the stub exists when CFS_BANDWIDTH is disabled but FAIR_GROUP_SCHED
is enabled.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20230522195021.3456768-4-arnd@kernel.org
2023-05-30 22:46:25 +02:00
Arnd Bergmann 378be384e0 sched: Add schedule_user() declaration
The schedule_user() function is used on powerpc and sparc architectures, but
only ever called from assembler, so it has no prototype, causing a harmless W=1
warning:

kernel/sched/core.c:6730:35: error: no previous prototype for 'schedule_user' [-Werror=missing-prototypes]

Add a prototype in sched/sched.h to shut up the warning.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20230522195021.3456768-3-arnd@kernel.org
2023-05-30 22:46:25 +02:00
Arnd Bergmann d55ebae3f3 sched: Hide unused sched_update_scaling()
This function is only used when CONFIG_SMP is enabled, without that there
is no caller and no prototype:

kernel/sched/fair.c:688:5: error: no previous prototype for 'sched_update_scaling' [-Werror=missing-prototypes

Hide the definition in the same #ifdef check as the declaration.

Fixes: 8a99b6833c ("sched: Move SCHED_DEBUG sysctl to debugfs")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20230522195021.3456768-2-arnd@kernel.org
2023-05-30 22:46:24 +02:00
Linus Torvalds 6d86b56f54 modules-6.4-rc5
A fix is provided for ia64. Even though ia64 is on life support it helps
 to fix issues if we can. Thanks to Linus for doing tons of the ia64
 debugging.
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCgAwFiEENnNq2KuOejlQLZofziMdCjCSiKcFAmR2JPsSHG1jZ3JvZkBr
 ZXJuZWwub3JnAAoJEM4jHQowkoinwGIQAIrm0qpwVgDvh98B/anYDVbPpdn1U+/I
 s9LBUMZCMIVEHGoOUU7YV3QGMh27OFvsNW72ggrgPuCbOgPzAfyxLJZoNY1nLURO
 ZWSi2Jg4Om82BTP/Vw79yDjikLjJWAUKT/nHuyJxOnLKVsjKrlKuX+UAPXCqMv1O
 yukZnCoQx6c57iRFLrpGq/+OM4Y/vZ2w8zeb5/HOSvVglkIITlXvcMUXmk0JxtS5
 Rr5R14F59BZnpQD5F3hYYvIWycM2DYNdHA5FFPLt1US6TAXjYlk/hf6jgEmHxvHI
 jN6U3qG0Wm8VO9ZlPXzwKYTPmHhc5llXqSlkXILuYke79w1dS1BINJb7dg3LcGna
 eq2IX0ZwxC4SicvDDFWZGmriwIIYbR7jWrSmtTr7W1HIb5mICIrXSuG0150xaCYv
 fc79brpBRm5me+py+WYkFKqT3DKkGHdB85+c7iXPToZeHY8v6581t3R4XQSHzPF6
 JY0Uhlsb17Korbsl8Iw2FFQ8K8DRjLpeeQnDU7iwEuGpZrPSx872o8pDNH1rnmIp
 wIOjfnjWc5arlCPjWWc0Ga/np7ewBP4Zw4Ff4AFyZG1ISrTmzFBWQI7TqyemjC9v
 RTIBN5QmJIW6Zai/7oL5xNapRJCO250+dg1G8uOka2dklfnLdKxIBafWBa+/Znmn
 ZpmhFljq1w3y
 =h/VF
 -----END PGP SIGNATURE-----

Merge tag 'modules-6.4-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux

Pull modules fix from Luis Chamberlain:
 "A fix is provided for ia64. Even though ia64 is on life support it
  helps to fix issues if we can. Thanks to Linus for doing tons of the
  ia64 debugging"

* tag 'modules-6.4-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux:
  module: fix module load for ia64
2023-05-30 15:49:40 -04:00
Jonathan Cameron 143f83e200 perf: Allow a PMU to have a parent
Some PMUs have well defined parents such as PCI devices.
As the device_initialize() and device_add() are all within
pmu_dev_alloc() which is called from perf_pmu_register()
there is no opportunity to set the parent from within a driver.

Add a struct device *parent field to struct pmu and use that
to set the parent.

Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Link: https://lore.kernel.org/r/20230526095824.16336-2-Jonathan.Cameron@huawei.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2023-05-30 11:20:35 -07:00
Song Liu db3e33dd8b module: fix module load for ia64
Frank reported boot regression in ia64 as:

ELILO v3.16 for EFI/IA-64
..
Uncompressing Linux... done
Loading file AC100221.initrd.img...done
[    0.000000] Linux version 6.4.0-rc3 (root@x4270) (ia64-linux-gcc
(GCC) 12.2.0, GNU ld (GNU Binutils) 2.39) #1 SMP Thu May 25 15:52:20
CEST 2023
[    0.000000] efi: EFI v1.1 by HP
[    0.000000] efi: SALsystab=0x3ee7a000 ACPI 2.0=0x3fe2a000
ESI=0x3ee7b000 SMBIOS=0x3ee7c000 HCDP=0x3fe28000
[    0.000000] PCDP: v3 at 0x3fe28000
[    0.000000] earlycon: uart8250 at MMIO 0x00000000f4050000 (options
'9600n8')
[    0.000000] printk: bootconsole [uart8250] enabled
[    0.000000] ACPI: Early table checksum verification disabled
[    0.000000] ACPI: RSDP 0x000000003FE2A000 000028 (v02 HP    )
[    0.000000] ACPI: XSDT 0x000000003FE2A02C 0000CC (v01 HP     rx2620
00000000 HP   00000000)
[...]
[    3.793350] Run /init as init process
Loading, please wait...
Starting systemd-udevd version 252.6-1
[    3.951100] ------------[ cut here ]------------
[    3.951100] WARNING: CPU: 6 PID: 140 at kernel/module/main.c:1547
__layout_sections+0x370/0x3c0
[    3.949512] Unable to handle kernel paging request at virtual address
1000000000000000
[    3.951100] Modules linked in:
[    3.951100] CPU: 6 PID: 140 Comm: (udev-worker) Not tainted 6.4.0-rc3 #1
[    3.956161] (udev-worker)[142]: Oops 11003706212352 [1]
[    3.951774] Hardware name: hp server rx2620                   , BIOS
04.29
11/30/2007
[    3.951774]
[    3.951774] Call Trace:
[    3.958339] Unable to handle kernel paging request at virtual address
1000000000000000
[    3.956161] Modules linked in:
[    3.951774]  [<a0000001000156d0>] show_stack.part.0+0x30/0x60
[    3.951774]                                 sp=e000000183a67b20
bsp=e000000183a61628
[    3.956161]
[    3.956161]

which bisect to module_memory change [1].

Debug showed that ia64 uses some special sections:

__layout_sections: section .got (sh_flags 10000002) matched to MOD_INVALID
__layout_sections: section .sdata (sh_flags 10000003) matched to MOD_INVALID
__layout_sections: section .sbss (sh_flags 10000003) matched to MOD_INVALID

All these sections are loaded to module core memory before [1].

Fix ia64 boot by loading these sections to MOD_DATA (core rw data).

[1] commit ac3b432839 ("module: replace module_layout with module_memory")

Fixes: ac3b432839 ("module: replace module_layout with module_memory")
Reported-by: Frank Scheiner <frank.scheiner@web.de>
Closes: https://lists.debian.org/debian-ia64/2023/05/msg00010.html
Closes: https://marc.info/?l=linux-ia64&m=168509859125505
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Song Liu <song@kernel.org>
Tested-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2023-05-30 09:29:40 -07:00
Linus Torvalds 8b817fded4 Tracing fixes:
User events:
 
  - Use long instead of int for storing the enable set/clear bit, as it was
    found that big endian machines could end up using the wrong bits.
 
  - Split allocating mm and attaching it. This keeps the allocation separate
    from the registration and avoids various races.
 
  - Remove RCU locking around pin_user_pages_remote() as that can schedule. The
    RCU protection is no longer needed with the above split of mm allocation and
    attaching.
 
  - Rename the "link" fields of the various structs to something more
    meaningful.
 
  - Add comments around user_event_mm struct usage and locking requirements.
 
 Timerlat tracer:
 
  - Fix missed wakeup of timerlat thread caused by the timerlat interrupt
    triggering when tracing is off. The timer interrupt handler needs to always
    wake up the timerlat thread regardless if tracing is enabled or not,
    otherwise, it will never wake up.
 
 Histograms:
 
  - Fix regression of breaking the "stacktrace" modifier for variables. That
    modifier cannot be used for values, but can be used for variables that are
    passed from one histogram to the next. This was broken when adding the
    restriction to values as the variable logic used the same code.
 
  - Rename the special field "stacktrace" to "common_stacktrace". Special fields
    (that are not actually part of the event, but can act just like event
    fields, like 'comm' and 'timestamp') should be prefixed with 'common_' for
    consistency. To keep backward compatibility, 'stacktrace' can still be used
    (as with the special field 'cpu'), but can be overridden if the event has a
    field called 'stacktrace'.
 
  - Update the synthetic event selftests to use the new name (synthetic events
    are created by histograms)
 
 Tracing bootup selftests:
 
  - Reorganize the code to keep artifacts of the selftests not compiled in when
    selftests are not configured.
 
  - Add various cond_resched() around the selftest code, as the softlock
    watchdog was triggering much more often. It appears that the kernel runs
    slower now with full debugging enabled.
 
  - While debugging ftrace with ftrace (using an instance ring buffer instead of
    the top level one), I found that the selftests were disabling prints to the
    debug instance. This should not happen, as the selftests only disable
    printing to the main buffer as the selftests examine the main buffer to see
    if it has what it expects, and prints can make the tests fail. Make the
    selftests only disable printing to the toplevel buffer, and leave the
    instance buffers alone.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZHQGJBQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qu6hAQCJ1WebZUTJ/s7pFo36mXirLnrW4afB
 Ua6sALseqKNesgEAyhLmd2+sMeqmAbCCIUWtcWJb/Pod0jGOt0U8+cBxfw8=
 =PhaX
 -----END PGP SIGNATURE-----

Merge tag 'trace-v6.4-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing fixes from Steven Rostedt:
 "User events:

   - Use long instead of int for storing the enable set/clear bit, as it
     was found that big endian machines could end up using the wrong
     bits.

   - Split allocating mm and attaching it. This keeps the allocation
     separate from the registration and avoids various races.

   - Remove RCU locking around pin_user_pages_remote() as that can
     schedule. The RCU protection is no longer needed with the above
     split of mm allocation and attaching.

   - Rename the "link" fields of the various structs to something more
     meaningful.

   - Add comments around user_event_mm struct usage and locking
     requirements.

  Timerlat tracer:

   - Fix missed wakeup of timerlat thread caused by the timerlat
     interrupt triggering when tracing is off. The timer interrupt
     handler needs to always wake up the timerlat thread regardless if
     tracing is enabled or not, otherwise, it will never wake up.

  Histograms:

   - Fix regression of breaking the "stacktrace" modifier for variables.
     That modifier cannot be used for values, but can be used for
     variables that are passed from one histogram to the next. This was
     broken when adding the restriction to values as the variable logic
     used the same code.

   - Rename the special field "stacktrace" to "common_stacktrace".

     Special fields (that are not actually part of the event, but can
     act just like event fields, like 'comm' and 'timestamp') should be
     prefixed with 'common_' for consistency. To keep backward
     compatibility, 'stacktrace' can still be used (as with the special
     field 'cpu'), but can be overridden if the event has a field called
     'stacktrace'.

   - Update the synthetic event selftests to use the new name (synthetic
     events are created by histograms)

  Tracing bootup selftests:

   - Reorganize the code to keep artifacts of the selftests not compiled
     in when selftests are not configured.

   - Add various cond_resched() around the selftest code, as the
     softlock watchdog was triggering much more often. It appears that
     the kernel runs slower now with full debugging enabled.

   - While debugging ftrace with ftrace (using an instance ring buffer
     instead of the top level one), I found that the selftests were
     disabling prints to the debug instance.

     This should not happen, as the selftests only disable printing to
     the main buffer as the selftests examine the main buffer to see if
     it has what it expects, and prints can make the tests fail.

     Make the selftests only disable printing to the toplevel buffer,
     and leave the instance buffers alone"

* tag 'trace-v6.4-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing: Have function_graph selftest call cond_resched()
  tracing: Only make selftest conditionals affect the global_trace
  tracing: Make tracing_selftest_running/delete nops when not used
  tracing: Have tracer selftests call cond_resched() before running
  tracing: Move setting of tracing_selftest_running out of register_tracer()
  tracing/selftests: Update synthetic event selftest to use common_stacktrace
  tracing: Rename stacktrace field to common_stacktrace
  tracing/histograms: Allow variables to have some modifiers
  tracing/user_events: Document user_event_mm one-shot list usage
  tracing/user_events: Rename link fields for clarity
  tracing/user_events: Remove RCU lock while pinning pages
  tracing/user_events: Split up mm alloc and attach
  tracing/timerlat: Always wakeup the timerlat thread
  tracing/user_events: Use long vs int for atomic bit ops
2023-05-29 07:20:13 -04:00
Linus Torvalds ac2263b588 Revert "module: error out early on concurrent load of the same module file"
This reverts commit 9828ed3f69.

Sadly, it does seem to cause failures to load modules. Johan Hovold reports:

 "This change breaks module loading during boot on the Lenovo Thinkpad
  X13s (aarch64).

  Specifically it results in indefinite probe deferral of the display
  and USB (ethernet) which makes it a pain to debug. Typing in the dark
  to acquire some logs reveals that other modules are missing as well"

Since this was applied late as a "let's try this", I'm reverting it
asap, and we can try to figure out what goes wrong later.  The excessive
parallel module loading problem is annoying, but not noticeable in
normal situations, and this was only meant as an optimistic workaround
for a user-space bug.

One possible solution may be to do the optimistic exclusive open first,
and then use a lock to serialize loading if that fails.

Reported-by: Johan Hovold <johan@kernel.org>
Link: https://lore.kernel.org/lkml/ZHRpH-JXAxA6DnzR@hovoldconsulting.com/
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-05-29 06:40:33 -04:00
Steven Rostedt (Google) a2d910f022 tracing: Have function_graph selftest call cond_resched()
When all kernel debugging is enabled (lockdep, KSAN, etc), the function
graph enabling and disabling can take several seconds to complete. The
function_graph selftest enables and disables function graph tracing
several times. With full debugging enabled, the soft lockup watchdog was
triggering because the selftest was running without ever scheduling.

Add cond_resched() throughout the test to make sure it does not trigger
the soft lockup detector.

Link: https://lkml.kernel.org/r/20230528051742.1325503-6-rostedt@goodmis.org

Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-05-28 21:15:46 -04:00
Steven Rostedt (Google) ac9d2cb1d5 tracing: Only make selftest conditionals affect the global_trace
The tracing_selftest_running and tracing_selftest_disabled variables were
to keep trace_printk() and other writes from affecting the tracing
selftests, as the tracing selftests would examine the ring buffer to see
if it contained what it expected or not. trace_printk() and friends could
add to the ring buffer and cause the selftests to fail (and then disable
the tracer that was being tested). To keep that from happening, these
variables were added and would keep trace_printk() and friends from
writing to the ring buffer while the tests were going on.

But this was only the top level ring buffer (owned by the global_trace
instance). There is no reason to prevent writing into ring buffers of
other instances via the trace_array_printk() and friends. For the
functions that could be used by other instances, check if the global_trace
is the tracer instance that is being written to before deciding to not
allow the write.

Link: https://lkml.kernel.org/r/20230528051742.1325503-5-rostedt@goodmis.org

Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-05-28 21:15:33 -04:00
Steven Rostedt (Google) a3ae76d7ff tracing: Make tracing_selftest_running/delete nops when not used
There's no reason to test the condition variables tracing_selftest_running
or tracing_selftest_delete when tracing selftests are not enabled. Make
them define 0s when not the selftests are not configured in.

Link: https://lkml.kernel.org/r/20230528051742.1325503-4-rostedt@goodmis.org

Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-05-28 21:15:26 -04:00
Steven Rostedt (Google) 9da705d432 tracing: Have tracer selftests call cond_resched() before running
As there are more and more internal selftests being added to the Linux
kernel (KSAN, lockdep, etc) the selftests are taking longer to run when
these are enabled. Add a cond_resched() to the calling of
do_run_tracer_selftest() to force a schedule if NEED_RESCHED is set,
otherwise the soft lockup watchdog may trigger on boot up.

Link: https://lkml.kernel.org/r/20230528051742.1325503-3-rostedt@goodmis.org

Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-05-28 21:15:17 -04:00
Steven Rostedt (Google) e8352cf577 tracing: Move setting of tracing_selftest_running out of register_tracer()
The variables tracing_selftest_running and tracing_selftest_disabled are
only used for when CONFIG_FTRACE_STARTUP_TEST is enabled. Make them only
visible within the selftest code. The setting of those variables are in
the register_tracer() call, and set in a location where they do not need
to be. Create a wrapper around run_tracer_selftest() called
do_run_tracer_selftest() which sets those variables, and have
register_tracer() call that instead.

Having those variables only set within the CONFIG_FTRACE_STARTUP_TEST
scope gets rid of them (and also the ability to remove testing against
them) when the startup tests are not enabled (most cases).

Link: https://lkml.kernel.org/r/20230528051742.1325503-2-rostedt@goodmis.org

Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-05-28 21:15:04 -04:00
Linus Torvalds d8f14b84fe Two fixes for debugobjects:
- Prevent that the allocation path wakes up kswapd. That's a long
     standing issue due to the GFP_ATOMIC allocation flag. As debug objects
     can be invoked from pretty much any context waking kswapd can end up
     in arbitrary lock chains versus the waitqueue lock.
 
   - Correct the explicit lockdep wait-type violation in
     debug_object_fill_pool().
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmRzCBQTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoa8FD/sFaHGSVtNTYgkV75umETMWbx+nR0Sp
 Y/i62MswIWU/DWmD9IKaBxlHpBByHgopBAozDnUix6RfQvf8V/GSU6PWa9HAR2QH
 rYwQCN/2/e8yQNAFv+9AiYGzPU3fRI/z7rYgfhhiWoLjivMFUCXypjBG0BAiCBxC
 pYKZDMhBeySIUjtEL6xjcflA8XXKuLUPGy1WeKBxRgJeNvM0GlbifNXoy0JaXBso
 NK+1FOG7zm05r2RqZjN0rAVRrrdgA4JYygpYC8YmzePoFQVXLeUnlbjjW9uYX+hz
 MoLuVeF+rKk9NHNu3NoD4kFgrNp3NXAAAzH1MJwIADy9THtsyWAeEgyUkkie9aiX
 Oa8eSjpJQjUv5h+VRKpMhh2RAAAhCYDuX/QC2FLImLy+GRF3dMhsAmuYgKXN2kHa
 CFkM84vStMiMVxKhwtLpxVE7VOrxzXxbqMO65kMrCXYxK1SfKtEZr8FrORvUjU7G
 MmH+D9sB034nkCBU+oGMsMYAAzB4rLp5Cw9qqvwWLfJvWLcUoPxjgUV6hLR6mNXx
 6+2133Tf68Fz4TgyEDN9XhQ7QEsKKGTTDMJ5JYolnrRe54sUJSsX+44khrbocSde
 WcEfcwhR+mjDDx0eVB2oT9bedxMf639mqPNn//EqJkzS4s+sECC8OiHbdvL3ArUq
 S92nrMxvyMB42Q==
 =7B4m
 -----END PGP SIGNATURE-----

Merge tag 'core-debugobjects-2023-05-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull debugobjects fixes from Thomas Gleixner:
 "Two fixes for debugobjects:

   - Prevent the allocation path from waking up kswapd.

     That's a long standing issue due to the GFP_ATOMIC allocation flag.
     As debug objects can be invoked from pretty much any context waking
     kswapd can end up in arbitrary lock chains versus the waitqueue
     lock

   - Correct the explicit lockdep wait-type violation in
     debug_object_fill_pool()"

* tag 'core-debugobjects-2023-05-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  debugobjects: Don't wake up kswapd from fill_pool()
  debugobjects,locking: Annotate debug_object_fill_pool() wait type violation
2023-05-28 07:15:33 -04:00
Masahiro Yamada 49c386ebbb Revert "kheaders: substituting --sort in archive creation"
This reverts commit 700dea5a0b.

The reason for that commit was --sort=ORDER introduced in
tar 1.28 (2014). More than 3 years have passed since then.

Requiring GNU tar 1.28 should be fine now because we require
GCC 5.1 (2015).

Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Reviewed-by: Nicolas Schier <nicolas@fjasle.eu>
2023-05-28 16:20:21 +09:00
Linus Torvalds 4e893b5aa4 xen: branch for v6.4-rc4
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQRTLbB6QfY48x44uB6AXGG7T9hjvgUCZHGVRAAKCRCAXGG7T9hj
 vqtJAQDizKasLE7tSnfs/FrZ/4xPaDLe3bQifMx2C1dtYCjRcAD/ciZSa1L0WzZP
 dpEZnlYRzsR3bwLktQEMQFOvlbh1SwE=
 =K860
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-6.4-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip

Pull xen fixes from Juergen Gross:

 - a double free fix in the Xen pvcalls backend driver

 - a fix for a regression causing the MSI related sysfs entries to not
   being created in Xen PV guests

 - a fix in the Xen blkfront driver for handling insane input data
   better

* tag 'for-linus-6.4-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
  x86/pci/xen: populate MSI sysfs entries
  xen/pvcalls-back: fix double frees with pvcalls_new_active_socket()
  xen/blkfront: Only check REQ_FUA for writes
2023-05-27 09:42:56 -07:00
Jakub Kicinski 75455b906d bpf-next-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZHEm+wAKCRDbK58LschI
 gyIKAQCqO7B4sIu8hYVxBTwfHV2tIuXSMSCV4P9e78NUOPcO2QEAvLP/WVSjB0Bm
 vpyTKKM22SpZvPe/jSp52j6t20N+qAc=
 =HFxD
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Daniel Borkmann says:

====================
pull-request: bpf-next 2023-05-26

We've added 54 non-merge commits during the last 10 day(s) which contain
a total of 76 files changed, 2729 insertions(+), 1003 deletions(-).

The main changes are:

1) Add the capability to destroy sockets in BPF through a new kfunc,
   from Aditi Ghag.

2) Support O_PATH fds in BPF_OBJ_PIN and BPF_OBJ_GET commands,
   from Andrii Nakryiko.

3) Add capability for libbpf to resize datasec maps when backed via mmap,
   from JP Kobryn.

4) Move all the test kfuncs for CI out of the kernel and into bpf_testmod,
   from Jiri Olsa.

5) Big batch of xsk selftest improvements to prep for multi-buffer testing,
   from Magnus Karlsson.

6) Show the target_{obj,btf}_id in tracing link's fdinfo and dump it
   via bpftool, from Yafang Shao.

7) Various misc BPF selftest improvements to work with upcoming LLVM 17,
   from Yonghong Song.

8) Extend bpftool to specify netdevice for resolving XDP hints,
   from Larysa Zaremba.

9) Document masking in shift operations for the insn set document,
   from Dave Thaler.

10) Extend BPF selftests to check xdp_feature support for bond driver,
    from Lorenzo Bianconi.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (54 commits)
  bpf: Fix bad unlock balance on freeze_mutex
  libbpf: Ensure FD >= 3 during bpf_map__reuse_fd()
  libbpf: Ensure libbpf always opens files with O_CLOEXEC
  selftests/bpf: Check whether to run selftest
  libbpf: Change var type in datasec resize func
  bpf: drop unnecessary bpf_capable() check in BPF_MAP_FREEZE command
  libbpf: Selftests for resizing datasec maps
  libbpf: Add capability for resizing datasec maps
  selftests/bpf: Add path_fd-based BPF_OBJ_PIN and BPF_OBJ_GET tests
  libbpf: Add opts-based bpf_obj_pin() API and add support for path_fd
  bpf: Support O_PATH FDs in BPF_OBJ_PIN and BPF_OBJ_GET commands
  libbpf: Start v1.3 development cycle
  bpf: Validate BPF object in BPF_OBJ_PIN before calling LSM
  bpftool: Specify XDP Hints ifname when loading program
  selftests/bpf: Add xdp_feature selftest for bond device
  selftests/bpf: Test bpf_sock_destroy
  selftests/bpf: Add helper to get port using getsockname
  bpf: Add bpf_sock_destroy kfunc
  bpf: Add kfunc filter function to 'struct btf_kfunc_id_set'
  bpf: udp: Implement batching for sockets iterator
  ...
====================

Link: https://lore.kernel.org/r/20230526222747.17775-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-26 17:26:01 -07:00
Maninder Singh 4f521bab5b kallsyms: remove unsed API lookup_symbol_attrs
with commit '7878c231dae0 ("slab: remove /proc/slab_allocators")'
lookup_symbol_attrs usage is removed.

Thus removing redundant API.

Signed-off-by: Maninder Singh <maninder1.s@samsung.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2023-05-26 15:10:18 -07:00
Azeem Shaikh c7dce4c5d9 tracing: Replace all non-returning strlcpy with strscpy
strlcpy() reads the entire source buffer first.
This read may exceed the destination size limit.
This is both inefficient and can lead to linear read
overflows if a source string is not NUL-terminated [1].
In an effort to remove strlcpy() completely [2], replace
strlcpy() here with strscpy().

No return values were used, so direct replacement with strlcpy is safe.

[1] https://www.kernel.org/doc/html/latest/process/deprecated.html#strlcpy
[2] https://github.com/KSPP/linux/issues/89

Signed-off-by: Azeem Shaikh <azeemshaikh38@gmail.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20230516143956.1367827-1-azeemshaikh38@gmail.com
2023-05-26 13:52:19 -07:00
Arnd Bergmann 15d5daa0a7 kallsyms: remove unused arch_get_kallsym() helper
The arch_get_kallsym() function was introduced so that x86 could override
it, but that override was removed in bf904d2762 ("x86/pti/64: Remove
the SYSCALL64 entry trampoline"), so now this does nothing except causing
a warning about a missing prototype:

kernel/kallsyms.c:662:12: error: no previous prototype for 'arch_get_kallsym' [-Werror=missing-prototypes]
  662 | int __weak arch_get_kallsym(unsigned int symnum, unsigned long *value,

Restore the old behavior before d83212d5dd ("kallsyms, x86: Export
addresses of PTI entry trampolines") to simplify the code and avoid
the warning.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Tested-by: Alan Maguire <alan.maguire@oracle.com>
[mcgrof: fold in bpf selftest fix]
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2023-05-26 10:38:23 -07:00
Vlastimil Babka eb07c4f39c mm/slab: rename CONFIG_SLAB to CONFIG_SLAB_DEPRECATED
As discussed at LSF/MM [1] [2] and with no objections raised there,
deprecate the SLAB allocator. Rename the user-visible option so that
users with CONFIG_SLAB=y get a new prompt with explanation during make
oldconfig, while make olddefconfig will just switch to SLUB.

In all defconfigs with CONFIG_SLAB=y remove the line so those also
switch to SLUB. Regressions due to the switch should be reported to
linux-mm and slab maintainers.

[1] https://lore.kernel.org/all/4b9fc9c6-b48c-198f-5f80-811a44737e5f@suse.cz/
[2] https://lwn.net/Articles/932201/

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> # m68k
Acked-by: Helge Deller <deller@gmx.de> # parisc
2023-05-26 19:01:47 +02:00
Daniel Borkmann 4266f41fea bpf: Fix bad unlock balance on freeze_mutex
Commit c4c84f6fb2 ("bpf: drop unnecessary bpf_capable() check in
BPF_MAP_FREEZE command") moved the permissions check outside of the
freeze_mutex in the map_freeze() handler. The error paths still jumps
to the err_put which tries to unlock the freeze_mutex even though it
was not locked in the first place. Fix it.

Fixes: c4c84f6fb2 ("bpf: drop unnecessary bpf_capable() check in BPF_MAP_FREEZE command")
Reported-by: syzbot+8982e75c2878b9ffeac5@syzkaller.appspotmail.com
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2023-05-26 12:16:12 +02:00
Jakub Kicinski d6f1e0bfe5 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-25 20:56:43 -07:00
Jakub Kicinski d4031ec844 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

Conflicts:

net/ipv4/raw.c
  3632679d9e ("ipv{4,6}/raw: fix output xfrm lookup wrt protocol")
  c85be08fc4 ("raw: Stop using RTO_ONLINK.")
https://lore.kernel.org/all/20230525110037.2b532b83@canb.auug.org.au/

Adjacent changes:

drivers/net/ethernet/freescale/fec_main.c
  9025944fdd ("net: fec: add dma_wmb to ensure correct descriptor values")
  144470c88c ("net: fec: using the standard return codes when xdp xmit errors")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-25 19:57:39 -07:00
Linus Torvalds 9828ed3f69 module: error out early on concurrent load of the same module file
It turns out that udev under certain circumstances will concurrently try
to load the same modules over-and-over excessively.  This isn't a kernel
bug, but it ends up affecting the kernel, to the point that under
certain circumstances we can fail to boot, because the kernel uses a lot
of memory to read all the module data all at once.

Note that it isn't a memory leak, it's just basically a thundering herd
problem happening at bootup with a lot of CPUs, with the worst cases
then being pretty bad.

Admittedly the worst situations are somewhat contrived: lots and lots of
CPUs, not a lot of memory, and KASAN enabled to make it all slower and
as such (unintentionally) exacerbate the problem.

Luis explains: [1]

 "My best assessment of the situation is that each CPU in udev ends up
  triggering a load of duplicate set of modules, not just one, but *a
  lot*. Not sure what heuristics udev uses to load a set of modules per
  CPU."

Petr Pavlu chimes in: [2]

 "My understanding is that udev workers are forked. An initial kmod
  context is created by the main udevd process but no sharing happens
  after the fork. It means that the mentioned memory pool logic doesn't
  really kick in.

  Multiple parallel load requests come from multiple udev workers, for
  instance, each handling an udev event for one CPU device and making
  the exactly same requests as all others are doing at the same time.

  The optimization idea would be to recognize these duplicate requests
  at the udevd/kmod level and converge them"

Note that module loading has tried to mitigate this issue before, see
for example commit 064f4536d1 ("module: avoid allocation if module is
already present and ready"), which has a few ASCII graphs on memory use
due to this same issue.

However, while that noticed that the module was already loaded, and
exited with an error early before spending any more time on setting up
the module, it didn't handle the case of multiple concurrent module
loads all being active - but not complete - at the same time.

Yes, one of them will eventually win the race and finalize its copy, and
the others will then notice that the module already exists and error
out, but while this all happens, we have tons of unnecessary concurrent
work being done.

Again, the real fix is for udev to not do that (maybe it should use
threads instead of fork, and have actual shared data structures and not
cause duplicate work). That real fix is apparently not trivial.

But it turns out that the kernel already has a pretty good model for
dealing with concurrent access to the same file: the i_writecount of the
inode.

In fact, the module loading already indirectly uses 'i_writecount' ,
because 'kernel_file_read()' will in fact do

	ret = deny_write_access(file);
	if (ret)
		return ret;
	...
	allow_write_access(file);

around the read of the file data.  We do not allow concurrent writes to
the file, and return -ETXTBUSY if the file was open for writing at the
same time as the module data is loaded from it.

And the solution to the reader concurrency problem is to simply extend
this "no concurrent writers" logic to simply be "exclusive access".

Note that "exclusive" in this context isn't really some absolute thing:
it's only exclusion from writers and from other "special readers" that
do this writer denial.  So we simply introduce a variation of that
"deny_write_access()" logic that not only denies write access, but also
requires that this is the _only_ such access that denies write access.

Which means that you can't start loading a module that is already being
loaded as a module by somebody else, or you will get the same -ETXTBSY
error that you would get if there were writers around.

[ It also means that you can't try to load a currently executing
  executable as a module, for the same reason: executables do that same
  "deny_write_access()" thing, and that's obviously where the whole
  ETXTBSY logic traditionally came from.

  This is not a problem for kernel modules, since the set of normal
  executable files and kernel module files is entirely disjoint. ]

This new function is called "exclusive_deny_write_access()", and the
implementation is trivial, in that it's just an atomic decrement of
i_writecount if it was 0 before.

To use that new exclusivity check, all we then do is wrap the module
loading with that exclusive_deny_write_access()() / allow_write_access()
pair.  The actual patch is a bit bigger than that, because we want to
surround not just the "load file data" part, but the whole module setup,
to get maximum exclusion.

So this ends up splitting up "finit_module()" into a few helper
functions to make it all very clear and legible.

In Luis' test-case (bringing up 255 vcpu's in a virtual machine [3]),
the "wasted vmalloc" space (ie module data read into a vmalloc'ed area
in order to be loaded as a module, but then discarded because somebody
else loaded the same module instead) dropped from 1.8GiB to 474kB.  Yes,
that's gigabytes to kilobytes.

It doesn't drop completely to zero, because even with this change, you
can still end up having completely serial pointless module loads, where
one udev process has loaded a module fully (and thus the kernel has
released that exclusive lock on the module file), and then another udev
process tries to load the same module again.

So while we cannot fully get rid of the fundamental bug in user space,
we _can_ get rid of the excessive concurrent thundering herd effect.

A couple of final side notes on this all:

 - This tweak only affects the "finit_module()" system call, which gives
   the kernel a file descriptor with the module data.

   You can also just feed the module data as raw data from user space
   with "init_module()" (note the lack of 'f' at the beginning), and
   obviously for that case we do _not_ have any "exclusive read" logic.

   So if you absolutely want to do things wrong in user space, and try
   to load the same module multiple times, and error out only later when
   the kernel ends up saying "you can't load the same module name
   twice", you can still do that.

   And in fact, some distros will do exactly that, because they will
   uncompress the kernel module data in user space before feeding it to
   the kernel (mainly because they haven't started using the new kernel
   side decompression yet).

   So this is not some absolute "you can't do concurrent loads of the
   same module". It's literally just a very simple heuristic that will
   catch it early in case you try to load the exact same module file at
   the same time, and in that case avoid a potentially nasty situation.

 - There is another user of "deny_write_access()": the verity code that
   enables fs-verity on a file (the FS_IOC_ENABLE_VERITY ioctl).

   If you use fs-verity and you care about verifying the kernel modules
   (which does make sense), you should do it *before* loading said
   kernel module. That may sound obvious, but now the implementation
   basically requires it. Because if you try to do it concurrently, the
   kernel may refuse to load the module file that is being set up by the
   fs-verity code.

 - This all will obviously mean that if you insist on loading the same
   module in parallel, only one module load will succeed, and the others
   will return with an error.

   That was true before too, but what is different is that the -ETXTBSY
   error can be returned *before* the success case of another process
   fully loading and instantiating the module.

   Again, that might sound obvious, and it is indeed the whole point of
   the whole change: we are much quicker to notice the whole "you're
   already in the process of loading this module".

   So it's very much intentional, but it does mean that if you just
   spray the kernel with "finit_module()", and expect that the module is
   immediately loaded afterwards without checking the return value, you
   are doing something horribly horribly wrong.

   I'd like to say that that would never happen, but the whole _reason_
   for this commit is that udev is currently doing something horribly
   horribly wrong, so ...

Link: https://lore.kernel.org/all/ZEGopJ8VAYnE7LQ2@bombadil.infradead.org/ [1]
Link: https://lore.kernel.org/all/23bd0ce6-ef78-1cd8-1f21-0e706a00424a@suse.com/ [2]
Link: https://lore.kernel.org/lkml/ZG%2Fa+nrt4%2FAAUi5z@bombadil.infradead.org/ [3]
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Cc: Petr Pavlu <petr.pavlu@suse.com>
Tested-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-05-25 17:07:57 -07:00
Zqiang 18c8ae8131 workqueue: Disable per-cpu CPU hog detection when wq_cpu_intensive_thresh_us is 0
If workqueue.cpu_intensive_thresh_us is set to 0, the detection mechanism
for CPU-hogging per-cpu work item will keep triggering spuriously:

  workqueue: process_srcu hogged CPU for >0us 4 times, consider switching to WQ_UNBOUND
  workqueue: gc_worker hogged CPU for >0us 4 times, consider switching to WQ_UNBOUND
  workqueue: gc_worker hogged CPU for >0us 8 times, consider switching to WQ_UNBOUND
  workqueue: wait_rcu_exp_gp hogged CPU for >0us 4 times, consider switching to WQ_UNBOUND
  workqueue: kfree_rcu_monitor hogged CPU for >0us 4 times, consider switching to WQ_UNBOUND
  workqueue: kfree_rcu_monitor hogged CPU for >0us 8 times, consider switching to WQ_UNBOUND
  workqueue: reg_todo hogged CPU for >0us 4 times, consider switching to WQ_UNBOUND

This commit therefore disables the CPU-hog detection mechanism when
workqueue.cpu_intensive_thresh_us is set to 0.

tj: Patch description updated and the condition check on
    cpu_intensive_thresh_us separated into a separate if statement for
    readability.

Signed-off-by: Zqiang <qiang.zhang1211@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-05-25 10:46:53 -10:00
Linus Torvalds 50fb587e6a Networking fixes for 6.4-rc4, including fixes from bluetooth and bpf
Current release - regressions:
 
   - net: fix skb leak in __skb_tstamp_tx()
 
   - eth: mtk_eth_soc: fix QoS on DSA MAC on non MTK_NETSYS_V2 SoCs
 
 Current release - new code bugs:
 
   - handshake:
     - fix sock->file allocation
     - fix handshake_dup() ref counting
 
   - bluetooth:
     - fix potential double free caused by hci_conn_unlink
     - fix UAF in hci_conn_hash_flush
 
 Previous releases - regressions:
 
   - core: fix stack overflow when LRO is disabled for virtual interfaces
 
   - tls: fix strparser rx issues
 
   - bpf:
     - fix many sockmap/TCP related issues
     - fix a memory leak in the LRU and LRU_PERCPU hash maps
     - init the offload table earlier
 
   - eth: mlx5e:
     - do as little as possible in napi poll when budget is 0
     - fix using eswitch mapping in nic mode
     - fix deadlock in tc route query code
 
 Previous releases - always broken:
 
   - udplite: fix NULL pointer dereference in __sk_mem_raise_allocated()
 
   - raw: fix output xfrm lookup wrt protocol
 
   - smc: reset connection when trying to use SMCRv2 fails
 
   - phy: mscc: enable VSC8501/2 RGMII RX clock
 
   - eth: octeontx2-pf: fix TSOv6 offload
 
   - eth: cdc_ncm: deal with too low values of dwNtbOutMaxSize
 
 Signed-off-by: Paolo Abeni <pabeni@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmRvOisSHHBhYmVuaUBy
 ZWRoYXQuY29tAAoJECkkeY3MjxOkMW8P/3rZy4Yy2bIWFCkxKD/aPvqG60ZZfvV/
 sB7Qu3X0OLiDNAmdDsXjCFeMYnV4cxDvwxjFUVQX0ZZEilEbGQ2XlOaFTpXS3jeW
 UQup55DW7VG6BkuNJipwtLkLSQ498Z+qinRPsmNPVADkItHHbyrSnKNjh34ruhly
 P5edWJ/3PuzoK2hN/izgBpk0i1UC1+tSKKANV5dlIWb6CXY9C8pvr0CScuGb5rKv
 xAs40Rp1eaFmkYkhbAn3H2fvSOoCr2aSDeS2SvRAxca9OUcrUAjnnsLTVq5WI22/
 PxSESy6wfE2e5+q1AwskwBdFO3LLKheVYJF2KzSlRk4FuWk50GbwbpueRSOYEU7b
 2w0MveYggr4m3B06/2esrsr6bEPsb4QFKE+hubX5FmIPECOz+dOA0RW4mOysvzqM
 q+xEuR9uWFsrMO7WVU7/4oF02HqAfAtaEn/87aniGz5o7bzPbmyyyBKfmb4s2c13
 TU828rEBNGkmqxSwsZHUOt21IJoOa646W99zsmGpRo/m47pFx093HVR22Hr1dH0B
 BllhsmtvJZ2XsWkR2Q9aAyyluc3/b3yI24OM125y7bIBWte2MF908xaStx/al+AF
 jPL/ioEQKNsOJKHan9EzhbyH98RCfEotLb+ha/qNQ9GGjKROHsTn9EgP7h7367oo
 yS8QLmvng01f
 =hz3D
 -----END PGP SIGNATURE-----

Merge tag 'net-6.4-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Paolo Abeni:
 "Including fixes from bluetooth and bpf.

  Current release - regressions:

   - net: fix skb leak in __skb_tstamp_tx()

   - eth: mtk_eth_soc: fix QoS on DSA MAC on non MTK_NETSYS_V2 SoCs

  Current release - new code bugs:

   - handshake:
      - fix sock->file allocation
      - fix handshake_dup() ref counting

   - bluetooth:
      - fix potential double free caused by hci_conn_unlink
      - fix UAF in hci_conn_hash_flush

  Previous releases - regressions:

   - core: fix stack overflow when LRO is disabled for virtual
     interfaces

   - tls: fix strparser rx issues

   - bpf:
      - fix many sockmap/TCP related issues
      - fix a memory leak in the LRU and LRU_PERCPU hash maps
      - init the offload table earlier

   - eth: mlx5e:
      - do as little as possible in napi poll when budget is 0
      - fix using eswitch mapping in nic mode
      - fix deadlock in tc route query code

  Previous releases - always broken:

   - udplite: fix NULL pointer dereference in __sk_mem_raise_allocated()

   - raw: fix output xfrm lookup wrt protocol

   - smc: reset connection when trying to use SMCRv2 fails

   - phy: mscc: enable VSC8501/2 RGMII RX clock

   - eth: octeontx2-pf: fix TSOv6 offload

   - eth: cdc_ncm: deal with too low values of dwNtbOutMaxSize"

* tag 'net-6.4-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (79 commits)
  udplite: Fix NULL pointer dereference in __sk_mem_raise_allocated().
  net: phy: mscc: enable VSC8501/2 RGMII RX clock
  net: phy: mscc: remove unnecessary phydev locking
  net: phy: mscc: add support for VSC8501
  net: phy: mscc: add VSC8502 to MODULE_DEVICE_TABLE
  net/handshake: Enable the SNI extension to work properly
  net/handshake: Unpin sock->file if a handshake is cancelled
  net/handshake: handshake_genl_notify() shouldn't ignore @flags
  net/handshake: Fix uninitialized local variable
  net/handshake: Fix handshake_dup() ref counting
  net/handshake: Remove unneeded check from handshake_dup()
  ipv6: Fix out-of-bounds access in ipv6_find_tlv()
  net: ethernet: mtk_eth_soc: fix QoS on DSA MAC on non MTK_NETSYS_V2 SoCs
  docs: netdev: document the existence of the mail bot
  net: fix skb leak in __skb_tstamp_tx()
  r8169: Use a raw_spinlock_t for the register locks.
  page_pool: fix inconsistency for page_pool_ring_[un]lock()
  bpf, sockmap: Test progs verifier error with latest clang
  bpf, sockmap: Test FIONREAD returns correct bytes in rx buffer with drops
  bpf, sockmap: Test FIONREAD returns correct bytes in rx buffer
  ...
2023-05-25 10:55:26 -07:00
Andrii Nakryiko c4c84f6fb2 bpf: drop unnecessary bpf_capable() check in BPF_MAP_FREEZE command
Seems like that extra bpf_capable() check in BPF_MAP_FREEZE handler was
unintentionally left when we switched to a model that all BPF map
operations should be allowed regardless of CAP_BPF (or any other
capabilities), as long as process got BPF map FD somehow.

This patch replaces bpf_capable() check in BPF_MAP_FREEZE handler with
writeable access check, given conceptually freezing the map is modifying
it: map becomes unmodifiable for subsequent updates.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20230524225421.1587859-2-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-05-25 10:08:20 -07:00
Jakub Kicinski 0c615f1cc3 bpf-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZG4AiAAKCRDbK58LschI
 g+xlAQCmefGbDuwPckZLnomvt6gl4bkIjs7kc1ySbG9QBnaInwD/WyrJaQIPijuD
 qziHPAyx+MEgPseFU1b7Le35SZ66IwM=
 =s4R1
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf

Daniel Borkmann says:

====================
pull-request: bpf 2023-05-24

We've added 19 non-merge commits during the last 10 day(s) which contain
a total of 20 files changed, 738 insertions(+), 448 deletions(-).

The main changes are:

1) Batch of BPF sockmap fixes found when running against NGINX TCP tests,
   from John Fastabend.

2) Fix a memleak in the LRU{,_PERCPU} hash map when bucket locking fails,
   from Anton Protopopov.

3) Init the BPF offload table earlier than just late_initcall,
   from Jakub Kicinski.

4) Fix ctx access mask generation for 32-bit narrow loads of 64-bit fields,
   from Will Deacon.

5) Remove a now unsupported __fallthrough in BPF samples,
   from Andrii Nakryiko.

6) Fix a typo in pkg-config call for building sign-file,
   from Jeremy Sowden.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  bpf, sockmap: Test progs verifier error with latest clang
  bpf, sockmap: Test FIONREAD returns correct bytes in rx buffer with drops
  bpf, sockmap: Test FIONREAD returns correct bytes in rx buffer
  bpf, sockmap: Test shutdown() correctly exits epoll and recv()=0
  bpf, sockmap: Build helper to create connected socket pair
  bpf, sockmap: Pull socket helpers out of listen test for general use
  bpf, sockmap: Incorrectly handling copied_seq
  bpf, sockmap: Wake up polling after data copy
  bpf, sockmap: TCP data stall on recv before accept
  bpf, sockmap: Handle fin correctly
  bpf, sockmap: Improved check for empty queue
  bpf, sockmap: Reschedule is now done through backlog
  bpf, sockmap: Convert schedule_work into delayed_work
  bpf, sockmap: Pass skb ownership through read_skb
  bpf: fix a memory leak in the LRU and LRU_PERCPU hash maps
  bpf: Fix mask generation for 32-bit narrow loads of 64-bit fields
  samples/bpf: Drop unnecessary fallthrough
  bpf: netdev: init the offload table earlier
  selftests/bpf: Fix pkg-config call building sign-file
====================

Link: https://lore.kernel.org/r/20230524170839.13905-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-24 21:57:57 -07:00
Xiu Jianfeng 659db0789c cgroup: Update out-of-date comment in cgroup_migrate()
Commit 674b745e22 ("cgroup: remove rcu_read_lock()/rcu_read_unlock()
in critical section of spin_lock_irq()") has removed the rcu_read_lock,
which makes the comment out-of-date, so update it.

tj: Updated the comment a bit.

Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-05-24 12:22:07 -10:00
Zqiang c8f6219be2 workqueue: Fix WARN_ON_ONCE() triggers in worker_enter_idle()
Currently, pool->nr_running can be modified from timer tick, that means the
timer tick can run nested inside a not-irq-protected section that's in the
process of modifying nr_running. Consider the following scenario:

CPU0
kworker/0:2 (events)
   worker_clr_flags(worker, WORKER_PREP | WORKER_REBOUND);
   ->pool->nr_running++;  (1)

   process_one_work()
   ->worker->current_func(work);
     ->schedule()
       ->wq_worker_sleeping()
         ->worker->sleeping = 1;
         ->pool->nr_running--;  (0)
           ....
       ->wq_worker_running()
               ....
               CPU0 by interrupt:
               wq_worker_tick()
               ->worker_set_flags(worker, WORKER_CPU_INTENSIVE);
                 ->pool->nr_running--;  (-1)
	         ->worker->flags |= WORKER_CPU_INTENSIVE;
               ....
         ->if (!(worker->flags & WORKER_NOT_RUNNING))
           ->pool->nr_running++;    (will not execute)
         ->worker->sleeping = 0;
         ....
    ->worker_clr_flags(worker, WORKER_CPU_INTENSIVE);
      ->pool->nr_running++;  (0)
    ....
    worker_set_flags(worker, WORKER_PREP);
    ->pool->nr_running--;   (-1)
    ....
    worker_enter_idle()
    ->WARN_ON_ONCE(pool->nr_workers == pool->nr_idle && pool->nr_running);

if the nr_workers is equal to nr_idle, due to the nr_running is not zero,
will trigger WARN_ON_ONCE().

[    2.460602] WARNING: CPU: 0 PID: 63 at kernel/workqueue.c:1999 worker_enter_idle+0xb2/0xc0
[    2.462163] Modules linked in:
[    2.463401] CPU: 0 PID: 63 Comm: kworker/0:2 Not tainted 6.4.0-rc2-next-20230519 #1
[    2.463771] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-2 04/01/2014
[    2.465127] Workqueue:  0x0 (events)
[    2.465678] RIP: 0010:worker_enter_idle+0xb2/0xc0
...
[    2.472614] Call Trace:
[    2.473152]  <TASK>
[    2.474182]  worker_thread+0x71/0x430
[    2.474992]  ? _raw_spin_unlock_irqrestore+0x28/0x50
[    2.475263]  kthread+0x103/0x120
[    2.475493]  ? __pfx_worker_thread+0x10/0x10
[    2.476355]  ? __pfx_kthread+0x10/0x10
[    2.476635]  ret_from_fork+0x2c/0x50
[    2.477051]  </TASK>

This commit therefore add the check of worker->sleeping in wq_worker_tick(),
if the worker->sleeping is not zero, directly return.

tj: Updated comment and description.

Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Reported-by: Linux Kernel Functional Testing <lkft@linaro.org>
Tested-by: Anders Roxell <anders.roxell@linaro.org>
Closes: https://qa-reports.linaro.org/lkft/linux-next-master/build/next-20230519/testrun/17078554/suite/boot/test/clang-nightly-lkftconfig/log
Signed-off-by: Zqiang <qiang.zhang1211@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-05-24 11:57:26 -10:00
Wang Honghui 847aea98e0 PM: hibernate: Correct spelling mistake in a comment
Fix a typo in a comment in kernel/power/snapshot.c

Signed-off-by: Wang Honghui <honghui.wang@ucas.com.cn>
[ rjw: Subject and changelog edits ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2023-05-24 19:00:50 +02:00
Maximilian Heyne 335b422346 x86/pci/xen: populate MSI sysfs entries
Commit bf5e758f02 ("genirq/msi: Simplify sysfs handling") reworked the
creation of sysfs entries for MSI IRQs. The creation used to be in
msi_domain_alloc_irqs_descs_locked after calling ops->domain_alloc_irqs.
Then it moved into __msi_domain_alloc_irqs which is an implementation of
domain_alloc_irqs. However, Xen comes with the only other implementation
of domain_alloc_irqs and hence doesn't run the sysfs population code
anymore.

Commit 6c796996ee ("x86/pci/xen: Fixup fallout from the PCI/MSI
overhaul") set the flag MSI_FLAG_DEV_SYSFS for the xen msi_domain_info
but that doesn't actually have an effect because Xen uses it's own
domain_alloc_irqs implementation.

Fix this by making use of the fallback functions for sysfs population.

Fixes: bf5e758f02 ("genirq/msi: Simplify sysfs handling")
Signed-off-by: Maximilian Heyne <mheyne@amazon.de>
Reviewed-by: Juergen Gross <jgross@suse.com>
Link: https://lore.kernel.org/r/20230503131656.15928-1-mheyne@amazon.de
Signed-off-by: Juergen Gross <jgross@suse.com>
2023-05-24 18:08:49 +02:00
David Howells 5bd4990f19 trace: Convert trace/seq to use copy_splice_read()
For the splice from the trace seq buffer, just use copy_splice_read().

In the future, something better can probably be done by gifting pages from
seq->buf into the pipe, but that would require changing seq->buf into a
vmap over an array of pages.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Christoph Hellwig <hch@lst.de>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Jens Axboe <axboe@kernel.dk>
cc: Steven Rostedt <rostedt@goodmis.org>
cc: Masami Hiramatsu <mhiramat@kernel.org>
cc: linux-kernel@vger.kernel.org
cc: linux-trace-kernel@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
cc: linux-block@vger.kernel.org
cc: linux-mm@kvack.org
Link: https://lore.kernel.org/r/20230522135018.2742245-27-dhowells@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-24 08:42:16 -06:00
Shanker Donthineni 721255b982 genirq: Use a maple tree for interrupt descriptor management
The current implementation uses a static bitmap for interrupt descriptor
allocation and a radix tree to pointer store the pointer for lookup.

However, the size of the bitmap is constrained by the build time macro
MAX_SPARSE_IRQS, which may not be sufficient to support high-end servers,
particularly those with GICv4.1 hardware, which require a large interrupt
space to cover LPIs and vSGIs.

Replace the bitmap and the radix tree with a maple tree, which not only
stores pointers for lookup, but also provides a mechanism to find free
ranges. That removes the build time hardcoded upper limit.

Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20230519134902.1495562-4-sdonthineni@nvidia.com
2023-05-24 11:39:44 +02:00
Shanker Donthineni 5e630aa8d9 genirq: Encapsulate sparse bitmap handling
Move the open coded sparse bitmap handling into helper functions as
a preparatory step for converting the sparse interrupt management
to a maple tree.

No functional change.

Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20230519134902.1495562-3-sdonthineni@nvidia.com
2023-05-24 11:39:44 +02:00
Shanker Donthineni bc06a9e087 genirq: Use hlist for managing resend handlers
The current implementation utilizes a bitmap for managing interrupt resend
handlers, which is allocated based on the SPARSE_IRQ/NR_IRQS macros.
However, this method may not efficiently utilize memory during runtime,
particularly when IRQ_BITMAP_BITS is large.

Address this issue by using an hlist to manage interrupt resend handlers
instead of relying on a static bitmap memory allocation. Additionally, a
new function, clear_irq_resend(), is introduced and called from
irq_shutdown to ensure a graceful teardown of the interrupt.

Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20230519134902.1495562-2-sdonthineni@nvidia.com
2023-05-24 11:39:44 +02:00
Sebastian Andrzej Siewior cb0b50b813 module: Remove preempt_disable() from module reference counting.
The preempt_disable() section in module_put() was added in commit
   e1783a240f ("module: Use this_cpu_xx to dynamically allocate counters")

while the per-CPU counter were switched to another API. The API requires
that during the RMW operation the CPU remained the same.

This counting API was later replaced with atomic_t in commit
   2f35c41f58 ("module: Replace module_ref with atomic_t refcnt")

Since this atomic_t replacement there is no need to keep preemption
disabled while the reference counter is modified.

Remove preempt_disable() from module_put(), __module_get() and
try_module_get().

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2023-05-23 22:23:18 -07:00
Joel Granados 2f5edd03ca sysctl: Refactor base paths registrations
This is part of the general push to deprecate register_sysctl_paths and
register_sysctl_table. The old way of doing this through
register_sysctl_base and DECLARE_SYSCTL_BASE macro is replaced with a
call to register_sysctl_init. The 5 base paths affected are: "kernel",
"vm", "debug", "dev" and "fs".

We remove the register_sysctl_base function and the DECLARE_SYSCTL_BASE
macro since they are no longer needed.

In order to quickly acertain that the paths did not actually change I
executed `find /proc/sys/ | sha1sum` and made sure that the sha was the
same before and after the commit.

We end up saving 563 bytes with this change:

./scripts/bloat-o-meter vmlinux.0.base vmlinux.1.refactor-base-paths
add/remove: 0/5 grow/shrink: 2/0 up/down: 77/-640 (-563)
Function                                     old     new   delta
sysctl_init_bases                             55     111     +56
init_fs_sysctls                               12      33     +21
vm_base_table                                128       -    -128
kernel_base_table                            128       -    -128
fs_base_table                                128       -    -128
dev_base_table                               128       -    -128
debug_base_table                             128       -    -128
Total: Before=21258215, After=21257652, chg -0.00%

[mcgrof: modified to use register_sysctl_init() over register_sysctl()
 and add bloat-o-meter stats]

Signed-off-by: Joel Granados <j.granados@samsung.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Tested-by: Stephen Rothwell <sfr@canb.auug.org.au>
Acked-by: Christian Brauner <brauner@kernel.org>
2023-05-23 21:43:26 -07:00
Steven Rostedt (Google) 4b512860bd tracing: Rename stacktrace field to common_stacktrace
The histogram and synthetic events can use a pseudo event called
"stacktrace" that will create a stacktrace at the time of the event and
use it just like it was a normal field. We have other pseudo events such
as "common_cpu" and "common_timestamp". To stay consistent with that,
convert "stacktrace" to "common_stacktrace". As this was used in older
kernels, to keep backward compatibility, this will act just like
"common_cpu" did with "cpu". That is, "cpu" will be the same as
"common_cpu" unless the event has a "cpu" field. In which case, the
event's field is used. The same is true with "stacktrace".

Also update the documentation to reflect this change.

Link: https://lore.kernel.org/linux-trace-kernel/20230523230913.6860e28d@rorschach.local.home

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Tom Zanussi <zanussi@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-05-23 23:38:23 -04:00
Steven Rostedt (Google) e30fbc618e tracing/histograms: Allow variables to have some modifiers
Modifiers are used to change the behavior of keys. For instance, they
can grouped into buckets, converted to syscall names (from the syscall
identifier), show task->comm of the current pid, be an array of longs
that represent a stacktrace, and more.

It was found that nothing stopped a value from taking a modifier. As
values are simple counters. If this happened, it would call code that
was not expecting a modifier and crash the kernel. This was fixed by
having the ___create_val_field() function test if a modifier was present
and fail if one was. This fixed the crash.

Now there's a problem with variables. Variables are used to pass fields
from one event to another. Variables are allowed to have some modifiers,
as the processing may need to happen at the time of the event (like
stacktraces and comm names of the current pid). The issue is that it too
uses __create_val_field(). Now that fails on modifiers, variables can no
longer use them (this is a regression).

As not all modifiers are for variables, have them use a separate check.

Link: https://lore.kernel.org/linux-trace-kernel/20230523221108.064a5d82@rorschach.local.home

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Tom Zanussi <zanussi@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Fixes: e0213434fe ("tracing: Do not let histogram values have some modifiers")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-05-23 22:56:55 -04:00
Beau Belgrave ff9e1632d6 tracing/user_events: Document user_event_mm one-shot list usage
During 6.4 development it became clear that the one-shot list used by
the user_event_mm's next field was confusing to others. It is not clear
how this list is protected or what the next field usage is for unless
you are familiar with the code.

Add comments into the user_event_mm struct indicating lock requirement
and usage. Also document how and why this approach was used via comments
in both user_event_enabler_update() and user_event_mm_get_all() and the
rules to properly use it.

Link: https://lkml.kernel.org/r/20230519230741.669-5-beaub@linux.microsoft.com
Link: https://lore.kernel.org/linux-trace-kernel/CAHk-=wicngggxVpbnrYHjRTwGE0WYscPRM+L2HO2BF8ia1EXgQ@mail.gmail.com/

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-05-23 21:08:33 -04:00
Beau Belgrave dcbd1ac266 tracing/user_events: Rename link fields for clarity
Currently most list_head fields of various structs within user_events
are simply named link. This causes folks to keep additional context in
their head when working with the code, which can be confusing.

Instead of using link, describe what the actual link is, for example:
list_del_rcu(&mm->link);

Changes into:
list_del_rcu(&mm->mms_link);

The reader now is given a hint the link is to the mms global list
instead of having to remember or spot check within the code.

Link: https://lkml.kernel.org/r/20230519230741.669-4-beaub@linux.microsoft.com
Link: https://lore.kernel.org/linux-trace-kernel/CAHk-=wicngggxVpbnrYHjRTwGE0WYscPRM+L2HO2BF8ia1EXgQ@mail.gmail.com/

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-05-23 21:08:33 -04:00
Linus Torvalds aaecdaf922 tracing/user_events: Remove RCU lock while pinning pages
pin_user_pages_remote() can reschedule which means we cannot hold any
RCU lock while using it. Now that enablers are not exposed out to the
tracing register callbacks during fork(), there is clearly no need to
require the RCU lock as event_mutex is enough to protect changes.

Remove unneeded RCU usages when pinning pages and walking enablers with
event_mutex held. Cleanup a misleading "safe" list walk that is not
needed. During fork() duplication, remove unneeded RCU list add, since
the list is not exposed yet.

Link: https://lkml.kernel.org/r/20230519230741.669-3-beaub@linux.microsoft.com
Link: https://lore.kernel.org/linux-trace-kernel/CAHk-=wiiBfT4zNS29jA0XEsy8EmbqTH1hAPdRJCDAJMD8Gxt5A@mail.gmail.com/

Fixes: 7235759084 ("tracing/user_events: Use remote writes for event enablement")
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[ change log written by Beau Belgrave ]
Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-05-23 21:05:04 -04:00
Linus Torvalds 3e0fea09b1 tracing/user_events: Split up mm alloc and attach
When a new mm is being created in a fork() path it currently is
allocated and then attached in one go. This leaves the mm exposed out to
the tracing register callbacks while any parent enabler locations are
copied in. This should not happen.

Split up mm alloc and attach as unique operations. When duplicating
enablers, first alloc, then duplicate, and only upon success, attach.
This prevents any timing window outside of the event_reg mutex for
enablement walking. This allows for dropping RCU requirement for
enablement walking in later patches.

Link: https://lkml.kernel.org/r/20230519230741.669-2-beaub@linux.microsoft.com
Link: https://lore.kernel.org/linux-trace-kernel/CAHk-=whTBvXJuoi_kACo3qi5WZUmRrhyA-_=rRFsycTytmB6qw@mail.gmail.com/

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[ change log written by Beau Belgrave ]
Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-05-23 20:58:17 -04:00
Andrii Nakryiko cb8edce280 bpf: Support O_PATH FDs in BPF_OBJ_PIN and BPF_OBJ_GET commands
Current UAPI of BPF_OBJ_PIN and BPF_OBJ_GET commands of bpf() syscall
forces users to specify pinning location as a string-based absolute or
relative (to current working directory) path. This has various
implications related to security (e.g., symlink-based attacks), forces
BPF FS to be exposed in the file system, which can cause races with
other applications.

One of the feedbacks we got from folks working with containers heavily
was that inability to use purely FD-based location specification was an
unfortunate limitation and hindrance for BPF_OBJ_PIN and BPF_OBJ_GET
commands. This patch closes this oversight, adding path_fd field to
BPF_OBJ_PIN and BPF_OBJ_GET UAPI, following conventions established by
*at() syscalls for dirfd + pathname combinations.

This now allows interesting possibilities like working with detached BPF
FS mount (e.g., to perform multiple pinnings without running a risk of
someone interfering with them), and generally making pinning/getting
more secure and not prone to any races and/or security attacks.

This is demonstrated by a selftest added in subsequent patch that takes
advantage of new mount APIs (fsopen, fsconfig, fsmount) to demonstrate
creating detached BPF FS mount, pinning, and then getting BPF map out of
it, all while never exposing this private instance of BPF FS to outside
worlds.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/bpf/20230523170013.728457-4-andrii@kernel.org
2023-05-23 23:31:42 +02:00
Thomas Gleixner 06c6796e03 cpu/hotplug: Fix off by one in cpuhp_bringup_mask()
cpuhp_bringup_mask() iterates over a cpumask and starts all present CPUs up
to a caller provided upper limit.

The limit variable is decremented and checked for 0 before invoking
cpu_up(), which is obviously off by one and prevents the bringup of the
last CPU when the limit is equal to the number of present CPUs.

Move the decrement and check after the cpu_up() invocation.

Fixes: 18415f33e2 ("cpu/hotplug: Allow "parallel" bringup up to CPUHP_BP_KICK_AP_STATE")
Reported-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/87wn10ufj9.ffs@tglx
2023-05-23 18:06:40 +02:00
Daniel Bristot de Oliveira 632478a058 tracing/timerlat: Always wakeup the timerlat thread
While testing rtla timerlat auto analysis, I reach a condition where
the interface was not receiving tracing data. I was able to manually
reproduce the problem with these steps:

  # echo 0 > tracing_on                 # disable trace
  # echo 1 > osnoise/stop_tracing_us    # stop trace if timerlat irq > 1 us
  # echo timerlat > current_tracer      # enable timerlat tracer
  # sleep 1                             # wait... that is the time when rtla
                                        # apply configs like prio or cgroup
  # echo 1 > tracing_on                 # start tracing
  # cat trace
  # tracer: timerlat
  #
  #                                _-----=> irqs-off
  #                               / _----=> need-resched
  #                              | / _---=> hardirq/softirq
  #                              || / _--=> preempt-depth
  #                              ||| / _-=> migrate-disable
  #                              |||| /     delay
  #                              |||||            ACTIVATION
  #           TASK-PID      CPU# |||||   TIMESTAMP   ID            CONTEXT                 LATENCY
  #              | |         |   |||||      |         |                  |                       |
        NOTHING!

Then, trying to enable tracing again with echo 1 > tracing_on resulted
in no change: the trace was still not tracing.

This problem happens because the timerlat IRQ hits the stop tracing
condition while tracing is off, and do not wake up the timerlat thread,
so the timerlat threads are kept sleeping forever, resulting in no
trace, even after re-enabling the tracer.

Avoid this condition by always waking up the threads, even after stopping
tracing, allowing the tracer to return to its normal operating after
a new tracing on.

Link: https://lore.kernel.org/linux-trace-kernel/1ed8f830638b20a39d535d27d908e319a9a3c4e2.1683822622.git.bristot@kernel.org

Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: stable@vger.kernel.org
Fixes: a955d7eac1 ("trace: Add timerlat tracer")
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-05-23 11:54:31 -04:00
Andrii Nakryiko e7d85427ef bpf: Validate BPF object in BPF_OBJ_PIN before calling LSM
Do a sanity check whether provided file-to-be-pinned is actually a BPF
object (prog, map, btf) before calling security_path_mknod LSM hook. If
it's not, LSM hook doesn't have to be triggered, as the operation has no
chance of succeeding anyways.

Suggested-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/bpf/20230522232917.2454595-2-andrii@kernel.org
2023-05-23 16:56:37 +02:00
Beau Belgrave ee7751b564 tracing/user_events: Use long vs int for atomic bit ops
Each event stores a int to track which bit to set/clear when enablement
changes. On big endian 64-bit configurations, it's possible this could
cause memory corruption when it's used for atomic bit operations.

Use unsigned long for enablement values to ensure any possible
corruption cannot occur. Downcast to int after mask for the bit target.

Link: https://lore.kernel.org/all/6f758683-4e5e-41c3-9b05-9efc703e827c@kili.mountain/
Link: https://lore.kernel.org/linux-trace-kernel/20230505205855.6407-1-beaub@linux.microsoft.com

Fixes: dcb8177c13 ("tracing/user_events: Add ioctl for disabling addresses")
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-05-23 10:40:34 -04:00
John Sperbeck 2bd1103392 cgroup: always put cset in cgroup_css_set_put_fork
A successful call to cgroup_css_set_fork() will always have taken
a ref on kargs->cset (regardless of CLONE_INTO_CGROUP), so always
do a corresponding put in cgroup_css_set_put_fork().

Without this, a cset and its contained css structures will be
leaked for some fork failures.  The following script reproduces
the leak for a fork failure due to exceeding pids.max in the
pids controller.  A similar thing can happen if we jump to the
bad_fork_cancel_cgroup label in copy_process().

[ -z "$1" ] && echo "Usage $0 pids-root" && exit 1
PID_ROOT=$1
CGROUP=$PID_ROOT/foo

[ -e $CGROUP ] && rmdir -f $CGROUP
mkdir $CGROUP
echo 5 > $CGROUP/pids.max
echo $$ > $CGROUP/cgroup.procs

fork_bomb()
{
	set -e
	for i in $(seq 10); do
		/bin/sleep 3600 &
	done
}

(fork_bomb) &
wait
echo $$ > $PID_ROOT/cgroup.procs
kill $(cat $CGROUP/cgroup.procs)
rmdir $CGROUP

Fixes: ef2c41cf38 ("clone3: allow spawning processes into cgroups")
Cc: stable@vger.kernel.org # v5.7+
Signed-off-by: John Sperbeck <jsperbeck@google.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-05-22 13:59:37 -10:00
Harshit Mogalapalli d36f6efbe0 module: Fix use-after-free bug in read_file_mod_stats()
Smatch warns:
	kernel/module/stats.c:394 read_file_mod_stats()
	warn: passing freed memory 'buf'

We are passing 'buf' to simple_read_from_buffer() after freeing it.

Fix this by changing the order of 'simple_read_from_buffer' and 'kfree'.

Fixes: df3e764d8e ("module: add debug stats to help identify memory pressure")
Signed-off-by: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2023-05-22 14:13:13 -07:00
Azeem Shaikh c33080cdc0 cgroup: Replace all non-returning strlcpy with strscpy
strlcpy() reads the entire source buffer first.
This read may exceed the destination size limit.
This is both inefficient and can lead to linear read
overflows if a source string is not NUL-terminated [1].
In an effort to remove strlcpy() completely [2], replace
strlcpy() here with strscpy().
No return values were used, so direct replacement is safe.

[1] https://www.kernel.org/doc/html/latest/process/deprecated.html#strlcpy
[2] https://github.com/KSPP/linux/issues/89

Signed-off-by: Azeem Shaikh <azeemshaikh38@gmail.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-05-22 09:05:57 -10:00
Qi Zheng ab1de7ead8 cgroup: fix missing cpus_read_{lock,unlock}() in cgroup_transfer_tasks()
The commit 4f7e723643 ("cgroup: Fix threadgroup_rwsem <-> cpus_read_lock()
deadlock") fixed the deadlock between cgroup_threadgroup_rwsem and
cpus_read_lock() by introducing cgroup_attach_{lock,unlock}() and removing
cpus_read_{lock,unlock}() from cpuset_attach(). But cgroup_transfer_tasks()
was missed and not handled, which will cause th following warning:

 WARNING: CPU: 0 PID: 589 at kernel/cpu.c:526 lockdep_assert_cpus_held+0x32/0x40
 CPU: 0 PID: 589 Comm: kworker/1:4 Not tainted 6.4.0-rc2-next-20230517 #50
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
 Workqueue: events cpuset_hotplug_workfn
 RIP: 0010:lockdep_assert_cpus_held+0x32/0x40
 <...>
 Call Trace:
  <TASK>
  cpuset_attach+0x40/0x240
  cgroup_migrate_execute+0x452/0x5e0
  ? _raw_spin_unlock_irq+0x28/0x40
  cgroup_transfer_tasks+0x1f3/0x360
  ? find_held_lock+0x32/0x90
  ? cpuset_hotplug_workfn+0xc81/0xed0
  cpuset_hotplug_workfn+0xcb1/0xed0
  ? process_one_work+0x248/0x5b0
  process_one_work+0x2b9/0x5b0
  worker_thread+0x56/0x3b0
  ? process_one_work+0x5b0/0x5b0
  kthread+0xf1/0x120
  ? kthread_complete_and_exit+0x20/0x20
  ret_from_fork+0x1f/0x30
  </TASK>

So just use the cgroup_attach_{lock,unlock}() helper to fix it.

Reported-by: Zhao Gongyi <zhaogongyi@bytedance.com>
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Acked-by: Muchun Song <songmuchun@bytedance.com>
Fixes: 05c7b7a92c ("cgroup/cpuset: Fix a race between cpuset_attach() and cpu hotplug")
Cc: stable@vger.kernel.org # v5.17+
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-05-22 08:42:14 -10:00
Gaosheng Cui a495108ea9 capability: fix kernel-doc warnings in capability.c
Fix all kernel-doc warnings in capability.c:

kernel/capability.c:477: warning: Function parameter or member 'idmap'
not described in 'privileged_wrt_inode_uidgid'
kernel/capability.c:493: warning: Function parameter or member 'idmap'
not described in 'capable_wrt_inode_uidgid'

Signed-off-by: Gaosheng Cui <cuigaosheng1@huawei.com>
Acked-by: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2023-05-22 14:30:52 -04:00