Commit Graph

40775 Commits

Author SHA1 Message Date
Kumar Kartikeya Dwivedi 76d16077be bpf: Use memmove for bpf_dynptr_{read,write}
It may happen that destination buffer memory overlaps with memory dynptr
points to. Hence, we must use memmove to correctly copy from dynptr to
destination buffer, or source buffer to dynptr.

This actually isn't a problem right now, as memcpy implementation falls
back to memmove on detecting overlap and warns about it, but we
shouldn't be relying on that.

Acked-by: Joanne Koong <joannelkoong@gmail.com>
Acked-by: David Vernet <void@manifault.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221207204141.308952-7-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-08 18:39:28 -08:00
Kumar Kartikeya Dwivedi f6ee298fa1 bpf: Move PTR_TO_STACK alignment check to process_dynptr_func
After previous commit, we are minimizing helper specific assumptions
from check_func_arg_reg_off, making it generic, and offloading checks
for a specific argument type to their respective functions called after
check_func_arg_reg_off has been called.

This allows relying on a consistent set of guarantees after that call
and then relying on them in code that deals with registers for each
argument type later. This is in line with how process_spin_lock,
process_timer_func, process_kptr_func check reg->var_off to be constant.
The same reasoning is used here to move the alignment check into
process_dynptr_func. Note that it also needs to check for constant
var_off, and accumulate the constant var_off when computing the spi in
get_spi, but that fix will come in later changes.

Acked-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221207204141.308952-6-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-08 18:39:28 -08:00
Kumar Kartikeya Dwivedi 184c9bdb8f bpf: Rework check_func_arg_reg_off
While check_func_arg_reg_off is the place which performs generic checks
needed by various candidates of reg->type, there is some handling for
special cases, like ARG_PTR_TO_DYNPTR, OBJ_RELEASE, and
ARG_PTR_TO_RINGBUF_MEM.

This commit aims to streamline these special cases and instead leave
other things up to argument type specific code to handle. The function
will be restrictive by default, and cover all possible cases when
OBJ_RELEASE is set, without having to update the function again (and
missing to do that being a bug).

This is done primarily for two reasons: associating back reg->type to
its argument leaves room for the list getting out of sync when a new
reg->type is supported by an arg_type.

The other case is ARG_PTR_TO_RINGBUF_MEM. The problem there is something
we already handle, whenever a release argument is expected, it should
be passed as the pointer that was received from the acquire function.
Hence zero fixed and variable offset.

There is nothing special about ARG_PTR_TO_RINGBUF_MEM, where technically
its target register type PTR_TO_MEM | MEM_RINGBUF can already be passed
with non-zero offset to other helper functions, which makes sense.

Hence, lift the arg_type_is_release check for reg->off and cover all
possible register types, instead of duplicating the same kind of check
twice for current OBJ_RELEASE arg_types (alloc_mem and ptr_to_btf_id).

For the release argument, arg_type_is_dynptr is the special case, where
we go to actual object being freed through the dynptr, so the offset of
the pointer still needs to allow fixed and variable offset and
process_dynptr_func will verify them later for the release argument case
as well.

This is not specific to ARG_PTR_TO_DYNPTR though, we will need to make
this exception for any future object on the stack that needs to be
released. In this sense, PTR_TO_STACK as a candidate for object on stack
argument is a special case for release offset checks, and they need to
be done by the helper releasing the object on stack.

Since the check has been lifted above all register type checks, remove
the duplicated check that is being done for PTR_TO_BTF_ID.

Acked-by: Joanne Koong <joannelkoong@gmail.com>
Acked-by: David Vernet <void@manifault.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221207204141.308952-5-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-08 18:39:06 -08:00
Kumar Kartikeya Dwivedi 2706053173 bpf: Rework process_dynptr_func
Recently, user ringbuf support introduced a PTR_TO_DYNPTR register type
for use in callback state, because in case of user ringbuf helpers,
there is no dynptr on the stack that is passed into the callback. To
reflect such a state, a special register type was created.

However, some checks have been bypassed incorrectly during the addition
of this feature. First, for arg_type with MEM_UNINIT flag which
initialize a dynptr, they must be rejected for such register type.
Secondly, in the future, there are plans to add dynptr helpers that
operate on the dynptr itself and may change its offset and other
properties.

In all of these cases, PTR_TO_DYNPTR shouldn't be allowed to be passed
to such helpers, however the current code simply returns 0.

The rejection for helpers that release the dynptr is already handled.

For fixing this, we take a step back and rework existing code in a way
that will allow fitting in all classes of helpers and have a coherent
model for dealing with the variety of use cases in which dynptr is used.

First, for ARG_PTR_TO_DYNPTR, it can either be set alone or together
with a DYNPTR_TYPE_* constant that denotes the only type it accepts.

Next, helpers which initialize a dynptr use MEM_UNINIT to indicate this
fact. To make the distinction clear, use MEM_RDONLY flag to indicate
that the helper only operates on the memory pointed to by the dynptr,
not the dynptr itself. In C parlance, it would be equivalent to taking
the dynptr as a point to const argument.

When either of these flags are not present, the helper is allowed to
mutate both the dynptr itself and also the memory it points to.
Currently, the read only status of the memory is not tracked in the
dynptr, but it would be trivial to add this support inside dynptr state
of the register.

With these changes and renaming PTR_TO_DYNPTR to CONST_PTR_TO_DYNPTR to
better reflect its usage, it can no longer be passed to helpers that
initialize a dynptr, i.e. bpf_dynptr_from_mem, bpf_ringbuf_reserve_dynptr.

A note to reviewers is that in code that does mark_stack_slots_dynptr,
and unmark_stack_slots_dynptr, we implicitly rely on the fact that
PTR_TO_STACK reg is the only case that can reach that code path, as one
cannot pass CONST_PTR_TO_DYNPTR to helpers that don't set MEM_RDONLY. In
both cases such helpers won't be setting that flag.

The next patch will add a couple of selftest cases to make sure this
doesn't break.

Fixes: 2057156738 ("bpf: Add bpf_user_ringbuf_drain() helper")
Acked-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221207204141.308952-4-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-08 18:25:31 -08:00
Kumar Kartikeya Dwivedi ac50fe51ce bpf: Propagate errors from process_* checks in check_func_arg
Currently, we simply ignore the errors in process_spin_lock,
process_timer_func, process_kptr_func, process_dynptr_func. Instead,
bubble up the error by storing and checking err variable.

Acked-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221207204141.308952-3-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-08 18:25:31 -08:00
Kumar Kartikeya Dwivedi 6b75bd3d03 bpf: Refactor ARG_PTR_TO_DYNPTR checks into process_dynptr_func
ARG_PTR_TO_DYNPTR is akin to ARG_PTR_TO_TIMER, ARG_PTR_TO_KPTR, where
the underlying register type is subjected to more special checks to
determine the type of object represented by the pointer and its state
consistency.

Move dynptr checks to their own 'process_dynptr_func' function so that
is consistent and in-line with existing code. This also makes it easier
to reuse this code for kfunc handling.

Then, reuse this consolidated function in kfunc dynptr handling too.
Note that for kfuncs, the arg_type constraint of DYNPTR_TYPE_LOCAL has
been lifted.

Acked-by: David Vernet <void@manifault.com>
Acked-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221207204141.308952-2-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-08 18:25:31 -08:00
Jakub Kicinski 837e8ac871 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
No conflicts.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-12-08 18:19:59 -08:00
Hou Tao 822ed78fab bpf: Skip rcu_barrier() if rcu_trace_implies_rcu_gp() is true
If there are pending rcu callback, free_mem_alloc() will use
rcu_barrier_tasks_trace() and rcu_barrier() to wait for the pending
__free_rcu_tasks_trace() and __free_rcu() callback.

If rcu_trace_implies_rcu_gp() is true, there will be no pending
__free_rcu(), so it will be OK to skip rcu_barrier() as well.

Acked-by: Yonghong Song <yhs@fb.com>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20221209010947.3130477-3-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-08 17:50:17 -08:00
Hou Tao 0893d6007d bpf: Reuse freed element in free_by_rcu during allocation
When there are batched freeing operations on a specific CPU, part of
the freed elements ((high_watermark - lower_watermark) / 2 + 1) will be
indirectly moved into waiting_for_gp list through free_by_rcu list.
After call_rcu_in_progress becomes false again, the remaining elements
in free_by_rcu list will be moved to waiting_for_gp list by the next
invocation of free_bulk(). However if the expiration of RCU tasks trace
grace period is relatively slow, none element in free_by_rcu list will
be moved.

So instead of invoking __alloc_percpu_gfp() or kmalloc_node() to
allocate a new object, in alloc_bulk() just check whether or not there is
freed element in free_by_rcu list and reuse it if available.

Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20221209010947.3130477-2-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-08 17:50:17 -08:00
Tejun Heo fbf8321238 memcg: Fix possible use-after-free in memcg_write_event_control()
memcg_write_event_control() accesses the dentry->d_name of the specified
control fd to route the write call.  As a cgroup interface file can't be
renamed, it's safe to access d_name as long as the specified file is a
regular cgroup file.  Also, as these cgroup interface files can't be
removed before the directory, it's safe to access the parent too.

Prior to 347c4a8747 ("memcg: remove cgroup_event->cft"), there was a
call to __file_cft() which verified that the specified file is a regular
cgroupfs file before further accesses.  The cftype pointer returned from
__file_cft() was no longer necessary and the commit inadvertently
dropped the file type check with it allowing any file to slip through.
With the invarients broken, the d_name and parent accesses can now race
against renames and removals of arbitrary files and cause
use-after-free's.

Fix the bug by resurrecting the file type check in __file_cft().  Now
that cgroupfs is implemented through kernfs, checking the file
operations needs to go through a layer of indirection.  Instead, let's
check the superblock and dentry type.

Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 347c4a8747 ("memcg: remove cgroup_event->cft")
Cc: stable@kernel.org # v3.14+
Reported-by: Jann Horn <jannh@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-12-08 10:40:58 -08:00
Yang Jihong c2cc0ce72a bpf: Fix comment error in fixup_kfunc_call function
insn->imm for kfunc is the relative address of __bpf_call_base,
instead of __bpf_base_call, Fix the comment error.

Signed-off-by: Yang Jihong <yangjihong1@huawei.com>
Link: https://lore.kernel.org/r/20221208013724.257848-1-yangjihong1@huawei.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-08 10:05:05 -08:00
Björn Töpel d35af0a7fe bpf: Do not zero-extend kfunc return values
In BPF all global functions, and BPF helpers return a 64-bit
value. For kfunc calls, this is not the case, and they can return
e.g. 32-bit values.

The return register R0 for kfuncs calls can therefore be marked as
subreg_def != DEF_NOT_SUBREG. In general, if a register is marked with
subreg_def != DEF_NOT_SUBREG, some archs (where bpf_jit_needs_zext()
returns true) require the verifier to insert explicit zero-extension
instructions.

For kfuncs calls, however, the caller should do sign/zero extension
for return values. In other words, the compiler is responsible to
insert proper instructions, not the verifier.

An example, provided by Yonghong Song:

$ cat t.c
extern unsigned foo(void);
unsigned bar1(void) {
     return foo();
}
unsigned bar2(void) {
     if (foo()) return 10; else return 20;
}

$ clang -target bpf -mcpu=v3 -O2 -c t.c && llvm-objdump -d t.o
t.o:    file format elf64-bpf

Disassembly of section .text:

0000000000000000 <bar1>:
	0:       85 10 00 00 ff ff ff ff call -0x1
	1:       95 00 00 00 00 00 00 00 exit

0000000000000010 <bar2>:
	2:       85 10 00 00 ff ff ff ff call -0x1
	3:       bc 01 00 00 00 00 00 00 w1 = w0
	4:       b4 00 00 00 14 00 00 00 w0 = 0x14
	5:       16 01 01 00 00 00 00 00 if w1 == 0x0 goto +0x1 <LBB1_2>
	6:       b4 00 00 00 0a 00 00 00 w0 = 0xa

0000000000000038 <LBB1_2>:
	7:       95 00 00 00 00 00 00 00 exit

If the return value of 'foo()' is used in the BPF program, the proper
zero-extension will be done.

Currently, the verifier correctly marks, say, a 32-bit return value as
subreg_def != DEF_NOT_SUBREG, but will fail performing the actual
zero-extension, due to a verifier bug in
opt_subreg_zext_lo32_rnd_hi32(). load_reg is not properly set to R0,
and the following path will be taken:

		if (WARN_ON(load_reg == -1)) {
			verbose(env, "verifier bug. zext_dst is set, but no reg is defined\n");
			return -EFAULT;
		}

A longer discussion from v1 can be found in the link below.

Correct the verifier by avoiding doing explicit zero-extension of R0
for kfunc calls. Note that R0 will still be marked as a sub-register
for return values smaller than 64-bit.

Fixes: 83a2881903 ("bpf: Account for BPF_FETCH in insn_has_def32()")
Link: https://lore.kernel.org/bpf/20221202103620.1915679-1-bjorn@kernel.org/
Suggested-by: Yonghong Song <yhs@meta.com>
Signed-off-by: Björn Töpel <bjorn@rivosinc.com>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20221207103540.396496-1-bjorn@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-08 09:59:27 -08:00
Yang Jihong f596da3efa blktrace: Fix output non-blktrace event when blk_classic option enabled
When the blk_classic option is enabled, non-blktrace events must be
filtered out. Otherwise, events of other types are output in the blktrace
classic format, which is unexpected.

The problem can be triggered in the following ways:

  # echo 1 > /sys/kernel/debug/tracing/options/blk_classic
  # echo 1 > /sys/kernel/debug/tracing/events/enable
  # echo blk > /sys/kernel/debug/tracing/current_tracer
  # cat /sys/kernel/debug/tracing/trace_pipe

Fixes: c71a896154 ("blktrace: add ftrace plugin")
Signed-off-by: Yang Jihong <yangjihong1@huawei.com>
Link: https://lore.kernel.org/r/20221122040410.85113-1-yangjihong1@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-12-08 09:26:11 -07:00
Michael Ellerman f24f21c412 Merge branch 'topic/objtool' into next
Merge the powerpc objtool support, which we were keeping in a topic
branch in case of any merge conflicts.
2022-12-08 23:57:47 +11:00
Petr Mladek 6b2b0d839a Merge branch 'rework/console-list-lock' into for-linus 2022-12-08 11:46:56 +01:00
David Vernet 36aa10ffd6 bpf/docs: Document struct cgroup * kfuncs
bpf_cgroup_acquire(), bpf_cgroup_release(), bpf_cgroup_kptr_get(), and
bpf_cgroup_ancestor(), are kfuncs that were recently added to
kernel/bpf/helpers.c. These are "core" kfuncs in that they're available
for use in any tracepoint or struct_ops BPF program. Though they have no
ABI stability guarantees, we should still document them. This patch adds
a struct cgroup * subsection to the Core kfuncs section which describes
each of these kfuncs.

Signed-off-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/r/20221207204911.873646-3-void@manifault.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-07 17:11:24 -08:00
David Vernet 25c5e92d19 bpf/docs: Document struct task_struct * kfuncs
bpf_task_acquire(), bpf_task_release(), and bpf_task_from_pid() are
kfuncs that were recently added to kernel/bpf/helpers.c. These are
"core" kfuncs in that they're available for use for any tracepoint or
struct_ops BPF program. Though they have no ABI stability guarantees, we
should still document them. This patch adds a new Core kfuncs section to
the BPF kfuncs doc, and adds entries for all of these task kfuncs.

Note that bpf_task_kptr_get() is not documented, as it still returns
NULL while we're working to resolve how it can use RCU to ensure struct
task_struct * lifetime.

Signed-off-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/r/20221207204911.873646-2-void@manifault.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-07 17:09:13 -08:00
Andrii Nakryiko dcb2288b1f bpf: Remove unused insn_cnt argument from visit_[func_call_]insn()
Number of total instructions in BPF program (including subprogs) can and
is accessed from env->prog->len. visit_func_call_insn() doesn't do any
checks against insn_cnt anymore, relying on push_insn() to do this check
internally. So remove unnecessary insn_cnt input argument from
visit_func_call_insn() and visit_insn() functions.

Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20221207195534.2866030-1-andrii@kernel.org
2022-12-07 22:58:41 +01:00
Alexei Starovoitov 0a6ea1ce82 for-alexei-2022120701
-----BEGIN PGP SIGNATURE-----
 
 iQJSBAABCAA8FiEEoEVH9lhNrxiMPSyI7MXwXhnZSjYFAmOQpWweHGJlbmphbWlu
 LnRpc3NvaXJlc0ByZWRoYXQuY29tAAoJEOzF8F4Z2Uo23ooQAJR4JBv+WKxyDplY
 m2Kk1t156kenJNhyRojwNWlYk7S0ziClwfjnJEsiki4S0RAwHcVNuuMLjKSjcDIP
 TFrs3kFIlgLITpkPFdMIqMniq0Fynb3N5QDsaohQPQvtLeDx5ASH9D6J+20bcdky
 PE+xOo1Nkn1DpnBiGX7P6irMsqrm5cXfBES2u9c7He9VLThviP2v+TvB80gmRi7w
 zUU4Uikcr8wlt+9MZoLVoVwAOg5aZmVa/9ogNqaT+cKnW6hQ+3CymxiyiyOdRrAQ
 e521+GhQOVTiM0w5C6BwhMx+Wu8r0Qz4Vp49UWf04U/KU+M1TzqAk1z7Vvt72TCr
 965qb19TSRNTGQzebAIRd09mFb/nech54dhpyceONBGnUs9r2dDWjfDd/PA7e2WO
 FbDE0HGnz/XK7GUrk/BXWU+n9VA7itnhJzB+zr3i6IKFgwwDJ1V4e81CWdBEsp9I
 WNDC8LF2bcgHvzFVC23AkKujmbirS6K4Wq+R0f2PISQIs2FdUBl1mgjh2E47lK8E
 zCozMRf9bMya5aGkd4S4dtn0NFGByFSXod2TMgfHPvBz06t6YG00DajALzcE5l8U
 GAoP5Nz9hRSbmHJCNMqy0SN0WN9Cz+JIFx5Vlb9az3lduRRBOVptgnjx9LOjErVr
 +aWWxuQgoHZmB5Ja5WNVN1lIf39/
 =FX5W
 -----END PGP SIGNATURE-----

Merge "do not rely on ALLOW_ERROR_INJECTION for fmod_ret" into bpf-next

Merge commit 5b481acab4 ("bpf: do not rely on ALLOW_ERROR_INJECTION for fmod_ret")
from hid tree into bpf-next.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-07 13:49:21 -08:00
Stephen Boyd 169a58ad82 module/decompress: Support zstd in-kernel decompression
Add support for zstd compressed modules to the in-kernel decompression
code. This allows zstd compressed modules to be decompressed by the
kernel, similar to the existing support for gzip and xz compressed
modules.

Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Cc: Piotr Gorski <lucjan.lucjanov@gmail.com>
Cc: Nick Terrell <terrelln@fb.com>
Signed-off-by: Stephen Boyd <swboyd@chromium.org>
Reviewed-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Reviewed-by: Piotr Gorski <lucjan.lucjanov@gmail.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2022-12-07 12:05:05 -08:00
Benjamin Tissoires 5b481acab4 bpf: do not rely on ALLOW_ERROR_INJECTION for fmod_ret
The current way of expressing that a non-bpf kernel component is willing
to accept that bpf programs can be attached to it and that they can change
the return value is to abuse ALLOW_ERROR_INJECTION.
This is debated in the link below, and the result is that it is not a
reasonable thing to do.

Reuse the kfunc declaration structure to also tag the kernel functions
we want to be fmodret. This way we can control from any subsystem which
functions are being modified by bpf without touching the verifier.

Link: https://lore.kernel.org/all/20221121104403.1545f9b5@gandalf.local.home/
Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Signed-off-by: Benjamin Tissoires <benjamin.tissoires@redhat.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/r/20221206145936.922196-2-benjamin.tissoires@redhat.com
2022-12-07 15:31:08 +01:00
Andrii Nakryiko 618945fbed bpf: remove unnecessary prune and jump points
Don't mark some instructions as jump points when there are actually no
jumps and instructions are just processed sequentially. Such case is
handled naturally by precision backtracking logic without the need to
update jump history. See get_prev_insn_idx(). It goes back linearly by
one instruction, unless current top of jmp_history is pointing to
current instruction. In such case we use `st->jmp_history[cnt - 1].prev_idx`
to find instruction from which we jumped to the current instruction
non-linearly.

Also remove both jump and prune point marking for instruction right
after unconditional jumps, as program flow can get to the instruction
right after unconditional jump instruction only if there is a jump to
that instruction from somewhere else in the program. In such case we'll
mark such instruction as prune/jump point because it's a destination of
a jump.

This change has no changes in terms of number of instructions or states
processes across Cilium and selftests programs.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/r/20221206233345.438540-4-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-06 19:14:38 -08:00
Andrii Nakryiko a095f42105 bpf: mostly decouple jump history management from is_state_visited()
Jump history updating and state equivalence checks are conceptually
independent, so move push_jmp_history() out of is_state_visited(). Also
make a decision whether to perform state equivalence checks or not one
layer higher in do_check(), keeping is_state_visited() unconditionally
performing state checks.

push_jmp_history() should be performed after state checks. There is just
one small non-uniformity. When is_state_visited() finds already
validated equivalent state, it propagates precision marks to current
state's parent chain. For this to work correctly, jump history has to be
updated, so is_state_visited() is doing that internally.

But if no equivalent verified state is found, jump history has to be
updated in a newly cloned child state, so is_jmp_point()
+ push_jmp_history() is performed after is_state_visited() exited with
zero result, which means "proceed with validation".

This change has no functional changes. It's not strictly necessary, but
feels right to decouple these two processes.

Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20221206233345.438540-3-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-06 19:14:38 -08:00
Andrii Nakryiko bffdeaa8a5 bpf: decouple prune and jump points
BPF verifier marks some instructions as prune points. Currently these
prune points serve two purposes.

It's a point where verifier tries to find previously verified state and
check current state's equivalence to short circuit verification for
current code path.

But also currently it's a point where jump history, used for precision
backtracking, is updated. This is done so that non-linear flow of
execution could be properly backtracked.

Such coupling is coincidental and unnecessary. Some prune points are not
part of some non-linear jump path, so don't need update of jump history.
On the other hand, not all instructions which have to be recorded in
jump history necessarily are good prune points.

This patch splits prune and jump points into independent flags.
Currently all prune points are marked as jump points to minimize amount
of changes in this patch, but next patch will perform some optimization
of prune vs jmp point placement.

No functional changes are intended.

Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20221206233345.438540-2-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-06 19:14:38 -08:00
Dave Marchevsky d8939cb0a0 bpf: Loosen alloc obj test in verifier's reg_btf_record
btf->struct_meta_tab is populated by btf_parse_struct_metas in btf.c.
There, a BTF record is created for any type containing a spin_lock or
any next-gen datastructure node/head.

Currently, for non-MAP_VALUE types, reg_btf_record will only search for
a record using struct_meta_tab if the reg->type exactly matches
(PTR_TO_BTF_ID | MEM_ALLOC). This exact match is too strict: an
"allocated obj" type - returned from bpf_obj_new - might pick up other
flags while working its way through the program.

Loosen the check to be exact for base_type and just use MEM_ALLOC mask
for type_flag.

This patch is marked Fixes as the original intent of reg_btf_record was
unlikely to have been to fail finding btf_record for valid alloc obj
types with additional flags, some of which (e.g. PTR_UNTRUSTED)
are valid register type states for alloc obj independent of this series.
However, I didn't find a specific broken repro case outside of this
series' added functionality, so it's possible that nothing was
triggering this logic error before.

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Fixes: 4e814da0d5 ("bpf: Allow locking bpf_spin_lock in allocated objects")
Link: https://lore.kernel.org/r/20221206231000.3180914-2-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-06 18:40:25 -08:00
David Vernet 156ed20d22 bpf: Don't use rcu_users to refcount in task kfuncs
A series of prior patches added some kfuncs that allow struct
task_struct * objects to be used as kptrs. These kfuncs leveraged the
'refcount_t rcu_users' field of the task for performing refcounting.
This field was used instead of 'refcount_t usage', as we wanted to
leverage the safety provided by RCU for ensuring a task's lifetime.

A struct task_struct is refcounted by two different refcount_t fields:

1. p->usage:     The "true" refcount field which task lifetime. The
		 task is freed as soon as this refcount drops to 0.

2. p->rcu_users: An "RCU users" refcount field which is statically
		 initialized to 2, and is co-located in a union with
		 a struct rcu_head field (p->rcu). p->rcu_users
		 essentially encapsulates a single p->usage
		 refcount, and when p->rcu_users goes to 0, an RCU
		 callback is scheduled on the struct rcu_head which
		 decrements the p->usage refcount.

Our logic was that by using p->rcu_users, we would be able to use RCU to
safely issue refcount_inc_not_zero() a task's rcu_users field to
determine if a task could still be acquired, or was exiting.
Unfortunately, this does not work due to p->rcu_users and p->rcu sharing
a union. When p->rcu_users goes to 0, an RCU callback is scheduled to
drop a single p->usage refcount, and because the fields share a union,
the refcount immediately becomes nonzero again after the callback is
scheduled.

If we were to split the fields out of the union, this wouldn't be a
problem. Doing so should also be rather non-controversial, as there are
a number of places in struct task_struct that have padding which we
could use to avoid growing the structure by splitting up the fields.

For now, so as to fix the kfuncs to be correct, this patch instead
updates bpf_task_acquire() and bpf_task_release() to use the p->usage
field for refcounting via the get_task_struct() and put_task_struct()
functions. Because we can no longer rely on RCU, the change also guts
the bpf_task_acquire_not_zero() and bpf_task_kptr_get() functions
pending a resolution on the above problem.

In addition, the task fixes the kfunc and rcu_read_lock selftests to
expect this new behavior.

Fixes: 90660309b0 ("bpf: Add kfuncs for storing struct task_struct * as a kptr")
Fixes: fca1aa7551 ("bpf: Handle MEM_RCU type properly")
Reported-by: Matus Jokay <matus.jokay@stuba.sk>
Signed-off-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/r/20221206210538.597606-1-void@manifault.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-06 16:40:16 -08:00
Will Deacon a4aebff7ef Merge branch 'for-next/ftrace' into for-next/core
* for-next/ftrace:
  ftrace: arm64: remove static ftrace
  ftrace: arm64: move from REGS to ARGS
  ftrace: abstract DYNAMIC_FTRACE_WITH_ARGS accesses
  ftrace: rename ftrace_instruction_pointer_set() -> ftrace_regs_set_instruction_pointer()
  ftrace: pass fregs to arch_ftrace_set_direct_caller()
2022-12-06 11:07:39 +00:00
Rafael J. Wysocki 96d4b8e1ad PM: sleep: Refine error message in try_to_freeze_tasks()
A previous change amended try_to_freeze_tasks() with the "what"
variable pointing to a string describing the group of tasks subject to
the freezing which may be used in the error message in there too, so
make that happen.

Accordingly, update sleepgraph.py to catch the modified error message
as appropriate.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
2022-12-06 12:04:34 +01:00
Rafael J. Wysocki a449dfbfc0 PM: sleep: Avoid using pr_cont() in the tasks freezing code
Using pr_cont() in the tasks freezing code related to system-wide
suspend and hibernation is problematic, because the continuation
messages printed there are susceptible to interspersing with other
unrelated messages which results in output that is hard to
understand.

Address this issue by modifying try_to_freeze_tasks() to print
messages that don't require continuations and adjusting its
callers accordingly.

Reported-by: Thomas Weißschuh <linux@weissschuh.net>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
2022-12-06 12:04:10 +01:00
Thomas Gleixner 3d393b2174 genirq/msi: Provide msi_domain_alloc_irq_at()
For supporting post MSI-X enable allocations and for the upcoming PCI/IMS
support a separate interface is required which allows not only the
allocation of a specific index, but also the allocation of any, i.e. the
next free index. The latter is especially required for IMS because IMS
completely does away with index to functionality mappings which are
often found in MSI/MSI-X implementation.

But even with MSI-X there are devices where only the first few indices have
a fixed functionality and the rest is freely assignable by software,
e.g. to queues.

msi_domain_alloc_irq_at() is also different from the range based interfaces
as it always enforces that the MSI descriptor is allocated by the core code
and not preallocated by the caller like the PCI/MSI[-X] enable code path
does.

msi_domain_alloc_irq_at() can be invoked with the index argument set to
MSI_ANY_INDEX which makes the core code pick the next free index. The irq
domain can provide a prepare_desc() operation callback in it's
msi_domain_ops to do domain specific post allocation initialization before
the actual Linux interrupt and the associated interrupt descriptor and
hierarchy alloccations are conducted.

The function also takes an optional @icookie argument which is of type
union msi_instance_cookie. This cookie is not used by the core code and is
stored in the allocated msi_desc::data::icookie. The meaning of the cookie
is completely implementation defined. In case of IMS this might be a PASID
or a pointer to a device queue, but for the MSI core it's opaque and not
used in any way.

The function returns a struct msi_map which on success contains the
allocated index number and the Linux interrupt number so the caller can
spare the index to Linux interrupt number lookup.

On failure map::index contains the error code and map::virq is 0.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221124232326.501359457@linutronix.de
2022-12-05 22:22:34 +01:00
Thomas Gleixner 8f986fd775 genirq/msi: Provide msi_domain_ops:: Prepare_desc()
The existing MSI domain ops msi_prepare() and set_desc() turned out to be
unsuitable for implementing IMS support.

msi_prepare() does not operate on the MSI descriptors. set_desc() lacks
an irq_domain pointer and has a completely different purpose.

Introduce a prepare_desc() op which allows IMS implementations to amend an
MSI descriptor which was allocated by the core code, e.g. by adjusting the
iomem base or adding some data based on the allocated index. This is way
better than requiring that all IMS domain implementations preallocate the
MSI descriptor and then allocate the interrupt.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221124232326.444560717@linutronix.de
2022-12-05 22:22:33 +01:00
Thomas Gleixner bd141a3db4 genirq/msi: Provide BUS_DEVICE_PCI_MSI[X]
Provide new bus tokens for the upcoming per device PCI/MSI and PCI/MSIX
interrupt domains.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221124232325.917219885@linutronix.de
2022-12-05 22:22:32 +01:00
Thomas Gleixner 36db3d9003 genirq/msi: Add range checking to msi_insert_desc()
Per device domains provide the real domain size to the core code. This
allows range checking on insertion of MSI descriptors and also paves the
way for dynamic index allocations which are required e.g. for IMS. This
avoids external mechanisms like bitmaps on the device side and just
utilizes the core internal MSI descriptor storxe for it.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221124232325.798556374@linutronix.de
2022-12-05 22:22:32 +01:00
Linus Torvalds bce9332220 proc: proc_skip_spaces() shouldn't think it is working on C strings
proc_skip_spaces() seems to think it is working on C strings, and ends
up being just a wrapper around skip_spaces() with a really odd calling
convention.

Instead of basing it on skip_spaces(), it should have looked more like
proc_skip_char(), which really is the exact same function (except it
skips a particular character, rather than whitespace).  So use that as
inspiration, odd coding and all.

Now the calling convention actually makes sense and works for the
intended purpose.

Reported-and-tested-by: Kyle Zeng <zengyhkyle@gmail.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-12-05 12:09:06 -08:00
Linus Torvalds e6cfaf34be proc: avoid integer type confusion in get_proc_long
proc_get_long() is passed a size_t, but then assigns it to an 'int'
variable for the length.  Let's not do that, even if our IO paths are
limited to MAX_RW_COUNT (exactly because of these kinds of type errors).

So do the proper test in the rigth type.

Reported-by: Kyle Zeng <zengyhkyle@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-12-05 11:33:40 -08:00
Thomas Gleixner 26e91b75bf genirq/msi: Provide msi_match_device_domain()
Provide an interface to match a per device domain bus token. This allows to
query which type of domain is installed for a particular domain id. Will be
used for PCI to avoid frequent create/remove cycles for the MSI resp. MSI-X
domains.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221124232325.738047902@linutronix.de
2022-12-05 19:21:01 +01:00
Thomas Gleixner 27a6dea3eb genirq/msi: Provide msi_create/free_device_irq_domain()
Now that all prerequsites are in place, provide the actual interfaces for
creating and removing per device interrupt domains.

MSI device interrupt domains are created from the provided
msi_domain_template which is duplicated so that it can be modified for the
particular device.

The name of the domain and the name of the interrupt chip are composed by
"$(PREFIX)$(CHIPNAME)-$(DEVNAME)"

  $PREFIX:   The optional prefix provided by the underlying MSI parent domain
             via msi_parent_ops::prefix.
  $CHIPNAME: The name of the irq_chip in the template
  $DEVNAME:  The name of the device

The domain is further initialized through a MSI parent domain callback which
fills in the required functionality for the parent domain or domains further
down the hierarchy. This initialization can fail, e.g. when the requested
feature or MSI domain type cannot be supported.

The domain pointer is stored in the pointer array inside of msi_device_data
which is attached to the domain.

The domain can be removed via the API or left for disposal via devres when
the device is torn down. The API removal is useful e.g. for PCI to have
seperate domains for MSI and MSI-X, which are mutually exclusive and always
occupy the default domain id slot.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221124232325.678838546@linutronix.de
2022-12-05 19:21:01 +01:00
Thomas Gleixner a80c0aceea genirq/msi: Split msi_create_irq_domain()
Split the functionality of msi_create_irq_domain() so it can
be reused for creating per device irq domains.

No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221124232325.559086358@linutronix.de
2022-12-05 19:21:01 +01:00
Thomas Gleixner 61bf992fc6 genirq/msi: Add size info to struct msi_domain_info
To allow proper range checking especially for dynamic allocations add a
size field to struct msi_domain_info. If the field is 0 then the size is
unknown or unlimited (up to MSI_MAX_INDEX) to provide backwards
compability.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221124232325.501144862@linutronix.de
2022-12-05 19:21:01 +01:00
Thomas Gleixner b78780d93b genirq/msi: Provide struct msi_parent_ops
MSI parent domains must have some control over the MSI domains which are
built on top. On domain creation they need to fill in e.g. architecture
specific chip callbacks or msi domain ops to make the outermost domain
parent agnostic which is obviously required for architecture independence
etc.

The structure contains:

    1) A bitfield which exposes the supported functional features. This
       allows to check for features and is also used in the initialization
       callback to mask out unsupported features when the actual domain
       implementation requests a broader range, e.g. on x86 PCI multi-MSI
       is only supported by remapping domains but not by the underlying
       vector domain. The PCI/MSI code can then always request multi-MSI
       support, but the resulting feature set after creation might not
       have it set.

    2) An optional string prefix which is put in front of domain and chip
       names during creation of the MSI domain. That allows to keep the
       naming schemes e.g. on x86 where PCI-MSI domains have a IR- prefix
       when interrupt remapping is enabled.

    3) An initialization callback to sanity check the domain info of
       the to be created MSI domain, to restrict features and to
       apply changes in MSI ops and interrupt chip callbacks to
       accomodate to the particular MSI parent implementation and/or
       the underlying hierarchy.

Add a conveniance function to delegate the initialization from the
MSI parent domain to an underlying domain in the hierarchy.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221124232325.382485843@linutronix.de
2022-12-05 19:21:01 +01:00
Thomas Gleixner c459f11f32 genirq/msi: Remove unused alloc/free interfaces
Now that all users are converted remove the old interfaces.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221124230314.694291814@linutronix.de
2022-12-05 19:21:00 +01:00
Thomas Gleixner f2480e7dac genirq/msi: Provide new domain id allocation functions
Provide two sorts of interfaces to handle the different use cases:

  - msi_domain_alloc_irqs_range():

	Handles a caller defined precise range

  - msi_domain_alloc_irqs_all():

	Allocates all interrupts associated to a domain by scanning the
    	allocated MSI descriptors

The latter is useful for the existing PCI/MSI support which does not have
range information available.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221124230314.396497163@linutronix.de
2022-12-05 19:21:00 +01:00
Thomas Gleixner 4cd5f4403f genirq/msi: Provide new domain id based interfaces for freeing interrupts
Provide two sorts of interfaces to handle the different use cases:

  - msi_domain_free_irqs_range():

	Handles a caller defined precise range

  - msi_domain_free_irqs_all():

	Frees all interrupts associated to a domain

The latter is useful for device teardown and to handle the legacy MSI support
which does not have any range information available.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221124230314.337844751@linutronix.de
2022-12-05 19:21:00 +01:00
Thomas Gleixner 40742716f2 genirq/msi: Make msi_add_simple_msi_descs() device domain aware
Allocating simple interrupt descriptors in the core code has to be multi
device irqdomain aware for the upcoming PCI/IMS support.

Change the interfaces to take a domain id into account. Use the internal
control struct for transport of arguments.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221124230314.279112474@linutronix.de
2022-12-05 19:21:00 +01:00
Thomas Gleixner 377712c5a4 genirq/msi: Make descriptor freeing domain aware
Change the descriptor free functions to take a domain id to prepare for the
upcoming multi MSI domain per device support.

To avoid changing and extending the interfaces over and over use an core
internal control struct and hand the pointer through the various functions.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221124230314.220788011@linutronix.de
2022-12-05 19:20:59 +01:00
Thomas Gleixner fc8ab38832 genirq/msi: Make descriptor allocation device domain aware
Change the descriptor allocation and insertion functions to take a domain
id to prepare for the upcoming multi MSI domain per device support.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221124230314.163043028@linutronix.de
2022-12-05 19:20:59 +01:00
Thomas Gleixner 1c89396300 genirq/msi: Rename msi_add_msi_desc() to msi_insert_msi_desc()
This reflects the functionality better. No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221124230314.103554618@linutronix.de
2022-12-05 19:20:59 +01:00
Ahmed S. Darwish 98043704f3 genirq/msi: Make msi_get_virq() device domain aware
In preparation of the upcoming per device multi MSI domain support, change
the interface to support lookups based on domain id and zero based index
within the domain.

Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221124230314.044613697@linutronix.de
2022-12-05 19:20:59 +01:00
Thomas Gleixner 94ff94cfea genirq/msi: Make MSI descriptor iterators device domain aware
To support multiple MSI interrupt domains per device it is necessary to
segment the xarray MSI descriptor storage. Each domain gets up to
MSI_MAX_INDEX entries.

Change the iterators so they operate with domain ids and take the domain
offsets into account.

The publicly available iterators which are mostly used in legacy
implementations and the PCI/MSI core default to MSI_DEFAULT_DOMAIN (0)
which is the id for the existing "global" domains.

No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221124230313.985498981@linutronix.de
2022-12-05 19:20:59 +01:00
Thomas Gleixner 64258eaa44 genirq/msi: Add pointers for per device irq domains
With the upcoming per device MSI interrupt domain support it is necessary
to store the domain pointers per device.

Instead of delegating that storage to device drivers or subsystems add a
domain pointer to the msi_dev_domain array in struct msi_device_data.

This pointer is also used to take care of tearing down the irq domains when
msi_device_data is cleaned up via devres.

The interfaces into the MSI core will be changed from irqdomain pointer
based interfaces to domain id based interfaces to support multiple MSI
domains on a single device (e.g. PCI/MSI[-X] and PCI/IMS.

Once the per device domain support is complete the irq domain pointer in
struct device::msi.domain will not longer contain a pointer to the "global"
MSI domain. It will contain a pointer to the MSI parent domain instead.

It would be a horrible maze of conditionals to evaluate all over the place
which domain pointer should be used, i.e. the "global" one in
device::msi::domain or one from the internal pointer array.

To avoid this evaluate in msi_setup_device_data() whether the irq domain
which is associated to a device is a "global" or a parent MSI domain. If it
is global then copy the pointer into the first entry of the msi_dev_domain
array.

This allows to convert interfaces and implementation to domain ids while
keeping everything existing working.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221124230313.923860399@linutronix.de
2022-12-05 19:20:59 +01:00
Thomas Gleixner f1139f905b genirq/msi: Move xarray into a separate struct and create an array
The upcoming support for multiple MSI domains per device requires storage
for the MSI descriptors and in a second step storage for the irqdomain
pointers.

Move the xarray into a separate data structure msi_dev_domain and create an
array with size 1 in msi_device_data, which can be expanded later when the
support for per device domains is implemented.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221124230313.864887773@linutronix.de
2022-12-05 19:20:59 +01:00
Thomas Gleixner 3e86a3a3ed genirq/msi: Check for invalid MSI parent domain usage
In the upcoming per device MSI domain concept the MSI parent domains are
not allowed to be used as regular MSI domains where the MSI allocation/free
operations are applicable.

Add appropriate checks.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221124230313.806128070@linutronix.de
2022-12-05 19:20:59 +01:00
Thomas Gleixner 6a9fc4190c genirq/irqdomain: Rename irq_domain::dev to irq_domain:: Pm_dev
irq_domain::dev is a misnomer as it's usually the rule that a device
pointer points to something which is directly related to the instance.

irq_domain::dev can point to some other device for power management to
ensure that this underlying device is not powered down when an interrupt is
allocated.

The upcoming per device MSI domains really require a pointer to the device
which instantiated the irq domain and not to some random other device which
is required for power management down the chain.

Rename irq_domain::dev to irq_domain::pm_dev and fixup the few sites which
use that pointer.

Conversion was done with the help of coccinelle.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221124230313.574541683@linutronix.de
2022-12-05 19:20:58 +01:00
Thomas Gleixner 3dad5f9ad9 genirq/msi: Move IRQ_DOMAIN_MSI_NOMASK_QUIRK to MSI flags
It's truly a MSI only flag and for the upcoming per device MSI domains this
must be in the MSI flags so it can be set during domain setup without
exposing this quirk outside of x86.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221124230313.454246167@linutronix.de
2022-12-05 19:20:58 +01:00
Yonghong Song 2c40d97da1 bpf: Enable sleeptable support for cgrp local storage
Similar to sk/inode/task local storage, enable sleepable support for
cgrp local storage.

Signed-off-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20221201050444.2785007-1-yhs@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-04 16:51:18 -08:00
Yonghong Song c0c852dd18 bpf: Do not mark certain LSM hook arguments as trusted
Martin mentioned that the verifier cannot assume arguments from
LSM hook sk_alloc_security being trusted since after the hook
is called, the sk ref_count is set to 1. This will overwrite
the ref_count changed by the bpf program and may cause ref_count
underflow later on.

I then further checked some other hooks. For example,
for bpf_lsm_file_alloc() hook in fs/file_table.c,

        f->f_cred = get_cred(cred);
        error = security_file_alloc(f);
        if (unlikely(error)) {
                file_free_rcu(&f->f_rcuhead);
                return ERR_PTR(error);
        }

        atomic_long_set(&f->f_count, 1);

The input parameter 'f' to security_file_alloc() cannot be trusted
as well.

Specifically, I investiaged bpf_map/bpf_prog/file/sk/task alloc/free
lsm hooks. Except bpf_map_alloc and task_alloc, arguments for all other
hooks should not be considered as trusted. This may not be a complete
list, but it covers common usage for sk and task.

Fixes: 3f00c52393 ("bpf: Allow trusted pointers to be passed to KF_TRUSTED_ARGS kfuncs")
Signed-off-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20221203204954.2043348-1-yhs@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-04 12:59:58 -08:00
Yonghong Song fca1aa7551 bpf: Handle MEM_RCU type properly
Commit 9bb00b2895 ("bpf: Add kfunc bpf_rcu_read_lock/unlock()")
introduced MEM_RCU and bpf_rcu_read_lock/unlock() support. In that
commit, a rcu pointer is tagged with both MEM_RCU and PTR_TRUSTED
so that it can be passed into kfuncs or helpers as an argument.

Martin raised a good question in [1] such that the rcu pointer,
although being able to accessing the object, might have reference
count of 0. This might cause a problem if the rcu pointer is passed
to a kfunc which expects trusted arguments where ref count should
be greater than 0.

This patch makes the following changes related to MEM_RCU pointer:
  - MEM_RCU pointer might be NULL (PTR_MAYBE_NULL).
  - Introduce KF_RCU so MEM_RCU ptr can be acquired with
    a KF_RCU tagged kfunc which assumes ref count of rcu ptr
    could be zero.
  - For mem access 'b = ptr->a', say 'ptr' is a MEM_RCU ptr, and
    'a' is tagged with __rcu as well. Let us mark 'b' as
    MEM_RCU | PTR_MAYBE_NULL.

 [1] https://lore.kernel.org/bpf/ac70f574-4023-664e-b711-e0d3b18117fd@linux.dev/

Fixes: 9bb00b2895 ("bpf: Add kfunc bpf_rcu_read_lock/unlock()")
Signed-off-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20221203184602.477272-1-yhs@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-04 12:52:40 -08:00
Linus Torvalds 0c3b5bcb48 - Fix a use-after-free case where the perf pending task callback would
see an already freed event
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmOMqHUACgkQEsHwGGHe
 VUpRzw/9Gow+0wbm2XhMuweUA6t3LgNweOmzDl9w8k1f55OD6niCvuDiF9jSaiKZ
 UwGyErasp2dlEVjuNGnp42qSHos3vRiR7sdZZQG+7opWV2FFyxyFpx5x8UEgVnFy
 gOuEij5vLXBApUdNRAcVqCbvivs4Lv6SggDyQ075zGzuOmUv57vw2jDt8YfKaFcp
 jZTiL+j5GKwihndDB6ayx+7Gwo9a9ASKrTgz8JK2tPOIHZR4X9y9ot1IanZnxzwF
 d0kFpLgF/ZqjPRpJoaFn/jgk1AfahQyYHXh7lQ1aP7rLSLRRGcfTBX4n9nC3BYT+
 EHaA94l151L1mzbR69ij9tryAERU4NlguD/FIuCeW+6IEPiuwBNGklXF+rRegNj4
 IYC0ZSld/NyWKtOrwNSrFRMsxFm583Pg6TaBkvU1rGd5YVQ7GImrj7UjecXO/W71
 iXpfarF7ur2zmd+5+F5FB34VYw8GumRo+D+XIb34+8UMBURTX36hgXvSC3sVyyCw
 b0c758F3+1zTwm8z52T1RhOOp47t5iWAznwTq6k1cT7788PDXJ9sGYXIpdLpwKcI
 Fuj61alwamGeUciCr0iKGtCLRHayZII7OeQh1VjXuqgCwI3hI2j3EaI9C74WSApn
 ttVInS0Ka2xcu//A1VFltkMOWNMQK9JeTlqdqctwypTL3WVb2XA=
 =jo4r
 -----END PGP SIGNATURE-----

Merge tag 'perf_urgent_for_v6.1_rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull perf fix from Borislav Petkov:

 - Fix a use-after-free case where the perf pending task callback would
   see an already freed event

* tag 'perf_urgent_for_v6.1_rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf: Fix perf_pending_task() UaF
2022-12-04 12:36:23 -08:00
haifeng.xu 3a017d6355 signal: Initialize the info in ksignal
When handing the SIGNAL_GROUP_EXIT flag, the info in ksignal isn't cleared.
However, the info acquired by dequeue_synchronous_signal/dequeue_signal is
initialized and can be safely used. Fortunately, the fatal signal process
just uses the si_signo and doesn't use any other member. Even so, the
initialization before use is more safer.

Signed-off-by: haifeng.xu <haifeng.xu@shopee.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20221128065606.19570-1-haifeng.xu@shopee.com
2022-12-02 13:04:44 -08:00
Kees Cook 8b05aa2633 panic: Expose "warn_count" to sysfs
Since Warn count is now tracked and is a fairly interesting signal, add
the entry /sys/kernel/warn_count to expose it to userspace.

Cc: Petr Mladek <pmladek@suse.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: tangmeng <tangmeng@uniontech.com>
Cc: "Guilherme G. Piccoli" <gpiccoli@igalia.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Tiezhu Yang <yangtiezhu@loongson.cn>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20221117234328.594699-6-keescook@chromium.org
2022-12-02 13:04:44 -08:00
Kees Cook 9fc9e278a5 panic: Introduce warn_limit
Like oops_limit, add warn_limit for limiting the number of warnings when
panic_on_warn is not set.

Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: "Jason A. Donenfeld" <Jason@zx2c4.com>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: tangmeng <tangmeng@uniontech.com>
Cc: "Guilherme G. Piccoli" <gpiccoli@igalia.com>
Cc: Tiezhu Yang <yangtiezhu@loongson.cn>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: linux-doc@vger.kernel.org
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20221117234328.594699-5-keescook@chromium.org
2022-12-02 13:04:44 -08:00
Kees Cook 79cc1ba7ba panic: Consolidate open-coded panic_on_warn checks
Several run-time checkers (KASAN, UBSAN, KFENCE, KCSAN, sched) roll
their own warnings, and each check "panic_on_warn". Consolidate this
into a single function so that future instrumentation can be added in
a single location.

Cc: Marco Elver <elver@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Gow <davidgow@google.com>
Cc: tangmeng <tangmeng@uniontech.com>
Cc: Jann Horn <jannh@google.com>
Cc: Shuah Khan <skhan@linuxfoundation.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: "Guilherme G. Piccoli" <gpiccoli@igalia.com>
Cc: Tiezhu Yang <yangtiezhu@loongson.cn>
Cc: kasan-dev@googlegroups.com
Cc: linux-mm@kvack.org
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Marco Elver <elver@google.com>
Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
Link: https://lore.kernel.org/r/20221117234328.594699-4-keescook@chromium.org
2022-12-02 13:04:44 -08:00
Kees Cook de92f65719 exit: Allow oops_limit to be disabled
In preparation for keeping oops_limit logic in sync with warn_limit,
have oops_limit == 0 disable checking the Oops counter.

Cc: Jann Horn <jannh@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: "Jason A. Donenfeld" <Jason@zx2c4.com>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: linux-doc@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
2022-12-02 13:04:39 -08:00
Kuniyuki Iwashima a1140cb215 seccomp: Move copy_seccomp() to no failure path.
Our syzbot instance reported memory leaks in do_seccomp() [0], similar
to the report [1].  It shows that we miss freeing struct seccomp_filter
and some objects included in it.

We can reproduce the issue with the program below [2] which calls one
seccomp() and two clone() syscalls.

The first clone()d child exits earlier than its parent and sends a
signal to kill it during the second clone(), more precisely before the
fatal_signal_pending() test in copy_process().  When the parent receives
the signal, it has to destroy the embryonic process and return -EINTR to
user space.  In the failure path, we have to call seccomp_filter_release()
to decrement the filter's refcount.

Initially, we called it in free_task() called from the failure path, but
the commit 3a15fb6ed9 ("seccomp: release filter after task is fully
dead") moved it to release_task() to notify user space as early as possible
that the filter is no longer used.

To keep the change and current seccomp refcount semantics, let's move
copy_seccomp() just after the signal check and add a WARN_ON_ONCE() in
free_task() for future debugging.

[0]:
unreferenced object 0xffff8880063add00 (size 256):
  comm "repro_seccomp", pid 230, jiffies 4294687090 (age 9.914s)
  hex dump (first 32 bytes):
    01 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00  ................
    ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
  backtrace:
    do_seccomp (./include/linux/slab.h:600 ./include/linux/slab.h:733 kernel/seccomp.c:666 kernel/seccomp.c:708 kernel/seccomp.c:1871 kernel/seccomp.c:1991)
    do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
    entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120)
unreferenced object 0xffffc90000035000 (size 4096):
  comm "repro_seccomp", pid 230, jiffies 4294687090 (age 9.915s)
  hex dump (first 32 bytes):
    01 00 00 00 00 00 00 00 00 00 00 00 05 00 00 00  ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    __vmalloc_node_range (mm/vmalloc.c:3226)
    __vmalloc_node (mm/vmalloc.c:3261 (discriminator 4))
    bpf_prog_alloc_no_stats (kernel/bpf/core.c:91)
    bpf_prog_alloc (kernel/bpf/core.c:129)
    bpf_prog_create_from_user (net/core/filter.c:1414)
    do_seccomp (kernel/seccomp.c:671 kernel/seccomp.c:708 kernel/seccomp.c:1871 kernel/seccomp.c:1991)
    do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
    entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120)
unreferenced object 0xffff888003fa1000 (size 1024):
  comm "repro_seccomp", pid 230, jiffies 4294687090 (age 9.915s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    bpf_prog_alloc_no_stats (./include/linux/slab.h:600 ./include/linux/slab.h:733 kernel/bpf/core.c:95)
    bpf_prog_alloc (kernel/bpf/core.c:129)
    bpf_prog_create_from_user (net/core/filter.c:1414)
    do_seccomp (kernel/seccomp.c:671 kernel/seccomp.c:708 kernel/seccomp.c:1871 kernel/seccomp.c:1991)
    do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
    entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120)
unreferenced object 0xffff888006360240 (size 16):
  comm "repro_seccomp", pid 230, jiffies 4294687090 (age 9.915s)
  hex dump (first 16 bytes):
    01 00 37 00 76 65 72 6c e0 83 01 06 80 88 ff ff  ..7.verl........
  backtrace:
    bpf_prog_store_orig_filter (net/core/filter.c:1137)
    bpf_prog_create_from_user (net/core/filter.c:1428)
    do_seccomp (kernel/seccomp.c:671 kernel/seccomp.c:708 kernel/seccomp.c:1871 kernel/seccomp.c:1991)
    do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
    entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120)
unreferenced object 0xffff8880060183e0 (size 8):
  comm "repro_seccomp", pid 230, jiffies 4294687090 (age 9.915s)
  hex dump (first 8 bytes):
    06 00 00 00 00 00 ff 7f                          ........
  backtrace:
    kmemdup (mm/util.c:129)
    bpf_prog_store_orig_filter (net/core/filter.c:1144)
    bpf_prog_create_from_user (net/core/filter.c:1428)
    do_seccomp (kernel/seccomp.c:671 kernel/seccomp.c:708 kernel/seccomp.c:1871 kernel/seccomp.c:1991)
    do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
    entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120)

[1]: https://syzkaller.appspot.com/bug?id=2809bb0ac77ad9aa3f4afe42d6a610aba594a987

[2]:
#define _GNU_SOURCE
#include <sched.h>
#include <signal.h>
#include <unistd.h>
#include <sys/syscall.h>
#include <linux/filter.h>
#include <linux/seccomp.h>

void main(void)
{
	struct sock_filter filter[] = {
		BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),
	};
	struct sock_fprog fprog = {
		.len = sizeof(filter) / sizeof(filter[0]),
		.filter = filter,
	};
	long i, pid;

	syscall(__NR_seccomp, SECCOMP_SET_MODE_FILTER, 0, &fprog);

	for (i = 0; i < 2; i++) {
		pid = syscall(__NR_clone, CLONE_NEWNET | SIGKILL, NULL, NULL, 0);
		if (pid == 0)
			return;
	}
}

Fixes: 3a15fb6ed9 ("seccomp: release filter after task is fully dead")
Reported-by: syzbot+ab17848fe269b573eb71@syzkaller.appspotmail.com
Reported-by: Ayushman Dutta <ayudutta@amazon.com>
Suggested-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20220823154532.82913-1-kuniyu@amazon.com
2022-12-02 11:32:53 -08:00
Jason Gunthorpe 90337f526c Merge tag 'v6.1-rc7' into iommufd.git for-next
Resolve conflicts in drivers/vfio/vfio_main.c by using the iommfd version.
The rc fix was done a different way when iommufd patches reworked this
code.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2022-12-02 12:04:39 -04:00
Vincent Donnefort 6f855b39e4 cpu/hotplug: Do not bail-out in DYING/STARTING sections
The DYING/STARTING callbacks are not expected to fail. However, as reported
by Derek, buggy drivers such as tboot are still free to return errors
within those sections, which halts the hot(un)plug and leaves the CPU in an
unrecoverable state.

As there is no rollback possible, only log the failures and proceed with
the following steps.

This restores the hotplug behaviour prior to commit 453e410851
("cpu/hotplug: Add cpuhp_invoke_callback_range()")

Fixes: 453e410851 ("cpu/hotplug: Add cpuhp_invoke_callback_range()")
Reported-by: Derek Dolney <z23@posteo.net>
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Derek Dolney <z23@posteo.net>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=215867
Link: https://lore.kernel.org/r/20220927101259.1149636-1-vdonnefort@google.com
2022-12-02 12:43:02 +01:00
Phil Auld d385febc9a cpu/hotplug: Set cpuhp target for boot cpu
Since the boot cpu does not go through the hotplug process it ends
up with state == CPUHP_ONLINE but target == CPUHP_OFFLINE.
So set the target to match in boot_cpu_hotplug_init().

Signed-off-by: Phil Auld <pauld@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lore.kernel.org/r/20221117162329.3164999-3-pauld@redhat.com
2022-12-02 12:43:02 +01:00
Phil Auld 64ea6e44f8 cpu/hotplug: Make target_store() a nop when target == state
Writing the current state back in hotplug/target calls cpu_down()
which will set cpu dying even when it isn't and then nothing will
ever clear it. A stress test that reads values and writes them back
for all cpu device files in sysfs will trigger the BUG() in
select_fallback_rq once all cpus are marked as dying.

kernel/cpu.c::target_store()
	...
        if (st->state < target)
                ret = cpu_up(dev->id, target);
        else
                ret = cpu_down(dev->id, target);

cpu_down() -> cpu_set_state()
	 bool bringup = st->state < target;
	 ...
	 if (cpu_dying(cpu) != !bringup)
		set_cpu_dying(cpu, !bringup);

Fix this by letting state==target fall through in the target_store()
conditional. Also make sure st->target == target in that case.

Fixes: 757c989b99 ("cpu/hotplug: Make target state writeable")
Signed-off-by: Phil Auld <pauld@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lore.kernel.org/r/20221117162329.3164999-2-pauld@redhat.com
2022-12-02 12:43:02 +01:00
Alexey Izbyshev 90d7588967 futex: Resend potentially swallowed owner death notification
Commit ca16d5bee5 ("futex: Prevent robust futex exit race") addressed
two cases when tasks waiting on a robust non-PI futex remained blocked
despite the futex not being owned anymore:

* if the owner died after writing zero to the futex word, but before
  waking up a waiter

* if a task waiting on the futex was woken up, but died before updating
  the futex word (effectively swallowing the notification without acting
  on it)

In the second case, the task could be woken up either by the previous
owner (after the futex word was reset to zero) or by the kernel (after
the OWNER_DIED bit was set and the TID part of the futex word was reset
to zero) if the previous owner died without the resetting the futex.

Because the referenced commit wakes up a potential waiter only if the
whole futex word is zero, the latter subcase remains unaddressed.

Fix this by looking only at the TID part of the futex when deciding
whether a wake up is needed.

Fixes: ca16d5bee5 ("futex: Prevent robust futex exit race")
Signed-off-by: Alexey Izbyshev <izbyshev@ispras.ru>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20221111215439.248185-1-izbyshev@ispras.ru
2022-12-02 12:20:24 +01:00
John Ogness 5074ffbec6 printk: htmldocs: add missing description
Variable and return descriptions were missing from the SRCU read
lock functions. Add them.

Signed-off-by: John Ogness <john.ogness@linutronix.de>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/87zgcjpdvo.fsf@jogness.linutronix.de
2022-12-02 11:25:02 +01:00
John Ogness 848a9c1066 printk: relieve console_lock of list synchronization duties
The console_list_lock provides synchronization for console list and
console->flags updates. All call sites that were using the console_lock
for this synchronization have either switched to use the
console_list_lock or the SRCU list iterator.

Remove console_lock usage for console list updates and console->flags
updates.

Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20221116162152.193147-40-john.ogness@linutronix.de
2022-12-02 11:25:02 +01:00
John Ogness 6f8836756f printk, xen: fbfront: create/use safe function for forcing preferred
With commit 9e124fe16ff2("xen: Enable console tty by default in domU
if it's not a dummy") a hack was implemented to make sure that the
tty console remains the console behind the /dev/console device. The
main problem with the hack is that, after getting the console pointer
to the tty console, it is assumed the pointer is still valid after
releasing the console_sem. This assumption is incorrect and unsafe.

Make the hack safe by introducing a new function
console_force_preferred_locked() and perform the full operation
under the console_list_lock.

Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20221116162152.193147-33-john.ogness@linutronix.de
2022-12-02 11:25:02 +01:00
John Ogness 1fd4224a6b console: introduce console_is_registered()
Currently it is not possible for drivers to detect if they have
already successfully registered their console. Several drivers
have multiple paths that lead to console registration. To avoid
attempting a 2nd registration (which leads to a WARN), drivers
are implementing their own solution.

Introduce console_is_registered() so drivers can easily identify
if their console is currently registered. A _locked() variant
is also provided if the caller is already holding the
console_list_lock.

Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20221116162152.193147-22-john.ogness@linutronix.de
2022-12-02 11:25:01 +01:00
John Ogness 8cb15f7f49 printk: console_device: use srcu console list iterator
Use srcu console list iteration for console list traversal. It is
acceptable because the consoles might come and go at any time.
Strict synchronizing with console registration code would not bring
any advantage over srcu.

Document why the console_lock is still necessary. Note that this
is a preparatory change for when console_lock no longer provides
synchronization for the console list.

Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20221116162152.193147-21-john.ogness@linutronix.de
2022-12-02 11:25:01 +01:00
John Ogness 87f2e4b7d9 printk: console_flush_on_panic: use srcu console list iterator
With SRCU it is now safe to traverse the console list, even if
the console_trylock() failed. However, overwriting console->seq
when console_trylock() failed is still an issue.

Switch to SRCU iteration and document remaining issue with
console->seq.

Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20221116162152.193147-20-john.ogness@linutronix.de
2022-12-02 11:25:01 +01:00
John Ogness d792db6f6b printk: console_unblank: use srcu console list iterator
Use srcu console list iteration for console list traversal.

Document why the console_lock is still necessary. Note that this
is a preparatory change for when console_lock no longer provides
synchronization for the console list.

Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20221116162152.193147-19-john.ogness@linutronix.de
2022-12-02 11:25:00 +01:00
John Ogness 12f1da5fc4 printk: console_is_usable: use console_srcu_read_flags
All users of console_is_usable() are SRCU iterators. Use the
appropriate wrapper function to locklessly read the flags.

Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20221116162152.193147-18-john.ogness@linutronix.de
2022-12-02 11:25:00 +01:00
John Ogness eb7f1ed250 printk: __pr_flush: use srcu console list iterator
Use srcu console list iteration for console list traversal.

Document why the console_lock is still necessary. Note that this
is a preparatory change for when console_lock no longer provides
synchronization for the console list.

Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20221116162152.193147-17-john.ogness@linutronix.de
2022-12-02 11:25:00 +01:00
John Ogness fc956ae0de printk: console_flush_all: use srcu console list iterator
Guarantee safe iteration of the console list by using SRCU.

Note that in the case of a handover, the SRCU read lock is also
released. This is documented in the function description and as
comments in the code. It is a bit tricky, but this preserves the
lockdep lock ordering for the context handing over the
console_lock:

  console_lock()
  | mutex_acquire(&console_lock_dep_map)       <-- console lock
  |
  console_unlock()
  | console_flush_all()
  | | srcu_read_lock(&console_srcu)            <-- srcu lock
  | | console_emit_next_record()
  | | | console_lock_spinning_disable_and_check()
  | | | | srcu_read_unlock(&console_srcu)      <-- srcu unlock
  | | | | mutex_release(&console_lock_dep_map) <-- console unlock

Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20221116162152.193147-16-john.ogness@linutronix.de
2022-12-02 11:25:00 +01:00
John Ogness b8ef04be6e kdb: use srcu console list iterator
Guarantee safe iteration of the console list by using SRCU.

Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: Aaron Tomlin <atomlin@atomlin.com>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20221116162152.193147-15-john.ogness@linutronix.de
2022-12-02 11:25:00 +01:00
John Ogness 100bdef2c1 console: introduce wrappers to read/write console flags
After switching to SRCU for console list iteration, some readers
will begin readings console->flags as a data race. Locklessly
reading console->flags provides a consistent value because there
is at most one CPU modifying console->flags and that CPU is
using only read-modify-write operations.

Introduce a wrapper for SRCU iterators to read console flags.
Introduce a matching wrapper to write to flags of registered
consoles. Writing to flags of registered consoles is synchronized
by the console_list_lock.

Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20221116162152.193147-13-john.ogness@linutronix.de
2022-12-02 11:25:00 +01:00
John Ogness 4dc64682ad printk: introduce console_list_lock
Currently there exist races in register_console(), where the types
of registered consoles are checked (without holding the console_lock)
and then after acquiring the console_lock, it is assumed that the list
has not changed. Also, some code that performs console_unregister()
make similar assumptions.

It might be possible to fix these races using the console_lock. But
it would require a complex analysis of all console drivers to make
sure that the console_lock is not taken in match() and setup()
callbacks. And we really prefer to split up and reduce the
responsibilities of console_lock rather than expand its complexity.
Therefore, introduce a new console_list_lock to provide full
synchronization for any console list changes.

In addition, also use console_list_lock for synchronization of
console->flags updates. All flags are either static or modified only
during the console registration. There are only two exceptions.

The first exception is CON_ENABLED, which is also modified by
console_start()/console_stop(). Therefore, these functions must
also take the console_list_lock.

The second exception is when the flags are modified by the console
driver init code before the console is registered. These will be
ignored because they are not visible to the rest of the system
via the console_drivers list.

Note that one of the various responsibilities of the console_lock is
also intended to provide console list and console->flags
synchronization. Later changes will update call sites relying on the
console_lock for these purposes. Once all call sites have been
updated, the console_lock will be relieved of synchronizing
console_list and console->flags updates.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/87sficwokr.fsf@jogness.linutronix.de
2022-12-02 11:25:00 +01:00
John Ogness a424276093 printk: fix setting first seq for consoles
It used to be that all consoles were synchronized with respect to
which message they were printing. After commit a699449bb1 ("printk:
refactor and rework printing logic"), all consoles have their own
@seq for tracking which message they are on. That commit also changed
how the initial sequence number was chosen. Instead of choosing the
next non-printed message, it chose the sequence number of the next
message that will be added to the ringbuffer.

That change created a possibility that a non-boot console taking over
for a boot console might skip messages if the boot console was behind
and did not have a chance to catch up before being unregistered.

Since it is not known which boot console is the same device, flush
all consoles and, if necessary, start with the message of the enabled
boot console that is the furthest behind. If no boot consoles are
enabled, begin with the next message that will be added to the
ringbuffer.

Also, since boot consoles are meant to be used at boot time, handle
them the same as CON_PRINTBUFFER to ensure that no initial messages
are skipped.

Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20221116162152.193147-7-john.ogness@linutronix.de
2022-12-02 11:24:59 +01:00
John Ogness b80ea0e81b printk: move @seq initialization to helper
The code to initialize @seq for a new console needs to consider
more factors when choosing an initial value. Move the code into
a helper function console_init_seq() "as is" so this code can
be expanded without causing register_console() to become too
long. A later commit will implement the additional code.

Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20221116162152.193147-6-john.ogness@linutronix.de
2022-12-02 11:24:59 +01:00
John Ogness 1145703612 printk: register_console: use "registered" for variable names
The @bootcon_enabled and @realcon_enabled local variables actually
represent if such console types are registered. In general there
has been a confusion about enabled vs. registered. Incorrectly
naming such variables promotes such confusion.

Rename the variables to _registered.

Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20221116162152.193147-5-john.ogness@linutronix.de
2022-12-02 11:24:59 +01:00
John Ogness 6c4afa7914 printk: Prepare for SRCU console list protection
Provide an NMI-safe SRCU protected variant to walk the console list.

Note that all console fields are now set before adding the console
to the list to avoid the console becoming visible by SCRU readers
before being fully initialized.

This is a preparatory change for a new console infrastructure which
operates independent of the console BKL.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Acked-by: Miguel Ojeda <ojeda@kernel.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20221116162152.193147-4-john.ogness@linutronix.de
2022-12-02 11:24:59 +01:00
Thomas Gleixner d9a4af5690 printk: Convert console_drivers list to hlist
Replace the open coded single linked list with a hlist so a conversion
to SRCU protected list walks can reuse the existing primitives.

Co-developed-by: John Ogness <john.ogness@linutronix.de>
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20221116162152.193147-3-john.ogness@linutronix.de
2022-12-02 11:24:59 +01:00
Nicholas Piggin f9231a996e module: add module_elf_check_arch for module-specific checks
The elf_check_arch() function is also used to test compatibility of
usermode binaries. Kernel modules may have more specific requirements,
for example powerpc would like to test for ABI version compatibility.

Add a weak module_elf_check_arch() that defaults to true, and call it
from elf_validity_check().

Signed-off-by: Jessica Yu <jeyu@kernel.org>
[np: added changelog, adjust name, rebase]
Acked-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20221128041539.1742489-2-npiggin@gmail.com
2022-12-02 17:54:07 +11:00
Dave Marchevsky 1f82dffc10 bpf: Fix release_on_unlock release logic for multiple refs
Consider a verifier state with three acquired references, all with
release_on_unlock = true:

            idx  0 1 2
  state->refs = [2 4 6]

(with 2, 4, and 6 being the ref ids).

When bpf_spin_unlock is called, process_spin_lock will loop through all
acquired_refs and, for each ref, if it's release_on_unlock, calls
release_reference on it. That function in turn calls
release_reference_state, which removes the reference from state->refs by
swapping the reference state with the last reference state in
refs array and decrements acquired_refs count.

process_spin_lock's loop logic, which is essentially:

  for (i = 0; i < state->acquired_refs; i++) {
    if (!state->refs[i].release_on_unlock)
      continue;
    release_reference(state->refs[i].id);
  }

will fail to release release_on_unlock references which are swapped from
the end. Running this logic on our example demonstrates:

  state->refs = [2 4 6] (start of idx=0 iter)
    release state->refs[0] by swapping w/ state->refs[2]

  state->refs = [6 4]   (start of idx=1)
    release state->refs[1], no need to swap as it's the last idx

  state->refs = [6]     (start of idx=2, loop terminates)

ref_id 6 should have been removed but was skipped.

Fix this by looping from back-to-front, which results in refs that are
candidates for removal being swapped with refs which have already been
examined and kept.

If we modify our initial example such that ref 6 is replaced with ref 7,
which is _not_ release_on_unlock, and loop from the back, we'd see:

  state->refs = [2 4 7] (start of idx=2)

  state->refs = [2 4 7] (start of idx=1)

  state->refs = [2 7]   (start of idx=0, refs 7 and 4 swapped)

  state->refs = [7]     (after idx=0, 7 and 2 swapped, loop terminates)

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Acked-by: Yonghong Song <yhs@fb.com>
cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Fixes: 534e86bc6c ("bpf: Add 'release on unlock' logic for bpf_list_push_{front,back}")
Link: https://lore.kernel.org/r/20221201183406.1203621-1-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-01 19:38:17 -08:00
Dmitry Safonov eb8c507296 jump_label: Prevent key->enabled int overflow
1. With CONFIG_JUMP_LABEL=n static_key_slow_inc() doesn't have any
   protection against key->enabled refcounter overflow.
2. With CONFIG_JUMP_LABEL=y static_key_slow_inc_cpuslocked()
   still may turn the refcounter negative as (v + 1) may overflow.

key->enabled is indeed a ref-counter as it's documented in multiple
places: top comment in jump_label.h, Documentation/staging/static-keys.rst,
etc.

As -1 is reserved for static key that's in process of being enabled,
functions would break with negative key->enabled refcount:
- for CONFIG_JUMP_LABEL=n negative return of static_key_count()
  breaks static_key_false(), static_key_true()
- the ref counter may become 0 from negative side by too many
  static_key_slow_inc() calls and lead to use-after-free issues.

These flaws result in that some users have to introduce an additional
mutex and prevent the reference counter from overflowing themselves,
see bpf_enable_runtime_stats() checking the counter against INT_MAX / 2.

Prevent the reference counter overflow by checking if (v + 1) > 0.
Change functions API to return whether the increment was successful.

Signed-off-by: Dmitry Safonov <dima@arista.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-12-01 15:53:05 -08:00
Zqiang 51f5f78a4f srcu: Make Tiny synchronize_srcu() check for readers
This commit adds lockdep checks for illegal use of synchronize_srcu()
within same-type SRCU read-side critical sections and within normal
RCU read-side critical sections.  It also makes synchronize_srcu()
be a no-op during early boot.

These changes bring Tiny synchronize_srcu() into line with both Tree
synchronize_srcu() and Tiny synchronize_rcu().

Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Tested-by: John Ogness <john.ogness@linutronix.de>
2022-12-01 15:49:12 -08:00
Jakub Kicinski 19833ae270 Merge branch 'locking/core' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull in locking/core from tip (just a single patch) to avoid a conflict
with a jump_label change needed by a TCP cleanup.

Link: https://lore.kernel.org/all/Y4B17nBArWS1Iywo@hirez.programming.kicks-ass.net/
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-12-01 15:36:27 -08:00
Xueqin Luo 3363e0adb3 PM: hibernate: Complain about memory map mismatches during resume
The system memory map can change over a hibernation-restore cycle due
to a defect in the platform firmware, and some of the page frames used
by the kernel before hibernation may not be available any more during
the subsequent restore which leads to the error below.

[  T357] PM: Image loading progress:   0%
[  T357] PM: Read 2681596 kbytes in 0.03 seconds (89386.53 MB/s)
[  T357] PM: Error -14 resuming
[  T357] PM: Failed to load hibernation image, recovering.
[  T357] PM: Basic memory bitmaps freed
[  T357] OOM killer enabled.
[  T357] Restarting tasks ... done.
[  T357] PM: resume from hibernation failed (-14)
[  T357] PM: Hibernation image not present or could not be loaded.

Add an error message to the unpack() function to allow problematic
page frames to be identified and the source of the problem to be
diagnosed more easily. This can save developers quite a bit of
debugging time.

Signed-off-by: Xueqin Luo <luoxueqin@kylinos.cn>
[ rjw: New subject, edited changelog ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2022-12-01 19:20:14 +01:00
Kees Cook 9db89b4111 exit: Expose "oops_count" to sysfs
Since Oops count is now tracked and is a fairly interesting signal, add
the entry /sys/kernel/oops_count to expose it to userspace.

Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Jann Horn <jannh@google.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20221117234328.594699-3-keescook@chromium.org
2022-12-01 08:50:38 -08:00
Jann Horn d4ccd54d28 exit: Put an upper limit on how often we can oops
Many Linux systems are configured to not panic on oops; but allowing an
attacker to oops the system **really** often can make even bugs that look
completely unexploitable exploitable (like NULL dereferences and such) if
each crash elevates a refcount by one or a lock is taken in read mode, and
this causes a counter to eventually overflow.

The most interesting counters for this are 32 bits wide (like open-coded
refcounts that don't use refcount_t). (The ldsem reader count on 32-bit
platforms is just 16 bits, but probably nobody cares about 32-bit platforms
that much nowadays.)

So let's panic the system if the kernel is constantly oopsing.

The speed of oopsing 2^32 times probably depends on several factors, like
how long the stack trace is and which unwinder you're using; an empirically
important one is whether your console is showing a graphical environment or
a text console that oopses will be printed to.
In a quick single-threaded benchmark, it looks like oopsing in a vfork()
child with a very short stack trace only takes ~510 microseconds per run
when a graphical console is active; but switching to a text console that
oopses are printed to slows it down around 87x, to ~45 milliseconds per
run.
(Adding more threads makes this faster, but the actual oops printing
happens under &die_lock on x86, so you can maybe speed this up by a factor
of around 2 and then any further improvement gets eaten up by lock
contention.)

It looks like it would take around 8-12 days to overflow a 32-bit counter
with repeated oopsing on a multi-core X86 system running a graphical
environment; both me (in an X86 VM) and Seth (with a distro kernel on
normal hardware in a standard configuration) got numbers in that ballpark.

12 days aren't *that* short on a desktop system, and you'd likely need much
longer on a typical server system (assuming that people don't run graphical
desktop environments on their servers), and this is a *very* noisy and
violent approach to exploiting the kernel; and it also seems to take orders
of magnitude longer on some machines, probably because stuff like EFI
pstore will slow it down a ton if that's active.

Signed-off-by: Jann Horn <jannh@google.com>
Link: https://lore.kernel.org/r/20221107201317.324457-1-jannh@google.com
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20221117234328.594699-2-keescook@chromium.org
2022-12-01 08:50:38 -08:00
Kees Cook 9360d035a5 panic: Separate sysctl logic from CONFIG_SMP
In preparation for adding more sysctls directly in kernel/panic.c, split
CONFIG_SMP from the logic that adds sysctls.

Cc: Petr Mladek <pmladek@suse.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: tangmeng <tangmeng@uniontech.com>
Cc: "Guilherme G. Piccoli" <gpiccoli@igalia.com>
Cc: Tiezhu Yang <yangtiezhu@loongson.cn>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20221117234328.594699-1-keescook@chromium.org
2022-12-01 08:50:38 -08:00
Randy Dunlap 2e833c8c8c block: bdev & blktrace: use consistent function doc. notation
Use only one hyphen in kernel-doc notation between the function name
and its short description.

The is the documented kerenl-doc format. It also fixes the HTML
presentation to be consistent with other functions.

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Link: https://lore.kernel.org/r/20221201070331.25685-1-rdunlap@infradead.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-12-01 09:16:46 -07:00
Lukas Bulwahn ebe1173283 clockevents: Repair kernel-doc for clockevent_delta2ns()
Since the introduction of clockevents, i.e., commit d316c57ff6
("clockevents: add core functionality"), there has been a mismatch between
the function and the kernel-doc comment for clockevent_delta2ns().

Hence, ./scripts/kernel-doc -none kernel/time/clockevents.c warns about it.

Adjust the kernel-doc comment for clockevent_delta2ns() for make W=1
happiness.

Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20221102091048.15068-1-lukas.bulwahn@gmail.com
2022-12-01 13:35:41 +01:00
Xu Panda 7365df19e8 printk: use strscpy() to instead of strlcpy()
The implementation of strscpy() is more robust and safer.
That's now the recommended way to copy NUL terminated strings.

Signed-off-by: Xu Panda <xu.panda@zte.com.cn>
Signed-off-by: Yang Yang <yang.yang29@zte.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/202211301601416229001@zte.com.cn
2022-12-01 11:57:51 +01:00
Jann Horn d6c494e8ee vdso/timens: Refactor copy-pasted find_timens_vvar_page() helper into one copy
find_timens_vvar_page() is not architecture-specific, as can be seen from
how all five per-architecture versions of it are the same.

(arm64, powerpc and riscv are exactly the same; x86 and s390 have two
characters difference inside a comment, less blank lines, and mark the
!CONFIG_TIME_NS version as inline.)

Refactor the five copies into a central copy in kernel/time/namespace.c.

Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20221130115320.2918447-1-jannh@google.com
2022-12-01 11:35:40 +01:00
Yonghong Song 3144bfa507 bpf: Fix a compilation failure with clang lto build
When building the kernel with clang lto (CONFIG_LTO_CLANG_FULL=y), the
following compilation error will appear:

  $ make LLVM=1 LLVM_IAS=1 -j
  ...
  ld.lld: error: ld-temp.o <inline asm>:26889:1: symbol 'cgroup_storage_map_btf_ids' is already defined
  cgroup_storage_map_btf_ids:;
  ^
  make[1]: *** [/.../bpf-next/scripts/Makefile.vmlinux_o:61: vmlinux.o] Error 1

In local_storage.c, we have
  BTF_ID_LIST_SINGLE(cgroup_storage_map_btf_ids, struct, bpf_local_storage_map)
Commit c4bcfb38a9 ("bpf: Implement cgroup storage available to
non-cgroup-attached bpf progs") added the above identical BTF_ID_LIST_SINGLE
definition in bpf_cgrp_storage.c. With duplicated definitions, llvm linker
complains with lto build.

Also, extracting btf_id of 'struct bpf_local_storage_map' is defined four times
for sk, inode, task and cgrp local storages. Let us define a single global one
with a different name than cgroup_storage_map_btf_ids, which also fixed
the lto compilation error.

Fixes: c4bcfb38a9 ("bpf: Implement cgroup storage available to non-cgroup-attached bpf progs")
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20221130052147.1591625-1-yhs@fb.com
2022-11-30 17:13:25 -08:00
Jason Gunthorpe ce5a23c835 kernel/user: Allow user_struct::locked_vm to be usable for iommufd
Following the pattern of io_uring, perf, skb, and bpf, iommfd will use
user->locked_vm for accounting pinned pages. Ensure the value is included
in the struct and export free_uid() as iommufd is modular.

user->locked_vm is the good accounting to use for ulimit because it is
per-user, and the security sandboxing of locked pages is not supposed to
be per-process. Other places (vfio, vdpa and infiniband) have used
mm->pinned_vm and/or mm->locked_vm for accounting pinned pages, but this
is only per-process and inconsistent with the new FOLL_LONGTERM users in
the kernel.

Concurrent work is underway to try to put this in a cgroup, so everything
can be consistent and the kernel can provide a FOLL_LONGTERM limit that
actually provides security.

Link: https://lore.kernel.org/r/7-v6-a196d26f289e+11787-iommufd_jgg@nvidia.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Yi Liu <yi.l.liu@intel.com>
Tested-by: Lixiao Yang <lixiao.yang@intel.com>
Tested-by: Matthew Rosato <mjrosato@linux.ibm.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2022-11-30 20:16:49 -04:00
Zheng Yejian c5f31c655b acct: fix potential integer overflow in encode_comp_t()
The integer overflow is descripted with following codes:
  > 317 static comp_t encode_comp_t(u64 value)
  > 318 {
  > 319         int exp, rnd;
    ......
  > 341         exp <<= MANTSIZE;
  > 342         exp += value;
  > 343         return exp;
  > 344 }

Currently comp_t is defined as type of '__u16', but the variable 'exp' is
type of 'int', so overflow would happen when variable 'exp' in line 343 is
greater than 65535.

Link: https://lkml.kernel.org/r/20210515140631.369106-3-zhengyejian1@huawei.com
Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com>
Cc: Hanjun Guo <guohanjun@huawei.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Zhang Jinhao <zhangjinhao2@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30 16:13:18 -08:00
Zheng Yejian 457139f16a acct: fix accuracy loss for input value of encode_comp_t()
Patch series "Fix encode_comp_t()".

Type conversion in encode_comp_t() may look a bit problematic.


This patch (of 2):

See calculation of ac_{u,s}time in fill_ac():
  > ac->ac_utime = encode_comp_t(nsec_to_AHZ(pacct->ac_utime));
  > ac->ac_stime = encode_comp_t(nsec_to_AHZ(pacct->ac_stime));

Return value of nsec_to_AHZ() is always type of 'u64', but it is handled
as type of 'unsigned long' in encode_comp_t, and accuracy loss would
happen on 32-bit platform when 'unsigned long' value is 32-bit-width.

So 'u64' value of encode_comp_t() may look better.

Link: https://lkml.kernel.org/r/20210515140631.369106-1-zhengyejian1@huawei.com
Link: https://lkml.kernel.org/r/20210515140631.369106-2-zhengyejian1@huawei.com
Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com>
Cc: Hanjun Guo <guohanjun@huawei.com>
Cc: Randy Dunlap <rdunlap@infradead.org> # build-tested
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Zhang Jinhao <zhangjinhao2@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30 16:13:18 -08:00
Stephen Brennan 08fc35f31b vmcoreinfo: warn if we exceed vmcoreinfo data size
Though vmcoreinfo is intended to be small, at just one page, useful
information is still added to it, so we risk running out of space. 
Currently there is no runtime check to see whether the vmcoreinfo buffer
has been exhausted.  Add a warning for this case.

Currently, my static checking tool[1] indicates that a good upper bound
for vmcoreinfo size is currently 3415 bytes, but the best time to add
warnings is before the risk becomes too high.

[1] https://github.com/brenns10/kernel_stuff/blob/master/vmcoreinfosize/vmcoreinfosize.py

Link: https://lkml.kernel.org/r/20221027205008.312534-1-stephen.s.brennan@oracle.com
Signed-off-by: Stephen Brennan <stephen.s.brennan@oracle.com>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30 16:13:17 -08:00
Shakeel Butt f689054aac percpu_counter: add percpu_counter_sum_all interface
The percpu_counter is used for scenarios where performance is more
important than the accuracy.  For percpu_counter users, who want more
accurate information in their slowpath, percpu_counter_sum is provided
which traverses all the online CPUs to accumulate the data.  The reason it
only needs to traverse online CPUs is because percpu_counter does
implement CPU offline callback which syncs the local data of the offlined
CPU.

However there is a small race window between the online CPUs traversal of
percpu_counter_sum and the CPU offline callback.  The offline callback has
to traverse all the percpu_counters on the system to flush the CPU local
data which can be a lot.  During that time, the CPU which is going offline
has already been published as offline to all the readers.  So, as the
offline callback is running, percpu_counter_sum can be called for one
counter which has some state on the CPU going offline.  Since
percpu_counter_sum only traverses online CPUs, it will skip that specific
CPU and the offline callback might not have flushed the state for that
specific percpu_counter on that offlined CPU.

Normally this is not an issue because percpu_counter users can deal with
some inaccuracy for small time window.  However a new user i.e.  mm_struct
on the cleanup path wants to check the exact state of the percpu_counter
through check_mm().  For such users, this patch introduces
percpu_counter_sum_all() which traverses all possible CPUs and it is used
in fork.c:check_mm() to avoid the potential race.

This issue is exposed by the later patch "mm: convert mm's rss stats into
percpu_counter".

Link: https://lkml.kernel.org/r/20221109012011.881058-1-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>
Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30 15:58:40 -08:00
Shakeel Butt f1a7941243 mm: convert mm's rss stats into percpu_counter
Currently mm_struct maintains rss_stats which are updated on page fault
and the unmapping codepaths.  For page fault codepath the updates are
cached per thread with the batch of TASK_RSS_EVENTS_THRESH which is 64. 
The reason for caching is performance for multithreaded applications
otherwise the rss_stats updates may become hotspot for such applications.

However this optimization comes with the cost of error margin in the rss
stats.  The rss_stats for applications with large number of threads can be
very skewed.  At worst the error margin is (nr_threads * 64) and we have a
lot of applications with 100s of threads, so the error margin can be very
high.  Internally we had to reduce TASK_RSS_EVENTS_THRESH to 32.

Recently we started seeing the unbounded errors for rss_stats for specific
applications which use TCP rx0cp.  It seems like vm_insert_pages()
codepath does not sync rss_stats at all.

This patch converts the rss_stats into percpu_counter to convert the error
margin from (nr_threads * 64) to approximately (nr_cpus ^ 2).  However
this conversion enable us to get the accurate stats for situations where
accuracy is more important than the cpu cost.

This patch does not make such tradeoffs - we can just use
percpu_counter_add_local() for the updates and percpu_counter_sum() (or
percpu_counter_sync() + percpu_counter_read) for the readers.  At the
moment the readers are either procfs interface, oom_killer and memory
reclaim which I think are not performance critical and should be ok with
slow read.  However I think we can make that change in a separate patch.

Link: https://lkml.kernel.org/r/20221024052841.3291983-1-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30 15:58:40 -08:00
Alexei Starovoitov c67cae551f bpf: Tighten ptr_to_btf_id checks.
The networking programs typically don't require CAP_PERFMON, but through kfuncs
like bpf_cast_to_kern_ctx() they can access memory through PTR_TO_BTF_ID. In
such case enforce CAP_PERFMON.
Also make sure that only GPL programs can access kernel data structures.
All kfuncs require GPL already.

Also remove allow_ptr_to_map_access. It's the same as allow_ptr_leaks and
different name for the same check only causes confusion.

Fixes: fd264ca020 ("bpf: Add a kfunc to type cast from bpf uapi ctx to kernel ctx")
Fixes: 50c6b8a9ae ("selftests/bpf: Add a test for btf_type_tag "percpu"")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20221125220617.26846-1-alexei.starovoitov@gmail.com
2022-11-30 15:33:48 -08:00
Vlastimil Babka 149b6fa228 mm, slob: rename CONFIG_SLOB to CONFIG_SLOB_DEPRECATED
As explained in [1], we would like to remove SLOB if possible.

- There are no known users that need its somewhat lower memory footprint
  so much that they cannot handle SLUB (after some modifications by the
  previous patches) instead.

- It is an extra maintenance burden, and a number of features are
  incompatible with it.

- It blocks the API improvement of allowing kfree() on objects allocated
  via kmem_cache_alloc().

As the first step, rename the CONFIG_SLOB option in the slab allocator
configuration choice to CONFIG_SLOB_DEPRECATED. Add CONFIG_SLOB
depending on CONFIG_SLOB_DEPRECATED as an internal option to avoid code
churn. This will cause existing .config files and defconfigs with
CONFIG_SLOB=y to silently switch to the default (and recommended
replacement) SLUB, while still allowing SLOB to be configured by anyone
that notices and needs it. But those should contact the slab maintainers
and linux-mm@kvack.org as explained in the updated help. With no valid
objections, the plan is to update the existing defconfigs to SLUB and
remove SLOB in a few cycles.

To make SLUB more suitable replacement for SLOB, a CONFIG_SLUB_TINY
option was introduced to limit SLUB's memory overhead.
There is a number of defconfigs specifying CONFIG_SLOB=y. As part of
this patch, update them to select CONFIG_SLUB and CONFIG_SLUB_TINY.

[1] https://lore.kernel.org/all/b35c3f82-f67b-2103-7d82-7a7ba7521439@suse.cz/

Cc: Russell King <linux@armlinux.org.uk>
Cc: Aaro Koskinen <aaro.koskinen@iki.fi>
Cc: Janusz Krzysztofik <jmkrzyszt@gmail.com>
Cc: Tony Lindgren <tony@atomide.com>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Rich Felker <dalias@libc.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Conor Dooley <conor@kernel.org>
Cc: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Aaro Koskinen <aaro.koskinen@iki.fi> # OMAP1
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> # riscv k210
Acked-by: Arnd Bergmann <arnd@arndb.de> # arm
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Christoph Lameter <cl@linux.com>
2022-12-01 00:09:20 +01:00
Andrew Morton a38358c934 Merge branch 'mm-hotfixes-stable' into mm-stable 2022-11-30 14:58:42 -08:00
Paul E. McKenney 87492c06e6 Merge branches 'doc.2022.10.20a', 'fixes.2022.10.21a', 'lazy.2022.11.30a', 'srcunmisafe.2022.11.09a', 'torture.2022.10.18c' and 'torturescript.2022.10.20a' into HEAD
doc.2022.10.20a: Documentation updates.
fixes.2022.10.21a: Miscellaneous fixes.
lazy.2022.11.30a: Lazy call_rcu() and NOCB updates.
srcunmisafe.2022.11.09a: NMI-safe SRCU readers.
torture.2022.10.18c: Torture-test updates.
torturescript.2022.10.20a: Torture-test scripting updates.
2022-11-30 13:20:05 -08:00
Uladzislau Rezki a7e30c0e9a workqueue: Make queue_rcu_work() use call_rcu_hurry()
Earlier commits in this series allow battery-powered systems to build
their kernels with the default-disabled CONFIG_RCU_LAZY=y Kconfig option.
This Kconfig option causes call_rcu() to delay its callbacks in order
to batch them.  This means that a given RCU grace period covers more
callbacks, thus reducing the number of grace periods, in turn reducing
the amount of energy consumed, which increases battery lifetime which
can be a very good thing.  This is not a subtle effect: In some important
use cases, the battery lifetime is increased by more than 10%.

This CONFIG_RCU_LAZY=y option is available only for CPUs that offload
callbacks, for example, CPUs mentioned in the rcu_nocbs kernel boot
parameter passed to kernels built with CONFIG_RCU_NOCB_CPU=y.

Delaying callbacks is normally not a problem because most callbacks do
nothing but free memory.  If the system is short on memory, a shrinker
will kick all currently queued lazy callbacks out of their laziness,
thus freeing their memory in short order.  Similarly, the rcu_barrier()
function, which blocks until all currently queued callbacks are invoked,
will also kick lazy callbacks, thus enabling rcu_barrier() to complete
in a timely manner.

However, there are some cases where laziness is not a good option.
For example, synchronize_rcu() invokes call_rcu(), and blocks until
the newly queued callback is invoked.  It would not be a good for
synchronize_rcu() to block for ten seconds, even on an idle system.
Therefore, synchronize_rcu() invokes call_rcu_hurry() instead of
call_rcu().  The arrival of a non-lazy call_rcu_hurry() callback on a
given CPU kicks any lazy callbacks that might be already queued on that
CPU.  After all, if there is going to be a grace period, all callbacks
might as well get full benefit from it.

Yes, this could be done the other way around by creating a
call_rcu_lazy(), but earlier experience with this approach and
feedback at the 2022 Linux Plumbers Conference shifted the approach
to call_rcu() being lazy with call_rcu_hurry() for the few places
where laziness is inappropriate.

And another call_rcu() instance that cannot be lazy is the one
in queue_rcu_work(), given that callers to queue_rcu_work() are
not necessarily OK with long delays.

Therefore, make queue_rcu_work() use call_rcu_hurry() in order to revert
to the old behavior.

[ paulmck: Apply s/call_rcu_flush/call_rcu_hurry/ feedback from Tejun Heo. ]

Signed-off-by: Uladzislau Rezki <urezki@gmail.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-11-30 13:17:05 -08:00
Lukas Bulwahn f9b4dc920d notifier: repair slips in kernel-doc comments
Invoking ./scripts/kernel-doc -none kernel/notifier.c warns:

  kernel/notifier.c:71: warning: Excess function parameter 'returns' description in 'notifier_call_chain'
  kernel/notifier.c:119: warning: Function parameter or member 'v' not described in 'notifier_call_chain_robust'

These two warning are easy to fix, as they are just due to some minor slips
that makes the comment not follow kernel-doc's syntactic expectation.

Fix those minor slips in kernel-doc comments for make W=1 happiness.

Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2022-11-30 19:32:30 +01:00
Jens Axboe c62256dda3 Revert "blk-cgroup: Flush stats at blkgs destruction path"
This reverts commit dae590a6c9.

We've had a few reports on this causing a crash at boot time, because
of a reference issue. While this problem seemginly did exist before
the patch and needs solving separately, this patch makes it a lot
easier to trigger.

Link: https://lore.kernel.org/linux-block/CA+QYu4oxiRKC6hJ7F27whXy-PRBx=Tvb+-7TQTONN8qTtV3aDA@mail.gmail.com/
Link: https://lore.kernel.org/linux-block/69af7ccb-6901-c84c-0e95-5682ccfb750c@acm.org/
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-30 08:25:46 -07:00
Yang Yingliang 9049e1ca41 genirq/irqdesc: Don't try to remove non-existing sysfs files
Fault injection tests trigger warnings like this:

  kernfs: can not remove 'chip_name', no directory
  WARNING: CPU: 0 PID: 253 at fs/kernfs/dir.c:1616 kernfs_remove_by_name_ns+0xce/0xe0
  RIP: 0010:kernfs_remove_by_name_ns+0xce/0xe0
  Call Trace:
   <TASK>
   remove_files.isra.1+0x3f/0xb0
   sysfs_remove_group+0x68/0xe0
   sysfs_remove_groups+0x41/0x70
   __kobject_del+0x45/0xc0
   kobject_del+0x29/0x40
   free_desc+0x42/0x70
   irq_free_descs+0x5e/0x90

The reason is that the interrupt descriptor sysfs handling does not roll
back on a failing kobject_add() during allocation. If the descriptor is
freed later on, kobject_del() is invoked with a not added kobject resulting
in the above warnings.

A proper rollback in case of a kobject_add() failure would be the straight
forward solution. But this is not possible due to the way how interrupt
descriptor sysfs handling works.

Interrupt descriptors are allocated before sysfs becomes available. So the
sysfs files for the early allocated descriptors are added later in the boot
process. At this point there can be nothing useful done about a failing
kobject_add(). For consistency the interrupt descriptor allocation always
treats kobject_add() failures as non-critical and just emits a warning.

To solve this problem, keep track in the interrupt descriptor whether
kobject_add() was successful or not and make the invocation of
kobject_del() conditional on that.

[ tglx: Massage changelog, comments and use a state bit. ]

Fixes: ecb3f394c5 ("genirq: Expose interrupt information through sysfs")
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://lore.kernel.org/r/20221128151612.1786122-1-yangyingliang@huawei.com
2022-11-30 14:52:11 +01:00
Paul E. McKenney 0cd7e350ab rcu: Make SRCU mandatory
Kernels configured with CONFIG_PRINTK=n and CONFIG_SRCU=n get build
failures.  This causes trouble for deep embedded systems.  But given
that there are more than 25 instances of "select SRCU" in the kernel,
it is hard to believe that there are many kernels running in production
without SRCU.  This commit therefore makes SRCU mandatory.  The SRCU
Kconfig option remains for backwards compatibility, and will be removed
when it is no longer used.

[ paulmck: Update per kernel test robot feedback. ]

Reported-by: John Ogness <john.ogness@linutronix.de>
Reported-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: <linux-arch@vger.kernel.org>
Acked-by: Randy Dunlap <rdunlap@infradead.org> # build-tested
Reviewed-by: John Ogness <john.ogness@linutronix.de>
2022-11-29 15:00:06 -08:00
Joel Fernandes (Google) 405d8e91f0 rcu/rcutorture: Use call_rcu_hurry() where needed
call_rcu() changes to save power will change the behavior of rcutorture
tests. Use the call_rcu_hurry() API instead which reverts to the old
behavior.

[ paulmck: Apply s/call_rcu_flush/call_rcu_hurry/ feedback from Tejun Heo. ]

Reported-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-11-29 14:04:33 -08:00
Joel Fernandes (Google) 723df859d8 rcu/rcuscale: Use call_rcu_hurry() for async reader test
rcuscale uses call_rcu() to queue async readers. With recent changes to
save power, the test will have fewer async readers in flight. Use the
call_rcu_hurry() API instead to revert to the old behavior.

[ paulmck: Apply s/call_rcu_flush/call_rcu_hurry/ feedback from Tejun Heo. ]

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-11-29 14:04:33 -08:00
Joel Fernandes (Google) 7651d6b250 rcu/sync: Use call_rcu_hurry() instead of call_rcu
call_rcu() changes to save power will slow down rcu sync. Use the
call_rcu_hurry() API instead which reverts to the old behavior.

[ paulmck: Apply s/call_rcu_flush/call_rcu_hurry/ feedback from Tejun Heo. ]

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-11-29 14:04:33 -08:00
Joel Fernandes (Google) 084e04fff1 rcuscale: Add laziness and kfree tests
This commit adds 2 tests to rcuscale.  The first one is a startup test
to check whether we are not too lazy or too hard working.  The second
one causes kfree_rcu() itself to use call_rcu() and checks memory
pressure. Testing indicates that the new call_rcu() keeps memory pressure
under control roughly as well as does kfree_rcu().

[ paulmck: Apply checkpatch feedback. ]

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-11-29 14:02:52 -08:00
Vineeth Pillai c945b4da7a rcu: Shrinker for lazy rcu
The shrinker is used to speed up the free'ing of memory potentially held
by RCU lazy callbacks. RCU kernel module test cases show this to be
effective. Test is introduced in a later patch.

Signed-off-by: Vineeth Pillai <vineeth@bitbyteword.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-11-29 14:02:52 -08:00
Joel Fernandes (Google) 3d222a0c0c rcu: Refactor code a bit in rcu_nocb_do_flush_bypass()
This consolidates the code a bit and makes it cleaner. Functionally it
is the same.

Reported-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-11-29 14:02:52 -08:00
Joel Fernandes (Google) 3cb278e73b rcu: Make call_rcu() lazy to save power
Implement timer-based RCU callback batching (also known as lazy
callbacks). With this we save about 5-10% of power consumed due
to RCU requests that happen when system is lightly loaded or idle.

By default, all async callbacks (queued via call_rcu) are marked
lazy. An alternate API call_rcu_hurry() is provided for the few users,
for example synchronize_rcu(), that need the old behavior.

The batch is flushed whenever a certain amount of time has passed, or
the batch on a particular CPU grows too big. Also memory pressure will
flush it in a future patch.

To handle several corner cases automagically (such as rcu_barrier() and
hotplug), we re-use bypass lists which were originally introduced to
address lock contention, to handle lazy CBs as well. The bypass list
length has the lazy CB length included in it. A separate lazy CB length
counter is also introduced to keep track of the number of lazy CBs.

[ paulmck: Fix formatting of inline call_rcu_lazy() definition. ]
[ paulmck: Apply Zqiang feedback. ]
[ paulmck: Apply s/call_rcu_flush/call_rcu_hurry/ feedback from Tejun Heo. ]

Suggested-by: Paul McKenney <paulmck@kernel.org>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-11-29 14:02:23 -08:00
Jakub Kicinski f2bb566f5c Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
tools/lib/bpf/ringbuf.c
  927cbb478a ("libbpf: Handle size overflow for ringbuf mmap")
  b486d19a0a ("libbpf: checkpatch: Fixed code alignments in ringbuf.c")
https://lore.kernel.org/all/20221121122707.44d1446a@canb.auug.org.au/

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-29 13:04:52 -08:00
Linus Torvalds 01f856ae6d Including fixes from bpf, can and wifi.
Current release - new code bugs:
 
  - eth: mlx5e:
    - use kvfree() in mlx5e_accel_fs_tcp_create()
    - MACsec, fix RX data path 16 RX security channel limit
    - MACsec, fix memory leak when MACsec device is deleted
    - MACsec, fix update Rx secure channel active field
    - MACsec, fix add Rx security association (SA) rule memory leak
 
 Previous releases - regressions:
 
  - wifi: cfg80211: don't allow multi-BSSID in S1G
 
  - stmmac: set MAC's flow control register to reflect current settings
 
  - eth: mlx5:
    - E-switch, fix duplicate lag creation
    - fix use-after-free when reverting termination table
 
 Previous releases - always broken:
 
  - ipv4: fix route deletion when nexthop info is not specified
 
  - bpf: fix a local storage BPF map bug where the value's spin lock
    field can get initialized incorrectly
 
  - tipc: re-fetch skb cb after tipc_msg_validate
 
  - wifi: wilc1000: fix Information Element parsing
 
  - packet: do not set TP_STATUS_CSUM_VALID on CHECKSUM_COMPLETE
 
  - sctp: fix memory leak in sctp_stream_outq_migrate()
 
  - can: can327: fix potential skb leak when netdev is down
 
  - can: add number of missing netdev freeing on error paths
 
  - aquantia: do not purge addresses when setting the number of rings
 
  - wwan: iosm:
    - fix incorrect skb length leading to truncated packet
    - fix crash in peek throughput test due to skb UAF
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmOGOdYACgkQMUZtbf5S
 IrsknQ//SAoOyDOEu15YzOt8hAupLKoF6MM+D0dwwTEQZLf7IVXCjPpkKtVh7Si7
 YCBoyrqrDs7vwaUrVoKY19Amwov+EYrHCpdC+c7wdZ7uxTaYfUbJJUGmxYOR179o
 lV1+1Aiqg9F9C6CUsmZ5lDN2Yb7/uPDBICIV8LM+VzJAtXjurBVauyMwAxLxPOAr
 cgvM+h5xzE7DXMF2z8R/mUq5MSIWoJo9hy2UwbV+f2liMTQuw9rwTbyw3d7+H/6p
 xmJcBcVaABjoUEsEhld3NTlYbSEnlFgCQBfDWzf2e4y6jBxO0JepuIc7SZwJFRJY
 XBqdsKcGw5RkgKbksKUgxe126XFX0SUUQEp0UkOIqe15k7eC2yO9uj1gRm6OuV4s
 J94HKzHX9WNV5OQ790Ed2JyIJScztMZlNFVJ/cz2/+iKR42xJg6kaO6Rt2fobtmL
 VC2cH+RfHzLl+2+7xnfzXEDgFePSBlA02Aq1wihU3zB3r7WCFHchEf9T7sGt1QF0
 03R+8E3+N2tYqphPAXyDoy6kXQJTPxJHAe1FNHJlwgfieUDEWZi/Pm+uQrKIkDeo
 oq9MAV2QBNSD1w4wl7cXfvicO5kBr/OP6YBqwkpsGao2jCSIgkWEX2DRrUaLczXl
 5/Z+m/gCO5tAEcVRYfMivxUIon//9EIhbErVpHTlNWpRHk24eS4=
 =0Lnw
 -----END PGP SIGNATURE-----

Merge tag 'net-6.1-rc8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
 "Including fixes from bpf, can and wifi.

  Current release - new code bugs:

   - eth: mlx5e:
      - use kvfree() in mlx5e_accel_fs_tcp_create()
      - MACsec, fix RX data path 16 RX security channel limit
      - MACsec, fix memory leak when MACsec device is deleted
      - MACsec, fix update Rx secure channel active field
      - MACsec, fix add Rx security association (SA) rule memory leak

  Previous releases - regressions:

   - wifi: cfg80211: don't allow multi-BSSID in S1G

   - stmmac: set MAC's flow control register to reflect current settings

   - eth: mlx5:
      - E-switch, fix duplicate lag creation
      - fix use-after-free when reverting termination table

  Previous releases - always broken:

   - ipv4: fix route deletion when nexthop info is not specified

   - bpf: fix a local storage BPF map bug where the value's spin lock
     field can get initialized incorrectly

   - tipc: re-fetch skb cb after tipc_msg_validate

   - wifi: wilc1000: fix Information Element parsing

   - packet: do not set TP_STATUS_CSUM_VALID on CHECKSUM_COMPLETE

   - sctp: fix memory leak in sctp_stream_outq_migrate()

   - can: can327: fix potential skb leak when netdev is down

   - can: add number of missing netdev freeing on error paths

   - aquantia: do not purge addresses when setting the number of rings

   - wwan: iosm:
      - fix incorrect skb length leading to truncated packet
      - fix crash in peek throughput test due to skb UAF"

* tag 'net-6.1-rc8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (79 commits)
  net: ethernet: renesas: ravb: Fix promiscuous mode after system resumed
  MAINTAINERS: Update maintainer list for chelsio drivers
  ionic: update MAINTAINERS entry
  sctp: fix memory leak in sctp_stream_outq_migrate()
  packet: do not set TP_STATUS_CSUM_VALID on CHECKSUM_COMPLETE
  net/mlx5: Lag, Fix for loop when checking lag
  Revert "net/mlx5e: MACsec, remove replay window size limitation in offload path"
  net: marvell: prestera: Fix a NULL vs IS_ERR() check in some functions
  net: tun: Fix use-after-free in tun_detach()
  net: mdiobus: fix unbalanced node reference count
  net: hsr: Fix potential use-after-free
  tipc: re-fetch skb cb after tipc_msg_validate
  mptcp: fix sleep in atomic at close time
  mptcp: don't orphan ssk in mptcp_close()
  dsa: lan9303: Correct stat name
  ipv4: Fix route deletion when nexthop info is not specified
  net: wwan: iosm: fix incorrect skb length
  net: wwan: iosm: fix crash in peek throughput test
  net: wwan: iosm: fix dma_alloc_coherent incompatible pointer type
  net: wwan: iosm: fix kernel test robot reported error
  ...
2022-11-29 09:52:10 -08:00
Peter Zijlstra 517e6a301f perf: Fix perf_pending_task() UaF
Per syzbot it is possible for perf_pending_task() to run after the
event is free()'d. There are two related but distinct cases:

 - the task_work was already queued before destroying the event;
 - destroying the event itself queues the task_work.

The first cannot be solved using task_work_cancel() since
perf_release() itself might be called from a task_work (____fput),
which means the current->task_works list is already empty and
task_work_cancel() won't be able to find the perf_pending_task()
entry.

The simplest alternative is extending the perf_event lifetime to cover
the task_work.

The second is just silly, queueing a task_work while you know the
event is going away makes no sense and is easily avoided by
re-arranging how the event is marked STATE_DEAD and ensuring it goes
through STATE_OFF on the way down.

Reported-by: syzbot+9228d6098455bb209ec8@syzkaller.appspotmail.com
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Marco Elver <elver@google.com>
2022-11-29 17:42:49 +01:00
Jakub Kicinski d6dc62fca6 bpf-next-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCY4AC5QAKCRDbK58LschI
 g1e0AQCfAqduTy7mYd02jDNCV0wLphNp9FbPiP9OrQT37ABpKAEA1ulj1X59bX3d
 HnZdDKuatcPZT9MV5hDLM7MFJ9GjOA4=
 =fNmM
 -----END PGP SIGNATURE-----

Daniel Borkmann says:

====================
bpf-next 2022-11-25

We've added 101 non-merge commits during the last 11 day(s) which contain
a total of 109 files changed, 8827 insertions(+), 1129 deletions(-).

The main changes are:

1) Support for user defined BPF objects: the use case is to allocate own
   objects, build own object hierarchies and use the building blocks to
   build own data structures flexibly, for example, linked lists in BPF,
   from Kumar Kartikeya Dwivedi.

2) Add bpf_rcu_read_{,un}lock() support for sleepable programs,
   from Yonghong Song.

3) Add support storing struct task_struct objects as kptrs in maps,
   from David Vernet.

4) Batch of BPF map documentation improvements, from Maryam Tahhan
   and Donald Hunter.

5) Improve BPF verifier to propagate nullness information for branches
   of register to register comparisons, from Eduard Zingerman.

6) Fix cgroup BPF iter infra to hold reference on the start cgroup,
   from Hou Tao.

7) Fix BPF verifier to not mark fentry/fexit program arguments as trusted
   given it is not the case for them, from Alexei Starovoitov.

8) Improve BPF verifier's realloc handling to better play along with dynamic
   runtime analysis tools like KASAN and friends, from Kees Cook.

9) Remove legacy libbpf mode support from bpftool,
   from Sahid Orentino Ferdjaoui.

10) Rework zero-len skb redirection checks to avoid potentially breaking
    existing BPF test infra users, from Stanislav Fomichev.

11) Two small refactorings which are independent and have been split out
    of the XDP queueing RFC series, from Toke Høiland-Jørgensen.

12) Fix a memory leak in LSM cgroup BPF selftest, from Wang Yufen.

13) Documentation on how to run BPF CI without patch submission,
    from Daniel Müller.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
====================

Link: https://lore.kernel.org/r/20221125012450.441-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-28 19:42:17 -08:00
Linus Torvalds cb525a6513 tracing fixes for 6.1:
- Fix osnoise duration type to 64bit not 32bit.
 
 - Have histogram triggers be able to handle an unexpected NULL pointer
   for the record event, that can happen when the histogram first starts up.
 
 - Clear out ring buffers when dynamic events are removed, as the type
   that is saved in the ring buffer is used to read the event, and a
   stale type that is reused by another event could cause use after free
   issues.
 
 - Trivial comment fix.
 
 - Fix memory leak in user_event_create().
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCY4UPShQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qo3qAQCz2R814gCo54WEDcvRUmAmF9BSE2b5
 wtaDfeyXQYbaMwEA4+cuNoMMlQvbVbmcss/Uh7RpRCB8+24iypeiEZNtIgg=
 =GPrG
 -----END PGP SIGNATURE-----

Merge tag 'trace-v6.1-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing fixes from Steven Rostedt:

 - Fix osnoise duration type to 64bit not 32bit

 - Have histogram triggers be able to handle an unexpected NULL pointer
   for the record event, which can happen when the histogram first
   starts up

 - Clear out ring buffers when dynamic events are removed, as the type
   that is saved in the ring buffer is used to read the event, and a
   stale type that is reused by another event could cause use after free
   issues

 - Trivial comment fix

 - Fix memory leak in user_event_create()

* tag 'trace-v6.1-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing: Free buffers when a used dynamic event is removed
  tracing: Add tracing_reset_all_online_cpus_unlocked() function
  tracing: Fix race where histograms can be called before the event
  tracing/osnoise: Fix duration type
  tracing/user_events: Fix memory leak in user_event_create()
  tracing/hist: add in missing * in comment blocks
2022-11-28 14:42:29 -08:00
Beau Belgrave 4bded7af8b tracing/user_events: Fix call print_fmt leak
If user_event_trace_register() fails within user_event_parse() the
call's print_fmt member is not freed. Add kfree call to fix this.

Link: https://lkml.kernel.org/r/20221123183248.554-1-beaub@linux.microsoft.com

Fixes: aa3b2b4c66 ("user_events: Add print_fmt generation support for basic types")
Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-11-28 15:58:56 -05:00
Li Huafei 0c76ef3f26 kprobes: Fix check for probe enabled in kill_kprobe()
In kill_kprobe(), the check whether disarm_kprobe_ftrace() needs to be
called always fails. This is because before that we set the
KPROBE_FLAG_GONE flag for kprobe so that "!kprobe_disabled(p)" is always
false.

The disarm_kprobe_ftrace() call introduced by commit:

  0cb2f1372b ("kprobes: Fix NULL pointer dereference at kprobe_ftrace_handler")

to fix the NULL pointer reference problem. When the probe is enabled, if
we do not disarm it, this problem still exists.

Fix it by putting the probe enabled check before setting the
KPROBE_FLAG_GONE flag.

Link: https://lore.kernel.org/all/20221126114316.201857-1-lihuafei1@huawei.com/

Fixes: 3031313eb3 ("kprobes: Fix to check probe enabled before disarm_kprobe_ftrace()")
Signed-off-by: Li Huafei <lihuafei1@huawei.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2022-11-28 21:20:47 +09:00
Linus Torvalds 5afcab2217 - Two more fixes to the perf sigtrap handling:
- output the address in the sample only when it has been requested
  - handle the case where user-only events can hit in kernel and thus
    upset the sigtrap sanity checking
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmODP5wACgkQEsHwGGHe
 VUqmBhAAnxQTCg4agqtKQyHAnFGZb09dE6pYghcpJzF2sOh9sffCYl9XYd53AGVS
 z7nTGagPmT0AxdtjbFvDnxXuvpPVfcS9rIKG6oBzPUcXSVKr9PsqzZJJiAibg+uP
 OSkC75tqNgBTnyDSGc+C5yd7LmWV98Y66t0fhwGUTHpAPY+vkbvYvi8yJYfVqZsA
 Fc/6j+MJMFO1/52GoP42HVs2F7oyuXceJAAVhyYkYitC23mvrK2BRTZAkKsfRKBk
 Un5gFMo+iFUTrOMo48f8BMNEtgfC5iAPPK2ZlFGwTEbDQ98+inf461s8jZq/Yeg7
 KcdHw0Tk1rynBrFlbbUQArDEPyqzmj10FMVZ5VgUNdMneGEy+EUdpcoilX0ZccaA
 NMnNDMPpU4S0cbuGn43Y2Jlv4QA9oECHqZ8NndAoJ5domR9g12WyVibD5W/bYBYQ
 4Z2lv6luOCfWd2xQgT4s0tfMjwQb7NKmTrT4No9je7/IBDS5YjzBjPriAd4kbhFj
 MUt6KKLK0fN4ASmxajXp6iYRALL2nvgJ2w1gzO0pcjMGCJ1WZBKDdfbNLbgulCwq
 GJHNc55F7lWmkMNFLbWb28YLHjXGskQj4J/RARFszQq9khQnPEFj4X69StQfrghT
 77rJF9CrS3KnxG4gCNWe5qvfQb9+Fi7IhwyWDI9VR0DITxGogvY=
 =9CQa
 -----END PGP SIGNATURE-----

Merge tag 'perf_urgent_for_v6.1_rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull perf fixes from Borislav Petkov:
 "Two more fixes to the perf sigtrap handling:

   - output the address in the sample only when it has been requested

   - handle the case where user-only events can hit in kernel and thus
     upset the sigtrap sanity checking"

* tag 'perf_urgent_for_v6.1_rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf: Consider OS filter fail
  perf: Fixup SIGTRAP and sample_flags interaction
2022-11-27 11:53:41 -08:00
Linus Torvalds 88817acb8b Power management fixes for 6.1-rc7
- Revert a recent schedutil cpufreq governor change that introduced
    a performace regression on Pixel 6 (Sam Wu).
 
  - Fix amd-pstate driver initialization after running the kernel via
    kexec (Wyes Karny).
 
  - Turn amd-pstate into a built-in driver which allows it to take
    precedence over acpi-cpufreq by default on supported systems and
    amend it with a mechanism to disable this behavior (Perry Yuan).
 
  - Update amd-pstate documentation in accordance with the other changes
    made to it (Perry Yuan).
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAmOA9bgSHHJqd0Byand5
 c29ja2kubmV0AAoJEILEb/54YlRxfpsP/j1JTOy3tPCsvPgI+b/RCWq33sm5PpaU
 iFYpfxE9uexy2FyGgsZSqp4fMpMw9fCjiCpv/yfxtLmKccmOyW2kG7IYnNuV5QuN
 FnN6B9WfsLohbuk87H07GpuGZTy4OISjmEGE/gTHXhY6n0eQ9utiX9b8zQuRdhsO
 Gz7N0GaYrDh1lo9HeHNXiXbgIuVSj8iRyzWjJwdeha8NhZxwzQ3tZz5zNcBlCeJ+
 sFtopCWqDZ5DbmKQy4rYXRuW8zC6UUQ4opqcpG5+0e5e51aa0lexJtUqxcyuKqHd
 8plSZEEofCd7XecsvdXGzxhN29FFFIPkHZHic6/GQQou0gC8uGdp/LQMCoiZ3sJ1
 3YblYJkYXMCsi9COjCAesMwy6PENL98XLfQYVmfT/EEHPgyxG1hD0a/xf0UvXzsf
 V2h/gX+gfnDspCtStkrprapQb1WTCgA8d4tOB6x47EpCfY+H3HrNPFNIpJ93ErQJ
 rAe8jAARh1qAbOEWYipc0XmnJLV4aDBAkR+6uE7GHXu6ESOUfvOzpVrQ5d3x4Ohk
 a+vyupwiRGxRRoBjuGFyf/v15g6/UeLi5A/4Qi5SpCM7Uet7RnE/ltxjbgZJZgfG
 gtBJvNaFExWJI5hFZIpz4hubsWvFMaF6SwlXnVMdYGuKZl0w119wAHFLkyO4EmHk
 5RacLxDHEJ4q
 =+qZV
 -----END PGP SIGNATURE-----

Merge tag 'pm-6.1-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management fixes from Rafael Wysocki:
 "These revert a recent change in the schedutil cpufreq governor that
  had not been expected to make any functional difference, but turned
  out to introduce a performance regression, fix an initialization issue
  in the amd-pstate driver and make it actually replace the venerable
  ACPI cpufreq driver on the supported systems by default.

  Specifics:

   - Revert a recent schedutil cpufreq governor change that introduced a
     performace regression on Pixel 6 (Sam Wu)

   - Fix amd-pstate driver initialization after running the kernel via
     kexec (Wyes Karny)

   - Turn amd-pstate into a built-in driver which allows it to take
     precedence over acpi-cpufreq by default on supported systems and
     amend it with a mechanism to disable this behavior (Perry Yuan)

   - Update amd-pstate documentation in accordance with the other
     changes made to it (Perry Yuan)"

* tag 'pm-6.1-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  Documentation: add amd-pstate kernel command line options
  Documentation: amd-pstate: add driver working mode introduction
  cpufreq: amd-pstate: add amd-pstate driver parameter for mode selection
  cpufreq: amd-pstate: change amd-pstate driver to be built-in type
  cpufreq: amd-pstate: cpufreq: amd-pstate: reset MSR_AMD_PERF_CTL register at init
  Revert "cpufreq: schedutil: Move max CPU capacity to sugov_policy"
2022-11-25 12:43:33 -08:00
Linus Torvalds 0b1dcc2cf5 24 hotfixes. 8 marked cc:stable and 16 for post-6.0 issues.
There have been a lot of hotfixes this cycle, and this is quite a large
 batch given how far we are into the -rc cycle.  Presumably a reflection of
 the unusually large amount of MM material which went into 6.1-rc1.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCY4Bd6gAKCRDdBJ7gKXxA
 jvX6AQCsG1ld24kMpdD+70XXUyC29g/6/jribgtZApHyDYjxSwD/WmLNpPlUPRax
 WB071Y5w65vjSTUKvwU0OLGbHwyxgAw=
 =swD5
 -----END PGP SIGNATURE-----

Merge tag 'mm-hotfixes-stable-2022-11-24' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull hotfixes from Andrew Morton:
 "24 MM and non-MM hotfixes. 8 marked cc:stable and 16 for post-6.0
  issues.

  There have been a lot of hotfixes this cycle, and this is quite a
  large batch given how far we are into the -rc cycle. Presumably a
  reflection of the unusually large amount of MM material which went
  into 6.1-rc1"

* tag 'mm-hotfixes-stable-2022-11-24' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (24 commits)
  test_kprobes: fix implicit declaration error of test_kprobes
  nilfs2: fix nilfs_sufile_mark_dirty() not set segment usage as dirty
  mm/cgroup/reclaim: fix dirty pages throttling on cgroup v1
  mm: fix unexpected changes to {failslab|fail_page_alloc}.attr
  swapfile: fix soft lockup in scan_swap_map_slots
  hugetlb: fix __prep_compound_gigantic_page page flag setting
  kfence: fix stack trace pruning
  proc/meminfo: fix spacing in SecPageTables
  mm: multi-gen LRU: retry folios written back while isolated
  mailmap: update email address for Satya Priya
  mm/migrate_device: return number of migrating pages in args->cpages
  kbuild: fix -Wimplicit-function-declaration in license_is_gpl_compatible
  MAINTAINERS: update Alex Hung's email address
  mailmap: update Alex Hung's email address
  mm: mmap: fix documentation for vma_mas_szero
  mm/damon/sysfs-schemes: skip stats update if the scheme directory is removed
  mm/memory: return vm_fault_t result from migrate_to_ram() callback
  mm: correctly charge compressed memory to its memcg
  ipc/shm: call underlying open/close vm_ops
  gcov: clang: fix the buffer overflow issue
  ...
2022-11-25 10:18:25 -08:00
Al Viro de4eda9de2 use less confusing names for iov_iter direction initializers
READ/WRITE proved to be actively confusing - the meanings are
"data destination, as used with read(2)" and "data source, as
used with write(2)", but people keep interpreting those as
"we read data from it" and "we write data to it", i.e. exactly
the wrong way.

Call them ITER_DEST and ITER_SOURCE - at least that is harder
to misinterpret...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-11-25 13:01:55 -05:00
Daniel Jordan 57ddfecc72 padata: Fix list iterator in padata_do_serial()
list_for_each_entry_reverse() assumes that the iterated list is nonempty
and that every list_head is embedded in the same type, but its use in
padata_do_serial() breaks both rules.

This doesn't cause any issues now because padata_priv and padata_list
happen to have their list fields at the same offset, but we really
shouldn't be relying on that.

Fixes: bfde23ce20 ("padata: unbind parallel jobs from specific CPUs")
Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2022-11-25 17:39:18 +08:00
Daniel Jordan 34c3a47d20 padata: Always leave BHs disabled when running ->parallel()
A deadlock can happen when an overloaded system runs ->parallel() in the
context of the current task:

    padata_do_parallel
      ->parallel()
        pcrypt_aead_enc/dec
          padata_do_serial
            spin_lock(&reorder->lock) // BHs still enabled
              <interrupt>
                ...
                  __do_softirq
                    ...
                      padata_do_serial
                        spin_lock(&reorder->lock)

It's a bug for BHs to be on in _do_serial as Steffen points out, so
ensure they're off in the "current task" case like they are in
padata_parallel_worker to avoid this situation.

Reported-by: syzbot+bc05445bc14148d51915@syzkaller.appspotmail.com
Fixes: 4611ce2246 ("padata: allocate work structures for parallel jobs from a pool")
Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2022-11-25 17:39:18 +08:00
Alexei Starovoitov c6b0337f01 bpf: Don't mark arguments to fentry/fexit programs as trusted.
The PTR_TRUSTED flag should only be applied to pointers where the verifier can
guarantee that such pointers are valid.
The fentry/fexit/fmod_ret programs are not in this category.
Only arguments of SEC("tp_btf") and SEC("iter") programs are trusted
(which have BPF_TRACE_RAW_TP and BPF_TRACE_ITER attach_type correspondingly)

This bug was masked because convert_ctx_accesses() was converting trusted
loads into BPF_PROBE_MEM loads. Fix it as well.
The loads from trusted pointers don't need exception handling.

Fixes: 3f00c52393 ("bpf: Allow trusted pointers to be passed to KF_TRUSTED_ARGS kfuncs")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20221124215314.55890-1-alexei.starovoitov@gmail.com
2022-11-24 23:47:09 +01:00
Yonghong Song 9bb00b2895 bpf: Add kfunc bpf_rcu_read_lock/unlock()
Add two kfunc's bpf_rcu_read_lock() and bpf_rcu_read_unlock(). These two kfunc's
can be used for all program types. The following is an example about how
rcu pointer are used w.r.t. bpf_rcu_read_lock()/bpf_rcu_read_unlock().

  struct task_struct {
    ...
    struct task_struct              *last_wakee;
    struct task_struct __rcu        *real_parent;
    ...
  };

Let us say prog does 'task = bpf_get_current_task_btf()' to get a
'task' pointer. The basic rules are:
  - 'real_parent = task->real_parent' should be inside bpf_rcu_read_lock
    region. This is to simulate rcu_dereference() operation. The
    'real_parent' is marked as MEM_RCU only if (1). task->real_parent is
    inside bpf_rcu_read_lock region, and (2). task is a trusted ptr. So
    MEM_RCU marked ptr can be 'trusted' inside the bpf_rcu_read_lock region.
  - 'last_wakee = real_parent->last_wakee' should be inside bpf_rcu_read_lock
    region since it tries to access rcu protected memory.
  - the ptr 'last_wakee' will be marked as PTR_UNTRUSTED since in general
    it is not clear whether the object pointed by 'last_wakee' is valid or
    not even inside bpf_rcu_read_lock region.

The verifier will reset all rcu pointer register states to untrusted
at bpf_rcu_read_unlock() kfunc call site, so any such rcu pointer
won't be trusted any more outside the bpf_rcu_read_lock() region.

The current implementation does not support nested rcu read lock
region in the prog.

Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20221124053217.2373910-1-yhs@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-24 12:54:13 -08:00
Yonghong Song 01685c5bdd bpf: Introduce might_sleep field in bpf_func_proto
Introduce bpf_func_proto->might_sleep to indicate a particular helper
might sleep. This will make later check whether a helper might be
sleepable or not easier.

Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20221124053211.2373553-1-yhs@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-24 12:27:13 -08:00
Thomas Gleixner f571faf6e4 timers: Provide timer_shutdown[_sync]()
Tearing down timers which have circular dependencies to other
functionality, e.g. workqueues, where the timer can schedule work and work
can arm timers, is not trivial.

In those cases it is desired to shutdown the timer in a way which prevents
rearming of the timer. The mechanism to do so is to set timer->function to
NULL and use this as an indicator for the timer arming functions to ignore
the (re)arm request.

Expose new interfaces for this: timer_shutdown_sync() and timer_shutdown().

timer_shutdown_sync() has the same functionality as timer_delete_sync()
plus the NULL-ification of the timer function.

timer_shutdown() has the same functionality as timer_delete() plus the
NULL-ification of the timer function.

In both cases the rearming of the timer is prevented by silently discarding
rearm attempts due to timer->function being NULL.

Co-developed-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Link: https://lore.kernel.org/all/20220407161745.7d6754b3@gandalf.local.home
Link: https://lore.kernel.org/all/20221110064101.429013735@goodmis.org
Link: https://lore.kernel.org/r/20221123201625.314230270@linutronix.de
2022-11-24 15:09:12 +01:00
Thomas Gleixner 0cc04e8045 timers: Add shutdown mechanism to the internal functions
Tearing down timers which have circular dependencies to other
functionality, e.g. workqueues, where the timer can schedule work and work
can arm timers, is not trivial.

In those cases it is desired to shutdown the timer in a way which prevents
rearming of the timer. The mechanism to do so is to set timer->function to
NULL and use this as an indicator for the timer arming functions to ignore
the (re)arm request.

Add a shutdown argument to the relevant internal functions which makes the
actual deactivation code set timer->function to NULL which in turn prevents
rearming of the timer.

Co-developed-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Link: https://lore.kernel.org/all/20220407161745.7d6754b3@gandalf.local.home
Link: https://lore.kernel.org/all/20221110064101.429013735@goodmis.org
Link: https://lore.kernel.org/r/20221123201625.253883224@linutronix.de
2022-11-24 15:09:12 +01:00
Thomas Gleixner 8553b5f277 timers: Split [try_to_]del_timer[_sync]() to prepare for shutdown mode
Tearing down timers which have circular dependencies to other
functionality, e.g. workqueues, where the timer can schedule work and work
can arm timers, is not trivial.

In those cases it is desired to shutdown the timer in a way which prevents
rearming of the timer. The mechanism to do so is to set timer->function to
NULL and use this as an indicator for the timer arming functions to ignore
the (re)arm request.

Split the inner workings of try_do_del_timer_sync(), del_timer_sync() and
del_timer() into helper functions to prepare for implementing the shutdown
functionality.

No functional change.

Co-developed-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Link: https://lore.kernel.org/all/20220407161745.7d6754b3@gandalf.local.home
Link: https://lore.kernel.org/all/20221110064101.429013735@goodmis.org
Link: https://lore.kernel.org/r/20221123201625.195147423@linutronix.de
2022-11-24 15:09:12 +01:00
Thomas Gleixner d02e382cef timers: Silently ignore timers with a NULL function
Tearing down timers which have circular dependencies to other
functionality, e.g. workqueues, where the timer can schedule work and work
can arm timers, is not trivial.

In those cases it is desired to shutdown the timer in a way which prevents
rearming of the timer. The mechanism to do so is to set timer->function to
NULL and use this as an indicator for the timer arming functions to ignore
the (re)arm request.

In preparation for that replace the warnings in the relevant code paths
with checks for timer->function == NULL. If the pointer is NULL, then
discard the rearm request silently.

Add debug_assert_init() instead of the WARN_ON_ONCE(!timer->function)
checks so that debug objects can warn about non-initialized timers.

The warning of debug objects does not warn if timer->function == NULL.  It
warns when timer was not initialized using timer_setup[_on_stack]() or via
DEFINE_TIMER(). If developers fail to enable debug objects and then waste
lots of time to figure out why their non-initialized timer is not firing,
they deserve it. Same for initializing a timer with a NULL function.

Co-developed-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Link: https://lore.kernel.org/all/20220407161745.7d6754b3@gandalf.local.home
Link: https://lore.kernel.org/all/20221110064101.429013735@goodmis.org
Link: https://lore.kernel.org/r/87wn7kdann.ffs@tglx
2022-11-24 15:09:11 +01:00
Thomas Gleixner bb663f0f3c timers: Rename del_timer() to timer_delete()
The timer related functions do not have a strict timer_ prefixed namespace
which is really annoying.

Rename del_timer() to timer_delete() and provide del_timer()
as a wrapper. Document that del_timer() is not for new code.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Link: https://lore.kernel.org/r/20221123201625.015535022@linutronix.de
2022-11-24 15:09:11 +01:00
Thomas Gleixner 9b13df3fb6 timers: Rename del_timer_sync() to timer_delete_sync()
The timer related functions do not have a strict timer_ prefixed namespace
which is really annoying.

Rename del_timer_sync() to timer_delete_sync() and provide del_timer_sync()
as a wrapper. Document that del_timer_sync() is not for new code.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Link: https://lore.kernel.org/r/20221123201624.954785441@linutronix.de
2022-11-24 15:09:11 +01:00
Thomas Gleixner 168f6b6ffb timers: Use del_timer_sync() even on UP
del_timer_sync() is assumed to be pointless on uniprocessor systems and can
be mapped to del_timer() because in theory del_timer() can never be invoked
while the timer callback function is executed.

This is not entirely true because del_timer() can be invoked from interrupt
context and therefore hit in the middle of a running timer callback.

Contrary to that del_timer_sync() is not allowed to be invoked from
interrupt context unless the affected timer is marked with TIMER_IRQSAFE.
del_timer_sync() has proper checks in place to detect such a situation.

Give up on the UP optimization and make del_timer_sync() unconditionally
available.

Co-developed-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Link: https://lore.kernel.org/all/20220407161745.7d6754b3@gandalf.local.home
Link: https://lore.kernel.org/all/20221110064101.429013735@goodmis.org
Link: https://lore.kernel.org/r/20221123201624.888306160@linutronix.de
2022-11-24 15:09:11 +01:00
Thomas Gleixner 14f043f134 timers: Update kernel-doc for various functions
The kernel-doc of timer related functions is partially uncomprehensible
word salad. Rewrite it to make it useful.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Link: https://lore.kernel.org/r/20221123201624.828703870@linutronix.de
2022-11-24 15:09:11 +01:00
Thomas Gleixner 82ed6f7ef5 timers: Replace BUG_ON()s
The timer code still has a few BUG_ON()s left which are crashing the kernel
in situations where it still can recover or simply refuse to take an
action.

Remove the one in the hotplug callback which checks for the CPU being
offline. If that happens then the whole hotplug machinery will explode in
colourful ways.

Replace the rest with WARN_ON_ONCE() and conditional returns where
appropriate.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Link: https://lore.kernel.org/r/20221123201624.769128888@linutronix.de
2022-11-24 15:09:11 +01:00
Thomas Gleixner 9a5a305686 timers: Get rid of del_singleshot_timer_sync()
del_singleshot_timer_sync() used to be an optimization for deleting timers
which are not rearmed from the timer callback function.

This optimization turned out to be broken and got mapped to
del_timer_sync() about 17 years ago.

Get rid of the undocumented indirection and use del_timer_sync() directly.

No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Link: https://lore.kernel.org/r/20221123201624.706987932@linutronix.de
2022-11-24 15:09:10 +01:00
Ravi Bangoria bf480f9385 perf/core: Don't allow grouping events from different hw pmus
Event group from different hw pmus does not make sense and thus perf
has never allowed it. However, with recent rewrite that restriction
has been inadvertently removed. Fix it.

Fixes: bd27568117 ("perf: Rewrite core context handling")
Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20221122080326.228-1-ravi.bangoria@amd.com
2022-11-24 11:09:19 +01:00