[ Upstream commit bccdd808902f8c677317cec47c306e42b93b849e ]
In some cases running with the test-ww_mutex code, I was seeing
odd behavior where sometimes it seemed flush_workqueue was
returning before all the work threads were finished.
Often this would cause strange crashes as the mutexes would be
freed while they were being used.
Looking at the code, there is a lifetime problem as the
controlling thread that spawns the work allocates the
"struct stress" structures that are passed to the workqueue
threads. Then when the workqueue threads are finished,
they free the stress struct that was passed to them.
Unfortunately the workqueue work_struct node is in the stress
struct. Which means the work_struct is freed before the work
thread returns and while flush_workqueue is waiting.
It seems like a better idea to have the controlling thread
both allocate and free the stress structures, so that we can
be sure we don't corrupt the workqueue by freeing the structure
prematurely.
So this patch reworks the test to do so, and with this change
I no longer see the early flush_workqueue returns.
Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230922043616.19282-3-jstultz@google.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit f032c53bea6d2057c14553832d846be2f151cfb2 ]
The order of descriptions should be consistent with the argument list of
the function, so "kretprobe" should be the second one.
int __kprobe_event_gen_cmd_start(struct dynevent_cmd *cmd, bool kretprobe,
const char *name, const char *loc, ...)
Link: https://lore.kernel.org/all/20231031041305.3363712-1-yujie.liu@intel.com/
Fixes: 2a588dd1d5 ("tracing: Add kprobe event command generation functions")
Suggested-by: Mukesh Ojha <quic_mojha@quicinc.com>
Signed-off-by: Yujie Liu <yujie.liu@intel.com>
Reviewed-by: Mukesh Ojha <quic_mojha@quicinc.com>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 17fc8084aa8f9d5235f252fc3978db657dd77e92 ]
We consistently switched from kmalloc() to vmalloc() in module
decompression to prevent potential memory allocation failures with large
modules, however vmalloc() is not as memory-efficient and fast as
kmalloc().
Since we don't know in general the size of the workspace required by the
decompression algorithm, it is more reasonable to use kvmalloc()
consistently, also considering that we don't have special memory
requirements here.
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Tested-by: Andrea Righi <andrea.righi@canonical.com>
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit fd381ce60a2d79cc967506208085336d3d268ae0 ]
When there are concurrent uref release and bpf timer init operations,
the following sequence diagram is possible. It will break the guarantee
provided by bpf_timer: bpf_timer will still be alive after userspace
application releases or unpins the map. It also will lead to kmemleak
for old kernel version which doesn't release bpf_timer when map is
released.
bpf program X:
bpf_timer_init()
lock timer->lock
read timer->timer as NULL
read map->usercnt != 0
process Y:
close(map_fd)
// put last uref
bpf_map_put_uref()
atomic_dec_and_test(map->usercnt)
array_map_free_timers()
bpf_timer_cancel_and_free()
// just return
read timer->timer is NULL
t = bpf_map_kmalloc_node()
timer->timer = t
unlock timer->lock
Fix the problem by checking map->usercnt after timer->timer is assigned,
so when there are concurrent uref release and bpf timer init, either
bpf_timer_cancel_and_free() from uref release reads a no-NULL timer
or the newly-added atomic64_read() returns a zero usercnt.
Because atomic_dec_and_test(map->usercnt) and READ_ONCE(timer->timer)
in bpf_timer_cancel_and_free() are not protected by a lock, so add
a memory barrier to guarantee the order between map->usercnt and
timer->timer. Also use WRITE_ONCE(timer->timer, x) to match the lockless
read of timer->timer in bpf_timer_cancel_and_free().
Reported-by: Hsin-Wei Hung <hsinweih@uci.edu>
Closes: https://lore.kernel.org/bpf/CABcoxUaT2k9hWsS1tNgXyoU3E-=PuOgMn737qK984fbFmfYixQ@mail.gmail.com
Fixes: b00628b1c7 ("bpf: Introduce bpf timers.")
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20231030063616.1653024-1-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 67e18e132f0fd738f8c8cac3aa1420312073f795 ]
Without the newline character, the log may not be printed immediately
after the error occurs.
Fixes: ca376a9374 ("livepatch: Prevent module-specific KLP rela sections from referencing vmlinux symbols")
Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20230914072644.4098857-1-zhengyejian1@huawei.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 7ddc21e317b360c3444de3023bcc83b85fabae2f ]
In a high-load arm64 environment, the pcrypt_aead01 test in LTP can lead
to system UAF (Use-After-Free) issues. Due to the lengthy analysis of
the pcrypt_aead01 function call, I'll describe the problem scenario
using a simplified model:
Suppose there's a user of padata named `user_function` that adheres to
the padata requirement of calling `padata_free_shell` after `serial()`
has been invoked, as demonstrated in the following code:
```c
struct request {
struct padata_priv padata;
struct completion *done;
};
void parallel(struct padata_priv *padata) {
do_something();
}
void serial(struct padata_priv *padata) {
struct request *request = container_of(padata,
struct request,
padata);
complete(request->done);
}
void user_function() {
DECLARE_COMPLETION(done)
padata->parallel = parallel;
padata->serial = serial;
padata_do_parallel();
wait_for_completion(&done);
padata_free_shell();
}
```
In the corresponding padata.c file, there's the following code:
```c
static void padata_serial_worker(struct work_struct *serial_work) {
...
cnt = 0;
while (!list_empty(&local_list)) {
...
padata->serial(padata);
cnt++;
}
local_bh_enable();
if (refcount_sub_and_test(cnt, &pd->refcnt))
padata_free_pd(pd);
}
```
Because of the high system load and the accumulation of unexecuted
softirq at this moment, `local_bh_enable()` in padata takes longer
to execute than usual. Subsequently, when accessing `pd->refcnt`,
`pd` has already been released by `padata_free_shell()`, resulting
in a UAF issue with `pd->refcnt`.
The fix is straightforward: add `refcount_dec_and_test` before calling
`padata_free_pd` in `padata_free_shell`.
Fixes: 07928d9bfc ("padata: Remove broken queue flushing")
Signed-off-by: WangJinchao <wangjinchao@xfusion.com>
Acked-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Acked-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 3737df782c740b944912ed93420c57344b1cf864 ]
Use a similar approach as commit a419beac4a ("module/decompress: use
vmalloc() for zstd decompression workspace") and replace kmalloc() with
vmalloc() also for the gzip module decompression workspace.
In this case the workspace is represented by struct inflate_workspace
that can be fairly large for kmalloc() and it can potentially lead to
allocation errors on certain systems:
$ pahole inflate_workspace
struct inflate_workspace {
struct inflate_state inflate_state; /* 0 9544 */
/* --- cacheline 149 boundary (9536 bytes) was 8 bytes ago --- */
unsigned char working_window[32768]; /* 9544 32768 */
/* size: 42312, cachelines: 662, members: 2 */
/* last cacheline: 8 bytes */
};
Considering that there is no need to use continuous physical memory,
simply switch to vmalloc() to provide a more reliable in-kernel module
decompression.
Fixes: b1ae6dc41e ("module: add in-kernel support for decompressing")
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit d35381aa73f7e1e8b25f3ed5283287a64d9ddff5 ]
htab_lock_bucket uses the following logic to avoid recursion:
1. preempt_disable();
2. check percpu counter htab->map_locked[hash] for recursion;
2.1. if map_lock[hash] is already taken, return -BUSY;
3. raw_spin_lock_irqsave();
However, if an IRQ hits between 2 and 3, BPF programs attached to the IRQ
logic will not able to access the same hash of the hashtab and get -EBUSY.
This -EBUSY is not really necessary. Fix it by disabling IRQ before
checking map_locked:
1. preempt_disable();
2. local_irq_save();
3. check percpu counter htab->map_locked[hash] for recursion;
3.1. if map_lock[hash] is already taken, return -BUSY;
4. raw_spin_lock().
Similarly, use raw_spin_unlock() and local_irq_restore() in
htab_unlock_bucket().
Fixes: 20b6cc34ea ("bpf: Avoid hashtab deadlock with map_locked")
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/7a9576222aa40b1c84ad3a9ba3e64011d1a04d41.camel@linux.ibm.com
Link: https://lore.kernel.org/bpf/20231012055741.3375999-1-song@kernel.org
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 29a7e00ffadddd8d68eff311de1bf12ae10687bb ]
When employed within a sleepable program not under RCU protection, the
use of 'bpf_task_under_cgroup()' may trigger a warning in the kernel log,
particularly when CONFIG_PROVE_RCU is enabled:
[ 1259.662357] WARNING: suspicious RCU usage
[ 1259.662358] 6.5.0+ #33 Not tainted
[ 1259.662360] -----------------------------
[ 1259.662361] include/linux/cgroup.h:423 suspicious rcu_dereference_check() usage!
Other info that might help to debug this:
[ 1259.662366] rcu_scheduler_active = 2, debug_locks = 1
[ 1259.662368] 1 lock held by trace/72954:
[ 1259.662369] #0: ffffffffb5e3eda0 (rcu_read_lock_trace){....}-{0:0}, at: __bpf_prog_enter_sleepable+0x0/0xb0
Stack backtrace:
[ 1259.662385] CPU: 50 PID: 72954 Comm: trace Kdump: loaded Not tainted 6.5.0+ #33
[ 1259.662391] Call Trace:
[ 1259.662393] <TASK>
[ 1259.662395] dump_stack_lvl+0x6e/0x90
[ 1259.662401] dump_stack+0x10/0x20
[ 1259.662404] lockdep_rcu_suspicious+0x163/0x1b0
[ 1259.662412] task_css_set.part.0+0x23/0x30
[ 1259.662417] bpf_task_under_cgroup+0xe7/0xf0
[ 1259.662422] bpf_prog_7fffba481a3bcf88_lsm_run+0x5c/0x93
[ 1259.662431] bpf_trampoline_6442505574+0x60/0x1000
[ 1259.662439] bpf_lsm_bpf+0x5/0x20
[ 1259.662443] ? security_bpf+0x32/0x50
[ 1259.662452] __sys_bpf+0xe6/0xdd0
[ 1259.662463] __x64_sys_bpf+0x1a/0x30
[ 1259.662467] do_syscall_64+0x38/0x90
[ 1259.662472] entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[ 1259.662479] RIP: 0033:0x7f487baf8e29
[...]
[ 1259.662504] </TASK>
This issue can be reproduced by executing a straightforward program, as
demonstrated below:
SEC("lsm.s/bpf")
int BPF_PROG(lsm_run, int cmd, union bpf_attr *attr, unsigned int size)
{
struct cgroup *cgrp = NULL;
struct task_struct *task;
int ret = 0;
if (cmd != BPF_LINK_CREATE)
return 0;
// The cgroup2 should be mounted first
cgrp = bpf_cgroup_from_id(1);
if (!cgrp)
goto out;
task = bpf_get_current_task_btf();
if (bpf_task_under_cgroup(task, cgrp))
ret = -1;
bpf_cgroup_release(cgrp);
out:
return ret;
}
After running the program, if you subsequently execute another BPF program,
you will encounter the warning.
It's worth noting that task_under_cgroup_hierarchy() is also utilized by
bpf_current_task_under_cgroup(). However, bpf_current_task_under_cgroup()
doesn't exhibit this issue because it cannot be used in sleepable BPF
programs.
Fixes: b5ad4cdc46 ("bpf: Add bpf_task_under_cgroup() kfunc")
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Stanislav Fomichev <sdf@google.com>
Cc: Feng Zhou <zhoufeng.zf@bytedance.com>
Cc: KP Singh <kpsingh@kernel.org>
Link: https://lore.kernel.org/bpf/20231007135945.4306-1-laoar.shao@gmail.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 06d686f771ddc27a8554cd8f5b22e071040dc90e ]
The kfunc code to handle KF_ARG_PTR_TO_CALLBACK does not check the reg
type before using reg->subprogno. This can accidently permit invalid
pointers from being passed into callback helpers (e.g. silently from
different paths). Likewise, reg->subprogno from the per-register type
union may not be meaningful either. We need to reject any other type
except PTR_TO_FUNC.
Acked-by: Dave Marchevsky <davemarchevsky@fb.com>
Fixes: 5d92ddc3de ("bpf: Add callback validation to kfunc verifier logic")
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20230912233214.1518551-14-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 2b5dcb31a19a2e0acd869b12c9db9b2d696ef544 ]
From commit ebf7d1f508 ("bpf, x64: rework pro/epilogue and tailcall
handling in JIT"), the tailcall on x64 works better than before.
From commit e411901c0b ("bpf: allow for tailcalls in BPF subprograms
for x64 JIT"), tailcall is able to run in BPF subprograms on x64.
From commit 5b92a28aae ("bpf: Support attaching tracing BPF program
to other BPF programs"), BPF program is able to trace other BPF programs.
How about combining them all together?
1. FENTRY/FEXIT on a BPF subprogram.
2. A tailcall runs in the BPF subprogram.
3. The tailcall calls the subprogram's caller.
As a result, a tailcall infinite loop comes up. And the loop would halt
the machine.
As we know, in tail call context, the tail_call_cnt propagates by stack
and rax register between BPF subprograms. So do in trampolines.
Fixes: ebf7d1f508 ("bpf, x64: rework pro/epilogue and tailcall handling in JIT")
Fixes: e411901c0b ("bpf: allow for tailcalls in BPF subprograms for x64 JIT")
Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Leon Hwang <hffilwlqm@gmail.com>
Link: https://lore.kernel.org/r/20230912150442.2009-3-hffilwlqm@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit a0b0bad10587ae2948a7c36ca4ffc206007fbcf3 ]
When a CPU is about to be offlined, x86 validates that all active
interrupts which are targeted to this CPU can be migrated to the remaining
online CPUs. If not, the offline operation is aborted.
The validation uses irq_matrix_allocated() to retrieve the number of
vectors which are allocated on the outgoing CPU. The returned number of
allocated vectors includes also vectors which are associated to managed
interrupts.
That's overaccounting because managed interrupts are:
- not migrated when the affinity mask of the interrupt targets only
the outgoing CPU
- migrated to another CPU, but in that case the vector is already
pre-allocated on the potential target CPUs and must not be taken into
account.
As a consequence the check whether the remaining online CPUs have enough
capacity for migrating the allocated vectors from the outgoing CPU might
fail incorrectly.
Let irq_matrix_allocated() return only the number of allocated non-managed
interrupts to make this validation check correct.
[ tglx: Amend changelog and fixup kernel-doc comment ]
Fixes: 2f75d9e1c9 ("genirq: Implement bitmap matrix allocator")
Reported-by: Wendy Wang <wendy.wang@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20231020072522.557846-1-yu.c.chen@intel.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit f06cc667f79909e9175460b167c277b7c64d3df0 ]
Namhyung reported that bd27568117 ("perf: Rewrite core context handling")
regresses context switch overhead when perf-cgroup is in use together
with 'slow' PMUs like uncore.
Specifically, perf_cgroup_switch()'s perf_ctx_disable() /
ctx_sched_out() etc.. all iterate the full list of active PMUs for
that CPU, even if they don't have cgroup events.
Previously there was cgrp_cpuctx_list which linked the relevant PMUs
together, but that got lost in the rework. Instead of re-instruducing
a similar list, let the perf_event_pmu_context iteration skip those
that do not have cgroup events. This avoids growing multiple versions
of the perf_event_pmu_context iteration.
Measured performance (on a slightly different patch):
Before)
$ taskset -c 0 ./perf bench sched pipe -l 10000 -G AAA,BBB
# Running 'sched/pipe' benchmark:
# Executed 10000 pipe operations between two processes
Total time: 0.901 [sec]
90.128700 usecs/op
11095 ops/sec
After)
$ taskset -c 0 ./perf bench sched pipe -l 10000 -G AAA,BBB
# Running 'sched/pipe' benchmark:
# Executed 10000 pipe operations between two processes
Total time: 0.065 [sec]
6.560100 usecs/op
152436 ops/sec
Fixes: bd27568117 ("perf: Rewrite core context handling")
Reported-by: Namhyung Kim <namhyung@kernel.org>
Debugged-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20231009210425.GC6307@noisy.programming.kicks-ass.net
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 4a8e65b0c348e42107c64381e692e282900be361 ]
SRCU callbacks acceleration might fail if the preceding callbacks
advance also fails. This can happen when the following steps are met:
1) The RCU_WAIT_TAIL segment has callbacks (say for gp_num 8) and the
RCU_NEXT_READY_TAIL also has callbacks (say for gp_num 12).
2) The grace period for RCU_WAIT_TAIL is observed as started but not yet
completed so rcu_seq_current() returns 4 + SRCU_STATE_SCAN1 = 5.
3) This value is passed to rcu_segcblist_advance() which can't move
any segment forward and fails.
4) srcu_gp_start_if_needed() still proceeds with callback acceleration.
But then the call to rcu_seq_snap() observes the grace period for the
RCU_WAIT_TAIL segment (gp_num 8) as completed and the subsequent one
for the RCU_NEXT_READY_TAIL segment as started
(ie: 8 + SRCU_STATE_SCAN1 = 9) so it returns a snapshot of the
next grace period, which is 16.
5) The value of 16 is passed to rcu_segcblist_accelerate() but the
freshly enqueued callback in RCU_NEXT_TAIL can't move to
RCU_NEXT_READY_TAIL which already has callbacks for a previous grace
period (gp_num = 12). So acceleration fails.
6) Note in all these steps, srcu_invoke_callbacks() hadn't had a chance
to run srcu_invoke_callbacks().
Then some very bad outcome may happen if the following happens:
7) Some other CPU races and starts the grace period number 16 before the
CPU handling previous steps had a chance. Therefore srcu_gp_start()
isn't called on the latter sdp to fix the acceleration leak from
previous steps with a new pair of call to advance/accelerate.
8) The grace period 16 completes and srcu_invoke_callbacks() is finally
called. All the callbacks from previous grace periods (8 and 12) are
correctly advanced and executed but callbacks in RCU_NEXT_READY_TAIL
still remain. Then rcu_segcblist_accelerate() is called with a
snaphot of 20.
9) Since nothing started the grace period number 20, callbacks stay
unhandled.
This has been reported in real load:
[3144162.608392] INFO: task kworker/136:12:252684 blocked for more
than 122 seconds.
[3144162.615986] Tainted: G O K 5.4.203-1-tlinux4-0011.1 #1
[3144162.623053] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[3144162.631162] kworker/136:12 D 0 252684 2 0x90004000
[3144162.631189] Workqueue: kvm-irqfd-cleanup irqfd_shutdown [kvm]
[3144162.631192] Call Trace:
[3144162.631202] __schedule+0x2ee/0x660
[3144162.631206] schedule+0x33/0xa0
[3144162.631209] schedule_timeout+0x1c4/0x340
[3144162.631214] ? update_load_avg+0x82/0x660
[3144162.631217] ? raw_spin_rq_lock_nested+0x1f/0x30
[3144162.631218] wait_for_completion+0x119/0x180
[3144162.631220] ? wake_up_q+0x80/0x80
[3144162.631224] __synchronize_srcu.part.19+0x81/0xb0
[3144162.631226] ? __bpf_trace_rcu_utilization+0x10/0x10
[3144162.631227] synchronize_srcu+0x5f/0xc0
[3144162.631236] irqfd_shutdown+0x3c/0xb0 [kvm]
[3144162.631239] ? __schedule+0x2f6/0x660
[3144162.631243] process_one_work+0x19a/0x3a0
[3144162.631244] worker_thread+0x37/0x3a0
[3144162.631247] kthread+0x117/0x140
[3144162.631247] ? process_one_work+0x3a0/0x3a0
[3144162.631248] ? __kthread_cancel_work+0x40/0x40
[3144162.631250] ret_from_fork+0x1f/0x30
Fix this with taking the snapshot for acceleration _before_ the read
of the current grace period number.
The only side effect of this solution is that callbacks advancing happen
then _after_ the full barrier in rcu_seq_snap(). This is not a problem
because that barrier only cares about:
1) Ordering accesses of the update side before call_srcu() so they don't
bleed.
2) See all the accesses prior to the grace period of the current gp_num
The only things callbacks advancing need to be ordered against are
carried by snp locking.
Reported-by: Yong He <alexyonghe@tencent.com>
Co-developed-by:: Yong He <alexyonghe@tencent.com>
Signed-off-by: Yong He <alexyonghe@tencent.com>
Co-developed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Co-developed-by: Neeraj upadhyay <Neeraj.Upadhyay@amd.com>
Signed-off-by: Neeraj upadhyay <Neeraj.Upadhyay@amd.com>
Link: http://lore.kernel.org/CANZk6aR+CqZaqmMWrC2eRRPY12qAZnDZLwLnHZbNi=xXMB401g@mail.gmail.com
Fixes: da915ad5cf ("srcu: Parallelize callback handling")
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit d91bdd96b55cc3ce98d883a60f133713821b80a6 ]
The SMT control mechanism got added as speculation attack vector
mitigation. The implemented logic relies on the primary thread mask to
be set up properly.
This turns out to be an issue with XEN/PV guests because their CPU hotplug
mechanics do not enumerate APICs and therefore the mask is never correctly
populated.
This went unnoticed so far because by chance XEN/PV ends up with
smp_num_siblings == 2. So smt_hotplug_control stays at its default value
CPU_SMT_ENABLED and the primary thread mask is never evaluated in the
context of CPU hotplug.
This stopped "working" with the upcoming overhaul of the topology
evaluation which legitimately provides a fake topology for XEN/PV. That
sets smp_num_siblings to 1, which causes the core CPU hot-plug core to
refuse to bring up the APs.
This happens because smt_hotplug_control is set to CPU_SMT_NOT_SUPPORTED
which causes cpu_smt_allowed() to evaluate the unpopulated primary thread
mask with the conclusion that all non-boot CPUs are not valid to be
plugged.
Make cpu_smt_allowed() more robust and take CPU_SMT_NOT_SUPPORTED and
CPU_SMT_NOT_IMPLEMENTED into account. Rename it to cpu_bootable() while at
it as that makes it more clear what the function is about.
The primary mask issue on x86 XEN/PV needs to be addressed separately as
there are users outside of the CPU hotplug code too.
Fixes: 05736e4ac1 ("cpu/hotplug: Provide knobs to control SMT")
Reported-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Juergen Gross <jgross@suse.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230814085112.149440843@linutronix.de
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 6fcdb0183bf024a70abccb0439321c25891c708d ]
Commit a86ce68078 ("cgroup/cpuset: Extract out CS_CPU_EXCLUSIVE
& CS_SCHED_LOAD_BALANCE handling") adds a new helper function
update_partition_sd_lb() to update the load balance state of the
cpuset. However the new load balance is determined by just looking at
whether the cpuset is a valid isolated partition root or not. That is
not enough if the cpuset is not a valid partition root but its parent
is in the isolated state (load balance off). Update the function to
set the new state to be the same as its parent in this case like what
has been done in commit c8c926200c ("cgroup/cpuset: Inherit parent's
load balance state in v2").
Fixes: a86ce68078 ("cgroup/cpuset: Extract out CS_CPU_EXCLUSIVE & CS_SCHED_LOAD_BALANCE handling")
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit c73801ae4f22b390228ebf471d55668e824198b6 ]
On no-MMU, all futexes are treated as private because there is no need
to map a virtual address to physical to match the futex across
processes. This doesn't quite work though, because private futexes
include the current process's mm_struct as part of their key. This makes
it impossible for one process to wake up a shared futex being waited on
in another process.
Fix this bug by excluding the mm_struct from the key. With
a single address space, the futex address is already a unique key.
Fixes: 784bdf3bb6 ("futex: Assume all mappings are private on !MMU systems")
Signed-off-by: Ben Wolsieffer <ben.wolsieffer@hefring.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Darren Hart <dvhart@infradead.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: André Almeida <andrealmeid@igalia.com>
Link: https://lore.kernel.org/r/20231019204548.1236437-2-ben.wolsieffer@hefring.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit f0498d2a54e7966ce23cd7c7ff42c64fa0059b07 ]
Kuyo reported sporadic failures on a sched_setaffinity() vs CPU
hotplug stress-test -- notably affine_move_task() remains stuck in
wait_for_completion(), leading to a hung-task detector warning.
Specifically, it was reported that stop_one_cpu_nowait(.fn =
migration_cpu_stop) returns false -- this stopper is responsible for
the matching complete().
The race scenario is:
CPU0 CPU1
// doing _cpu_down()
__set_cpus_allowed_ptr()
task_rq_lock();
takedown_cpu()
stop_machine_cpuslocked(take_cpu_down..)
<PREEMPT: cpu_stopper_thread()
MULTI_STOP_PREPARE
...
__set_cpus_allowed_ptr_locked()
affine_move_task()
task_rq_unlock();
<PREEMPT: cpu_stopper_thread()\>
ack_state()
MULTI_STOP_RUN
take_cpu_down()
__cpu_disable();
stop_machine_park();
stopper->enabled = false;
/>
/>
stop_one_cpu_nowait(.fn = migration_cpu_stop);
if (stopper->enabled) // false!!!
That is, by doing stop_one_cpu_nowait() after dropping rq-lock, the
stopper thread gets a chance to preempt and allows the cpu-down for
the target CPU to complete.
OTOH, since stop_one_cpu_nowait() / cpu_stop_queue_work() needs to
issue a wakeup, it must not be ran under the scheduler locks.
Solve this apparent contradiction by keeping preemption disabled over
the unlock + queue_stopper combination:
preempt_disable();
task_rq_unlock(...);
if (!stop_pending)
stop_one_cpu_nowait(...)
preempt_enable();
This respects the lock ordering contraints while still avoiding the
above race. That is, if we find the CPU is online under rq-lock, the
targeted stop_one_cpu_nowait() must succeed.
Apply this pattern to all similar stop_one_cpu_nowait() invocations.
Fixes: 6d337eab04 ("sched: Fix migrate_disable() vs set_cpus_allowed_ptr()")
Reported-by: "Kuyo Chang (張建文)" <Kuyo.Chang@mediatek.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: "Kuyo Chang (張建文)" <Kuyo.Chang@mediatek.com>
Link: https://lkml.kernel.org/r/20231010200442.GA16515@noisy.programming.kicks-ass.net
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 23c9519def98ee0fa97ea5871535e9b136f522fc ]
find_energy_efficient_cpu() bails out early if effective util of the
task is 0 as the delta at this point will be zero and there's nothing
for EAS to do. When uclamp is being used, this could lead to wrong
decisions when uclamp_max is set to 0. In this case the task is capped
to performance point 0, but it is actually running and consuming energy
and we can benefit from EAS energy calculations.
Rework the condition so that it bails out when both util and uclamp_min
are 0.
We can do that without needing to use uclamp_task_util(); remove it.
Fixes: d81304bc61 ("sched/uclamp: Cater for uclamp in find_energy_efficient_cpu()'s early exit condition")
Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230916232955.2099394-3-qyousef@layalina.io
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 6b00a40147653c8ea748e8f4396510f252763364 ]
When uclamp_max is being used, the util of the task could be higher than
the spare capacity of the CPU, but due to uclamp_max value we force-fit
it there.
The way the condition for checking for max_spare_cap in
find_energy_efficient_cpu() was constructed; it ignored any CPU that has
its spare_cap less than or _equal_ to max_spare_cap. Since we initialize
max_spare_cap to 0; this lead to never setting max_spare_cap_cpu and
hence ending up never performing compute_energy() for this cluster and
missing an opportunity for a better energy efficient placement to honour
uclamp_max setting.
max_spare_cap = 0;
cpu_cap = capacity_of(cpu) - cpu_util(p); // 0 if cpu_util(p) is high
...
util_fits_cpu(...); // will return true if uclamp_max forces it to fit
...
// this logic will fail to update max_spare_cap_cpu if cpu_cap is 0
if (cpu_cap > max_spare_cap) {
max_spare_cap = cpu_cap;
max_spare_cap_cpu = cpu;
}
prev_spare_cap suffers from a similar problem.
Fix the logic by converting the variables into long and treating -1
value as 'not populated' instead of 0 which is a viable and correct
spare capacity value. We need to be careful signed comparison is used
when comparing with cpu_cap in one of the conditions.
Fixes: 1d42509e47 ("sched/fair: Make EAS wakeup placement consider uclamp restrictions")
Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230916232955.2099394-2-qyousef@layalina.io
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit c0490bc9bb62d9376f3dd4ec28e03ca0fef97152 ]
We don't need to maintain per-queue leaf_cfs_rq_list on !SMP, since
it's used for cfs_rq load tracking & balancing on SMP.
But sched debug interface uses it to print per-cfs_rq stats.
This patch fixes the !SMP version of cfs_rq_is_decayed(), so the
per-queue leaf_cfs_rq_list is also maintained correctly on !SMP,
to fix the warning in assert_list_leaf_cfs_rq().
Fixes: 0a00a35464 ("sched/fair: Delete useless condition in tg_unthrottle_up()")
Reported-by: Leo Yu-Chi Liang <ycliang@andestech.com>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Leo Yu-Chi Liang <ycliang@andestech.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Closes: https://lore.kernel.org/all/ZN87UsqkWcFLDxea@swlinux02/
Link: https://lore.kernel.org/r/20230913132031.2242151-1-chengming.zhou@linux.dev
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit bb32500fb9b78215e4ef6ee8b4345c5f5d7eafb4 upstream
The following can crash the kernel:
# cd /sys/kernel/tracing
# echo 'p:sched schedule' > kprobe_events
# exec 5>>events/kprobes/sched/enable
# > kprobe_events
# exec 5>&-
The above commands:
1. Change directory to the tracefs directory
2. Create a kprobe event (doesn't matter what one)
3. Open bash file descriptor 5 on the enable file of the kprobe event
4. Delete the kprobe event (removes the files too)
5. Close the bash file descriptor 5
The above causes a crash!
BUG: kernel NULL pointer dereference, address: 0000000000000028
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: 0000 [#1] PREEMPT SMP PTI
CPU: 6 PID: 877 Comm: bash Not tainted 6.5.0-rc4-test-00008-g2c6b6b1029d4-dirty #186
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
RIP: 0010:tracing_release_file_tr+0xc/0x50
What happens here is that the kprobe event creates a trace_event_file
"file" descriptor that represents the file in tracefs to the event. It
maintains state of the event (is it enabled for the given instance?).
Opening the "enable" file gets a reference to the event "file" descriptor
via the open file descriptor. When the kprobe event is deleted, the file is
also deleted from the tracefs system which also frees the event "file"
descriptor.
But as the tracefs file is still opened by user space, it will not be
totally removed until the final dput() is called on it. But this is not
true with the event "file" descriptor that is already freed. If the user
does a write to or simply closes the file descriptor it will reference the
event "file" descriptor that was just freed, causing a use-after-free bug.
To solve this, add a ref count to the event "file" descriptor as well as a
new flag called "FREED". The "file" will not be freed until the last
reference is released. But the FREE flag will be set when the event is
removed to prevent any more modifications to that event from happening,
even if there's still a reference to the event "file" descriptor.
Link: https://lore.kernel.org/linux-trace-kernel/20231031000031.1e705592@gandalf.local.home/
Link: https://lore.kernel.org/linux-trace-kernel/20231031122453.7a48b923@gandalf.local.home
Cc: stable@vger.kernel.org
Cc: Mark Rutland <mark.rutland@arm.com>
Fixes: f5ca233e2e ("tracing: Increase trace array ref count on enable and filter files")
Reported-by: Beau Belgrave <beaub@linux.microsoft.com>
Tested-by: Beau Belgrave <beaub@linux.microsoft.com>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
- tracing/kprobes: Fix kernel-doc warnings for the variable length
arguments.
- tracing/kprobes: Fix to count the symbols in modules even if the
module name is not specified so that user can probe the symbols in
the modules without module name.
-----BEGIN PGP SIGNATURE-----
iQFPBAABCgA5FiEEh7BulGwFlgAOi5DV2/sHvwUrPxsFAmU82MUbHG1hc2FtaS5o
aXJhbWF0c3VAZ21haWwuY29tAAoJENv7B78FKz8bMZ0H+wZHWVUsmqGLGNCt3gfi
m2EJX83VMwY8PzpwZ5ezrx4ibAcUyo7Dhh8OniGgEazC3BNeggoUu/HwpirS22gI
Tx0EMlgLOJQykauiUe6FPem0IbrlbQMI1gLplx6cVd8lgIYZQfMIM5gI0kuCywT3
Ka9sCgp6y3UKQNtHKFwtPRLYFTF3Afyy2C01wdsa800SEqeOAeTD9+8yz7ZnuFt+
bNgu6vJGFfJHkEkvYCwFFqZ1eIfXON6lUFpijNpCGvMN2h1XArLexSk8JRBf6j2+
8+1FrRQsTXRk3G6v9uQABeK7z5W2F8gufmSFyBlXajbZp2HT6j4s2S86u5lP9P9J
l1U=
=etyx
-----END PGP SIGNATURE-----
Merge tag 'probes-fixes-v6.6-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull probes fixes from Masami Hiramatsu:
- tracing/kprobes: Fix kernel-doc warnings for the variable length
arguments
- tracing/kprobes: Fix to count the symbols in modules even if the
module name is not specified so that user can probe the symbols in
the modules without module name
* tag 'probes-fixes-v6.6-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing/kprobes: Fix symbol counting logic by looking at modules as well
tracing/kprobes: Fix the description of variable length arguments
Recent changes to count number of matching symbols when creating
a kprobe event failed to take into account kernel modules. As such, it
breaks kprobes on kernel module symbols, by assuming there is no match.
Fix this my calling module_kallsyms_on_each_symbol() in addition to
kallsyms_on_each_match_symbol() to perform a proper counting.
Link: https://lore.kernel.org/all/20231027233126.2073148-1-andrii@kernel.org/
Cc: Francis Laniel <flaniel@linux.microsoft.com>
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Fixes: b022f0c7e4 ("tracing/kprobes: Return EADDRNOTAVAIL when func matches several symbols")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Song Liu <song@kernel.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Fix the following kernel-doc warnings:
kernel/trace/trace_kprobe.c:1029: warning: Excess function parameter 'args' description in '__kprobe_event_gen_cmd_start'
kernel/trace/trace_kprobe.c:1097: warning: Excess function parameter 'args' description in '__kprobe_event_add_fields'
Refer to the usage of variable length arguments elsewhere in the kernel
code, "@..." is the proper way to express it in the description.
Link: https://lore.kernel.org/all/20231027041315.2613166-1-yujie.liu@intel.com/
Fixes: 2a588dd1d5 ("tracing: Add kprobe event command generation functions")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202310190437.paI6LYJF-lkp@intel.com/
Signed-off-by: Yujie Liu <yujie.liu@intel.com>
Reviewed-by: Mukesh Ojha <quic_mojha@quicinc.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
When allocating a new pool at runtime, reduce the number of slabs so
that the allocation order is at most MAX_ORDER. This avoids a kernel
warning in __alloc_pages().
The warning is relatively benign, because the pool size is subsequently
reduced when allocation fails, but it is silly to start with a request
that is known to fail, especially since this is the default behavior if
the kernel is built with CONFIG_SWIOTLB_DYNAMIC=y and booted without any
swiotlb= parameter.
Reported-by: Ben Greear <greearb@candelatech.com>
Closes: https://lore.kernel.org/netdev/4f173dd2-324a-0240-ff8d-abf5c191be18@candelatech.com/
Fixes: 1aaa736815 ("swiotlb: allocate a new memory pool when existing pools are full")
Signed-off-by: Petr Tesarik <petr.tesarik1@huawei-partners.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
- kprobe-events: Fix kprobe events to reject if the attached symbol
is not unique name because it may not the function which the user
want to attach to. (User can attach a probe to such symbol using
the nearest unique symbol + offset.)
- selftest: Add a testcase to ensure the kprobe event rejects non
unique symbol correctly.
-----BEGIN PGP SIGNATURE-----
iQFPBAABCgA5FiEEh7BulGwFlgAOi5DV2/sHvwUrPxsFAmUzdQobHG1hc2FtaS5o
aXJhbWF0c3VAZ21haWwuY29tAAoJENv7B78FKz8bMNAH/inFWv8e+rMm8F5Po6ZI
CmBxuZbxy2l+KfYDjXqSHu7TLKngVd6Bhdb5H2K7fgdwiZxrS0i6qvdppo+Cxgop
Yod06peDTM80IKavioCcOJOwLPGXXpZkMlK5fdC48HN6vrf9km4vws5ZAagfc1ng
YhnYm1HHeXcIYwtLkE2dCr6HkwkaOebWTLdZ8c70d1OPw0L9rzxH+edjhKCq8uIw
6WUg9ERxJYPUuCkQxOxVJrTdzNMRXsgf28FHc0LyYRm8kDpECT2BP6e/Y+TBbsX5
2pN5cUY5qfI6t3Pc1HDs2KX8ui2QCmj0mCvT0VixhdjThdHpRf0VjIFFAANf3LNO
XVA=
=O1Aa
-----END PGP SIGNATURE-----
Merge tag 'probes-fixes-v6.6-rc6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull probes fixes from Masami Hiramatsu:
- kprobe-events: Fix kprobe events to reject if the attached symbol is
not unique name because it may not the function which the user want
to attach to. (User can attach a probe to such symbol using the
nearest unique symbol + offset.)
- selftest: Add a testcase to ensure the kprobe event rejects non
unique symbol correctly.
* tag 'probes-fixes-v6.6-rc6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
selftests/ftrace: Add new test case which checks non unique symbol
tracing/kprobes: Return EADDRNOTAVAIL when func matches several symbols
When a kprobe is attached to a function that's name is not unique (is
static and shares the name with other functions in the kernel), the
kprobe is attached to the first function it finds. This is a bug as the
function that it is attaching to is not necessarily the one that the
user wants to attach to.
Instead of blindly picking a function to attach to what is ambiguous,
error with EADDRNOTAVAIL to let the user know that this function is not
unique, and that the user must use another unique function with an
address offset to get to the function they want to attach to.
Link: https://lore.kernel.org/all/20231020104250.9537-2-flaniel@linux.microsoft.com/
Cc: stable@vger.kernel.org
Fixes: 413d37d1eb ("tracing: Add kprobe-based event tracer")
Suggested-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Francis Laniel <flaniel@linux.microsoft.com>
Link: https://lore.kernel.org/lkml/20230819101105.b0c104ae4494a7d1f2eea742@kernel.org/
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZTD6IQAKCRCRxhvAZXjc
opXLAQC9X+ECnGUAOy/kvOrEBkBb7G4BuZ8XsrnL976riVNp0gEA85LaJV9Ow7Xk
51k/1ujhYkglQbCsa0zo+mI4ueE3wAQ=
=Dqrj
-----END PGP SIGNATURE-----
Merge tag 'v6.6-rc7.vfs.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs fix from Christian Brauner:
"An openat() call from io_uring triggering an audit call can apparently
cause the refcount of struct filename to be incremented from multiple
threads concurrently during async execution, triggering a refcount
underflow and hitting a BUG_ON(). That bug has been lurking around
since at least v5.16 apparently.
Switch to an atomic counter to fix that. The underflow check is
downgraded from a BUG_ON() to a WARN_ON_ONCE() but we could easily
remove that check altogether tbh"
* tag 'v6.6-rc7.vfs.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
audit,io_uring: io_uring openat triggers audit reference count underflow
Because group consistency is non-atomic between parent (filedesc) and children
(inherited) events, it is possible for PERF_FORMAT_GROUP read() to try and sum
non-matching counter groups -- with non-sensical results.
Add group_generation to distinguish the case where a parent group removes and
adds an event and thus has the same number, but a different configuration of
events as inherited groups.
This became a problem when commit fa8c269353 ("perf/core: Invert
perf_read_group() loops") flipped the order of child_list and sibling_list.
Previously it would iterate the group (sibling_list) first, and for each
sibling traverse the child_list. In this order, only the group composition of
the parent is relevant. By flipping the order the group composition of the
child (inherited) events becomes an issue and the mis-match in group
composition becomes evident.
That said; even prior to this commit, while reading of a group that is not
equally inherited was not broken, it still made no sense.
(Ab)use ECHILD as error return to indicate issues with child process group
composition.
Fixes: fa8c269353 ("perf/core: Invert perf_read_group() loops")
Reported-by: Budimir Markovic <markovicbudimir@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20231018115654.GK33217@noisy.programming.kicks-ass.net
An io_uring openat operation can update an audit reference count
from multiple threads resulting in the call trace below.
A call to io_uring_submit() with a single openat op with a flag of
IOSQE_ASYNC results in the following reference count updates.
These first part of the system call performs two increments that do not race.
do_syscall_64()
__do_sys_io_uring_enter()
io_submit_sqes()
io_openat_prep()
__io_openat_prep()
getname()
getname_flags() /* update 1 (increment) */
__audit_getname() /* update 2 (increment) */
The openat op is queued to an io_uring worker thread which starts the
opportunity for a race. The system call exit performs one decrement.
do_syscall_64()
syscall_exit_to_user_mode()
syscall_exit_to_user_mode_prepare()
__audit_syscall_exit()
audit_reset_context()
putname() /* update 3 (decrement) */
The io_uring worker thread performs one increment and two decrements.
These updates can race with the system call decrement.
io_wqe_worker()
io_worker_handle_work()
io_wq_submit_work()
io_issue_sqe()
io_openat()
io_openat2()
do_filp_open()
path_openat()
__audit_inode() /* update 4 (increment) */
putname() /* update 5 (decrement) */
__audit_uring_exit()
audit_reset_context()
putname() /* update 6 (decrement) */
The fix is to change the refcnt member of struct audit_names
from int to atomic_t.
kernel BUG at fs/namei.c:262!
Call Trace:
...
? putname+0x68/0x70
audit_reset_context.part.0.constprop.0+0xe1/0x300
__audit_uring_exit+0xda/0x1c0
io_issue_sqe+0x1f3/0x450
? lock_timer_base+0x3b/0xd0
io_wq_submit_work+0x8d/0x2b0
? __try_to_del_timer_sync+0x67/0xa0
io_worker_handle_work+0x17c/0x2b0
io_wqe_worker+0x10a/0x350
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/lkml/MW2PR2101MB1033FFF044A258F84AEAA584F1C9A@MW2PR2101MB1033.namprd21.prod.outlook.com/
Fixes: 5bd2182d58 ("audit,io_uring,io-wq: add some basic audit support to io_uring")
Signed-off-by: Dan Clash <daclash@linux.microsoft.com>
Link: https://lore.kernel.org/r/20231012215518.GA4048@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Christian Brauner <brauner@kernel.org>
- In cgroup1, the `tasks` file could have duplicate pids which can trigger a
warning in seq_file. Fix it by removing duplicate items after sorting.
- Comment update.
-----BEGIN PGP SIGNATURE-----
iIQEABYIACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZSh+2A4cdGpAa2VybmVs
Lm9yZwAKCRCxYfJx3gVYGfASAP9dgEe1Ay6jkJoCCGROnjPRDj2j7Cm9WWcHV79X
0Pr3zQEA/vFIpUzRZGbisrvnyXwNNLX12Hq/nwRX6DzN4UIkDgE=
=tPdI
-----END PGP SIGNATURE-----
Merge tag 'cgroup-for-6.6-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fixes from Tejun Heo:
- In cgroup1, the `tasks` file could have duplicate pids which can
trigger a warning in seq_file. Fix it by removing duplicate items
after sorting
- Comment update
* tag 'cgroup-for-6.6-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: Fix incorrect css_set_rwsem reference in comment
cgroup: Remove duplicates in cgroup v1 tasks file
* Fix access-after-free in pwq allocation error path.
* Implicitly ordered unbound workqueues should lose the implicit ordering if
an attribute change which isn't compatible with ordered operation is
requested. However, attribute changes requested through the sysfs
interface weren't doing that leaving no way to override the implicit
ordering through the sysfs interface. Fix it.
* Other doc and misc updates.
-----BEGIN PGP SIGNATURE-----
iIQEABYIACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZSh9vQ4cdGpAa2VybmVs
Lm9yZwAKCRCxYfJx3gVYGTG4AQCklH7aGqSbzGBPuV19gN6q+BPjkNNLTkEtOzW7
3t1gewEAuwGiGr5FwuxCuGhzTUm5dkELFFsKdzYk+Pt7B2M5Pg0=
=YQtE
-----END PGP SIGNATURE-----
Merge tag 'wq-for-6.6-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue fixes from Tejun Heo:
- Fix access-after-free in pwq allocation error path
- Implicitly ordered unbound workqueues should lose the implicit
ordering if an attribute change which isn't compatible with ordered
operation is requested. However, attribute changes requested through
the sysfs interface weren't doing that leaving no way to override the
implicit ordering through the sysfs interface. Fix it.
- Other doc and misc updates
* tag 'wq-for-6.6-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: fix -Wformat-truncation in create_worker
workqueue: Override implicit ordered attribute in workqueue_apply_unbound_cpumask()
workqueue: Use the kmem_cache_free() instead of kfree() to release pwq
workqueue: doc: Fix function and sysfs path errors
workqueue: Fix UAF report by KASAN in pwq_release_workfn()
Previous releases - regressions:
- af_packet: fix fortified memcpy() without flex array.
- tcp: fix crashes trying to free half-baked MTU probes
- xdp: fix zero-size allocation warning in xskq_create()
- can: sja1000: always restart the tx queue after an overrun
- eth: mlx5e: again mutually exclude RX-FCS and RX-port-timestamp
- eth: nfp: avoid rmmod nfp crash issues
- eth: octeontx2-pf: fix page pool frag allocation warning
Previous releases - always broken:
- mctp: perform route lookups under a RCU read-side lock
- bpf: s390: fix clobbering the caller's backchain in the trampoline
- phy: lynx-28g: cancel the CDR check work item on the remove path
- dsa: qca8k: fix qca8k driver for Turris 1.x
- eth: ravb: fix use-after-free issue in ravb_tx_timeout_work()
- eth: ixgbe: fix crash with empty VF macvlan list
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmUnw0USHHBhYmVuaUBy
ZWRoYXQuY29tAAoJECkkeY3MjxOkN0EP/RKl317fLqlm6ZzRUMVP169CNRAgMaBG
7FIwxlCv4hfO2Rx09Mxu2wjDp+tBQKqBKaxfcwh8tEdLMqqCymOW2K5+tWVty8C8
TJJS+zggqLAo7DjXbnT8GBm5owHPLKGNxW6vRmnw9xraCD/nuV1wqolI2+l4IxB+
kqfliltepnJSakg0uXg7/uwAE87slBzX5VgB6K5JKLiiDMD8tYoAUmZzH8bMJd0l
Cl7+L+ucRfQkj0DPfuZM/FncM0el7oFB6imnKd36hD6vfDfCNxpyNBYG1yZ/61/N
7H3E595Hr9PA+YBZjja3UvQGbFXkyMHloQdYxmq4s0T2WHqKwRyjLlwPayMXvavn
OTJh2VAs68ivtti0ry5Nbgz4viiNfr32PLyZr6XySwCZ1/TCLjV4Cq9IYnaP3YeM
KA+CIl3d0asQdZuMXTBivmtF65Buawt9UX/gJzUst2mNdcqhV1RTNWDNWoFLQ0qW
gz8XN68V5LhbaaOq/Lat80krWgNLNZIlTNmSsE/Ie799w7dAHn/xvT6h+h5pF1XX
dhng9NK7RL7KVcI/9walArOnhz9ksGWc2+JPMQohuPM/ITMHW11oOUOX6NwAre5m
hBJKh+Rz7ylLDLn33C4qowUhxnJlqqm+rDCVDTmoYngEFQvhEl19mfndSsC8P/K/
xXQJ+diS/Jug
=orAS
-----END PGP SIGNATURE-----
Merge tag 'net-6.6-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
"Including fixes from CAN and BPF.
We have a regression in TC currently under investigation, otherwise
the things that stand off most are probably the TCP and AF_PACKET
fixes, with both issues coming from 6.5.
Previous releases - regressions:
- af_packet: fix fortified memcpy() without flex array.
- tcp: fix crashes trying to free half-baked MTU probes
- xdp: fix zero-size allocation warning in xskq_create()
- can: sja1000: always restart the tx queue after an overrun
- eth: mlx5e: again mutually exclude RX-FCS and RX-port-timestamp
- eth: nfp: avoid rmmod nfp crash issues
- eth: octeontx2-pf: fix page pool frag allocation warning
Previous releases - always broken:
- mctp: perform route lookups under a RCU read-side lock
- bpf: s390: fix clobbering the caller's backchain in the trampoline
- phy: lynx-28g: cancel the CDR check work item on the remove path
- dsa: qca8k: fix qca8k driver for Turris 1.x
- eth: ravb: fix use-after-free issue in ravb_tx_timeout_work()
- eth: ixgbe: fix crash with empty VF macvlan list"
* tag 'net-6.6-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (54 commits)
rswitch: Fix imbalance phy_power_off() calling
rswitch: Fix renesas_eth_sw_remove() implementation
octeontx2-pf: Fix page pool frag allocation warning
nfc: nci: assert requested protocol is valid
af_packet: Fix fortified memcpy() without flex array.
net: tcp: fix crashes trying to free half-baked MTU probes
net/smc: Fix pos miscalculation in statistics
nfp: flower: avoid rmmod nfp crash issues
net: usb: dm9601: fix uninitialized variable use in dm9601_mdio_read
ethtool: Fix mod state of verbose no_mask bitset
net: nfc: fix races in nfc_llcp_sock_get() and nfc_llcp_sock_get_sn()
mctp: perform route lookups under a RCU read-side lock
net: skbuff: fix kernel-doc typos
s390/bpf: Fix unwinding past the trampoline
s390/bpf: Fix clobbering the caller's backchain in the trampoline
net/mlx5e: Again mutually exclude RX-FCS and RX-port-timestamp
net/smc: Fix dependency of SMC on ISM
ixgbe: fix crash with empty VF macvlan list
net/mlx5e: macsec: use update_pn flag instead of PN comparation
net: phy: mscc: macsec: reject PN update requests
...
Compiling with W=1 emitted the following warning
(Compiler: gcc (x86-64, ver. 13.2.1, .config: result of make allyesconfig,
"Treat warnings as errors" turned off):
kernel/workqueue.c:2188:54: warning: ‘%d’ directive output may be
truncated writing between 1 and 10 bytes into a region of size
between 5 and 14 [-Wformat-truncation=]
kernel/workqueue.c:2188:50: note: directive argument in the range
[0, 2147483647]
kernel/workqueue.c:2188:17: note: ‘snprintf’ output between 4 and 23 bytes
into a destination of size 16
setting "id_buf" to size 23 will silence the warning, since GCC
determines snprintf's output to be max. 23 bytes in line 2188.
Please let me know if there are any mistakes in my patch!
Signed-off-by: Lucy Mielke <lucymielke@icloud.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Commit 5c0338c687 ("workqueue: restore WQ_UNBOUND/max_active==1
to be ordered") enabled implicit ordered attribute to be added to
WQ_UNBOUND workqueues with max_active of 1. This prevented the changing
of attributes to these workqueues leading to fix commit 0a94efb5ac
("workqueue: implicit ordered attribute should be overridable").
However, workqueue_apply_unbound_cpumask() was not updated at that time.
So sysfs changes to wq_unbound_cpumask has no effect on WQ_UNBOUND
workqueues with implicit ordered attribute. Since not all WQ_UNBOUND
workqueues are visible on sysfs, we are not able to make all the
necessary cpumask changes even if we iterates all the workqueue cpumasks
in sysfs and changing them one by one.
Fix this problem by applying the corresponding change made
to apply_workqueue_attrs_locked() in the fix commit to
workqueue_apply_unbound_cpumask().
Fixes: 5c0338c687 ("workqueue: restore WQ_UNBOUND/max_active==1 to be ordered")
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Currently, the kfree() be used for pwq objects allocated with
kmem_cache_alloc() in alloc_and_link_pwqs(), this isn't wrong.
but usually, use "trace_kmem_cache_alloc/trace_kmem_cache_free"
to track memory allocation and free. this commit therefore use
kmem_cache_free() instead of kfree() in alloc_and_link_pwqs()
and also consistent with release of the pwq in rcu_free_pwq().
Signed-off-by: Zqiang <qiang.zhang1211@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The verifier, as part of check_return_code(), verifies that async
callbacks such as from e.g. timers, will return 0. It does this by
correctly checking that R0->var_off is in tnum_const(0), which
effectively checks that it's in a range of 0. If this condition fails,
however, it prints an error message which says that the value should
have been in (0x0; 0x1). This results in possibly confusing output such
as the following in which an async callback returns 1:
At async callback the register R0 has value (0x1; 0x0) should have been in (0x0; 0x1)
The fix is easy -- we should just pass the tnum_const(0) as the correct
range to verbose_invalid_scalar(), which will then print the following:
At async callback the register R0 has value (0x1; 0x0) should have been in (0x0; 0x0)
Fixes: bfc6bb74e4 ("bpf: Implement verifier support for validation of async callbacks.")
Signed-off-by: David Vernet <void@manifault.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20231009161414.235829-1-void@manifault.com
One PID may appear multiple times in a preloaded pidlist.
(Possibly due to PID recycling but we have reports of the same
task_struct appearing with different PIDs, thus possibly involving
transfer of PID via de_thread().)
Because v1 seq_file iterator uses PIDs as position, it leads to
a message:
> seq_file: buggy .next function kernfs_seq_next did not update position index
Conservative and quick fix consists of removing duplicates from `tasks`
file (as opposed to removing pidlists altogether). It doesn't affect
correctness (it's sufficient to show a PID once), performance impact
would be hidden by unconditional sorting of the pidlist already in place
(asymptotically).
Link: https://lore.kernel.org/r/20230823174804.23632-1-mkoutny@suse.com/
Suggested-by: Firo Yang <firo.yang@suse.com>
Signed-off-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org
Commit 9e70a5e109 ("printk: Add per-console suspended state")
removed console lock usage during resume and replaced it with
the clearly defined console_list_lock and srcu mechanisms.
However, the console lock usage had an important side-effect
of flushing the consoles. After its removal, consoles were no
longer flushed before checking their progress.
Add the console_lock/console_unlock dance to the beginning
of __pr_flush() to actually flush the consoles before checking
their progress. Also add comments to clarify this additional
usage of the console lock.
Note that console_unlock() does not guarantee flushing all messages
since the commit dbdda842fe ("printk: Add console owner and waiter
logic to load balance console writes").
Reported-by: Todd Brandt <todd.e.brandt@intel.com>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217955
Fixes: 9e70a5e109 ("printk: Add per-console suspended state")
Co-developed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Link: https://lore.kernel.org/r/20231006082151.6969-2-pmladek@suse.com