- Fix bootconfig causing kernels to fail with CONFIG_BLK_DEV_RAM enabled
- Fix allocation leaks in bootconfig tool
- Fix a double initialization of a variable
- Fix API bootconfig usage from kprobe boot time events
- Reject NULL location for kprobes
- Fix crash caused by preempt delay module not cleaning up kthread
correctly
- Add vmalloc_sync_mappings() to prevent x86_64 page faults from
recursively faulting from tracing page faults
- Fix comment in gpu/trace kerneldoc header
- Fix documentation of how to create a trace event class
- Make the local tracing_snapshot_instance_cond() function static
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXrRUBhQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qveTAP4iNCnMeS/Isb+MXQx2Pnu7OP+0BeRP
2ahlKG2sBgEdnwEAoUzxQoYWtfC6xoM38YwLuZPRlcScRea/5CRHyW8BFQc=
=o3pV
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.7-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
- Fix bootconfig causing kernels to fail with CONFIG_BLK_DEV_RAM
enabled
- Fix allocation leaks in bootconfig tool
- Fix a double initialization of a variable
- Fix API bootconfig usage from kprobe boot time events
- Reject NULL location for kprobes
- Fix crash caused by preempt delay module not cleaning up kthread
correctly
- Add vmalloc_sync_mappings() to prevent x86_64 page faults from
recursively faulting from tracing page faults
- Fix comment in gpu/trace kerneldoc header
- Fix documentation of how to create a trace event class
- Make the local tracing_snapshot_instance_cond() function static
* tag 'trace-v5.7-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tools/bootconfig: Fix resource leak in apply_xbc()
tracing: Make tracing_snapshot_instance_cond() static
tracing: Fix doc mistakes in trace sample
gpu/trace: Minor comment updates for gpu_mem_total tracepoint
tracing: Add a vmalloc_sync_mappings() for safe measure
tracing: Wait for preempt irq delay thread to finish
tracing/kprobes: Reject new event if loc is NULL
tracing/boottime: Fix kprobe event API usage
tracing/kprobes: Fix a double initialization typo
bootconfig: Fix to remove bootconfig data from initrd while boot
Fix the following sparse warning:
kernel/trace/trace.c:950:6: warning: symbol 'tracing_snapshot_instance_cond'
was not declared. Should it be static?
Link: http://lkml.kernel.org/r/1587614905-48692-1-git-send-email-zou_wei@huawei.com
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Zou Wei <zou_wei@huawei.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
x86_64 lazily maps in the vmalloc pages, and the way this works with per_cpu
areas can be complex, to say the least. Mappings may happen at boot up, and
if nothing synchronizes the page tables, those page mappings may not be
synced till they are used. This causes issues for anything that might touch
one of those mappings in the path of the page fault handler. When one of
those unmapped mappings is touched in the page fault handler, it will cause
another page fault, which in turn will cause a page fault, and leave us in
a loop of page faults.
Commit 763802b53a ("x86/mm: split vmalloc_sync_all()") split
vmalloc_sync_all() into vmalloc_sync_unmappings() and
vmalloc_sync_mappings(), as on system exit, it did not need to do a full
sync on x86_64 (although it still needed to be done on x86_32). By chance,
the vmalloc_sync_all() would synchronize the page mappings done at boot up
and prevent the per cpu area from being a problem for tracing in the page
fault handler. But when that synchronization in the exit of a task became a
nop, it caused the problem to appear.
Link: https://lore.kernel.org/r/20200429054857.66e8e333@oasis.local.home
Cc: stable@vger.kernel.org
Fixes: 737223fbca ("tracing: Consolidate buffer allocation code")
Reported-by: "Tzvetomir Stoyanov (VMware)" <tz.stoyanov@gmail.com>
Suggested-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Running on a slower machine, it is possible that the preempt delay kernel
thread may still be executing if the module was immediately removed after
added, and this can cause the kernel to crash as the kernel thread might be
executing after its code has been removed.
There's no reason that the caller of the code shouldn't just wait for the
delay thread to finish, as the thread can also be created by a trigger in
the sysfs code, which also has the same issues.
Link: http://lore.kernel.org/r/5EA2B0C8.2080706@cn.fujitsu.com
Cc: stable@vger.kernel.org
Fixes: 793937236d ("lib: Add module for testing preemptoff/irqsoff latency tracers")
Reported-by: Xiao Yang <yangx.jy@cn.fujitsu.com>
Reviewed-by: Xiao Yang <yangx.jy@cn.fujitsu.com>
Reviewed-by: Joel Fernandes <joel@joelfernandes.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
fixes.2020.04.27a: Miscellaneous fixes.
kfree_rcu.2020.04.27a: Changes related to kfree_rcu().
rcu-tasks.2020.04.27a: Addition of new RCU-tasks flavors.
stall.2020.04.27a: RCU CPU stall-warning updates.
torture.2020.05.07a: Torture-test updates.
This commit converts three ULONG_CMP_LT() invocations in rcutorture to
time_before() to reflect the fact that they are comparing timestamps to
the jiffies counter.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit fixes the following sparse warning:
kernel/rcu/rcutorture.c:1695:16: warning: symbol 'rcu_fwds' was not declared. Should it be static?
kernel/rcu/rcutorture.c:1696:6: warning: symbol 'rcu_fwd_emergency_stop' was not declared. Should it be static?
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Jason Yan <yanaijie@huawei.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit provides an rcutorture.stall_gp_kthread module parameter
to allow rcutorture to starve the grace-period kthread. This allows
testing the code that detects such starvation.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit aids testing of RCU task stall warning messages by adding
an rcutorture.stall_cpu_block module parameter that results in the
induced stall sleeping within the RCU read-side critical section.
Spinning with interrupts disabled is still available via the
rcutorture.stall_cpu_irqsoff module parameter, and specifying neither
of these two module parameters will spin with preemption disabled.
Note that sleeping (as opposed to preemption) results in additional
complaints from RCU at context-switch time, so yet more testing.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The refactored function is no longer required as the codepaths that call
freeze_secondary_cpus() are all suspend/resume related now.
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Link: https://lkml.kernel.org/r/20200430114004.17477-2-qais.yousef@arm.com
The single user could have called freeze_secondary_cpus() directly.
Since this function was a source of confusion, remove it as it's
just a pointless wrapper.
While at it, rename enable_nonboot_cpus() to thaw_secondary_cpus() to
preserve the naming symmetry.
Done automatically via:
git grep -l enable_nonboot_cpus | xargs sed -i 's/enable_nonboot_cpus/thaw_secondary_cpus/g'
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Link: https://lkml.kernel.org/r/20200430114004.17477-1-qais.yousef@arm.com
Reject the new event which has NULL location for kprobes.
For kprobes, user must specify at least the location.
Link: http://lkml.kernel.org/r/158779376597.6082.1411212055469099461.stgit@devnote2
Cc: Tom Zanussi <zanussi@kernel.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: stable@vger.kernel.org
Fixes: 2a588dd1d5 ("tracing: Add kprobe event command generation functions")
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Fix boottime kprobe events to use API correctly for
multiple events.
For example, when we set a multiprobe kprobe events in
bootconfig like below,
ftrace.event.kprobes.myevent {
probes = "vfs_read $arg1 $arg2", "vfs_write $arg1 $arg2"
}
This cause an error;
trace_boot: Failed to add probe: p:kprobes/myevent (null) vfs_read $arg1 $arg2 vfs_write $arg1 $arg2
This shows the 1st argument becomes NULL and multiprobes
are merged to 1 probe.
Link: http://lkml.kernel.org/r/158779375766.6082.201939936008972838.stgit@devnote2
Cc: Ingo Molnar <mingo@kernel.org>
Cc: stable@vger.kernel.org
Fixes: 29a1548105 ("tracing: Change trace_boot to use kprobe_event interface")
Reviewed-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Fix a typo that resulted in an unnecessary double
initialization to addr.
Link: http://lkml.kernel.org/r/158779374968.6082.2337484008464939919.stgit@devnote2
Cc: Tom Zanussi <zanussi@kernel.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: stable@vger.kernel.org
Fixes: c7411a1a12 ("tracing/kprobe: Check whether the non-suffixed symbol is notrace")
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Factor out a copy_siginfo_to_external32 helper from
copy_siginfo_to_user32 that fills out the compat_siginfo, but does so
on a kernel space data structure. With that we can let architectures
override copy_siginfo_to_user32 with their own implementations using
copy_siginfo_to_external32. That allows moving the x32 SIGCHLD purely
to x86 architecture code.
As a nice side effect copy_siginfo_to_external32 also comes in handy
for avoiding a set_fs() call in the coredump code later on.
Contains improvements from Eric W. Biederman <ebiederm@xmission.com>
and Arnd Bergmann <arnd@arndb.de>.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Under rare circumstances, task_function_call() can repeatedly fail and
cause a soft lockup.
There is a slight race where the process is no longer running on the cpu
we targeted by the time remote_function() runs. The code will simply
try again. If we are very unlucky, this will continue to fail, until a
watchdog fires. This can happen in a heavily loaded, multi-core virtual
machine.
Reported-by: syzbot+bb4935a5c09b5ff79940@syzkaller.appspotmail.com
Signed-off-by: Barret Rhoden <brho@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200414222920.121401-1-brho@google.com
Fix to return negative error code -EFAULT from the copy_to_user() error
handling case instead of 0, as done elsewhere in this function.
Fixes: bd513cd08f ("bpf: add MAP_LOOKUP_AND_DELETE_ELEM syscall")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200430081851.166996-1-weiyongjun1@huawei.com
Removing the pcrypt module triggers this:
general protection fault, probably for non-canonical
address 0xdead000000000122
CPU: 5 PID: 264 Comm: modprobe Not tainted 5.6.0+ #2
Hardware name: QEMU Standard PC
RIP: 0010:__cpuhp_state_remove_instance+0xcc/0x120
Call Trace:
padata_sysfs_release+0x74/0xce
kobject_put+0x81/0xd0
padata_free+0x12/0x20
pcrypt_exit+0x43/0x8ee [pcrypt]
padata instances wrongly use the same hlist node for the online and dead
states, so __padata_free()'s second cpuhp remove call chokes on the node
that the first poisoned.
cpuhp multi-instance callbacks only walk forward in cpuhp_step->list and
the same node is linked in both the online and dead lists, so the list
corruption that results from padata_alloc() adding the node to a second
list without removing it from the first doesn't cause problems as long
as no instances are freed.
Avoid the issue by giving each state its own node.
Fixes: 894c9ef978 ("padata: validate cpumask without removed CPU during offline")
Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: linux-crypto@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: stable@vger.kernel.org # v5.4+
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
The n_barrier_successes, n_barrier_attempts, and
n_rcu_torture_barrier_error variables are updated (without access
markings) by the main rcu_barrier() test kthread, and accessed (also
without access markings) by the rcu_torture_stats() kthread. This of
course can result in KCSAN complaints.
Because the accesses are in diagnostic prints, this commit uses
data_race() to excuse the diagnostic prints from the data race. If this
were to ever cause bogus statistics prints (for example, due to store
tearing), any misleading information would be disambiguated by the
presence or absence of an rcutorture splat.
This data race was reported by KCSAN. Not appropriate for backporting
due to failure being unlikely and due to the mild consequences of the
failure, namely a confusing rcutorture console message.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
This commit adds stubs for KCSAN's data_race(), ASSERT_EXCLUSIVE_WRITER(),
and ASSERT_EXCLUSIVE_ACCESS() macros to allow code using these macros to
move ahead.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
When all quiescent states have been seen, it is normally the grace-period
kthread that is in trouble. Although the existing stack trace from
the current CPU might possibly provide useful information, experience
indicates that there is too much noise for this to be worthwhile.
This commit therefore removes this stack trace from the output.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
If the grace-period kthread is starved, idle threads' extended quiescent
states are not reported. These idle threads thus wrongly appear to
be blocking the current grace period. This commit therefore tags such
idle threads as probable false positives when the grace-period kthread
is being starved.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Although the accesses used to determine whether or not an expedited
stall should be printed are an integral part of the concurrency algorithm
governing use of the corresponding variables, the values that are simply
printed are ancillary. As such, it is best to use data_race() for these
accesses in order to provide the greatest latitude in the use of KCSAN
for the other accesses that are an integral part of the algorithm. This
commit therefore changes the relevant uses of READ_ONCE() to data_race().
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit replaces the schedule_on_each_cpu(ftrace_sync) instances
with synchronize_rcu_tasks_rude().
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@redhat.com>
[ paulmck: Make Kconfig adjustments noted by kbuild test robot. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit allows TASKS_TRACE_RCU to be used independently of TASKS_RCU
and vice versa.
[ paulmck: Fix conditional compilation per kbuild test robot feedback. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit adds a failure-return count for smp_call_function_single(),
and adds this to the console messages for rcutorture writer stalls and at
the end of rcutorture testing.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit adds a counter for the number of times the quiescent state
was an idle task associated with an offline CPU, and prints this count
at the end of rcutorture runs and at stall time.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit adds counts of the number of calls and number of successful
calls to rcu_dynticks_zero_in_eqs(), which are printed at the end
of rcutorture runs and at stall time. This allows evaluation of the
effectiveness of rcu_dynticks_zero_in_eqs().
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit scans the CPUs, adding each CPU's idle task to the list of
tasks that need quiescent states.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The idle task corresponding to an offline CPU can appear to be running
while that CPU is offline. This commit therefore adds checks for this
situation, treating it as a quiescent state. Because the tasklist scan
and the holdout-list scan now exclude CPU-hotplug operations, readers
on the CPU-hotplug paths are still waited for.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit disables CPU hotplug across RCU tasks trace scans, which
is a first step towards correctly recognizing idle tasks "running" on
offline CPUs.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The rcu_read_unlock_trace() can invoke rcu_read_unlock_trace_special(),
which in turn can call wake_up(). Therefore, if any scheduler lock is
held across a call to rcu_read_unlock_trace(), self-deadlock can occur.
This commit therefore uses the irq_work facility to defer the wake_up()
to a clean environment where no scheduler locks will be held.
Reported-by: Steven Rostedt <rostedt@goodmis.org>
[ paulmck: Update #includes for m68k per kbuild test robot. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Systems running CPU-bound real-time task do not want IPIs sent to CPUs
executing nohz_full userspace tasks. Battery-powered systems don't
want IPIs sent to idle CPUs in low-power mode. Unfortunately, RCU tasks
trace can and will send such IPIs in some cases.
Both of these situations occur only when the target CPU is in RCU
dyntick-idle mode, in other words, when RCU is not watching the
target CPU. This suggests that CPUs in dyntick-idle mode should use
memory barriers in outermost invocations of rcu_read_lock_trace()
and rcu_read_unlock_trace(), which would allow the RCU tasks trace
grace period to directly read out the target CPU's read-side state.
One challenge is that RCU tasks trace is not targeting a specific
CPU, but rather a task. And that task could switch from one CPU to
another at any time.
This commit therefore uses try_invoke_on_locked_down_task()
and checks for task_curr() in trc_inspect_reader_notrunning().
When this condition holds, the target task is running and cannot move.
If CONFIG_TASKS_TRACE_RCU_READ_MB=y, the new rcu_dynticks_zero_in_eqs()
function can be used to check if the specified integer (in this case,
t->trc_reader_nesting) is zero while the target CPU remains in that same
dyntick-idle sojourn. If so, the target task is in a quiescent state.
If not, trc_read_check_handler() must indicate failure so that the
grace-period kthread can take appropriate action or retry after an
appropriate delay, as the case may be.
With this change, given CONFIG_TASKS_TRACE_RCU_READ_MB=y, if a given
CPU remains idle or a given task continues executing in nohz_full mode,
the RCU tasks trace grace-period kthread will detect this without the
need to send an IPI.
Suggested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit provides a new TASKS_TRACE_RCU_READ_MB Kconfig option that
enables use of read-side memory barriers by both rcu_read_lock_trace()
and rcu_read_unlock_trace() when the are executed with the
current->trc_reader_special.b.need_mb flag set. This flag is currently
never set. Doing that is the subject of a later commit.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit adds a grace-period count and a count of IPIs sent since
boot, which is printed in response to rcutorture writer stalls and at
the end of rcutorture testing. These counts will be used to evaluate
various schemes to reduce the number of IPIs sent.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit splits ->trc_reader_need_end by using the rcu_special union.
This change permits readers to check to see if a memory barrier is
required without any added overhead in the common case where no such
barrier is required. This commit also adds the read-side checking.
Later commits will add the machinery to properly set the new
->trc_reader_special.b.need_mb field.
This commit also makes rcu_read_unlock_trace_special() tolerate nested
read-side critical sections within interrupt and NMI handlers.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit provides a rcupdate.rcu_task_ipi_delay kernel boot parameter
that specifies how old the RCU tasks trace grace period must be before
the grace-period kthread starts sending IPIs. This delay allows more
tasks to pass through rcu_tasks_qs() quiescent states, thus reducing
(or even eliminating) the number of IPIs that must be sent.
On a short rcutorture test setting this kernel boot parameter to HZ/2
resulted in zero IPIs for all 877 RCU-tasks trace grace periods that
elapsed during that test.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit adds a place to record the grace-period start in jiffies.
This will be used by later commits for debugging purposes and to throttle
IPIs early in the grace period.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit makes the calls to rcu_tasks_qs() detect and report
quiescent states for RCU tasks trace. If the task is in a quiescent
state and if ->trc_reader_checked is not yet set, the task sets its own
->trc_reader_checked. This will cause the grace-period kthread to
remove it from the holdout list if it still remains there.
[ paulmck: Fix conditional compilation per kbuild test robot feedback. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit adds state for each RCU-tasks flavor to the rcutorture
writer stall output. The initial state is minimal, but you have to
start somewhere.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
[ paulmck: Fixes based on feedback from kbuild test robot. ]
This commit pushes the #ifdef CONFIG_TASKS_RCU_GENERIC from
kernel/rcu/update.c to kernel/rcu/tasks.h in order to improve
readability as more APIs are added.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit adds RCU CPU stall warnings for RCU Tasks Trace. These
dump out any tasks blocking the current grace period, as well as any
CPUs that have not responded to an IPI request. This happens in two
phases, when initially extracting state from the tasks and later when
waiting for any holdout tasks to check in.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Because RCU does not watch exception early-entry/late-exit, idle-loop,
or CPU-hotplug execution, protection of tracing and BPF operations is
needlessly complicated. This commit therefore adds a variant of
Tasks RCU that:
o Has explicit read-side markers to allow finite grace periods in
the face of in-kernel loops for PREEMPT=n builds. These markers
are rcu_read_lock_trace() and rcu_read_unlock_trace().
o Protects code in the idle loop, exception entry/exit, and
CPU-hotplug code paths. In this respect, RCU-tasks trace is
similar to SRCU, but with lighter-weight readers.
o Avoids expensive read-side instruction, having overhead similar
to that of Preemptible RCU.
There are of course downsides:
o The grace-period code can send IPIs to CPUs, even when those
CPUs are in the idle loop or in nohz_full userspace. This is
mitigated by later commits.
o It is necessary to scan the full tasklist, much as for Tasks RCU.
o There is a single callback queue guarded by a single lock,
again, much as for Tasks RCU. However, those early use cases
that request multiple grace periods in quick succession are
expected to do so from a single task, which makes the single
lock almost irrelevant. If needed, multiple callback queues
can be provided using any number of schemes.
Perhaps most important, this variant of RCU does not affect the vanilla
flavors, rcu_preempt and rcu_sched. The fact that RCU Tasks Trace
readers can operate from idle, offline, and exception entry/exit in no
way enables rcu_preempt and rcu_sched readers to do so.
The memory ordering was outlined here:
https://lore.kernel.org/lkml/20200319034030.GX3199@paulmck-ThinkPad-P72/
This effort benefited greatly from off-list discussions of BPF
requirements with Alexei Starovoitov and Andrii Nakryiko. At least
some of the on-list discussions are captured in the Link: tags below.
In addition, KCSAN was quite helpful in finding some early bugs.
Link: https://lore.kernel.org/lkml/20200219150744.428764577@infradead.org/
Link: https://lore.kernel.org/lkml/87mu8p797b.fsf@nanos.tec.linutronix.de/
Link: https://lore.kernel.org/lkml/20200225221305.605144982@linutronix.de/
Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Andrii Nakryiko <andriin@fb.com>
[ paulmck: Apply feedback from Steve Rostedt and Joel Fernandes. ]
[ paulmck: Decrement trc_n_readers_need_end upon IPI failure. ]
[ paulmck: Fix locking issue reported by rcutorture. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit refactors RCU tasks to allow variants to be added. These
variants will share the current Tasks-RCU tasklist scan and the holdout
list processing.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit causes the flavors of RCU Tasks to use different names
for their kthreads and in their console messages.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit adds a "rude" variant of RCU-tasks that has as quiescent
states schedule(), cond_resched_tasks_rcu_qs(), userspace execution,
and (in theory, anyway) cond_resched(). In other words, RCU-tasks rude
readers are regions of code with preemption disabled, but excluding code
early in the CPU-online sequence and late in the CPU-offline sequence.
Updates make use of IPIs and force an IPI and a context switch on each
online CPU. This variant is useful in some situations in tracing.
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
[ paulmck: Apply EXPORT_SYMBOL_GPL() feedback from Qiujun Huang. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
[ paulmck: Apply review feedback from Steve Rostedt. ]
This commit splits out generic processing from RCU-tasks-specific
processing in order to allow additional flavors to be added. It also
adds a def_bool TASKS_RCU_GENERIC to enable the common RCU-tasks
infrastructure code.
This is primarily, but not entirely, a code-movement commit.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit adds a crude test for synchronize_rcu_mult(). This is
currently a smoke test rather than a high-quality stress test.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit creates an rcu_tasks struct to hold state information for
RCU Tasks. This is a preparation commit for adding additional flavors
of Tasks RCU, each of which would have its own rcu_tasks struct.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This code-movement-only commit is in preparation for adding an additional
flavor of Tasks RCU, which relies on workqueues to detect grace periods.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Currently, an RCU-preempt CPU stall warning simply lists the PIDs of
those tasks holding up the current grace period. This can be helpful,
but more can be even more helpful.
To this end, this commit adds the nesting level, whether the task
thinks it was preempted in its current RCU read-side critical section,
whether RCU core has asked this task for a quiescent state, whether the
expedited-grace-period hint is set, and whether the task believes that
it is on the blocked-tasks list (it must be, or it would not be printed,
but if things are broken, best not to take too much for granted).
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
A running task's state can be sampled in a consistent manner (for example,
for diagnostic purposes) simply by invoking smp_call_function_single()
on its CPU, which may be obtained using task_cpu(), then having the
IPI handler verify that the desired task is in fact still running.
However, if the task is not running, this sampling can in theory be done
immediately and directly. In practice, the task might start running at
any time, including during the sampling period. Gaining a consistent
sample of a not-running task therefore requires that something be done
to lock down the target task's state.
This commit therefore adds a try_invoke_on_locked_down_task() function
that invokes a specified function if the specified task can be locked
down, returning true if successful and if the specified function returns
true. Otherwise this function simply returns false. Given that the
function passed to try_invoke_on_nonrunning_task() might be invoked with
a runqueue lock held, that function had better be quite lightweight.
The function is passed the target task's task_struct pointer and the
argument passed to try_invoke_on_locked_down_task(), allowing easy access
to task state and to a location for further variables to be passed in
and out.
Note that the specified function will be called even if the specified
task is currently running. The function can use ->on_rq and task_curr()
to quickly and easily determine the task's state, and can return false
if this state is not to the function's liking. The caller of the
try_invoke_on_locked_down_task() would then see the false return value,
and could take appropriate action, for example, trying again later or
sending an IPI if matters are more urgent.
It is expected that use cases such as the RCU CPU stall warning code will
simply return false if the task is currently running. However, there are
use cases involving nohz_full CPUs where the specified function might
instead fall back to an alternative sampling scheme that relies on heavier
synchronization (such as memory barriers) in the target task.
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
[ paulmck: Apply feedback from Peter Zijlstra and Steven Rostedt. ]
[ paulmck: Invoke if running to handle feedback from Mathieu Desnoyers. ]
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Currently, the PREEMPT=y version of rcu_note_context_switch() does not
invoke rcu_tasks_qs(), and we need it to in order to keep RCU Tasks
Trace's IPIs down to a dull roar. This commit therefore enables this
hook.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
It is not as clear as it might be just where in RCU's idle entry/exit
code RCU stops and starts watching the current CPU. This commit therefore
adds comments calling out the transitions.
Reported-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Now that it should be safe to hold scheduler locks across
rcu_read_unlock(), even in cases where the corresponding RCU read-side
critical section might have been preempted and boosted, the commit adds
a test of this capability to rcutorture. This has been tested on current
mainline (which can deadlock in this situation), and lockdep duly reported
the expected deadlock. On -rcu, lockdep is silent, thus far, anyway.
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Now that RCU flavors have been consolidated, an RCU-preempt
rcu_read_unlock() in an interrupt or softirq handler cannot possibly
end the RCU read-side critical section. Consider the old vulnerability
involving rcu_read_unlock() being invoked within such a handler that
interrupted an __rcu_read_unlock_special(), in which a wakeup might be
invoked with a scheduler lock held. Because rcu_read_unlock_special()
no longer does wakeups in such situations, it is no longer necessary
for __rcu_read_unlock() to set the nesting level negative.
This commit therefore removes this recursion-protection code from
__rcu_read_unlock().
[ paulmck: Let rcu_exp_handler() continue to call rcu_report_exp_rdp(). ]
[ paulmck: Adjust other checks given no more negative nesting. ]
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The ->rcu_read_unlock_special.b.deferred_qs field is set to true in
rcu_read_unlock_special() but never set to false. This is not
particularly useful, so this commit removes this field.
The only possible justification for this field is to ease debugging
of RCU deferred quiscent states, but the combination of the other
->rcu_read_unlock_special fields plus ->rcu_blocked_node and of course
->rcu_read_lock_nesting should cover debugging needs. And if this last
proves incorrect, this patch can always be reverted, along with the
required setting of ->rcu_read_unlock_special.b.deferred_qs to false
in rcu_preempt_deferred_qs_irqrestore().
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Now that RCU flavors have been consolidated, an RCU-preempt
rcu_read_unlock() in an interrupt or softirq handler cannot possibly
end the RCU read-side critical section. Consider the old vulnerability
involving rcu_preempt_deferred_qs() being invoked within such a handler
that interrupted an extended RCU read-side critical section, in which
a wakeup might be invoked with a scheduler lock held. Because
rcu_read_unlock_special() no longer does wakeups in such situations,
it is no longer necessary for rcu_preempt_deferred_qs() to set the
nesting level negative.
This commit therefore removes this recursion-protection code from
rcu_preempt_deferred_qs().
[ paulmck: Fix typo in commit log per Steve Rostedt. ]
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The scheduler is currently required to hold rq/pi locks across the entire
RCU read-side critical section or not at all. This is inconvenient and
leaves traps for the unwary, including the author of this commit.
But now that excessively long grace periods enable scheduling-clock
interrupts for holdout nohz_full CPUs, the nohz_full rescue logic in
rcu_read_unlock_special() can be dispensed with. In other words, the
rcu_read_unlock_special() function can refrain from doing wakeups unless
such wakeups are guaranteed safe.
This commit therefore avoids unsafe wakeups, freeing the scheduler to
hold rq/pi locks across rcu_read_unlock() even if the corresponding RCU
read-side critical section might have been preempted. This commit also
updates RCU's requirements documentation.
This commit is inspired by a patch from Lai Jiangshan:
https://lore.kernel.org/lkml/20191102124559.1135-2-laijs@linux.alibaba.com
This commit is further intended to be a step towards his goal of permitting
the inlining of RCU-preempt's rcu_read_lock() and rcu_read_unlock().
Cc: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit adds stubs for KCSAN's data_race(), ASSERT_EXCLUSIVE_WRITER(),
and ASSERT_EXCLUSIVE_ACCESS() macros to allow code using these macros
to move ahead.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit adds rcu_gp_might_be_stalled(), which returns true if there
is some reason to believe that the RCU grace period is stalled. The use
case is where an RCU free-memory path needs to allocate memory in order
to free it, a situation that should be avoided where possible.
But where it is necessary, there is always the alternative of using
synchronize_rcu() to wait for a grace period in order to avoid the
allocation. And if the grace period is stalled, allocating memory to
asynchronously wait for it is a bad idea of epic proportions: Far better
to let others use the memory, because these others might actually be
able to free that memory before the grace period ends.
Thus, rcu_gp_might_be_stalled() can be used to help decide whether
allocating memory on an RCU free path is a semi-reasonable course
of action.
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Uladzislau Rezki <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
We can relax the correctness of counting of number of queued objects in
favor of not hurting performance, by locklessly sampling per-cpu
counters. This should be Ok since under high memory pressure, it should not
matter if we are off by a few objects while counting. The shrinker will
still do the reclaim.
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
[ paulmck: Remove unused "flags" variable. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
To reduce grace periods and improve kfree() performance, we have done
batching recently dramatically bringing down the number of grace periods
while giving us the ability to use kfree_bulk() for efficient kfree'ing.
However, this has increased the likelihood of OOM condition under heavy
kfree_rcu() flood on small memory systems. This patch introduces a
shrinker which starts grace periods right away if the system is under
memory pressure due to existence of objects that have still not started
a grace period.
With this patch, I do not observe an OOM anymore on a system with 512MB
RAM and 8 CPUs, with the following rcuperf options:
rcuperf.kfree_loops=20000 rcuperf.kfree_alloc_num=8000
rcuperf.kfree_rcu_test=1 rcuperf.kfree_mult=2
Otherwise it easily OOMs with the above parameters.
NOTE:
1. On systems with no memory pressure, the patch has no effect as intended.
2. In the future, we can use this same mechanism to prevent grace periods
from happening even more, by relying on shrinkers carefully.
Cc: urezki@gmail.com
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This allows us to increase memory pressure dynamically using a new
rcuperf boot command line parameter called 'rcumult'.
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit converts the ULONG_CMP_LT() in rcu_nohz_full_cpu() to
time_before() to reflect the fact that it is comparing a timestamp to
the jiffies counter.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit converts the ULONG_CMP_GE() in rcu_initiate_boost() to
time_after() to reflect the fact that it is comparing a timestamp to
the jiffies counter.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit converts the ULONG_CMP_GE() in rcu_gp_fqs_loop() to
time_after() to reflect the fact that it is comparing a timestamp to
the jiffies counter.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Coccinelle reports a warning at use_softirq declaration
WARNING: Assignment of 0/1 to bool variable
The root cause is
use_softirq a variable of bool type is initialised with the integer 1
Replacing 1 with value true solve the issue.
Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Coccinelle reports warnings at rcu_read_lock_held_common()
WARNING: Assignment of 0/1 to bool variable
To fix this,
the assigned pointer ret values are replaced by corresponding boolean value.
Given that ret is a pointer of bool type
Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The rcu_state structure's gp_seq field is only to be modified by the RCU
grace-period kthread, which is single-threaded. This commit therefore
enlists KCSAN's help in enforcing this restriction. This commit applies
KCSAN-specific primitives, so cannot go upstream until KCSAN does.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit escapes *ret, because otherwise the documentation system
thinks that this is an incomplete emphasis block:
./kernel/rcu/update.c:65: WARNING: Inline emphasis start-string without end-string.
./kernel/rcu/update.c:65: WARNING: Inline emphasis start-string without end-string.
./kernel/rcu/update.c:70: WARNING: Inline emphasis start-string without end-string.
./kernel/rcu/update.c:82: WARNING: Inline emphasis start-string without end-string.
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
It is possible that an over-long grace period will end while the RCU
CPU stall warning message is printing. In this case, the estimate of
the offending grace period's duration can be erroneous due to refetching
of rcu_state.gp_start, which will now be the time of the newly started
grace period. Computation of this duration clearly needs to use the
start time for the old over-long grace period, not the fresh new one.
This commit avoids such errors by causing both print_other_cpu_stall() and
print_cpu_stall() to reuse the value previously fetched by their caller.
Signed-off-by: Zhaolong Zhang <zhangzl2013@126.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Even if some CPUs have excessive numbers of callbacks, RCU's grace-period
kthread will still wait normally between successive force-quiescent-state
scans. The first two are the most important, as they are the ones that
enlist aid from the scheduler when overloaded. This commit therefore
omits the wait before the first and the second force-quiescent-state
scan under callback-overload conditions.
This approach was inspired by a discussion with Jeff Roberson.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Although the accesses used to determine whether or not a stall should
be printed are an integral part of the concurrency algorithm governing
use of the corresponding variables, the values that are simply printed
are ancillary. As such, it is best to use data_race() for these accesses
in order to provide the greatest latitude in the use of KCSAN for the
other accesses that are an integral part of the algorithm. This commit
therefore changes the relevant uses of READ_ONCE() to data_race().
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The rcu_node structure's ->boost_tasks field is read locklessly, so this
commit adds the WRITE_ONCE() to an update in order to provide proper
documentation and READ_ONCE()/WRITE_ONCE() pairing.
This data race was reported by KCSAN. Not appropriate for backporting
due to failure being unlikely.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The srcu_data structure's ->srcu_lock_count and ->srcu_unlock_count arrays
are read and written locklessly, so this commit adds the data_race()
to the diagnostic-print loads from these arrays in order mark them as
known and approved data-racy accesses.
This data race was reported by KCSAN. Not appropriate for backporting due
to failure being unlikely and due to this being used only by rcutorture.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The rcu_node structure's ->boost_tasks field is read locklessly, so this
commit adds the READ_ONCE() to one load in order to avoid destructive
compiler optimizations. The other load is from a diagnostic print,
so data_race() suffices.
This data race was reported by KCSAN. Not appropriate for backporting
due to failure being unlikely.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
There are lockless loads from the rcu_node structure's ->exp_tasks field,
so this commit causes all stores to use WRITE_ONCE() and all lockless
loads to use READ_ONCE() or data_race(), with the latter for debug
prints. This code also did a unprotected traversal of the linked list
pointed into by ->exp_tasks, so this commit also acquires the rcu_node
structure's ->lock to properly protect this traversal. This list was
traversed unprotected only when printing an RCU CPU stall warning for
an expedited grace period, so the odds of seeing this in production are
not all that high.
This data race was reported by KCSAN.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The rcu_state structure's ncpus field is only to be modified by the
CPU-hotplug CPU-online code path, which is single-threaded. This
commit therefore enlists KCSAN's help in enforcing this restriction.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit adds stubs for KCSAN's data_race(), ASSERT_EXCLUSIVE_WRITER(),
and ASSERT_EXCLUSIVE_ACCESS() macros to allow code using these macros to
move ahead.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit adds stubs for KCSAN's data_race(), ASSERT_EXCLUSIVE_WRITER(),
and ASSERT_EXCLUSIVE_ACCESS() macros to allow code using these macros to
move ahead.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Currently the kernel threads are not frozen in software_resume(), so
between dpm_suspend_start(PMSG_QUIESCE) and resume_target_kernel(),
system_freezable_power_efficient_wq can still try to submit SCSI
commands and this can cause a panic since the low level SCSI driver
(e.g. hv_storvsc) has quiesced the SCSI adapter and can not accept
any SCSI commands: https://lkml.org/lkml/2020/4/10/47
At first I posted a fix (https://lkml.org/lkml/2020/4/21/1318) trying
to resolve the issue from hv_storvsc, but with the help of
Bart Van Assche, I realized it's better to fix software_resume(),
since this looks like a generic issue, not only pertaining to SCSI.
Cc: All applicable <stable@vger.kernel.org>
Signed-off-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Pull pid leak fix from Eric Biederman:
"Oleg noticed that put_pid(thread_pid) was not getting called when proc
was not compiled in.
Let's get that fixed before 5.7 is released and causes problems for
anyone"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
proc: Put thread_pid in release_task not proc_flush_pid
- an uclamp accounting fix
- three frequency invariance fixes and a readability improvement
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAl6kAS8RHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1g0jg//Su8CnWMzOvs4mzqUvzb3OHV6PeQ8BZva
yI0h8z8V3S33LchwXjb6FI4VulGaGPwLD7tPZtdAdz+wTCnrhw5Rlgfk2thjROW2
XnWxUACZrAbcH5H88PF0rVp3Z8oaygwlFZCUhvJxLUOgVi4oipNr+0ZNZVATGwO+
wpCheVKIty6hlMRmNamUDNOB15xFRvXGSQ+kn0N2h/XIvke5f89AJ7uOgTuZ42Ne
m1vkgX7J29yieLt4yY6odgxlcqlFNAKegpzaadWkEPNOyqkG29pOKwBX7LpuDRVS
8jd5vo65snKDEQuBkG/CfActJR3GpNT5CVx8wzft7nDb81sEPEPb5sCCURMv5Ig3
UpEpvzCqYokC2z+ourjtugvilmHq6odwW9XbD/a8i24X+fo13oPg72EVMF7+PLuL
xZZfhxuQ3hcUGB+H83COEiA8XsNcFxCk7hcHhPyPZeagbGWUTozrwRn/JItAp5Bv
xkKmqefOu0bZYDGKwP5fMJZ5BmNiWKNw6PyH7lfzL8Ve6dKXlMWMre5cwEUJXPUe
scpPjvokvZo7C4FTy8U/7cAZlmVy27Y9Ljyf9nROWLp4KewznP3FBdGEv2+IE8uu
m90vpYXv3Y/4JsLyxg+MMkhnOb26e8vFh6roWQxtPluhWyFlTilmBYovLmFz2RBO
eqpS9MVBuRs=
=5Wor
-----END PGP SIGNATURE-----
Merge tag 'sched-urgent-2020-04-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
"Misc fixes:
- an uclamp accounting fix
- three frequency invariance fixes and a readability improvement"
* tag 'sched-urgent-2020-04-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/core: Fix reset-on-fork from RT with uclamp
x86, sched: Move check for CPU type to caller function
x86, sched: Don't enable static key when starting secondary CPUs
x86, sched: Account for CPUs with less than 4 cores in freq. invariance
x86, sched: Bail out of frequency invariance if base frequency is unknown
- fix exit event records
- extend x86 PMU driver enumeration to add Intel Jasper Lake CPU support.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAl6j/jURHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1hpjg/+L0FTvxUj42qaknksSGhPrIa6aKNSfsHe
zSG4WsN59E8XUOuB9SV3D/VDJ7rX7w8jEub/TpWqLJll+UiPE3B0oLOwlVhX+8nc
fDfIqONA6ZoKvfMG9HUXPUo1Lr2+RtcF06hZFWx56g2ijqe5da1CdniJUv1YPb5s
K4HFU6Xuitih5QrYoLOiATQdp0/W1cjG36irF4svmjrxXotmeWsp7SDlAsAP2hgw
D7VchYMVs2bWtVL4GyBkq/+EOhvHwp5PTjF3yz7Scy9+CFitL9Bp5DGkaBolpszK
mUKwstXjbePw28r0jlfLJaptti2ZTFd5r4Ywqyh374ct8JbsdLR77v9Uo6M9VHu8
9la9cmz+KUnTtz2Bl2dkj2DClh+p4k9VbRjTyrAIobo6WbX8hTYKJmLn/ehvOWqU
nPJL1bRtkx4s4tIS+oXnVOOSYdcWEvcbduxmuh1eRP3wlDb06AUZCR6JaUErd6bd
oYFwrZg9vncscKEQM7boQI8f6PhwloZFnGPbPdXgCNarFdCEugCc57s2NvG4BUFo
WykXYcOGxoANO3F478e/h52cjOoZNk0trdlAk9z957GcI+g+xwoMXWZf8nMUrlWx
qrtJmEJlde5sN+5j1X75wjJovOIXBmZH9yXWYgZ+kkmASpqeKQvRkbH+eZBBDBuU
EHNMzUenU0o=
=pEo5
-----END PGP SIGNATURE-----
Merge tag 'perf-urgent-2020-04-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
"Two changes:
- fix exit event records
- extend x86 PMU driver enumeration to add Intel Jasper Lake CPU
support"
* tag 'perf-urgent-2020-04-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/core: fix parent pid/tid in task exit events
perf/x86/cstate: Add Jasper Lake CPU support
Pull networking fixes from David Miller:
1) Fix memory leak in netfilter flowtable, from Roi Dayan.
2) Ref-count leaks in netrom and tipc, from Xiyu Yang.
3) Fix warning when mptcp socket is never accepted before close, from
Florian Westphal.
4) Missed locking in ovs_ct_exit(), from Tonghao Zhang.
5) Fix large delays during PTP synchornization in cxgb4, from Rahul
Lakkireddy.
6) team_mode_get() can hang, from Taehee Yoo.
7) Need to use kvzalloc() when allocating fw tracer in mlx5 driver,
from Niklas Schnelle.
8) Fix handling of bpf XADD on BTF memory, from Jann Horn.
9) Fix BPF_STX/BPF_B encoding in x86 bpf jit, from Luke Nelson.
10) Missing queue memory release in iwlwifi pcie code, from Johannes
Berg.
11) Fix NULL deref in macvlan device event, from Taehee Yoo.
12) Initialize lan87xx phy correctly, from Yuiko Oshino.
13) Fix looping between VRF and XFRM lookups, from David Ahern.
14) etf packet scheduler assumes all sockets are full sockets, which is
not necessarily true. From Eric Dumazet.
15) Fix mptcp data_fin handling in RX path, from Paolo Abeni.
16) fib_select_default() needs to handle nexthop objects, from David
Ahern.
17) Use GFP_ATOMIC under spinlock in mac80211_hwsim, from Wei Yongjun.
18) vxlan and geneve use wrong nlattr array, from Sabrina Dubroca.
19) Correct rx/tx stats in bcmgenet driver, from Doug Berger.
20) BPF_LDX zero-extension is encoded improperly in x86_32 bpf jit, fix
from Luke Nelson.
* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (100 commits)
selftests/bpf: Fix a couple of broken test_btf cases
tools/runqslower: Ensure own vmlinux.h is picked up first
bpf: Make bpf_link_fops static
bpftool: Respect the -d option in struct_ops cmd
selftests/bpf: Add test for freplace program with expected_attach_type
bpf: Propagate expected_attach_type when verifying freplace programs
bpf: Fix leak in LINK_UPDATE and enforce empty old_prog_fd
bpf, x86_32: Fix logic error in BPF_LDX zero-extension
bpf, x86_32: Fix clobbering of dst for BPF_JSET
bpf, x86_32: Fix incorrect encoding in BPF_LDX zero-extension
bpf: Fix reStructuredText markup
net: systemport: suppress warnings on failed Rx SKB allocations
net: bcmgenet: suppress warnings on failed Rx SKB allocations
macsec: avoid to set wrong mtu
mac80211: sta_info: Add lockdep condition for RCU list usage
mac80211: populate debugfs only after cfg80211 init
net: bcmgenet: correct per TX/RX ring statistics
net: meth: remove spurious copyright text
net: phy: bcm84881: clear settings on link down
chcr: Fix CPU hard lockup
...
Fix the following sparse warning:
kernel/bpf/syscall.c:2289:30: warning: symbol 'bpf_link_fops' was not declared. Should it be static?
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Zou Wei <zou_wei@huawei.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/1587609160-117806-1-git-send-email-zou_wei@huawei.com
For some program types, the verifier relies on the expected_attach_type of
the program being verified in the verification process. However, for
freplace programs, the attach type was not propagated along with the
verifier ops, so the expected_attach_type would always be zero for freplace
programs.
This in turn caused the verifier to sometimes make the wrong call for
freplace programs. For all existing uses of expected_attach_type for this
purpose, the result of this was only false negatives (i.e., freplace
functions would be rejected by the verifier even though they were valid
programs for the target they were replacing). However, should a false
positive be introduced, this can lead to out-of-bounds accesses and/or
crashes.
The fix introduced in this patch is to propagate the expected_attach_type
to the freplace program during verification, and reset it after that is
done.
Fixes: be8704ff07 ("bpf: Introduce dynamic program extensions")
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/158773526726.293902.13257293296560360508.stgit@toke.dk
Fix bug of not putting bpf_link in LINK_UPDATE command.
Also enforce zeroed old_prog_fd if no BPF_F_REPLACE flag is specified.
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200424052045.4002963-1-andriin@fb.com
Oleg pointed out that in the unlikely event the kernel is compiled
with CONFIG_PROC_FS unset that release_task will now leak the pid.
Move the put_pid out of proc_flush_pid into release_task to fix this
and to guarantee I don't make that mistake again.
When possible it makes sense to keep get and put in the same function
so it can easily been seen how they pair up.
Fixes: 7bc3e6e55a ("proc: Use a list of inodes to flush from proc")
Reported-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
- Two fixes that fix memory leaks detected by kmemleak
- Removal of some dead code
- A few local functions turned to static
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXqIivBQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qswYAQDEH+T80JHD1XBPpqWw6JBKvPph7moz
AsjasFiX3d5T2AD+JvNMpZntTtZPWz8+V+RqbU7EcBFD9qCNIxaZXaECOAw=
=Cm2g
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.7-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
"A few tracing fixes:
- Two fixes for memory leaks detected by kmemleak
- Removal of some dead code
- A few local functions turned static"
* tag 'trace-v5.7-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Convert local functions in tracing_map.c to static
tracing: Remove DECLARE_TRACE_NOARGS
ftrace: Fix memory leak caused by not freeing entry in unregister_ftrace_direct()
tracing: Fix memory leaks in trace_events_hist.c
Pull SIGCHLD fix from Eric Biederman:
"Christof Meerwald reported that do_notify_parent has not been
successfully populating si_pid and si_uid for multi-threaded
processes.
This is the one-liner fix. Strictly speaking a one-liner plus
comment"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
signal: Avoid corrupting si_pid and si_uid in do_notify_parent
Fix the following sparse warning:
kernel/trace/tracing_map.c:286:6: warning: symbol
'tracing_map_array_clear' was not declared. Should it be static?
kernel/trace/tracing_map.c:297:6: warning: symbol
'tracing_map_array_free' was not declared. Should it be static?
kernel/trace/tracing_map.c:319:26: warning: symbol
'tracing_map_array_alloc' was not declared. Should it be static?
Link: http://lkml.kernel.org/r/20200410073312.38855-1-yanaijie@huawei.com
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Jason Yan <yanaijie@huawei.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
uclamp_fork() resets the uclamp values to their default when the
reset-on-fork flag is set. It also checks whether the task has a RT
policy, and sets its uclamp.min to 1024 accordingly. However, during
reset-on-fork, the task's policy is lowered to SCHED_NORMAL right after,
hence leading to an erroneous uclamp.min setting for the new task if it
was forked from RT.
Fix this by removing the unnecessary check on rt_task() in
uclamp_fork() as this doesn't make sense if the reset-on-fork flag is
set.
Fixes: 1a00d99997 ("sched/uclamp: Set default clamps for RT tasks")
Reported-by: Chitti Babu Theegala <ctheegal@codeaurora.org>
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Patrick Bellasi <patrick.bellasi@matbug.net>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lkml.kernel.org/r/20200416085956.217587-1-qperret@google.com
kernel + tools/perf:
Alexey Budankov:
- Introduce CAP_PERFMON to kernel and user space.
callchains:
Adrian Hunter:
- Allow using Intel PT to synthesize callchains for regular events.
Kan Liang:
- Stitch LBR records from multiple samples to get deeper backtraces,
there are caveats, see the csets for details.
perf script:
Andreas Gerstmayr:
- Add flamegraph.py script
BPF:
Jiri Olsa:
- Synthesize bpf_trampoline/dispatcher ksymbol events.
perf stat:
Arnaldo Carvalho de Melo:
- Honour --timeout for forked workloads.
Stephane Eranian:
- Force error in fallback on :k events, to avoid counting nothing when
the user asks for kernel events but is not allowed to.
perf bench:
Ian Rogers:
- Add event synthesis benchmark.
tools api fs:
Stephane Eranian:
- Make xxx__mountpoint() more scalable
libtraceevent:
He Zhe:
- Handle return value of asprintf.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQR2GiIUctdOfX2qHhGyPKLppCJ+JwUCXp2LlQAKCRCyPKLppCJ+
J95oAP0ZihVUhESv/gdeX0IDE5g6Rd2V6LNcRj+jb7gX9NlQkwD/UfS454WV1ftQ
qTwrkKPzY/5Tm2cLuVE7r7fJ6naDHgU=
=FHm4
-----END PGP SIGNATURE-----
Merge tag 'perf-core-for-mingo-5.8-20200420' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/core
Pull perf/core fixes and improvements from Arnaldo Carvalho de Melo:
kernel + tools/perf:
Alexey Budankov:
- Introduce CAP_PERFMON to kernel and user space.
callchains:
Adrian Hunter:
- Allow using Intel PT to synthesize callchains for regular events.
Kan Liang:
- Stitch LBR records from multiple samples to get deeper backtraces,
there are caveats, see the csets for details.
perf script:
Andreas Gerstmayr:
- Add flamegraph.py script
BPF:
Jiri Olsa:
- Synthesize bpf_trampoline/dispatcher ksymbol events.
perf stat:
Arnaldo Carvalho de Melo:
- Honour --timeout for forked workloads.
Stephane Eranian:
- Force error in fallback on :k events, to avoid counting nothing when
the user asks for kernel events but is not allowed to.
perf bench:
Ian Rogers:
- Add event synthesis benchmark.
tools api fs:
Stephane Eranian:
- Make xxx__mountpoint() more scalable
libtraceevent:
He Zhe:
- Handle return value of asprintf.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Christof Meerwald <cmeerw@cmeerw.org> writes:
> Hi,
>
> this is probably related to commit
> 7a0cf09494 (signal: Correct namespace
> fixups of si_pid and si_uid).
>
> With a 5.6.5 kernel I am seeing SIGCHLD signals that don't include a
> properly set si_pid field - this seems to happen for multi-threaded
> child processes.
>
> A simple test program (based on the sample from the signalfd man page):
>
> #include <sys/signalfd.h>
> #include <signal.h>
> #include <unistd.h>
> #include <spawn.h>
> #include <stdlib.h>
> #include <stdio.h>
>
> #define handle_error(msg) \
> do { perror(msg); exit(EXIT_FAILURE); } while (0)
>
> int main(int argc, char *argv[])
> {
> sigset_t mask;
> int sfd;
> struct signalfd_siginfo fdsi;
> ssize_t s;
>
> sigemptyset(&mask);
> sigaddset(&mask, SIGCHLD);
>
> if (sigprocmask(SIG_BLOCK, &mask, NULL) == -1)
> handle_error("sigprocmask");
>
> pid_t chldpid;
> char *chldargv[] = { "./sfdclient", NULL };
> posix_spawn(&chldpid, "./sfdclient", NULL, NULL, chldargv, NULL);
>
> sfd = signalfd(-1, &mask, 0);
> if (sfd == -1)
> handle_error("signalfd");
>
> for (;;) {
> s = read(sfd, &fdsi, sizeof(struct signalfd_siginfo));
> if (s != sizeof(struct signalfd_siginfo))
> handle_error("read");
>
> if (fdsi.ssi_signo == SIGCHLD) {
> printf("Got SIGCHLD %d %d %d %d\n",
> fdsi.ssi_status, fdsi.ssi_code,
> fdsi.ssi_uid, fdsi.ssi_pid);
> return 0;
> } else {
> printf("Read unexpected signal\n");
> }
> }
> }
>
>
> and a multi-threaded client to test with:
>
> #include <unistd.h>
> #include <pthread.h>
>
> void *f(void *arg)
> {
> sleep(100);
> }
>
> int main()
> {
> pthread_t t[8];
>
> for (int i = 0; i != 8; ++i)
> {
> pthread_create(&t[i], NULL, f, NULL);
> }
> }
>
> I tried to do a bit of debugging and what seems to be happening is
> that
>
> /* From an ancestor pid namespace? */
> if (!task_pid_nr_ns(current, task_active_pid_ns(t))) {
>
> fails inside task_pid_nr_ns because the check for "pid_alive" fails.
>
> This code seems to be called from do_notify_parent and there we
> actually have "tsk != current" (I am assuming both are threads of the
> current process?)
I instrumented the code with a warning and received the following backtrace:
> WARNING: CPU: 0 PID: 777 at kernel/pid.c:501 __task_pid_nr_ns.cold.6+0xc/0x15
> Modules linked in:
> CPU: 0 PID: 777 Comm: sfdclient Not tainted 5.7.0-rc1userns+ #2924
> Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> RIP: 0010:__task_pid_nr_ns.cold.6+0xc/0x15
> Code: ff 66 90 48 83 ec 08 89 7c 24 04 48 8d 7e 08 48 8d 74 24 04 e8 9a b6 44 00 48 83 c4 08 c3 48 c7 c7 59 9f ac 82 e8 c2 c4 04 00 <0f> 0b e9 3fd
> RSP: 0018:ffffc9000042fbf8 EFLAGS: 00010046
> RAX: 000000000000000c RBX: 0000000000000000 RCX: ffffc9000042faf4
> RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff81193d29
> RBP: ffffc9000042fc18 R08: 0000000000000000 R09: 0000000000000001
> R10: 000000100f938416 R11: 0000000000000309 R12: ffff8880b941c140
> R13: 0000000000000000 R14: 0000000000000000 R15: ffff8880b941c140
> FS: 0000000000000000(0000) GS:ffff8880bca00000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007f2e8c0a32e0 CR3: 0000000002e10000 CR4: 00000000000006f0
> Call Trace:
> send_signal+0x1c8/0x310
> do_notify_parent+0x50f/0x550
> release_task.part.21+0x4fd/0x620
> do_exit+0x6f6/0xaf0
> do_group_exit+0x42/0xb0
> get_signal+0x13b/0xbb0
> do_signal+0x2b/0x670
> ? __audit_syscall_exit+0x24d/0x2b0
> ? rcu_read_lock_sched_held+0x4d/0x60
> ? kfree+0x24c/0x2b0
> do_syscall_64+0x176/0x640
> ? trace_hardirqs_off_thunk+0x1a/0x1c
> entry_SYSCALL_64_after_hwframe+0x49/0xb3
The immediate problem is as Christof noticed that "pid_alive(current) == false".
This happens because do_notify_parent is called from the last thread to exit
in a process after that thread has been reaped.
The bigger issue is that do_notify_parent can be called from any
process that manages to wait on a thread of a multi-threaded process
from wait_task_zombie. So any logic based upon current for
do_notify_parent is just nonsense, as current can be pretty much
anything.
So change do_notify_parent to call __send_signal directly.
Inspecting the code it appears this problem has existed since the pid
namespace support started handling this case in 2.6.30. This fix only
backports to 7a0cf09494 ("signal: Correct namespace fixups of si_pid and si_uid")
where the problem logic was moved out of __send_signal and into send_signal.
Cc: stable@vger.kernel.org
Fixes: 6588c1e3ff ("signals: SI_USER: Masquerade si_pid when crossing pid ns boundary")
Ref: 921cf9f630 ("signals: protect cinit from unblocked SIG_DFL signals")
Link: https://lore.kernel.org/lkml/20200419201336.GI22017@edge.cmeerw.net/
Reported-by: Christof Meerwald <cmeerw@cmeerw.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
check_xadd() can cause check_ptr_to_btf_access() to be executed with
atype==BPF_READ and value_regno==-1 (meaning "just check whether the access
is okay, don't tell me what type it will result in").
Handle that case properly and skip writing type information, instead of
indexing into the registers at index -1 and writing into out-of-bounds
memory.
Note that at least at the moment, you can't actually write through a BTF
pointer, so check_xadd() will reject the program after calling
check_ptr_to_btf_access with atype==BPF_WRITE; but that's after the
verifier has already corrupted memory.
This patch assumes that BTF pointers are not available in unprivileged
programs.
Fixes: 9e15db6613 ("bpf: Implement accurate raw_tp context access via BTF")
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200417000007.10734-2-jannh@google.com
When check_xadd() verifies an XADD operation on a pointer to a stack slot
containing a spilled pointer, check_stack_read() verifies that the read,
which is part of XADD, is valid. However, since the placeholder value -1 is
passed as `value_regno`, check_stack_read() can only return a binary
decision and can't return the type of the value that was read. The intent
here is to verify whether the value read from the stack slot may be used as
a SCALAR_VALUE; but since check_stack_read() doesn't check the type, and
the type information is lost when check_stack_read() returns, this is not
enforced, and a malicious user can abuse XADD to leak spilled kernel
pointers.
Fix it by letting check_stack_read() verify that the value is usable as a
SCALAR_VALUE if no type information is passed to the caller.
To be able to use __is_pointer_value() in check_stack_read(), move it up.
Fix up the expected unprivileged error message for a BPF selftest that,
until now, assumed that unprivileged users can use XADD on stack-spilled
pointers. This also gives us a test for the behavior introduced in this
patch for free.
In theory, this could also be fixed by forbidding XADD on stack spills
entirely, since XADD is a locked operation (for operations on memory with
concurrency) and there can't be any concurrency on the BPF stack; but
Alexei has said that he wants to keep XADD on stack slots working to avoid
changes to the test suite [1].
The following BPF program demonstrates how to leak a BPF map pointer as an
unprivileged user using this bug:
// r7 = map_pointer
BPF_LD_MAP_FD(BPF_REG_7, small_map),
// r8 = launder(map_pointer)
BPF_STX_MEM(BPF_DW, BPF_REG_FP, BPF_REG_7, -8),
BPF_MOV64_IMM(BPF_REG_1, 0),
((struct bpf_insn) {
.code = BPF_STX | BPF_DW | BPF_XADD,
.dst_reg = BPF_REG_FP,
.src_reg = BPF_REG_1,
.off = -8
}),
BPF_LDX_MEM(BPF_DW, BPF_REG_8, BPF_REG_FP, -8),
// store r8 into map
BPF_MOV64_REG(BPF_REG_ARG1, BPF_REG_7),
BPF_MOV64_REG(BPF_REG_ARG2, BPF_REG_FP),
BPF_ALU64_IMM(BPF_ADD, BPF_REG_ARG2, -4),
BPF_ST_MEM(BPF_W, BPF_REG_ARG2, 0, 0),
BPF_EMIT_CALL(BPF_FUNC_map_lookup_elem),
BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
BPF_EXIT_INSN(),
BPF_STX_MEM(BPF_DW, BPF_REG_0, BPF_REG_8, 0),
BPF_MOV64_IMM(BPF_REG_0, 0),
BPF_EXIT_INSN()
[1] https://lore.kernel.org/bpf/20200416211116.qxqcza5vo2ddnkdq@ast-mbp.dhcp.thefacebook.com/
Fixes: 17a5267067 ("bpf: verifier (add verifier core)")
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200417000007.10734-1-jannh@google.com
When the kernel is built with CONFIG_DEBUG_PER_CPU_MAPS, the cpumap code
can trigger a spurious warning if CONFIG_CPUMASK_OFFSTACK is also set. This
happens because in this configuration, NR_CPUS can be larger than
nr_cpumask_bits, so the initial check in cpu_map_alloc() is not sufficient
to guard against hitting the warning in cpumask_check().
Fix this by explicitly checking the supplied key against the
nr_cpumask_bits variable before calling cpu_possible().
Fixes: 6710e11269 ("bpf: introduce new bpf cpu map type BPF_MAP_TYPE_CPUMAP")
Reported-by: Xiumei Mu <xmu@redhat.com>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Tested-by: Xiumei Mu <xmu@redhat.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20200416083120.453718-1-toke@redhat.com
Commit 7561252892 ("audit: always check the netlink payload length
in audit_receive_msg()") fixed a number of missing message length
checks, but forgot to check the length of userspace generated audit
records. The good news is that you need CAP_AUDIT_WRITE to submit
userspace audit records, which is generally only given to trusted
processes, so the impact should be limited.
Cc: stable@vger.kernel.org
Fixes: 7561252892 ("audit: always check the netlink payload length in audit_receive_msg()")
Reported-by: syzbot+49e69b4d71a420ceda3e@syzkaller.appspotmail.com
Signed-off-by: Paul Moore <paul@paul-moore.com>
instead of clockid numbers. The usability nuisance of numbers was noticed
by Michael when polishing the man page.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl6cVQsTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoWBjEAC0dCUHKDLoG0FeyG4tb4FEBW2iTqM8
UFirH26K18s8QSePdvfJlaxtN2SdfNZG7UgYN7wz1fDFQy05zTz7Rek8UrDuu3rh
mVph/UZtUJl+6ypW2Lw9x5RWpT5yzay2iowUyBPnNxU9F/0uRKvXQFju3L83Lo/z
Z4ni7gVEw87dQi5E74tEv6iaydgPuCBpGxoMahotnHyclqMjA0QuAK6nhN5ZTcAn
senoorS/VqkSF5qEvIUwe7+F+kkMbwQryT7merJyNwh/F49xTTXRyBmiys1MF8Og
MTEvldXKy2pCh2UfRa/x84WWwOUVNivTXdIXjhalsblczL0j1z9MsQ8b3AOXOiLf
S+/Ntbb2dGo4qE22jekMwZ54Pm4x5NzChCU8+3pvd6IrPWZKi6vue74Kd0RNHQg/
0kWOlZnIP2ArVW0bFqV6jhMYkjmVdK6gm7cUpFV66L2H8zbfFuc4OlxJYEFYivye
9Yck+rFQmMwA15ZXYIpggkd7Rf/5CGF1CiMBAvP/ILubpgbJqnn6/tGByq8tDKdy
mqXX+NHF0M/7rJd5vr7wP6p3E5nQ9l/41rh9ii9EDLXf4jsWVO3EyobJ7fFHwprs
5tTWGxVJymUQLq/LQPXOVVENGK+ZsXXNGn/4n8IOVroeypxADTGyhtSh122kFFhv
jPcVHqpBUd0g4Q==
=slEk
-----END PGP SIGNATURE-----
Merge tag 'timers-urgent-2020-04-19' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull time namespace fix from Thomas Gleixner:
"An update for the proc interface of time namespaces: Use symbolic
names instead of clockid numbers. The usability nuisance of numbers
was noticed by Michael when polishing the man page"
* tag 'timers-urgent-2020-04-19' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
proc, time/namespace: Show clock symbolic names in /proc/pid/timens_offsets
- Remove setup_irq() and remove_irq(). All users have been converted so
remove them before new users surface.
- A set of bugfixes for various interrupt chip drivers
- Add a few missing static attributes to address sparse warnings.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl6cUuMTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoYi7EACOFPrwdOlKqDdgU1FGReEzhJeNSSyH
yUp1m2nNckz8Y2B+ihnLsfvcktZSXYRuDTZ/u/rmaKqq2wH5Q/h4DNQxEfoMNUep
IVBlvAFcGsvpdSbrlc+nx6sEo0K2b22AQVHdyPECiQYFZJikstAtEfzEv+ZaUr2S
Lcds295BIQylbugQpcVZL73j6iUKQ+P5YU0Wlkd/Vhlnxe9UdMd/N1P3GoRaRlOa
QxYDJCnZJjWkN+cEVRCAZVTat6pd3zaMHvEabI39Lzx4U+nu4vh62TILwk+wdpuA
DzgA+ENFXzv2zLlnF8gB0wKWw3J99No9gfRpuK/vWBQ68UeZsPlM5PKEr93oD4cC
To9D70r71UM+LS+Km8ciFlqeT4N+hIMb/x8rpIf5Tcfn5spXjNEhR4U6/d/D2ZYy
cQiu82th9kSOMGBhlrfkJ0gAT20UfAktDHU1M4JhwI5Y/DLusb6mfg0CRMj8ucOV
0xrKkgHxhX162oRTKzy5OTMWQRGTvIQZg1QE3xxtrT2qCq4ypu0EHQbh3GdfcIVQ
8n+s/Dde6etmbSwDDdEuRi///zM+hvaiXi5KOV28LYgRDbU78cAX8uRgX9sq2pg+
WxK9ulprkW6Ci1yTts9Q6FY+ZBekg7NBKXXDCJdPwXxTLRrdci68pPZip12AaWxP
2HYxWhE8LvmKAw==
=jaX5
-----END PGP SIGNATURE-----
Merge tag 'irq-urgent-2020-04-19' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fixes from Thomas Gleixner:
"A set of fixes/updates for the interrupt subsystem:
- Remove setup_irq() and remove_irq(). All users have been converted
so remove them before new users surface.
- A set of bugfixes for various interrupt chip drivers
- Add a few missing static attributes to address sparse warnings"
* tag 'irq-urgent-2020-04-19' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqchip/irq-bcm7038-l1: Make bcm7038_l1_of_init() static
irqchip/irq-mvebu-icu: Make legacy_bindings static
irqchip/meson-gpio: Fix HARDIRQ-safe -> HARDIRQ-unsafe lock order
irqchip/sifive-plic: Fix maximum priority threshold value
irqchip/ti-sci-inta: Fix processing of masked irqs
irqchip/mbigen: Free msi_desc on device teardown
irqchip/gic-v4.1: Update effective affinity of virtual SGIs
irqchip/gic-v4.1: Add support for VPENDBASER's Dirty+Valid signaling
genirq: Remove setup_irq() and remove_irq()
- Work around an uninitializaed variable warning where GCC can't figure it
out.
- Allow 'isolcpus=' to skip unknown subparameters so that older kernels
work with the commandline of a newer kernel. Improve the error output
while at it.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl6cVFwTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoZAaD/9i9QgLuj1Ka59kNPAs68i5Kjar72VS
us1dM2n0Tx6lIUEYsdJsu4GTRi5NEBqLbmwSgsXROnhI6Jd17hHp5JViezk1GZWc
Zg2uARAj9Jsqh2q5IjriNOwzq47PDC4dmSUzaecJff8PqGkk9Lpry6qvx3A02uSn
tVVQAXqwCbPTaQzuhEf/q6mbfRaO90tVqGdseD+1wE0FBFfPLwddegLEGhL1vYsA
55UhpKCGsS9lrfmgkxk1Xb3e0pJBObiV0SXdn2qHqJTpVUaDTZzsWgJHXg+0Fe1V
0ZsuGfmaaisYPBZmqRo4HALbkgnvVECSbp7FAnhvqiQMyNaciiwkkFv9Ap5+aayf
c8wXz/emAmuEMNzipovyFUITCIOs6IL1CkESsbO8Bgx9sTHO+pcgNEYrsX1953UC
45GjhXR3ymnclqsVqfMWIcNRukk0g9W38yp1DgA5IIhVz1rHogEquD9F10qsCGb1
FgSOnyGlU0I0JR5tEfqR0TeCuqLGKB2NvnEgLU4OVpsdEC5ac87uvzWEZuOmR5Z4
vQCkps1z1ABW5fB/kFO83OiA5BZfDGnq5Vvh6XDOv6EeWjhIXrolu6VeTYpBSInq
w0oNpsaA9wsy7WIy1RJ8jtSNsgS8fULCE5lUBtFeSUY/T7IcWd0lwnTlL97A4qzg
GdYVT/UAHLCzCA==
=AKgh
-----END PGP SIGNATURE-----
Merge tag 'sched-urgent-2020-04-19' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Thomas Gleixner:
"Two fixes for the scheduler:
- Work around an uninitialized variable warning where GCC can't
figure it out.
- Allow 'isolcpus=' to skip unknown subparameters so that older
kernels work with the commandline of a newer kernel. Improve the
error output while at it"
* tag 'sched-urgent-2020-04-19' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/vtime: Work around an unitialized variable warning
sched/isolation: Allow "isolcpus=" to skip unknown sub-parameters
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl6cUf8THHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYofFtD/92Ufh1t+1uSe/txUJ/lTdY98cpcD5i
UZJYdF102VLkhwA3Ors0Udtrnvuyxb5FFSfJZ3/N7tsNaXgcz1QOQsqoZz4SSgLJ
pPj+2v+LNI6rIWDXuwszbLM/nT1mJGAK9NQ6+AhvIyBMPbht4/Lajsv7vkFMTzXw
8o+a5NqLKWDca7/eLcgbcMiSsulRZxRld0D7MSP7RBeEWeylt31q3JIBp7ldzJ77
0KCdQEd3TAkP+hOZW98CNgcLgGtCxiOJ5EgjUOFJyD/+5mj219czKF53HXnn4amk
5lmdzSB0RKV2JTNFKNZQOobgMPp8VIIf6R6QlDp5MdrGYWTIbV0p5Hak2u41Cyma
BfxxkVZJipjC7mgAvZLgy0/Md7n2Eu5uAW0e72AYEmv7IwOGyHh9YL7IYiZQld67
5q8xEgrIIpaCwscVjZqP3+GHc8KGWYuv8puMCeOk1v7UeWsRlc8j/eGWIXnY4E8v
wvqWAB91dHlBh+CHaFtdy97buYinVzW/Tv2NTxLFKgxyGJTg82H2hdTpjRVYi5Z4
DM2NQRLcD1ozvh8tsFzXWP+/uemlE+EUPBZofCjJ0WtzH1GWarf3YNqviFqldRLr
GyEFoyIc3Ra/hTEzD9yCC0UWwJubhAVLWOuu9pJKuaaei8s1aiusABQGbz5sNG9l
AcpB9KFMtsLAXA==
=QjbA
-----END PGP SIGNATURE-----
Merge tag 'core-urgent-2020-04-19' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RCU fix from Thomas Gleixner:
"A single bugfix for RCU to prevent taking a lock in NMI context"
* tag 'core-urgent-2020-04-19' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
rcu: Don't acquire lock in NMI handler in rcu_nmi_enter_common()
Use smp_call_func_t instead of the open coded function pointer argument.
Signed-off-by: Kaitao Cheng <pilgrimtao@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://lkml.kernel.org/r/20200417162451.91969-1-pilgrimtao@gmail.com
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCXprWIAAKCRCRxhvAZXjc
omUyAQCQcvJQhilLv0b7FtBAbN7+TkzV8vAQTzEITuHPa6m/HwEA2Gp9ZDTJfQbV
T6utOrTm/LT0mfBkiDLSnLPtVzh7mgE=
=Jz3d
-----END PGP SIGNATURE-----
Merge tag 'for-linus-2020-04-18' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux
Pull thread fixes from Christian Brauner:
"A few fixes and minor improvements:
- Correctly validate the cgroup file descriptor when clone3() is used
with CLONE_INTO_CGROUP.
- Check that a new enough version of struct clone_args is passed
which supports the cgroup file descriptor argument when
CLONE_INTO_CGROUP is set in the flags argument.
- Catch nonsensical struct clone_args layouts at build time.
- Catch extensions of struct clone_args without updating the uapi
visible size definitions at build time.
- Check whether the signal is valid early in kill_pid_usb_asyncio()
before doing further work.
- Replace open-coded rcu_read_lock()+kill_pid_info()+rcu_read_unlock()
sequence in kill_something_info() with kill_proc_info() which is a
dedicated helper to do just that"
* tag 'for-linus-2020-04-18' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
clone3: add build-time CLONE_ARGS_SIZE_VER* validity checks
clone3: add a check for the user struct size if CLONE_INTO_CGROUP is set
clone3: fix cgroup argument sanity check
signal: use kill_proc_info instead of kill_pid_info in kill_something_info
signal: check sig before setting info in kill_pid_usb_asyncio
Pull networking fixes from David Miller:
1) Disable RISCV BPF JIT builds when !MMU, from Björn Töpel.
2) nf_tables leaves dangling pointer after free, fix from Eric Dumazet.
3) Out of boundary write in __xsk_rcv_memcpy(), fix from Li RongQing.
4) Adjust icmp6 message source address selection when routes have a
preferred source address set, from Tim Stallard.
5) Be sure to validate HSR protocol version when creating new links,
from Taehee Yoo.
6) CAP_NET_ADMIN should be sufficient to manage l2tp tunnels even in
non-initial namespaces, from Michael Weiß.
7) Missing release firmware call in mlx5, from Eran Ben Elisha.
8) Fix variable type in macsec_changelink(), caught by KASAN. Fix from
Taehee Yoo.
9) Fix pause frame negotiation in marvell phy driver, from Clemens
Gruber.
10) Record RX queue early enough in tun packet paths such that XDP
programs will see the correct RX queue index, from Gilberto Bertin.
11) Fix double unlock in mptcp, from Florian Westphal.
12) Fix offset overflow in ARM bpf JIT, from Luke Nelson.
13) marvell10g needs to soft reset PHY when coming out of low power
mode, from Russell King.
14) Fix MTU setting regression in stmmac for some chip types, from
Florian Fainelli.
* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (101 commits)
amd-xgbe: Use __napi_schedule() in BH context
mISDN: make dmril and dmrim static
net: stmmac: dwmac-sunxi: Provide TX and RX fifo sizes
net: dsa: mt7530: fix tagged frames pass-through in VLAN-unaware mode
tipc: fix incorrect increasing of link window
Documentation: Fix tcp_challenge_ack_limit default value
net: tulip: make early_486_chipsets static
dt-bindings: net: ethernet-phy: add desciption for ethernet-phy-id1234.d400
ipv6: remove redundant assignment to variable err
net/rds: Use ERR_PTR for rds_message_alloc_sgs()
net: mscc: ocelot: fix untagged packet drops when enslaving to vlan aware bridge
selftests/bpf: Check for correct program attach/detach in xdp_attach test
libbpf: Fix type of old_fd in bpf_xdp_set_link_opts
libbpf: Always specify expected_attach_type on program load if supported
xsk: Add missing check on user supplied headroom size
mac80211: fix channel switch trigger from unknown mesh peer
mac80211: fix race in ieee80211_register_hw()
net: marvell10g: soft-reset the PHY when coming out of low power
net: marvell10g: report firmware version
net/cxgb4: Check the return from t4_query_params properly
...
Open access to bpf_trace monitoring for CAP_PERFMON privileged process.
Providing the access under CAP_PERFMON capability singly, without the
rest of CAP_SYS_ADMIN credentials, excludes chances to misuse the
credentials and makes operation more secure.
CAP_PERFMON implements the principle of least privilege for performance
monitoring and observability operations (POSIX IEEE 1003.1e 2.2.2.39
principle of least privilege: A security design principle that states
that a process or program be granted only those privileges (e.g.,
capabilities) necessary to accomplish its legitimate function, and only
for the time that such privileges are actually required)
For backward compatibility reasons access to bpf_trace monitoring
remains open for CAP_SYS_ADMIN privileged processes but CAP_SYS_ADMIN
usage for secure bpf_trace monitoring is discouraged with respect to
CAP_PERFMON capability.
Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
Reviewed-by: James Morris <jamorris@linux.microsoft.com>
Acked-by: Song Liu <songliubraving@fb.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Igor Lubashev <ilubashe@akamai.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Serge Hallyn <serge@hallyn.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: intel-gfx@lists.freedesktop.org
Cc: linux-doc@vger.kernel.org
Cc: linux-man@vger.kernel.org
Cc: linux-security-module@vger.kernel.org
Cc: selinux@vger.kernel.org
Link: http://lore.kernel.org/lkml/c0a0ae47-8b6e-ff3e-416b-3cd1faaf71c0@linux.intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Open access to monitoring via kprobes and uprobes and eBPF tracing for
CAP_PERFMON privileged process. Providing the access under CAP_PERFMON
capability singly, without the rest of CAP_SYS_ADMIN credentials,
excludes chances to misuse the credentials and makes operation more
secure.
perf kprobes and uprobes are used by ftrace and eBPF. perf probe uses
ftrace to define new kprobe events, and those events are treated as
tracepoint events. eBPF defines new probes via perf_event_open interface
and then the probes are used in eBPF tracing.
CAP_PERFMON implements the principle of least privilege for performance
monitoring and observability operations (POSIX IEEE 1003.1e 2.2.2.39
principle of least privilege: A security design principle that states
that a process or program be granted only those privileges (e.g.,
capabilities) necessary to accomplish its legitimate function, and only
for the time that such privileges are actually required)
For backward compatibility reasons access to perf_events subsystem
remains open for CAP_SYS_ADMIN privileged processes but CAP_SYS_ADMIN
usage for secure perf_events monitoring is discouraged with respect to
CAP_PERFMON capability.
Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
Reviewed-by: James Morris <jamorris@linux.microsoft.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Igor Lubashev <ilubashe@akamai.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Serge Hallyn <serge@hallyn.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: intel-gfx@lists.freedesktop.org
Cc: linux-doc@vger.kernel.org
Cc: linux-security-module@vger.kernel.org
Cc: selinux@vger.kernel.org
Cc: linux-man@vger.kernel.org
Link: http://lore.kernel.org/lkml/3c129d9a-ba8a-3483-ecc5-ad6c8e7c203f@linux.intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Open access to monitoring of kernel code, CPUs, tracepoints and
namespaces data for a CAP_PERFMON privileged process. Providing the
access under CAP_PERFMON capability singly, without the rest of
CAP_SYS_ADMIN credentials, excludes chances to misuse the credentials
and makes operation more secure.
CAP_PERFMON implements the principle of least privilege for performance
monitoring and observability operations (POSIX IEEE 1003.1e 2.2.2.39
principle of least privilege: A security design principle that states
that a process or program be granted only those privileges (e.g.,
capabilities) necessary to accomplish its legitimate function, and only
for the time that such privileges are actually required)
For backward compatibility reasons the access to perf_events subsystem
remains open for CAP_SYS_ADMIN privileged processes but CAP_SYS_ADMIN
usage for secure perf_events monitoring is discouraged with respect to
CAP_PERFMON capability.
Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
Reviewed-by: James Morris <jamorris@linux.microsoft.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Igor Lubashev <ilubashe@akamai.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: linux-man@vger.kernel.org
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Serge Hallyn <serge@hallyn.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: intel-gfx@lists.freedesktop.org
Cc: linux-doc@vger.kernel.org
Cc: linux-security-module@vger.kernel.org
Cc: selinux@vger.kernel.org
Link: http://lore.kernel.org/lkml/471acaef-bb8a-5ce2-923f-90606b78eef9@linux.intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Michael Kerrisk suggested to replace numeric clock IDs with symbolic names.
Now the content of these files looks like this:
$ cat /proc/774/timens_offsets
monotonic 864000 0
boottime 1728000 0
For setting offsets, both representations of clocks (numeric and symbolic)
can be used.
As for compatibility, it is acceptable to change things as long as
userspace doesn't care. The format of timens_offsets files is very new and
there are no userspace tools yet which rely on this format.
But three projects crun, util-linux and criu rely on the interface of
setting time offsets and this is why it's required to continue supporting
the numeric clock IDs on write.
Fixes: 04a8682a71 ("fs/proc: Introduce /proc/pid/timens_offsets")
Suggested-by: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Michael Kerrisk <mtk.manpages@gmail.com>
Acked-by: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Dmitry Safonov <0x7f454c46@gmail.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20200411154031.642557-1-avagin@gmail.com
saved_max_pfn was originally introduced in commit
92aa63a5a1 ("[PATCH] kdump: Retrieve saved max pfn")
It used to make sure that the user does not try to read the physical memory
beyond saved_max_pfn. But since commit
921d58c0e6 ("vmcore: remove saved_max_pfn check")
it's no longer used for the check. This variable doesn't have any users
anymore so just remove it.
[ bp: Drop the Calgary IOMMU reference from the commit message. ]
Signed-off-by: Kairui Song <kasong@redhat.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Link: https://lkml.kernel.org/r/20200330181544.1595733-1-kasong@redhat.com
Work around this warning:
kernel/sched/cputime.c: In function ‘kcpustat_field’:
kernel/sched/cputime.c:1007:6: warning: ‘val’ may be used uninitialized in this function [-Wmaybe-uninitialized]
because GCC can't see that val is used only when err is 0.
Acked-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20200327214334.GF8015@zn.tnic
The "isolcpus=" parameter allows sub-parameters before the cpulist is
specified, and if the parser detects an unknown sub-parameters the whole
parameter will be ignored.
This design is incompatible with itself when new sub-parameters are added.
An older kernel will not recognize the new sub-parameter and will
invalidate the whole parameter so the CPU isolation will not take
effect. It emits a warning:
isolcpus: Error, unknown flag
The better and compatible way is to allow "isolcpus=" to skip unknown
sub-parameters, so that even if new sub-parameters are added an older
kernel will still be able to behave as usual even if with the new
sub-parameter specified on the command line.
Ideally this should have been there when the first sub-parameter for
"isolcpus=" was introduced.
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20200403223517.406353-1-peterx@redhat.com
CLONE_ARGS_SIZE_VER* macros are defined explicitly and not via
the offsets of the relevant struct clone_args fields, which makes
it rather error-prone, so it probably makes sense to add some
compile-time checks for them (including the one that breaks
on struct clone_args extension as a reminder to add a relevant
size macro and a similar check). Function copy_clone_args_from_user
seems to be a good place for such checks.
Signed-off-by: Eugene Syromiatnikov <esyr@redhat.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Link: https://lore.kernel.org/r/20200412202658.GA31499@asgard.redhat.com
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Passing CLONE_INTO_CGROUP with an under-sized structure (that doesn't
properly contain cgroup field) seems like garbage input, especially
considering the fact that fd 0 is a valid descriptor.
Signed-off-by: Eugene Syromiatnikov <esyr@redhat.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Link: https://lore.kernel.org/r/20200412203123.GA5869@asgard.redhat.com
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Checking that cgroup field value of struct clone_args is less than 0
is useless, as it is defined as unsigned 64-bit integer. Moreover,
it doesn't catch the situations where its higher bits are lost during
the assignment to the cgroup field of the cgroup field of the internal
struct kernel_clone_args (where it is declared as signed 32-bit
integer), so it is still possible to pass garbage there. A check
against INT_MAX solves both these issues.
Fixes: ef2c41cf38 ("clone3: allow spawning processes into cgroups")
Signed-off-by: Eugene Syromiatnikov <esyr@redhat.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Link: https://lore.kernel.org/r/20200412202533.GA29554@asgard.redhat.com
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
This issue was detected by using the Coccinelle software:
kernel/bpf/verifier.c:1259:16-21: WARNING: conversion to bool not needed here
The conversion to bool is unneeded, remove it.
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Zou Wei <zou_wei@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/1586779076-101346-1-git-send-email-zou_wei@huawei.com
VM_MAYWRITE flag during initial memory mapping determines if already mmap()'ed
pages can be later remapped as writable ones through mprotect() call. To
prevent user application to rewrite contents of memory-mapped as read-only and
subsequently frozen BPF map, remove VM_MAYWRITE flag completely on initially
read-only mapping.
Alternatively, we could treat any memory-mapping on unfrozen map as writable
and bump writecnt instead. But there is little legitimate reason to map
BPF map as read-only and then re-mmap() it as writable through mprotect(),
instead of just mmap()'ing it as read/write from the very beginning.
Also, at the suggestion of Jann Horn, drop unnecessary refcounting in mmap
operations. We can just rely on VMA holding reference to BPF map's file
properly.
Fixes: fc9702273e ("bpf: Add mmap() support for BPF_MAP_TYPE_ARRAY")
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jann Horn <jannh@google.com>
Link: https://lore.kernel.org/bpf/20200410202613.3679837-1-andriin@fb.com
signal.c provides kill_proc_info, we can use it instead of kill_pid_info
in kill_something_info func gracefully.
Signed-off-by: Zhiqiang Liu <liuzhiqiang26@huawei.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Link: https://lore.kernel.org/r/80236965-f0b5-c888-95ff-855bdec75bb3@huawei.com
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
In kill_pid_usb_asyncio, if signal is not valid, we do not need to
set info struct.
Signed-off-by: Zhiqiang Liu <liuzhiqiang26@huawei.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Link: https://lore.kernel.org/r/f525fd08-1cf7-fb09-d20c-4359145eb940@huawei.com
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
- Fix the time_for_children symlink in /proc/$PID/ so it properly reflects
that it part of the 'time' namespace
- Add the missing userns limit for the allowed number of time namespaces,
which was half defined but the actual array member was not added. This
went unnoticed as the array has an exessive empty member at the end but
introduced a user visible regression as the output was corrupted.
- Prevent further silent ucount corruption by adding a BUILD_BUG_ON() to
catch half updated data.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl6TFe4THHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYob4PD/47Qwz2z2mEeO037VbbI2gY4yl/raFo
5KPWmnwonKrtVaYAldLutA3iaG7bBbUX5fRvbSRNTS6CJIHwwfLSx7/CeWMmIXEX
0zsBsn5QXjG89lJZXM+ot74yzjvkeoad2g0jEHv92v0WDSXFiAWhkBUwknfNFbpa
csEjkdpyn2zTVBGBzKVHWHXddkY0o0Q0JOy0EiH09rHGpQktPoLJdYp73VCygoJd
NRAXhTmBQq85RMcSB3eVTbSPpIuBUzZke9zoio7YZwEjl6bkvSqetPmTdIr57u4s
ex3PX++64EXD7r8ZW36fPGDqu6v0CH2ILK7QVhwyHAYJo2LQKVd+v25muaFrzfpn
dSG1SqabWqdIHUoW/76ORyecAFLTzGwDu07UH+6VJbXeLfmuhe/LI3hdDQFph9NQ
BOBKhaHm8aXmAmvrkxbbAikSkJYVHrAIp5abI4PSYoPaqK1DWnSPaT1cqtaIUgYL
Mk15z19V9np4lMCH2cucAlap8U9EvQEIfCRRdl+crDu17ZzGID1pwhY2DA8adqcT
SUfwzzUaykd5TZtDeIe+6G9fsgf/wbSTSSbrNGKlLXDbxx+iNVXErkmx0JXLEHV4
47cmBwQZ255DzjMfuS4HzCck2MaaP8mDWgcbszgkP+GFnkf9EAP5XNp9st937mbG
rzP+NkjNCldN9w==
=wOiC
-----END PGP SIGNATURE-----
Merge tag 'timers-urgent-2020-04-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull time(keeping) updates from Thomas Gleixner:
- Fix the time_for_children symlink in /proc/$PID/ so it properly
reflects that it part of the 'time' namespace
- Add the missing userns limit for the allowed number of time
namespaces, which was half defined but the actual array member was
not added. This went unnoticed as the array has an exessive empty
member at the end but introduced a user visible regression as the
output was corrupted.
- Prevent further silent ucount corruption by adding a BUILD_BUG_ON()
to catch half updated data.
* tag 'timers-urgent-2020-04-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
ucount: Make sure ucounts in /proc/sys/user don't regress again
time/namespace: Add max_time_namespaces ucount
time/namespace: Fix time_for_children symlink
- Deduplicate the average computations in the scheduler core and the fair
class code.
- Fix a raise between runtime distribution and assignement which can cause
exceeding the quota by up to 70%.
- Prevent negative results in the imbalanace calculation
- Remove a stale warning in the workqueue code which can be triggered
since the call site was moved out of preempt disabled code. It's a false
positive.
- Deduplicate the print macros for procfs
- Add the ucmap values to the SCHED_DEBUG procfs output for completness
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl6TFEoTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoRoPD/9cTERNZb/aT/SVD3TntBs5bMlnRR2U
KsyoN7w2bmYGY6XbKdjWX+KorPyJ9YRcUpOfmH+6JmMioM5RrgCtAQ4cXP4q89k3
6+hdHu+WAhp2AyKr5LAYZK+NYdq0fy/9xx7fdXNri8/QakQCy1vu/u6iiKRMmrEf
R7zB7v4ddTOTKamUuGBNnmM4GwfrD7td9NksfjDqV4yJ2ZkaQ+apWAPzJK8ixfEa
TGjVnnUZA82tZRZ+O4RDN2AVgG0CuXYRYPWRfqi3XgQ3d0+ju5mM7q+6TQeEUuAq
19IoscLhmYGb9yYCY5p0j1AUkyr6FcSFr1SJt/8Jxpyw4JsKw+ANFBA3kkzEBZ+h
0VcWjUw4S6A6+0HdrqUQYjr5tl7RCY+hOE+QQ/Xvrm4PpWi0L+1tB+kgR7VKNPO8
Xqfp/CK7Rcgh2/nUHT4uxdwh4ZNsLo39QGTuahyPXzeRVJTEGUjbzktnEUVc9wmd
OfjpC0Q2DBdXx+vnzH82flv/rtUUYor/Owhpq92E6d50Z7CjTa6xRNbmQUiVKls8
jqehpRBUg5cB3TI0BKeb5+kzgeGDGpm7MfZbvSwvoL0stdJdzNwwc4nhOLjPkJFK
R5m7cg82Kx+r/00Vb7k6WgYZ10WTS2Il1OKbgwHl5uTlaKuY7TPEIRVuJQV/XUSG
88n4jIwpRbLMNw==
=BVqY
-----END PGP SIGNATURE-----
Merge tag 'sched-urgent-2020-04-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes/updates from Thomas Gleixner:
- Deduplicate the average computations in the scheduler core and the
fair class code.
- Fix a raise between runtime distribution and assignement which can
cause exceeding the quota by up to 70%.
- Prevent negative results in the imbalanace calculation
- Remove a stale warning in the workqueue code which can be triggered
since the call site was moved out of preempt disabled code. It's a
false positive.
- Deduplicate the print macros for procfs
- Add the ucmap values to the SCHED_DEBUG procfs output for completness
* tag 'sched-urgent-2020-04-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/debug: Add task uclamp values to SCHED_DEBUG procfs
sched/debug: Factor out printing formats into common macros
sched/debug: Remove redundant macro define
sched/core: Remove unused rq::last_load_update_tick
workqueue: Remove the warning in wq_worker_sleeping()
sched/fair: Fix negative imbalance in imbalance calculation
sched/fair: Fix race between runtime distribution and assignment
sched/fair: Align rq->avg_idle and rq->avg_scan_cost
- Fix the perf event cgroup tracking which tries to track the cgroup even
for disabled events.
- Add Ice Lake server support for uncore events
- Disable pagefaults when retrieving the physical address in the sampling
code.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl6TEbMTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoYvgD/9Ufht3qrbE29oTS+q2x6PZRc01hvMW
c3wh/7Anh5yBlTioUEl1tbNfSD1k9hKqMQJF6iPZQ4tzft6JAunstICngqJrJqSY
si22mgY6QGKrA2+4UuxT7FZD5nrEhy1TrGNIOp/86Y9I59PAlrypWMxq71VUPjgB
Yy9r6eSJqNX05r9tZy1WMloJaBBieaVvEefK9ZrO4s1XM/RU5pAl+/B1XauK8XN9
e4bzu7d6r+w23pFuEqk3r4KWozafxczJAqV+w6/Me1lFgNj6m2GKq429bL/Bgj8d
Re8N4donynPe9yvuDWS4eHD1z0nhOMQhOv85seBwBcEet0PnEPtyw/eUu2zZTE2J
h1R7l8I3RIs9EmXtdoOuzLFbc8a9sw/lGQJYdP70TMEtLKPUcGkfNhwP196CRbZK
TBVub3jP0STmo8u7PrkcMb53ZBZ17P317ous52f0xgkFB6YVqv29cB+NEyabjKMI
fP9ZAaoJ9Rn7T9nji/8KsUcjpCUBhP//YWnf/Vax5PW+hkJhfrLOpesKpX1uAwzd
fc4QzEXYr/a4YI30IvqSJBsJJZfvFl9ikMaXrv5TbR8nj/eszJInho5+olQj54A7
SLzCOpVHc3jJ13d/C5Xjo0LhwSqjLv3JtvlFG34XHl+YlZUw0ua9u4Ld5vDIDYgw
mwHHFp3P/yFAfQ==
=5kqD
-----END PGP SIGNATURE-----
Merge tag 'perf-urgent-2020-04-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Thomas Gleixner:
"Three fixes/updates for perf:
- Fix the perf event cgroup tracking which tries to track the cgroup
even for disabled events.
- Add Ice Lake server support for uncore events
- Disable pagefaults when retrieving the physical address in the
sampling code"
* tag 'perf-urgent-2020-04-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/core: Disable page faults when getting phys address
perf/x86/intel/uncore: Add Ice Lake server uncore support
perf/cgroup: Correct indirection in perf_less_group_idx()
perf/core: Fix event cgroup tracking
- Plug a task struct reference leak in the percpu rswem implementation.
- Document the refcount interaction with PID_MAX_LIMIT
- Improve the 'invalid wait context' data dump in lockdep so it contains
all information which is required to decode the problem
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl6TEJoTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoeY6D/42AKqIvpAPiPA+Aod9o8o7qDCoPJQN
kwCx939+Mwrwm8YS3WJGJvR/jgoGeq33gerY93D0+jkMyhmvYeWbUIEdCenyy0+C
tkA7pN1j61D7jMqmL8PgrObhQMM8tTdNNsiNxg5qZA8wI2oFBMbAStx5mCficCyJ
8t3hh/2aRVlNN1MBjo25+jEWEXUIchAzmxnmR+5t/pRYWythOhLuWW05D8b1JoN3
9zdHj9ZScONlrK+VmpXZl41SokT385arMXcmUuS+zPx6zl4P6yTgeHMLsm4Kt3+j
1q0hm4fUEEW2IrF1Uc9pxtJh/39UVwvX8Wfj/wJLm3b1akmO+3tQ2cpiee4NHk2H
WogvBNGG4QeGMrRkg0seiqUE6SOaNPXdGRszPc6EkzwpOywDjHVYV5qbII1VOpbK
z/TJrtv8g8ejgAvyG0dEtEDydm2SqPAIAZFXj9JQ0I/gUZKTW5rvGQWdMXW6HPif
wdJJ6xb4DC3f5DrBrXgpkMSdwfea2boIoLDMf9tCcdlRs6vmvY3iRLHy2xBLcRFt
9yVDUxO40eJUxztbFKuZsQvgJyBxZubyK+Yo9CCvWQ95GfIr2JurIJEcD2fxVyRU
57/Avt3zt7iViZ9LHJnL4JnWRxftWOdWusF2dsufmnuveCVhtA5B5bl7nGntpIT2
mfJEDrJ7zzhgyA==
=Agrh
-----END PGP SIGNATURE-----
Merge tag 'locking-urgent-2020-04-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking fixes from Thomas Gleixner:
"Three small fixes/updates for the locking core code:
- Plug a task struct reference leak in the percpu rswem
implementation.
- Document the refcount interaction with PID_MAX_LIMIT
- Improve the 'invalid wait context' data dump in lockdep so it
contains all information which is required to decode the problem"
* tag 'locking-urgent-2020-04-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
locking/lockdep: Improve 'invalid wait context' splat
locking/refcount: Document interaction with PID_MAX_LIMIT
locking/percpu-rwsem: Fix a task_struct refcount
- fix an integer truncation in dma_direct_get_required_mask
(Kishon Vijay Abraham)
- fix the display of dma mapping types (Grygorii Strashko)
-----BEGIN PGP SIGNATURE-----
iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAl6Rfy8LHGhjaEBsc3Qu
ZGUACgkQD55TZVIEUYPT3w/+MKNVGmyi6nSsruskr1CkLCah5GkHrDA4v+vwwkFZ
zIeB7df/EOijzfYHlhEkiQWBcPvl9wmVBhFsuaY9XMQ/fJv06CpR5J2xtb+uZN9L
HmfwCJPZwrHDqsonIiOtIWjyIOARLGC0CfBP+DCPJ/GDG9wys7eyfESh85PQEDvO
QA4sPjJHuKeTTH2Q6QLM/aiqgr80Q4xrQLCVcgDAqAvqnCYi/E4HXty4pLNGpvJu
5O2xzmbQT2l3gRXjxt5Ct4pol+nFmiWJ0sBDNVNUqge7vebBiu3e0VjEaeXUPAMq
hQSrof2x+EwHITV10L82/poIKNQVVXW3aYgItMNXfOSBNqaNx4wVLRsrcMZWeMeJ
0OHmCy6Sac2dK4QtY/NtZFLL1zW4ajsj8X975RjqBMcL6QAdCMxNSWGEY/j/fMQy
FV1spy8FBfvXH+0O2POkYNBx/eICnMbGQ2ed5NZ8ONAl8wFLPM+2qIwmgb51a3TG
8jR613Zqyw0aMnky0MUwvj5TlwwUyyXbGKpizl0vkFw7/NYXroNMaYWJSHF/ddFp
vUChNqYhgCFY8FOhanFadyPxqU1InEoOkUS8/ij7NAR6cCrxsLuAohMYSNQJes3R
IvmU36FtjO53rBdHaN4Eizk4EHVaZDxoll05UPU/B5IvzXqR6zy0PphoOxfzvmmq
Qfk=
=Af93
-----END PGP SIGNATURE-----
Merge tag 'dma-mapping-5.7-1' of git://git.infradead.org/users/hch/dma-mapping
Pull dma-mapping fixes from Christoph Hellwig:
- fix an integer truncation in dma_direct_get_required_mask
(Kishon Vijay Abraham)
- fix the display of dma mapping types (Grygorii Strashko)
* tag 'dma-mapping-5.7-1' of git://git.infradead.org/users/hch/dma-mapping:
dma-debug: fix displaying of dma allocation type
dma-direct: fix data truncation in dma_direct_get_required_mask()
Merge yet more updates from Andrew Morton:
- Almost all of the rest of MM (memcg, slab-generic, slab, pagealloc,
gup, hugetlb, pagemap, memremap)
- Various other things (hfs, ocfs2, kmod, misc, seqfile)
* akpm: (34 commits)
ipc/util.c: sysvipc_find_ipc() should increase position index
kernel/gcov/fs.c: gcov_seq_next() should increase position index
fs/seq_file.c: seq_read(): add info message about buggy .next functions
drivers/dma/tegra20-apb-dma.c: fix platform_get_irq.cocci warnings
change email address for Pali Rohár
selftests: kmod: test disabling module autoloading
selftests: kmod: fix handling test numbers above 9
docs: admin-guide: document the kernel.modprobe sysctl
fs/filesystems.c: downgrade user-reachable WARN_ONCE() to pr_warn_once()
kmod: make request_module() return an error when autoloading is disabled
mm/memremap: set caching mode for PCI P2PDMA memory to WC
mm/memory_hotplug: add pgprot_t to mhp_params
powerpc/mm: thread pgprot_t through create_section_mapping()
x86/mm: introduce __set_memory_prot()
x86/mm: thread pgprot_t through init_memory_mapping()
mm/memory_hotplug: rename mhp_restrictions to mhp_params
mm/memory_hotplug: drop the flags field from struct mhp_restrictions
mm/special: create generic fallbacks for pte_special() and pte_mkspecial()
mm/vma: introduce VM_ACCESS_FLAGS
mm/vma: define a default value for VM_DATA_DEFAULT_FLAGS
...
If seq_file .next function does not change position index, read after
some lseek can generate unexpected output.
https://bugzilla.kernel.org/show_bug.cgi?id=206283
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Peter Oberparleiter <oberpar@linux.ibm.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: NeilBrown <neilb@suse.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Waiman Long <longman@redhat.com>
Link: http://lkml.kernel.org/r/f65c6ee7-bd00-f910-2f8a-37cc67e4ff88@virtuozzo.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "module autoloading fixes and cleanups", v5.
This series fixes a bug where request_module() was reporting success to
kernel code when module autoloading had been completely disabled via
'echo > /proc/sys/kernel/modprobe'.
It also addresses the issues raised on the original thread
(https://lkml.kernel.org/lkml/20200310223731.126894-1-ebiggers@kernel.org/T/#u)
bydocumenting the modprobe sysctl, adding a self-test for the empty path
case, and downgrading a user-reachable WARN_ONCE().
This patch (of 4):
It's long been possible to disable kernel module autoloading completely
(while still allowing manual module insertion) by setting
/proc/sys/kernel/modprobe to the empty string.
This can be preferable to setting it to a nonexistent file since it
avoids the overhead of an attempted execve(), avoids potential
deadlocks, and avoids the call to security_kernel_module_request() and
thus on SELinux-based systems eliminates the need to write SELinux rules
to dontaudit module_request.
However, when module autoloading is disabled in this way,
request_module() returns 0. This is broken because callers expect 0 to
mean that the module was successfully loaded.
Apparently this was never noticed because this method of disabling
module autoloading isn't used much, and also most callers don't use the
return value of request_module() since it's always necessary to check
whether the module registered its functionality or not anyway.
But improperly returning 0 can indeed confuse a few callers, for example
get_fs_type() in fs/filesystems.c where it causes a WARNING to be hit:
if (!fs && (request_module("fs-%.*s", len, name) == 0)) {
fs = __get_fs_type(name, len);
WARN_ONCE(!fs, "request_module fs-%.*s succeeded, but still no fs?\n", len, name);
}
This is easily reproduced with:
echo > /proc/sys/kernel/modprobe
mount -t NONEXISTENT none /
It causes:
request_module fs-NONEXISTENT succeeded, but still no fs?
WARNING: CPU: 1 PID: 1106 at fs/filesystems.c:275 get_fs_type+0xd6/0xf0
[...]
This should actually use pr_warn_once() rather than WARN_ONCE(), since
it's also user-reachable if userspace immediately unloads the module.
Regardless, request_module() should correctly return an error when it
fails. So let's make it return -ENOENT, which matches the error when
the modprobe binary doesn't exist.
I've also sent patches to document and test this case.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Jessica Yu <jeyu@kernel.org>
Acked-by: Luis Chamberlain <mcgrof@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jeff Vander Stoep <jeffv@google.com>
Cc: Ben Hutchings <benh@debian.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/20200310223731.126894-1-ebiggers@kernel.org
Link: http://lkml.kernel.org/r/20200312202552.241885-1-ebiggers@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
printk_deferred(), similarly to printk_safe/printk_nmi, does not
immediately attempt to print a new message on the consoles, avoiding
calls into non-reentrant kernel paths, e.g. scheduler or timekeeping,
which potentially can deadlock the system.
Those printk() flavors, instead, rely on per-CPU flush irq_work to print
messages from safer contexts. For same reasons (recursive scheduler or
timekeeping calls) printk() uses per-CPU irq_work in order to wake up
user space syslog/kmsg readers.
However, only printk_safe/printk_nmi do make sure that per-CPU areas
have been initialised and that it's safe to modify per-CPU irq_work.
This means that, for instance, should printk_deferred() be invoked "too
early", that is before per-CPU areas are initialised, printk_deferred()
will perform illegal per-CPU access.
Lech Perczak [0] reports that after commit 1b710b1b10 ("char/random:
silence a lockdep splat with printk()") user-space syslog/kmsg readers
are not able to read new kernel messages.
The reason is printk_deferred() being called too early (as was pointed
out by Petr and John).
Fix printk_deferred() and do not queue per-CPU irq_work before per-CPU
areas are initialized.
Link: https://lore.kernel.org/lkml/aa0732c6-5c4e-8a8b-a1c1-75ebe3dca05b@camlintechnologies.com/
Reported-by: Lech Perczak <l.perczak@camlintechnologies.com>
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Tested-by: Jann Horn <jannh@google.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: John Ogness <john.ogness@linutronix.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull proc fix from Eric Biederman:
"A brown paper bag slipped through my proc changes, and syzcaller
caught it when the code ended up in your tree.
I have opted to fix it the simplest cleanest way I know how, so there
is no reasonable chance for the bug to repeat"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
proc: Use a dedicated lock in struct pid
Rework compat ioctl handling in the user space hibernation
interface (Christoph Hellwig) and fix a typo in a function
name in the cpuidle haltpoll driver (Yihao Wu).
-----BEGIN PGP SIGNATURE-----
iQJFBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAl6QPXkSHHJqd0Byand5
c29ja2kubmV0AAoJEILEb/54YlRx918P+NU9lM++pMi2Ng4wHmtRtHW9mH4M6L54
/LTjDNZ4IE/v6JlK6hzaFUAhrFViI23gmUat8a7TT/o1NUsPS0QHxatFK+nGbjEk
blB/rBHWpC5vGo8SZbqmvI0hbRn0Q6Ah5I+iV+KAk9Z76mDEMNHZKpP2CfiqSVQE
QYsUIJOSUo/CMT1SZRE/xDrvoU418Y1Ed6a6Kn9Ki5uXvqDTPHoAyETZ9M6tLphP
kGBpbnbkHx3FyYZ3EyjVEd8O6cDsxS2gIWu6YUCB31N4G7v3bJ6rfgperTCN999J
8eQY9rNVlaAWIP9t0ObC3xOpoaNUcZC+V+yaHev3LU/6LP4Sued8cNBxQn26/RWN
vyM5M6K0OphjPll9QfVfZFRUcuoBMyAtgVYw8mwT64GVJt3ukbd2QYDQHORXBQEC
ziTtfLQEuAiUw/DRoTGo2XYDTlnd2nu9mFoTRU6juOawOwgwPbQldOtSJiMtCDoR
cyaQH/t528w10jGCa+mIJZToTIet+gN6ui83M3kQdxMTa5ulgPClU8Kujn1wPgKX
6jFrf+NJ6SNm6A/EhPk9t/soe7hhzyGwqx351aMQpZGX6UH4hQQPXgC0tyZ71Uj5
UJ7Ys5RHcXg4kub/FJsfhv/3wdSPiekwe+U89UCQY5lxXC2x3iVjp50B4auCjW23
tQVHpAvegWg=
=h4ud
-----END PGP SIGNATURE-----
Merge tag 'pm-5.7-rc1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull more power management updates from Rafael Wysocki:
"Rework compat ioctl handling in the user space hibernation interface
(Christoph Hellwig) and fix a typo in a function name in the cpuidle
haltpoll driver (Yihao Wu)"
* tag 'pm-5.7-rc1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
cpuidle-haltpoll: Fix small typo
PM / sleep: handle the compat case in snapshot_set_swap_area()
PM / sleep: move SNAPSHOT_SET_SWAP_AREA handling into a helper