OpenCloudOS-Kernel

Commit Graph

Author	SHA1	Message	Date
Peter Zijlstra	929659acea	sched/completion: Add wait_for_completion_state() Allows waiting with a custom @state. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20220822114648.922711674@infradead.org	2022-09-07 21:53:49 +02:00
Peter Zijlstra	f9fc8cad97	sched: Add TASK_ANY for wait_task_inactive() Now that wait_task_inactive()'s @match_state argument is a mask (like ttwu()) it is possible to replace the special !match_state case with an 'all-states' value such that any blocked state will match. Suggested-by: Ingo Molnar (mingo@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/YxhkzfuFTvRnpUaH@hirez.programming.kicks-ass.net	2022-09-07 21:53:49 +02:00
Peter Zijlstra	9204a97f7a	sched: Change wait_task_inactive()s match_state Make wait_task_inactive()'s @match_state work like ttwu()'s @state. That is, instead of an equal comparison, use it as a mask. This allows matching multiple block conditions. (removes the unlikely; it doesn't make sense how it's only part of the condition) Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20220822114648.856734578@infradead.org	2022-09-07 21:53:48 +02:00
Peter Zijlstra	1fbcaa923c	freezer,umh: Clean up freezer/initrd interaction handle_initrd() marks itself as PF_FREEZER_SKIP in order to ensure that the UMH, which is going to freeze the system, doesn't indefinitely wait for it's caller. Rework things by adding UMH_FREEZABLE to indicate the completion is freezable. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Link: https://lore.kernel.org/r/20220822114648.791019324@infradead.org	2022-09-07 21:53:48 +02:00
Peter Zijlstra	5950e5d574	freezer: Have {,un}lock_system_sleep() save/restore flags Rafael explained that the reason for having both PF_NOFREEZE and PF_FREEZER_SKIP is that {,un}lock_system_sleep() is callable from kthread context that has previously called set_freezable(). In preparation of merging the flags, have {,un}lock_system_slee() save and restore current->flags. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Link: https://lore.kernel.org/r/20220822114648.725003428@infradead.org	2022-09-07 21:53:48 +02:00
Peter Zijlstra	0b9d46fc5e	sched: Rename task_running() to task_on_cpu() There is some ambiguity about task_running() in that it is unrelated to TASK_RUNNING but instead tests ->on_cpu. As such, rename the thing task_on_cpu(). Suggested-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/Yxhkhn55uHZx+NGl@hirez.programming.kicks-ass.net	2022-09-07 21:53:47 +02:00
Abel Wu	96c1c0cfe4	sched/fair: Cleanup for SIS_PROP The sched-domain of this cpu is only used for some heuristics when SIS_PROP is enabled, and it should be irrelevant whether the local sd_llc is valid or not, since all we care about is target sd_llc if !SIS_PROP. Access the local domain only when there is a need. Signed-off-by: Abel Wu <wuyun.abel@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Mel Gorman <mgorman@suse.de> Link: https://lore.kernel.org/r/20220907112000.1854-6-wuyun.abel@bytedance.com	2022-09-07 21:53:47 +02:00
Abel Wu	398ba2b0cc	sched/fair: Default to false in test_idle_cores() It's uncertain whether idle cores exist or not if shared sched- domains are not ready, so returning "no idle cores" usually makes sense. While __update_idle_core() is an exception, it checks status of this core and set hint to shared sched-domain if necessary. So the whole logic of this function depends on the existence of shared sched-domain, and can certainly bail out early if it is not available. It's somehow a little tricky, and as Josh suggested that it should be transient while the domain isn't ready. So remove the self-defined default value to make things more clearer. Signed-off-by: Abel Wu <wuyun.abel@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Josh Don <joshdon@google.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Link: https://lore.kernel.org/r/20220907112000.1854-5-wuyun.abel@bytedance.com	2022-09-07 21:53:47 +02:00
Abel Wu	8eeeed9c4a	sched/fair: Remove useless check in select_idle_core() The function select_idle_core() only gets called when has_idle_cores is true which can be possible only when sched_smt_present is enabled. This change also aligns select_idle_core() with select_idle_smt() in the way that the caller do the check if necessary. Signed-off-by: Abel Wu <wuyun.abel@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Mel Gorman <mgorman@techsingularity.net> Link: https://lore.kernel.org/r/20220907112000.1854-4-wuyun.abel@bytedance.com	2022-09-07 21:53:46 +02:00
Abel Wu	b9bae70440	sched/fair: Avoid double search on same cpu The prev cpu is checked at the beginning of SIS, and it's unlikely to be idle before the second check in select_idle_smt(). So we'd better focus on its SMT siblings. Signed-off-by: Abel Wu <wuyun.abel@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Josh Don <joshdon@google.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Link: https://lore.kernel.org/r/20220907112000.1854-3-wuyun.abel@bytedance.com	2022-09-07 21:53:46 +02:00
Abel Wu	3e6efe87cd	sched/fair: Remove redundant check in select_idle_smt() If two cpus share LLC cache, then the two cores they belong to are also in the same LLC domain. Signed-off-by: Abel Wu <wuyun.abel@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Josh Don <joshdon@google.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Link: https://lore.kernel.org/r/20220907112000.1854-2-wuyun.abel@bytedance.com	2022-09-07 21:53:46 +02:00
Kumar Kartikeya Dwivedi	6df4ea1ff0	bpf: Support kptrs in percpu arraymap Enable support for kptrs in percpu BPF arraymap by wiring up the freeing of these kptrs from percpu map elements. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20220904204145.3089-3-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2022-09-07 11:46:08 -07:00
Jules Irenge	9fad7fe5b2	bpf: Fix resetting logic for unreferenced kptrs Sparse reported a warning at bpf_map_free_kptrs() "warning: Using plain integer as NULL pointer" During the process of fixing this warning, it was discovered that the current code erroneously writes to the pointer variable instead of deferencing and writing to the actual kptr. Hence, Sparse tool accidentally helped to uncover this problem. Fix this by doing WRITE_ONCE(*p, 0) instead of WRITE_ONCE(p, 0). Note that the effect of this bug is that unreferenced kptrs will not be cleared during check_and_free_fields. It is not a problem if the clearing is not done during map_free stage, as there is nothing to free for them. Fixes: `14a324f6a6` ("bpf: Wire up freeing of referenced kptr") Signed-off-by: Jules Irenge <jbi.octave@gmail.com> Link: https://lore.kernel.org/r/Yxi3pJaK6UDjVJSy@playground Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2022-09-07 11:15:18 -07:00
Benjamin Tissoires	eb1f7f71c1	bpf/verifier: allow kfunc to return an allocated mem For drivers (outside of network), the incoming data is not statically defined in a struct. Most of the time the data buffer is kzalloc-ed and thus we can not rely on eBPF and BTF to explore the data. This commit allows to return an arbitrary memory, previously allocated by the driver. An interesting extra point is that the kfunc can mark the exported memory region as read only or read/write. So, when a kfunc is not returning a pointer to a struct but to a plain type, we can consider it is a valid allocated memory assuming that: - one of the arguments is either called rdonly_buf_size or rdwr_buf_size - and this argument is a const from the caller point of view We can then use this parameter as the size of the allocated memory. The memory is either read-only or read-write based on the name of the size parameter. Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Benjamin Tissoires <benjamin.tissoires@redhat.com> Link: https://lore.kernel.org/r/20220906151303.2780789-7-benjamin.tissoires@redhat.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2022-09-07 11:05:17 -07:00
Benjamin Tissoires	f9b348185f	bpf/btf: bump BTF_KFUNC_SET_MAX_CNT net/bpf/test_run.c is already presenting 20 kfuncs. net/netfilter/nf_conntrack_bpf.c is also presenting an extra 10 kfuncs. Given that all the kfuncs are regrouped into one unique set, having only 2 space left prevent us to add more selftests. Bump it to 256. Signed-off-by: Benjamin Tissoires <benjamin.tissoires@redhat.com> Link: https://lore.kernel.org/r/20220906151303.2780789-6-benjamin.tissoires@redhat.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2022-09-07 11:05:03 -07:00
Benjamin Tissoires	15baa55ff5	bpf/verifier: allow all functions to read user provided context When a function was trying to access data from context in a syscall eBPF program, the verifier was rejecting the call unless it was accessing the first element. This is because the syscall context is not known at compile time, and so we need to check this when actually accessing it. Check for the valid memory access if there is no convert_ctx callback, and allow such situation to happen. Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Benjamin Tissoires <benjamin.tissoires@redhat.com> Link: https://lore.kernel.org/r/20220906151303.2780789-4-benjamin.tissoires@redhat.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2022-09-07 11:03:44 -07:00
Benjamin Tissoires	95f2f26f3c	bpf: split btf_check_subprog_arg_match in two btf_check_subprog_arg_match() was used twice in verifier.c: - when checking for the type mismatches between a (sub)prog declaration and BTF - when checking the call of a subprog to see if the provided arguments are correct and valid This is problematic when we check if the first argument of a program (pointer to ctx) is correctly accessed: To be able to ensure we access a valid memory in the ctx, the verifier assumes the pointer to context is not null. This has the side effect of marking the program accessing the entire context, even if the context is never dereferenced. For example, by checking the context access with the current code, the following eBPF program would fail with -EINVAL if the ctx is set to null from the userspace: ``` SEC("syscall") int prog(struct my_ctx *args) { return 0; } ``` In that particular case, we do not want to actually check that the memory is correct while checking for the BTF validity, but we just want to ensure that the (sub)prog definition matches the BTF we have. So split btf_check_subprog_arg_match() in two so we can actually check for the memory used when in a call, and ignore that part when not. Note that a further patch is in preparation to disentangled btf_check_func_arg_match() from these two purposes, and so right now we just add a new hack around that by adding a boolean to this function. Signed-off-by: Benjamin Tissoires <benjamin.tissoires@redhat.com> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20220906151303.2780789-3-benjamin.tissoires@redhat.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2022-09-07 10:57:28 -07:00
Jiapeng Chong	c478bd8836	cgroup/cpuset: remove unreachable code The function sched_partition_show cannot execute seq_puts, delete the invalid code. kernel/cgroup/cpuset.c:2849 sched_partition_show() warn: ignoring unreachable code. Link: https://bugzilla.openanolis.cn/show_bug.cgi?id=2087 Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2022-09-07 05:30:29 -10:00
Xiu Jianfeng	934f70d9d4	audit: remove selinux_audit_rule_update() declaration selinux_audit_rule_update() has been renamed to audit_update_lsm_rules() since commit `d7a96f3a1a` ("Audit: internally use the new LSM audit hooks"), so remove it. Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com> Signed-off-by: Paul Moore <paul@paul-moore.com>	2022-09-07 11:30:15 -04:00
Jim Cromie	66f4006b6a	kernel/module: add __dyndbg_classes section Add __dyndbg_classes section, using __dyndbg as a model. Use it: vmlinux.lds.h: KEEP the new section, which also silences orphan section warning on loadable modules. Add (__start_/__stop_)__dyndbg_classes linker symbols for the c externs (below). kernel/module/main.c: - fill new fields in find_module_sections(), using section_objs() - extend callchain prototypes to pass classes, length load_module(): pass new info to dynamic_debug_setup() dynamic_debug_setup(): new params, pass through to ddebug_add_module() dynamic_debug.c: - add externs to the linker symbols. ddebug_add_module(): - It currently builds a debug_table, and will find and attach classes. dynamic_debug_init(): - add class fields to the _ddebug_info cursor var: di. Signed-off-by: Jim Cromie <jim.cromie@gmail.com> Link: https://lore.kernel.org/r/20220904214134.408619-16-jim.cromie@gmail.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-09-07 17:04:49 +02:00
Jim Cromie	b7b4eebdba	dyndbg: gather __dyndbg[] state into struct _ddebug_info This new struct composes the linker provided (vector,len) section, and provides a place to add other __dyndbg[] state-data later: descs - the vector of descriptors in __dyndbg section. num_descs - length of the data/section. Use it, in several different ways, as follows: In lib/dynamic_debug.c: ddebug_add_module(): Alter params-list, replacing 2 args (array,index) with a struct _ddebug_info * containing them both, with room for expansion. This helps future-proof the function prototype against the looming addition of class-map info into the dyndbg-state, by providing a place to add more member fields later. NB: later add static struct _ddebug_info builtins_state declaration, not needed yet. ddebug_add_module() is called in 2 contexts: In dynamic_debug_init(), declare, init a struct _ddebug_info di auto-var to use as a cursor. Then iterate over the prdbg blocks of the builtin modules, and update the di cursor before calling _add_module for each. Its called from kernel/module/main.c:load_info() for each loaded module: In internal.h, alter struct load_info, replacing the dyndbg array,len fields with an embedded _ddebug_info containing them both; and populate its members in find_module_sections(). The 2 calling contexts differ in that _init deals with contiguous subranges of __dyndbgs[] section, packed together, while loadable modules are added one at a time. So rename ddebug_add_module() into outer/__inner fns, call __inner from _init, and provide the offset into the builtin __dyndbgs[] where the module's prdbgs reside. The cursor provides start, len of the subrange for each. The offset will be used later to pack the results of builtin __dyndbg_sites[] de-duplication, and is 0 and unneeded for loadable modules, Note: kernel/module/main.c includes <dynamic_debug.h> for struct _ddeubg_info. This might be prone to include loops, since its also included by printk.h. Nothing has broken in robot-land on this. cc: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Jim Cromie <jim.cromie@gmail.com> Link: https://lore.kernel.org/r/20220904214134.408619-12-jim.cromie@gmail.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-09-07 17:04:48 +02:00
Christoph Hellwig	9fc18f6d56	dma-mapping: mark dma_supported static Now that the remaining users in drivers are gone, this function can be marked static. Signed-off-by: Christoph Hellwig <hch@lst.de>	2022-09-07 10:38:28 +02:00
Chao Gao	43b919017f	swiotlb: fix a typo "overwirte" isn't a word. It should be "overwrite". Signed-off-by: Chao Gao <chao.gao@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2022-09-07 10:38:16 +02:00
Chao Gao	3f0461613e	swiotlb: avoid potential left shift overflow The second operand passed to slot_addr() is declared as int or unsigned int in all call sites. The left-shift to get the offset of a slot can overflow if swiotlb size is larger than 4G. Convert the macro to an inline function and declare the second argument as phys_addr_t to avoid the potential overflow. Fixes: `26a7e09478` ("swiotlb: refactor swiotlb_tbl_map_single") Signed-off-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Dongli Zhang <dongli.zhang@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2022-09-07 10:38:16 +02:00
Robin Murphy	2995b8002c	dma-debug: improve search for partial syncs When bucket_find_contains() tries to find the original entry for a partial sync, it manages to constrain its search in a way that is both too restrictive and not restrictive enough. A driver which only uses single mappings rather than scatterlists might not set max_seg_size, but could still technically perform a partial sync at an offset of more than 64KB into a sufficiently large mapping, so we could stop searching too early before reaching a legitimate entry. Conversely, if no valid entry is present and max_range is large enough, we can pointlessly search buckets that we've already searched, or that represent an impossible wrapping around the bottom of the address space. At worst, the (legitimate) case of max_seg_size == UINT_MAX can make the loop infinite. Replace the fragile and frankly hard-to-follow "range" logic with a simple counted loop for the number of possible hash buckets below the given address. Reported-by: Yunfei Wang <yf.wang@mediatek.com> Signed-off-by: Robin Murphy <robin.murphy@arm.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2022-09-07 10:38:16 +02:00
Yu Zhao	81c12e922b	Revert "swiotlb: panic if nslabs is too small" This reverts commit `0bf28fc40d`. Reasons: 1. new panic()s shouldn't be added [1]. 2. It does no "cleanup" but breaks MIPS [2]. v2: properly solved the conflict [3] with commit `20347fca71` ("swiotlb: split up the global swiotlb lock") Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> [1] https://lore.kernel.org/r/CAHk-=wit-DmhMfQErY29JSPjFgebx_Ld+pnerc4J2Ag990WwAA@mail.gmail.com/ [2] https://lore.kernel.org/r/20220820012031.1285979-1-yuzhao@google.com/ [3] https://lore.kernel.org/r/202208310701.LKr1WDCh-lkp@intel.com/ Fixes: `0bf28fc40d` ("swiotlb: panic if nslabs is too small") Signed-off-by: Yu Zhao <yuzhao@google.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2022-09-07 10:38:16 +02:00
Yonghong Song	720e6a4351	bpf: Allow struct argument in trampoline based programs Allow struct argument in trampoline based programs where the struct size should be <= 16 bytes. In such cases, the argument will be put into up to 2 registers for bpf, x86_64 and arm64 architectures. To support arch-specific trampoline manipulation, add arg_flags for additional struct information about arguments in btf_func_model. Such information will be used in arch specific function arch_prepare_bpf_trampoline() to prepare argument access properly in trampoline. Signed-off-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/r/20220831152646.2078089-1-yhs@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2022-09-06 19:51:14 -07:00
Alexei Starovoitov	1e660f7ebe	bpf: Replace __ksize with ksize. __ksize() was made private. Use ksize() instead. Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2022-09-06 19:38:53 -07:00
Xiu Jianfeng	93d71986a6	rv/reactor: add __init/__exit annotations to module init/exit funcs Add missing __init/__exit annotations to module init/exit funcs. Link: https://lkml.kernel.org/r/20220906141210.132607-1-xiujianfeng@huawei.com Fixes: `135b881ea8` ("rv/reactor: Add the printk reactor") Fixes: `e88043c0ac` ("rv/reactor: Add the panic reactor") Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com> Acked-by: Daniel Bristot de Oliveira <bristot@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2022-09-06 22:26:00 -04:00
Masami Hiramatsu (Google)	cecf8e128e	tracing: Fix to check event_mutex is held while accessing trigger list Since the check_user_trigger() is called outside of RCU read lock, this list_for_each_entry_rcu() caused a suspicious RCU usage warning. # echo hist:keys=pid > events/sched/sched_stat_runtime/trigger # cat events/sched/sched_stat_runtime/trigger [ 43.167032] [ 43.167418] ============================= [ 43.167992] WARNING: suspicious RCU usage [ 43.168567] 5.19.0-rc5-00029-g19ebe4651abf #59 Not tainted [ 43.169283] ----------------------------- [ 43.169863] kernel/trace/trace_events_trigger.c:145 RCU-list traversed in non-reader section!! ... However, this file->triggers list is safe when it is accessed under event_mutex is held. To fix this warning, adds a lockdep_is_held check to the list_for_each_entry_rcu(). Link: https://lkml.kernel.org/r/166226474977.223837.1992182913048377113.stgit@devnote2 Cc: stable@vger.kernel.org Fixes: `7491e2c442` ("tracing: Add a probe that attaches to trace events") Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2022-09-06 22:26:00 -04:00
Yipeng Zou	54c3931957	tracing: hold caller_addr to hardirq_{enable,disable}_ip Currently, The arguments passing to lockdep_hardirqs_{on,off} was fixed in CALLER_ADDR0. The function trace_hardirqs_on_caller should have been intended to use caller_addr to represent the address that caller wants to be traced. For example, lockdep log in riscv showing the last {enabled,disabled} at __trace_hardirqs_{on,off} all the time(if called by): [ 57.853175] hardirqs last enabled at (2519): __trace_hardirqs_on+0xc/0x14 [ 57.853848] hardirqs last disabled at (2520): __trace_hardirqs_off+0xc/0x14 After use trace_hardirqs_xx_caller, we can get more effective information: [ 53.781428] hardirqs last enabled at (2595): restore_all+0xe/0x66 [ 53.782185] hardirqs last disabled at (2596): ret_from_exception+0xa/0x10 Link: https://lkml.kernel.org/r/20220901104515.135162-2-zouyipeng@huawei.com Cc: stable@vger.kernel.org Fixes: `c3bc8fd637` ("tracing: Centralize preemptirq tracepoints and unify their usage") Signed-off-by: Yipeng Zou <zouyipeng@huawei.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2022-09-06 22:26:00 -04:00
Alison Schofield	54be550942	tracepoint: Allow trace events in modules with TAINT_TEST Commit `2852ca7fba` ("panic: Taint kernel if tests are run") introduced a new taint type, TAINT_TEST, to signal that an in-kernel test module has been loaded. TAINT_TEST taint type defaults into a 'bad_taint' list for kernel tracing and blocks the creation of trace events. This causes a problem for CXL testing where loading the cxl_test module makes all CXL modules out-of-tree, blocking any trace events. Trace events are in development for CXL at the moment and this issue was found in test with v6.0-rc1. Link: https://lkml.kernel.org/r/20220829171048.263065-1-alison.schofield@intel.com Fixes: `2852ca7fba` ("panic: Taint kernel if tests are run") Reported-by: Ira Weiny <ira.weiny@intel.com> Suggested-by: Dan Williams <dan.j.williams@intel.com> Tested-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: David Gow <davidgow@google.com> Signed-off-by: Alison Schofield <alison.schofield@intel.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2022-09-06 22:26:00 -04:00
Daniel Bristot de Oliveira	baf2c00240	rv/monitors: Make monitor's automata definition static Monitor's automata definition is only used locally, so make them static for all existing monitors. Link: https://lore.kernel.org/all/202208210332.gtHXje45-lkp@intel.com Link: https://lore.kernel.org/all/202208210358.6HH3OrVs-lkp@intel.com Link: https://lkml.kernel.org/r/a50e27c3738d6ef809f4201857229fed64799234.1661266564.git.bristot@kernel.org Fixes: `ccc319dcb4` ("rv/monitor: Add the wwnr monitor") Fixes: `8812d21219` ("rv/monitor: Add the wip monitor skeleton created by dot2k") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2022-09-06 22:13:25 -04:00
Paolo Abeni	2786bcff28	Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== pull-request: bpf-next 2022-09-05 The following pull-request contains BPF updates for your net-next tree. We've added 106 non-merge commits during the last 18 day(s) which contain a total of 159 files changed, 5225 insertions(+), 1358 deletions(-). There are two small merge conflicts, resolve them as follows: 1) tools/testing/selftests/bpf/DENYLIST.s390x Commit `27e23836ce` ("selftests/bpf: Add lru_bug to s390x deny list") in bpf tree was needed to get BPF CI green on s390x, but it conflicted with newly added tests on bpf-next. Resolve by adding both hunks, result: [...] lru_bug # prog 'printk': failed to auto-attach: -524 setget_sockopt # attach unexpected error: -524 (trampoline) cb_refs # expected error message unexpected error: -524 (trampoline) cgroup_hierarchical_stats # JIT does not support calling kernel function (kfunc) htab_update # failed to attach: ERROR: strerror_r(-524)=22 (trampoline) [...] 2) net/core/filter.c Commit `1227c1771d` ("net: Fix data-races around sysctl_[rw]mem_(max\|default).") from net tree conflicts with commit `29003875bd` ("bpf: Change bpf_setsockopt(SOL_SOCKET) to reuse sk_setsockopt()") from bpf-next tree. Take the code as it is from bpf-next tree, result: [...] if (getopt) { if (optname == SO_BINDTODEVICE) return -EINVAL; return sk_getsockopt(sk, SOL_SOCKET, optname, KERNEL_SOCKPTR(optval), KERNEL_SOCKPTR(optlen)); } return sk_setsockopt(sk, SOL_SOCKET, optname, KERNEL_SOCKPTR(optval), optlen); [...] The main changes are: 1) Add any-context BPF specific memory allocator which is useful in particular for BPF tracing with bonus of performance equal to full prealloc, from Alexei Starovoitov. 2) Big batch to remove duplicated code from bpf_{get,set}sockopt() helpers as an effort to reuse the existing core socket code as much as possible, from Martin KaFai Lau. 3) Extend BPF flow dissector for BPF programs to just augment the in-kernel dissector with custom logic. In other words, allow for partial replacement, from Shmulik Ladkani. 4) Add a new cgroup iterator to BPF with different traversal options, from Hao Luo. 5) Support for BPF to collect hierarchical cgroup statistics efficiently through BPF integration with the rstat framework, from Yosry Ahmed. 6) Support bpf_{g,s}et_retval() under more BPF cgroup hooks, from Stanislav Fomichev. 7) BPF hash table and local storages fixes under fully preemptible kernel, from Hou Tao. 8) Add various improvements to BPF selftests and libbpf for compilation with gcc BPF backend, from James Hilliard. 9) Fix verifier helper permissions and reference state management for synchronous callbacks, from Kumar Kartikeya Dwivedi. 10) Add support for BPF selftest's xskxceiver to also be used against real devices that support MAC loopback, from Maciej Fijalkowski. 11) Various fixes to the bpf-helpers(7) man page generation script, from Quentin Monnet. 12) Document BPF verifier's tnum_in(tnum_range(), ...) gotchas, from Shung-Hsi Yu. 13) Various minor misc improvements all over the place. https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (106 commits) bpf: Optimize rcu_barrier usage between hash map and bpf_mem_alloc. bpf: Remove usage of kmem_cache from bpf_mem_cache. bpf: Remove prealloc-only restriction for sleepable bpf programs. bpf: Prepare bpf_mem_alloc to be used by sleepable bpf programs. bpf: Remove tracing program restriction on map types bpf: Convert percpu hash map to per-cpu bpf_mem_alloc. bpf: Add percpu allocation support to bpf_mem_alloc. bpf: Batch call_rcu callbacks instead of SLAB_TYPESAFE_BY_RCU. bpf: Adjust low/high watermarks in bpf_mem_cache bpf: Optimize call_rcu in non-preallocated hash map. bpf: Optimize element count in non-preallocated hash map. bpf: Relax the requirement to use preallocated hash maps in tracing progs. samples/bpf: Reduce syscall overhead in map_perf_test. selftests/bpf: Improve test coverage of test_maps bpf: Convert hash map to bpf_mem_alloc. bpf: Introduce any context BPF specific memory allocator. selftest/bpf: Add test for bpf_getsockopt() bpf: Change bpf_getsockopt(SOL_IPV6) to reuse do_ipv6_getsockopt() bpf: Change bpf_getsockopt(SOL_IP) to reuse do_ip_getsockopt() bpf: Change bpf_getsockopt(SOL_TCP) to reuse do_tcp_getsockopt() ... ==================== Link: https://lore.kernel.org/r/20220905161136.9150-1-daniel@iogearbox.net Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2022-09-06 23:21:18 +02:00
Tejun Heo	8a693f7766	cgroup: Remove CFTYPE_PRESSURE CFTYPE_PRESSURE is used to flag PSI related files so that they are not created if PSI is disabled during boot. It's a bit weird to use a generic flag to mark a specific file type. Let's instead move the PSI files into its own cftypes array and add/rm them conditionally. This is a bit more code but cleaner. No userland visible changes. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org>	2022-09-06 09:38:55 -10:00
Tejun Heo	0083d27b21	cgroup: Improve cftype add/rm error handling Let's track whether a cftype is currently added or not using a new flag __CFTYPE_ADDED so that duplicate operations can be failed safely and consistently allow using empty cftypes. Signed-off-by: Tejun Heo <tj@kernel.org>	2022-09-06 09:38:42 -10:00
Kan Liang	ee9db0e14b	perf: Use sample_flags for txn Use the new sample_flags to indicate whether the txn field is filled by the PMU driver. Remove the txn field from the perf_sample_data_init() to minimize the number of cache lines touched. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20220901130959.1285717-7-kan.liang@linux.intel.com	2022-09-06 11:33:03 +02:00
Kan Liang	e16fd7f2cb	perf: Use sample_flags for data_src Use the new sample_flags to indicate whether the data_src field is filled by the PMU driver. Remove the data_src field from the perf_sample_data_init() to minimize the number of cache lines touched. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20220901130959.1285717-6-kan.liang@linux.intel.com	2022-09-06 11:33:03 +02:00
Kan Liang	2abe681da0	perf: Use sample_flags for weight Use the new sample_flags to indicate whether the weight field is filled by the PMU driver. Remove the weight field from the perf_sample_data_init() to minimize the number of cache lines touched. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20220901130959.1285717-5-kan.liang@linux.intel.com	2022-09-06 11:33:02 +02:00
Kan Liang	a9a931e266	perf: Use sample_flags for branch stack Use the new sample_flags to indicate whether the branch stack is filled by the PMU driver. Remove the br_stack from the perf_sample_data_init() to minimize the number of cache lines touched. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20220901130959.1285717-4-kan.liang@linux.intel.com	2022-09-06 11:33:02 +02:00
Kan Liang	3aac580d5c	perf: Add sample_flags to indicate the PMU-filled sample data On some platforms, some data e.g., timestamps, can be retrieved from the PMU driver. Usually, the data from the PMU driver is more accurate. The current perf kernel should output the PMU-filled sample data if it's available. To check the availability of the PMU-filled sample data, the current perf kernel initializes the related fields in the perf_sample_data_init(). When outputting a sample, the perf checks whether the field is updated by the PMU driver. If yes, the updated value will be output. If not, the perf uses an SW way to calculate the value or just outputs the initialized value if an SW way is unavailable either. With more and more data being provided by the PMU driver, more fields has to be initialized in the perf_sample_data_init(). That will increase the number of cache lines touched in perf_sample_data_init() and be harmful to the performance. Add new "sample_flags" to indicate the PMU-filled sample data. The PMU driver should set the corresponding PERF_SAMPLE_ flag when the field is updated. The initialization of the corresponding field is not required anymore. The following patches will make use of it and remove the corresponding fields from the perf_sample_data_init(), which will further minimize the number of cache lines touched. Only clear the sample flags that have already been done by the PMU driver in the perf_prepare_sample() for the PERF_RECORD_SAMPLE. For the other PERF_RECORD_ event type, the sample data is not available. Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20220901130959.1285717-2-kan.liang@linux.intel.com	2022-09-06 11:33:01 +02:00
Yang Jihong	6b959ba22d	perf/core: Fix reentry problem in perf_output_read_group() perf_output_read_group may respond to IPI request of other cores and invoke __perf_install_in_context function. As a result, hwc configuration is modified. causing inconsistency and unexpected consequences. Interrupts are not disabled when perf_output_read_group reads PMU counter. In this case, IPI request may be received from other cores. As a result, PMU configuration is modified and an error occurs when reading PMU counter: CPU0 CPU1 __se_sys_perf_event_open perf_install_in_context perf_output_read_group smp_call_function_single for_each_sibling_event(sub, leader) { generic_exec_single if ((sub != event) && remote_function (sub->state == PERF_EVENT_STATE_ACTIVE)) \| <enter IPI handler: __perf_install_in_context> <----RAISE IPI-----+ __perf_install_in_context ctx_resched event_sched_out armpmu_del ... hwc->idx = -1; // event->hwc.idx is set to -1 ... <exit IPI> sub->pmu->read(sub); armpmu_read armv8pmu_read_counter armv8pmu_read_hw_counter int idx = event->hw.idx; // idx = -1 u64 val = armv8pmu_read_evcntr(idx); u32 counter = ARMV8_IDX_TO_COUNTER(idx); // invalid counter = 30 read_pmevcntrn(counter) // undefined instruction Signed-off-by: Yang Jihong <yangjihong1@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20220902082918.179248-1-yangjihong1@huawei.com	2022-09-06 11:33:00 +02:00
Alexei Starovoitov	9f2c6e96c6	bpf: Optimize rcu_barrier usage between hash map and bpf_mem_alloc. User space might be creating and destroying a lot of hash maps. Synchronous rcu_barrier-s in a destruction path of hash map delay freeing of hash buckets and other map memory and may cause artificial OOM situation under stress. Optimize rcu_barrier usage between bpf hash map and bpf_mem_alloc: - remove rcu_barrier from hash map, since htab doesn't use call_rcu directly and there are no callback to wait for. - bpf_mem_alloc has call_rcu_in_progress flag that indicates pending callbacks. Use it to avoid barriers in fast path. - When barriers are needed copy bpf_mem_alloc into temp structure and wait for rcu barrier-s in the worker to let the rest of hash map freeing to proceed. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20220902211058.60789-17-alexei.starovoitov@gmail.com	2022-09-05 15:33:07 +02:00
Alexei Starovoitov	bfc03c15be	bpf: Remove usage of kmem_cache from bpf_mem_cache. For bpf_mem_cache based hash maps the following stress test: for (i = 1; i <= 512; i <<= 1) for (j = 1; j <= 1 << 18; j <<= 1) fd = bpf_map_create(BPF_MAP_TYPE_HASH, NULL, i, j, 2, 0); creates many kmem_cache-s that are not mergeable in debug kernels and consume unnecessary amount of memory. Turned out bpf_mem_cache's free_list logic does batching well, so usage of kmem_cache for fixes size allocations doesn't bring any performance benefits vs normal kmalloc. Hence get rid of kmem_cache in bpf_mem_cache. That saves memory, speeds up map create/destroy operations, while maintains hash map update/delete performance. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20220902211058.60789-16-alexei.starovoitov@gmail.com	2022-09-05 15:33:07 +02:00
Alexei Starovoitov	02cc5aa29e	bpf: Remove prealloc-only restriction for sleepable bpf programs. Since hash map is now converted to bpf_mem_alloc and it's waiting for rcu and rcu_tasks_trace GPs before freeing elements into global memory slabs it's safe to use dynamically allocated hash maps in sleepable bpf programs. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20220902211058.60789-15-alexei.starovoitov@gmail.com	2022-09-05 15:33:06 +02:00
Alexei Starovoitov	dccb4a9013	bpf: Prepare bpf_mem_alloc to be used by sleepable bpf programs. Use call_rcu_tasks_trace() to wait for sleepable progs to finish. Then use call_rcu() to wait for normal progs to finish and finally do free_one() on each element when freeing objects into global memory pool. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20220902211058.60789-14-alexei.starovoitov@gmail.com	2022-09-05 15:33:06 +02:00
Alexei Starovoitov	96da3f7d48	bpf: Remove tracing program restriction on map types The hash map is now fully converted to bpf_mem_alloc. Its implementation is not allocating synchronously and not calling call_rcu() directly. It's now safe to use non-preallocated hash maps in all types of tracing programs including BPF_PROG_TYPE_PERF_EVENT that runs out of NMI context. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20220902211058.60789-13-alexei.starovoitov@gmail.com	2022-09-05 15:33:06 +02:00
Alexei Starovoitov	ee4ed53c5e	bpf: Convert percpu hash map to per-cpu bpf_mem_alloc. Convert dynamic allocations in percpu hash map from alloc_percpu() to bpf_mem_cache_alloc() from per-cpu bpf_mem_alloc. Since bpf_mem_alloc frees objects after RCU gp the call_rcu() is removed. pcpu_init_value() now needs to zero-fill per-cpu allocations, since dynamically allocated map elements are now similar to full prealloc, since alloc_percpu() is not called inline and the elements are reused in the freelist. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20220902211058.60789-12-alexei.starovoitov@gmail.com	2022-09-05 15:33:06 +02:00
Alexei Starovoitov	4ab67149f3	bpf: Add percpu allocation support to bpf_mem_alloc. Extend bpf_mem_alloc to cache free list of fixed size per-cpu allocations. Once such cache is created bpf_mem_cache_alloc() will return per-cpu objects. bpf_mem_cache_free() will free them back into global per-cpu pool after observing RCU grace period. per-cpu flavor of bpf_mem_alloc is going to be used by per-cpu hash maps. The free list cache consists of tuples { llist_node, per-cpu pointer } Unlike alloc_percpu() that returns per-cpu pointer the bpf_mem_cache_alloc() returns a pointer to per-cpu pointer and bpf_mem_cache_free() expects to receive it back. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20220902211058.60789-11-alexei.starovoitov@gmail.com	2022-09-05 15:33:06 +02:00
Alexei Starovoitov	8d5a8011b3	bpf: Batch call_rcu callbacks instead of SLAB_TYPESAFE_BY_RCU. SLAB_TYPESAFE_BY_RCU makes kmem_caches non mergeable and slows down kmem_cache_destroy. All bpf_mem_cache are safe to share across different maps and programs. Convert SLAB_TYPESAFE_BY_RCU to batched call_rcu. This change solves the memory consumption issue, avoids kmem_cache_destroy latency and keeps bpf hash map performance the same. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20220902211058.60789-10-alexei.starovoitov@gmail.com	2022-09-05 15:33:06 +02:00
Alexei Starovoitov	7c266178aa	bpf: Adjust low/high watermarks in bpf_mem_cache The same low/high watermarks for every bucket in bpf_mem_cache consume significant amount of memory. Preallocating 64 elements of 4096 bytes each in the free list is not efficient. Make low/high watermarks and batching value dependent on element size. This change brings significant memory savings. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20220902211058.60789-9-alexei.starovoitov@gmail.com	2022-09-05 15:33:06 +02:00
Alexei Starovoitov	0fd7c5d433	bpf: Optimize call_rcu in non-preallocated hash map. Doing call_rcu() million times a second becomes a bottle neck. Convert non-preallocated hash map from call_rcu to SLAB_TYPESAFE_BY_RCU. The rcu critical section is no longer observed for one htab element which makes non-preallocated hash map behave just like preallocated hash map. The map elements are released back to kernel memory after observing rcu critical section. This improves 'map_perf_test 4' performance from 100k events per second to 250k events per second. bpf_mem_alloc + percpu_counter + typesafe_by_rcu provide 10x performance boost to non-preallocated hash map and make it within few % of preallocated map while consuming fraction of memory. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20220902211058.60789-8-alexei.starovoitov@gmail.com	2022-09-05 15:33:06 +02:00
Alexei Starovoitov	86fe28f769	bpf: Optimize element count in non-preallocated hash map. The atomic_inc/dec might cause extreme cache line bouncing when multiple cpus access the same bpf map. Based on specified max_entries for the hash map calculate when percpu_counter becomes faster than atomic_t and use it for such maps. For example samples/bpf/map_perf_test is using hash map with max_entries 1000. On a system with 16 cpus the 'map_perf_test 4' shows 14k events per second using atomic_t. On a system with 15 cpus it shows 100k events per second using percpu. map_perf_test is an extreme case where all cpus colliding on atomic_t which causes extreme cache bouncing. Note that the slow path of percpu_counter is 5k events per secound vs 14k for atomic, so the heuristic is necessary. See comment in the code why the heuristic is based on num_online_cpus(). Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20220902211058.60789-7-alexei.starovoitov@gmail.com	2022-09-05 15:33:06 +02:00
Alexei Starovoitov	34dd3bad1a	bpf: Relax the requirement to use preallocated hash maps in tracing progs. Since bpf hash map was converted to use bpf_mem_alloc it is safe to use from tracing programs and in RT kernels. But per-cpu hash map is still using dynamic allocation for per-cpu map values, hence keep the warning for this map type. In the future alloc_percpu_gfp can be front-end-ed with bpf_mem_cache and this restriction will be completely lifted. perf_event (NMI) bpf programs have to use preallocated hash maps, because free_htab_elem() is using call_rcu which might crash if re-entered. Sleepable bpf programs have to use preallocated hash maps, because life time of the map elements is not protected by rcu_read_lock/unlock. This restriction can be lifted in the future as well. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20220902211058.60789-6-alexei.starovoitov@gmail.com	2022-09-05 15:33:05 +02:00
Alexei Starovoitov	fba1a1c6c9	bpf: Convert hash map to bpf_mem_alloc. Convert bpf hash map to use bpf memory allocator. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20220902211058.60789-3-alexei.starovoitov@gmail.com	2022-09-05 15:33:05 +02:00
Alexei Starovoitov	7c8199e24f	bpf: Introduce any context BPF specific memory allocator. Tracing BPF programs can attach to kprobe and fentry. Hence they run in unknown context where calling plain kmalloc() might not be safe. Front-end kmalloc() with minimal per-cpu cache of free elements. Refill this cache asynchronously from irq_work. BPF programs always run with migration disabled. It's safe to allocate from cache of the current cpu with irqs disabled. Free-ing is always done into bucket of the current cpu as well. irq_work trims extra free elements from buckets with kfree and refills them with kmalloc, so global kmalloc logic takes care of freeing objects allocated by one cpu and freed on another. struct bpf_mem_alloc supports two modes: - When size != 0 create kmem_cache and bpf_mem_cache for each cpu. This is typical bpf hash map use case when all elements have equal size. - When size == 0 allocate 11 bpf_mem_cache-s for each cpu, then rely on kmalloc/kfree. Max allocation size is 4096 in this case. This is bpf_dynptr and bpf_kptr use case. bpf_mem_alloc/bpf_mem_free are bpf specific 'wrappers' of kmalloc/kfree. bpf_mem_cache_alloc/bpf_mem_cache_free are 'wrappers' of kmem_cache_alloc/kmem_cache_free. The allocators are NMI-safe from bpf programs only. They are not NMI-safe in general. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20220902211058.60789-2-alexei.starovoitov@gmail.com	2022-09-05 15:33:05 +02:00
Yishai Hadas	85eaeb5058	IB/core: Fix a nested dead lock as part of ODP flow Fix a nested dead lock as part of ODP flow by using mmput_async(). From the below call trace [1] can see that calling mmput() once we have the umem_odp->umem_mutex locked as required by ib_umem_odp_map_dma_and_lock() might trigger in the same task the exit_mmap()->__mmu_notifier_release()->mlx5_ib_invalidate_range() which may dead lock when trying to lock the same mutex. Moving to use mmput_async() will solve the problem as the above exit_mmap() flow will be called in other task and will be executed once the lock will be available. [1] [64843.077665] task:kworker/u133:2 state:D stack: 0 pid:80906 ppid: 2 flags:0x00004000 [64843.077672] Workqueue: mlx5_ib_page_fault mlx5_ib_eqe_pf_action [mlx5_ib] [64843.077719] Call Trace: [64843.077722] <TASK> [64843.077724] __schedule+0x23d/0x590 [64843.077729] schedule+0x4e/0xb0 [64843.077735] schedule_preempt_disabled+0xe/0x10 [64843.077740] __mutex_lock.constprop.0+0x263/0x490 [64843.077747] __mutex_lock_slowpath+0x13/0x20 [64843.077752] mutex_lock+0x34/0x40 [64843.077758] mlx5_ib_invalidate_range+0x48/0x270 [mlx5_ib] [64843.077808] __mmu_notifier_release+0x1a4/0x200 [64843.077816] exit_mmap+0x1bc/0x200 [64843.077822] ? walk_page_range+0x9c/0x120 [64843.077828] ? __cond_resched+0x1a/0x50 [64843.077833] ? mutex_lock+0x13/0x40 [64843.077839] ? uprobe_clear_state+0xac/0x120 [64843.077860] mmput+0x5f/0x140 [64843.077867] ib_umem_odp_map_dma_and_lock+0x21b/0x580 [ib_core] [64843.077931] pagefault_real_mr+0x9a/0x140 [mlx5_ib] [64843.077962] pagefault_mr+0xb4/0x550 [mlx5_ib] [64843.077992] pagefault_single_data_segment.constprop.0+0x2ac/0x560 [mlx5_ib] [64843.078022] mlx5_ib_eqe_pf_action+0x528/0x780 [mlx5_ib] [64843.078051] process_one_work+0x22b/0x3d0 [64843.078059] worker_thread+0x53/0x410 [64843.078065] ? process_one_work+0x3d0/0x3d0 [64843.078073] kthread+0x12a/0x150 [64843.078079] ? set_kthread_struct+0x50/0x50 [64843.078085] ret_from_fork+0x22/0x30 [64843.078093] </TASK> Fixes: `36f30e486d` ("IB/core: Improve ODP to use hmm_range_fault()") Reviewed-by: Maor Gottlieb <maorg@nvidia.com> Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://lore.kernel.org/r/74d93541ea533ef7daec6f126deb1072500aeb16.1661251841.git.leonro@nvidia.com Signed-off-by: Leon Romanovsky <leon@kernel.org>	2022-09-05 14:47:40 +03:00
Greg Kroah-Hartman	c2e4065965	sched/debug: fix dentry leak in update_sched_domain_debugfs Kuyo reports that the pattern of using debugfs_remove(debugfs_lookup()) leaks a dentry and with a hotplug stress test, the machine eventually runs out of memory. Fix this up by using the newly created debugfs_lookup_and_remove() call instead which properly handles the dentry reference counting logic. Cc: Major Chen <major.chen@samsung.com> Cc: stable <stable@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ben Segall <bsegall@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Matthias Brugger <matthias.bgg@gmail.com> Reported-by: Kuyo Chang <kuyo.chang@mediatek.com> Tested-by: Kuyo Chang <kuyo.chang@mediatek.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20220902123107.109274-2-gregkh@linuxfoundation.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-09-05 13:02:38 +02:00
Greg Kroah-Hartman	35f2e3c267	Merge 6.0-rc4 into tty-next We need the tty/serial fixes in here as well. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-09-05 07:59:28 +02:00
Waiman Long	d7c8142d5a	cgroup/cpuset: Make partition invalid if cpumask change violates exclusivity rule Currently, changes in "cpust.cpus" of a partition root is not allowed if it violates the sibling cpu exclusivity rule when the check is done in the validate_change() function. That is inconsistent with the other cpuset changes that are always allowed but may make a partition invalid. Update the cpuset code to allow cpumask change even if it violates the sibling cpu exclusivity rule, but invalidate the partition instead just like the other changes. However, other sibling partitions with conflicting cpumask will also be invalidated in order to not violating the exclusivity rule. This behavior is specific to this partition rule violation. Note that a previous commit has made sibling cpu exclusivity rule check the last check of validate_change(). So if -EINVAL is returned, we can be sure that sibling cpu exclusivity rule violation is the only rule that is broken. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2022-09-04 10:47:28 -10:00
Waiman Long	74027a6535	cgroup/cpuset: Relocate a code block in validate_change() This patch moves down the exclusive cpu and memory check in validate_change(). There is no functional change. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2022-09-04 10:47:28 -10:00
Waiman Long	7476a636d3	cgroup/cpuset: Show invalid partition reason string There are a number of different reasons which can cause a partition to become invalid. A user seeing an invalid partition may not know exactly why. To help user to get a better understanding of the underlying reason, The cpuset.cpus.partition control file, when read, will now report the reason why a partition become invalid. When a partition does become invalid, reading the control file will show "root invalid (<reason>)" where <reason> is a string that describes why the partition is invalid. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2022-09-04 10:47:27 -10:00
Waiman Long	f28e22441f	cgroup/cpuset: Add a new isolated cpus.partition type Cpuset v1 uses the sched_load_balance control file to determine if load balancing should be enabled. Cpuset v2 gets rid of sched_load_balance as its use may require disabling load balancing at cgroup root. For workloads that require very low latency like DPDK, the latency jitters caused by periodic load balancing may exceed the desired latency limit. When cpuset v2 is in use, the only way to avoid this latency cost is to use the "isolcpus=" kernel boot option to isolate a set of CPUs. After the kernel boot, however, there is no way to add or remove CPUs from this isolated set. For workloads that are more dynamic in nature, that means users have to provision enough CPUs for the worst case situation resulting in excess idle CPUs. To address this issue for cpuset v2, a new cpuset.cpus.partition type "isolated" is added which allows the creation of a cpuset partition without load balancing. This will allow system administrators to dynamically adjust the size of isolated partition to the current need of the workload without rebooting the system. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2022-09-04 10:47:27 -10:00
Waiman Long	f0af1bfc27	cgroup/cpuset: Relax constraints to partition & cpus changes Currently, enabling a partition root is only allowed if all the constraints of a valid partition are satisfied. Even changes to "cpuset.cpus" may not be allowed in some cases. Moreover, there are limits to changes made to a parent cpuset if it is a valid partition root. This is contrary to the general cgroup v2 philosophy. This patch relaxes the constraints of changing the state of "cpuset.cpus" and "cpuset.cpus.partition". Now all valid changes ("member" or "root") to "cpuset.cpus.partition" are allowed even if there are child cpusets underneath it. Trying to make a cpuset a partition root, however, will cause its state to become invalid if the following constraints of a valid partition root are not satisfied. 1) The "cpuset.cpus" is non-empty and exclusive. 2) The parent cpuset is a valid partition root. 3) The "cpuset.cpus" overlaps parent's "cpuset.cpus". Similarly, almost all changes to "cpuset.cpus" are allowed with the exception that if the underlying CS_CPU_EXCLUSIVE flag is set, the exclusivity rule will still apply. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2022-09-04 10:47:27 -10:00
Waiman Long	e2d59900d9	cgroup/cpuset: Allow no-task partition to have empty cpuset.cpus.effective Currently, a partition root cannot have empty "cpuset.cpus.effective". As a result, a parent partition root cannot distribute out all its CPUs to child partitions with no CPUs left. However in most cases, there shouldn't be any tasks associated with intermediate nodes of the default hierarchy. So the current rule is too restrictive and can waste valuable CPU resource. To address this issue, we are now allowing a partition to have empty "cpuset.cpus.effective" as long as it has no task. Since cpuset is threaded, no-internal-process rule does not apply. So it is possible to have tasks in a partition root with child sub-partitions even though that should be uncommon. A parent partition with no task can now have all its CPUs distributed out to its child partitions. The top cpuset always have some house-keeping tasks running and so its list of effective cpu can't be empty. Once a partition with empty "cpuset.cpus.effective" is formed, no new task can be moved into it until "cpuset.cpus.effective" becomes non-empty. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2022-09-04 10:47:27 -10:00
Waiman Long	18065ebe9b	cgroup/cpuset: Miscellaneous cleanups & add helper functions The partition root state (PRS) macro names do not currently match the external names. Change them to match the external names and add helper functions to read or change the state. Shorten the cpuset argument of update_parent_subparts_cpumask() to cs to match other cpuset functions. Remove the new_prs argument from notify_partition_change() as the cs->partition_root_state has already been set to new_prs before it is called. There is no functional change. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2022-09-04 10:47:27 -10:00
Waiman Long	ec5fbdfb99	cgroup/cpuset: Enable update_tasks_cpumask() on top_cpuset Previously, update_tasks_cpumask() is not supposed to be called with top cpuset. With cpuset partition that takes CPUs away from the top cpuset, adjusting the cpus_mask of the tasks in the top cpuset is necessary. Percpu kthreads, however, are ignored. Fixes: `ee8dde0cd2` ("cpuset: Add new v2 cpuset.sched.partition flag") Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2022-09-04 10:47:27 -10:00
Josh Don	5251c6c436	cgroup: add pids.peak interface for pids controller pids.peak tracks the high watermark of usage for number of pids. This helps give a better baseline on which to set pids.max. Polling pids.current isn't really feasible, since it would potentially miss short-lived spikes. This interface is analogous to memory.peak. Signed-off-by: Josh Don <joshdon@google.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2022-09-04 09:26:51 -10:00
Tejun Heo	dc79ec1b23	cgroup: Remove data-race around cgrp_dfl_visible There's a seemingly harmless data-race around cgrp_dfl_visible detected by kernel concurrency sanitizer. Let's remove it by throwing WRITE/READ_ONCE at it. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Abhishek Shah <abhishek.shah@columbia.edu> Cc: Gabriel Ryan <gabe@cs.columbia.edu> Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org> Link: https://lore.kernel.org/netdev/20220819072256.fn7ctciefy4fc4cu@wittgenstein/	2022-09-04 09:16:19 -10:00
Hou Tao	ef331a8d4c	bpf: Only add BTF IDs for socket security hooks when CONFIG_SECURITY_NETWORK is on When CONFIG_SECURITY_NETWORK is disabled, there will be build warnings from resolve_btfids: WARN: resolve_btfids: unresolved symbol bpf_lsm_socket_socketpair ...... WARN: resolve_btfids: unresolved symbol bpf_lsm_inet_conn_established Fixing it by wrapping these BTF ID definitions by CONFIG_SECURITY_NETWORK. Fixes: `69fd337a97` ("bpf: per-cgroup lsm flavor") Fixes: `9113d7e48e` ("bpf: expose bpf_{g,s}etsockopt to lsm cgroup") Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20220901065126.3856297-1-houtao@huaweicloud.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>	2022-09-01 16:21:14 -07:00
Al Viro	bf2e1ae417	audit_init_parent(): constify path Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2022-09-01 17:39:30 -04:00
Jiapeng Chong	ccf365eac0	bpf: Remove useless else if The assignment of the else and else if branches is the same, so the else if here is redundant, so we remove it and add a comment to make the code here readable. ./kernel/bpf/cgroup_iter.c:81:6-8: WARNING: possible condition with no effect (if == else). Link: https://bugzilla.openanolis.cn/show_bug.cgi?id=2016 Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Link: https://lore.kernel.org/r/20220831021618.86770-1-jiapeng.chong@linux.alibaba.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>	2022-09-01 14:04:07 -07:00
Jakub Kicinski	60ad1100d5	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net tools/testing/selftests/net/.gitignore sort the net-next version and use it Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-09-01 12:58:02 -07:00
Hou Tao	c89e843a11	bpf: Use this_cpu_{inc_return\|dec} for prog->active Both __this_cpu_inc_return() and __this_cpu_dec() are not preemption safe and now migrate_disable() doesn't disable preemption, so the update of prog-active is not atomic and in theory under fully preemptible kernel recurisve prevention may do not work. Fixing by using the preemption-safe and IRQ-safe variants. Fixes: `ca06f55b90` ("bpf: Add per-program recursion prevention mechanism") Signed-off-by: Hou Tao <houtao1@huawei.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/r/20220901061938.3789460-3-houtao@huaweicloud.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>	2022-09-01 12:16:18 -07:00
Hou Tao	197827a05e	bpf: Use this_cpu_{inc\|dec\|inc_return} for bpf_task_storage_busy Now migrate_disable() does not disable preemption and under some architectures (e.g. arm64) __this_cpu_{inc\|dec\|inc_return} are neither preemption-safe nor IRQ-safe, so for fully preemptible kernel concurrent lookups or updates on the same task local storage and on the same CPU may make bpf_task_storage_busy be imbalanced, and bpf_task_storage_trylock() on the specific cpu will always fail. Fixing it by using this_cpu_{inc\|dec\|inc_return} when manipulating bpf_task_storage_busy. Fixes: `bc235cdb42` ("bpf: Prevent deadlock from recursive bpf_task_storage_[get\|delete]") Signed-off-by: Hou Tao <houtao1@huawei.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/r/20220901061938.3789460-2-houtao@huaweicloud.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>	2022-09-01 12:16:12 -07:00
Paul E. McKenney	5c0ec49004	Merge branches 'doc.2022.08.31b', 'fixes.2022.08.31b', 'kvfree.2022.08.31b', 'nocb.2022.09.01a', 'poll.2022.08.31b', 'poll-srcu.2022.08.31b' and 'tasks.2022.08.31b' into HEAD doc.2022.08.31b: Documentation updates fixes.2022.08.31b: Miscellaneous fixes kvfree.2022.08.31b: kvfree_rcu() updates nocb.2022.09.01a: NOCB CPU updates poll.2022.08.31b: Full-oldstate RCU polling grace-period API poll-srcu.2022.08.31b: Polled SRCU grace-period updates tasks.2022.08.31b: Tasks RCU updates	2022-09-01 10:55:57 -07:00
Zqiang	48297a22a3	rcutorture: Use the barrier operation specified by cur_ops The rcutorture_oom_notify() function unconditionally invokes rcu_barrier(), which is OK when the rcutorture.torture_type value is "rcu", but unhelpful otherwise. The purpose of these barrier calls is to wait for all outstanding callback-flooding callbacks to be invoked before cleaning up their data. Using the wrong barrier function therefore risks arbitrary memory corruption. Thus, this commit changes these rcu_barrier() calls into cur_ops->cb_barrier() to make things work when torturing non-vanilla flavors of RCU. Signed-off-by: Zqiang <qiang1.zhang@intel.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2022-09-01 10:50:04 -07:00
Linus Torvalds	42e66b1cc3	Networking fixes for 6.0-rc4, including fixes from bluetooth, bpf and wireless. Current release - regressions: - bpf: - fix wrong last sg check in sk_msg_recvmsg() - fix kernel BUG in purge_effective_progs() - mac80211: - fix possible leak in ieee80211_tx_control_port() - potential NULL dereference in ieee80211_tx_control_port() Current release - new code bugs: - nfp: fix the access to management firmware hanging Previous releases - regressions: - ip: fix triggering of 'icmp redirect' - sched: tbf: don't call qdisc_put() while holding tree lock - bpf: fix corrupted packets for XDP_SHARED_UMEM - bluetooth: hci_sync: fix suspend performance regression - micrel: fix probe failure Previous releases - always broken: - tcp: make global challenge ack rate limitation per net-ns and default disabled - tg3: fix potential hang-up on system reboot - mac802154: fix reception for no-daddr packets Misc: - r8152: add PID for the lenovo onelink+ dock Signed-off-by: Paolo Abeni <pabeni@redhat.com> -----BEGIN PGP SIGNATURE----- iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmMQda0SHHBhYmVuaUBy ZWRoYXQuY29tAAoJECkkeY3MjxOk2eAQAJHZNo2CiN8dmVrT/e3Fc3GMMPhVIAHO lOjIUHIrV5BtsedhSrzAVTviMxVxC4CXAE8pJcE+5Y8MMygQYxZ3QF/93SSLFDKn zvhA1KizjmS7k2m7DNlS61aTwwPFBwc7dv388LrSUFdH0ZZfot+UXfzq4O8RSBUe mlYYLsiSRW5lUvu6j9hMSWn8D/A2k+BboA6Q1Q+PgK1tIpuEuv1gGg8IeV23xkfa hKLpZjtbrYPdGMKLMzmI5Ww4bqctZtCbPedSqBqydpmCyRsO/07G4fJLRffYsbSy nSREYF1QNSry/caR9KYHj602IwNywneIHV3cAO3B/ETFzThPkOmJbu2Em621G7+Z 1HpWmser7eiHDz0rDYLQlFr/ZYcSF4TwoNH4ha9hiKRpnHTZgD0USudLG+vvTNs5 DgGCAzJpdxI8Erks8Em9pYGEtKczZRp5MT+pZR+AAYkkryYANV6043+Xxbadal73 CsVXODmHmmCSG346juOubujDLADUyS+RWf2eMIFy289CRUHpGbZQ8Ai2UM3dqaX1 mgFpEAhJ78rmNBv8pVrKSJjE4Bx2s3hzgEe8tk9DHWCrODAAL490wzpMsVGvW+lz jTs2XNJ7MRDqV3KqMnZKlw0ESc0nSHz7BCztCbRQXfg6PxsIOTGD6ZB5kPQOHjU5 XP3Y5g3775az =doxx -----END PGP SIGNATURE----- Merge tag 'net-6.0-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Paolo Abeni: "Including fixes from bluetooth, bpf and wireless. Current release - regressions: - bpf: - fix wrong last sg check in sk_msg_recvmsg() - fix kernel BUG in purge_effective_progs() - mac80211: - fix possible leak in ieee80211_tx_control_port() - potential NULL dereference in ieee80211_tx_control_port() Current release - new code bugs: - nfp: fix the access to management firmware hanging Previous releases - regressions: - ip: fix triggering of 'icmp redirect' - sched: tbf: don't call qdisc_put() while holding tree lock - bpf: fix corrupted packets for XDP_SHARED_UMEM - bluetooth: hci_sync: fix suspend performance regression - micrel: fix probe failure Previous releases - always broken: - tcp: make global challenge ack rate limitation per net-ns and default disabled - tg3: fix potential hang-up on system reboot - mac802154: fix reception for no-daddr packets Misc: - r8152: add PID for the lenovo onelink+ dock" * tag 'net-6.0-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (56 commits) net/smc: Remove redundant refcount increase Revert "sch_cake: Return __NET_XMIT_STOLEN when consuming enqueued skb" tcp: make global challenge ack rate limitation per net-ns and default disabled tcp: annotate data-race around challenge_timestamp net: dsa: hellcreek: Print warning only once ip: fix triggering of 'icmp redirect' sch_cake: Return __NET_XMIT_STOLEN when consuming enqueued skb selftests: net: sort .gitignore file Documentation: networking: correct possessive "its" kcm: fix strp_init() order and cleanup mlxbf_gige: compute MDIO period based on i1clk ethernet: rocker: fix sleep in atomic context bug in neigh_timer_handler net: lan966x: improve error handle in lan966x_fdma_rx_get_frame() nfp: fix the access to management firmware hanging net: phy: micrel: Make the GPIO to be non-exclusive net: virtio_net: fix notification coalescing comments net/sched: fix netdevice reference leaks in attach_default_qdiscs() net: sched: tbf: don't call qdisc_put() while holding tree lock net: Use u64_stats_fetch_begin_irq() for stats fetch. net: dsa: xrs700x: Use irqsave variant for u64 stats update ...	2022-09-01 09:20:42 -07:00
Tejun Heo	e2691f6b44	cgroup: Implement cgroup_file_show() Add cgroup_file_show() which allows toggling visibility of a cgroup file using the new kernfs_show(). This will be used to hide psi interface files on cgroups where it's disabled. Cc: Chengming Zhou <zhouchengming@bytedance.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Tested-by: Chengming Zhou <zhouchengming@bytedance.com> Reviewed-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20220828050440.734579-10-tj@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-09-01 18:08:44 +02:00
Daniel Vetter	0a64ce6e54	kernel/panic: Drop unblank_screen call console_unblank() does this too (called in both places right after), and with a lot more confidence inspiring approach to locking. Reconstructing this story is very strange: In `b61312d353` ("oops handling: ensure that any oops is flushed to the mtdoops console") it is claimed that a printk(" "); flushed out the console buffer, which was removed in `e3e8a75d2a` ("[PATCH] Extract and use wake_up_klogd()"). In todays kernels this is done way earlier in console_flush_on_panic with some really nasty tricks. I didn't bother to fully reconstruct this all, least because the call to bust_spinlock(0); gets moved every few years, depending upon how the wind blows (or well, who screamed loudest about the various issue each call site caused). Before that commit the only calls to console_unblank() where in s390 arch code. The other side here is the console->unblank callback, which was introduced in 2.1.31 for the vt driver. Which predates the console_unblank() function by a lot, which was added (without users) in 2.4.14.3. So pretty much impossible to guess at any motivation here. Also afaict the vt driver is the only (and always was the only) console driver implementing the unblank callback, so no idea why a call to console_unblank() was added for the mtdooops driver - the action actually flushing out the console buffers is done from console_unlock() only. Note that as prep for the s390 users the locking was adjusted in 2.5.22 (I couldn't figure out how to properly reference the BK commit from the historical git trees) from a normal semaphore to a trylock. Note that a copy of the direct unblank_screen() call was added to panic() in `c7c3f05e34` ("panic: avoid deadlocks in re-entrant console drivers"), which partially inlined the bust_spinlocks(0); call. Long story short, I have no idea why the direct call to unblank_screen survived for so long (the infrastructure to do it properly existed for years), nor why it wasn't removed when the console_unblank() call was finally added. But it makes a ton more sense to finally do that than not - it's just better encapsulation to go through the console functions instead of doing a direct call, so let's dare. Plus it really does not make much sense to call the only unblank implementation there is twice, once without, and once with appropriate locking. Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jiri Slaby <jirislaby@kernel.org> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: "Ilpo Järvinen" <ilpo.jarvinen@linux.intel.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Xuezhi Zhang <zhangxuezhi1@coolpad.com> Cc: Yangxi Xiang <xyangxi5@gmail.com> Cc: nick black <dankamongmen@gmail.com> Cc: Petr Mladek <pmladek@suse.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: "Guilherme G. Piccoli" <gpiccoli@igalia.com> Cc: Marco Elver <elver@google.com> Cc: John Ogness <john.ogness@linutronix.de> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: David Gow <davidgow@google.com> Cc: tangmeng <tangmeng@uniontech.com> Cc: Tiezhu Yang <yangtiezhu@loongson.cn> Cc: Chris Wilson <chris@chris-wilson.co.uk> Reviewed-by: Petr Mladek <pmladek@suse.com> Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Link: https://lore.kernel.org/r/20220830145004.430545-1-daniel.vetter@ffwll.ch Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-09-01 16:55:35 +02:00
Zhen Lei	66d8529d0f	livepatch: Add a missing newline character in klp_module_coming() The error message is not printed immediately because it does not end with a newline character. Before: root@localhost:~# insmod vmlinux.ko insmod: ERROR: could not insert module vmlinux.ko: Invalid parameters After: root@localhost:~# insmod vmlinux.ko [ 43.982558] livepatch: vmlinux.ko: invalid module name insmod: ERROR: could not insert module vmlinux.ko: Invalid parameters Fixes: `dcf550e52f` ("livepatch: Disallow vmlinux.ko") Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> Reviewed-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20220830112855.749-1-thunder.leizhen@huawei.com	2022-09-01 16:27:59 +02:00
Rik van Riel	747f7a2901	livepatch: fix race between fork and KLP transition The KLP transition code depends on the TIF_PATCH_PENDING and the task->patch_state to stay in sync. On a normal (forward) transition, TIF_PATCH_PENDING will be set on every task in the system, while on a reverse transition (after a failed forward one) first TIF_PATCH_PENDING will be cleared from every task, followed by it being set on tasks that need to be transitioned back to the original code. However, the fork code copies over the TIF_PATCH_PENDING flag from the parent to the child early on, in dup_task_struct and setup_thread_stack. Much later, klp_copy_process will set child->patch_state to match that of the parent. However, the parent's patch_state may have been changed by KLP loading or unloading since it was initially copied over into the child. This results in the KLP code occasionally hitting this warning in klp_complete_transition: for_each_process_thread(g, task) { WARN_ON_ONCE(test_tsk_thread_flag(task, TIF_PATCH_PENDING)); task->patch_state = KLP_UNDEFINED; } Set, or clear, the TIF_PATCH_PENDING flag in the child task depending on whether or not it is needed at the time klp_copy_process is called, at a point in copy_process where the tasklist_lock is held exclusively, preventing races with the KLP code. The KLP code does have a few places where the state is changed without the tasklist_lock held, but those should not cause problems because klp_update_patch_state(current) cannot be called while the current task is in the middle of fork, klp_check_and_switch_task() which is called under the pi_lock, which prevents rescheduling, and manipulation of the patch state of idle tasks, which do not fork. This should prevent this warning from triggering again in the future, and close the race for both normal and reverse transitions. Signed-off-by: Rik van Riel <riel@surriel.com> Reported-by: Breno Leitao <leitao@debian.org> Reviewed-by: Petr Mladek <pmladek@suse.com> Acked-by: Josh Poimboeuf <jpoimboe@kernel.org> Fixes: `d83a7cb375` ("livepatch: change to a per-task consistency model") Cc: stable@kernel.org Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20220808150019.03d6a67b@imladris.surriel.com	2022-09-01 14:53:18 +02:00
Shang XiaoJing	33f9352579	sched/deadline: Move __dl_clear_params out of dl_bw lock As members in sched_dl_entity are independent with dl_bw, move __dl_clear_params out of dl_bw lock. Signed-off-by: Shang XiaoJing <shangxiaojing@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Daniel Bristot de Oliveira <bristot@kernel.org> Link: https://lore.kernel.org/r/20220827020911.30641-1-shangxiaojing@huawei.com	2022-09-01 11:19:55 +02:00
Shang XiaoJing	96458e7f7d	sched/deadline: Add replenish_dl_new_period helper Wrap repeated code in helper function replenish_dl_new_period, which set the deadline and runtime of input dl_se based on pi_of(dl_se). Note that setup_new_dl_entity originally set the deadline and runtime base on dl_se, which should equals to pi_of(dl_se) for non-boosted task. Signed-off-by: Shang XiaoJing <shangxiaojing@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Daniel Bristot de Oliveira <bristot@kernel.org> Link: https://lore.kernel.org/r/20220826100037.12146-1-shangxiaojing@huawei.com	2022-09-01 11:19:54 +02:00
Shang XiaoJing	973bee493a	sched/deadline: Add dl_task_is_earliest_deadline helper Wrap repeated code in helper function dl_task_is_earliest_deadline, which return true if there is no deadline task on the rq at all, or task's deadline earlier than the whole rq. Signed-off-by: Shang XiaoJing <shangxiaojing@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Daniel Bristot de Oliveira <bristot@kernel.org> Link: https://lore.kernel.org/r/20220826083453.698-1-shangxiaojing@huawei.com	2022-09-01 11:19:54 +02:00
Hou Tao	66a7a92e4d	bpf: Propagate error from htab_lock_bucket() to userspace In __htab_map_lookup_and_delete_batch() if htab_lock_bucket() returns -EBUSY, it will go to next bucket. Going to next bucket may not only skip the elements in current bucket silently, but also incur out-of-bound memory access or expose kernel memory to userspace if current bucket_cnt is greater than bucket_size or zero. Fixing it by stopping batch operation and returning -EBUSY when htab_lock_bucket() fails, and the application can retry or skip the busy batch as needed. Fixes: `20b6cc34ea` ("bpf: Avoid hashtab deadlock with map_locked") Reported-by: Hao Sun <sunhao.th@gmail.com> Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20220831042629.130006-3-houtao@huaweicloud.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>	2022-08-31 14:10:01 -07:00
Hou Tao	2775da2162	bpf: Disable preemption when increasing per-cpu map_locked Per-cpu htab->map_locked is used to prohibit the concurrent accesses from both NMI and non-NMI contexts. But since commit `74d862b682` ("sched: Make migrate_disable/enable() independent of RT"), migrate_disable() is also preemptible under CONFIG_PREEMPT case, so now map_locked also disallows concurrent updates from normal contexts (e.g. userspace processes) unexpectedly as shown below: process A process B htab_map_update_elem() htab_lock_bucket() migrate_disable() /* return 1 / __this_cpu_inc_return() / preempted by B / htab_map_update_elem() / the same bucket as A / htab_lock_bucket() migrate_disable() / return 2, so lock fails */ __this_cpu_inc_return() return -EBUSY A fix that seems feasible is using in_nmi() in htab_lock_bucket() and only checking the value of map_locked for nmi context. But it will re-introduce dead-lock on bucket lock if htab_lock_bucket() is re-entered through non-tracing program (e.g. fentry program). One cannot use preempt_disable() to fix this issue as htab_use_raw_lock being false causes the bucket lock to be a spin lock which can sleep and does not work with preempt_disable(). Therefore, use migrate_disable() when using the spinlock instead of preempt_disable() and defer fixing concurrent updates to when the kernel has its own BPF memory allocator. Fixes: `74d862b682` ("sched: Make migrate_disable/enable() independent of RT") Reviewed-by: Hao Luo <haoluo@google.com> Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20220831042629.130006-2-houtao@huaweicloud.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>	2022-08-31 14:10:01 -07:00
Zqiang	528262f502	rcu-tasks: Make RCU Tasks Trace check for userspace execution Userspace execution is a valid quiescent state for RCU Tasks Trace, but the scheduling-clock interrupt does not currently report such quiescent states. Of course, the scheduling-clock interrupt is not strictly speaking userspace execution. However, the only way that this code is not in a quiescent state is if something invoked rcu_read_lock_trace(), and that would be reflected in the ->trc_reader_nesting field in the task_struct structure. Furthermore, this field is checked by rcu_tasks_trace_qs(), which is invoked by rcu_tasks_qs() which is in turn invoked by rcu_note_voluntary_context_switch() in kernels building at least one of the RCU Tasks flavors. It is therefore safe to invoke rcu_tasks_trace_qs() from the rcu_sched_clock_irq(). But rcu_tasks_qs() also invokes rcu_tasks_classic_qs() for RCU Tasks, which lacks the read-side markers provided by RCU Tasks Trace. This raises the possibility that an RCU Tasks grace period could start after the interrupt from userspace execution, but before the call to rcu_sched_clock_irq(). However, it turns out that this is safe because the RCU Tasks grace period waits for an RCU grace period, which will wait for the entire scheduling-clock interrupt handler, including any RCU Tasks read-side critical section that this handler might contain. This commit therefore updates the rcu_sched_clock_irq() function's check for usermode execution and its call to rcu_tasks_classic_qs() to instead check for both usermode execution and interrupt from idle, and to instead call rcu_note_voluntary_context_switch(). This consolidates code and provides more faster RCU Tasks Trace reporting of quiescent states in kernels that do scheduling-clock interrupts for userspace execution. [ paulmck: Consolidate checks into rcu_sched_clock_irq(). ] Signed-off-by: Zqiang <qiang1.zhang@intel.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2022-08-31 05:10:55 -07:00
Paul E. McKenney	d6ad60635c	rcu-tasks: Ensure RCU Tasks Trace loops have quiescent states The RCU Tasks Trace grace-period kthread loops across all CPUs, and there can be quite a few CPUs, with some commercially available systems sporting well over a thousand of them. Some of these loops can feature IPIs, which can take some time. This commit therefore places a call to cond_resched_tasks_rcu_qs() in each such loop. Link: https://docs.google.com/document/d/1V0YnG1HTWMt9WHJjroiJL9lf-hMrud4v8Fn3fhyY0cI/edit?usp=sharing Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2022-08-31 05:10:55 -07:00
Zqiang	fcd53c8a4d	rcu-tasks: Convert RCU_LOCKDEP_WARN() to WARN_ONCE() Kernels built with CONFIG_PROVE_RCU=y and CONFIG_DEBUG_LOCK_ALLOC=y attempt to emit a warning when the synchronize_rcu_tasks_generic() function is called during early boot while the rcu_scheduler_active variable is RCU_SCHEDULER_INACTIVE. However the warnings is not actually be printed because the debug_lockdep_rcu_enabled() returns false, exactly because the rcu_scheduler_active variable is still equal to RCU_SCHEDULER_INACTIVE. This commit therefore replaces RCU_LOCKDEP_WARN() with WARN_ONCE() to force these warnings to actually be printed. Signed-off-by: Zqiang <qiang1.zhang@intel.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2022-08-31 05:10:54 -07:00
Paul E. McKenney	5fe89191e4	srcu: Make Tiny SRCU use full-sized grace-period counters This commit makes Tiny SRCU use full-sized grace-period counters to further avoid counter-wrap issues when using polled grace-period APIs. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2022-08-31 05:10:15 -07:00
Paul E. McKenney	de3f2671ae	srcu: Make Tiny SRCU poll_state_synchronize_srcu() more precise This commit applies the more-precise grace-period-state check used by rcu_seq_done_exact() to poll_state_synchronize_srcu(). This is important because Tiny SRCU uses a 16-bit counter, which can wrap quite quickly. If counter wrap continues to be a problem, then expanding ->srcu_idx and ->srcu_idx_max to 32 bits might be warranted. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2022-08-31 05:10:15 -07:00
Paul E. McKenney	599d97e3f2	rcutorture: Make "srcud" option also test polled grace-period API This commit brings the "srcud" (dynamically allocated) SRCU test in line with the "srcu" (statically allocated) test, so that both test the full SRCU polled grace-period API. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2022-08-31 05:10:15 -07:00
Paul E. McKenney	967c298d65	rcutorture: Limit read-side polling-API testing RCU's polled grace-period API is reasonably lightweight, but still contains heavyweight memory barriers. This commit therefore limits testing of this API from rcutorture's readers in order to avoid the false negatives that these heavyweight operations could provoke. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2022-08-31 05:09:22 -07:00
Paul E. McKenney	5d7801f201	rcutorture: Expand rcu_torture_write_types() first "if" statement This commit expands the rcu_torture_write_types() function's first "if" condition and body, placing one element per line, in order to make the compiler's error messages more helpful. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2022-08-31 05:09:22 -07:00
Paul E. McKenney	cc8faf5b65	rcutorture: Use 1-suffixed variable in rcu_torture_write_types() check This commit changes the use of gp_poll_exp to gp_poll_exp1 in the first check in rcu_torture_write_types(). No functional effect, but consistency is a good thing. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2022-08-31 05:09:22 -07:00
Paul E. McKenney	d761de8a7d	rcu: Make synchronize_rcu() fastpath update only boot-CPU counters Large systems can have hundreds of rcu_node structures, and updating counters in each of them might slow down booting. This commit therefore updates only the counters in those rcu_node structures corresponding to the boot CPU, up to and including the root rcu_node structure. The counters for the remaining rcu_node structures are updated by the rcu_scheduler_starting() function, which executes just before the first non-boot kthread is spawned. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2022-08-31 05:09:22 -07:00
Paul E. McKenney	b3cdd0a79c	rcutorture: Adjust rcu_poll_need_2gp() for rcu_gp_oldstate field removal Now that rcu_gp_oldstate can accurately track both normal and expedited grace periods regardless of system state, rcutorture's rcu_poll_need_2gp() function need only call for a second grace period for the old single-unsigned-long grace-period polling APIs This commit therefore adjusts rcu_poll_need_2gp() accordingly. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2022-08-31 05:09:21 -07:00
Paul E. McKenney	7ecef0871d	rcu: Remove ->rgos_polled field from rcu_gp_oldstate structure Because both normal and expedited grace periods increment their respective counters on their pre-scheduler early boot fastpaths, the rcu_gp_oldstate structure no longer needs its ->rgos_polled field. This commit therefore removes this field, shrinking this structure so that it is the same size as an rcu_head structure. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2022-08-31 05:09:21 -07:00
Paul E. McKenney	43ff97cc99	rcu: Make synchronize_rcu_expedited() fast path update .expedited_sequence This commit causes the early boot single-CPU synchronize_rcu_expedited() fastpath to update the rcu_state structure's ->expedited_sequence counter. This will allow the full-state polled grace-period APIs to detect all expedited grace periods without the need to track the special combined polling-only counter, which is another step towards removing the ->rgos_polled field from the rcu_gp_oldstate, thereby reducing its size by one third. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2022-08-31 05:09:21 -07:00
Paul E. McKenney	e8755d2bde	rcu: Remove expedited grace-period fast-path forward-progress helper Now that the expedited grace-period fast path can only happen during the pre-scheduler portion of early boot, this fast path can no longer block run-time RCU Trace grace periods. This commit therefore removes the conditional cond_resched() invocation. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2022-08-31 05:09:21 -07:00
Paul E. McKenney	910e12092e	rcu: Make synchronize_rcu() fast path update ->gp_seq counters This commit causes the early boot single-CPU synchronize_rcu() fastpath to update the rcu_state and rcu_node structures' ->gp_seq and ->gp_seq_needed counters. This will allow the full-state polled grace-period APIs to detect all normal grace periods without the need to track the special combined polling-only counter, which is a step towards removing the ->rgos_polled field from the rcu_gp_oldstate, thereby reducing its size by one third. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2022-08-31 05:09:21 -07:00
Paul E. McKenney	5f11bad6b7	rcu-tasks: Remove grace-period fast-path rcu-tasks helper Now that the grace-period fast path can only happen during the pre-scheduler portion of early boot, this fast path can no longer block run-time RCU Tasks and RCU Tasks Trace grace periods. This commit therefore removes the conditional cond_resched_tasks_rcu_qs() invocation. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2022-08-31 05:08:08 -07:00
Paul E. McKenney	a5d1b0b68a	rcu: Set rcu_data structures' initial ->gpwrap value to true It would be good do reduce the size of the rcu_gp_oldstate structure from three unsigned long instances to two, but this requires that the boot-time optimized grace periods update the various ->gp_seq fields. Updating these fields in the rcu_state structure and in all of the rcu_node structures is at least semi-reasonable, but updating them in all of the rcu_data structures is a bridge too far. This means that if there are too many early boot-time grace periods, the ->gp_seq field in the rcu_data structure cannot be trusted. This commit therefore sets each rcu_data structure's ->gpwrap field to provide the necessary impetus for a suitable level of distrust. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2022-08-31 05:08:08 -07:00
Paul E. McKenney	258f887aba	rcu: Disable run-time single-CPU grace-period optimization The run-time single-CPU grace-period optimization applies only to kernels built with CONFIG_SMP=y && CONFIG_PREEMPTION=y that are running on a single-CPU system. But a kernel intended for a single-CPU system should instead be built with CONFIG_SMP=n, and in any case, single-CPU systems running Linux no longer appear to be the common case. Plus this optimization results in the rcu_gp_oldstate structure being half again larger than it needs to be. This commit therefore disables the run-time single-CPU grace-period optimization, so that this optimization applies only during the pre-scheduler portion of the boot sequence. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2022-08-31 05:08:08 -07:00
Paul E. McKenney	8df13f0160	rcu: Add full-sized polling for cond_sync_exp_full() The cond_synchronize_rcu_expedited() API compresses the combined expedited and normal grace-period states into a single unsigned long, which conserves storage, but can miss grace periods in certain cases involving overlapping normal and expedited grace periods. Missing the occasional grace period is usually not a problem, but there are use cases that care about each and every grace period. This commit therefore adds yet another member of the full-state RCU grace-period polling API, which is the cond_synchronize_rcu_exp_full() function. This uses up to three times the storage (rcu_gp_oldstate structure instead of unsigned long), but is guaranteed not to miss grace periods. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2022-08-31 05:08:08 -07:00
Paul E. McKenney	b6fe4917ae	rcu: Add full-sized polling for cond_sync_full() The cond_synchronize_rcu() API compresses the combined expedited and normal grace-period states into a single unsigned long, which conserves storage, but can miss grace periods in certain cases involving overlapping normal and expedited grace periods. Missing the occasional grace period is usually not a problem, but there are use cases that care about each and every grace period. This commit therefore adds yet another member of the full-state RCU grace-period polling API, which is the cond_synchronize_rcu_full() function. This uses up to three times the storage (rcu_gp_oldstate structure instead of unsigned long), but is guaranteed not to miss grace periods. [ paulmck: Apply feedback from kernel test robot and Julia Lawall. ] Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2022-08-31 05:08:08 -07:00
Paul E. McKenney	f21e014345	rcu: Remove blank line from poll_state_synchronize_rcu() docbook header This commit removes the blank line preceding the oldstate parameter to the docbook header for the poll_state_synchronize_rcu() function and marks uses of this parameter later in that header. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2022-08-31 05:08:08 -07:00
Paul E. McKenney	6c502b14ba	rcu: Add full-sized polling for start_poll_expedited() The start_poll_synchronize_rcu_expedited() API compresses the combined expedited and normal grace-period states into a single unsigned long, which conserves storage, but can miss grace periods in certain cases involving overlapping normal and expedited grace periods. Missing the occasional grace period is usually not a problem, but there are use cases that care about each and every grace period. This commit therefore adds yet another member of the full-state RCU grace-period polling API, which is the start_poll_synchronize_rcu_expedited_full() function. This uses up to three times the storage (rcu_gp_oldstate structure instead of unsigned long), but is guaranteed not to miss grace periods. [ paulmck: Apply feedback from kernel test robot and Julia Lawall. ] Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2022-08-31 05:08:08 -07:00
Paul E. McKenney	76ea364161	rcu: Add full-sized polling for start_poll() The start_poll_synchronize_rcu() API compresses the combined expedited and normal grace-period states into a single unsigned long, which conserves storage, but can miss grace periods in certain cases involving overlapping normal and expedited grace periods. Missing the occasional grace period is usually not a problem, but there are use cases that care about each and every grace period. This commit therefore adds the next member of the full-state RCU grace-period polling API, namely the start_poll_synchronize_rcu_full() function. This uses up to three times the storage (rcu_gp_oldstate structure instead of unsigned long), but is guaranteed not to miss grace periods. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2022-08-31 05:08:08 -07:00
Paul E. McKenney	f4754ad292	rcutorture: Verify long-running reader prevents full polling from completing This commit adds full-state polling checks to accompany the old-style polling checks in the rcu_torture_one_read() function. If a polling cycle within an RCU reader completes, a WARN_ONCE() is triggered. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2022-08-31 05:08:07 -07:00
Paul E. McKenney	37d6ade31c	rcutorture: Remove redundant RTWS_DEF_FREE check This check does nothing because the state at this point in the code because the rcu_torture_writer_state value is guaranteed to instead be RTWS_REPLACE. This commit therefore removes this check. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2022-08-31 05:08:07 -07:00
Paul E. McKenney	d594231aa5	rcutorture: Verify RCU reader prevents full polling from completing This commit adds a test to rcu_torture_writer() that verifies that a ->get_gp_state_full() and ->poll_gp_state_full() polled grace-period sequence does not claim that a grace period elapsed within the confines of the corresponding read-side critical section. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2022-08-31 05:08:07 -07:00
Paul E. McKenney	ed7d2f1abe	rcutorture: Allow per-RCU-flavor polled double-GP check Only vanilla RCU needs a double grace period for its compressed polled grace-period old-state cookie. This commit therefore adds an rcu_torture_ops per-flavor function ->poll_need_2gp to allow this check to be adapted to the RCU flavor under test. A NULL pointer for this function says that doubled grace periods are never needed. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2022-08-31 05:08:07 -07:00
Paul E. McKenney	ccb42229fb	rcutorture: Abstract synchronous and polled API testing This commit abstracts a do_rtws_sync() function that does synchronous grace-period testing, but also testing the polled API 25% of the time each for the normal and full-state variants of the polled API. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2022-08-31 05:08:07 -07:00
Paul E. McKenney	3fdefca9b4	rcu: Add full-sized polling for get_state() The get_state_synchronize_rcu() API compresses the combined expedited and normal grace-period states into a single unsigned long, which conserves storage, but can miss grace periods in certain cases involving overlapping normal and expedited grace periods. Missing the occasional grace period is usually not a problem, but there are use cases that care about each and every grace period. This commit therefore adds the next member of the full-state RCU grace-period polling API, namely the get_state_synchronize_rcu_full() function. This uses up to three times the storage (rcu_gp_oldstate structure instead of unsigned long), but is guaranteed not to miss grace periods. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2022-08-31 05:08:07 -07:00
Paul E. McKenney	91a967fd69	rcu: Add full-sized polling for get_completed() and poll_state() The get_completed_synchronize_rcu() and poll_state_synchronize_rcu() APIs compress the combined expedited and normal grace-period states into a single unsigned long, which conserves storage, but can miss grace periods in certain cases involving overlapping normal and expedited grace periods. Missing the occasional grace period is usually not a problem, but there are use cases that care about each and every grace period. This commit therefore adds the first members of the full-state RCU grace-period polling API, namely the get_completed_synchronize_rcu_full() and poll_state_synchronize_rcu_full() functions. These use up to three times the storage (rcu_gp_oldstate structure instead of unsigned long), but which are guaranteed not to miss grace periods, at least in situations where the single-CPU grace-period optimization does not apply. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2022-08-31 05:08:07 -07:00
Paul E. McKenney	638dce227a	rcu/nocb: Add CPU number to CPU-{,de}offload failure messages Offline CPUs cannot be offloaded or deoffloaded. Any attempt to offload or deoffload an offline CPU causes a message to be printed on the console, which is good, but this message does not contain the CPU number, which is bad. Such a CPU number can be helpful when debugging, as it gives a clear indication that the CPU in question is in fact offline. This commit therefore adds the CPU number to the CPU-{,de}offload failure messages. Cc: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2022-08-31 05:07:19 -07:00
Zqiang	5334da2af2	rcu/nocb: Choose the right rcuog/rcuop kthreads to output The show_rcu_nocb_gp_state() function is supposed to dump out the rcuog kthread and the show_rcu_nocb_state() function is supposed to dump out the rcuo[ps] kthread. Currently, both do a mixture, which is not optimal for debugging, even though it does not affect functionality. This commit therefore adjusts these two functions to focus on their respective kthreads. Signed-off-by: Zqiang <qiang1.zhang@intel.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2022-08-31 05:07:19 -07:00
Uladzislau Rezki (Sony)	51824b780b	rcu/kvfree: Update KFREE_DRAIN_JIFFIES interval Currently the monitor work is scheduled with a fixed interval of HZ/20, which is roughly 50 milliseconds. The drawback of this approach is low utilization of the 512 page slots in scenarios with infrequence kvfree_rcu() calls. For example on an Android system: <snip> kworker/3:3-507 [003] .... 470.286305: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x00000000d0f0dde5 nr_records=6 kworker/6:1-76 [006] .... 470.416613: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x00000000ea0d6556 nr_records=1 kworker/6:1-76 [006] .... 470.416625: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x000000003e025849 nr_records=9 kworker/3:3-507 [003] .... 471.390000: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x00000000815a8713 nr_records=48 kworker/1:1-73 [001] .... 471.725785: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x00000000fda9bf20 nr_records=3 kworker/1:1-73 [001] .... 471.725833: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x00000000a425b67b nr_records=76 kworker/0:4-1411 [000] .... 472.085673: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x000000007996be9d nr_records=1 kworker/0:4-1411 [000] .... 472.085728: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x00000000d0f0dde5 nr_records=5 kworker/6:1-76 [006] .... 472.260340: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x0000000065630ee4 nr_records=102 <snip> In many cases, out of 512 slots, fewer than 10 were actually used. In order to improve batching and make utilization more efficient this commit sets a drain interval to a fixed 5-seconds interval. Floods are detected when a page fills quickly, and in that case, the reclaim work is re-scheduled for the next scheduling-clock tick (jiffy). After this change: <snip> kworker/7:1-371 [007] .... 5630.725708: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x000000005ab0ffb3 nr_records=121 kworker/7:1-371 [007] .... 5630.989702: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x0000000060c84761 nr_records=47 kworker/7:1-371 [007] .... 5630.989714: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x000000000babf308 nr_records=510 kworker/7:1-371 [007] .... 5631.553790: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x00000000bb7bd0ef nr_records=169 kworker/7:1-371 [007] .... 5631.553808: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x0000000044c78753 nr_records=510 kworker/5:6-9428 [005] .... 5631.746102: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x00000000d98519aa nr_records=123 kworker/4:7-9434 [004] .... 5632.001758: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x00000000526c9d44 nr_records=322 kworker/4:7-9434 [004] .... 5632.002073: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x000000002c6a8afa nr_records=185 kworker/7:1-371 [007] .... 5632.277515: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x000000007f4a962f nr_records=510 <snip> Here, all but one of the cases, more than one hundreds slots were used, representing an order-of-magnitude improvement. Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2022-08-31 05:06:50 -07:00
Joel Fernandes (Google)	3826909635	rcu/kfree: Fix kfree_rcu_shrink_count() return value As per the comments in include/linux/shrinker.h, .count_objects callback should return the number of freeable items, but if there are no objects to free, SHRINK_EMPTY should be returned. The only time 0 is returned should be when we are unable to determine the number of objects, or the cache should be skipped for another reason. Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2022-08-31 05:06:50 -07:00
Michal Hocko	093590c16b	rcu: Back off upon fill_page_cache_func() allocation failure The fill_page_cache_func() function allocates couple of pages to store kvfree_rcu_bulk_data structures. This is a lightweight (GFP_NORETRY) allocation which can fail under memory pressure. The function will, however keep retrying even when the previous attempt has failed. This retrying is in theory correct, but in practice the allocation is invoked from workqueue context, which means that if the memory reclaim gets stuck, these retries can hog the worker for quite some time. Although the workqueues subsystem automatically adjusts concurrency, such adjustment is not guaranteed to happen until the worker context sleeps. And the fill_page_cache_func() function's retry loop is not guaranteed to sleep (see the should_reclaim_retry() function). And we have seen this function cause workqueue lockups: kernel: BUG: workqueue lockup - pool cpus=93 node=1 flags=0x1 nice=0 stuck for 32s! [...] kernel: pool 74: cpus=37 node=0 flags=0x1 nice=0 hung=32s workers=2 manager: 2146 kernel: pwq 498: cpus=249 node=1 flags=0x1 nice=0 active=4/256 refcnt=5 kernel: in-flight: 1917:fill_page_cache_func kernel: pending: dbs_work_handler, free_work, kfree_rcu_monitor Originally, we thought that the root cause of this lockup was several retries with direct reclaim, but this is not yet confirmed. Furthermore, we have seen similar lockups without any heavy memory pressure. This suggests that there are other factors contributing to these lockups. However, it is not really clear that endless retries are desireable. So let's make the fill_page_cache_func() function back off after allocation failure. Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: Frederic Weisbecker <frederic@kernel.org> Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Joel Fernandes <joel@joelfernandes.org> Signed-off-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2022-08-31 05:06:50 -07:00
Paul E. McKenney	7634b1eaa0	rcu: Exclude outgoing CPU when it is the last to leave The rcu_boost_kthread_setaffinity() function removes the outgoing CPU from the set_cpus_allowed() mask for the corresponding leaf rcu_node structure's rcub priority-boosting kthread. Except that if the outgoing CPU will leave that structure without any online CPUs, the mask is set to the housekeeping CPU mask from housekeeping_cpumask(). Which is fine unless the outgoing CPU happens to be a housekeeping CPU. This commit therefore removes the outgoing CPU from the housekeeping mask. This would of course be problematic if the outgoing CPU was the last online housekeeping CPU, but in that case you are in a world of hurt anyway. If someone comes up with a valid use case for a system needing all the housekeeping CPUs to be offline, further adjustments can be made. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2022-08-31 05:06:03 -07:00
Zqiang	621189a1fe	rcu: Avoid triggering strict-GP irq-work when RCU is idle Kernels built with PREEMPT_RCU=y and RCU_STRICT_GRACE_PERIOD=y trigger irq-work from rcu_read_unlock(), and the resulting irq-work handler invokes rcu_preempt_deferred_qs_handle(). The point of this triggering is to force grace periods to end quickly in order to give tools like KASAN a better chance of detecting RCU usage bugs such as leaking RCU-protected pointers out of an RCU read-side critical section. However, this irq-work triggering is unconditional. This works, but there is no point in doing this irq-work unless the current grace period is waiting on the running CPU or task, which is not the common case. After all, in the common case there are many rcu_read_unlock() calls per CPU per grace period. This commit therefore triggers the irq-work only when the current grace period is waiting on the running CPU or task. This change was tested as follows on a four-CPU system: echo rcu_preempt_deferred_qs_handler > /sys/kernel/debug/tracing/set_ftrace_filter echo 1 > /sys/kernel/debug/tracing/function_profile_enabled insmod rcutorture.ko sleep 20 rmmod rcutorture.ko echo 0 > /sys/kernel/debug/tracing/function_profile_enabled echo > /sys/kernel/debug/tracing/set_ftrace_filter This procedure produces results in this per-CPU set of files: /sys/kernel/debug/tracing/trace_stat/function* Sample output from one of these files is as follows: Function Hit Time Avg s^2 -------- --- ---- --- --- rcu_preempt_deferred_qs_handle 838746 182650.3 us 0.217 us 0.004 us The baseline sum of the "Hit" values (the number of calls to this function) was 3,319,015. With this commit, that sum was 1,140,359, for a 2.9x reduction. The worst-case variance across the CPUs was less than 25%, so this large effect size is statistically significant. The raw data is available in the Link: URL. Link: https://lore.kernel.org/all/20220808022626.12825-1-qiang1.zhang@intel.com/ Signed-off-by: Zqiang <qiang1.zhang@intel.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2022-08-31 05:06:02 -07:00
Zhen Lei	bc1cca97e6	sched/debug: Show the registers of 'current' in dump_cpu_task() The dump_cpu_task() function does not print registers on architectures that do not support NMIs. However, registers can be useful for debugging. Fortunately, in the case where dump_cpu_task() is invoked from an interrupt handler and is dumping the current CPU's stack, the get_irq_regs() function can be used to get the registers. Therefore, this commit makes dump_cpu_task() check to see if it is being asked to dump the current CPU's stack from within an interrupt handler, and, if so, it uses the get_irq_regs() function to obtain the registers. On systems that do support NMIs, this commit has the further advantage of avoiding a self-NMI in this case. This is an example of rcu self-detected stall on arm64, which does not support NMIs: [ 27.501721] rcu: INFO: rcu_preempt self-detected stall on CPU [ 27.502238] rcu: 0-....: (1250 ticks this GP) idle=4f7/1/0x4000000000000000 softirq=2594/2594 fqs=619 [ 27.502632] (t=1251 jiffies g=2989 q=29 ncpus=4) [ 27.503845] CPU: 0 PID: 306 Comm: test0 Not tainted 5.19.0-rc7-00009-g1c1a6c29ff99-dirty #46 [ 27.504732] Hardware name: linux,dummy-virt (DT) [ 27.504947] pstate: 20000005 (nzCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 27.504998] pc : arch_counter_read+0x18/0x24 [ 27.505301] lr : arch_counter_read+0x18/0x24 [ 27.505328] sp : ffff80000b29bdf0 [ 27.505345] x29: ffff80000b29bdf0 x28: 0000000000000000 x27: 0000000000000000 [ 27.505475] x26: 0000000000000000 x25: 0000000000000000 x24: 0000000000000000 [ 27.505553] x23: 0000000000001f40 x22: ffff800009849c48 x21: 000000065f871ae0 [ 27.505627] x20: 00000000000025ec x19: ffff80000a6eb300 x18: ffffffffffffffff [ 27.505654] x17: 0000000000000001 x16: 0000000000000000 x15: ffff80000a6d0296 [ 27.505681] x14: ffffffffffffffff x13: ffff80000a29bc18 x12: 0000000000000426 [ 27.505709] x11: 0000000000000162 x10: ffff80000a2f3c18 x9 : ffff80000a29bc18 [ 27.505736] x8 : 00000000ffffefff x7 : ffff80000a2f3c18 x6 : 00000000759bd013 [ 27.505761] x5 : 01ffffffffffffff x4 : 0002dc6c00000000 x3 : 0000000000000017 [ 27.505787] x2 : 00000000000025ec x1 : ffff80000b29bdf0 x0 : 0000000075a30653 [ 27.505937] Call trace: [ 27.506002] arch_counter_read+0x18/0x24 [ 27.506171] ktime_get+0x48/0xa0 [ 27.506207] test_task+0x70/0xf0 [ 27.506227] kthread+0x10c/0x110 [ 27.506243] ret_from_fork+0x10/0x20 This is a marked improvement over the old output: [ 27.944550] rcu: INFO: rcu_preempt self-detected stall on CPU [ 27.944980] rcu: 0-....: (1249 ticks this GP) idle=cbb/1/0x4000000000000000 softirq=2610/2610 fqs=614 [ 27.945407] (t=1251 jiffies g=2681 q=28 ncpus=4) [ 27.945731] Task dump for CPU 0: [ 27.945844] task:test0 state:R running task stack: 0 pid: 306 ppid: 2 flags:0x0000000a [ 27.946073] Call trace: [ 27.946151] dump_backtrace.part.0+0xc8/0xd4 [ 27.946378] show_stack+0x18/0x70 [ 27.946405] sched_show_task+0x150/0x180 [ 27.946427] dump_cpu_task+0x44/0x54 [ 27.947193] rcu_dump_cpu_stacks+0xec/0x130 [ 27.947212] rcu_sched_clock_irq+0xb18/0xef0 [ 27.947231] update_process_times+0x68/0xac [ 27.947248] tick_sched_handle+0x34/0x60 [ 27.947266] tick_sched_timer+0x4c/0xa4 [ 27.947281] __hrtimer_run_queues+0x178/0x360 [ 27.947295] hrtimer_interrupt+0xe8/0x244 [ 27.947309] arch_timer_handler_virt+0x38/0x4c [ 27.947326] handle_percpu_devid_irq+0x88/0x230 [ 27.947342] generic_handle_domain_irq+0x2c/0x44 [ 27.947357] gic_handle_irq+0x44/0xc4 [ 27.947376] call_on_irq_stack+0x2c/0x54 [ 27.947415] do_interrupt_handler+0x80/0x94 [ 27.947431] el1_interrupt+0x34/0x70 [ 27.947447] el1h_64_irq_handler+0x18/0x24 [ 27.947462] el1h_64_irq+0x64/0x68 <--- the above backtrace is worthless [ 27.947474] arch_counter_read+0x18/0x24 [ 27.947487] ktime_get+0x48/0xa0 [ 27.947501] test_task+0x70/0xf0 [ 27.947520] kthread+0x10c/0x110 [ 27.947538] ret_from_fork+0x10/0x20 Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Ben Segall <bsegall@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: Valentin Schneider <vschneid@redhat.com>	2022-08-31 05:05:49 -07:00
Zhen Lei	e73dfe3093	sched/debug: Try trigger_single_cpu_backtrace(cpu) in dump_cpu_task() The trigger_all_cpu_backtrace() function attempts to send an NMI to the target CPU, which usually provides much better stack traces than the dump_cpu_task() function's approach of dumping that stack from some other CPU. So much so that most calls to dump_cpu_task() only happen after a call to trigger_all_cpu_backtrace() has failed. And the exception to this rule really should attempt to use trigger_all_cpu_backtrace() first. Therefore, move the trigger_all_cpu_backtrace() invocation into dump_cpu_task(). Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Ben Segall <bsegall@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: Valentin Schneider <vschneid@redhat.com>	2022-08-31 05:03:14 -07:00
Paul E. McKenney	089254fd38	rcu: Document reason for rcu_all_qs() call to preempt_disable() Given that rcu_all_qs() is in non-preemptible kernels, why on earth should it invoke preempt_disable()? This commit adds the reason, which is to work nicely with debugging enabled in CONFIG_PREEMPT_COUNT=y kernels. Reported-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> Reported-by: Boqun Feng <boqun.feng@gmail.com> Reported-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2022-08-31 05:03:14 -07:00
Zqiang	6ca0292ccf	rcu: Make tiny RCU support leak callbacks for debug-object errors Currently, only Tree RCU leaks callbacks setting when it detects a duplicate call_rcu(). This commit causes Tiny RCU to also leak callbacks in this situation. Because this is Tiny RCU, kernel size is important: 1. CONFIG_TINY_RCU=y and CONFIG_DEBUG_OBJECTS_RCU_HEAD=n (Production kernel) Original: text data bss dec hex filename 26290663 20159823 15212544 61663030 3ace736 vmlinux With this commit: text data bss dec hex filename 26290663 20159823 15212544 61663030 3ace736 vmlinux 2. CONFIG_TINY_RCU=y and CONFIG_DEBUG_OBJECTS_RCU_HEAD=y (Debugging kernel) Original: text data bss dec hex filename 26291319 20160143 15212544 61664006 3aceb06 vmlinux With this commit: text data bss dec hex filename 26291319 20160431 15212544 61664294 3acec26 vmlinux These results show that the kernel size is unchanged for production kernels, as desired. Signed-off-by: Zqiang <qiang1.zhang@intel.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2022-08-31 05:03:14 -07:00
Zqiang	fcb42c9a77	rcu: Add QS check in rcu_exp_handler() for non-preemptible kernels Kernels built with CONFIG_PREEMPTION=n and CONFIG_PREEMPT_COUNT=y maintain preempt_count() state. Because such kernels map __rcu_read_lock() and __rcu_read_unlock() to preempt_disable() and preempt_enable(), respectively, this allows the expedited grace period's !CONFIG_PREEMPT_RCU version of the rcu_exp_handler() IPI handler function to use preempt_count() to detect quiescent states. This preempt_count() usage might seem to risk failures due to use of implicit RCU readers in portions of the kernel under #ifndef CONFIG_PREEMPTION, except that rcu_core() already disallows such implicit RCU readers. The moral of this story is that you must use explicit read-side markings such as rcu_read_lock() or preempt_disable() even if the code knows that this kernel does not support preemption. This commit therefore adds a preempt_count()-based check for a quiescent state in the !CONFIG_PREEMPT_RCU version of the rcu_exp_handler() function for kernels built with CONFIG_PREEMPT_COUNT=y, reporting an immediate quiescent state when the interrupted code had both preemption and softirqs enabled. This change results in about a 2% reduction in expedited grace-period latency in kernels built with both CONFIG_PREEMPT_RCU=n and CONFIG_PREEMPT_COUNT=y. Signed-off-by: Zqiang <qiang1.zhang@intel.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Link: https://lore.kernel.org/all/20220622103549.2840087-1-qiang1.zhang@intel.com/	2022-08-31 05:03:14 -07:00
Zqiang	bca4fa8cb0	rcu: Update rcu_preempt_deferred_qs() comments for !PREEMPT kernels In non-premptible kernels, tasks never do context switches within RCU read-side critical sections. Therefore, in such kernels, each leaf rcu_node structure's ->blkd_tasks list will always be empty. The comment on the non-preemptible version of rcu_preempt_deferred_qs() confuses this point, so this commit therefore fixes it. Signed-off-by: Zqiang <qiang1.zhang@intel.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2022-08-31 05:03:14 -07:00
Zqiang	6d60ea03ac	rcu: Fix rcu_read_unlock_strict() strict QS reporting Kernels built with CONFIG_PREEMPT=n and CONFIG_RCU_STRICT_GRACE_PERIOD=y report the quiescent state directly from the outermost rcu_read_unlock(). However, the current CPU's rcu_data structure's ->cpu_no_qs.b.norm might still be set, in which case rcu_report_qs_rdp() will exit early, thus failing to report quiescent state. This commit therefore causes rcu_read_unlock_strict() to clear CPU's rcu_data structure's ->cpu_no_qs.b.norm field before invoking rcu_report_qs_rdp(). Signed-off-by: Zqiang <qiang1.zhang@intel.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2022-08-31 05:03:14 -07:00
Marco Elver	ecdfb8896f	perf/hw_breakpoint: Optimize toggle_bp_slot() for CPU-independent task targets We can still see that a majority of the time is spent hashing task pointers: ... 16.98% [kernel] [k] rhashtable_jhash2 ... Doing the bookkeeping in toggle_bp_slots() is currently O(#cpus), calling task_bp_pinned() for each CPU, even if task_bp_pinned() is CPU-independent. The reason for this is to update the per-CPU 'tsk_pinned' histogram. To optimize the CPU-independent case to O(1), keep a separate CPU-independent 'tsk_pinned_all' histogram. The major source of complexity are transitions between "all CPU-independent task breakpoints" and "mixed CPU-independent and CPU-dependent task breakpoints". The code comments list all cases that require handling. After this optimization: \| $> perf bench -r 100 breakpoint thread -b 4 -p 128 -t 512 \| # Running 'breakpoint/thread' benchmark: \| # Created/joined 100 threads with 4 breakpoints and 128 parallelism \| Total time: 1.758 [sec] \| \| 34.336621 usecs/op \| 4395.087500 usecs/op/cpu 38.08% [kernel] [k] queued_spin_lock_slowpath 10.81% [kernel] [k] smp_cfm_core_cond 3.01% [kernel] [k] update_sg_lb_stats 2.58% [kernel] [k] osq_lock 2.57% [kernel] [k] llist_reverse_order 1.45% [kernel] [k] find_next_bit 1.21% [kernel] [k] flush_tlb_func_common 1.01% [kernel] [k] arch_install_hw_breakpoint Showing that the time spent hashing keys has become insignificant. With the given benchmark parameters, that's an improvement of 12% compared with the old O(#cpus) version. And finally, using the less aggressive parameters from the preceding changes, we now observe: \| $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64 \| # Running 'breakpoint/thread' benchmark: \| # Created/joined 30 threads with 4 breakpoints and 64 parallelism \| Total time: 0.067 [sec] \| \| 35.292187 usecs/op \| 2258.700000 usecs/op/cpu Which is an improvement of 12% compared to without the histogram optimizations (baseline is 40 usecs/op). This is now on par with the theoretical ideal (constraints disabled), and only 12% slower than no breakpoints at all. Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Acked-by: Ian Rogers <irogers@google.com> Link: https://lore.kernel.org/r/20220829124719.675715-15-elver@google.com	2022-08-30 10:56:24 +02:00
Marco Elver	9b1933b864	perf/hw_breakpoint: Optimize max_bp_pinned_slots() for CPU-independent task targets Running the perf benchmark with (note: more aggressive parameters vs. preceding changes, but same 256 CPUs host): \| $> perf bench -r 100 breakpoint thread -b 4 -p 128 -t 512 \| # Running 'breakpoint/thread' benchmark: \| # Created/joined 100 threads with 4 breakpoints and 128 parallelism \| Total time: 1.989 [sec] \| \| 38.854160 usecs/op \| 4973.332500 usecs/op/cpu 20.43% [kernel] [k] queued_spin_lock_slowpath 18.75% [kernel] [k] osq_lock 16.98% [kernel] [k] rhashtable_jhash2 8.34% [kernel] [k] task_bp_pinned 4.23% [kernel] [k] smp_cfm_core_cond 3.65% [kernel] [k] bcmp 2.83% [kernel] [k] toggle_bp_slot 1.87% [kernel] [k] find_next_bit 1.49% [kernel] [k] __reserve_bp_slot We can see that a majority of the time is now spent hashing task pointers to index into task_bps_ht in task_bp_pinned(). Obtaining the max_bp_pinned_slots() for CPU-independent task targets currently is O(#cpus), and calls task_bp_pinned() for each CPU, even if the result of task_bp_pinned() is CPU-independent. The loop in max_bp_pinned_slots() wants to compute the maximum slots across all CPUs. If task_bp_pinned() is CPU-independent, we can do so by obtaining the max slots across all CPUs and adding task_bp_pinned(). To do so in O(1), use a bp_slots_histogram for CPU-pinned slots. After this optimization: \| $> perf bench -r 100 breakpoint thread -b 4 -p 128 -t 512 \| # Running 'breakpoint/thread' benchmark: \| # Created/joined 100 threads with 4 breakpoints and 128 parallelism \| Total time: 1.930 [sec] \| \| 37.697832 usecs/op \| 4825.322500 usecs/op/cpu 19.13% [kernel] [k] queued_spin_lock_slowpath 18.21% [kernel] [k] rhashtable_jhash2 15.46% [kernel] [k] osq_lock 6.27% [kernel] [k] toggle_bp_slot 5.91% [kernel] [k] task_bp_pinned 5.05% [kernel] [k] smp_cfm_core_cond 1.78% [kernel] [k] update_sg_lb_stats 1.36% [kernel] [k] llist_reverse_order 1.34% [kernel] [k] find_next_bit 1.19% [kernel] [k] bcmp Suggesting that time spent in task_bp_pinned() has been reduced. However, we're still hashing too much, which will be addressed in the subsequent change. Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Acked-by: Ian Rogers <irogers@google.com> Link: https://lore.kernel.org/r/20220829124719.675715-14-elver@google.com	2022-08-30 10:56:24 +02:00
Marco Elver	16db2839a5	perf/hw_breakpoint: Introduce bp_slots_histogram Factor out the existing `atomic_t count[N]` into its own struct called 'bp_slots_histogram', to generalize and make its intent clearer in preparation of reusing elsewhere. The basic idea of bucketing "total uses of N slots" resembles a histogram, so calling it such seems most intuitive. No functional change. Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Acked-by: Ian Rogers <irogers@google.com> Link: https://lore.kernel.org/r/20220829124719.675715-13-elver@google.com	2022-08-30 10:56:24 +02:00
Marco Elver	0912037fec	perf/hw_breakpoint: Reduce contention with large number of tasks While optimizing task_bp_pinned()'s runtime complexity to O(1) on average helps reduce time spent in the critical section, we still suffer due to serializing everything via 'nr_bp_mutex'. Indeed, a profile shows that now contention is the biggest issue: 95.93% [kernel] [k] osq_lock 0.70% [kernel] [k] mutex_spin_on_owner 0.22% [kernel] [k] smp_cfm_core_cond 0.18% [kernel] [k] task_bp_pinned 0.18% [kernel] [k] rhashtable_jhash2 0.15% [kernel] [k] queued_spin_lock_slowpath when running the breakpoint benchmark with (system with 256 CPUs): \| $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64 \| # Running 'breakpoint/thread' benchmark: \| # Created/joined 30 threads with 4 breakpoints and 64 parallelism \| Total time: 0.207 [sec] \| \| 108.267188 usecs/op \| 6929.100000 usecs/op/cpu The main concern for synchronizing the breakpoint constraints data is that a consistent snapshot of the per-CPU and per-task data is observed. The access pattern is as follows: 1. If the target is a task: the task's pinned breakpoints are counted, checked for space, and then appended to; only bp_cpuinfo::cpu_pinned is used to check for conflicts with CPU-only breakpoints; bp_cpuinfo::tsk_pinned are incremented/decremented, but otherwise unused. 2. If the target is a CPU: bp_cpuinfo::cpu_pinned are counted, along with bp_cpuinfo::tsk_pinned; after a successful check, cpu_pinned is incremented. No per-task breakpoints are checked. Since rhltable safely synchronizes insertions/deletions, we can allow concurrency as follows: 1. If the target is a task: independent tasks may update and check the constraints concurrently, but same-task target calls need to be serialized; since bp_cpuinfo::tsk_pinned is only updated, but not checked, these modifications can happen concurrently by switching tsk_pinned to atomic_t. 2. If the target is a CPU: access to the per-CPU constraints needs to be serialized with other CPU-target and task-target callers (to stabilize the bp_cpuinfo::tsk_pinned snapshot). We can allow the above concurrency by introducing a per-CPU constraints data reader-writer lock (bp_cpuinfo_sem), and per-task mutexes (reuses task_struct::perf_event_mutex): 1. If the target is a task: acquires perf_event_mutex, and acquires bp_cpuinfo_sem as a reader. The choice of percpu-rwsem minimizes contention in the presence of many read-lock but few write-lock acquisitions: we assume many orders of magnitude more task target breakpoints creations/destructions than CPU target breakpoints. 2. If the target is a CPU: acquires bp_cpuinfo_sem as a writer. With these changes, contention with thousands of tasks is reduced to the point where waiting on locking no longer dominates the profile: \| $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64 \| # Running 'breakpoint/thread' benchmark: \| # Created/joined 30 threads with 4 breakpoints and 64 parallelism \| Total time: 0.077 [sec] \| \| 40.201563 usecs/op \| 2572.900000 usecs/op/cpu 21.54% [kernel] [k] task_bp_pinned 20.18% [kernel] [k] rhashtable_jhash2 6.81% [kernel] [k] toggle_bp_slot 5.47% [kernel] [k] queued_spin_lock_slowpath 3.75% [kernel] [k] smp_cfm_core_cond 3.48% [kernel] [k] bcmp On this particular setup that's a speedup of 2.7x. We're also getting closer to the theoretical ideal performance through optimizations in hw_breakpoint.c -- constraints accounting disabled: \| perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64 \| # Running 'breakpoint/thread' benchmark: \| # Created/joined 30 threads with 4 breakpoints and 64 parallelism \| Total time: 0.067 [sec] \| \| 35.286458 usecs/op \| 2258.333333 usecs/op/cpu Which means the current implementation is ~12% slower than the theoretical ideal. For reference, performance without any breakpoints: \| $> bench -r 30 breakpoint thread -b 0 -p 64 -t 64 \| # Running 'breakpoint/thread' benchmark: \| # Created/joined 30 threads with 0 breakpoints and 64 parallelism \| Total time: 0.060 [sec] \| \| 31.365625 usecs/op \| 2007.400000 usecs/op/cpu On a system with 256 CPUs, the theoretical ideal is only ~12% slower than no breakpoints at all; the current implementation is ~28% slower. Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Acked-by: Ian Rogers <irogers@google.com> Link: https://lore.kernel.org/r/20220829124719.675715-12-elver@google.com	2022-08-30 10:56:24 +02:00
Marco Elver	01fe8a3f81	locking/percpu-rwsem: Add percpu_is_write_locked() and percpu_is_read_locked() Implement simple accessors to probe percpu-rwsem's locked state: percpu_is_write_locked(), percpu_is_read_locked(). Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Acked-by: Ian Rogers <irogers@google.com> Link: https://lore.kernel.org/r/20220829124719.675715-11-elver@google.com	2022-08-30 10:56:23 +02:00
Marco Elver	24198ad373	perf/hw_breakpoint: Remove useless code related to flexible breakpoints Flexible breakpoints have never been implemented, with bp_cpuinfo::flexible always being 0. Unfortunately, they still occupy 4 bytes in each bp_cpuinfo and bp_busy_slots, as well as computing the max flexible count in fetch_bp_busy_slots(). This again causes suboptimal code generation, when we always know that `!!slots.flexible` will be 0. Just get rid of the flexible "placeholder" and remove all real code related to it. Make a note in the comment related to the constraints algorithm but don't remove them from the algorithm, so that if in future flexible breakpoints need supporting, it should be trivial to revive them (along with reverting this change). Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Acked-by: Ian Rogers <irogers@google.com> Link: https://lore.kernel.org/r/20220829124719.675715-9-elver@google.com	2022-08-30 10:56:22 +02:00
Marco Elver	9caf87be11	perf/hw_breakpoint: Make hw_breakpoint_weight() inlinable Due to being a __weak function, hw_breakpoint_weight() will cause the compiler to always emit a call to it. This generates unnecessarily bad code (register spills etc.) for no good reason; in fact it appears in profiles of `perf bench -r 100 breakpoint thread -b 4 -p 128 -t 512`: ... 0.70% [kernel] [k] hw_breakpoint_weight ... While a small percentage, no architecture defines its own hw_breakpoint_weight() nor are there users outside hw_breakpoint.c, which makes the fact it is currently __weak a poor choice. Change hw_breakpoint_weight()'s definition to follow a similar protocol to hw_breakpoint_slots(), such that if <asm/hw_breakpoint.h> defines hw_breakpoint_weight(), we'll use it instead. The result is that it is inlined and no longer shows up in profiles. Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Acked-by: Ian Rogers <irogers@google.com> Link: https://lore.kernel.org/r/20220829124719.675715-8-elver@google.com	2022-08-30 10:56:22 +02:00
Marco Elver	be3f152568	perf/hw_breakpoint: Optimize constant number of breakpoint slots Optimize internal hw_breakpoint state if the architecture's number of breakpoint slots is constant. This avoids several kmalloc() calls and potentially unnecessary failures if the allocations fail, as well as subtly improves code generation and cache locality. The protocol is that if an architecture defines hw_breakpoint_slots via the preprocessor, it must be constant and the same for all types. Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Dmitry Vyukov <dvyukov@google.com> Acked-by: Ian Rogers <irogers@google.com> Link: https://lore.kernel.org/r/20220829124719.675715-7-elver@google.com	2022-08-30 10:56:22 +02:00
Marco Elver	db5f6f8531	perf/hw_breakpoint: Mark data __ro_after_init Mark read-only data after initialization as __ro_after_init. While we are here, turn 'constraints_initialized' into a bool. Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Acked-by: Ian Rogers <irogers@google.com> Link: https://lore.kernel.org/r/20220829124719.675715-6-elver@google.com	2022-08-30 10:56:21 +02:00
Marco Elver	0370dc314d	perf/hw_breakpoint: Optimize list of per-task breakpoints On a machine with 256 CPUs, running the recently added perf breakpoint benchmark results in: \| $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64 \| # Running 'breakpoint/thread' benchmark: \| # Created/joined 30 threads with 4 breakpoints and 64 parallelism \| Total time: 236.418 [sec] \| \| 123134.794271 usecs/op \| 7880626.833333 usecs/op/cpu The benchmark tests inherited breakpoint perf events across many threads. Looking at a perf profile, we can see that the majority of the time is spent in various hw_breakpoint.c functions, which execute within the 'nr_bp_mutex' critical sections which then results in contention on that mutex as well: 37.27% [kernel] [k] osq_lock 34.92% [kernel] [k] mutex_spin_on_owner 12.15% [kernel] [k] toggle_bp_slot 11.90% [kernel] [k] __reserve_bp_slot The culprit here is task_bp_pinned(), which has a runtime complexity of O(#tasks) due to storing all task breakpoints in the same list and iterating through that list looking for a matching task. Clearly, this does not scale to thousands of tasks. Instead, make use of the "rhashtable" variant "rhltable" which stores multiple items with the same key in a list. This results in average runtime complexity of O(1) for task_bp_pinned(). With the optimization, the benchmark shows: \| $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64 \| # Running 'breakpoint/thread' benchmark: \| # Created/joined 30 threads with 4 breakpoints and 64 parallelism \| Total time: 0.208 [sec] \| \| 108.422396 usecs/op \| 6939.033333 usecs/op/cpu On this particular setup that's a speedup of ~1135x. While one option would be to make task_struct a breakpoint list node, this would only further bloat task_struct for infrequently used data. Furthermore, after all optimizations in this series, there's no evidence it would result in better performance: later optimizations make the time spent looking up entries in the hash table negligible (we'll reach the theoretical ideal performance i.e. no constraints). Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Acked-by: Ian Rogers <irogers@google.com> Link: https://lore.kernel.org/r/20220829124719.675715-5-elver@google.com	2022-08-30 10:56:21 +02:00
Marco Elver	089cdcb0cd	perf/hw_breakpoint: Clean up headers Clean up headers: - Remove unused <linux/kallsyms.h> - Remove unused <linux/kprobes.h> - Remove unused <linux/module.h> - Remove unused <linux/smp.h> - Add <linux/export.h> for EXPORT_SYMBOL_GPL(). - Add <linux/mutex.h> for mutex. - Sort alphabetically. - Move <linux/hw_breakpoint.h> to top to test it compiles on its own. Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Dmitry Vyukov <dvyukov@google.com> Acked-by: Ian Rogers <irogers@google.com> Link: https://lore.kernel.org/r/20220829124719.675715-4-elver@google.com	2022-08-30 10:56:21 +02:00
Marco Elver	c5b81449f9	perf/hw_breakpoint: Provide hw_breakpoint_is_used() and use in test Provide hw_breakpoint_is_used() to check if breakpoints are in use on the system. Use it in the KUnit test to verify the global state before and after a test case. Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Acked-by: Ian Rogers <irogers@google.com> Link: https://lore.kernel.org/r/20220829124719.675715-3-elver@google.com	2022-08-30 10:56:20 +02:00
Marco Elver	724c299c6a	perf/hw_breakpoint: Add KUnit test for constraints accounting Add KUnit test for hw_breakpoint constraints accounting, with various interesting mixes of breakpoint targets (some care was taken to catch interesting corner cases via bug-injection). The test cannot be built as a module because it requires access to hw_breakpoint_slots(), which is not inlinable or exported on all architectures. Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Acked-by: Ian Rogers <irogers@google.com> Link: https://lore.kernel.org/r/20220829124719.675715-2-elver@google.com	2022-08-30 10:56:20 +02:00
Ingo Molnar	53aa930dc4	Merge branch 'sched/warnings' into sched/core, to pick up WARN_ON_ONCE() conversion commit Merge in the BUG_ON() => WARN_ON_ONCE() conversion commit. Signed-off-by: Ingo Molnar <mingo@kernel.org>	2022-08-30 10:28:15 +02:00
wuchi	501e4bb102	audit: use time_after to compare time Using time_{*} macro to compare time is better Signed-off-by: wuchi <wuchi.zero@gmail.com> Signed-off-by: Paul Moore <paul@paul-moore.com>	2022-08-29 19:47:03 -04:00
Jakub Kicinski	9c5d03d362	genetlink: start to validate reserved header bytes We had historically not checked that genlmsghdr.reserved is 0 on input which prevents us from using those precious bytes in the future. One use case would be to extend the cmd field, which is currently just 8 bits wide and 256 is not a lot of commands for some core families. To make sure that new families do the right thing by default put the onus of opting out of validation on existing families. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Acked-by: Paul Moore <paul@paul-moore.com> (NetLabel) Signed-off-by: David S. Miller <davem@davemloft.net>	2022-08-29 12:47:15 +01:00
Linus Torvalds	b467192ec7	Seventeen hotfixes. Mostly memory management things. Ten patches are cc:stable, addressing pre-6.0 issues. -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCYwvgrAAKCRDdBJ7gKXxA jlweAQC9dzE08Elxl4F7Uvxe+62JWVeflBRrT7sJ6jU1Gu3QcQEAhhI1Xit3/MGq pRytDBObGADxlA67c9eNq6J5pCT/7gE= =pD67 -----END PGP SIGNATURE----- Merge tag 'mm-hotfixes-stable-2022-08-28' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull more hotfixes from Andrew Morton: "Seventeen hotfixes. Mostly memory management things. Ten patches are cc:stable, addressing pre-6.0 issues" * tag 'mm-hotfixes-stable-2022-08-28' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: .mailmap: update Luca Ceresoli's e-mail address mm/mprotect: only reference swap pfn page if type match squashfs: don't call kmalloc in decompressors mm/damon/dbgfs: avoid duplicate context directory creation mailmap: update email address for Colin King asm-generic: sections: refactor memory_intersects bootmem: remove the vmemmap pages from kmemleak in put_page_bootmem ocfs2: fix freeing uninitialized resource on ocfs2_dlm_shutdown Revert "memcg: cleanup racy sum avoidance code" mm/zsmalloc: do not attempt to free IS_ERR handle binder_alloc: add missing mmap_lock calls when using the VMA mm: re-allow pinning of zero pfns (again) vmcoreinfo: add kallsyms_num_syms symbol mailmap: update Guilherme G. Piccoli's email addresses writeback: avoid use-after-free after removing device shmem: update folio if shmem_replace_page() updates the page mm/hugetlb: avoid corrupting page->mapping in hugetlb_mcopy_atomic_pte	2022-08-28 14:49:59 -07:00
Stephen Brennan	f09bddbd86	vmcoreinfo: add kallsyms_num_syms symbol The rest of the kallsyms symbols are useless without knowing the number of symbols in the table. In an earlier patch, I somehow dropped the kallsyms_num_syms symbol, so add it back in. Link: https://lkml.kernel.org/r/20220808205410.18590-1-stephen.s.brennan@oracle.com Fixes: `5fd8fea935` ("vmcoreinfo: include kallsyms symbols") Signed-off-by: Stephen Brennan <stephen.s.brennan@oracle.com> Cc: Baoquan He <bhe@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-08-28 14:02:44 -07:00
Linus Torvalds	17b28d4267	audit/stable-6.0 PR 20220826 -----BEGIN PGP SIGNATURE----- iQJIBAABCAAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAmMJG04UHHBhdWxAcGF1 bC1tb29yZS5jb20ACgkQ6iDy2pc3iXOBtBAAkfUY6U8EtMvrPPu6kMyREPdU/9Zh wCBrKjY59fMWOl1RT8zYqyZaCZRvSc/Wd73XLvU2r0pf83N3i6sH7CozVhQyhM8H icNSzFRcZetaaOu2VKvfp5sSHR0ulLlYy26+zud6Syl/F7AJVwID0wsyHLVMuLs0 PVb+oOoOoHzLdAxY6GlwHFHww3NgDPuYTo2v/19AAQ9f9HHHbr8iMwso4kBPA3TX x6tS/0YNKdAKAEtzwBmLQ7d8rFsjuBVActzoIOHjSluH5hg7UrrY4OwSOK1tp0bY r+tnpa4M1bBBqxgNlHY9CHlpveNNzDtiDNjxOA/EsGHyNPrjkna017MEc9kGO7Bn uwu0ytGoLt/IWeWdn3edmlDJtg782JmGI5YS3ihCE6vrqjd1sDh6QUVGMMy29Cm2 dSPp1WY+I7IW9zTD1RzsdqDWdtnuN2XL591VxPW8WyvcU4QS5bBXQmUT+T8Ribkr jsZHiG4GqozF7bzuN38iw+MO2dV7TFvrzTQmqbji/8cDC68QANagdBaqUx8dGZ1w itW6UDZiUeSN8XUNJgDNX2b7jxnVPpEBQ1a0Ncbo6ykfZ4NKKujGE2kv7GMJ2d7x vYP/MxQdw15hQsSlT3vhmCQq6OpchpLUIywIsT3uTYATb5dMHDaWW7RtUg55/yNv xxiKWBMeALHGE9w= =j67g -----END PGP SIGNATURE----- Merge tag 'audit-pr-20220826' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit Pull audit fix from Paul Moore: "Another small audit patch, this time to fix a bug where the return codes were not properly set before the audit filters were run, potentially resulting in missed audit records" * tag 'audit-pr-20220826' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit: audit: move audit_return_fixup before the filters	2022-08-27 15:31:12 -07:00
Shang XiaoJing	5531ecffa4	sched: Add update_current_exec_runtime helper Wrap repeated code in helper function update_current_exec_runtime for update the exec time of the current. Signed-off-by: Shang XiaoJing <shangxiaojing@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20220824082856.15674-1-shangxiaojing@huawei.com	2022-08-27 00:05:35 +02:00
Richard Guy Briggs	c3f3ea8af4	audit: free audit_proctitle only on task exit Since audit_proctitle is generated at syscall exit time, its value is used immediately and cached for the next syscall. Since this is the case, then only clear it at task exit time. Otherwise, there is no point in caching the value OR bearing the overhead of regenerating it. Fixes: `12c5e81d3f` ("audit: prepare audit_context for use in calling contexts beyond syscalls") Signed-off-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Paul Moore <paul@paul-moore.com>	2022-08-26 17:18:54 -04:00
Richard Guy Briggs	3ed66951f9	audit: explicitly check audit_context->context enum value Be explicit in checking the struct audit_context "context" member enum value rather than assuming the order of context enum values. Fixes: `12c5e81d3f` ("audit: prepare audit_context for use in calling contexts beyond syscalls") Signed-off-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Paul Moore <paul@paul-moore.com>	2022-08-26 17:17:11 -04:00
Tetsuo Handa	075b593f54	cgroup: Use cgroup_attach_{lock,unlock}() from cgroup_attach_task_all() No behavior changes; preparing for potential locking changes in future. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reviewed-by:Mukesh Ojha <quic_mojha@quicinc.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2022-08-26 11:14:34 -10:00
Tejun Heo	265efc941f	Merge branch 'for-6.0-fixes' into for-6.1 Pulling to receive `43626dade3` ("group: Add missing cpus_read_lock() to cgroup_attach_task_all()") for a follow-up patch.	2022-08-26 11:13:39 -10:00
Richard Guy Briggs	e84d9f5214	audit: audit_context pid unused, context enum comment fix The pid member of struct audit_context is never used. Remove it. The audit_reset_context() comment about unconditionally resetting "ctx->state" should read "ctx->context". Signed-off-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Paul Moore <paul@paul-moore.com>	2022-08-26 17:06:00 -04:00
Michal Koutný	fa7e439cf9	cgroup: Homogenize cgroup_get_from_id() return value Cgroup id is user provided datum hence extend its return domain to include possible error reason (similar to cgroup_get_from_fd()). This change also fixes commit `d4ccaf58a8` ("bpf: Introduce cgroup iter") that would use NULL instead of proper error handling in `d4ccaf58a8` ("bpf: Introduce cgroup iter"). Additionally, neither of: fc_appid_store, bpf_iter_attach_cgroup, mem_cgroup_get_from_ino (callers of cgroup_get_from_fd) is built without CONFIG_CGROUPS (depends via CONFIG_BLK_CGROUP, direct, transitive CONFIG_MEMCG respectively) transitive, so drop the singular definition not needed with !CONFIG_CGROUPS. Fixes: `d4ccaf58a8` ("bpf: Introduce cgroup iter") Signed-off-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2022-08-26 10:57:41 -10:00
Michal Koutný	4534dee941	cgroup: cgroup: Honor caller's cgroup NS when resolving cgroup id Cgroup ids are resolved in the global scope. That may be needed sometime (in future) but currently it violates virtual view provided through cgroup namespaces. There are currently following users of the resolution: - fc_appid_store - bpf_iter_attach_cgroup - mem_cgroup_get_from_ino None of the is a called on behalf of kernel but the resolution is made with proper userspace context, hence the default to current->nsproxy makes sens. (This doesn't rule out cgroup_get_from_id with cgroup NS parameter in the future.) Since cgroup ids are defined on v2 hierarchy only, we simply check existence in the cgroup namespace by looking at ancestry on the default hierarchy. Fixes: `6b658c4863` ("scsi: cgroup: Add cgroup_get_from_id()") Signed-off-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2022-08-26 10:57:10 -10:00
Michal Koutný	74e4b956eb	cgroup: Honor caller's cgroup NS when resolving path cgroup_get_from_path() is not widely used function. Its callers presume the path is resolved under cgroup namespace. (There is one caller currently and resolving in init NS won't make harm (netfilter). However, future users may be subject to different effects when resolving globally.) Since, there's currently no use for the global resolution, modify the existing function to take cgroup NS into account. Fixes: `a79a908fd2` ("cgroup: introduce cgroup namespaces") Signed-off-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2022-08-26 10:56:26 -10:00
Mikulas Patocka	8238b45798	wait_on_bit: add an acquire memory barrier There are several places in the kernel where wait_on_bit is not followed by a memory barrier (for example, in drivers/md/dm-bufio.c:new_read). On architectures with weak memory ordering, it may happen that memory accesses that follow wait_on_bit are reordered before wait_on_bit and they may return invalid data. Fix this class of bugs by introducing a new function "test_bit_acquire" that works like test_bit, but has acquire memory ordering semantics. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Acked-by: Will Deacon <will@kernel.org> Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2022-08-26 09:30:25 -07:00
David S. Miller	2e085ec0e2	Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Daniel borkmann says: ==================== The following pull-request contains BPF updates for your net tree. We've added 11 non-merge commits during the last 14 day(s) which contain a total of 13 files changed, 61 insertions(+), 24 deletions(-). The main changes are: 1) Fix BPF verifier's precision tracking around BPF ring buffer, from Kumar Kartikeya Dwivedi. 2) Fix regression in tunnel key infra when passing FLOWI_FLAG_ANYSRC, from Eyal Birger. 3) Fix insufficient permissions for bpf_sys_bpf() helper, from YiFei Zhu. 4) Fix splat from hitting BUG when purging effective cgroup programs, from Pu Lehui. 5) Fix range tracking for array poke descriptors, from Daniel Borkmann. 6) Fix corrupted packets for XDP_SHARED_UMEM in aligned mode, from Magnus Karlsson. 7) Fix NULL pointer splat in BPF sockmap sk_msg_recvmsg(), from Liu Jian. 8) Add READ_ONCE() to bpf_jit_limit when reading from sysctl, from Kuniyuki Iwashima. 9) Add BPF selftest lru_bug check to s390x deny list, from Daniel Müller. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2022-08-26 12:19:09 +01:00
Benjamin Tissoires	b88df69796	bpf: prepare for more bpf syscall to be used from kernel and user space. Add BPF_MAP_GET_FD_BY_ID and BPF_MAP_DELETE_PROG. Only BPF_MAP_GET_FD_BY_ID needs to be amended to be able to access the bpf pointer either from the userspace or the kernel. Acked-by: Yonghong Song <yhs@fb.com> Signed-off-by: Benjamin Tissoires <benjamin.tissoires@redhat.com> Link: https://lore.kernel.org/r/20220824134055.1328882-7-benjamin.tissoires@redhat.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2022-08-25 18:52:05 -07:00
Hao Luo	d4ffb6f39f	bpf: Add CGROUP prefix to cgroup_iter_order bpf_cgroup_iter_order is globally visible but the entries do not have CGROUP prefix. As requested by Andrii, put a CGROUP in the names in bpf_cgroup_iter_order. This patch fixes two previous commits: one introduced the API and the other uses the API in bpf selftest (that is, the selftest cgroup_hierarchical_stats). I tested this patch via the following command: test_progs -t cgroup,iter,btf_dump Fixes: `d4ccaf58a8` ("bpf: Introduce cgroup iter") Fixes: `88886309d2` ("selftests/bpf: add a selftest for cgroup hierarchical stats collection") Suggested-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Hao Luo <haoluo@google.com> Link: https://lore.kernel.org/r/20220825223936.1865810-1-haoluo@google.com Signed-off-by: Martin KaFai Lau <kafai@fb.com>	2022-08-25 16:26:37 -07:00
Jakub Kicinski	880b0dd94f	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net drivers/net/ethernet/mellanox/mlx5/core/en_fs.c `21234e3a84` ("net/mlx5e: Fix use after free in mlx5e_fs_init()") `c7eafc5ed0` ("net/mlx5e: Convert ethtool_steering member of flow_steering struct to pointer") https://lore.kernel.org/all/20220825104410.67d4709c@canb.auug.org.au/ https://lore.kernel.org/all/20220823055533.334471-1-saeed@kernel.org/ Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-08-25 16:07:42 -07:00
Daniel Borkmann	a657182a5c	bpf: Don't use tnum_range on array range checking for poke descriptors Hsin-Wei reported a KASAN splat triggered by their BPF runtime fuzzer which is based on a customized syzkaller: BUG: KASAN: slab-out-of-bounds in bpf_int_jit_compile+0x1257/0x13f0 Read of size 8 at addr ffff888004e90b58 by task syz-executor.0/1489 CPU: 1 PID: 1489 Comm: syz-executor.0 Not tainted 5.19.0 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x9c/0xc9 print_address_description.constprop.0+0x1f/0x1f0 ? bpf_int_jit_compile+0x1257/0x13f0 kasan_report.cold+0xeb/0x197 ? kvmalloc_node+0x170/0x200 ? bpf_int_jit_compile+0x1257/0x13f0 bpf_int_jit_compile+0x1257/0x13f0 ? arch_prepare_bpf_dispatcher+0xd0/0xd0 ? rcu_read_lock_sched_held+0x43/0x70 bpf_prog_select_runtime+0x3e8/0x640 ? bpf_obj_name_cpy+0x149/0x1b0 bpf_prog_load+0x102f/0x2220 ? __bpf_prog_put.constprop.0+0x220/0x220 ? find_held_lock+0x2c/0x110 ? __might_fault+0xd6/0x180 ? lock_downgrade+0x6e0/0x6e0 ? lock_is_held_type+0xa6/0x120 ? __might_fault+0x147/0x180 __sys_bpf+0x137b/0x6070 ? bpf_perf_link_attach+0x530/0x530 ? new_sync_read+0x600/0x600 ? __fget_files+0x255/0x450 ? lock_downgrade+0x6e0/0x6e0 ? fput+0x30/0x1a0 ? ksys_write+0x1a8/0x260 __x64_sys_bpf+0x7a/0xc0 ? syscall_enter_from_user_mode+0x21/0x70 do_syscall_64+0x3b/0x90 entry_SYSCALL_64_after_hwframe+0x63/0xcd RIP: 0033:0x7f917c4e2c2d The problem here is that a range of tnum_range(0, map->max_entries - 1) has limited ability to represent the concrete tight range with the tnum as the set of resulting states from value + mask can result in a superset of the actual intended range, and as such a tnum_in(range, reg->var_off) check may yield true when it shouldn't, for example tnum_range(0, 2) would result in 00XX -> v = 0000, m = 0011 such that the intended set of {0, 1, 2} is here represented by a less precise superset of {0, 1, 2, 3}. As the register is known const scalar, really just use the concrete reg->var_off.value for the upper index check. Fixes: `d2e4c1e6c2` ("bpf: Constant map key tracking for prog array pokes") Reported-by: Hsin-Wei Hung <hsinweih@uci.edu> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Shung-Hsi Yu <shung-hsi.yu@suse.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/r/984b37f9fdf7ac36831d2137415a4a915744c1b6.1661462653.git.daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2022-08-25 14:58:30 -07:00
Richard Guy Briggs	d4fefa4801	audit: move audit_return_fixup before the filters The success and return_code are needed by the filters. Move audit_return_fixup() before the filters. This was causing syscall auditing events to be missed. Link: https://github.com/linux-audit/audit-kernel/issues/138 Cc: stable@vger.kernel.org Fixes: `12c5e81d3f` ("audit: prepare audit_context for use in calling contexts beyond syscalls") Signed-off-by: Richard Guy Briggs <rgb@redhat.com> [PM: manual merge required] Signed-off-by: Paul Moore <paul@paul-moore.com>	2022-08-25 17:25:08 -04:00
Kumar Kartikeya Dwivedi	2fc31465c5	bpf: Do mark_chain_precision for ARG_CONST_ALLOC_SIZE_OR_ZERO Precision markers need to be propagated whenever we have an ARG_CONST_* style argument, as the verifier cannot consider imprecise scalars to be equivalent for the purposes of states_equal check when such arguments refine the return value (in this case, set mem_size for PTR_TO_MEM). The resultant mem_size for the R0 is derived from the constant value, and if the verifier incorrectly prunes states considering them equivalent where such arguments exist (by seeing that both registers have reg->precise as false in regsafe), we can end up with invalid programs passing the verifier which can do access beyond what should have been the correct mem_size in that explored state. To show a concrete example of the problem: 0000000000000000 <prog>: 0: r2 = (u32 )(r1 + 80) 1: r1 = (u32 )(r1 + 76) 2: r3 = r1 3: r3 += 4 4: if r3 > r2 goto +18 <LBB5_5> 5: w2 = 0 6: (u32 )(r1 + 0) = r2 7: r1 = (u32 )(r1 + 0) 8: r2 = 1 9: if w1 == 0 goto +1 <LBB5_3> 10: r2 = -1 0000000000000058 <LBB5_3>: 11: r1 = 0 ll 13: r3 = 0 14: call bpf_ringbuf_reserve 15: if r0 == 0 goto +7 <LBB5_5> 16: r1 = r0 17: r1 += 16777215 18: w2 = 0 19: (u8 )(r1 + 0) = r2 20: r1 = r0 21: r2 = 0 22: call bpf_ringbuf_submit 00000000000000b8 <LBB5_5>: 23: w0 = 0 24: exit For the first case, the single line execution's exploration will prune the search at insn 14 for the branch insn 9's second leg as it will be verified first using r2 = -1 (UINT_MAX), while as w1 at insn 9 will always be 0 so at runtime we don't get error for being greater than UINT_MAX/4 from bpf_ringbuf_reserve. The verifier during regsafe just sees reg->precise as false for both r2 registers in both states, hence considers them equal for purposes of states_equal. If we propagated precise markers using the backtracking support, we would use the precise marking to then ensure that old r2 (UINT_MAX) was within the new r2 (1) and this would never be true, so the verification would rightfully fail. The end result is that the out of bounds access at instruction 19 would be permitted without this fix. Note that reg->precise is always set to true when user does not have CAP_BPF (or when subprog count is greater than 1 (i.e. use of any static or global functions)), hence this is only a problem when precision marks need to be explicitly propagated (i.e. privileged users with CAP_BPF). A simplified test case has been included in the next patch to prevent future regressions. Fixes: `457f44363a` ("bpf: Implement BPF ring buffer and verifier support for it") Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20220823185300.406-2-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2022-08-25 12:07:45 -07:00
Yosry Ahmed	a319185be9	cgroup: bpf: enable bpf programs to integrate with rstat Enable bpf programs to make use of rstat to collect cgroup hierarchical stats efficiently: - Add cgroup_rstat_updated() kfunc, for bpf progs that collect stats. - Add cgroup_rstat_flush() sleepable kfunc, for bpf progs that read stats. - Add an empty bpf_rstat_flush() hook that is called during rstat flushing, for bpf progs that flush stats to attach to. Attaching a bpf prog to this hook effectively registers it as a flush callback. Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Hao Luo <haoluo@google.com> Link: https://lore.kernel.org/r/20220824233117.1312810-4-haoluo@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2022-08-25 11:35:37 -07:00
Hao Luo	d4ccaf58a8	bpf: Introduce cgroup iter Cgroup_iter is a type of bpf_iter. It walks over cgroups in four modes: - walking a cgroup's descendants in pre-order. - walking a cgroup's descendants in post-order. - walking a cgroup's ancestors. - process only the given cgroup. When attaching cgroup_iter, one can set a cgroup to the iter_link created from attaching. This cgroup is passed as a file descriptor or cgroup id and serves as the starting point of the walk. If no cgroup is specified, the starting point will be the root cgroup v2. For walking descendants, one can specify the order: either pre-order or post-order. For walking ancestors, the walk starts at the specified cgroup and ends at the root. One can also terminate the walk early by returning 1 from the iter program. Note that because walking cgroup hierarchy holds cgroup_mutex, the iter program is called with cgroup_mutex held. Currently only one session is supported, which means, depending on the volume of data bpf program intends to send to user space, the number of cgroups that can be walked is limited. For example, given the current buffer size is 8 * PAGE_SIZE, if the program sends 64B data for each cgroup, assuming PAGE_SIZE is 4kb, the total number of cgroups that can be walked is 512. This is a limitation of cgroup_iter. If the output data is larger than the kernel buffer size, after all data in the kernel buffer is consumed by user space, the subsequent read() syscall will signal EOPNOTSUPP. In order to work around, the user may have to update their program to reduce the volume of data sent to output. For example, skip some uninteresting cgroups. In future, we may extend bpf_iter flags to allow customizing buffer size. Acked-by: Yonghong Song <yhs@fb.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Hao Luo <haoluo@google.com> Link: https://lore.kernel.org/r/20220824233117.1312810-2-haoluo@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2022-08-25 11:35:37 -07:00
Linus Torvalds	3f5c20055a	`4f7e723643` ("cgroup: Fix threadgroup_rwsem <-> cpus_read_lock() deadlock") in the previous fix pull required cgroup core to grab cpus_read_lock() before invoking ->attach(). Unfortunately, it missed adding cpus_read_lock() in cgroup_attach_task_all(). Fix it. -----BEGIN PGP SIGNATURE----- iIQEABYIACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCYwe0GA4cdGpAa2VybmVs Lm9yZwAKCRCxYfJx3gVYGee0AP9jrsUgnmis/PzqyyPlkD95rRSDyyUNjMjfHnJe HW+YbgD/XcEo1eJvijqP1g/ZJhRKQl6vA1JSMgnL9obc3wNpGg8= =7LzT -----END PGP SIGNATURE----- Merge tag 'cgroup-for-6.0-rc2-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull another cgroup fix from Tejun Heo: "Commit `4f7e723643` ("cgroup: Fix threadgroup_rwsem <-> cpus_read_lock() deadlock") required the cgroup core to grab cpus_read_lock() before invoking ->attach(). Unfortunately, it missed adding cpus_read_lock() in cgroup_attach_task_all(). Fix it" * tag 'cgroup-for-6.0-rc2-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: Add missing cpus_read_lock() to cgroup_attach_task_all()	2022-08-25 10:52:16 -07:00
Tetsuo Handa	43626dade3	cgroup: Add missing cpus_read_lock() to cgroup_attach_task_all() syzbot is hitting percpu_rwsem_assert_held(&cpu_hotplug_lock) warning at cpuset_attach() [1], for commit `4f7e723643` ("cgroup: Fix threadgroup_rwsem <-> cpus_read_lock() deadlock") missed that cpuset_attach() is also called from cgroup_attach_task_all(). Add cpus_read_lock() like what cgroup_procs_write_start() does. Link: https://syzkaller.appspot.com/bug?extid=29d3a3b4d86c8136ad9e [1] Reported-by: syzbot <syzbot+29d3a3b4d86c8136ad9e@syzkaller.appspotmail.com> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Fixes: `4f7e723643` ("cgroup: Fix threadgroup_rwsem <-> cpus_read_lock() deadlock") Signed-off-by: Tejun Heo <tj@kernel.org>	2022-08-25 07:36:30 -10:00
Kumar Kartikeya Dwivedi	9d9d00ac29	bpf: Fix reference state management for synchronous callbacks Currently, verifier verifies callback functions (sync and async) as if they will be executed once, (i.e. it explores execution state as if the function was being called once). The next insn to explore is set to start of subprog and the exit from nested frame is handled using curframe > 0 and prepare_func_exit. In case of async callback it uses a customized variant of push_stack simulating a kind of branch to set up custom state and execution context for the async callback. While this approach is simple and works when callback really will be executed only once, it is unsafe for all of our current helpers which are for_each style, i.e. they execute the callback multiple times. A callback releasing acquired references of the caller may do so multiple times, but currently verifier sees it as one call inside the frame, which then returns to caller. Hence, it thinks it released some reference that the cb e.g. got access through callback_ctx (register filled inside cb from spilled typed register on stack). Similarly, it may see that an acquire call is unpaired inside the callback, so the caller will copy the reference state of callback and then will have to release the register with new ref_obj_ids. But again, the callback may execute multiple times, but the verifier will only account for acquired references for a single symbolic execution of the callback, which will cause leaks. Note that for async callback case, things are different. While currently we have bpf_timer_set_callback which only executes it once, even for multiple executions it would be safe, as reference state is NULL and check_reference_leak would force program to release state before BPF_EXIT. The state is also unaffected by analysis for the caller frame. Hence async callback is safe. Since we want the reference state to be accessible, e.g. for pointers loaded from stack through callback_ctx's PTR_TO_STACK, we still have to copy caller's reference_state to callback's bpf_func_state, but we enforce that whatever references it adds to that reference_state has been released before it hits BPF_EXIT. This requires introducing a new callback_ref member in the reference state to distinguish between caller vs callee references. Hence, check_reference_leak now errors out if it sees we are in callback_fn and we have not released callback_ref refs. Since there can be multiple nested callbacks, like frame 0 -> cb1 -> cb2 etc. we need to also distinguish between whether this particular ref belongs to this callback frame or parent, and only error for our own, so we store state->frameno (which is always non-zero for callbacks). In short, callbacks can read parent reference_state, but cannot mutate it, to be able to use pointers acquired by the caller. They must only undo their changes (by releasing their own acquired_refs before BPF_EXIT) on top of caller reference_state before returning (at which point the caller and callback state will match anyway, so no need to copy it back to caller). Fixes: `69c087ba62` ("bpf: Add bpf_for_each_map_elem() helper") Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20220823013125.24938-1-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2022-08-24 17:54:08 -07:00
Linus Torvalds	a86766c49e	Tracing: Fix for 6.0-rc2 - Fix build warning for when MODULES and FTRACE_WITH_DIRECT_CALLS are not set. A warning happens with ops_references_rec() defined but not used. -----BEGIN PGP SIGNATURE----- iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCYwTkGhQccm9zdGVkdEBn b29kbWlzLm9yZwAKCRAp5XQQmuv6qtUTAP4tOmf0I0c+GsWTzpecvv7fa+9rxmZa SfBuoPqzC/TBqAEArqaf91+57aehCrJC3X5HaE7OJisW9nd2Epnvrpxk4QY= =0yZV -----END PGP SIGNATURE----- Merge tag 'trace-v6.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fix from Steven Rostedt: - Fix build warning for when MODULES and FTRACE_WITH_DIRECT_CALLS are not set. A warning happens with ops_references_rec() defined but not used. * tag 'trace-v6.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: ftrace: Fix build warning for ops_references_rec() not used	2022-08-24 10:43:34 -07:00
Linus Torvalds	c40e8341e3	cgroup fixes for v6.0-rc2 Contains fixes for the following issues: * psi data structure was changed to be allocated dynamically but it wasn't being cleared leading to reporting garbage values and triggering spurious oom kills. * A deadlock involving cpuset and cpu hotplug. * When a controller is moved across cgroup hierarchies, css->rstat_css_node didn't get RCU drained properly from the previous list. -----BEGIN PGP SIGNATURE----- iIQEABYIACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCYwVmRg4cdGpAa2VybmVs Lm9yZwAKCRCxYfJx3gVYGb/eAP44dr9/OQtapKm63H/qmLF39LWE6nC99RYHECl5 ncuZvwD/XIkZt212nr/qC1C0ggB5qCGG7tIZG6tIgkS+J5huqg4= =CC/Y -----END PGP SIGNATURE----- Merge tag 'cgroup-for-6.0-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fixes from Tejun Heo: - The psi data structure was changed to be allocated dynamically but it wasn't being cleared leading to it reporting garbage values and triggering spurious oom kills. - A deadlock involving cpuset and cpu hotplug. - When a controller is moved across cgroup hierarchies, css->rstat_css_node didn't get RCU drained properly from the previous list. * tag 'cgroup-for-6.0-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: Fix race condition at rebind_subsystems() cgroup: Fix threadgroup_rwsem <-> cpus_read_lock() deadlock sched/psi: Remove redundant cgroup_psi() when !CONFIG_CGROUPS sched/psi: Remove unused parameter nbytes of psi_trigger_create() sched/psi: Zero the memory of struct psi_group	2022-08-23 19:33:28 -07:00
Linus Torvalds	072c92b1b1	audit/stable-6.0 PR 20220823 -----BEGIN PGP SIGNATURE----- iQJIBAABCAAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAmMFJkQUHHBhdWxAcGF1 bC1tb29yZS5jb20ACgkQ6iDy2pc3iXONqQ/+L06aobl3pPbFlatTW0YXgmYKxXBb Vf329u1P1pcmHYuUf/c4pCGxxbMbbEHwsmdtp6YKLnh97gP0GYUNTNI9WMbU7he9 ZNrBe1gUhUHNi0ZL1OPVxokfeV2UK+hsyGQuR1wXHwjTTbONsghGCvdy1LEw4DMe dGRPNkxzoKJ5K7SnScplUhSBAoVtLLBQB1+HKd5mILV22TTWWzTwcde0RSIkAX1s /VM4P77DSEw5DX4fYgIt85yHZ/c8MUUyECFkALph/VUkkLvEWrISTXIzoOdJXfJO Ock88Gz2HAj3L+4b0CL0zW67sERks1H5udmhtI+ymGObklMjfJh88QL44K+n8f9w 2ap9Hlgl4b2TjE2KK313ixX6Om7xxvH62IAMr0/x8y5tk+qZTNvbQsT8TpkRXxtt vHxp9x4qqeRL2Si/5A6rvyKvEaZI26hOmNTQIKzJvUIs5tyV0UySABgOheVp42PS VeF9/lUc7XOXI9CffhSw636I0WZYmp0bSIcDvRpeTuxobCG6SpOup+ODPoRdA+0A 8jdwQlJWO36H0qQnqrN/dfDKqcfUu2epKFrP46TxEtC60WztLnS1Nw1ZLFhcEIUw MiKEirj1PjVYmES4/aW8l3OPQqMcYHXp5Lj3qBrcOEpi3tPIzOzYwnmdfASxfIvE QXE2QHEBjtx8lHE= =j0DY -----END PGP SIGNATURE----- Merge tag 'audit-pr-20220823' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit Pull audit fix from Paul Moore: "A single fix for a potential double-free on a fsnotify error path" * tag 'audit-pr-20220823' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit: audit: fix potential double free on error path from fsnotify_add_inode_mark	2022-08-23 19:26:48 -07:00
Kumar Kartikeya Dwivedi	5679ff2f13	bpf: Move bpf_loop and bpf_for_each_map_elem under CAP_BPF They would require func_info which needs prog BTF anyway. Loading BTF and setting the prog btf_fd while loading the prog indirectly requires CAP_BPF, so just to reduce confusion, move both these helpers taking callback under bpf_capable() protection as well, since they cannot be used without CAP_BPF. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20220823013117.24916-1-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2022-08-23 16:21:59 -07:00
Stanislav Fomichev	8a67f2de9b	bpf: expose bpf_strtol and bpf_strtoul to all program types bpf_strncmp is already exposed everywhere. The motivation is to keep those helpers in kernel/bpf/helpers.c. Otherwise it's tempting to move them under kernel/bpf/cgroup.c because they are currently only used by sysctl prog types. Suggested-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/20220823222555.523590-4-sdf@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2022-08-23 16:08:21 -07:00
Stanislav Fomichev	bed89185af	bpf: Use cgroup_{common,current}_func_proto in more hooks The following hooks are per-cgroup hooks but they are not using cgroup_{common,current}_func_proto, fix it: * BPF_PROG_TYPE_CGROUP_SKB (cg_skb) * BPF_PROG_TYPE_CGROUP_SOCK_ADDR (cg_sock_addr) * BPF_PROG_TYPE_CGROUP_SOCK (cg_sock) * BPF_PROG_TYPE_LSM+BPF_LSM_CGROUP Also: * move common func_proto's into cgroup func_proto handlers * make sure bpf_{g,s}et_retval are not accessible from recvmsg, getpeername and getsockname (return/errno is ignored in these places) * as a side effect, expose get_current_pid_tgid, get_current_comm_proto, get_current_ancestor_cgroup_id, get_cgroup_classid to more cgroup hooks Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/20220823222555.523590-3-sdf@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2022-08-23 16:08:21 -07:00
Stanislav Fomichev	dea6a4e170	bpf: Introduce cgroup_{common,current}_func_proto Split cgroup_base_func_proto into the following: * cgroup_common_func_proto - common helpers for all cgroup hooks * cgroup_current_func_proto - common helpers for all cgroup hooks running in the process context (== have meaningful 'current'). Move bpf_{g,s}et_retval and other cgroup-related helpers into kernel/bpf/cgroup.c so they closer to where they are being used. Signed-off-by: Stanislav Fomichev <sdf@google.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/r/20220823222555.523590-2-sdf@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2022-08-23 16:08:21 -07:00
Kuniyuki Iwashima	0947ae1121	bpf: Fix a data-race around bpf_jit_limit. While reading bpf_jit_limit, it can be changed concurrently via sysctl, WRITE_ONCE() in __do_proc_doulongvec_minmax(). The size of bpf_jit_limit is long, so we need to add a paired READ_ONCE() to avoid load-tearing. Fixes: `ede95a63b5` ("bpf: add bpf_jit_limit knob to restrict unpriv allocations") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20220823215804.2177-1-kuniyu@amazon.com	2022-08-24 00:27:14 +02:00
Linus Torvalds	95607ad99b	Thirteen fixes, almost all for MM. Seven of these are cc:stable and the remainder fix up the changes which went into this -rc cycle. -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCYwQZcgAKCRDdBJ7gKXxA jnCxAQCk8L6PPm0L2KvKr5Vu3M/T0o9SvfxfM5yho80zM68fHQD/eLxz+nd3m+N5 K7Mdbcb2u6F46qQaS+S5RialEWKpsw8= =WtBo -----END PGP SIGNATURE----- Merge tag 'mm-hotfixes-stable-2022-08-22' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "Thirteen fixes, almost all for MM. Seven of these are cc:stable and the remainder fix up the changes which went into this -rc cycle" * tag 'mm-hotfixes-stable-2022-08-22' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: kprobes: don't call disarm_kprobe() for disabled kprobes mm/shmem: shmem_replace_page() remember NR_SHMEM mm/shmem: tmpfs fallocate use file_modified() mm/shmem: fix chattr fsflags support in tmpfs mm/hugetlb: support write-faults in shared mappings mm/hugetlb: fix hugetlb not supporting softdirty tracking mm/uffd: reset write protection when unregister with wp-mode mm/smaps: don't access young/dirty bit if pte unpresent mm: add DEVICE_ZONE to FOR_ALL_ZONES kernel/sys_ni: add compat entry for fadvise64_64 mm/gup: fix FOLL_FORCE COW security issue and remove FOLL_COW Revert "zram: remove double compression logic" get_maintainer: add Alan to .get_maintainer.ignore	2022-08-23 13:33:08 -07:00
Linus Torvalds	6234806f8c	linux-kselftest-kunit-fixes-6.0-rc3 This KUnit fixes update for Linux 6.0-rc3 consists of fixes to mmc test and fix to load .kunit_test_suites section when CONFIG_KUNIT=m, and not just when KUnit is built-in. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEPZKym/RZuOCGeA/kCwJExA0NQxwFAmMD/D8ACgkQCwJExA0N QxzK8A/7B5VP2K2BYVeXZqK8E7Y+GuFAQc0oD+b9Sgfvv2QkeRO8xAsd19DmvIsi MscFjIhScqoU50nrVwxOEiU5Wzeg+BuG8ghPWIM8mo/heeuD27UNE1uEOv7jlOZp hlKD9SqCf2HV3YLre4I83fwrK6hilQT4R55yQeXsl/EWvRCwBb1axBp2NCt5Jh5Q PUK066pNy3KozumKTL65RLzCKwkoxqNZ5+XTa105heh9lhPqrRZ9+kR+eVfskPI2 q8F3NcCpGYV+YcHa899MMS4R98nDpB9GYK/sbJxVAIeWxcUT+9fZIJnr4oYi6z0N sZDoOsg864R29JUt/rhZkisOuOMjJ94vLVJY62dTVskEVGK6YiARaIfwI39sMWoH 4ATYCcQW50WLZDC/zc0X9Cm2Bp4Dv/WRL2xTWYpH2P5caxRcnoVm89ggMXsTHE8U QApjQW8e9STn+8vzio4KdityIOZ9EsQNpsSilq/Zq2iL2B0ZOIXWSf+JwplAjJdA Or9N7EzIZU9PSxdxE9Xwjq0f/bvDC+DH8h9/X3Sy0WeLuA9KUlgVSbFLwSp8jz5Q CwoWPase+BAWrAXMEZSqnjrgwIfa1FOXy5jcWNaJWsPVNKirmW0j0n9K1Zj4QFCC lYj/H+FgkYXIE6GQRNuHEbnNmExWEDdEdolr+vi+p8Xs3PVyVls= =HQ19 -----END PGP SIGNATURE----- Merge tag 'linux-kselftest-kunit-fixes-6.0-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest Pull KUnit fixes from Shuah Khan: "Fix for a mmc test and to load .kunit_test_suites section when CONFIG_KUNIT=m, and not just when KUnit is built-in" * tag 'linux-kselftest-kunit-fixes-6.0-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest: module: kunit: Load .kunit_test_suites section when CONFIG_KUNIT=m mmc: sdhci-of-aspeed: test: Fix dependencies when KUNIT=m	2022-08-23 13:23:07 -07:00
Jing-Ting Wu	763f4fb76e	cgroup: Fix race condition at rebind_subsystems() Root cause: The rebind_subsystems() is no lock held when move css object from A list to B list,then let B's head be treated as css node at list_for_each_entry_rcu(). Solution: Add grace period before invalidating the removed rstat_css_node. Reported-by: Jing-Ting Wu <jing-ting.wu@mediatek.com> Suggested-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Jing-Ting Wu <jing-ting.wu@mediatek.com> Tested-by: Jing-Ting Wu <jing-ting.wu@mediatek.com> Link: https://lore.kernel.org/linux-arm-kernel/d8f0bc5e2fb6ed259f9334c83279b4c011283c41.camel@mediatek.com/T/ Acked-by: Mukesh Ojha <quic_mojha@quicinc.com> Fixes: `a7df69b81a` ("cgroup: rstat: support cgroup1") Cc: stable@vger.kernel.org # v5.13+ Signed-off-by: Tejun Heo <tj@kernel.org>	2022-08-23 08:11:06 -10:00
Lukasz Luba	6d5afdc97e	cpufreq: schedutil: Move max CPU capacity to sugov_policy There is no need to keep the max CPU capacity in the per_cpu instance. Furthermore, there is no need to check and update that variable (sg_cpu->max) every time in the frequency change request, which is part of hot path. Instead use struct sugov_policy to store that information. Initialize the max CPU capacity during the setup and start callback. We can do that since all CPUs in the same frequency domain have the same max capacity (capacity setup and thermal pressure are based on that). Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2022-08-23 20:03:33 +02:00
Chengming Zhou	e4fe074d6c	sched/fair: Don't init util/runnable_avg for !fair task post_init_entity_util_avg() init task util_avg according to the cpu util_avg at the time of fork, which will decay when switched_to_fair() some time later, we'd better to not set them at all in the case of !fair task. Suggested-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20220818124805.601-10-zhouchengming@bytedance.com	2022-08-23 11:01:20 +02:00
Chengming Zhou	d6531ab6e5	sched/fair: Move task sched_avg attach to enqueue_task_fair() When wake_up_new_task(), we use post_init_entity_util_avg() to init util_avg/runnable_avg based on cpu's util_avg at that time, and attach task sched_avg to cfs_rq. Since enqueue_task_fair() -> enqueue_entity() -> update_load_avg() loop will do attach, we can move this work to update_load_avg(). wake_up_new_task(p) post_init_entity_util_avg(p) attach_entity_cfs_rq() --> (1) activate_task(rq, p) enqueue_task() := enqueue_task_fair() enqueue_entity() loop update_load_avg(cfs_rq, se, UPDATE_TG \| DO_ATTACH) if (!se->avg.last_update_time && (flags & DO_ATTACH)) attach_entity_load_avg() --> (2) This patch move attach from (1) to (2), update related comments too. Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20220818124805.601-9-zhouchengming@bytedance.com	2022-08-23 11:01:19 +02:00
Chengming Zhou	df16b71c68	sched/fair: Allow changing cgroup of new forked task commit `7dc603c902` ("sched/fair: Fix PELT integrity for new tasks") introduce a TASK_NEW state and an unnessary limitation that would fail when changing cgroup of new forked task. Because at that time, we can't handle task_change_group_fair() for new forked fair task which hasn't been woken up by wake_up_new_task(), which will cause detach on an unattached task sched_avg problem. This patch delete this unnessary limitation by adding check before do detach or attach in task_change_group_fair(). So cpu_cgrp_subsys.can_attach() has nothing to do for fair tasks, only define it in #ifdef CONFIG_RT_GROUP_SCHED. Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20220818124805.601-8-zhouchengming@bytedance.com	2022-08-23 11:01:19 +02:00
Chengming Zhou	7e2edaf618	sched/fair: Fix another detach on unattached task corner case commit `7dc603c902` ("sched/fair: Fix PELT integrity for new tasks") fixed two load tracking problems for new task, including detach on unattached new task problem. There still left another detach on unattached task problem for the task which has been woken up by try_to_wake_up() and waiting for actually being woken up by sched_ttwu_pending(). try_to_wake_up(p) cpu = select_task_rq(p) if (task_cpu(p) != cpu) set_task_cpu(p, cpu) migrate_task_rq_fair() remove_entity_load_avg() --> unattached se->avg.last_update_time = 0; __set_task_cpu() ttwu_queue(p, cpu) ttwu_queue_wakelist() __ttwu_queue_wakelist() task_change_group_fair() detach_task_cfs_rq() detach_entity_cfs_rq() detach_entity_load_avg() --> detach on unattached task set_task_rq() attach_task_cfs_rq() attach_entity_cfs_rq() attach_entity_load_avg() The reason of this problem is similar, we should check in detach_entity_cfs_rq() that se->avg.last_update_time != 0, before do detach_entity_load_avg(). Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20220818124805.601-7-zhouchengming@bytedance.com	2022-08-23 11:01:19 +02:00
Chengming Zhou	e1f078f504	sched/fair: Combine detach into dequeue when migrating task When we are migrating task out of the CPU, we can combine detach and propagation into dequeue_entity() to save the detach_entity_cfs_rq() in migrate_task_rq_fair(). This optimization is like combining DO_ATTACH in the enqueue_entity() when migrating task to the CPU. So we don't have to traverse the CFS tree extra time to do the detach_entity_cfs_rq() -> propagate_entity_cfs_rq(), which wouldn't be called anymore with this patch's change. detach_task() deactivate_task() dequeue_task_fair() for_each_sched_entity(se) dequeue_entity() update_load_avg() /* (1) / detach_entity_load_avg() set_task_cpu() migrate_task_rq_fair() detach_entity_cfs_rq() / (2) */ update_load_avg(); detach_entity_load_avg(); propagate_entity_cfs_rq(); for_each_sched_entity() update_load_avg() This patch save the detach_entity_cfs_rq() called in (2) by doing the detach_entity_load_avg() for a CPU migrating task inside (1) (the task being the first se in the loop) Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20220818124805.601-6-zhouchengming@bytedance.com	2022-08-23 11:01:18 +02:00
Chengming Zhou	859f206290	sched/fair: Update comments in enqueue/dequeue_entity() When reading the sched_avg related code, I found the comments in enqueue/dequeue_entity() are not updated with the current code. We don't add/subtract entity's runnable_avg from cfs_rq->runnable_avg during enqueue/dequeue_entity(), those are done only for attach/detach. This patch updates the comments to reflect the current code working. Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20220818124805.601-5-zhouchengming@bytedance.com	2022-08-23 11:01:18 +02:00
Chengming Zhou	5d6da83c44	sched/fair: Reset sched_avg last_update_time before set_task_rq() set_task_rq() -> set_task_rq_fair() will try to synchronize the blocked task's sched_avg when migrate, which is not needed for already detached task. task_change_group_fair() will detached the task sched_avg from prev cfs_rq first, so reset sched_avg last_update_time before set_task_rq() to avoid that. Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20220818124805.601-4-zhouchengming@bytedance.com	2022-08-23 11:01:18 +02:00
Chengming Zhou	39c4261191	sched/fair: Remove redundant cpu_cgrp_subsys->fork() We use cpu_cgrp_subsys->fork() to set task group for the new fair task in cgroup_post_fork(). Since commit `b1e8206582` ("sched: Fix yet more sched_fork() races") has already set_task_rq() for the new fair task in sched_cgroup_fork(), so cpu_cgrp_subsys->fork() can be removed. cgroup_can_fork() --> pin parent's sched_task_group sched_cgroup_fork() __set_task_cpu() set_task_rq() cgroup_post_fork() ss->fork() := cpu_cgroup_fork() sched_change_group(..., TASK_SET_GROUP) task_set_group_fair() set_task_rq() --> can be removed After this patch's change, task_change_group_fair() only need to care about task cgroup migration, make the code much simplier. Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/20220818124805.601-3-zhouchengming@bytedance.com	2022-08-23 11:01:17 +02:00
Chengming Zhou	78b6b15770	sched/fair: Maintain task se depth in set_task_rq() Previously we only maintain task se depth in task_move_group_fair(), if a !fair task change task group, its se depth will not be updated, so commit `eb7a59b2c8` ("sched/fair: Reset se-depth when task switched to FAIR") fix the problem by updating se depth in switched_to_fair() too. Then commit `daa59407b5` ("sched/fair: Unify switched_{from,to}_fair() and task_move_group_fair()") unified these two functions, moved se.depth setting to attach_task_cfs_rq(), which further into attach_entity_cfs_rq() with commit `df217913e7` ("sched/fair: Factorize attach/detach entity"). This patch move task se depth maintenance from attach_entity_cfs_rq() to set_task_rq(), which will be called when CPU/cgroup change, so its depth will always be correct. This patch is preparation for the next patch. Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20220818124805.601-2-zhouchengming@bytedance.com	2022-08-23 11:01:17 +02:00
Gaosheng Cui	ad982c3be4	audit: fix potential double free on error path from fsnotify_add_inode_mark Audit_alloc_mark() assign pathname to audit_mark->path, on error path from fsnotify_add_inode_mark(), fsnotify_put_mark will free memory of audit_mark->path, but the caller of audit_alloc_mark will free the pathname again, so there will be double free problem. Fix this by resetting audit_mark->path to NULL pointer on error path from fsnotify_add_inode_mark(). Cc: stable@vger.kernel.org Fixes: `7b12932340` ("fsnotify: Add group pointer in fsnotify_init_mark()") Signed-off-by: Gaosheng Cui <cuigaosheng1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Paul Moore <paul@paul-moore.com>	2022-08-22 18:50:06 -04:00
Wang Jingjin	123d645577	ftrace: Fix build warning for ops_references_rec() not used The change that made IPMODIFY and DIRECT ops work together needed access to the ops_references_ip() function, which it pulled out of the module only code. But now if both CONFIG_MODULES and CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS is not set, we get the below warning: ‘ops_references_rec’ defined but not used. Since ops_references_rec() only calls ops_references_ip() replace the usage of ops_references_rec() with ops_references_ip() and encompass the function with an #ifdef of DIRECT_CALLS \|\| MODULES being defined. Link: https://lkml.kernel.org/r/20220801084745.1187987-1-wangjingjin1@huawei.com Fixes: `53cd885bc5` ("ftrace: Allow IPMODIFY and DIRECT ops on the same function") Signed-off-by: Wang Jingjin <wangjingjin1@huawei.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2022-08-22 09:41:12 -04:00
Linus Torvalds	7fb312d225	Various fixes for tracing: - Fix a return value of traceprobe_parse_event_name() - Fix NULL pointer dereference from failed ftrace enabling - Fix NULL pointer dereference when asking for registers from eprobes - Make eprobes consistent with kprobes/uprobes, filters and histograms -----BEGIN PGP SIGNATURE----- iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCYwKRrhQccm9zdGVkdEBn b29kbWlzLm9yZwAKCRAp5XQQmuv6qosDAP9WySmPxjoMfR0hbjmnepLy2zJtBbIq ABWR3LDrjvLlYwD9H/wrD+6ctOZtXh5XJc0Vn5z6XEyNtqrVGSse7Lm+sg4= =qb/R -----END PGP SIGNATURE----- Merge tag 'trace-v6.0-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fixes from Steven Rostedt: "Various fixes for tracing: - Fix a return value of traceprobe_parse_event_name() - Fix NULL pointer dereference from failed ftrace enabling - Fix NULL pointer dereference when asking for registers from eprobes - Make eprobes consistent with kprobes/uprobes, filters and histograms" * tag 'trace-v6.0-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing: Have filter accept "common_cpu" to be consistent tracing/probes: Have kprobes and uprobes use $COMM too tracing/eprobes: Have event probes be consistent with kprobes and uprobes tracing/eprobes: Fix reading of string fields tracing/eprobes: Do not hardcode $comm as a string tracing/eprobes: Do not allow eprobes to use $stack, or % for regs ftrace: Fix NULL pointer dereference in is_ftrace_trampoline when ftrace is dead tracing/perf: Fix double put of trace event when init fails tracing: React to error return from traceprobe_parse_event_name()	2022-08-21 14:49:42 -07:00
Steven Rostedt (Google)	b2380577d4	tracing: Have filter accept "common_cpu" to be consistent Make filtering consistent with histograms. As "cpu" can be a field of an event, allow for "common_cpu" to keep it from being confused with the "cpu" field of the event. Link: https://lkml.kernel.org/r/20220820134401.513062765@goodmis.org Link: https://lore.kernel.org/all/20220820220920.e42fa32b70505b1904f0a0ad@kernel.org/ Cc: stable@vger.kernel.org Cc: Ingo Molnar <mingo@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com> Cc: Tom Zanussi <zanussi@kernel.org> Fixes: `1e3bac71c5` ("tracing/histogram: Rename "cpu" to "common_cpu"") Suggested-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2022-08-21 15:56:08 -04:00
Steven Rostedt (Google)	ab8384442e	tracing/probes: Have kprobes and uprobes use $COMM too Both $comm and $COMM can be used to get current->comm in eprobes and the filtering and histogram logic. Make kprobes and uprobes consistent in this regard and allow both $comm and $COMM as well. Currently kprobes and uprobes only handle $comm, which is inconsistent with the other utilities, and can be confusing to users. Link: https://lkml.kernel.org/r/20220820134401.317014913@goodmis.org Link: https://lore.kernel.org/all/20220820220442.776e1ddaf8836e82edb34d01@kernel.org/ Cc: stable@vger.kernel.org Cc: Ingo Molnar <mingo@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com> Cc: Tom Zanussi <zanussi@kernel.org> Fixes: `533059281e` ("tracing: probeevent: Introduce new argument fetching code") Suggested-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2022-08-21 15:56:08 -04:00
Steven Rostedt (Google)	6a832ec3d6	tracing/eprobes: Have event probes be consistent with kprobes and uprobes Currently, if a symbol "@" is attempted to be used with an event probe (eprobes), it will cause a NULL pointer dereference crash. Both kprobes and uprobes can reference data other than the main registers. Such as immediate address, symbols and the current task name. Have eprobes do the same thing. For "comm", if "comm" is used and the event being attached to does not have the "comm" field, then make it the "$comm" that kprobes has. This is consistent to the way histograms and filters work. Link: https://lkml.kernel.org/r/20220820134401.136924220@goodmis.org Cc: stable@vger.kernel.org Cc: Ingo Molnar <mingo@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com> Cc: Tom Zanussi <zanussi@kernel.org> Fixes: `7491e2c442` ("tracing: Add a probe that attaches to trace events") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2022-08-21 15:56:08 -04:00
Steven Rostedt (Google)	f04dec9346	tracing/eprobes: Fix reading of string fields Currently when an event probe (eprobe) hooks to a string field, it does not display it as a string, but instead as a number. This makes the field rather useless. Handle the different kinds of strings, dynamic, static, relational/dynamic etc. Now when a string field is used, the ":string" type can be used to display it: echo "e:sw sched/sched_switch comm=$next_comm:string" > dynamic_events Link: https://lkml.kernel.org/r/20220820134400.959640191@goodmis.org Cc: stable@vger.kernel.org Cc: Ingo Molnar <mingo@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com> Cc: Tom Zanussi <zanussi@kernel.org> Fixes: `7491e2c442` ("tracing: Add a probe that attaches to trace events") Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2022-08-21 15:56:08 -04:00
Steven Rostedt (Google)	02333de90e	tracing/eprobes: Do not hardcode $comm as a string The variable $comm is hard coded as a string, which is true for both kprobes and uprobes, but for event probes (eprobes) it is a field name. In most cases the "comm" field would be a string, but there's no guarantee of that fact. Do not assume that comm is a string. Not to mention, it currently forces comm fields to fault, as string processing for event probes is currently broken. Link: https://lkml.kernel.org/r/20220820134400.756152112@goodmis.org Cc: stable@vger.kernel.org Cc: Ingo Molnar <mingo@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com> Cc: Tom Zanussi <zanussi@kernel.org> Fixes: `7491e2c442` ("tracing: Add a probe that attaches to trace events") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2022-08-21 15:56:08 -04:00
Steven Rostedt (Google)	2673c60ee6	tracing/eprobes: Do not allow eprobes to use $stack, or % for regs While playing with event probes (eprobes), I tried to see what would happen if I attempted to retrieve the instruction pointer (%rip) knowing that event probes do not use pt_regs. The result was: BUG: kernel NULL pointer dereference, address: 0000000000000024 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] PREEMPT SMP PTI CPU: 1 PID: 1847 Comm: trace-cmd Not tainted 5.19.0-rc5-test+ #309 Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v03.03 07/14/2016 RIP: 0010:get_event_field.isra.0+0x0/0x50 Code: ff 48 c7 c7 c0 8f 74 a1 e8 3d 8b f5 ff e8 88 09 f6 ff 4c 89 e7 e8 50 6a 13 00 48 89 ef 5b 5d 41 5c 41 5d e9 42 6a 13 00 66 90 <48> 63 47 24 8b 57 2c 48 01 c6 8b 47 28 83 f8 02 74 0e 83 f8 04 74 RSP: 0018:ffff916c394bbaf0 EFLAGS: 00010086 RAX: ffff916c854041d8 RBX: ffff916c8d9fbf50 RCX: ffff916c255d2000 RDX: 0000000000000000 RSI: ffff916c255d2008 RDI: 0000000000000000 RBP: 0000000000000000 R08: ffff916c3a2a0c08 R09: ffff916c394bbda8 R10: 0000000000000000 R11: 0000000000000000 R12: ffff916c854041d8 R13: ffff916c854041b0 R14: 0000000000000000 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff916c9ea40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000024 CR3: 000000011b60a002 CR4: 00000000001706e0 Call Trace: <TASK> get_eprobe_size+0xb4/0x640 ? __mod_node_page_state+0x72/0xc0 __eprobe_trace_func+0x59/0x1a0 ? __mod_lruvec_page_state+0xaa/0x1b0 ? page_remove_file_rmap+0x14/0x230 ? page_remove_rmap+0xda/0x170 event_triggers_call+0x52/0xe0 trace_event_buffer_commit+0x18f/0x240 trace_event_raw_event_sched_wakeup_template+0x7a/0xb0 try_to_wake_up+0x260/0x4c0 __wake_up_common+0x80/0x180 __wake_up_common_lock+0x7c/0xc0 do_notify_parent+0x1c9/0x2a0 exit_notify+0x1a9/0x220 do_exit+0x2ba/0x450 do_group_exit+0x2d/0x90 __x64_sys_exit_group+0x14/0x20 do_syscall_64+0x3b/0x90 entry_SYSCALL_64_after_hwframe+0x46/0xb0 Obviously this is not the desired result. Move the testing for TPARG_FL_TPOINT which is only used for event probes to the top of the "$" variable check, as all the other variables are not used for event probes. Also add a check in the register parsing "%" to fail if an event probe is used. Link: https://lkml.kernel.org/r/20220820134400.564426983@goodmis.org Cc: stable@vger.kernel.org Cc: Ingo Molnar <mingo@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com> Cc: Tom Zanussi <zanussi@kernel.org> Fixes: `7491e2c442` ("tracing: Add a probe that attaches to trace events") Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2022-08-21 15:56:08 -04:00
Yang Jihong	c3b0f72e80	ftrace: Fix NULL pointer dereference in is_ftrace_trampoline when ftrace is dead ftrace_startup does not remove ops from ftrace_ops_list when ftrace_startup_enable fails: register_ftrace_function ftrace_startup __register_ftrace_function ... add_ftrace_ops(&ftrace_ops_list, ops) ... ... ftrace_startup_enable // if ftrace failed to modify, ftrace_disabled is set to 1 ... return 0 // ops is in the ftrace_ops_list. When ftrace_disabled = 1, unregister_ftrace_function simply returns without doing anything: unregister_ftrace_function ftrace_shutdown if (unlikely(ftrace_disabled)) return -ENODEV; // return here, __unregister_ftrace_function is not executed, // as a result, ops is still in the ftrace_ops_list __unregister_ftrace_function ... If ops is dynamically allocated, it will be free later, in this case, is_ftrace_trampoline accesses NULL pointer: is_ftrace_trampoline ftrace_ops_trampoline do_for_each_ftrace_op(op, ftrace_ops_list) // OOPS! op may be NULL! Syzkaller reports as follows: [ 1203.506103] BUG: kernel NULL pointer dereference, address: 000000000000010b [ 1203.508039] #PF: supervisor read access in kernel mode [ 1203.508798] #PF: error_code(0x0000) - not-present page [ 1203.509558] PGD 800000011660b067 P4D 800000011660b067 PUD 130fb8067 PMD 0 [ 1203.510560] Oops: 0000 [#1] SMP KASAN PTI [ 1203.511189] CPU: 6 PID: 29532 Comm: syz-executor.2 Tainted: G B W 5.10.0 #8 [ 1203.512324] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 [ 1203.513895] RIP: 0010:is_ftrace_trampoline+0x26/0xb0 [ 1203.514644] Code: ff eb d3 90 41 55 41 54 49 89 fc 55 53 e8 f2 00 fd ff 48 8b 1d 3b 35 5d 03 e8 e6 00 fd ff 48 8d bb 90 00 00 00 e8 2a 81 26 00 <48> 8b ab 90 00 00 00 48 85 ed 74 1d e8 c9 00 fd ff 48 8d bb 98 00 [ 1203.518838] RSP: 0018:ffffc900012cf960 EFLAGS: 00010246 [ 1203.520092] RAX: 0000000000000000 RBX: 000000000000007b RCX: ffffffff8a331866 [ 1203.521469] RDX: 0000000000000000 RSI: 0000000000000008 RDI: 000000000000010b [ 1203.522583] RBP: 0000000000000000 R08: 0000000000000000 R09: ffffffff8df18b07 [ 1203.523550] R10: fffffbfff1be3160 R11: 0000000000000001 R12: 0000000000478399 [ 1203.524596] R13: 0000000000000000 R14: ffff888145088000 R15: 0000000000000008 [ 1203.525634] FS: 00007f429f5f4700(0000) GS:ffff8881daf00000(0000) knlGS:0000000000000000 [ 1203.526801] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1203.527626] CR2: 000000000000010b CR3: 0000000170e1e001 CR4: 00000000003706e0 [ 1203.528611] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 1203.529605] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Therefore, when ftrace_startup_enable fails, we need to rollback registration process and remove ops from ftrace_ops_list. Link: https://lkml.kernel.org/r/20220818032659.56209-1-yangjihong1@huawei.com Suggested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Yang Jihong <yangjihong1@huawei.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2022-08-21 15:56:07 -04:00
Steven Rostedt (Google)	7249921d94	tracing/perf: Fix double put of trace event when init fails If in perf_trace_event_init(), the perf_trace_event_open() fails, then it will call perf_trace_event_unreg() which will not only unregister the perf trace event, but will also call the put() function of the tp_event. The problem here is that the trace_event_try_get_ref() is called by the caller of perf_trace_event_init() and if perf_trace_event_init() returns a failure, it will then call trace_event_put(). But since the perf_trace_event_unreg() already called the trace_event_put() function, it triggers a WARN_ON(). WARNING: CPU: 1 PID: 30309 at kernel/trace/trace_dynevent.c:46 trace_event_dyn_put_ref+0x15/0x20 If perf_trace_event_reg() does not call the trace_event_try_get_ref() then the perf_trace_event_unreg() should not be calling trace_event_put(). This breaks symmetry and causes bugs like these. Pull out the trace_event_put() from perf_trace_event_unreg() and call it in the locations that perf_trace_event_unreg() is called. This not only fixes this bug, but also brings back the proper symmetry of the reg/unreg vs get/put logic. Link: https://lore.kernel.org/all/cover.1660347763.git.kjlx@templeofstupid.com/ Link: https://lkml.kernel.org/r/20220816192817.43d5e17f@gandalf.local.home Cc: stable@vger.kernel.org Fixes: `1d18538e6a` ("tracing: Have dynamic events have a ref counter") Reported-by: Krister Johansen <kjlx@templeofstupid.com> Reviewed-by: Krister Johansen <kjlx@templeofstupid.com> Tested-by: Krister Johansen <kjlx@templeofstupid.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2022-08-21 15:56:07 -04:00
Lukas Bulwahn	d8a64313c1	tracing: React to error return from traceprobe_parse_event_name() The function traceprobe_parse_event_name() may set the first two function arguments to a non-null value and still return -EINVAL to indicate an unsuccessful completion of the function. Hence, it is not sufficient to just check the result of the two function arguments for being not null, but the return value also needs to be checked. Commit `95c104c378` ("tracing: Auto generate event name when creating a group of events") changed the error-return-value checking of the second traceprobe_parse_event_name() invocation in __trace_eprobe_create() and removed checking the return value to jump to the error handling case. Reinstate using the return value in the error-return-value checking. Link: https://lkml.kernel.org/r/20220811071734.20700-1-lukas.bulwahn@gmail.com Fixes: `95c104c378` ("tracing: Auto generate event name when creating a group of events") Acked-by: Linyu Yuan <quic_linyyuan@quicinc.com> Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2022-08-21 15:56:07 -04:00
Kuniyuki Iwashima	9c80e79906	kprobes: don't call disarm_kprobe() for disabled kprobes The assumption in __disable_kprobe() is wrong, and it could try to disarm an already disarmed kprobe and fire the WARN_ONCE() below. [0] We can easily reproduce this issue. 1. Write 0 to /sys/kernel/debug/kprobes/enabled. # echo 0 > /sys/kernel/debug/kprobes/enabled 2. Run execsnoop. At this time, one kprobe is disabled. # /usr/share/bcc/tools/execsnoop & [1] 2460 PCOMM PID PPID RET ARGS # cat /sys/kernel/debug/kprobes/list ffffffff91345650 r __x64_sys_execve+0x0 [FTRACE] ffffffff91345650 k __x64_sys_execve+0x0 [DISABLED][FTRACE] 3. Write 1 to /sys/kernel/debug/kprobes/enabled, which changes kprobes_all_disarmed to false but does not arm the disabled kprobe. # echo 1 > /sys/kernel/debug/kprobes/enabled # cat /sys/kernel/debug/kprobes/list ffffffff91345650 r __x64_sys_execve+0x0 [FTRACE] ffffffff91345650 k __x64_sys_execve+0x0 [DISABLED][FTRACE] 4. Kill execsnoop, when __disable_kprobe() calls disarm_kprobe() for the disabled kprobe and hits the WARN_ONCE() in __disarm_kprobe_ftrace(). # fg /usr/share/bcc/tools/execsnoop ^C Actually, WARN_ONCE() is fired twice, and __unregister_kprobe_top() misses some cleanups and leaves the aggregated kprobe in the hash table. Then, __unregister_trace_kprobe() initialises tk->rp.kp.list and creates an infinite loop like this. aggregated kprobe.list -> kprobe.list -. ^ \| '.__.' In this situation, these commands fall into the infinite loop and result in RCU stall or soft lockup. cat /sys/kernel/debug/kprobes/list : show_kprobe_addr() enters into the infinite loop with RCU. /usr/share/bcc/tools/execsnoop : warn_kprobe_rereg() holds kprobe_mutex, and __get_valid_kprobe() is stuck in the loop. To avoid the issue, make sure we don't call disarm_kprobe() for disabled kprobes. [0] Failed to disarm kprobe-ftrace at __x64_sys_execve+0x0/0x40 (error -2) WARNING: CPU: 6 PID: 2460 at kernel/kprobes.c:1130 __disarm_kprobe_ftrace.isra.19 (kernel/kprobes.c:1129) Modules linked in: ena CPU: 6 PID: 2460 Comm: execsnoop Not tainted 5.19.0+ #28 Hardware name: Amazon EC2 c5.2xlarge/, BIOS 1.0 10/16/2017 RIP: 0010:__disarm_kprobe_ftrace.isra.19 (kernel/kprobes.c:1129) Code: 24 8b 02 eb c1 80 3d c4 83 f2 01 00 75 d4 48 8b 75 00 89 c2 48 c7 c7 90 fa 0f 92 89 04 24 c6 05 ab 83 01 e8 e4 94 f0 ff <0f> 0b 8b 04 24 eb b1 89 c6 48 c7 c7 60 fa 0f 92 89 04 24 e8 cc 94 RSP: 0018:ffff9e6ec154bd98 EFLAGS: 00010282 RAX: 0000000000000000 RBX: ffffffff930f7b00 RCX: 0000000000000001 RDX: 0000000080000001 RSI: ffffffff921461c5 RDI: 00000000ffffffff RBP: ffff89c504286da8 R08: 0000000000000000 R09: c0000000fffeffff R10: 0000000000000000 R11: ffff9e6ec154bc28 R12: ffff89c502394e40 R13: ffff89c502394c00 R14: ffff9e6ec154bc00 R15: 0000000000000000 FS: 00007fe800398740(0000) GS:ffff89c812d80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000000c00057f010 CR3: 0000000103b54006 CR4: 00000000007706e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: <TASK> __disable_kprobe (kernel/kprobes.c:1716) disable_kprobe (kernel/kprobes.c:2392) __disable_trace_kprobe (kernel/trace/trace_kprobe.c:340) disable_trace_kprobe (kernel/trace/trace_kprobe.c:429) perf_trace_event_unreg.isra.2 (./include/linux/tracepoint.h:93 kernel/trace/trace_event_perf.c:168) perf_kprobe_destroy (kernel/trace/trace_event_perf.c:295) _free_event (kernel/events/core.c:4971) perf_event_release_kernel (kernel/events/core.c:5176) perf_release (kernel/events/core.c:5186) __fput (fs/file_table.c:321) task_work_run (./include/linux/sched.h:2056 (discriminator 1) kernel/task_work.c:179 (discriminator 1)) exit_to_user_mode_prepare (./include/linux/resume_user_mode.h:49 kernel/entry/common.c:169 kernel/entry/common.c:201) syscall_exit_to_user_mode (./arch/x86/include/asm/jump_label.h:55 ./arch/x86/include/asm/nospec-branch.h:384 ./arch/x86/include/asm/entry-common.h:94 kernel/entry/common.c:133 kernel/entry/common.c:296) do_syscall_64 (arch/x86/entry/common.c:87) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120) RIP: 0033:0x7fe7ff210654 Code: 15 79 89 20 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb be 0f 1f 00 8b 05 9a cd 20 00 48 63 ff 85 c0 75 11 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 3a f3 c3 48 83 ec 18 48 89 7c 24 08 e8 34 fc RSP: 002b:00007ffdbd1d3538 EFLAGS: 00000246 ORIG_RAX: 0000000000000003 RAX: 0000000000000000 RBX: 0000000000000008 RCX: 00007fe7ff210654 RDX: 0000000000000000 RSI: 0000000000002401 RDI: 0000000000000008 RBP: 0000000000000000 R08: 94ae31d6fda838a4 R0900007fe8001c9d30 R10: 00007ffdbd1d34b0 R11: 0000000000000246 R12: 00007ffdbd1d3600 R13: 0000000000000000 R14: fffffffffffffffc R15: 00007ffdbd1d3560 </TASK> Link: https://lkml.kernel.org/r/20220813020509.90805-1-kuniyu@amazon.com Fixes: `69d54b916d` ("kprobes: makes kprobes/enabled works correctly for optimized kprobes.") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reported-by: Ayushman Dutta <ayudutta@amazon.com> Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com> Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Wang Nan <wangnan0@huawei.com> Cc: Kuniyuki Iwashima <kuniyu@amazon.com> Cc: Kuniyuki Iwashima <kuni1840@gmail.com> Cc: Ayushman Dutta <ayudutta@amazon.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-08-20 15:17:46 -07:00
Randy Dunlap	a8faed3a02	kernel/sys_ni: add compat entry for fadvise64_64 When CONFIG_ADVISE_SYSCALLS is not set/enabled and CONFIG_COMPAT is set/enabled, the riscv compat_syscall_table references 'compat_sys_fadvise64_64', which is not defined: riscv64-linux-ld: arch/riscv/kernel/compat_syscall_table.o:(.rodata+0x6f8): undefined reference to `compat_sys_fadvise64_64' Add 'fadvise64_64' to kernel/sys_ni.c as a conditional COMPAT function so that when CONFIG_ADVISE_SYSCALLS is not set, there is a fallback function available. Link: https://lkml.kernel.org/r/20220807220934.5689-1-rdunlap@infradead.org Fixes: `d3ac21cacc` ("mm: Support compiling out madvise and fadvise") Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Suggested-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-08-20 15:17:45 -07:00
Namhyung Kim	501f7f69bc	locking: Add __lockfunc to slow path functions So that we can skip the functions in the perf lock contention and other places like /proc/PID/wchan. Signed-off-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Waiman Long <longman@redhat.com> Link: https://lore.kernel.org/r/20220810220346.1919485-1-namhyung@kernel.org	2022-08-19 19:47:51 +02:00
Jakub Kicinski	268603d79c	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net No conflicts. Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-08-18 21:17:10 -07:00
Linus Torvalds	4c2d0b039c	Including fixes from netfilter. Current release - regressions: - tcp: fix cleanup and leaks in tcp_read_skb() (the new way BPF socket maps get data out of the TCP stack) - tls: rx: react to strparser initialization errors - netfilter: nf_tables: fix scheduling-while-atomic splat - net: fix suspicious RCU usage in bpf_sk_reuseport_detach() Current release - new code bugs: - mlxsw: ptp: fix a couple of races, static checker warnings and error handling Previous releases - regressions: - netfilter: - nf_tables: fix possible module reference underflow in error path - make conntrack helpers deal with BIG TCP (skbs > 64kB) - nfnetlink: re-enable conntrack expectation events - net: fix potential refcount leak in ndisc_router_discovery() Previous releases - always broken: - sched: cls_route: disallow handle of 0 - neigh: fix possible local DoS due to net iface start/stop loop - rtnetlink: fix module refcount leak in rtnetlink_rcv_msg - sched: fix adding qlen to qcpu->backlog in gnet_stats_add_queue_cpu - virtio_net: fix endian-ness for RSS - dsa: mv88e6060: prevent crash on an unused port - fec: fix timer capture timing in `fec_ptp_enable_pps()` - ocelot: stats: fix races, integer wrapping and reading incorrect registers (the change of register definitions here accounts for bulk of the changed LoC in this PR) Signed-off-by: Jakub Kicinski <kuba@kernel.org> -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmL+lGYACgkQMUZtbf5S IrunKw/+OfV68qJ2C+zg/qPgZg5XAD/v+3WuQo9Vsj4Z+dmxelyQkKqok61xLc6t eXr8v3/stDM1/zxHqCc0zJZMGhOug4RLS6kfVVwNbo6XaceTJlKcFTgM1bjQgLyT pMlet2JMhzpmWkMma2oztsG4zQaWSITCCjgLJByUmeO8+zKXDMojc1eew2bH8ueo KzZjIys+lHdEIo2uhGEU3OdhqnFn2zdVGVxcmtgtV3N9rIobnHiJdVwqLlTgnTvQ nU5ZoYUM4h1AG7gKSXsDbM0CPH3s4xavpkA3rMB1x4ahfxNd3y6WmpVt9qjE5wME 8HbzutQ+x7Xf2XAQBBZma/KjmLW0GCHlQhRT+RHBryk21Yizb04HqXNMB1sPFZe6 uDAvSZjZqPX+3aMznLTzz1T+F1TJygoeVNQ2tlxHkMuPrfS9g3T+jiohGnELF8+K /A3g7oCQin/qiMk35JXBuhGk4RqjyPsITOwAZ2OycHZWD/U5xd1OlkKPGUoUAg+m y+7XswZZJ/uBw+U+16AMMzg8vxCmoBHbgYGvnw0+96wpv4yVqTW26Wtzv01gjZPp wZuJkd+sHZLBNP5RkBC0PQj5rfcUj+4PUTXtW+57z+XM0HcmcqsXZHLXpMr4rS0b EnSsuDlfp9SWwfpMld75v/LA19a6opi6novjY4Nds3+t22ffEHY= =ednY -----END PGP SIGNATURE----- Merge tag 'net-6.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from netfilter. Current release - regressions: - tcp: fix cleanup and leaks in tcp_read_skb() (the new way BPF socket maps get data out of the TCP stack) - tls: rx: react to strparser initialization errors - netfilter: nf_tables: fix scheduling-while-atomic splat - net: fix suspicious RCU usage in bpf_sk_reuseport_detach() Current release - new code bugs: - mlxsw: ptp: fix a couple of races, static checker warnings and error handling Previous releases - regressions: - netfilter: - nf_tables: fix possible module reference underflow in error path - make conntrack helpers deal with BIG TCP (skbs > 64kB) - nfnetlink: re-enable conntrack expectation events - net: fix potential refcount leak in ndisc_router_discovery() Previous releases - always broken: - sched: cls_route: disallow handle of 0 - neigh: fix possible local DoS due to net iface start/stop loop - rtnetlink: fix module refcount leak in rtnetlink_rcv_msg - sched: fix adding qlen to qcpu->backlog in gnet_stats_add_queue_cpu - virtio_net: fix endian-ness for RSS - dsa: mv88e6060: prevent crash on an unused port - fec: fix timer capture timing in `fec_ptp_enable_pps()` - ocelot: stats: fix races, integer wrapping and reading incorrect registers (the change of register definitions here accounts for bulk of the changed LoC in this PR)" * tag 'net-6.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (77 commits) net: moxa: MAC address reading, generating, validity checking tcp: handle pure FIN case correctly tcp: refactor tcp_read_skb() a bit tcp: fix tcp_cleanup_rbuf() for tcp_read_skb() tcp: fix sock skb accounting in tcp_read_skb() igb: Add lock to avoid data race dt-bindings: Fix incorrect "the the" corrections net: genl: fix error path memory leak in policy dumping stmmac: intel: Add a missing clk_disable_unprepare() call in intel_eth_pci_remove() net: ethernet: mtk_eth_soc: fix possible NULL pointer dereference in mtk_xdp_run net/mlx5e: Allocate flow steering storage during uplink initialization net: mscc: ocelot: report ndo_get_stats64 from the wraparound-resistant ocelot->stats net: mscc: ocelot: keep ocelot_stat_layout by reg address, not offset net: mscc: ocelot: make struct ocelot_stat_layout array indexable net: mscc: ocelot: fix race between ndo_get_stats64 and ocelot_check_stats_work net: mscc: ocelot: turn stats_lock into a spinlock net: mscc: ocelot: fix address of SYS_COUNT_TX_AGING counter net: mscc: ocelot: fix incorrect ndo_get_stats64 packet counters net: dsa: felix: fix ethtool 256-511 and 512-1023 TX packet counters net: dsa: don't warn in dsa_port_set_state_now() when driver doesn't support it ...	2022-08-18 19:37:15 -07:00
Martin KaFai Lau	2b5a2ecbfd	bpf: Initialize the bpf_run_ctx in bpf_iter_run_prog() The bpf-iter-prog for tcp and unix sk can do bpf_setsockopt() which needs has_current_bpf_ctx() to decide if it is called by a bpf prog. This patch initializes the bpf_run_ctx in bpf_iter_run_prog() for the has_current_bpf_ctx() to use. Acked-by: Andrii Nakryiko <andrii@kernel.org> Reviewed-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/r/20220817061751.4177657-1-kafai@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2022-08-18 17:06:13 -07:00
Pu Lehui	7d6620f107	bpf, cgroup: Fix kernel BUG in purge_effective_progs Syzkaller reported a triggered kernel BUG as follows: ------------[ cut here ]------------ kernel BUG at kernel/bpf/cgroup.c:925! invalid opcode: 0000 [#1] PREEMPT SMP NOPTI CPU: 1 PID: 194 Comm: detach Not tainted 5.19.0-14184-g69dac8e431af #8 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 RIP: 0010:__cgroup_bpf_detach+0x1f2/0x2a0 Code: 00 e8 92 60 30 00 84 c0 75 d8 4c 89 e0 31 f6 85 f6 74 19 42 f6 84 28 48 05 00 00 02 75 0e 48 8b 80 c0 00 00 00 48 85 c0 75 e5 <0f> 0b 48 8b 0c5 RSP: 0018:ffffc9000055bdb0 EFLAGS: 00000246 RAX: 0000000000000000 RBX: ffff888100ec0800 RCX: ffffc900000f1000 RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff888100ec4578 RBP: 0000000000000000 R08: ffff888100ec0800 R09: 0000000000000040 R10: 0000000000000000 R11: 0000000000000000 R12: ffff888100ec4000 R13: 000000000000000d R14: ffffc90000199000 R15: ffff888100effb00 FS: 00007f68213d2b80(0000) GS:ffff88813bc80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000055f74a0e5850 CR3: 0000000102836000 CR4: 00000000000006e0 Call Trace: <TASK> cgroup_bpf_prog_detach+0xcc/0x100 __sys_bpf+0x2273/0x2a00 __x64_sys_bpf+0x17/0x20 do_syscall_64+0x3b/0x90 entry_SYSCALL_64_after_hwframe+0x63/0xcd RIP: 0033:0x7f68214dbcb9 Code: 08 44 89 e0 5b 41 5c c3 66 0f 1f 84 00 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff8 RSP: 002b:00007ffeb487db68 EFLAGS: 00000246 ORIG_RAX: 0000000000000141 RAX: ffffffffffffffda RBX: 000000000000000b RCX: 00007f68214dbcb9 RDX: 0000000000000090 RSI: 00007ffeb487db70 RDI: 0000000000000009 RBP: 0000000000000003 R08: 0000000000000012 R09: 0000000b00000003 R10: 00007ffeb487db70 R11: 0000000000000246 R12: 00007ffeb487dc20 R13: 0000000000000004 R14: 0000000000000001 R15: 000055f74a1011b0 </TASK> Modules linked in: ---[ end trace 0000000000000000 ]--- Repetition steps: For the following cgroup tree, root \| cg1 \| cg2 1. attach prog2 to cg2, and then attach prog1 to cg1, both bpf progs attach type is NONE or OVERRIDE. 2. write 1 to /proc/thread-self/fail-nth for failslab. 3. detach prog1 for cg1, and then kernel BUG occur. Failslab injection will cause kmalloc fail and fall back to purge_effective_progs. The problem is that cg2 have attached another prog, so when go through cg2 layer, iteration will add pos to 1, and subsequent operations will be skipped by the following condition, and cg will meet NULL in the end. `if (pos && !(cg->bpf.flags[atype] & BPF_F_ALLOW_MULTI))` The NULL cg means no link or prog match, this is as expected, and it's not a bug. So here just skip the no match situation. Fixes: `4c46091ee9` ("bpf: Fix KASAN use-after-free Read in compute_effective_progs") Signed-off-by: Pu Lehui <pulehui@huawei.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20220813134030.1972696-1-pulehui@huawei.com	2022-08-18 23:27:33 +02:00
Jakub Kicinski	3f5f728a72	Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Andrii Nakryiko says: ==================== bpf-next 2022-08-17 We've added 45 non-merge commits during the last 14 day(s) which contain a total of 61 files changed, 986 insertions(+), 372 deletions(-). The main changes are: 1) New bpf_ktime_get_tai_ns() BPF helper to access CLOCK_TAI, from Kurt Kanzenbach and Jesper Dangaard Brouer. 2) Few clean ups and improvements for libbpf 1.0, from Andrii Nakryiko. 3) Expose crash_kexec() as kfunc for BPF programs, from Artem Savkov. 4) Add ability to define sleepable-only kfuncs, from Benjamin Tissoires. 5) Teach libbpf's bpf_prog_load() and bpf_map_create() to gracefully handle unsupported names on old kernels, from Hangbin Liu. 6) Allow opting out from auto-attaching BPF programs by libbpf's BPF skeleton, from Hao Luo. 7) Relax libbpf's requirement for shared libs to be marked executable, from Henqgi Chen. 8) Improve bpf_iter internals handling of error returns, from Hao Luo. 9) Few accommodations in libbpf to support GCC-BPF quirks, from James Hilliard. 10) Fix BPF verifier logic around tracking dynptr ref_obj_id, from Joanne Koong. 11) bpftool improvements to handle full BPF program names better, from Manu Bretelle. 12) bpftool fixes around libcap use, from Quentin Monnet. 13) BPF map internals clean ups and improvements around memory allocations, from Yafang Shao. 14) Allow to use cgroup_get_from_file() on cgroupv1, allowing BPF cgroup iterator to work on cgroupv1, from Yosry Ahmed. 15) BPF verifier internal clean ups, from Dave Marchevsky and Joanne Koong. 16) Various fixes and clean ups for selftests/bpf and vmtest.sh, from Daniel Xu, Artem Savkov, Joanne Koong, Andrii Nakryiko, Shibin Koikkara Reeny. * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (45 commits) selftests/bpf: Few fixes for selftests/bpf built in release mode libbpf: Clean up deprecated and legacy aliases libbpf: Streamline bpf_attr and perf_event_attr initialization libbpf: Fix potential NULL dereference when parsing ELF selftests/bpf: Tests libbpf autoattach APIs libbpf: Allows disabling auto attach selftests/bpf: Fix attach point for non-x86 arches in test_progs/lsm libbpf: Making bpf_prog_load() ignore name if kernel doesn't support selftests/bpf: Update CI kconfig selftests/bpf: Add connmark read test selftests/bpf: Add existing connection bpf_*_ct_lookup() test bpftool: Clear errno after libcap's checks bpf: Clear up confusion in bpf_skb_adjust_room()'s documentation bpftool: Fix a typo in a comment libbpf: Add names for auxiliary maps bpf: Use bpf_map_area_alloc consistently on bpf map creation bpf: Make __GFP_NOWARN consistent in bpf map creation bpf: Use bpf_map_area_free instread of kvfree bpf: Remove unneeded memset in queue_stack_map creation libbpf: preserve errno across pr_warn/pr_info/pr_debug ... ==================== Link: https://lore.kernel.org/r/20220817215656.1180215-1-andrii@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-08-17 20:29:36 -07:00
David Howells	fc4aaf9fb3	net: Fix suspicious RCU usage in bpf_sk_reuseport_detach() bpf_sk_reuseport_detach() calls __rcu_dereference_sk_user_data_with_flags() to obtain the value of sk->sk_user_data, but that function is only usable if the RCU read lock is held, and neither that function nor any of its callers hold it. Fix this by adding a new helper, __locked_read_sk_user_data_with_flags() that checks to see if sk->sk_callback_lock() is held and use that here instead. Alternatively, making __rcu_dereference_sk_user_data_with_flags() use rcu_dereference_checked() might suffice. Without this, the following warning can be occasionally observed: ============================= WARNING: suspicious RCU usage 6.0.0-rc1-build2+ #563 Not tainted ----------------------------- include/net/sock.h:592 suspicious rcu_dereference_check() usage! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 5 locks held by locktest/29873: #0: ffff88812734b550 (&sb->s_type->i_mutex_key#9){+.+.}-{3:3}, at: __sock_release+0x77/0x121 #1: ffff88812f5621b0 (sk_lock-AF_INET){+.+.}-{0:0}, at: tcp_close+0x1c/0x70 #2: ffff88810312f5c8 (&h->lhash2[i].lock){+.+.}-{2:2}, at: inet_unhash+0x76/0x1c0 #3: ffffffff83768bb8 (reuseport_lock){+...}-{2:2}, at: reuseport_detach_sock+0x18/0xdd #4: ffff88812f562438 (clock-AF_INET){++..}-{2:2}, at: bpf_sk_reuseport_detach+0x24/0xa4 stack backtrace: CPU: 1 PID: 29873 Comm: locktest Not tainted 6.0.0-rc1-build2+ #563 Hardware name: ASUS All Series/H97-PLUS, BIOS 2306 10/09/2014 Call Trace: <TASK> dump_stack_lvl+0x4c/0x5f bpf_sk_reuseport_detach+0x6d/0xa4 reuseport_detach_sock+0x75/0xdd inet_unhash+0xa5/0x1c0 tcp_set_state+0x169/0x20f ? lockdep_sock_is_held+0x3a/0x3a ? __lock_release.isra.0+0x13e/0x220 ? reacquire_held_locks+0x1bb/0x1bb ? hlock_class+0x31/0x96 ? mark_lock+0x9e/0x1af __tcp_close+0x50/0x4b6 tcp_close+0x28/0x70 inet_release+0x8e/0xa7 __sock_release+0x95/0x121 sock_close+0x14/0x17 __fput+0x20f/0x36a task_work_run+0xa3/0xcc exit_to_user_mode_prepare+0x9c/0x14d syscall_exit_to_user_mode+0x18/0x44 entry_SYSCALL_64_after_hwframe+0x63/0xcd Fixes: `cf8c1e9672` ("net: refactor bpf_sk_reuseport_detach()") Signed-off-by: David Howells <dhowells@redhat.com> cc: Hawkins Jiawei <yin31149@gmail.com> Link: https://lore.kernel.org/r/166064248071.3502205.10036394558814861778.stgit@warthog.procyon.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-08-17 16:42:59 -07:00
YiFei Zhu	14b20b784f	bpf: Restrict bpf_sys_bpf to CAP_PERFMON The verifier cannot perform sufficient validation of any pointers passed into bpf_attr and treats them as integers rather than pointers. The helper will then read from arbitrary pointers passed into it. Restrict the helper to CAP_PERFMON since the security model in BPF of arbitrary kernel read is CAP_BPF + CAP_PERFMON. Fixes: `af2ac3e13e` ("bpf: Prepare bpf syscall to be used from kernel and user space.") Signed-off-by: YiFei Zhu <zhuyifei@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20220816205517.682470-1-zhuyifei@google.com	2022-08-18 00:27:49 +02:00
Tejun Heo	4f7e723643	cgroup: Fix threadgroup_rwsem <-> cpus_read_lock() deadlock Bringing up a CPU may involve creating and destroying tasks which requires read-locking threadgroup_rwsem, so threadgroup_rwsem nests inside cpus_read_lock(). However, cpuset's ->attach(), which may be called with thredagroup_rwsem write-locked, also wants to disable CPU hotplug and acquires cpus_read_lock(), leading to a deadlock. Fix it by guaranteeing that ->attach() is always called with CPU hotplug disabled and removing cpus_read_lock() call from cpuset_attach(). Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-and-tested-by: Imran Khan <imran.f.khan@oracle.com> Reported-and-tested-by: Xuewen Yan <xuewen.yan@unisoc.com> Fixes: `05c7b7a92c` ("cgroup/cpuset: Fix a race between cpuset_attach() and cpu hotplug") Cc: stable@vger.kernel.org # v5.17+	2022-08-17 07:36:05 -10:00
Frederick Lawler	401e64b3a4	bpf-lsm: Make bpf_lsm_userns_create() sleepable Users may want to audit calls to security_create_user_ns() and access user space memory. Also create_user_ns() runs without pagefault_disabled(). Therefore, make bpf_lsm_userns_create() sleepable for mandatory access control policies. Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Christian Brauner (Microsoft) <brauner@kernel.org> Acked-by: KP Singh <kpsingh@kernel.org> Signed-off-by: Frederick Lawler <fred@cloudflare.com> Signed-off-by: Paul Moore <paul@paul-moore.com>	2022-08-16 17:38:21 -04:00
Frederick Lawler	7cd4c5c210	security, lsm: Introduce security_create_user_ns() User namespaces are an effective tool to allow programs to run with permission without requiring the need for a program to run as root. User namespaces may also be used as a sandboxing technique. However, attackers sometimes leverage user namespaces as an initial attack vector to perform some exploit. [1,2,3] While it is not the unprivileged user namespace functionality, which causes the kernel to be exploitable, users/administrators might want to more granularly limit or at least monitor how various processes use this functionality, while vulnerable kernel subsystems are being patched. Preventing user namespace already creation comes in a few of forms in order of granularity: 1. /proc/sys/user/max_user_namespaces sysctl 2. Distro specific patch(es) 3. CONFIG_USER_NS To block a task based on its attributes, the LSM hook cred_prepare is a decent candidate for use because it provides more granular control, and it is called before create_user_ns(): cred = prepare_creds() security_prepare_creds() call_int_hook(cred_prepare, ... if (cred) create_user_ns(cred) Since security_prepare_creds() is meant for LSMs to copy and prepare credentials, access control is an unintended use of the hook. [4] Further, security_prepare_creds() will always return a ENOMEM if the hook returns any non-zero error code. This hook also does not handle the clone3 case which requires us to access a user space pointer to know if we're in the CLONE_NEW_USER call path which may be subject to a TOCTTOU attack. Lastly, cred_prepare is called in many call paths, and a targeted hook further limits the frequency of calls which is a beneficial outcome. Therefore introduce a new function security_create_user_ns() with an accompanying userns_create LSM hook. With the new userns_create hook, users will have more control over the observability and access control over user namespace creation. Users should expect that normal operation of user namespaces will behave as usual, and only be impacted when controls are implemented by users or administrators. This hook takes the prepared creds for LSM authors to write policy against. On success, the new namespace is applied to credentials, otherwise an error is returned. Links: 1. https://nvd.nist.gov/vuln/detail/CVE-2022-0492 2. https://nvd.nist.gov/vuln/detail/CVE-2022-25636 3. https://nvd.nist.gov/vuln/detail/CVE-2022-34918 4. https://lore.kernel.org/all/1c4b1c0d-12f6-6e9e-a6a3-cdce7418110c@schaufler-ca.com/ Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org> Reviewed-by: KP Singh <kpsingh@kernel.org> Signed-off-by: Frederick Lawler <fred@cloudflare.com> Signed-off-by: Paul Moore <paul@paul-moore.com>	2022-08-16 17:32:46 -04:00
Tetsuo Handa	c0feea594e	workqueue: don't skip lockdep work dependency in cancel_work_sync() Like Hillf Danton mentioned syzbot should have been able to catch cancel_work_sync() in work context by checking lockdep_map in __flush_work() for both flush and cancel. in [1], being unable to report an obvious deadlock scenario shown below is broken. From locking dependency perspective, sync version of cancel request should behave as if flush request, for it waits for completion of work if that work has already started execution. ---------- #include <linux/module.h> #include <linux/sched.h> static DEFINE_MUTEX(mutex); static void work_fn(struct work_struct *work) { schedule_timeout_uninterruptible(HZ / 5); mutex_lock(&mutex); mutex_unlock(&mutex); } static DECLARE_WORK(work, work_fn); static int __init test_init(void) { schedule_work(&work); schedule_timeout_uninterruptible(HZ / 10); mutex_lock(&mutex); cancel_work_sync(&work); mutex_unlock(&mutex); return -EINVAL; } module_init(test_init); MODULE_LICENSE("GPL"); ---------- The check this patch restores was added by commit `0976dfc1d0` ("workqueue: Catch more locking problems with flush_work()"). Then, lockdep's crossrelease feature was added by commit `b09be676e0` ("locking/lockdep: Implement the 'crossrelease' feature"). As a result, this check was once removed by commit `fd1a5b04df` ("workqueue: Remove now redundant lock acquisitions wrt. workqueue flushes"). But lockdep's crossrelease feature was removed by commit `e966eaeeb6` ("locking/lockdep: Remove the cross-release locking checks"). At this point, this check should have been restored. Then, commit `d6e89786be` ("workqueue: skip lockdep wq dependency in cancel_work_sync()") introduced a boolean flag in order to distinguish flush_work() and cancel_work_sync(), for checking "struct workqueue_struct" dependency when called from cancel_work_sync() was causing false positives. Then, commit `87915adc3f` ("workqueue: re-add lockdep dependencies for flushing") tried to restore "struct work_struct" dependency check, but by error checked this boolean flag. Like an example shown above indicates, "struct work_struct" dependency needs to be checked for both flush_work() and cancel_work_sync(). Link: https://lkml.kernel.org/r/20220504044800.4966-1-hdanton@sina.com [1] Reported-by: Hillf Danton <hdanton@sina.com> Suggested-by: Lai Jiangshan <jiangshanlai@gmail.com> Fixes: `87915adc3f` ("workqueue: re-add lockdep dependencies for flushing") Cc: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Tejun Heo <tj@kernel.org>	2022-08-16 06:27:35 -10:00
Jilin Yuan	0351dc57b9	audit: fix repeated words in comments Delete the redundant word 'doesn't'. Signed-off-by: Jilin Yuan <yuanjilin@cdjrlc.com> [PM: subject line tweak] Signed-off-by: Paul Moore <paul@paul-moore.com>	2022-08-15 22:46:09 -04:00
Hao Jia	76b079ef4c	sched/psi: Remove unused parameter nbytes of psi_trigger_create() psi_trigger_create()'s 'nbytes' parameter is not used, so we can remove it. Signed-off-by: Hao Jia <jiahao.os@bytedance.com> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Tejun Heo <tj@kernel.org>	2022-08-15 12:35:25 -10:00
Hao Jia	2b97cf7628	sched/psi: Zero the memory of struct psi_group After commit `5f69a6577b` ("psi: dont alloc memory for psi by default"), the memory used by struct psi_group is no longer allocated and zeroed in cgroup_create(). Since the memory of struct psi_group is not zeroed, the data in this memory is random, which will lead to inaccurate psi statistics when creating a new cgroup. So we use kzlloc() to allocate and zero the struct psi_group and remove the redundant zeroing in group_init(). Steps to reproduce: 1. Use cgroup v2 and enable CONFIG_PSI 2. Create a new cgroup, and query psi statistics mkdir /sys/fs/cgroup/test cat /sys/fs/cgroup/test/cpu.pressure some avg10=0.00 avg60=0.00 avg300=47927752200.00 total=12884901 full avg10=561815124.00 avg60=125835394188.00 avg300=1077090462000.00 total=10273561772 cat /sys/fs/cgroup/test/io.pressure some avg10=1040093132823.95 avg60=1203770351379.21 avg300=3862252669559.46 total=4294967296 full avg10=921884564601.39 avg60=0.00 avg300=1984507298.35 total=442381631 cat /sys/fs/cgroup/test/memory.pressure some avg10=232476085778.11 avg60=0.00 avg300=0.00 total=0 full avg10=0.00 avg60=0.00 avg300=2585658472280.57 total=12884901 Fixes: commit `5f69a6577b` ("psi: dont alloc memory for psi by default") Signed-off-by: Hao Jia <jiahao.os@bytedance.com> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Tejun Heo <tj@kernel.org>	2022-08-15 12:35:13 -10:00
Tejun Heo	7f203bc89e	cgroup: Replace cgroup->ancestor_ids[] with ->ancestors[] Every cgroup knows all its ancestors through its ->ancestor_ids[]. There's no advantage to remembering the IDs instead of the pointers directly and this makes the array useless for finding an actual ancestor cgroup forcing cgroup_ancestor() to iteratively walk up the hierarchy instead. Let's replace cgroup->ancestor_ids[] with ->ancestors[] and remove the walking-up from cgroup_ancestor(). While at it, improve comments around cgroup_root->cgrp_ancestor_storage. This patch shouldn't cause user-visible behavior differences. v2: Update cgroup_ancestor() to use ->ancestors[]. v3: cgroup_root->cgrp_ancestor_storage's type is updated to match cgroup->ancestors[]. Better comments. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Namhyung Kim <namhyung@kernel.org>	2022-08-15 11:16:47 -10:00
David Gow	41a55567b9	module: kunit: Load .kunit_test_suites section when CONFIG_KUNIT=m The new KUnit module handling has KUnit test suites listed in a .kunit_test_suites section of each module. This should be loaded when the module is, but at the moment this only happens if KUnit is built-in. Also load this when KUnit is enabled as a module: it'll not be usable unless KUnit is loaded, but such modules are likely to depend on KUnit anyway, so it's unlikely to ever be loaded needlessly. Fixes: `3d6e446238` ("kunit: unify module and builtin suite definitions") Signed-off-by: David Gow <davidgow@google.com> Reviewed-by: Brendan Higgins <brendanhiggins@google.com> Tested-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>	2022-08-15 13:51:07 -06:00
Linus Torvalds	5d6a0f4da9	xen: branch for v6.0-rc1b -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQRTLbB6QfY48x44uB6AXGG7T9hjvgUCYvi0yQAKCRCAXGG7T9hj vmikAQDWSrcWuxDkGnzut0A1tBQRUCWDMyKPqigWAA5tH2sPgAEAtWfBvT1xyl7T gZ22I7o21WxxDGyvNUcA65pK7c2cpg8= =UMbq -----END PGP SIGNATURE----- Merge tag 'for-linus-6.0-rc1b-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip Pull more xen updates from Juergen Gross: - fix the handling of the "persistent grants" feature negotiation between Xen blkfront and Xen blkback drivers - a cleanup of xen.config and adding xen.config to Xen section in MAINTAINERS - support HVMOP_set_evtchn_upcall_vector, which is more compliant to "normal" interrupt handling than the global callback used up to now - further small cleanups * tag 'for-linus-6.0-rc1b-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: MAINTAINERS: add xen config fragments to XEN HYPERVISOR sections xen: remove XEN_SCRUB_PAGES in xen.config xen/pciback: Fix comment typo xen/xenbus: fix return type in xenbus_file_read() xen-blkfront: Apply 'feature_persistent' parameter when connect xen-blkback: Apply 'feature_persistent' parameter when connect xen-blkback: fix persistent grants negotiation x86/xen: Add support for HVMOP_set_evtchn_upcall_vector	2022-08-14 09:28:54 -07:00
Linus Torvalds	f6eb0fed6a	Misc timer fixes: - fix a potential use-after-free bug in posix timers - correct a prototype - address a build warning Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmL3epQRHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1iPZw/+I/9GXcf3SzbG5M6Nf21SJpSjC4hAHHgb eyv5MUNxKvCHU5iT2SrCvgKjESl5I/E70kubeRHJnvarBPUzGnHHzGlYIYOaJPQ7 irJpUj/6R8ps4UsMBJ8vj5f3b7163zhBJVP8egDW6roT1HUrYTFeIjIli/SOCxpY H1/DqHlbEALE5o5xykg3zuqAbywym+hNRleIVls4wqjZNnfqiTElSuW9xqw9xt3n 9xYmOKZaztdv5Lp2JCm7QOu2byGzeHje72ppsDcBZ3EBvHUBLSndhfe5NQUGhtxy UlBqAELA653uPgPnNKLRMqt/kop8emHqvAx8T0RawPwoUS6XGDVxRX+my8+HKklg P8KsM/8W7+3KTHz0bf72DEHTFiXCzlswRzdOSvP5bR4xw1G4ychzvuxAiPDFR3zT v7uPgykxxCrEexVCBCdPmrl4WikwLJtcrSXtJ4bsisxQFlq7WWd2/osZkTffI3pN IIxDXuHFHC78lrUMk2OQ+ITBz01z4nCFSlgMGZ6ZY6ppS1Rndy1HG/B2NgjW1zGP Y/1xq/nWaql0QO7RmyoJXt1ZSMJYCyKFocRDh9nBmtBSlYm3A8aIA8b4i1VRRG1G 8HOkdS8ef2eOWj8wqk0NvoTbiGjV7YM5pf0g1dmRLA+aGCBD1P9/iFcBv5b6Uxaq qZ7ZtuQzsyc= =Plg8 -----END PGP SIGNATURE----- Merge tag 'timers-urgent-2022-08-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fixes from Ingo Molnar: "Misc timer fixes: - fix a potential use-after-free bug in posix timers - correct a prototype - address a build warning" * tag 'timers-urgent-2022-08-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: posix-cpu-timers: Cleanup CPU timers before freeing them during exec time: Correct the prototype of ns_to_kernel_old_timeval and ns_to_timespec64 posix-timers: Make do_clock_gettime() static	2022-08-13 14:38:22 -07:00
Linus Torvalds	1da8cf961b	io_uring-6.0-2022-08-13 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmL3+fQQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpmXyEACfERdKYdZ/W3IvPoyK8CJ3p7f/6SOj2/p1 DTuaa3l7/kVq2HcRUGgZwvgeWpOCFghdBm5co/4hGqSw7bT8rERGDelo41ohhTfr xKIiwJflK/s280VXLJFA+o7Jeoj1oTFYCmdUmU3wcKFVnQdu1rz9s0L6bwsEqq93 y1uty96dxYZn2mENLbBah0x9yV0h2ZxRkguUm0sdnKl/tMkUVLSD1TPLHf2s6eAL o3Dbmo9jv4HFXoJj8YL50Oxl22zIKBHl9hZqHdLcKesFgyFTChckKUNijWyPL2vE zesbnd57sXgY6ghi4LDGeCOtN41WNjiVeAm/c4XK5oFhTag8Q2x0D1hTPUByHksl IV/116xs6pHTeZRhNlMOBVMZGLSz95zSuRUyTONAmKgc/b3if/w3zTi1W3CnJSlx 7O5GpqQDZTQuin0jldNKImbx1aPAATb+UWDkl7O5aXkjw4FUtxT5GrYcBNswVuKX iybx8NyVn8kFD1hix3U8huBOPSg1JMkR+sFml+NqYRd4i2CwV8KAPPuzsPw6MRBL U4DfkAkpsbKqSK+mri5aUrYxmpYkJ45mgyldiewiOso9+AYg9DDp3D2iGgAiRbKm i3pz1Gh/3iUow0UAI5ZFlDhjHgWPlIH7IBbemivhjhFV4GrXJqTwUzsA1iDKTe14 3lHKkAPVPA== =FfLf -----END PGP SIGNATURE----- Merge tag 'io_uring-6.0-2022-08-13' of git://git.kernel.dk/linux-block Pull io_uring fixes from Jens Axboe: - Regression fix for this merge window, fixing a wrong order of arguments for io_req_set_res() for passthru (Dylan) - Fix for the audit code leaking context memory (Peilin) - Ensure that provided buffers are memcg accounted (Pavel) - Correctly handle short zero-copy sends (Pavel) - Sparse warning fixes for the recvmsg multishot command (Dylan) - Error handling fix for passthru (Anuj) - Remove randomization of struct kiocb fields, to avoid it growing in size if re-arranged in such a fashion that it grows more holes or padding (Keith, Linus) - Small series improving type safety of the sqe fields (Stefan) * tag 'io_uring-6.0-2022-08-13' of git://git.kernel.dk/linux-block: io_uring: add missing BUILD_BUG_ON() checks for new io_uring_sqe fields io_uring: make io_kiocb_to_cmd() typesafe fs: don't randomize struct kiocb fields io_uring: consistently make use of io_notif_to_data() io_uring: fix error handling for io_uring_cmd io_uring: fix io_recvmsg_prep_multishot sparse warnings io_uring/net: send retry for zerocopy io_uring: mem-account pbuf buckets audit, io_uring, io-wq: Fix memory leak in io_sq_thread() and io_wqe_worker() io_uring: pass correct parameters to io_req_set_res	2022-08-13 13:28:54 -07:00
Lukas Bulwahn	aa6d1e5b50	xen: remove XEN_SCRUB_PAGES in xen.config Commit `197ecb3802` ("xen/balloon: add runtime control for scrubbing ballooned out pages") changed config XEN_SCRUB_PAGES to config XEN_SCRUB_PAGES_DEFAULT. As xen.config sets 'XEN_BALLOON=y' and XEN_SCRUB_PAGES_DEFAULT defaults to yes, there is no further need to set this config in the xen.config file. Remove setting XEN_SCRUB_PAGES in xen.config, which is without effect since the commit above anyway. Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com> Reviewed-by: Juergen Gross <jgross@suse.com> Link: https://lore.kernel.org/r/20220810050712.9539-3-lukas.bulwahn@gmail.com Signed-off-by: Juergen Gross <jgross@suse.com>	2022-08-12 12:22:23 +02:00
Ingo Molnar	09348d75a6	sched/all: Change all BUG_ON() instances in the scheduler to WARN_ON_ONCE() There's no good reason to crash a user's system with a BUG_ON(), chances are high that they'll never even see the crash message on Xorg, and it won't make it into the syslog either. By using a WARN_ON_ONCE() we at least give the user a chance to report any bugs triggered here - instead of getting silent hangs. None of these WARN_ON_ONCE()s are supposed to trigger, ever - so we ignore cases where a NULL check is done via a BUG_ON() and we let a NULL pointer through after a WARN_ON_ONCE(). There's one exception: WARN_ON_ONCE() arguments with side-effects, such as locking - in this case we use the return value of the WARN_ON_ONCE(), such as in: - BUG_ON(!lock_task_sighand(p, &flags)); + if (WARN_ON_ONCE(!lock_task_sighand(p, &flags))) + return; Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/YvSsKcAXISmshtHo@gmail.com	2022-08-12 11:25:10 +02:00
Linus Torvalds	7ebfc85e2c	Including fixes from bluetooth, bpf, can and netfilter. A little longer PR than usual but it's all fixes, no late features. It's long partially because of timing, and partially because of follow ups to stuff that got merged a week or so before the merge window and wasn't as widely tested. Maybe the Bluetooth fixes are a little alarming so we'll address that, but the rest seems okay and not scary. Notably we're including a fix for the netfilter Kconfig [1], your WiFi warning [2] and a bluetooth fix which should unblock syzbot [3]. Current release - regressions: - Bluetooth: - don't try to cancel uninitialized works [3] - L2CAP: fix use-after-free caused by l2cap_chan_put - tls: rx: fix device offload after recent rework - devlink: fix UAF on failed reload and leftover locks in mlxsw Current release - new code bugs: - netfilter: - flowtable: fix incorrect Kconfig dependencies [1] - nf_tables: fix crash when nf_trace is enabled - bpf: - use proper target btf when exporting attach_btf_obj_id - arm64: fixes for bpf trampoline support - Bluetooth: - ISO: unlock on error path in iso_sock_setsockopt() - ISO: fix info leak in iso_sock_getsockopt() - ISO: fix iso_sock_getsockopt for BT_DEFER_SETUP - ISO: fix memory corruption on iso_pinfo.base - ISO: fix not using the correct QoS - hci_conn: fix updating ISO QoS PHY - phy: dp83867: fix get nvmem cell fail Previous releases - regressions: - wifi: cfg80211: fix validating BSS pointers in __cfg80211_connect_result [2] - atm: bring back zatm uAPI after ATM had been removed - properly fix old bug making bonding ARP monitor mode not being able to work with software devices with lockless Tx - tap: fix null-deref on skb->dev in dev_parse_header_protocol - revert "net: usb: ax88179_178a needs FLAG_SEND_ZLP" it helps some devices and breaks others - netfilter: - nf_tables: many fixes rejecting cross-object linking which may lead to UAFs - nf_tables: fix null deref due to zeroed list head - nf_tables: validate variable length element extension - bgmac: fix a BUG triggered by wrong bytes_compl - bcmgenet: indicate MAC is in charge of PHY PM Previous releases - always broken: - bpf: - fix bad pointer deref in bpf_sys_bpf() injected via test infra - disallow non-builtin bpf programs calling the prog_run command - don't reinit map value in prealloc_lru_pop - fix UAFs during the read of map iterator fd - fix invalidity check for values in sk local storage map - reject sleepable program for non-resched map iterator - mptcp: - move subflow cleanup in mptcp_destroy_common() - do not queue data on closed subflows - virtio_net: fix memory leak inside XDP_TX with mergeable - vsock: fix memory leak when multiple threads try to connect() - rework sk_user_data sharing to prevent psock leaks - geneve: fix TOS inheriting for ipv4 - tunnels & drivers: do not use RT_TOS for IPv6 flowlabel - phy: c45 baset1: do not skip aneg configuration if clock role is not specified - rose: avoid overflow when /proc displays timer information - x25: fix call timeouts in blocking connects - can: mcp251x: fix race condition on receive interrupt - can: j1939: - replace user-reachable WARN_ON_ONCE() with netdev_warn_once() - fix memory leak of skbs in j1939_session_destroy() Misc: - docs: bpf: clarify that many things are not uAPI - seg6: initialize induction variable to first valid array index (to silence clang vs objtool warning) - can: ems_usb: fix clang 14's -Wunaligned-access warning Signed-off-by: Jakub Kicinski <kuba@kernel.org> -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmL1TtkACgkQMUZtbf5S Iruz8Q/+O5xFFsjxuyZD0Mw9d3Jeo3ZI9PeeDvcYl5dZXVegpxqorujTFntxv1Ad JC8o5qqms3kO51d+W/yai6iDacEHX2YcJrupZve+vGvpOEVmBRY5O0E1AckJ18+u ItmjSVESkybUP5P08/An7Y0dMmj9Xb2z84dGkLe+n8lg6/fimo6Ki6yZjcOBOALu AYquMXUcnwztRMbTFjscbJjBd4xFMKZEtthljYtPdIReIN976wmMNYYx+jcPK7ha g39Kv6maklp4euerkGIJ/AMnOWHaOGCFjIaz7rr4444NDfrKdt/jeirUXJaz77Jo TJM2UOwgOeg6WZkSa3cmdq6UdjdkJ6LTe2CJFf1wJ1qfhAi+s8yWoszsM2Enp+66 c/mo9jTCMAjmgEJF11idZuz2S697/5j0hvbfM3ZPgNyNBgn8qxz/Z56fNOisx95u TkoKKFnGH+mcm/et+omBcyLBtBVK2+/6B6mpl6btf4DOkPn5KFYWHV67uV3ksHzQ ye+pnzidoIG0yKbRM2EQKXk7ELKROpl52xUHko93ZinMJt0Q7jBm7tZhJozNFEzi hWgUvpmNXgawzLYQcJ9jJmKw3PmYZnRhvYZB/1r91YamM28Hd58k9WfpWtUtjYJN N0X58L6JSnKPqzR70pcFppz6iBlh0tHdcEQGWhhKU5ScS3FDxGc= =C5Ck -----END PGP SIGNATURE----- Merge tag 'net-6.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from bluetooth, bpf, can and netfilter. A little larger than usual but it's all fixes, no late features. It's large partially because of timing, and partially because of follow ups to stuff that got merged a week or so before the merge window and wasn't as widely tested. Maybe the Bluetooth fixes are a little alarming so we'll address that, but the rest seems okay and not scary. Notably we're including a fix for the netfilter Kconfig [1], your WiFi warning [2] and a bluetooth fix which should unblock syzbot [3]. Current release - regressions: - Bluetooth: - don't try to cancel uninitialized works [3] - L2CAP: fix use-after-free caused by l2cap_chan_put - tls: rx: fix device offload after recent rework - devlink: fix UAF on failed reload and leftover locks in mlxsw Current release - new code bugs: - netfilter: - flowtable: fix incorrect Kconfig dependencies [1] - nf_tables: fix crash when nf_trace is enabled - bpf: - use proper target btf when exporting attach_btf_obj_id - arm64: fixes for bpf trampoline support - Bluetooth: - ISO: unlock on error path in iso_sock_setsockopt() - ISO: fix info leak in iso_sock_getsockopt() - ISO: fix iso_sock_getsockopt for BT_DEFER_SETUP - ISO: fix memory corruption on iso_pinfo.base - ISO: fix not using the correct QoS - hci_conn: fix updating ISO QoS PHY - phy: dp83867: fix get nvmem cell fail Previous releases - regressions: - wifi: cfg80211: fix validating BSS pointers in __cfg80211_connect_result [2] - atm: bring back zatm uAPI after ATM had been removed - properly fix old bug making bonding ARP monitor mode not being able to work with software devices with lockless Tx - tap: fix null-deref on skb->dev in dev_parse_header_protocol - revert "net: usb: ax88179_178a needs FLAG_SEND_ZLP" it helps some devices and breaks others - netfilter: - nf_tables: many fixes rejecting cross-object linking which may lead to UAFs - nf_tables: fix null deref due to zeroed list head - nf_tables: validate variable length element extension - bgmac: fix a BUG triggered by wrong bytes_compl - bcmgenet: indicate MAC is in charge of PHY PM Previous releases - always broken: - bpf: - fix bad pointer deref in bpf_sys_bpf() injected via test infra - disallow non-builtin bpf programs calling the prog_run command - don't reinit map value in prealloc_lru_pop - fix UAFs during the read of map iterator fd - fix invalidity check for values in sk local storage map - reject sleepable program for non-resched map iterator - mptcp: - move subflow cleanup in mptcp_destroy_common() - do not queue data on closed subflows - virtio_net: fix memory leak inside XDP_TX with mergeable - vsock: fix memory leak when multiple threads try to connect() - rework sk_user_data sharing to prevent psock leaks - geneve: fix TOS inheriting for ipv4 - tunnels & drivers: do not use RT_TOS for IPv6 flowlabel - phy: c45 baset1: do not skip aneg configuration if clock role is not specified - rose: avoid overflow when /proc displays timer information - x25: fix call timeouts in blocking connects - can: mcp251x: fix race condition on receive interrupt - can: j1939: - replace user-reachable WARN_ON_ONCE() with netdev_warn_once() - fix memory leak of skbs in j1939_session_destroy() Misc: - docs: bpf: clarify that many things are not uAPI - seg6: initialize induction variable to first valid array index (to silence clang vs objtool warning) - can: ems_usb: fix clang 14's -Wunaligned-access warning" * tag 'net-6.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (117 commits) net: atm: bring back zatm uAPI dpaa2-eth: trace the allocated address instead of page struct net: add missing kdoc for struct genl_multicast_group::flags nfp: fix use-after-free in area_cache_get() MAINTAINERS: use my korg address for mt7601u mlxsw: minimal: Fix deadlock in ports creation bonding: fix reference count leak in balance-alb mode net: usb: qmi_wwan: Add support for Cinterion MV32 bpf: Shut up kern_sys_bpf warning. net/tls: Use RCU API to access tls_ctx->netdev tls: rx: device: don't try to copy too much on detach tls: rx: device: bound the frag walk net_sched: cls_route: remove from list when handle is 0 selftests: forwarding: Fix failing tests with old libnet net: refactor bpf_sk_reuseport_detach() net: fix refcount bug in sk_psock_get (2) selftests/bpf: Ensure sleepable program is rejected by hash map iter selftests/bpf: Add write tests for sk local storage map iterator selftests/bpf: Add tests for reading a dangling map iter fd bpf: Only allow sleepable program for resched-able iterator ...	2022-08-11 13:45:37 -07:00
Alexei Starovoitov	4e4588f1c4	bpf: Shut up kern_sys_bpf warning. Shut up this warning: kernel/bpf/syscall.c:5089:5: warning: no previous prototype for function 'kern_sys_bpf' [-Wmissing-prototypes] int kern_sys_bpf(int cmd, union bpf_attr *attr, unsigned int size) Reported-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2022-08-10 23:58:13 -07:00
Jakub Kicinski	fbe8870f72	Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Daniel Borkmann says: ==================== bpf 2022-08-10 We've added 23 non-merge commits during the last 7 day(s) which contain a total of 19 files changed, 424 insertions(+), 35 deletions(-). The main changes are: 1) Several fixes for BPF map iterator such as UAFs along with selftests, from Hou Tao. 2) Fix BPF syscall program's {copy,strncpy}_from_bpfptr() to not fault, from Jinghao Jia. 3) Reject BPF syscall programs calling BPF_PROG_RUN, from Alexei Starovoitov and YiFei Zhu. 4) Fix attach_btf_obj_id info to pick proper target BTF, from Stanislav Fomichev. 5) BPF design Q/A doc update to clarify what is not stable ABI, from Paul E. McKenney. 6) Fix BPF map's prealloc_lru_pop to not reinitialize, from Kumar Kartikeya Dwivedi. 7) Fix bpf_trampoline_put to avoid leaking ftrace hash, from Jiri Olsa. 8) Fix arm64 JIT to address sparse errors around BPF trampoline, from Xu Kuohai. 9) Fix arm64 JIT to use kvcalloc instead of kcalloc for internal program address offset buffer, from Aijun Sun. * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: (23 commits) selftests/bpf: Ensure sleepable program is rejected by hash map iter selftests/bpf: Add write tests for sk local storage map iterator selftests/bpf: Add tests for reading a dangling map iter fd bpf: Only allow sleepable program for resched-able iterator bpf: Check the validity of max_rdwr_access for sock local storage map iterator bpf: Acquire map uref in .init_seq_private for sock{map,hash} iterator bpf: Acquire map uref in .init_seq_private for sock local storage map iterator bpf: Acquire map uref in .init_seq_private for hash map iterator bpf: Acquire map uref in .init_seq_private for array map iterator bpf: Disallow bpf programs call prog_run command. bpf, arm64: Fix bpf trampoline instruction endianness selftests/bpf: Add test for prealloc_lru_pop bug bpf: Don't reinit map value in prealloc_lru_pop bpf: Allow calling bpf_prog_test kfuncs in tracing programs bpf, arm64: Allocate program buffer using kvcalloc instead of kcalloc selftests/bpf: Excercise bpf_obj_get_info_by_fd for bpf2bpf bpf: Use proper target btf when exporting attach_btf_obj_id mptcp, btf: Add struct mptcp_sock definition when CONFIG_MPTCP is disabled bpf: Cleanup ftrace hash in bpf_trampoline_put BPF: Fix potential bad pointer dereference in bpf_sys_bpf() ... ==================== Link: https://lore.kernel.org/r/20220810190624.10748-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-08-10 21:48:15 -07:00
Hawkins Jiawei	cf8c1e9672	net: refactor bpf_sk_reuseport_detach() Refactor sk_user_data dereference using more generic function __rcu_dereference_sk_user_data_with_flags(), which improve its maintainability Suggested-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Hawkins Jiawei <yin31149@gmail.com> Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-08-10 21:48:04 -07:00
Yafang Shao	73cf09a36b	bpf: Use bpf_map_area_alloc consistently on bpf map creation Let's use the generic helper bpf_map_area_alloc() instead of the open-coded kzalloc helpers in bpf maps creation path. Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Link: https://lore.kernel.org/r/20220810151840.16394-5-laoar.shao@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2022-08-10 11:50:43 -07:00
Yafang Shao	992c9e13f5	bpf: Make __GFP_NOWARN consistent in bpf map creation Some of the bpf maps are created with __GFP_NOWARN, i.e. arraymap, bloom_filter, bpf_local_storage, bpf_struct_ops, lpm_trie, queue_stack_maps, reuseport_array, stackmap and xskmap, while others are created without __GFP_NOWARN, i.e. cpumap, devmap, hashtab, local_storage, offload, ringbuf and sock_map. But there are not key differences between the creation of these maps. So let make this allocation flag consistent in all bpf maps creation. Then we can use a generic helper to alloc all bpf maps. Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Link: https://lore.kernel.org/r/20220810151840.16394-4-laoar.shao@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2022-08-10 11:49:25 -07:00
Yafang Shao	8f58ee54c2	bpf: Use bpf_map_area_free instread of kvfree bpf_map_area_alloc() should be paired with bpf_map_area_free(). Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Link: https://lore.kernel.org/r/20220810151840.16394-3-laoar.shao@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2022-08-10 11:48:44 -07:00
Yafang Shao	083818156d	bpf: Remove unneeded memset in queue_stack_map creation __GFP_ZERO will clear the memory, so we don't need to memset it. Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Link: https://lore.kernel.org/r/20220810151840.16394-2-laoar.shao@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2022-08-10 11:48:22 -07:00
Linus Torvalds	c235698355	cxl for 6.0 - Introduce a 'struct cxl_region' object with support for provisioning and assembling persistent memory regions. - Introduce alloc_free_mem_region() to accompany the existing request_free_mem_region() as a method to allocate physical memory capacity out of an existing resource. - Export insert_resource_expand_to_fit() for the CXL subsystem to late-publish CXL platform windows in iomem_resource. - Add a polled mode PCI DOE (Data Object Exchange) driver service and use it in cxl_pci to retrieve the CDAT (Coherent Device Attribute Table). -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQSbo+XnGs+rwLz9XGXfioYZHlFsZwUCYvLYmAAKCRDfioYZHlFs Z0pbAQC/3j+WriWpU7CdhrnZI1Wqn+x5IIklF0Lc4/f6LwGZtAEAsSbLpItzvwqx M/rcLaeLpwYlgvS1JjdsuQ2VQ7KOtAs= =ehNT -----END PGP SIGNATURE----- Merge tag 'cxl-for-6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl Pull cxl updates from Dan Williams: "Compute Express Link (CXL) updates for 6.0: - Introduce a 'struct cxl_region' object with support for provisioning and assembling persistent memory regions. - Introduce alloc_free_mem_region() to accompany the existing request_free_mem_region() as a method to allocate physical memory capacity out of an existing resource. - Export insert_resource_expand_to_fit() for the CXL subsystem to late-publish CXL platform windows in iomem_resource. - Add a polled mode PCI DOE (Data Object Exchange) driver service and use it in cxl_pci to retrieve the CDAT (Coherent Device Attribute Table)" * tag 'cxl-for-6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl: (74 commits) cxl/hdm: Fix skip allocations vs multiple pmem allocations cxl/region: Disallow region granularity != window granularity cxl/region: Fix x1 interleave to greater than x1 interleave routing cxl/region: Move HPA setup to cxl_region_attach() cxl/region: Fix decoder interleave programming Documentation: cxl: remove dangling kernel-doc reference cxl/region: describe targets and nr_targets members of cxl_region_params cxl/regions: add padding for cxl_rr_ep_add nested lists cxl/region: Fix IS_ERR() vs NULL check cxl/region: Fix region reference target accounting cxl/region: Fix region commit uninitialized variable warning cxl/region: Fix port setup uninitialized variable warnings cxl/region: Stop initializing interleave granularity cxl/hdm: Fix DPA reservation vs cxl_endpoint_decoder lifetime cxl/acpi: Minimize granularity for x1 interleaves cxl/region: Delete 'region' attribute from root decoders cxl/acpi: Autoload driver for 'cxl_acpi' test devices cxl/region: decrement ->nr_targets on error in cxl_region_attach() cxl/region: prevent underflow in ways_to_cxl() cxl/region: uninitialized variable in alloc_hpa() ...	2022-08-10 11:07:26 -07:00
Hou Tao	d247049f4f	bpf: Only allow sleepable program for resched-able iterator When a sleepable program is attached to a hash map iterator, might_fault() will report "BUG: sleeping function called from invalid context..." if CONFIG_DEBUG_ATOMIC_SLEEP is enabled. The reason is that rcu_read_lock() is held in bpf_hash_map_seq_next() and won't be released until all elements are traversed or bpf_hash_map_seq_stop() is called. Fixing it by reusing BPF_ITER_RESCHED to indicate that only non-sleepable program is allowed for iterator without BPF_ITER_RESCHED. We can revise bpf_iter_link_attach() later if there are other conditions which may cause rcu_read_lock() or spin_lock() issues. Signed-off-by: Hou Tao <houtao1@huawei.com> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/r/20220810080538.1845898-7-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2022-08-10 10:12:48 -07:00
Hou Tao	ef1e93d2ee	bpf: Acquire map uref in .init_seq_private for hash map iterator bpf_iter_attach_map() acquires a map uref, and the uref may be released before or in the middle of iterating map elements. For example, the uref could be released in bpf_iter_detach_map() as part of bpf_link_release(), or could be released in bpf_map_put_with_uref() as part of bpf_map_release(). So acquiring an extra map uref in bpf_iter_init_hash_map() and releasing it in bpf_iter_fini_hash_map(). Fixes: `d6c4503cc2` ("bpf: Implement bpf iterator for hash maps") Signed-off-by: Hou Tao <houtao1@huawei.com> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/r/20220810080538.1845898-3-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2022-08-10 10:12:47 -07:00
Hou Tao	f76fa6b338	bpf: Acquire map uref in .init_seq_private for array map iterator bpf_iter_attach_map() acquires a map uref, and the uref may be released before or in the middle of iterating map elements. For example, the uref could be released in bpf_iter_detach_map() as part of bpf_link_release(), or could be released in bpf_map_put_with_uref() as part of bpf_map_release(). Alternative fix is acquiring an extra bpf_link reference just like a pinned map iterator does, but it introduces unnecessary dependency on bpf_link instead of bpf_map. So choose another fix: acquiring an extra map uref in .init_seq_private for array map iterator. Fixes: `d3cc2ab546` ("bpf: Implement bpf iterator for array maps") Signed-off-by: Hou Tao <houtao1@huawei.com> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/r/20220810080538.1845898-2-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2022-08-10 10:12:47 -07:00
Alexei Starovoitov	86f44fcec2	bpf: Disallow bpf programs call prog_run command. The verifier cannot perform sufficient validation of bpf_attr->test.ctx_in pointer, therefore bpf programs should not be allowed to call BPF_PROG_RUN command from within the program. To fix this issue split bpf_sys_bpf() bpf helper into normal kern_sys_bpf() kernel function that can only be used by the kernel light skeleton directly. Reported-by: YiFei Zhu <zhuyifei@google.com> Fixes: `b1d18a7574` ("bpf: Extend sys_bpf commands for bpf_syscall programs.") Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2022-08-10 09:43:07 -07:00
Artem Savkov	1337905964	bpf: export crash_kexec() as destructive kfunc Allow properly marked bpf programs to call crash_kexec(). Signed-off-by: Artem Savkov <asavkov@redhat.com> Link: https://lore.kernel.org/r/20220810065905.475418-3-asavkov@redhat.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2022-08-10 09:22:18 -07:00
Artem Savkov	4dd48c6f1f	bpf: add destructive kfunc flag Add KF_DESTRUCTIVE flag for destructive functions. Functions with this flag set will require CAP_SYS_BOOT capabilities. Signed-off-by: Artem Savkov <asavkov@redhat.com> Link: https://lore.kernel.org/r/20220810065905.475418-2-asavkov@redhat.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2022-08-10 09:22:18 -07:00
Kumar Kartikeya Dwivedi	275c30bcee	bpf: Don't reinit map value in prealloc_lru_pop The LRU map that is preallocated may have its elements reused while another program holds a pointer to it from bpf_map_lookup_elem. Hence, only check_and_free_fields is appropriate when the element is being deleted, as it ensures proper synchronization against concurrent access of the map value. After that, we cannot call check_and_init_map_value again as it may rewrite bpf_spin_lock, bpf_timer, and kptr fields while they can be concurrently accessed from a BPF program. This is safe to do as when the map entry is deleted, concurrent access is protected against by check_and_free_fields, i.e. an existing timer would be freed, and any existing kptr will be released by it. The program can create further timers and kptrs after check_and_free_fields, but they will eventually be released once the preallocated items are freed on map destruction, even if the item is never reused again. Hence, the deleted item sitting in the free list can still have resources attached to it, and they would never leak. With spin_lock, we never touch the field at all on delete or update, as we may end up modifying the state of the lock. Since the verifier ensures that a bpf_spin_lock call is always paired with bpf_spin_unlock call, the program will eventually release the lock so that on reuse the new user of the value can take the lock. Essentially, for the preallocated case, we must assume that the map value may always be in use by the program, even when it is sitting in the freelist, and handle things accordingly, i.e. use proper synchronization inside check_and_free_fields, and never reinitialize the special fields when it is reused on update. Fixes: `68134668c1` ("bpf: Add map side support for bpf timers.") Acked-by: Yonghong Song <yhs@fb.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/r/20220809213033.24147-3-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2022-08-09 18:46:11 -07:00
Joanne Koong	883743422c	bpf: Fix ref_obj_id for dynptr data slices in verifier When a data slice is obtained from a dynptr (through the bpf_dynptr_data API), the ref obj id of the dynptr must be found and then associated with the data slice. The ref obj id of the dynptr must be found before the caller saved regs are reset. Without this fix, the ref obj id tracking is not correct for dynptrs that are at an offset from the frame pointer. Please also note that the data slice's ref obj id must be assigned after the ret types are parsed, since RET_PTR_TO_ALLOC_MEM-type return regs get zero-marked. Fixes: `34d4ef5775` ("bpf: Add dynptr data slices") Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Acked-by: David Vernet <void@manifault.com> Link: https://lore.kernel.org/r/20220809214055.4050604-1-joannelkoong@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2022-08-09 18:38:12 -07:00
Youngmin Nam	46dae32fe6	time: Correct the prototype of ns_to_kernel_old_timeval and ns_to_timespec64 In ns_to_kernel_old_timeval() definition, the function argument is defined with const identifier in kernel/time/time.c, but the prototype in include/linux/time32.h looks different. - The function is defined in kernel/time/time.c as below: struct __kernel_old_timeval ns_to_kernel_old_timeval(const s64 nsec) - The function is decalared in include/linux/time32.h as below: extern struct __kernel_old_timeval ns_to_kernel_old_timeval(s64 nsec); Because the variable of arithmethic types isn't modified in the calling scope, there's no need to mark arguments as const, which was already mentioned during review (Link[1) of the original patch. Likewise remove the "const" keyword in both definition and declaration of ns_to_timespec64() as requested by Arnd (Link[2]). Fixes: `a84d116916` ("y2038: Introduce struct __kernel_old_timeval") Signed-off-by: Youngmin Nam <youngmin.nam@samsung.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Link: https://lore.kernel.org/all/20220712094715.2918823-1-youngmin.nam@samsung.com Link[1]: https://lore.kernel.org/all/20180310081123.thin6wphgk7tongy@gmail.com/ Link[2]: https://lore.kernel.org/all/CAK8P3a3nknJgEDESGdJH91jMj6R_xydFqWASd8r5BbesdvMBgA@mail.gmail.com/	2022-08-09 20:02:13 +02:00
Yonghong Song	a00ed84301	bpf: Always return corresponding btf_type in __get_type_size() Currently in funciton __get_type_size(), the corresponding btf_type is returned only in invalid cases. Let us always return btf_type regardless of valid or invalid cases. Such a new functionality will be used in subsequent patches. Signed-off-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/r/20220807175116.4179242-1-yhs@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2022-08-09 10:30:45 -07:00
Jesper Dangaard Brouer	c8996c98f7	bpf: Add BPF-helper for accessing CLOCK_TAI Commit `3dc6ffae2d` ("timekeeping: Introduce fast accessor to clock tai") introduced a fast and NMI-safe accessor for CLOCK_TAI. Especially in time sensitive networks (TSN), where all nodes are synchronized by Precision Time Protocol (PTP), it's helpful to have the possibility to generate timestamps based on CLOCK_TAI instead of CLOCK_MONOTONIC. With a BPF helper for TAI in place, it becomes very convenient to correlate activity across different machines in the network. Use cases for such a BPF helper include functionalities such as Tx launch time (e.g. ETF and TAPRIO Qdiscs) and timestamping. Note: CLOCK_TAI is nothing new per se, only the NMI-safe variant of it is. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> [Kurt: Wrote changelog and renamed helper] Signed-off-by: Kurt Kanzenbach <kurt@linutronix.de> Link: https://lore.kernel.org/r/20220809060803.5773-2-kurt@linutronix.de Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2022-08-09 09:47:13 -07:00
Dave Marchevsky	b2d8ef19c6	bpf: Cleanup check_refcount_ok Discussion around a recently-submitted patch provided historical context for check_refcount_ok [0]. Specifically, the function and its helpers - may_be_acquire_function and arg_type_may_be_refcounted - predate the OBJ_RELEASE type flag and the addition of many more helpers with acquire/release semantics. The purpose of check_refcount_ok is to ensure: 1) Helper doesn't have multiple uses of return reg's ref_obj_id 2) Helper with release semantics only has one arg needing to be released, since that's tracked using meta->ref_obj_id With current verifier, it's safe to remove check_refcount_ok and its helpers. Since addition of OBJ_RELEASE type flag, case 2) has been handled by the arg_type_is_release check in check_func_arg. To ensure case 1) won't result in verifier silently prioritizing one use of ref_obj_id, this patch adds a helper_multiple_ref_obj_use check which fails loudly if a helper passes > 1 test for use of ref_obj_id. [0]: lore.kernel.org/bpf/20220713234529.4154673-1-davemarchevsky@fb.com Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Joanne Koong <joannelkoong@gmail.com> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20220808171559.3251090-1-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2022-08-09 09:43:31 -07:00
Hao Luo	be3bb83dab	bpf, iter: Fix the condition on p when calling stop. In bpf_seq_read, seq->op->next() could return an ERR and jump to the label stop. However, the existing code in stop does not handle the case when p (returned from next()) is an ERR. Adds the handling of ERR of p by converting p into an error and jumping to done. Because all the current implementations do not have a case that returns ERR from next(), so this patch doesn't have behavior changes right now. Acked-by: Yonghong Song <yhs@fb.com> Signed-off-by: Hao Luo <haoluo@google.com> Link: https://lore.kernel.org/r/20220805214821.1058337-4-haoluo@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2022-08-09 09:12:00 -07:00
Yosry Ahmed	f3a2aebdd6	cgroup: enable cgroup_get_from_file() on cgroup1 cgroup_get_from_file() currently fails with -EBADF if called on cgroup v1. However, the current implementation works on cgroup v1 as well, so the restriction is unnecessary. This enabled cgroup_get_from_fd() to work on cgroup v1, which would be the only thing stopping bpf cgroup_iter from supporting cgroup v1. Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Hao Luo <haoluo@google.com> Link: https://lore.kernel.org/r/20220805214821.1058337-3-haoluo@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2022-08-09 09:11:41 -07:00
Benjamin Tissoires	fa96b24204	btf: Add a new kfunc flag which allows to mark a function to be sleepable This allows to declare a kfunc as sleepable and prevents its use in a non sleepable program. Signed-off-by: Benjamin Tissoires <benjamin.tissoires@redhat.com> Co-developed-by: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Hao Luo <haoluo@google.com> Link: https://lore.kernel.org/r/20220805214821.1058337-2-haoluo@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2022-08-09 09:11:17 -07:00
Linus Torvalds	5d5d353bed	remoteproc updates for v5.20 This introduces support for the remoteproc on Mediatek MT8188, and enables caches for MT8186 SCP. It adds support for PRU cores found on the TI K3 AM62x SoCs. It moves the recovery work after a firmware crash to an unbound workqueue, to allow recovery to happen in parallel. A new DMA API is introduced to release dma_mem for a device. It adds support a panic handler for the Qualcomm modem remoteproc, with the goal of having caches flushed in memory dumps for post-mortem debugging and it introduces a mechanism to wait for the modem firmware on SM8450 to decrypt part of its memory for post-mortem debugging. Qualcomm sysmon is restricted to only inform remote processors about peers that are actually running, to avoid a race where Linux tries to notify a recovering remote processor about its peers new state. A mechanism for waiting for the sysmon connection to be established is also introduced, to avoid out-of-sync updates for rapidly restarting remote processors. A number of Devicetree binding cleanups and conversions to YAML are introduced, to facilitate Devicetree validation. Lastly it introduces a number of smaller fixes and cleanups in the core and a few different drivers. -----BEGIN PGP SIGNATURE----- iQJPBAABCAA5FiEEBd4DzF816k8JZtUlCx85Pw2ZrcUFAmLxXTUbHGJqb3JuLmFu ZGVyc3NvbkBsaW5hcm8ub3JnAAoJEAsfOT8Nma3F6lcQAKEAtkd7dRChx5Y11h8J BdUmqYTGrlZCfZhGePgUgm9KXvf+BwjnYgZGNPVsno0h9/taY6pWggGz1/hMeD97 oTFrzZreOEHmrB7tKCQmzKdHzlVaf1aMifzz1BkICH+TRG2t/V3ycr+KJhyCK6IV CcsQ6D4FRdVDTWHEizWRewO7uFzaA3CWlr7uSY99aDMXikxSSGU7TgkH8ac04TU/ Z1+X2uClOa7IzaQX6dSm5lzZGDACatA0+WLFBf6LlEC2XtywKxPHq60QjWQwuXth /5mljBbIyW+5Qblm1r1gaipOCd6bGUvlY+0TdqbLlK8LpNIpDjFrt1mrmT4N2T+6 OAEyXglFvqHG8qjDafew5SxOEYbmFCMJ/oY+akNmpKS7Hhwx3AHeiZJdtu+bDY3O JeMQVCqrdMbrdBTNPJEjkTnhWCu1fPTn8STGaAEHgxsOPkarEtk37DuEy6KcV4It RTFY4mfnJrTfNeFpm60tOxg/zGYTjXol7uqY7BUTB7bV82W5+UTVGlpO8ayHvxru MwtN0HIDH/liXEsbt8INATXTEiTwJmEiqga53/EEWhMtnor3/xE2e26TZwzfq3sB Ue8TXnuQEN+v/ThHHvjyOZH0MONivYiW6iHkAuzq0RdnHIVDrFD/YQusWpxj7uuM nuk9OY0SbxMvUXIFKucg7zXJ =gbAX -----END PGP SIGNATURE----- Merge tag 'rproc-v5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/remoteproc/linux Pull remoteproc updates from Bjorn Andersson: "This introduces support for the remoteproc on Mediatek MT8188, and enables caches for MT8186 SCP. It adds support for PRU cores found on the TI K3 AM62x SoCs. It moves the recovery work after a firmware crash to an unbound workqueue, to allow recovery to happen in parallel. A new DMA API is introduced to release dma_mem for a device. It adds support a panic handler for the Qualcomm modem remoteproc, with the goal of having caches flushed in memory dumps for post-mortem debugging and it introduces a mechanism to wait for the modem firmware on SM8450 to decrypt part of its memory for post-mortem debugging. Qualcomm sysmon is restricted to only inform remote processors about peers that are actually running, to avoid a race where Linux tries to notify a recovering remote processor about its peers new state. A mechanism for waiting for the sysmon connection to be established is also introduced, to avoid out-of-sync updates for rapidly restarting remote processors. A number of Devicetree binding cleanups and conversions to YAML are introduced, to facilitate Devicetree validation. Lastly it introduces a number of smaller fixes and cleanups in the core and a few different drivers" * tag 'rproc-v5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/remoteproc/linux: (42 commits) remoteproc: qcom_q6v5_pas: Do not fail if regulators are not found drivers/remoteproc: fix repeated words in comments remoteproc: Directly use ida_alloc()/free() remoteproc: Use unbounded workqueue for recovery work remoteproc: using pm_runtime_resume_and_get instead of pm_runtime_get_sync remoteproc: qcom_q6v5_pas: Deal silently with optional px and cx regulators remoteproc: sysmon: Send sysmon state only for running rprocs remoteproc: sysmon: Wait for SSCTL service to come up remoteproc: qcom: q6v5: Set q6 state to offline on receiving wdog irq remoteproc: qcom: pas: Check if coredump is enabled remoteproc: qcom: pas: Mark devices as wakeup capable remoteproc: qcom: pas: Mark va as io memory remoteproc: qcom: pas: Add decrypt shutdown support for modem remoteproc: qcom: q6v5-mss: add powerdomains to MSM8996 config remoteproc: qcom_q6v5: Introduce panic handler for MSS remoteproc: qcom_q6v5_mss: Update MBA log info remoteproc: qcom: correct kerneldoc remoteproc: qcom_q6v5_mss: map/unmap metadata region before/after use remoteproc: qcom: using pm_runtime_resume_and_get to simplify the code remoteproc: mediatek: Support MT8188 SCP ...	2022-08-08 15:16:29 -07:00
Linus Torvalds	d5af75f77c	sysctl updates for 6.0 There isn't much for 6.0 for sysctl stuff, most of the stuff went through the networking subsystem (Kuniyuki Iwashima's trove of fixes using READ_ONCE/WRITE_ONCE helpe) as most of the issues there have been identified on networking side. So it is good we don't have much updates as we would have ended up with tons of conflicts. I rebased my delta just now to your tree so to avoid conflicts with that stuff. This merge request is just minor fluff cleanups then. Perhaps for 6.1 kernel/sysctl.c will get more love than this release. -----BEGIN PGP SIGNATURE----- iQJGBAABCgAwFiEENnNq2KuOejlQLZofziMdCjCSiKcFAmLxPncSHG1jZ3JvZkBr ZXJuZWwub3JnAAoJEM4jHQowkoinvvEP/jn5mnSp4QJzwHboahHmdFUToi90d+CW ah7Mvl//QlYuK9wLFXeYKI6D9Y9kBV9VzA9tB/HSElqafqX4l57wCNf+44fnJyrs FlYRPWRFXbbklslHv6hltv/X7FNe2iwcNQM2JV6V25HKULzYuOZ1bbKRAL6fRF77 xnG9v70gU/5twyxFj7aKNLx+koWQxpnqTwmehDwl94audCL4BpyG/cVarGyQMu1x hdeeTgOfnwYoNCCFROGW5s56P/SdwQEdfQcN6pQTVXqgdmg5hStOh5+G13IUU04z Fvs6oKDoNlnjc6Wxh88LAiMlu0LRi2H7/2PyclhwP8JQj9eC9Qd2cKixjwnG2PfG th+Pg+6mIJs66s0UeloZbFCBMq7kavDvbxqg62/r8OrB3YUOMoFUPCBd+ZvjqmpC V5R3g272a1exj+IjNbitwukrx3yNYDiR1fWaY78ydwQUX54/5OCfdJogx+/NaaX9 29ww7N2mXl52q3XBCSp1tEkDN4d6TxFSDZVCEZxUukNZv5QuXJMMHboN6DxzVS3w fsbPhYzWgGFqMnDPU2jLCbT5QyD4nTzZ/2x+HPP+I8BpmKffQ+uPxh+wb2nKKyHI I9VylC92Fleto/NtB+eb7WIqvCoILHS7cf0/TF18Mync8dXzyFZOvFOZLDSiPFq1 Fhac4kSyIUZR =21dd -----END PGP SIGNATURE----- Merge tag 'sysctl-6.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux Pull sysctl updates from Luis Chamberlain: "There isn't much for 6.0 for sysctl stuff, most of the stuff went through the networking subsystem (Kuniyuki Iwashima's trove of fixes using READ_ONCE/WRITE_ONCE helpers) as most of the issues there have been identified on networking side. So it is good we don't have much updates as we would have ended up with tons of conflicts. I rebased my delta just now to your tree so to avoid conflicts with that stuff. This merge request is just minor fluff cleanups then. Perhaps for 6.1 kernel/sysctl.c will get more love than this release" * tag 'sysctl-6.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux: kernel/sysctl.c: Remove trailing white space kernel/sysctl.c: Clean up indentation, replace spaces with tab. sysctl: Merge adjacent CONFIG_TREE_RCU blocks	2022-08-08 14:17:46 -07:00
Linus Torvalds	e74acdf55d	Modules updates for 6.0 For the 6.0 merge window the modules code shifts to cleanup and minor fixes effort. This is becomes much easier to do and review now due to the code split to its own directory from effort on the last kernel release. I expect to see more of this with time and as we expand on test coverage in the future. The cleanups and fixes come from usual suspects such as Christophe Leroy and Aaron Tomlin but there are also some other contributors. One particular minor fix worth mentioning is from Helge Deller, where he spotted a forever incorrect natural alignment on both ELF section header tables: * .altinstructions * __bug_table sections A lot of back and forth went on in trying to determine the ill effects of this misalignment being present for years and it has been determined there should be no real ill effects unless you have a buggy exception handler. Helge actually hit one of these buggy exception handlers on parisc which is how he ended up spotting this issue. When implemented correctly these paths with incorrect misalignment would just mean a performance penalty, but given that we are dealing with alternatives on modules and with the __bug_table (where info regardign BUG()/WARN() file/line information associated with it is stored) this really shouldn't be a big deal. The only other change with mentioning is the kmap() with kmap_local_page() and my only concern with that was on what is done after preemption, but the virtual addresses are restored after preemption. This is only used on module decompression. This all has sit on linux-next for a while except the kmap stuff which has been there for 3 weeks. -----BEGIN PGP SIGNATURE----- iQJGBAABCgAwFiEENnNq2KuOejlQLZofziMdCjCSiKcFAmLxL4gSHG1jZ3JvZkBr ZXJuZWwub3JnAAoJEM4jHQowkoin8AYP/iv/Oh/Zzh4UvZzkkOSzhf1qDgGhjFb0 aFIODZzpEfZ5ix5GcLapB8/QIwQgxiIRa3WkTMc0uyv+mddlbKuILFnI9A1I+TQe N4gmKeYXwWRyxLa6y7/B3lVzuLxf4DpcxfS2c3A65MkYi09XPA9oXCy7JjzsmEiZ z2Lu8lTe6hg8VarBTogHBxiEU7ybfDCnHWj7/Oe6zz8tS/R0i0ndNBu9xmaCqSh7 QC8++eqCaS+zfW0uTmnGDo1/zWLBblCZ5HAHG8bLlPHezUbekNz6G1D4CVwFyNQ8 wy1Gjy8nFWc+rwUl1CTgJ+A7wodGrMCyt5SmcNUVBOWdlSmli5vFJp61ET6UdrV+ +8owATwwIm8hbkIAI4037j7pMgrO27d130GRxFwgG9GNoqew2AM7y/9HrlmW49PE IqJA4Pm3zg26IhLIRcH7jLg3oKGuFf0nkMTDoooI5a9DlcsCXPuGd0FBw2WbR71D Px6dlVoAW0NrP2tm8YzkTKIT+aN+UId4Vdi2oFs1t8Sye/U+LCjvwrXPk13pZKdR VxfM1oVxeRwiAUq0VuIrnj7windF5Mpy2hDLHeWjzQmLcEGAtCYEGyxKTBkNTtPt gm9XBzT6Rbzi+Sc++ZoHYHe1g4T66sjYOp4N90sRRMD3FR97ZyW8eD01gwf6p1Uy aCOrA+sRHK3F =hPvl -----END PGP SIGNATURE----- Merge tag 'modules-6.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux Pull module updates from Luis Chamberlain: "For the 6.0 merge window the modules code shifts to cleanup and minor fixes effort. This becomes much easier to do and review now due to the code split to its own directory from effort on the last kernel release. I expect to see more of this with time and as we expand on test coverage in the future. The cleanups and fixes come from usual suspects such as Christophe Leroy and Aaron Tomlin but there are also some other contributors. One particular minor fix worth mentioning is from Helge Deller, where he spotted a forever incorrect natural alignment on both ELF section header tables: * .altinstructions * __bug_table sections A lot of back and forth went on in trying to determine the ill effects of this misalignment being present for years and it has been determined there should be no real ill effects unless you have a buggy exception handler. Helge actually hit one of these buggy exception handlers on parisc which is how he ended up spotting this issue. When implemented correctly these paths with incorrect misalignment would just mean a performance penalty, but given that we are dealing with alternatives on modules and with the __bug_table (where info regardign BUG()/WARN() file/line information associated with it is stored) this really shouldn't be a big deal. The only other change with mentioning is the kmap() with kmap_local_page() and my only concern with that was on what is done after preemption, but the virtual addresses are restored after preemption. This is only used on module decompression. This all has sit on linux-next for a while except the kmap stuff which has been there for 3 weeks" * tag 'modules-6.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux: module: Replace kmap() with kmap_local_page() module: Show the last unloaded module's taint flag(s) module: Use strscpy() for last_unloaded_module module: Modify module_flags() to accept show_state argument module: Move module's Kconfig items in kernel/module/ MAINTAINERS: Update file list for module maintainers module: Use vzalloc() instead of vmalloc()/memset(0) modules: Ensure natural alignment for .altinstructions and __bug_table sections module: Increase readability of module_kallsyms_lookup_name() module: Fix ERRORs reported by checkpatch.pl module: Add support for default value for module async_probe	2022-08-08 14:12:19 -07:00
Fanjun Kong	374a723c74	kernel/sysctl.c: Remove trailing white space This patch removes the trailing white space in kernel/sysysctl.c Signed-off-by: Fanjun Kong <bh1scw@gmail.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> [mcgrof: fix commit message subject] Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>	2022-08-08 09:01:36 -07:00
Fanjun Kong	5bfd5d3e2e	kernel/sysctl.c: Clean up indentation, replace spaces with tab. This patch fixes two coding style issues: 1. Clean up indentation, replace spaces with tab 2. Add space after ',' Signed-off-by: Fanjun Kong <bh1scw@gmail.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>	2022-08-08 09:01:36 -07:00
Geert Uytterhoeven	7251ceb51a	sysctl: Merge adjacent CONFIG_TREE_RCU blocks There are two adjacent sysctl entries protected by the same CONFIG_TREE_RCU config symbol. Merge them into a single block to improve readability. Use the more common "#ifdef" form while at it. Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>	2022-08-08 09:01:36 -07:00
Joanne Koong	0c9a7a7e20	bpf: Verifier cleanups This patch cleans up a few things in the verifier: * type_is_pkt_pointer(): Future work (skb + xdp dynptrs [0]) will be using the reg type PTR_TO_PACKET \| PTR_MAYBE_NULL. type_is_pkt_pointer() should return true for any type whose base type is PTR_TO_PACKET, regardless of flags attached to it. * reg_type_may_be_refcounted_or_null(): Get the base type at the start of the function to avoid having to recompute it / improve readability * check_func_proto(): remove unnecessary 'meta' arg * check_helper_call(): Use switch casing on the base type of return value instead of nested ifs on the full type There are no functional behavior changes. [0] https://lore.kernel.org/bpf/20220726184706.954822-1-joannelkoong@gmail.com/ Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/bpf/20220802214638.3643235-1-joannelkoong@gmail.com	2022-08-08 17:54:06 +02:00
Stanislav Fomichev	6644aabbd8	bpf: Use proper target btf when exporting attach_btf_obj_id When attaching to program, the program itself might not be attached to anything (and, hence, might not have attach_btf), so we can't unconditionally use 'prog->aux->dst_prog->aux->attach_btf'. Instead, use bpf_prog_get_target_btf to pick proper target BTF: * when attached to dst_prog, use dst_prog->aux->btf * when attached to kernel btf, use prog->aux->attach_btf Fixes: `b79c9fc955` ("bpf: implement BPF_PROG_QUERY for BPF_LSM_CGROUP") Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Hao Luo <haoluo@google.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20220804201140.1340684-1-sdf@google.com	2022-08-08 15:53:17 +02:00
Linus Torvalds	eb5699ba31	Updates to various subsystems which I help look after. lib, ocfs2, fatfs, autofs, squashfs, procfs, etc. -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCYu9BeQAKCRDdBJ7gKXxA jp1DAP4mjCSvAwYzXklrIt+Knv3CEY5oVVdS+pWOAOGiJpldTAD9E5/0NV+VmlD9 kwS/13j38guulSlXRzDLmitbg81zAAI= =Zfum -----END PGP SIGNATURE----- Merge tag 'mm-nonmm-stable-2022-08-06-2' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc updates from Andrew Morton: "Updates to various subsystems which I help look after. lib, ocfs2, fatfs, autofs, squashfs, procfs, etc. A relatively small amount of material this time" * tag 'mm-nonmm-stable-2022-08-06-2' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (72 commits) scripts/gdb: ensure the absolute path is generated on initial source MAINTAINERS: kunit: add David Gow as a maintainer of KUnit mailmap: add linux.dev alias for Brendan Higgins mailmap: update Kirill's email profile: setup_profiling_timer() is moslty not implemented ocfs2: fix a typo in a comment ocfs2: use the bitmap API to simplify code ocfs2: remove some useless functions lib/mpi: fix typo 'the the' in comment proc: add some (hopefully) insightful comments bdi: remove enum wb_congested_state kernel/hung_task: fix address space of proc_dohung_task_timeout_secs lib/lzo/lzo1x_compress.c: replace ternary operator with min() and min_t() squashfs: support reading fragments in readahead call squashfs: implement readahead squashfs: always build "file direct" version of page actor Revert "squashfs: provide backing_dev_info in order to disable read-ahead" fs/ocfs2: Fix spelling typo in comment ia64: old_rr4 added under CONFIG_HUGETLB_PAGE proc: fix test for "vsyscall=xonly" boot option ...	2022-08-07 10:03:24 -07:00
Linus Torvalds	cac03ac368	Various fixes: a deadline scheduler fix, a migration fix, a Sparse fix and a comment fix. Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmLuvmwRHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1gONQ/+KkkPTeKgGDvrahTfeYZlmRyvcI1R78r9 yooa8v+DtifznBW2eXDBc8WTruzqr78VyUY+1YSjfKS6FRQWYMficJ3qk3hxgBru 998KZbvl3jXBBlRkqgGeFlF5Ty2KaryEZgX97a7IF/0xWDgpm972jFkJ/KCo/YTY WSQrzutz2FKe71EjK4cAplYxPZIiy/zo2hSGTbsso4M7bO5VLc1Y4qMtFGcCZ7JB s9JYkj2Rfz+AS5wioDRcGuec4A4SrroxKszZA6QDDBuhMJukqexO02xs/fxZ2W4Z DF4U5MFOrtz9AWSGsf1P6XXbgJO8qTgQXZchFsEcJwypV13w8U0IViXQfD/Pvx2X y+WHdnZVIO2sDwOJ15ew7IuoJZ2LsVygrBNFJJaIFOtIz3RzprI0BJN7LeWFALOa IPmbtiY8hVwhKmjRgMHWDwJhMEHLuhGx3idiD89w1pknzTUnKDiwLyEUtyynxeGd ft9uCvPefrYQVx9AiH7wf0W+fg334FCccC+0f8LyduyftUyQCfZIZY6LUSKuKded Odm7k0ngLDPbdZwAHs0Nf/ilRwd91Z7b6hGt5U3ptx+8BPMKB+/k1VoKog7OISPc zGaP7DrtuC4sEdX4X6bqX+mEQhpkLcQw15gVGxhKoHqygWNSZrV634aSSXwfVXJx eT5m/K9a7L0= =CYl5 -----END PGP SIGNATURE----- Merge tag 'sched-urgent-2022-08-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar: "Various fixes: a deadline scheduler fix, a migration fix, a Sparse fix and a comment fix" * tag 'sched-urgent-2022-08-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/core: Do not requeue task on CPU excluded from cpus_mask sched/rt: Fix Sparse warnings due to undefined rt.c declarations exit: Fix typo in comment: s/sub-theads/sub-threads sched, cpuset: Fix dl_cpu_busy() panic due to empty cs->cpus_allowed	2022-08-06 17:34:06 -07:00
Linus Torvalds	592d8362bc	Misc fixes to kprobes and the faddr2line script, plus a cleanup. Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmLuu5MRHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1gFbA//ZppMR0/26/d+KqhdbVND6wtuTzGb5krZ m3QynlRQ+x7CZJNJeiNSTo/Dup/KwBUpJFT5sKLtpfOQILxlEt0hdYMiD+/oxgxd K0Vb0QrZhwFCju+OpcDAVlWNcQ5P8MMdoGUkOr5ekZ9FFalabW+bVUuM2Yf0Cok8 e20MGoZa2jcd+AZkp9jPUtCTURpW3Ew1WcVuJIgLH3EUMNrQNiPdia6xBzFyOPAw L0G14RDkd/POGF90dUGY1Ta4WeQCNYp2Rgu5DLo6l3eJJ/oeqoIUBUoNRT9AOJHH 0SVNHkrrNlRJe9HD/Jdc6RVBMM+FFNU4rw1uxOPU2OtG0MyMsj39Nzw+xmvB9QsG mwnMoeeDOJmFRnAyhETe4meR5mA8cPQDoNNlHL51I9JTJTUutIrfd+gQIgVgYrM2 oVfLW7Y0Eew8qYbAd2kfGnFNHDSH90RHG4beTz4zW3y4shembKhiPU7bgJ8lkke7 u4NgDOE+qTmtC1DznuV4Av8/27W6OMt/j1IWeR78IN7YBko99Ekog3zsWrAJgA/E Y08JVrUpUU47tMl4uC9Y0AUvm1Tb2ZyDqcdlEEzF9txtdNa6cAJtJkPaO6nUrr4+ qLCbhBBADP+oQNESi6vRHRmxmk5Z/m2ybfnAuYNNraWY01Imp4kNvLFvB01ARGaF Qin7dCjqz+E= =S41z -----END PGP SIGNATURE----- Merge tag 'perf-urgent-2022-08-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Ingo Molnar: "Misc fixes to kprobes and the faddr2line script, plus a cleanup" * tag 'perf-urgent-2022-08-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/core: Fix ';;' typo scripts/faddr2line: Add CONFIG_DEBUG_INFO check scripts/faddr2line: Fix vmlinux detection on arm64 x86/kprobes: Update kcb status flag after singlestepping kprobes: Forbid probing on trampoline and BPF code areas	2022-08-06 17:28:12 -07:00
Linus Torvalds	cae4199f93	powerpc updates for 6.0 - Add support for syscall stack randomization. - Add support for atomic operations to the 32 & 64-bit BPF JIT. - Full support for KASAN on 64-bit Book3E. - Add a watchdog driver for the new PowerVM hypervisor watchdog. - Add a number of new selftests for the Power10 PMU support. - Add a driver for the PowerVM Platform KeyStore. - Increase the NMI watchdog timeout during live partition migration, to avoid timeouts due to increased memory access latency. - Add support for using the 'linux,pci-domain' device tree property for PCI domain assignment. - Many other small features and fixes. Thanks to: Alexey Kardashevskiy, Andy Shevchenko, Arnd Bergmann, Athira Rajeev, Bagas Sanjaya, Christophe Leroy, Erhard Furtner, Fabiano Rosas, Greg Kroah-Hartman, Greg Kurz, Haowen Bai, Hari Bathini, Jason A. Donenfeld, Jason Wang, Jiang Jian, Joel Stanley, Juerg Haefliger, Kajol Jain, Kees Cook, Laurent Dufour, Madhavan Srinivasan, Masahiro Yamada, Maxime Bizon, Miaoqian Lin, Murilo Opsfelder Araújo, Nathan Lynch, Naveen N. Rao, Nayna Jain, Nicholas Piggin, Ning Qiang, Pali Rohár, Petr Mladek, Rashmica Gupta, Sachin Sant, Scott Cheloha, Segher Boessenkool, Stephen Rothwell, Uwe Kleine-König, Wolfram Sang, Xiu Jianfeng, Zhouyi Zhou. -----BEGIN PGP SIGNATURE----- iQJHBAABCAAxFiEEJFGtCPCthwEv2Y/bUevqPMjhpYAFAmLuAPgTHG1wZUBlbGxl cm1hbi5pZC5hdQAKCRBR6+o8yOGlgBPpD/9kY/T0qlOXABxlZCgtqeAjPX+2xpnY BF+TlsN1TS1auFcEZL2BapmVacsvOeGEFDVuZHZvZJc69Hx+gSjnjFCnZjp6n+Yz wt6y9w9Pu0t/sjD5vNQ46O15/dXqm6RoVI7um12j/WLMN8Ko5+x3gKAyQONjQd2/ 1kPcxVH6FUosAdnCuvIcqCX4e4IIHl2ZkitHOTXoQUvUy9oAK/mOBnwqZ6zLGUKC E5M+Zyt4RFGxhPs48FkX6Nq6crDGU/P0VJpDKkR/t7GHnE67Bm70gZougAPrzrgP nx8zoTWgDKpqDeuqK7pFcyKgNS3dKbxsN3sAfKHOWu/YnV4wMyy+7fmwagMauki7 lXccKN6F/r+8JcMNx80Jp/dAw3ZdLceP38M3Ryf8IL6lTfkNySumUvrKJn6r1Cu1 wvzhgyEuDawss9KHdEmXcA2i3+XVZvitaipO7JWUC8pblrP1SJMoPfIIe9zh3y3M pyZj0TcGJ8XaK+badvI+PW/K/KeRgXEY8HpC3wDHSoIkli3OE4jDwXn6TiZgvm3n k0sKL8YSmQZ8hP8QAkR+r8NQKYqLlfyPxdslK5omDPxfub5Uzk9ZV2Ep7svkaiQn Wqjq27Dpz8+w0XPjsQ0Tkv+ByTkOhrawOH7x9SpFLHpv9g5otcYmS79NkO/htx8C 6LyPNx1VYn5IRA== =tRkm -----END PGP SIGNATURE----- Merge tag 'powerpc-6.0-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc updates from Michael Ellerman: - Add support for syscall stack randomization - Add support for atomic operations to the 32 & 64-bit BPF JIT - Full support for KASAN on 64-bit Book3E - Add a watchdog driver for the new PowerVM hypervisor watchdog - Add a number of new selftests for the Power10 PMU support - Add a driver for the PowerVM Platform KeyStore - Increase the NMI watchdog timeout during live partition migration, to avoid timeouts due to increased memory access latency - Add support for using the 'linux,pci-domain' device tree property for PCI domain assignment - Many other small features and fixes Thanks to Alexey Kardashevskiy, Andy Shevchenko, Arnd Bergmann, Athira Rajeev, Bagas Sanjaya, Christophe Leroy, Erhard Furtner, Fabiano Rosas, Greg Kroah-Hartman, Greg Kurz, Haowen Bai, Hari Bathini, Jason A. Donenfeld, Jason Wang, Jiang Jian, Joel Stanley, Juerg Haefliger, Kajol Jain, Kees Cook, Laurent Dufour, Madhavan Srinivasan, Masahiro Yamada, Maxime Bizon, Miaoqian Lin, Murilo Opsfelder Araújo, Nathan Lynch, Naveen N. Rao, Nayna Jain, Nicholas Piggin, Ning Qiang, Pali Rohár, Petr Mladek, Rashmica Gupta, Sachin Sant, Scott Cheloha, Segher Boessenkool, Stephen Rothwell, Uwe Kleine-König, Wolfram Sang, Xiu Jianfeng, and Zhouyi Zhou. * tag 'powerpc-6.0-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (191 commits) powerpc/64e: Fix kexec build error EDAC/ppc_4xx: Include required of_irq header directly powerpc/pci: Fix PHB numbering when using opal-phbid powerpc/64: Init jump labels before parse_early_param() selftests/powerpc: Avoid GCC 12 uninitialised variable warning powerpc/cell/axon_msi: Fix refcount leak in setup_msi_msg_address powerpc/xive: Fix refcount leak in xive_get_max_prio powerpc/spufs: Fix refcount leak in spufs_init_isolated_loader powerpc/perf: Include caps feature for power10 DD1 version powerpc: add support for syscall stack randomization powerpc: Move system_call_exception() to syscall.c powerpc/powernv: rename remaining rng powernv_ functions to pnv_ powerpc/powernv/kvm: Use darn for H_RANDOM on Power9 powerpc/powernv: Avoid crashing if rng is NULL selftests/powerpc: Fix matrix multiply assist test powerpc/signal: Update comment for clarity powerpc: make facility_unavailable_exception 64s powerpc/platforms/83xx/suspend: Remove write-only global variable powerpc/platforms/83xx/suspend: Prevent unloading the driver powerpc/platforms/83xx/suspend: Reorder to get rid of a forward declaration ...	2022-08-06 16:38:17 -07:00
Linus Torvalds	c993e07be0	dma-mapping updates - convert arm32 to the common dma-direct code (Arnd Bergmann, Robin Murphy, Christoph Hellwig) - restructure the PCIe peer to peer mapping support (Logan Gunthorpe) - allow the IOMMU code to communicate an optional DMA mapping length and use that in scsi and libata (John Garry) - split the global swiotlb lock (Tianyu Lan) - various fixes and cleanup (Chao Gao, Dan Carpenter, Dongli Zhang, Lukas Bulwahn, Robin Murphy) -----BEGIN PGP SIGNATURE----- iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAmLuIYULHGhjaEBsc3Qu ZGUACgkQD55TZVIEUYPS5A//Ty1ZNyXExmwZ6J6g7/oIvQlpAHilDr22mCd8tR8Y Ne7TgLa/X+usFvJTxJfkvg/LNMDjD7qx0J/mhDGm4reOFcEL4/PBy0rDSOgnmntV k/fPhgwnpuztiAQ+s+WkJ3pkrmG1HaEId7GGj2JaoYdas6RX2mGX7vL8uvUFepjw lYPAqWMtJHkOfsDK0PqqyQsr7dcC6lyFLqnn/wqvHtTJeKCfGs6W/SIrlWme2SZY 3dNx84ZR1uPjaazAmtf2IWfjh/TBmd0ETRYycgUUKRP9iwsCkBQDBwsBGSIYXiWj BUKQ5oMvjAlUGRF0jYz9e77KuedE6GxWiXNQstitBmid142M37DHA5tvZRf65MPS THHcjTDmmoaO4YfFhhXOcFOrjG4/V8bF7fgHB6XkHDjhVVTcnIx8zuOAXIVBZvIV VAALmamBqEfIZZrCqgr7hzFssK2bip+TIMkdoD46Wcr+D7bAlujhuzWxubn9+ulT 23v/pAvC80ut6LvKj6EA+GpRm/pejfOtEbjXPoO2hguNxvuUKvPQqNh9hy0q+v1e 8n2Y/4lhy5bv02S7wKooNkfCoV753jBY1TIru45UmEYc3EkTQPii6okYe0DvW4QX VCnKgo156wSBfE+9eWdxCROv2SZqJFMV/wL3vw54dpJQMbDy7VkNsh4mGREdUkU1 uek= =Bv19 -----END PGP SIGNATURE----- Merge tag 'dma-mapping-5.20-2022-08-06' of git://git.infradead.org/users/hch/dma-mapping Pull dma-mapping updates from Christoph Hellwig: - convert arm32 to the common dma-direct code (Arnd Bergmann, Robin Murphy, Christoph Hellwig) - restructure the PCIe peer to peer mapping support (Logan Gunthorpe) - allow the IOMMU code to communicate an optional DMA mapping length and use that in scsi and libata (John Garry) - split the global swiotlb lock (Tianyu Lan) - various fixes and cleanup (Chao Gao, Dan Carpenter, Dongli Zhang, Lukas Bulwahn, Robin Murphy) * tag 'dma-mapping-5.20-2022-08-06' of git://git.infradead.org/users/hch/dma-mapping: (45 commits) swiotlb: fix passing local variable to debugfs_create_ulong() dma-mapping: reformat comment to suppress htmldoc warning PCI/P2PDMA: Remove pci_p2pdma_[un]map_sg() RDMA/rw: drop pci_p2pdma_[un]map_sg() RDMA/core: introduce ib_dma_pci_p2p_dma_supported() nvme-pci: convert to using dma_map_sgtable() nvme-pci: check DMA ops when indicating support for PCI P2PDMA iommu/dma: support PCI P2PDMA pages in dma-iommu map_sg iommu: Explicitly skip bus address marked segments in __iommu_map_sg() dma-mapping: add flags to dma_map_ops to indicate PCI P2PDMA support dma-direct: support PCI P2PDMA pages in dma-direct map_sg dma-mapping: allow EREMOTEIO return code for P2PDMA transfers PCI/P2PDMA: Introduce helpers for dma_map_sg implementations PCI/P2PDMA: Attempt to set map_type if it has not been set lib/scatterlist: add flag for indicating P2PDMA segments in an SGL swiotlb: clean up some coding style and minor issues dma-mapping: update comment after dmabounce removal scsi: sd: Add a comment about limiting max_sectors to shost optimal limit ata: libata-scsi: cap ata_device->max_sectors according to shost->max_sectors scsi: scsi_transport_sas: cap shost opt_sectors according to DMA optimal limit ...	2022-08-06 10:56:45 -07:00
Jiri Slaby	221f9d9cdf	posix-timers: Make do_clock_gettime() static do_clock_gettime() is used only in posix-stubs.c, so make it static. It avoids a compiler warning too: time/posix-stubs.c:73:5: warning: no previous prototype for ‘do_clock_gettime’ [-Wmissing-prototypes] Signed-off-by: Jiri Slaby <jslaby@suse.cz> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20220719085620.30567-1-jslaby@suse.cz	2022-08-06 10:33:54 +02:00
Linus Torvalds	6614a3c316	- The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe Lin, Yang Shi, Anshuman Khandual and Mike Rapoport - Some kmemleak fixes from Patrick Wang and Waiman Long - DAMON updates from SeongJae Park - memcg debug/visibility work from Roman Gushchin - vmalloc speedup from Uladzislau Rezki - more folio conversion work from Matthew Wilcox - enhancements for coherent device memory mapping from Alex Sierra - addition of shared pages tracking and CoW support for fsdax, from Shiyang Ruan - hugetlb optimizations from Mike Kravetz - Mel Gorman has contributed some pagealloc changes to improve latency and realtime behaviour. - mprotect soft-dirty checking has been improved by Peter Xu - Many other singleton patches all over the place -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCYuravgAKCRDdBJ7gKXxA jpqSAQDrXSdII+ht9kSHlaCVYjqRFQz/rRvURQrWQV74f6aeiAD+NHHeDPwZn11/ SPktqEUrF1pxnGQxqLh1kUFUhsVZQgE= =w/UH -----END PGP SIGNATURE----- Merge tag 'mm-stable-2022-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "Most of the MM queue. A few things are still pending. Liam's maple tree rework didn't make it. This has resulted in a few other minor patch series being held over for next time. Multi-gen LRU still isn't merged as we were waiting for mapletree to stabilize. The current plan is to merge MGLRU into -mm soon and to later reintroduce mapletree, with a view to hopefully getting both into 6.1-rc1. Summary: - The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe Lin, Yang Shi, Anshuman Khandual and Mike Rapoport - Some kmemleak fixes from Patrick Wang and Waiman Long - DAMON updates from SeongJae Park - memcg debug/visibility work from Roman Gushchin - vmalloc speedup from Uladzislau Rezki - more folio conversion work from Matthew Wilcox - enhancements for coherent device memory mapping from Alex Sierra - addition of shared pages tracking and CoW support for fsdax, from Shiyang Ruan - hugetlb optimizations from Mike Kravetz - Mel Gorman has contributed some pagealloc changes to improve latency and realtime behaviour. - mprotect soft-dirty checking has been improved by Peter Xu - Many other singleton patches all over the place" [ XFS merge from hell as per Darrick Wong in https://lore.kernel.org/all/YshKnxb4VwXycPO8@magnolia/ ] * tag 'mm-stable-2022-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (282 commits) tools/testing/selftests/vm/hmm-tests.c: fix build mm: Kconfig: fix typo mm: memory-failure: convert to pr_fmt() mm: use is_zone_movable_page() helper hugetlbfs: fix inaccurate comment in hugetlbfs_statfs() hugetlbfs: cleanup some comments in inode.c hugetlbfs: remove unneeded header file hugetlbfs: remove unneeded hugetlbfs_ops forward declaration hugetlbfs: use helper macro SZ_1{K,M} mm: cleanup is_highmem() mm/hmm: add a test for cross device private faults selftests: add soft-dirty into run_vmtests.sh selftests: soft-dirty: add test for mprotect mm/mprotect: fix soft-dirty check in can_change_pte_writable() mm: memcontrol: fix potential oom_lock recursion deadlock mm/gup.c: fix formatting in check_and_migrate_movable_page() xfs: fail dax mount if reflink is enabled on a partition mm/memcontrol.c: remove the redundant updating of stats_flush_threshold userfaultfd: don't fail on unrecognized features hugetlb_cgroup: fix wrong hugetlb cgroup numa stat ...	2022-08-05 16:32:45 -07:00
Jiri Olsa	62d468e5e1	bpf: Cleanup ftrace hash in bpf_trampoline_put We need to release possible hash from trampoline fops object before removing it, otherwise we leak it. Fixes: `00963a2e75` ("bpf: Support bpf_trampoline on functions with IPMODIFY (e.g. livepatch)") Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/bpf/20220802135651.1794015-1-jolsa@kernel.org	2022-08-05 09:43:58 -07:00
Linus Torvalds	965a9d75e3	Tracing updates for 5.20 / 6.0 - Runtime verification infrastructure This is the biggest change for this pull request. It introduces the runtime verification that is necessary for running Linux on safety critical systems. It allows for deterministic automata models to be inserted into the kernel that will attach to tracepoints, where the information on these tracepoints will move the model from state to state. If a state is encountered that does not belong to the model, it will then activate a given reactor, that could just inform the user or even panic the kernel (for which safety critical systems will detect and can recover from). - Two monitor models are also added: Wakeup In Preemptive (WIP - not to be confused with "work in progress"), and Wakeup While Not Running (WWNR). - Added __vstring() helper to the TRACE_EVENT() macro to replace several vsnprintf() usages that were all doing it wrong. - eprobes now can have their event autogenerated when the event name is left off. - The rest is various cleanups and fixes. -----BEGIN PGP SIGNATURE----- iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCYu0yzRQccm9zdGVkdEBn b29kbWlzLm9yZwAKCRAp5XQQmuv6qj4HAP4tQtV55rjj4DQ5XIXmtI3/64PmyRSJ +y4DEXi1UvEUCQD/QAuQfWoT/7gh35ltkfeS4t3ockzy14rrkP5drZigiQA= =kEtM -----END PGP SIGNATURE----- Merge tag 'trace-v6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing updates from Steven Rostedt: - Runtime verification infrastructure This is the biggest change here. It introduces the runtime verification that is necessary for running Linux on safety critical systems. It allows for deterministic automata models to be inserted into the kernel that will attach to tracepoints, where the information on these tracepoints will move the model from state to state. If a state is encountered that does not belong to the model, it will then activate a given reactor, that could just inform the user or even panic the kernel (for which safety critical systems will detect and can recover from). - Two monitor models are also added: Wakeup In Preemptive (WIP - not to be confused with "work in progress"), and Wakeup While Not Running (WWNR). - Added __vstring() helper to the TRACE_EVENT() macro to replace several vsnprintf() usages that were all doing it wrong. - eprobes now can have their event autogenerated when the event name is left off. - The rest is various cleanups and fixes. * tag 'trace-v6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (50 commits) rv: Unlock on error path in rv_unregister_reactor() tracing: Use alignof__(struct {type b;}) instead of offsetof() tracing/eprobe: Show syntax error logs in error_log file scripts/tracing: Fix typo 'the the' in comment tracepoints: It is CONFIG_TRACEPOINTS not CONFIG_TRACEPOINT tracing: Use free_trace_buffer() in allocate_trace_buffers() tracing: Use a struct alignof to determine trace event field alignment rv/reactor: Add the panic reactor rv/reactor: Add the printk reactor rv/monitor: Add the wwnr monitor rv/monitor: Add the wip monitor rv/monitor: Add the wip monitor skeleton created by dot2k Documentation/rv: Add deterministic automata instrumentation documentation Documentation/rv: Add deterministic automata monitor synthesis documentation tools/rv: Add dot2k Documentation/rv: Add deterministic automaton documentation tools/rv: Add dot2c Documentation/rv: Add a basic documentation rv/include: Add instrumentation helper functions rv/include: Add deterministic automata monitor definition via C macros ...	2022-08-05 09:41:12 -07:00
Dan Carpenter	f1a15b977f	rv: Unlock on error path in rv_unregister_reactor() Unlock the "rv_interface_lock" mutex before returning. Link: https://lkml.kernel.org/r/YuvYzNfGMgV+PIhd@kili Fixes: `04acadcb44` ("rv: Add runtime reactors interface") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2022-08-04 22:49:17 -04:00
Linus Torvalds	7447691ef9	xen: branch for v6.0-rc1 -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQRTLbB6QfY48x44uB6AXGG7T9hjvgUCYuooOQAKCRCAXGG7T9hj vmmlAPoCfYBh4jKwRnvGvyn+sPQed/r0TH0wnsGK1ccONhyIvAD+IZcSTPsnp4Cj m1URGGff2PvAyjOIAzQZbKZomtfICwM= =z2e5 -----END PGP SIGNATURE----- Merge tag 'for-linus-6.0-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip Pull xen updates from Juergen Gross: - a series fine tuning virtio support for Xen guests, including removal the now again unused "platform_has()" feature. - a fix for host admin triggered reboot of Xen guests - a simple spelling fix * tag 'for-linus-6.0-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: xen: don't require virtio with grants for non-PV guests kernel: remove platform_has() infrastructure virtio: replace restricted mem access flag with callback xen: Fix spelling mistake xen/manage: Use orderly_reboot() to reboot	2022-08-04 15:10:55 -07:00
Linus Torvalds	228dfe98a3	Char / Misc driver changes for 6.0-rc1 Here is the large set of char and misc and other driver subsystem changes for 6.0-rc1. Highlights include: - large set of IIO driver updates, additions, and cleanups - new habanalabs device support added (loads of register maps much like GPUs have) - soundwire driver updates - phy driver updates - slimbus driver updates - tiny virt driver fixes and updates - misc driver fixes and updates - interconnect driver updates - hwtracing driver updates - fpga driver updates - extcon driver updates - firmware driver updates - counter driver update - mhi driver fixes and updates - binder driver fixes and updates - speakup driver fixes Full details are in the long shortlog contents. All of these have been in linux-next for a while without any reported problems. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> -----BEGIN PGP SIGNATURE----- iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCYup9QQ8cZ3JlZ0Brcm9h aC5jb20ACgkQMUfUDdst+ylBKQCfaSuzl9ZP9dTvAw2FPp14oRqXnpoAnicvWAoq 1vU9Vtq2c73uBVLdZm4m =AwP3 -----END PGP SIGNATURE----- Merge tag 'char-misc-6.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc Pull char / misc driver updates from Greg KH: "Here is the large set of char and misc and other driver subsystem changes for 6.0-rc1. Highlights include: - large set of IIO driver updates, additions, and cleanups - new habanalabs device support added (loads of register maps much like GPUs have) - soundwire driver updates - phy driver updates - slimbus driver updates - tiny virt driver fixes and updates - misc driver fixes and updates - interconnect driver updates - hwtracing driver updates - fpga driver updates - extcon driver updates - firmware driver updates - counter driver update - mhi driver fixes and updates - binder driver fixes and updates - speakup driver fixes All of these have been in linux-next for a while without any reported problems" * tag 'char-misc-6.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (634 commits) drivers: lkdtm: fix clang -Wformat warning char: remove VR41XX related char driver misc: Mark MICROCODE_MINOR unused spmi: trace: fix stack-out-of-bound access in SPMI tracing functions dt-bindings: iio: adc: Add compatible for MT8188 iio: light: isl29028: Fix the warning in isl29028_remove() iio: accel: sca3300: Extend the trigger buffer from 16 to 32 bytes iio: fix iio_format_avail_range() printing for none IIO_VAL_INT iio: adc: max1027: unlock on error path in max1027_read_single_value() iio: proximity: sx9324: add empty line in front of bullet list iio: magnetometer: hmc5843: Remove duplicate 'the' iio: magn: yas530: Use DEFINE_RUNTIME_DEV_PM_OPS() and pm_ptr() macros iio: magnetometer: ak8974: Use DEFINE_RUNTIME_DEV_PM_OPS() and pm_ptr() macros iio: light: veml6030: Use DEFINE_RUNTIME_DEV_PM_OPS() and pm_ptr() macros iio: light: vcnl4035: Use DEFINE_RUNTIME_DEV_PM_OPS() and pm_ptr() macros iio: light: vcnl4000: Use DEFINE_RUNTIME_DEV_PM_OPS() and pm_ptr() macros iio: light: tsl2591: Use DEFINE_RUNTIME_DEV_PM_OPS() and pm_ptr() iio: light: tsl2583: Use DEFINE_RUNTIME_DEV_PM_OPS and pm_ptr() iio: light: isl29028: Use DEFINE_RUNTIME_DEV_PM_OPS() and pm_ptr() iio: light: gp2ap002: Switch to DEFINE_RUNTIME_DEV_PM_OPS and pm_ptr() ...	2022-08-04 11:05:48 -07:00
Peilin Ye	f482aa9865	audit, io_uring, io-wq: Fix memory leak in io_sq_thread() and io_wqe_worker() Currently @audit_context is allocated twice for io_uring workers: 1. copy_process() calls audit_alloc(); 2. io_sq_thread() or io_wqe_worker() calls audit_alloc_kernel() (which is effectively audit_alloc()) and overwrites @audit_context, causing: BUG: memory leak unreferenced object 0xffff888144547400 (size 1024): <...> hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<ffffffff8135cfc3>] audit_alloc+0x133/0x210 [<ffffffff81239e63>] copy_process+0xcd3/0x2340 [<ffffffff8123b5f3>] create_io_thread+0x63/0x90 [<ffffffff81686604>] create_io_worker+0xb4/0x230 [<ffffffff81686f68>] io_wqe_enqueue+0x248/0x3b0 [<ffffffff8167663a>] io_queue_iowq+0xba/0x200 [<ffffffff816768b3>] io_queue_async+0x113/0x180 [<ffffffff816840df>] io_req_task_submit+0x18f/0x1a0 [<ffffffff816841cd>] io_apoll_task_func+0xdd/0x120 [<ffffffff8167d49f>] tctx_task_work+0x11f/0x570 [<ffffffff81272c4e>] task_work_run+0x7e/0xc0 [<ffffffff8125a688>] get_signal+0xc18/0xf10 [<ffffffff8111645b>] arch_do_signal_or_restart+0x2b/0x730 [<ffffffff812ea44e>] exit_to_user_mode_prepare+0x5e/0x180 [<ffffffff844ae1b2>] syscall_exit_to_user_mode+0x12/0x20 [<ffffffff844a7e80>] do_syscall_64+0x40/0x80 Then, 3. io_sq_thread() or io_wqe_worker() frees @audit_context using audit_free(); 4. do_exit() eventually calls audit_free() again, which is okay because audit_free() does a NULL check. As suggested by Paul Moore, fix it by deleting audit_alloc_kernel() and redundant audit_free() calls. Fixes: `5bd2182d58` ("audit,io_uring,io-wq: add some basic audit support to io_uring") Suggested-by: Paul Moore <paul@paul-moore.com> Cc: stable@vger.kernel.org Signed-off-by: Peilin Ye <peilin.ye@bytedance.com> Acked-by: Paul Moore <paul@paul-moore.com> Link: https://lore.kernel.org/r/20220803222343.31673-1-yepeilin.cs@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-08-04 08:33:54 -06:00
Mel Gorman	751d4cbc43	sched/core: Do not requeue task on CPU excluded from cpus_mask The following warning was triggered on a large machine early in boot on a distribution kernel but the same problem should also affect mainline. WARNING: CPU: 439 PID: 10 at ../kernel/workqueue.c:2231 process_one_work+0x4d/0x440 Call Trace: <TASK> rescuer_thread+0x1f6/0x360 kthread+0x156/0x180 ret_from_fork+0x22/0x30 </TASK> Commit `c6e7bd7afa` ("sched/core: Optimize ttwu() spinning on p->on_cpu") optimises ttwu by queueing a task that is descheduling on the wakelist, but does not check if the task descheduling is still allowed to run on that CPU. In this warning, the problematic task is a workqueue rescue thread which checks if the rescue is for a per-cpu workqueue and running on the wrong CPU. While this is early in boot and it should be possible to create workers, the rescue thread may still used if the MAYDAY_INITIAL_TIMEOUT is reached or MAYDAY_INTERVAL and on a sufficiently large machine, the rescue thread is being used frequently. Tracing confirmed that the task should have migrated properly using the stopper thread to handle the migration. However, a parallel wakeup from udev running on another CPU that does not share CPU cache observes p->on_cpu and uses task_cpu(p), queues the task on the old CPU and triggers the warning. Check that the wakee task that is descheduling is still allowed to run on its current CPU and if not, wait for the descheduling to complete and select an allowed CPU. Fixes: `c6e7bd7afa` ("sched/core: Optimize ttwu() spinning on p->on_cpu") Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20220804092119.20137-1-mgorman@techsingularity.net	2022-08-04 11:26:13 +02:00
Andi Kleen	9aeaf5bc4e	locking/spinlocks: Mark spinlocks noinline when inline spinlocks are disabled Otherwise LTO will inline them anyways and cause a large kernel text increase. Since the explicit intention here is to not inline them marking them noinline is good documentation even for the non-LTO case. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Martin Liska <mliska@suse.cz> Signed-off-by: Jiri Slaby <jslaby@suse.cz> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20220719110548.1544-1-jslaby@suse.cz	2022-08-04 11:05:43 +02:00
Xin Gao	8648f92a66	sched/core: Remove superfluous semicolon Signed-off-by: Xin Gao <gaoxin@cdjrlc.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20220719111044.7095-1-gaoxin@cdjrlc.com	2022-08-04 11:02:08 +02:00
Slark Xiao	99643bab36	perf/core: Fix ';;' typo Remove double ';;'. Signed-off-by: Slark Xiao <slark_xiao@163.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20220720091220.14200-1-slark_xiao@163.com	2022-08-04 11:01:30 +02:00
Linus Torvalds	b44f2fd879	drm for 5.20/6.0 New driver: - logicvc vfio: - use aperture API core: - of: Add data-lane helpers and convert drivers - connector: Remove deprecated ida_simple_get() media: - Add various RGB666 and RGB888 format constants panel: - Add HannStar HSD101PWW - Add ETML0700Y5DHA dma-buf: - add sync-file API - set dma mask for udmabuf devices fbcon: - Improve scrolling performance - Sanitize input fbdev: - device unregistering fixes - vesa: Support COMPILE_TEST - Disable firmware-device registration when first native driver loads aperture: - fix segfault during hot-unplug - export for use with other subsystems client: - use driver validated modes dp: - aux: make probing more reliable - mst: Read extended DPCD capabilities during system resume - Support waiting for HDP signal - Port-validation fixes edid: - CEA data-block iterators - struct drm_edid introduction - implement HF-EEODB extension gem: - don't use fb format non-existing planes probe-helper: - use 640x480 as displayport fallback scheduler: - don't kill jobs in interrupt context bridge: - Add support for i.MX8qxp and i.MX8qm - lots of fixes/cleanups - Add TI-DLPC3433 - fy07024di26a30d: Optional GPIO reset - ldb: Add reg and reg-name properties to bindings, Kconfig fixes - lt9611: Fix display sensing; - tc358767: DSI/DPI refactoring and DSI-to-eDP support, DSI lane handling - tc358775: Fix clock settings - ti-sn65dsi83: Allow GPIO to sleep - adv7511: I2C fixes - anx7625: Fix error handling; DPI fixes; Implement HDP timeout via callback - fsl-ldb: Drop DE flip - ti-sn65dsi86: Convert to atomic modesetting amdgpu: - use atomic fence helpers in DM - fix VRAM address calculations - export CRTC bpc via debugfs - Initial devcoredump support - Enable high priority gfx queue on asics which support it - Adjust GART size on newer APUs for S/G display - Soft reset for GFX 11 / SDMA 6 - Add gfxoff status query for vangogh - Fix timestamps for cursor only commits - Adjust GART size on newer APUs for S/G display - fix buddy memory corruption amdkfd: - MMU notifier fixes - P2P DMA support using dma-buf - Add available memory IOCTL - HMM profiler support - Simplify GPUVM validation - Unified memory for CWSR save/restore area i915: - General driver clean-up - DG2 enabling (still under force probe) - DG2 small BAR memory support - HuC loading support - DG2 workarounds - DG2/ATS-M device IDs added - Ponte Vecchio prep work and new blitter engines - add Meteorlake support - Fix sparse warnings - DMC MMIO range checks - Audio related fixes - Runtime PM fixes - PSR fixes - Media freq factor and per-gt enhancements - DSI fixes for ICL+ - Disable DMC flip queue handlers - ADL_P voltage swing updates - Use more the VBT for panel information - Fix on Type-C ports with TBT mode - Improve fastset and allow seamless M/N changes - Accept more fixed modes with VRR/DMRRS panels - Disable connector polling for a headless SKU - ADL-S display PLL w/a - Enable THP on Icelake and beyond - Fix i915_gem_object_ggtt_pin_ww regression on old platforms - Expose per tile media freq factor in sysfs - Fix dma_resv fence handling in multi-batch execbuf - Improve on suspend / resume time with VT-d enabled - export CRTC bpc settings via debugfs msm: - gpu: a619 support - gpu: Fix for unclocked GMU register access - gpu: Devcore dump enhancements - client utilization via fdinfo support - fix fence rollover issue - gem: Lockdep false-positive warning fix - gem: Switch to pfn mappings - WB support on sc7180 - dp: dropped custom bulk clock implementation - fix link retraining on resolution change - hdmi: dropped obsolete GPIO support tegra: - context isolation for host1x engines - tegra234 soc support mediatek: - add vdosys0/1 for mt8195 - add MT8195 dp_intf driver exynos: - Fix resume function issue of exynos decon driver by calling clk_disable_unprepare() properly if clk_prepare_enable() failed. nouveau: - set of misc fixes/cleanups - display cleanups gma500: - Cleanup connector I2C handling hyperv: - Unify VRAM allocation of Gen1 and Gen2 meson: - Support YUV422 output; Refcount fixes mgag200: - Support damage clipping - Support gamma handling - Protect concurrent HW access - Fixes to connector - Store model-specific limits in device-info structure - fix PCI register init panfrost: - Valhall support r128: - Fix bit-shift overflow rockchip: - Locking fixes in error path ssd130x: - Fix built-in linkage udl: - Always advertize VGA connector ast: - Support multiple outputs - fix black screen on resume sun4i: - HDMI PHY cleanups vc4: - Add support for BCM2711 vkms: - Allocate output buffer with vmalloc() mcde: - Fix ref-count leak mxsfb/lcdif: - Support i.MX8MP LCD controller stm/ltdc: - Support dynamic Z order - Support mirroring ingenic: - Fix display at maximum resolution -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEEKbZHaGwW9KfbeusDHTzWXnEhr4FAmLp/7YACgkQDHTzWXnE hr7NjhAAnefa+72EG42OAqajbwTQMENOtFfqyL3k6ueK2ciYbsj/wklw/xc4Ok3o DM5kG54t+nA9L1M7UyE7eaO36/XcuvS8Ea0uKKkamWt+3Ux4g1Vo1J37nP5sK5jI GT/wceKA5sk3nuYly2lBby6mVTGuhAX+3edTAFeOwmd0WvQzzpy4vV+nCAgfshUs ql4gfQPdQdP+wiovUzCIEu6exCSCAI/Oc944fd3AJi5bZbOPFXRS4rMMOLSrdoXV 9P44EZExPbYrDuVUCx/UaZtN8D9myyyBfZe62CtdgNyTYUHXnHCBYue+7D/s5O+y GaLWcP128MsqZNmJNhmcWFIlgqowO24YkKUH68JH0UtBLSWich8rfdEsrxIidYED 0ma1jodRapjyZOjrHEJ3N5deKpoflMmqvCMpvIk1Ev6pT8KX9a6u34kLgsOVCV41 2bDEYD+DbRW2FexGR79yB2huXHGSnguco6069ca1oy9RF4q8cX6Pb1w2u42oS7zX lIgLIashilVR2AYg/qi6IPHavmOQ9ItSXPC+4YasYiMGp/mwePqpmL63b/wkhg0D nXn6/F8Bm6wle2FFbkLGwo1fF1Hn7RzTHSlqRWDKSEaMLhCus6M09VsobFCB19i0 lO4FNVTL8ZtryR94bgVmgi616w9hOhDhM9A+C0kJ9KBkDnDYUJU= =HQ9U -----END PGP SIGNATURE----- Merge tag 'drm-next-2022-08-03' of git://anongit.freedesktop.org/drm/drm Pull drm updates from Dave Airlie: "Highlights: - New driver for logicvc - which is a display IP core. - EDID parser rework to add new extensions - fbcon scrolling improvements - i915 has some more DG2 work but not enabled by default, but should have enough features for userspace to work now. Otherwise it's lots of work all over the place. Detailed summary: New driver: - logicvc vfio: - use aperture API core: - of: Add data-lane helpers and convert drivers - connector: Remove deprecated ida_simple_get() media: - Add various RGB666 and RGB888 format constants panel: - Add HannStar HSD101PWW - Add ETML0700Y5DHA dma-buf: - add sync-file API - set dma mask for udmabuf devices fbcon: - Improve scrolling performance - Sanitize input fbdev: - device unregistering fixes - vesa: Support COMPILE_TEST - Disable firmware-device registration when first native driver loads aperture: - fix segfault during hot-unplug - export for use with other subsystems client: - use driver validated modes dp: - aux: make probing more reliable - mst: Read extended DPCD capabilities during system resume - Support waiting for HDP signal - Port-validation fixes edid: - CEA data-block iterators - struct drm_edid introduction - implement HF-EEODB extension gem: - don't use fb format non-existing planes probe-helper: - use 640x480 as displayport fallback scheduler: - don't kill jobs in interrupt context bridge: - Add support for i.MX8qxp and i.MX8qm - lots of fixes/cleanups - Add TI-DLPC3433 - fy07024di26a30d: Optional GPIO reset - ldb: Add reg and reg-name properties to bindings, Kconfig fixes - lt9611: Fix display sensing; - tc358767: DSI/DPI refactoring and DSI-to-eDP support, DSI lane handling - tc358775: Fix clock settings - ti-sn65dsi83: Allow GPIO to sleep - adv7511: I2C fixes - anx7625: Fix error handling; DPI fixes; Implement HDP timeout via callback - fsl-ldb: Drop DE flip - ti-sn65dsi86: Convert to atomic modesetting amdgpu: - use atomic fence helpers in DM - fix VRAM address calculations - export CRTC bpc via debugfs - Initial devcoredump support - Enable high priority gfx queue on asics which support it - Adjust GART size on newer APUs for S/G display - Soft reset for GFX 11 / SDMA 6 - Add gfxoff status query for vangogh - Fix timestamps for cursor only commits - Adjust GART size on newer APUs for S/G display - fix buddy memory corruption amdkfd: - MMU notifier fixes - P2P DMA support using dma-buf - Add available memory IOCTL - HMM profiler support - Simplify GPUVM validation - Unified memory for CWSR save/restore area i915: - General driver clean-up - DG2 enabling (still under force probe) - DG2 small BAR memory support - HuC loading support - DG2 workarounds - DG2/ATS-M device IDs added - Ponte Vecchio prep work and new blitter engines - add Meteorlake support - Fix sparse warnings - DMC MMIO range checks - Audio related fixes - Runtime PM fixes - PSR fixes - Media freq factor and per-gt enhancements - DSI fixes for ICL+ - Disable DMC flip queue handlers - ADL_P voltage swing updates - Use more the VBT for panel information - Fix on Type-C ports with TBT mode - Improve fastset and allow seamless M/N changes - Accept more fixed modes with VRR/DMRRS panels - Disable connector polling for a headless SKU - ADL-S display PLL w/a - Enable THP on Icelake and beyond - Fix i915_gem_object_ggtt_pin_ww regression on old platforms - Expose per tile media freq factor in sysfs - Fix dma_resv fence handling in multi-batch execbuf - Improve on suspend / resume time with VT-d enabled - export CRTC bpc settings via debugfs msm: - gpu: a619 support - gpu: Fix for unclocked GMU register access - gpu: Devcore dump enhancements - client utilization via fdinfo support - fix fence rollover issue - gem: Lockdep false-positive warning fix - gem: Switch to pfn mappings - WB support on sc7180 - dp: dropped custom bulk clock implementation - fix link retraining on resolution change - hdmi: dropped obsolete GPIO support tegra: - context isolation for host1x engines - tegra234 soc support mediatek: - add vdosys0/1 for mt8195 - add MT8195 dp_intf driver exynos: - Fix resume function issue of exynos decon driver by calling clk_disable_unprepare() properly if clk_prepare_enable() failed. nouveau: - set of misc fixes/cleanups - display cleanups gma500: - Cleanup connector I2C handling hyperv: - Unify VRAM allocation of Gen1 and Gen2 meson: - Support YUV422 output; Refcount fixes mgag200: - Support damage clipping - Support gamma handling - Protect concurrent HW access - Fixes to connector - Store model-specific limits in device-info structure - fix PCI register init panfrost: - Valhall support r128: - Fix bit-shift overflow rockchip: - Locking fixes in error path ssd130x: - Fix built-in linkage udl: - Always advertize VGA connector ast: - Support multiple outputs - fix black screen on resume sun4i: - HDMI PHY cleanups vc4: - Add support for BCM2711 vkms: - Allocate output buffer with vmalloc() mcde: - Fix ref-count leak mxsfb/lcdif: - Support i.MX8MP LCD controller stm/ltdc: - Support dynamic Z order - Support mirroring ingenic: - Fix display at maximum resolution" * tag 'drm-next-2022-08-03' of git://anongit.freedesktop.org/drm/drm: (1480 commits) drm/amd/display: Fix a compilation failure on PowerPC caused by FPU code drm/amdgpu: enable support for psp 13.0.4 block drm/amdgpu: add files for PSP 13.0.4 drm/amdgpu: add header files for MP 13.0.4 drm/amdgpu: correct RLC_RLCS_BOOTLOAD_STATUS offset and index drm/amdgpu: send msg to IMU for the front-door loading drm/amdkfd: use time_is_before_jiffies(a + b) to replace "jiffies - a > b" drm/amdgpu: fix hive reference leak when reflecting psp topology info drm/amd/pm: enable GFX ULV feature support for SMU13.0.0 drm/amd/pm: update driver if header for SMU 13.0.0 drm/amdgpu: move mes self test after drm sched re-started drm/amdgpu: drop non-necessary call trace dump drm/amdgpu: enable VCN cg and JPEG cg/pg drm/amdgpu: vcn_4_0_2 video codec query drm/amdgpu: add VCN_4_0_2 firmware support drm/amdgpu: add VCN function in NBIO v7.7 drm/amdgpu: fix a vcn4 boot poll bug in emulation mode drm/amd/amdgpu: add memory training support for PSP_V13 drm/amdkfd: remove an unnecessary amdgpu_bo_ref drm/amd/pm: Add get_gfx_off_status interface for yellow carp ...	2022-08-03 19:52:08 -07:00
Linus Torvalds	f86d1fbbe7	Networking changes for 6.0. Core ---- - Refactor the forward memory allocation to better cope with memory pressure with many open sockets, moving from a per socket cache to a per-CPU one - Replace rwlocks with RCU for better fairness in ping, raw sockets and IP multicast router. - Network-side support for IO uring zero-copy send. - A few skb drop reason improvements, including codegen the source file with string mapping instead of using macro magic. - Rename reference tracking helpers to a more consistent netdev_* schema. - Adapt u64_stats_t type to address load/store tearing issues. - Refine debug helper usage to reduce the log noise caused by bots. BPF --- - Improve socket map performance, avoiding skb cloning on read operation. - Add support for 64 bits enum, to match types exposed by kernel. - Introduce support for sleepable uprobes program. - Introduce support for enum textual representation in libbpf. - New helpers to implement synproxy with eBPF/XDP. - Improve loop performances, inlining indirect calls when possible. - Removed all the deprecated libbpf APIs. - Implement new eBPF-based LSM flavor. - Add type match support, which allow accurate queries to the eBPF used types. - A few TCP congetsion control framework usability improvements. - Add new infrastructure to manipulate CT entries via eBPF programs. - Allow for livepatch (KLP) and BPF trampolines to attach to the same kernel function. Protocols --------- - Introduce per network namespace lookup tables for unix sockets, increasing scalability and reducing contention. - Preparation work for Wi-Fi 7 Multi-Link Operation (MLO) support. - Add support to forciby close TIME_WAIT TCP sockets via user-space tools. - Significant performance improvement for the TLS 1.3 receive path, both for zero-copy and not-zero-copy. - Support for changing the initial MTPCP subflow priority/backup status - Introduce virtually contingus buffers for sockets over RDMA, to cope better with memory pressure. - Extend CAN ethtool support with timestamping capabilities - Refactor CAN build infrastructure to allow building only the needed features. Driver API ---------- - Remove devlink mutex to allow parallel commands on multiple links. - Add support for pause stats in distributed switch. - Implement devlink helpers to query and flash line cards. - New helper for phy mode to register conversion. New hardware / drivers ---------------------- - Ethernet DSA driver for the rockchip mt7531 on BPI-R2 Pro. - Ethernet DSA driver for the Renesas RZ/N1 A5PSW switch. - Ethernet DSA driver for the Microchip LAN937x switch. - Ethernet PHY driver for the Aquantia AQR113C EPHY. - CAN driver for the OBD-II ELM327 interface. - CAN driver for RZ/N1 SJA1000 CAN controller. - Bluetooth: Infineon CYW55572 Wi-Fi plus Bluetooth combo device. Drivers ------- - Intel Ethernet NICs: - i40e: add support for vlan pruning - i40e: add support for XDP framented packets - ice: improved vlan offload support - ice: add support for PPPoE offload - Mellanox Ethernet (mlx5) - refactor packet steering offload for performance and scalability - extend support for TC offload - refactor devlink code to clean-up the locking schema - support stacked vlans for bridge offloads - use TLS objects pool to improve connection rate - Netronome Ethernet NICs (nfp): - extend support for IPv6 fields mangling offload - add support for vepa mode in HW bridge - better support for virtio data path acceleration (VDPA) - enable TSO by default - Microsoft vNIC driver (mana) - add support for XDP redirect - Others Ethernet drivers: - bonding: add per-port priority support - microchip lan743x: extend phy support - Fungible funeth: support UDP segmentation offload and XDP xmit - Solarflare EF100: add support for virtual function representors - MediaTek SoC: add XDP support - Mellanox Ethernet/IB switch (mlxsw): - dropped support for unreleased H/W (XM router). - improved stats accuracy - unified bridge model coversion improving scalability (parts 1-6) - support for PTP in Spectrum-2 asics - Broadcom PHYs - add PTP support for BCM54210E - add support for the BCM53128 internal PHY - Marvell Ethernet switches (prestera): - implement support for multicast forwarding offload - Embedded Ethernet switches: - refactor OcteonTx MAC filter for better scalability - improve TC H/W offload for the Felix driver - refactor the Microchip ksz8 and ksz9477 drivers to share the probe code (parts 1, 2), add support for phylink mac configuration - Other WiFi: - Microchip wilc1000: diable WEP support and enable WPA3 - Atheros ath10k: encapsulation offload support Old code removal: - Neterion vxge ethernet driver: this is untouched since more than 10 years. Signed-off-by: Paolo Abeni <pabeni@redhat.com> -----BEGIN PGP SIGNATURE----- iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmLqN+oSHHBhYmVuaUBy ZWRoYXQuY29tAAoJECkkeY3MjxOkB9kQAI9VqW0c3SfiTJnkVBEIovZ6Tnh5stD2 UYFkh1BdchLsYxi7W4XMpVPSzRztiTP87mIx5c/KvIzj+QNeWL1XWRJSPdI9HhTD pTAA/tM2OG7bqrbyQiKDNfpQdNl7+kk1RwnYd+f9RFl1QVuIJaYhmjVwrsN5xF/+ jUsotpROarM2dGFWiFwJbKhP2zMDT+6qEEahM8pEPggKhv8wRLYjany2cZVEe4e0 WGUpbINAS8gEKm0Ob922WaDfDrcK/N1Z0jNz/kMaENkK18Vvc7F6bCO0DzAawKX9 QZMMwm6mHp3EThflJAMAzCGIYiIcwLhykgdyj8rrjPhFrWbMD2Sdsbo21HOXU/8j u4aAhVl+d+h7emmbgBoJ8sycVJ7BQlXz7lX20sTgADv9xI4/dPhQ17CMRuwX6fXX JSrn6P6e1LTV5CEg6vrlSPnKPY6uhFn/cPw47FxCjRwJ9phVnp+8uZWQmf9Pz3yf Ok/tcj+juFbsmuOshHy2cbRkuNZNS0oRWlSTBo5795ZwOLSakMonR3L+ev2aOvzz DVrFp2Y/iIVwMSFdCbouYdYnhArPRhOAtCmZc2afY8aBN7aaMgrdTy3+mzUoHy3I FG3K+VuKpfi0vY4zn6ZoLZDIpyXIoJJ93RcSGltD32t3Dp1RaQMVEI4s45k05PVm 1nYpXKHA8qML =hxEG -----END PGP SIGNATURE----- Merge tag 'net-next-6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking changes from Paolo Abeni: "Core: - Refactor the forward memory allocation to better cope with memory pressure with many open sockets, moving from a per socket cache to a per-CPU one - Replace rwlocks with RCU for better fairness in ping, raw sockets and IP multicast router. - Network-side support for IO uring zero-copy send. - A few skb drop reason improvements, including codegen the source file with string mapping instead of using macro magic. - Rename reference tracking helpers to a more consistent netdev_* schema. - Adapt u64_stats_t type to address load/store tearing issues. - Refine debug helper usage to reduce the log noise caused by bots. BPF: - Improve socket map performance, avoiding skb cloning on read operation. - Add support for 64 bits enum, to match types exposed by kernel. - Introduce support for sleepable uprobes program. - Introduce support for enum textual representation in libbpf. - New helpers to implement synproxy with eBPF/XDP. - Improve loop performances, inlining indirect calls when possible. - Removed all the deprecated libbpf APIs. - Implement new eBPF-based LSM flavor. - Add type match support, which allow accurate queries to the eBPF used types. - A few TCP congetsion control framework usability improvements. - Add new infrastructure to manipulate CT entries via eBPF programs. - Allow for livepatch (KLP) and BPF trampolines to attach to the same kernel function. Protocols: - Introduce per network namespace lookup tables for unix sockets, increasing scalability and reducing contention. - Preparation work for Wi-Fi 7 Multi-Link Operation (MLO) support. - Add support to forciby close TIME_WAIT TCP sockets via user-space tools. - Significant performance improvement for the TLS 1.3 receive path, both for zero-copy and not-zero-copy. - Support for changing the initial MTPCP subflow priority/backup status - Introduce virtually contingus buffers for sockets over RDMA, to cope better with memory pressure. - Extend CAN ethtool support with timestamping capabilities - Refactor CAN build infrastructure to allow building only the needed features. Driver API: - Remove devlink mutex to allow parallel commands on multiple links. - Add support for pause stats in distributed switch. - Implement devlink helpers to query and flash line cards. - New helper for phy mode to register conversion. New hardware / drivers: - Ethernet DSA driver for the rockchip mt7531 on BPI-R2 Pro. - Ethernet DSA driver for the Renesas RZ/N1 A5PSW switch. - Ethernet DSA driver for the Microchip LAN937x switch. - Ethernet PHY driver for the Aquantia AQR113C EPHY. - CAN driver for the OBD-II ELM327 interface. - CAN driver for RZ/N1 SJA1000 CAN controller. - Bluetooth: Infineon CYW55572 Wi-Fi plus Bluetooth combo device. Drivers: - Intel Ethernet NICs: - i40e: add support for vlan pruning - i40e: add support for XDP framented packets - ice: improved vlan offload support - ice: add support for PPPoE offload - Mellanox Ethernet (mlx5) - refactor packet steering offload for performance and scalability - extend support for TC offload - refactor devlink code to clean-up the locking schema - support stacked vlans for bridge offloads - use TLS objects pool to improve connection rate - Netronome Ethernet NICs (nfp): - extend support for IPv6 fields mangling offload - add support for vepa mode in HW bridge - better support for virtio data path acceleration (VDPA) - enable TSO by default - Microsoft vNIC driver (mana) - add support for XDP redirect - Others Ethernet drivers: - bonding: add per-port priority support - microchip lan743x: extend phy support - Fungible funeth: support UDP segmentation offload and XDP xmit - Solarflare EF100: add support for virtual function representors - MediaTek SoC: add XDP support - Mellanox Ethernet/IB switch (mlxsw): - dropped support for unreleased H/W (XM router). - improved stats accuracy - unified bridge model coversion improving scalability (parts 1-6) - support for PTP in Spectrum-2 asics - Broadcom PHYs - add PTP support for BCM54210E - add support for the BCM53128 internal PHY - Marvell Ethernet switches (prestera): - implement support for multicast forwarding offload - Embedded Ethernet switches: - refactor OcteonTx MAC filter for better scalability - improve TC H/W offload for the Felix driver - refactor the Microchip ksz8 and ksz9477 drivers to share the probe code (parts 1, 2), add support for phylink mac configuration - Other WiFi: - Microchip wilc1000: diable WEP support and enable WPA3 - Atheros ath10k: encapsulation offload support Old code removal: - Neterion vxge ethernet driver: this is untouched since more than 10 years" * tag 'net-next-6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1890 commits) doc: sfp-phylink: Fix a broken reference wireguard: selftests: support UML wireguard: allowedips: don't corrupt stack when detecting overflow wireguard: selftests: update config fragments wireguard: ratelimiter: use hrtimer in selftest net/mlx5e: xsk: Discard unaligned XSK frames on striding RQ net: usb: ax88179_178a: Bind only to vendor-specific interface selftests: net: fix IOAM test skip return code net: usb: make USB_RTL8153_ECM non user configurable net: marvell: prestera: remove reduntant code octeontx2-pf: Reduce minimum mtu size to 60 net: devlink: Fix missing mutex_unlock() call net/tls: Remove redundant workqueue flush before destroy net: txgbe: Fix an error handling path in txgbe_probe() net: dsa: Fix spelling mistakes and cleanup code Documentation: devlink: add add devlink-selftests to the table of contents dccp: put dccp_qpolicy_full() and dccp_qpolicy_push() in the same lock net: ionic: fix error check for vlan flags in ionic_set_nic_features() net: ice: fix error NETIF_F_HW_VLAN_CTAG_FILTER check in ice_vsi_sync_fltr() nfp: flower: add support for tunnel offload without key ID ...	2022-08-03 16:29:08 -07:00
Linus Torvalds	a782e86649	Saner handling of "lseek should fail with ESPIPE" - gets rid of magical no_llseek thing and makes checks consistent. In particular, ad-hoc "can we do splice via internal pipe" checks got saner (and somewhat more permissive, which is what Jason had been after, AFAICT) Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCYug2xgAKCRBZ7Krx/gZQ 6wxWAQDqeg+xMq2FGPXmgjCa+Cp3PXH96Lp6f3hHzakIDx+t8gEAxvuiXAD22Mct 6S1SKuGj0iDIuM4L7hUiWTiY/bDXSAc= =3EC/ -----END PGP SIGNATURE----- Merge tag 'pull-work.lseek' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs lseek updates from Al Viro: "Jason's lseek series. Saner handling of 'lseek should fail with ESPIPE' - this gets rid of the magical no_llseek thing and makes checks consistent. In particular, the ad-hoc "can we do splice via internal pipe" checks got saner (and somewhat more permissive, which is what Jason had been after, AFAICT)" * tag 'pull-work.lseek' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: fs: remove no_llseek fs: check FMODE_LSEEK to control internal pipe splicing vfio: do not set FMODE_LSEEK flag dma-buf: remove useless FMODE_LSEEK flag fs: do not compare against ->llseek fs: clear or set FMODE_LSEEK based on llseek function	2022-08-03 11:35:20 -07:00
Bing Huang	18c31c9711	sched/fair: Make per-cpu cpumasks static The load_balance_mask and select_rq_mask percpu variables are only used in kernel/sched/fair.c. Make them static and move their allocation into init_sched_fair_class(). Replace kzalloc_node() with zalloc_cpumask_var_node() to get rid of the CONFIG_CPUMASK_OFFSTACK #ifdef and to align with per-cpu cpumask allocation for RT (local_cpu_mask in init_sched_rt_class()) and DL class (local_cpu_mask_dl in init_sched_dl_class()). [ mingo: Tidied up changelog & touched up the code. ] Signed-off-by: Bing Huang <huangbing@kylinos.cn> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20220722213609.3901-1-huangbing775@126.com	2022-08-03 19:17:33 +02:00
Hao Jia	d985ee9f44	sched/fair: Remove unused parameter idle of _nohz_idle_balance() After commit `7a82e5f52a` ("sched/fair: Merge for each idle cpu loop of ILB"), _nohz_idle_balance()'s 'idle' parameter is not used anymore, so we can remove it. Signed-off-by: Hao Jia <jiahao.os@bytedance.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20220803130223.70419-1-jiahao.os@bytedance.com	2022-08-03 18:54:26 +02:00
Linus Torvalds	b6bb70f9ab	Several core optimizations: * threadgroup_rwsem write locking is skipped when configuring controllers in empty subtrees. Combined with CLONE_INTO_CGROUP, this allows the common static usage pattern to not grab threadgroup_rwsem at all (glibc still doesn't seem ready for CLONE_INTO_CGROUP unfortunately). * threadgroup_rwsem used to be put into non-percpu mode by default due to latency concerns in specific use cases. There's no reason for everyone else to pay for it. Make the behavior optional. * psi no longer allocates memory when disabled. along with some code cleanups. -----BEGIN PGP SIGNATURE----- iIQEABYIACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCYugHIQ4cdGpAa2VybmVs Lm9yZwAKCRCxYfJx3gVYGd+oAP9lfD3fTRdNo4qWV2VsZsYzoOxzNIuJSwN/dnYx IEbQOwD/cd2YMfeo6zcb427U/VfTFqjJjFK04OeljYtJU8fFywo= =sucy -----END PGP SIGNATURE----- Merge tag 'cgroup-for-5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: "Several core optimizations: - threadgroup_rwsem write locking is skipped when configuring controllers in empty subtrees. Combined with CLONE_INTO_CGROUP, this allows the common static usage pattern to not grab threadgroup_rwsem at all (glibc still doesn't seem ready for CLONE_INTO_CGROUP unfortunately). - threadgroup_rwsem used to be put into non-percpu mode by default due to latency concerns in specific use cases. There's no reason for everyone else to pay for it. Make the behavior optional. - psi no longer allocates memory when disabled. ... along with some code cleanups" * tag 'cgroup-for-5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: Skip subtree root in cgroup_update_dfl_csses() cgroup: remove "no" prefixed mount options cgroup: Make !percpu threadgroup_rwsem operations optional cgroup: Add "no" prefixed mount options cgroup: Elide write-locking threadgroup_rwsem when updating csses on an empty subtree cgroup.c: remove redundant check for mixable cgroup in cgroup_migrate_vet_dst cgroup.c: add helper __cset_cgroup_from_root to cleanup duplicated codes psi: dont alloc memory for psi by default	2022-08-03 09:45:08 -07:00
Ben Dooks	87514b2c24	sched/rt: Fix Sparse warnings due to undefined rt.c declarations There are several symbols defined in kernel/sched/sched.h but get wrapped in CONFIG_CGROUP_SCHED, even though dummy versions get built in rt.c and therefore trigger Sparse warnings: kernel/sched/rt.c:309:6: warning: symbol 'unregister_rt_sched_group' was not declared. Should it be static? kernel/sched/rt.c:311:6: warning: symbol 'free_rt_sched_group' was not declared. Should it be static? kernel/sched/rt.c:313:5: warning: symbol 'alloc_rt_sched_group' was not declared. Should it be static? Fix this by moving them outside the CONFIG_CGROUP_SCHED block. [ mingo: Refreshed to the latest scheduler tree, tweaked changelog. ] Signed-off-by: Ben Dooks <ben-linux@fluff.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20220721145155.358366-1-ben-linux@fluff.org	2022-08-03 11:22:37 +02:00
Ingo Molnar	dcca34754a	exit: Fix typo in comment: s/sub-theads/sub-threads Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>	2022-08-03 10:44:54 +02:00
Waiman Long	b6e8d40d43	sched, cpuset: Fix dl_cpu_busy() panic due to empty cs->cpus_allowed With cgroup v2, the cpuset's cpus_allowed mask can be empty indicating that the cpuset will just use the effective CPUs of its parent. So cpuset_can_attach() can call task_can_attach() with an empty mask. This can lead to cpumask_any_and() returns nr_cpu_ids causing the call to dl_bw_of() to crash due to percpu value access of an out of bound CPU value. For example: [80468.182258] BUG: unable to handle page fault for address: ffffffff8b6648b0 : [80468.191019] RIP: 0010:dl_cpu_busy+0x30/0x2b0 : [80468.207946] Call Trace: [80468.208947] cpuset_can_attach+0xa0/0x140 [80468.209953] cgroup_migrate_execute+0x8c/0x490 [80468.210931] cgroup_update_dfl_csses+0x254/0x270 [80468.211898] cgroup_subtree_control_write+0x322/0x400 [80468.212854] kernfs_fop_write_iter+0x11c/0x1b0 [80468.213777] new_sync_write+0x11f/0x1b0 [80468.214689] vfs_write+0x1eb/0x280 [80468.215592] ksys_write+0x5f/0xe0 [80468.216463] do_syscall_64+0x5c/0x80 [80468.224287] entry_SYSCALL_64_after_hwframe+0x44/0xae Fix that by using effective_cpus instead. For cgroup v1, effective_cpus is the same as cpus_allowed. For v2, effective_cpus is the real cpumask to be used by tasks within the cpuset anyway. Also update task_can_attach()'s 2nd argument name to cs_effective_cpus to reflect the change. In addition, a check is added to task_can_attach() to guard against the possibility that cpumask_any_and() may return a value >= nr_cpu_ids. Fixes: `7f51412a41` ("sched/deadline: Fix bandwidth check/update when migrating tasks between exclusive cpusets") Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Juri Lelli <juri.lelli@redhat.com> Link: https://lore.kernel.org/r/20220803015451.2219567-1-longman@redhat.com	2022-08-03 10:34:26 +02:00
Linus Torvalds	665fe72a7d	linux-kselftest-kunit-5.20-rc1 This KUnit update for Linux 5.20-rc1 consists of several fixes and an important feature to discourage running KUnit tests on production systems. Running tests on a production system could leave the system in a bad state. This new feature adds: - adds a new taint type, TAINT_TEST to signal that a test has been run. This should discourage people from running these tests on production systems, and to make it easier to tell if tests have been run accidentally (by loading the wrong configuration, etc.) - several documentation and tool enhancements and fixes. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEPZKym/RZuOCGeA/kCwJExA0NQxwFAmLoOXcACgkQCwJExA0N Qxy5HQ//QehcBsN0rvNM5enP0HyJjDFxoF9HI7RxhHbwAE3LEkMQTNnFJOViJ7cY XZgvPipySkekPkvbm9uAnJw160hUSTCM3Oikf7JaxSTKS9Zvfaq9k78miQNrU2rT C9ljhLBF9y2eXxj9348jwlIHmjBwV5iMn6ncSvUkdUpDAkll2qIvtmmdiSgl33Et CRhdc07XBwhlz/hBDwj8oK2ZYGPsqjxf2CyrhRMJAOEJtY0wt971COzPj8cDGtmi nmQXiUhGejXPlzL/7hPYNr83YmYa/xGjecgDPKR3hOf5dVEVRUE2lKQ00F4GrwdZ KC6CWyXCzhhbtH7tfpWBU4ZoBdmyxhVOMDPFNJdHzuAHVAI3WbHmGjnptgV9jT7o KqgPVDW2n0fggMMUjmxR4fV2VrKoVy8EvLfhsanx961KhnPmQ6MXxL1cWoMT5BwA JtwPlNomwaee2lH9534Qgt1brybYZRGx1RDbWn2CW3kJabODptL80sZ62X5XxxRi I/keCbSjDO1mL3eEeGg/n7AsAhWrZFsxCThxSXH6u6d6jrrvCF3X2Ki5m27D1eGD Yh40Fy+FhwHSXNyVOav6XHYKhyRzJvPxM/mTGe5DtQ6YnP7G7SnfPchX4irZQOkv T2soJdtAcshnpG6z38Yd3uWM/8ARtSMaBU891ZAkFD9foniIYWE= =WzBX -----END PGP SIGNATURE----- Merge tag 'linux-kselftest-kunit-5.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest Pull KUnit updates from Shuah Khan: "This consists of several fixes and an important feature to discourage running KUnit tests on production systems. Running tests on a production system could leave the system in a bad state. Summary: - Add a new taint type, TAINT_TEST to signal that a test has been run. This should discourage people from running these tests on production systems, and to make it easier to tell if tests have been run accidentally (by loading the wrong configuration, etc) - Several documentation and tool enhancements and fixes" * tag 'linux-kselftest-kunit-5.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest: (29 commits) Documentation: KUnit: Fix example with compilation error Documentation: kunit: Add CLI args for kunit_tool kcsan: test: Add a .kunitconfig to run KCSAN tests kunit: executor: Fix a memory leak on failure in kunit_filter_tests clk: explicitly disable CONFIG_UML_PCI_OVER_VIRTIO in .kunitconfig mmc: sdhci-of-aspeed: test: Use kunit_test_suite() macro nitro_enclaves: test: Use kunit_test_suite() macro thunderbolt: test: Use kunit_test_suite() macro kunit: flatten kunit_suite* to kunit_suite in .kunit_test_suites kunit: unify module and builtin suite definitions selftest: Taint kernel when test module loaded module: panic: Taint the kernel when selftest modules load Documentation: kunit: fix example run_kunit func to allow spaces in args Documentation: kunit: Cleanup run_wrapper, fix x-ref kunit: test.h: fix a kernel-doc markup kunit: tool: Enable virtio/PCI by default on UML kunit: tool: make --kunitconfig repeatable, blindly concat kunit: add coverage_uml.config to enable GCOV on UML kunit: tool: refactor internal kconfig handling, allow overriding kunit: tool: introduce --qemu_args ...	2022-08-02 19:34:45 -07:00
Linus Torvalds	aad26f55f4	This was a moderately busy cycle for documentation, but nothing all that earth-shaking: - More Chinese translations, and an update to the Italian translations. The Japanese, Korean, and traditional Chinese translations are more-or-less unmaintained at this point, instead. - Some build-system performance improvements. - The removal of the archaic submitting-drivers.rst document, with the movement of what useful material that remained into other docs. - Improvements to sphinx-pre-install to, hopefully, give more useful suggestions. - A number of build-warning fixes Plus the usual collection of typo fixes, updates, and more. -----BEGIN PGP SIGNATURE----- iQFDBAABCAAtFiEEIw+MvkEiF49krdp9F0NaE2wMflgFAmLn9OwPHGNvcmJldEBs d24ubmV0AAoJEBdDWhNsDH5YtrwIAJNZoDYJJIRuVHnFkAn5EJ4b/chnR1dSTBtn WdE/1zdAlMBWVlEGO48VZybph9Sk0v+cUGf+yviDgASQrfOhRRTkg/0u6XaBAYO0 +C2D1QDd9DggGgajxsfJfTdD3IuB78mGmCQvP17XIJW+NK1CK9rXZBnj6WC5/HJw PCHzeeVreBxOS3W9GelMYa6vjVl7dv81x4DPllnsgU2AMk0/Ce0MVjeIZ695sOeP Ki6jZgC2GsgFSK5kBC35OiDe5q+fDzlLfek34EUCn4SIbMALSUYWO1db122w5Pme Ej0+UTBhD19WH1uB/rcVKnVWugi7UEUJexZsao+nC7UrdIVtYq0= =83BG -----END PGP SIGNATURE----- Merge tag 'docs-6.0' of git://git.lwn.net/linux Pull documentation updates from Jonathan Corbet: "This was a moderately busy cycle for documentation, but nothing all that earth-shaking: - More Chinese translations, and an update to the Italian translations. The Japanese, Korean, and traditional Chinese translations are more-or-less unmaintained at this point, instead. - Some build-system performance improvements. - The removal of the archaic submitting-drivers.rst document, with the movement of what useful material that remained into other docs. - Improvements to sphinx-pre-install to, hopefully, give more useful suggestions. - A number of build-warning fixes Plus the usual collection of typo fixes, updates, and more" * tag 'docs-6.0' of git://git.lwn.net/linux: (92 commits) docs: efi-stub: Fix paths for x86 / arm stubs Docs/zh_CN: Update the translation of sched-stats to 5.19-rc8 Docs/zh_CN: Update the translation of pci to 5.19-rc8 Docs/zh_CN: Update the translation of pci-iov-howto to 5.19-rc8 Docs/zh_CN: Update the translation of usage to 5.19-rc8 Docs/zh_CN: Update the translation of testing-overview to 5.19-rc8 Docs/zh_CN: Update the translation of sparse to 5.19-rc8 Docs/zh_CN: Update the translation of kasan to 5.19-rc8 Docs/zh_CN: Update the translation of iio_configfs to 5.19-rc8 doc:it_IT: align Italian documentation docs: Remove spurious tag from admin-guide/mm/overcommit-accounting.rst Documentation: process: Update email client instructions for Thunderbird docs: ABI: correct QEMU fw_cfg spec path doc/zh_CN: remove submitting-driver reference from docs docs: zh_TW: align to submitting-drivers removal docs: zh_CN: align to submitting-drivers removal docs: ko_KR: howto: remove reference to removed submitting-drivers docs: ja_JP: howto: remove reference to removed submitting-drivers docs: it_IT: align to submitting-drivers removal docs: process: remove outdated submitting-drivers.rst ...	2022-08-02 19:24:24 -07:00
Linus Torvalds	7d9d077c78	RCU pull request for v5.20 (or whatever) This pull request contains the following branches: doc.2022.06.21a: Documentation updates. fixes.2022.07.19a: Miscellaneous fixes. nocb.2022.07.19a: Callback-offload updates, perhaps most notably a new RCU_NOCB_CPU_DEFAULT_ALL Kconfig option that causes all CPUs to be offloaded at boot time, regardless of kernel boot parameters. This is useful to battery-powered systems such as ChromeOS and Android. In addition, a new RCU_NOCB_CPU_CB_BOOST kernel boot parameter prevents offloaded callbacks from interfering with real-time workloads and with energy-efficiency mechanisms. poll.2022.07.21a: Polled grace-period updates, perhaps most notably making these APIs account for both normal and expedited grace periods. rcu-tasks.2022.06.21a: Tasks RCU updates, perhaps most notably reducing the CPU overhead of RCU tasks trace grace periods by more than a factor of two on a system with 15,000 tasks. The reduction is expected to increase with the number of tasks, so it seems reasonable to hypothesize that a system with 150,000 tasks might see a 20-fold reduction in CPU overhead. torture.2022.06.21a: Torture-test updates. ctxt.2022.07.05a: Updates that merge RCU's dyntick-idle tracking into context tracking, thus reducing the overhead of transitioning to kernel mode from either idle or nohz_full userspace execution for kernels that track context independently of RCU. This is expected to be helpful primarily for kernels built with CONFIG_NO_HZ_FULL=y. -----BEGIN PGP SIGNATURE----- iQJHBAABCgAxFiEEbK7UrM+RBIrCoViJnr8S83LZ+4wFAmLgMcgTHHBhdWxtY2tA a2VybmVsLm9yZwAKCRCevxLzctn7jArXD/0fjbCwqpRjHVTzjMY8jN4zDkqZZD6m g8Fx27hZ4ToNFwRptyHwNezrNj14skjAJEXfdjaVw32W62ivXvf0HINvSzsTLCSq k2kWyBdXLc9CwY5p5W4smnpn5VoAScjg5PoPL59INoZ/Zziji323C7Zepl/1DYJt 0T6bPCQjo1ZQoDUCyVpSjDmAqxnderWG0MeJVt74GkLqmnYLANg0GH8c7mH4+9LL kVGlLp5nlPgNJ4FEoFdMwNU8T/ETmaVld/m2dkiawjkXjJzB2XKtBigU91DDmXz5 7DIdV4ABrxiy4kGNqtIe/jFgnKyVD7xiDpyfjd6KTeDr/rDS8u2ZH7+1iHsyz3g0 Np/tS3vcd0KR+gI/d0eXxPbgm5sKlCmKw/nU2eArpW/+4LmVXBUfHTG9Jg+LJmBc JrUh6aEdIZJZHgv/nOQBNig7GJW43IG50rjuJxAuzcxiZNEG5lUSS23ysaA9CPCL PxRWKSxIEfK3kdmvVO5IIbKTQmIBGWlcWMTcYictFSVfBgcCXpPAksGvqA5JiUkc egW+xLFo/7K+E158vSKsVqlWZcEeUbsNJ88QOlpqnRgH++I2Yv/LhK41XfJfpH+Y ALxVaDd+mAq6v+qSHNVq9wT3ozXIPy/zK1hDlMIqx40h2YvaEsH4je+521oSoN9r vX60+QNxvUBLwA== =vUNm -----END PGP SIGNATURE----- Merge tag 'rcu.2022.07.26a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu Pull RCU updates from Paul McKenney: - Documentation updates - Miscellaneous fixes - Callback-offload updates, perhaps most notably a new RCU_NOCB_CPU_DEFAULT_ALL Kconfig option that causes all CPUs to be offloaded at boot time, regardless of kernel boot parameters. This is useful to battery-powered systems such as ChromeOS and Android. In addition, a new RCU_NOCB_CPU_CB_BOOST kernel boot parameter prevents offloaded callbacks from interfering with real-time workloads and with energy-efficiency mechanisms - Polled grace-period updates, perhaps most notably making these APIs account for both normal and expedited grace periods - Tasks RCU updates, perhaps most notably reducing the CPU overhead of RCU tasks trace grace periods by more than a factor of two on a system with 15,000 tasks. The reduction is expected to increase with the number of tasks, so it seems reasonable to hypothesize that a system with 150,000 tasks might see a 20-fold reduction in CPU overhead - Torture-test updates - Updates that merge RCU's dyntick-idle tracking into context tracking, thus reducing the overhead of transitioning to kernel mode from either idle or nohz_full userspace execution for kernels that track context independently of RCU. This is expected to be helpful primarily for kernels built with CONFIG_NO_HZ_FULL=y * tag 'rcu.2022.07.26a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu: (98 commits) rcu: Add irqs-disabled indicator to expedited RCU CPU stall warnings rcu: Diagnose extended sync_rcu_do_polled_gp() loops rcu: Put panic_on_rcu_stall() after expedited RCU CPU stall warnings rcutorture: Test polled expedited grace-period primitives rcu: Add polled expedited grace-period primitives rcutorture: Verify that polled GP API sees synchronous grace periods rcu: Make Tiny RCU grace periods visible to polled APIs rcu: Make polled grace-period API account for expedited grace periods rcu: Switch polled grace-period APIs to ->gp_seq_polled rcu/nocb: Avoid polling when my_rdp->nocb_head_rdp list is empty rcu/nocb: Add option to opt rcuo kthreads out of RT priority rcu: Add nocb_cb_kthread check to rcu_is_callbacks_kthread() rcu/nocb: Add an option to offload all CPUs on boot rcu/nocb: Fix NOCB kthreads spawn failure with rcu_nocb_rdp_deoffload() direct call rcu/nocb: Invert rcu_state.barrier_mutex VS hotplug lock locking order rcu/nocb: Add/del rdp to iterate from rcuog itself rcu/tree: Add comment to describe GP-done condition in fqs loop rcu: Initialize first_gp_fqs at declaration in rcu_gp_fqs() rcu/kvfree: Remove useless monitor_todo flag rcu: Cleanup RCU urgency state for offline CPU ...	2022-08-02 19:12:45 -07:00
Linus Torvalds	a0b09f2d6f	Random number generator updates for Linux 6.0-rc1. -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEq5lC5tSkz8NBJiCnSfxwEqXeA64FAmLnDOwACgkQSfxwEqXe A65Fiw//Z0YaPejSslQIGitQ1b0XzdWBhyJArYDieaaiQRXMqlaSKlIUqHz38xb7 +FykUY51/SJLjHV2riPxq1OK3/MPmk6VlTd0HHihcHVmg77oZcFcv2tPnDpZoqND TsBOujLbXKwxP8tNFedRY/4+K7w+ue9BTfDjuH7aCtz7uWd+4cNJmPg3x9FCfkMA +hbcRluwE9W3Pg4OCKwv+qxL0JF3qQtNKEOp1wpnjGAZZW/I9gFNgFBEkykvcAsj TkIRDc3agPFj6QgDeRIgLdnf9KCsLubKAg5oJneeCvQztJJUCSkn8nQXxpx+4sLo GsRgvCdfL/GyJqfSAzQJVYDHKtKMkJiCiWCC/oOALR8dzHJfSlULDAjbY1m/DAr9 at+vi4678Or7TNx2ZSaUlCXXKZ+UT7yWMlQWax9JuxGk1hGYP5/eT1AH5SGjqUwF w1q8oyzxt1vUcnOzEddFXPFirnqqhAk4dQFtu83+xKM4ZssMVyeB4NZdEhAdW0ng MX+RjrVj4l5gWWuoS0Cx3LUxDCgV6WT0dN+Vl9axAZkoJJbcXLEmXwQ6NbzTLPWg 1/MT7qFTxNcTCeAArMdZvvFbeh7pOBXO42pafrK/7vDRnTMUIw9tqXNLQUfvdFQp F5flPgiVRHDU2vSzKIFtnPTyXU0RBBGvNb4n0ss2ehH2DSsCxYE= =Zy3d -----END PGP SIGNATURE----- Merge tag 'random-6.0-rc1-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random Pull random number generator updates from Jason Donenfeld: "Though there's been a decent amount of RNG-related development during this last cycle, not all of it is coming through this tree, as this cycle saw a shift toward tackling early boot time seeding issues, which took place in other trees as well. Here's a summary of the various patches: - The CONFIG_ARCH_RANDOM .config option and the "nordrand" boot option have been removed, as they overlapped with the more widely supported and more sensible options, CONFIG_RANDOM_TRUST_CPU and "random.trust_cpu". This change allowed simplifying a bit of arch code. - x86's RDRAND boot time test has been made a bit more robust, with RDRAND disabled if it's clearly producing bogus results. This would be a tip.git commit, technically, but I took it through random.git to avoid a large merge conflict. - The RNG has long since mixed in a timestamp very early in boot, on the premise that a computer that does the same things, but does so starting at different points in wall time, could be made to still produce a different RNG state. Unfortunately, the clock isn't set early in boot on all systems, so now we mix in that timestamp when the time is actually set. - User Mode Linux now uses the host OS's getrandom() syscall to generate a bootloader RNG seed and later on treats getrandom() as the platform's RDRAND-like faculty. - The arch_get_random_{seed_,}_long() family of functions is now arch_get_random_{seed_,}_longs(), which enables certain platforms, such as s390, to exploit considerable performance advantages from requesting multiple CPU random numbers at once, while at the same time compiling down to the same code as before on platforms like x86. - A small cleanup changing a cmpxchg() into a try_cmpxchg(), from Uros. - A comment spelling fix" More info about other random number changes that come in through various architecture trees in the full commentary in the pull request: https://lore.kernel.org/all/20220731232428.2219258-1-Jason@zx2c4.com/ * tag 'random-6.0-rc1-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random: random: correct spelling of "overwrites" random: handle archrandom with multiple longs um: seed rng using host OS rng random: use try_cmpxchg in _credit_init_bits timekeeping: contribute wall clock to rng on time change x86/rdrand: Remove "nordrand" flag in favor of "random.trust_cpu" random: remove CONFIG_ARCH_RANDOM	2022-08-02 17:31:35 -07:00
Linus Torvalds	043402495d	integrity-v6.0 -----BEGIN PGP SIGNATURE----- iIoEABYIADIWIQQdXVVFGN5XqKr1Hj7LwZzRsCrn5QUCYulqTBQcem9oYXJAbGlu dXguaWJtLmNvbQAKCRDLwZzRsCrn5SBBAP9nbAW1SPa/hDqbrclHdDrS59VkSVwv 6ZO2yAmxJAptHwD+JzyJpJiZsqVN/Tu85V1PqeAt9c8az8f3CfDBp2+w7AA= =Ad+c -----END PGP SIGNATURE----- Merge tag 'integrity-v6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity Pull integrity updates from Mimi Zohar: "Aside from the one EVM cleanup patch, all the other changes are kexec related. On different architectures different keyrings are used to verify the kexec'ed kernel image signature. Here are a number of preparatory cleanup patches and the patches themselves for making the keyrings - builtin_trusted_keyring, .machine, .secondary_trusted_keyring, and .platform - consistent across the different architectures" * tag 'integrity-v6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity: kexec, KEYS, s390: Make use of built-in and secondary keyring for signature verification arm64: kexec_file: use more system keyrings to verify kernel image signature kexec, KEYS: make the code in bzImage64_verify_sig generic kexec: clean up arch_kexec_kernel_verify_sig kexec: drop weak attribute from functions kexec_file: drop weak attribute from functions evm: Use IS_ENABLED to initialize .enabled	2022-08-02 15:21:18 -07:00
Linus Torvalds	87fe1adb66	SafeSetID changes for Linux 6.0 -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEgvWslnM+qUy+sgVg5n2WYw6TPBAFAmLofpEACgkQ5n2WYw6T PBDnXg/9E1ZZ6c/RkGG224qc1f9K+Epl4ZjFWAzDeQ84GQpa2BdBEs++JDCH9M1c YBWBjPMzry1D980VRpxtP6Of6M2SsheMuKQCBBLlO6/uJp1EgMFxFJq/kq6FIybH cZx4VZqEsw7Yt4U05I5FDfKpkdOIncGBykMmjDgPZYbGR8S03kpc80Ou9luAlEde 31SMhXpTy17yT5WMgBeGtY5OYqO+Plf5FXmS1KEA2BUDk3L3XfYurPpM5mD+Oc3a HosxT29CeqEPDl+nr96dOliSspC+81IKbHH03Ah7UiKd/12dSjxXQuqLnpksB+vr H5LjjwuS8CphnFETPx5pb+Ceia4wxJT/FOfcQlzWGh1jI1gFDTipbO04nVyRPDPa 88oQPkqDp7Sh7hCaHsUFmPBkOTwgmG9jHvgBl0656YU14BzHXr4jNMFCL/2x+LPt jAF/gws87lyyVJ/7c0VaH+V8QWB4a/B1/Gr85yT2Qge1W1T+/lRIhgGtukX+0uBw AJhPNBVjA2SFopOiBF+WuGEfmyXoUwIpMF/9UDhsvZn5Q+fa/QuuvwuER0QoorVE FbTbE60eGSPfFdxdyLBrELrDapslZLyn89SG4C3Ec/xljhp7RR8xz2c0EPvJ4HWz pDjoLG3LbJXSsst86bFJc3B45MvOcxgqIrht9PyY12l+oUKs9mY= =ESR7 -----END PGP SIGNATURE----- Merge tag 'safesetid-6.0' of https://github.com/micah-morton/linux Pull SafeSetID updates from Micah Morton: "This contains one commit that touches common kernel code, one that adds functionality internal to the SafeSetID LSM code, and a few other commits that only modify the SafeSetID LSM selftest. The commit that touches common kernel code simply adds an LSM hook in the setgroups() syscall that mirrors what is done for the existing LSM hooks in the setuid() and setgid() syscalls. This commit combined with the SafeSetID-specific one allow the LSM to filter setgroups() calls according to configured rule sets in the same way that is already done for setuid() and setgid()" * tag 'safesetid-6.0' of https://github.com/micah-morton/linux: LSM: SafeSetID: add setgroups() testing to selftest LSM: SafeSetID: Add setgroups() security policy handling security: Add LSM hook to setgroups() syscall LSM: SafeSetID: add GID testing to selftest LSM: SafeSetID: selftest cleanup and prepare for GIDs LSM: SafeSetID: fix userns bug in selftest	2022-08-02 15:12:13 -07:00
Linus Torvalds	f42e1e3e40	audit/stable-6.0 PR 20220801 -----BEGIN PGP SIGNATURE----- iQJIBAABCAAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAmLoEZsUHHBhdWxAcGF1 bC1tb29yZS5jb20ACgkQ6iDy2pc3iXNlGRAAgSop64Uln+mokEGTcPTfD2hbmB47 Ns7mU9UNS3XgfGDoLbyCbPs5wN0nLTRonzs1oFhPxHmSMMb5nZDwYVLLC/NBuiEl jVsF10NHLNhZX2UJUcOttQHCKUTgjAEXpiV3ROBf5EI0RFN8dkPsYTUyeTm0iqSo Q1cCy5Sp81KmZgSnX9okcasNVLdWoog/H5fWrmgHpd3/g3pJQTSct3tlkJcgP20c zXbqHyGcJNnZ1VGLjNc49L4OpRQITRZIhYKEFxol5UV8C0sbTjsJdS2ztN9eKVer MJWdRCxHItTorP/0G6rb+pHdz4VfYquiV6ZMCLbgWSRCfaUrCJXRLKmUumOOP/0y UH/TEEHaCPQoA3wW5XCzTMEozawPSUjhcqJQnPS1hlV53dK+s6IuCx39mizSoJwL HdCe97hiIT4pZqUp9mgsKsBzM/QTnA0732LdCLum/YIR0ZHFbg6WPvO6vpcRA43S KT7jQJJIGW2TjL2nG4fSLxdT85QqTug+a4ar8W3Q2Jg8no4HftnbhLNuR3UhlCrF OBB9YPlFjEXOVNp7bgmsVxZKLbkuOMFmFZU3bY3Q8jUSda67zPSebTO5GlHTdRLp 6SaT2l5DsMzYBR01X1shLB78NepWU1NN0uvrC6Zr3YHpBnSW3eYmY1+2TqtL5W5U Pfd9JtotV6jdsqw= =9ygD -----END PGP SIGNATURE----- Merge tag 'audit-pr-20220801' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit Pull audit updates from Paul Moore: "Two minor audit patches: on marks a function as static, the other removes a redundant length check" * tag 'audit-pr-20220801' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit: audit: make is_audit_feature_set() static audit: remove redundant data_len check	2022-08-02 14:56:25 -07:00
Linus Torvalds	d7b767b508	execve updates for v5.20-rc1 - Allow unsharing time namespace on vfork+exec (Andrei Vagin) - Replace usage of deprecated kmap APIs (Fabio M. De Francesco) - Fix spelling mistake (Zhang Jiaming) -----BEGIN PGP SIGNATURE----- iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmLoDyAWHGtlZXNjb29r QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJh0mEACL07hj3eT3rWg6ohZx9sCTcAjY /tG+zxLQ7xu717nM1a4j7CI5kdNNpYbsCqG71ikDDRrOCeEutu7M8zE1emctjtHv oh853D6BKhV2Hvsiuk1oM2ZHR1bmgiW1eFNAJcCLz6rE6wYu564R0wYJV0h418fH Rjk+Y989A7Srs9t/9GQSktjX3Q039/PG28avhA5q144/ZNycr5FnLFOf4RlmzEUz 7E8TfGsftX8eRAfxW/dPiWuIKMuYPLqspca9pT3aFj3ze2qKnldjNV3c9M5ajL5Q q7KKWeWzunKyYHMaRzIxkHyhs396ZGKFN2PbcNYyml+NBItyc3fCHishMF7bW0Vb nyZbmYJslBloYmrSJYgqCfxyjUuhe0cMMk9iMzDVp6ROwtLgFFLwfwunM6RwRmnr dAmM8QGwSE3qYLhVnLEcRqpgdXzVd+S0TGhB5k5AyI3628/mLxhE66/eWq0X8QF5 los5zku1GagMkylt6SOGb3TME4JZe6ZdZpU4fe/ilM22qw852xgbF3+6Zap6IBbD AdzXVCHyU/obORfIxx5KTF213m4KpkWBBi3N1/vVlxIAFAUy1WdXDM1o2RPMD7hw DeHe8sgfTZxLmSqfWLuX+3qC94IvrbDPFaRCIMj1QNK0ltM8I9oHRPcUFyZMaV0O xHN/5QtmgVDfKA3mTw== =82SS -----END PGP SIGNATURE----- Merge tag 'execve-v5.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull execve updates from Kees Cook: - Allow unsharing time namespace on vfork+exec (Andrei Vagin) - Replace usage of deprecated kmap APIs (Fabio M. De Francesco) - Fix spelling mistake (Zhang Jiaming) * tag 'execve-v5.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: exec: Call kmap_local_page() in copy_string_kernel() exec: Fix a spelling mistake selftests/timens: add a test for vfork+exit fs/exec: allow to unshare a time namespace on vfork+exec	2022-08-02 14:36:19 -07:00
Jason A. Donenfeld	151c8e499f	wireguard: ratelimiter: use hrtimer in selftest Using msleep() is problematic because it's compared against ratelimiter.c's ktime_get_coarse_boottime_ns(), which means on systems with slow jiffies (such as UML's forced HZ=100), the result is inaccurate. So switch to using schedule_hrtimeout(). However, hrtimer gives us access only to the traditional posix timers, and none of the _COARSE variants. So now, rather than being too imprecise like jiffies, it's too precise. One solution would be to give it a large "range" value, but this will still fire early on a loaded system. A better solution is to align the timeout to the actual coarse timer, and then round up to the nearest tick, plus change. So add the timeout to the current coarse time, and then schedule_hrtimer() until the absolute computed time. This should hopefully reduce flakes in CI as well. Note that we keep the retry loop in case the entire function is running behind, because the test could still be scheduled out, by either the kernel or by the hypervisor's kernel, in which case restarting the test and hoping to not be scheduled out still helps. Fixes: `e7096c131e` ("net: WireGuard secure network tunnel") Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-08-02 13:47:50 -07:00
Linus Torvalds	c013d0af81	for-5.20/block-2022-07-29 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmLko3gQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpmQaD/90NKFj4v8I456TUQyg1jimXEsL+e84E6o2 ALWVb6JzQvlPVQXNLnK5YKIunMWOTtTMz0nyB8sVRwVJVJO0P5d7QopAkZM8fkyU MK5OCzoryENw4DTc2wJS4in6cSbGylIuN74wMzlf7+M67JTImfoZQhbTMcjwzZfn b3OlL6sID7zMXwGcuOJPZyUJICCpDhzdSF9JXqKma5PQuG2SBmQyvFxJAcsoFBPc YetnoRIOIN6yBvsIZaPaYq7XI9MIvF0e67EQtyCEHj4tHpyVnyDWkeObVFULsISU gGEKbkYPvNUzRAU5Q1NBBHh1tTfkf/MaUxTuZwoEwZ/s04IGBGMmrZGyfvdfzYo6 M7NwSEg/TrUSNfTwn65mQi7uOXu1pGkJrqz84Flm8u9Qid9Vd7LExLG5p/ggnWdH 5th93MDEmtEg29e9DXpEAuS5d0t3TtSvosflaKpyfNNfr+P0rWCN6GM/uW62VUTK ls69SQh/AQJRbg64jU4xper6WhaYtSXK7TKEnxJycoEn9gYNyCcdot2uekth0xRH ChHGmRlteiqe/y4uFWn/2dcxWjoleiHbFjTaiRL75WVl8wIDEjw02LGuoZ61Ss9H WOV+MT7KqNjBGe6lreUY+O/PO02dzmoR6heJXN19p8zr/pBuLCTGX7UpO7rzgaBR 4N1HEozvIw== =celk -----END PGP SIGNATURE----- Merge tag 'for-5.20/block-2022-07-29' of git://git.kernel.dk/linux-block Pull block updates from Jens Axboe: - Improve the type checking of request flags (Bart) - Ensure queue mapping for a single queues always picks the right queue (Bart) - Sanitize the io priority handling (Jan) - rq-qos race fix (Jinke) - Reserved tags handling improvements (John) - Separate memory alignment from file/disk offset aligment for O_DIRECT (Keith) - Add new ublk driver, userspace block driver using io_uring for communication with the userspace backend (Ming) - Use try_cmpxchg() to cleanup the code in various spots (Uros) - Finally remove bdevname() (Christoph) - Clean up the zoned device handling (Christoph) - Clean up independent access range support (Christoph) - Clean up and improve block sysfs handling (Christoph) - Clean up and improve teardown of block devices. This turns the usual two step process into something that is simpler to implement and handle in block drivers (Christoph) - Clean up chunk size handling (Christoph) - Misc cleanups and fixes (Bart, Bo, Dan, GuoYong, Jason, Keith, Liu, Ming, Sebastian, Yang, Ying) * tag 'for-5.20/block-2022-07-29' of git://git.kernel.dk/linux-block: (178 commits) ublk_drv: fix double shift bug ublk_drv: make sure that correct flags(features) returned to userspace ublk_drv: fix error handling of ublk_add_dev ublk_drv: fix lockdep warning block: remove __blk_get_queue block: call blk_mq_exit_queue from disk_release for never added disks blk-mq: fix error handling in __blk_mq_alloc_disk ublk: defer disk allocation ublk: rewrite ublk_ctrl_get_queue_affinity to not rely on hctx->cpumask ublk: fold __ublk_create_dev into ublk_ctrl_add_dev ublk: cleanup ublk_ctrl_uring_cmd ublk: simplify ublk_ch_open and ublk_ch_release ublk: remove the empty open and release block device operations ublk: remove UBLK_IO_F_PREFLUSH ublk: add a MAINTAINERS entry block: don't allow the same type rq_qos add more than once mmc: fix disk/queue leak in case of adding disk failure ublk_drv: fix an IS_ERR() vs NULL check ublk: remove UBLK_IO_F_INTEGRITY ublk_drv: remove unneeded semicolon ...	2022-08-02 13:46:35 -07:00
Linus Torvalds	b349b1181d	for-5.20/io_uring-2022-07-29 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmLkm5gQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpmKMD/4l3QIrLbjYIxlfrzQcHbmYuUkbQtj3SbZg 6ejbnGVhCs1P9DdXH8MgE2BxgpiXQE0CqOK7vbSoo5ep2n2UTLI2DIxAl74SMIo7 0wmJXtUJySuViKr3NYVHqlN180MkQYddBz0nGElhkQBPBCMhW8CrtPCeURr/YyHp 2RxSYBXiUx2gRyig+klnp6oPEqelcBZJUyNHdA9yVrgl/RhB/t2rKj7D++8ukQM3 Zuyh8WIkTeTfUz9hdGG7fuCEdZN4DlO2CCEc7uy0cKi6VRCKH4hYUCqClJ+/cfd2 43dUI2O7B6D1t/ObFh8AGIDXBDqVA6ePQohQU6gooRkfQiBPKkc9d0ts4yIhRqca AjkzNM+0Eve3A01loJ8J84w8oZnvNpYEv5n8/sZVLWcyU3UIs0I88nC2OBiFtoRq d77CtFLwOTo+r3STtAhnZOqez90rhS6BqKtqlUP346PCuFItl6/MbGtwdTbLYEFj CVNIb2pERWSr2NxGv4lFyXaX/cRwruxojWH7yc3rRYjr4Ykevd1pe/fMGNiMAnKw 5em/3QU3qq0ZVcXLMihksKeHHFIQwGDRMuyuv/fktV10+yYXQ0t16WzkJT3aR8Xo cqs0r8+6Jnj3uYcOMzj/FoLcpEPr21hnwAtzLto1mG1Wh4JRn/D7Nx5zqxPLxcW+ NiU6VihPOw== =gxeV -----END PGP SIGNATURE----- Merge tag 'for-5.20/io_uring-2022-07-29' of git://git.kernel.dk/linux-block Pull io_uring updates from Jens Axboe: - As per (valid) complaint in the last merge window, fs/io_uring.c has grown quite large these days. io_uring isn't really tied to fs either, as it supports a wide variety of functionality outside of that. Move the code to io_uring/ and split it into files that either implement a specific request type, and split some code into helpers as well. The code is organized a lot better like this, and io_uring.c is now < 4K LOC (me). - Deprecate the epoll_ctl opcode. It'll still work, just trigger a warning once if used. If we don't get any complaints on this, and I don't expect any, then we can fully remove it in a future release (me). - Improve the cancel hash locking (Hao) - kbuf cleanups (Hao) - Efficiency improvements to the task_work handling (Dylan, Pavel) - Provided buffer improvements (Dylan) - Add support for recv/recvmsg multishot support. This is similar to the accept (or poll) support for have for multishot, where a single SQE can trigger everytime data is received. For applications that expect to do more than a few receives on an instantiated socket, this greatly improves efficiency (Dylan). - Efficiency improvements for poll handling (Pavel) - Poll cancelation improvements (Pavel) - Allow specifiying a range for direct descriptor allocations (Pavel) - Cleanup the cqe32 handling (Pavel) - Move io_uring types to greatly cleanup the tracing (Pavel) - Tons of great code cleanups and improvements (Pavel) - Add a way to do sync cancelations rather than through the sqe -> cqe interface, as that's a lot easier to use for some use cases (me). - Add support to IORING_OP_MSG_RING for sending direct descriptors to a different ring. This avoids the usually problematic SCM case, as we disallow those. (me) - Make the per-command alloc cache we use for apoll generic, place limits on it, and use it for netmsg as well (me). - Various cleanups (me, Michal, Gustavo, Uros) * tag 'for-5.20/io_uring-2022-07-29' of git://git.kernel.dk/linux-block: (172 commits) io_uring: ensure REQ_F_ISREG is set async offload net: fix compat pointer in get_compat_msghdr() io_uring: Don't require reinitable percpu_ref io_uring: fix types in io_recvmsg_multishot_overflow io_uring: Use atomic_long_try_cmpxchg in __io_account_mem io_uring: support multishot in recvmsg net: copy from user before calling __get_compat_msghdr net: copy from user before calling __copy_msghdr io_uring: support 0 length iov in buffer select in compat io_uring: fix multishot ending when not polled io_uring: add netmsg cache io_uring: impose max limit on apoll cache io_uring: add abstraction around apoll cache io_uring: move apoll cache to poll.c io_uring: consolidate hash_locked io-wq handling io_uring: clear REQ_F_HASH_LOCKED on hash removal io_uring: don't race double poll setting REQ_F_ASYNC_DATA io_uring: don't miss setting REQ_F_DOUBLE_POLL io_uring: disable multishot recvmsg io_uring: only trace one of complete or overflow ...	2022-08-02 13:20:44 -07:00
Zhen Lei	0f03d6805b	sched/debug: Print each field value left-aligned in sched_show_task() Currently, the values of some fields are printed right-aligned, causing the field value to be next to the next field name rather than next to its own field name. So print each field value left-aligned, to make it more readable. Before: stack: 0 pid: 307 ppid: 2 flags:0x00000008 After: stack:0 pid:308 ppid:2 flags:0x0000000a This also makes them print in the same style as the other two fields: task:demo0 state:R running task Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Link: https://lore.kernel.org/r/20220727060819.1085-1-thunder.leizhen@huawei.com	2022-08-02 21:45:35 +02:00
Masami Hiramatsu (Google)	2f63e5d2e3	tracing/eprobe: Show syntax error logs in error_log file Show the syntax errors for event probes in error_log file as same as other dynamic events, so that user can understand what is the problem. Link: https://lkml.kernel.org/r/165932113556.2850673.3483079297896607612.stgit@devnote2 Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2022-08-02 15:22:55 -04:00

... 4 5 6 7 8 ...

40229 Commits