OpenCloudOS-Kernel

Commit Graph

Author	SHA1	Message	Date
Haojie Ning	a1574c433d	rue/mm: add sysctl_vm_use_priority_oom to enable priority oom for all cgroups Add sysctl_vm_use_priority_oom as a global setting to enable the priority_oom setting for all cgroups without the need to manually set it for each cgroup. This global setting has no effect when it is turned off. Signed-off-by: Haojie Ning <paulning@tencent.com> Signed-off-by: Honglin Li <honglinli@tencent.com>	2024-09-27 11:13:32 +08:00
Honglin Li	b82ababba6	rue/mm: introduce new feature to async clean dying memcgs When memcg was removed, page caches and slab pages still reference to this memcg, it will cause very large number of dying memcgs in out system. This feature can async to clean dying memcgs in system. 1) sysctl -w vm.clean_dying_memcg_async=1 #start a kthread to async clean dying memcgs, default #value is 0. 2) sysctl -w vm.clean_dying_memcg_threshold=10 #Whenever 10 dying memcgs are generated in the system, #wakeup a kthread to async clean dying memcgs, default #value is 100. Signed-off-by: Bin Lai <robinlai@tencent.com> Signed-off-by: Jingxiang Zeng <linuszeng@tencent.com> Signed-off-by: Honglin Li <honglinli@tencent.com>	2024-09-27 11:13:31 +08:00
Honglin Li	200560da23	rue/mm: introduce memcg page cache hit & miss ratio tool A new memory.page_cache_hit control file is added under each memory cgroup directory. Cat this file can print page cache hit and miss ratio at the memory cgroup level. Signed-off-by: Jingxiang Zeng <linuszeng@tencent.com> Signed-off-by: Honglin Li <honglinli@tencent.com>	2024-09-27 11:13:31 +08:00
Honglin Li	8de07be077	rue/mm: introduce memory allocation latency for per-cgroup tool A new memory.latency_histogram control file is added under each memory cgroup directory. Cat this file can print the memory access latency at the memory cgroup level. Signed-off-by: Jingxiang Zeng <linuszeng@tencent.com> Signed-off-by: Honglin Li <honglinli@tencent.com>	2024-09-27 11:13:31 +08:00
Honglin Li	1824581599	rue/mm: async free memory while process exiting Introduce async free memory while process exiting to shorten exit time. Signed-off-by: Xiaoguang Chen <xiaoggchen@tencent.com> Signed-off-by: Jingxiang Zeng <linuszeng@tencent.com> Signed-off-by: Honglin Li <honglinli@tencent.com>	2024-09-27 11:13:31 +08:00
Honglin Li	75ad2bae3d	rue/mm: pagecache limit per cgroup support Functional test: http://tapd.oa.com/TencentOS_QoS/prong/stories/view/ 1020426664867405667?jump_count=1 Signed-off-by: Xiaoguang Chen <xiaoggchen@tencent.com> Signed-off-by: Jingxiang Zeng <linuszeng@tencent.com> Signed-off-by: Xuan Liu <benxliu@tencent.com> Signed-off-by: Honglin Li <honglinli@tencent.com>	2024-09-27 11:13:31 +08:00
Honglin Li	0d35c4c639	rue/mm: introduce memcg priority oom Under memory pressure reclaim and oom would happen, with multiple cgroups exist in one system, we might want some of their memory or tasks survived the reclaim and oom while there are other cadidates. When oom happens it always choose victim from low priority memcg. And it works both for memcg oom and global oom, it can be enabled/disabled through @memory.use_priority_oom, for global oom through the root memcg's @memory.use_priority_oom, it is disabled by default. Signed-off-by: Haiwei Li <gerryhwli@tencent.com> Signed-off-by: Mengmeng Chen <bauerchen@tencent.com> Signed-off-by: Xiaoguang Chen <xiaoggchen@tencent.com> Signed-off-by: Honglin Li <honglinli@tencent.com>	2024-09-27 11:13:31 +08:00
Honglin Li	db44c11cdd	rue/mm: add priority reclaim support Introduce the sync && async priority reclaim mechanism. Signed-off-by: Yu Liu <allanyuliu@tencent.com> Signed-off-by: Xiaoguang Chen <xiaoggchen@tencent.com> Signed-off-by: Honglin Li <honglinli@tencent.com>	2024-09-27 11:13:30 +08:00
Honglin Li	55f6748cd1	rue/net: adapt to the new rue modular framework Add to register and unregister rue net ops through rue modular framework. Signed-off-by: Honglin Li <honglinli@tencent.com> Reviewed-by: Haisu Wang <haisuwang@tencent.com>	2024-09-27 11:13:30 +08:00
Honglin Li	703664bf47	rue/net: add support for cgroup whitelist ports Introduce the cgroup whitelist ports mechanism. Signed-off-by: Honglin Li <honglinli@tencent.com> Signed-off-by: Zhiping Du <zhipingdu@tencent.com>	2024-09-27 11:13:30 +08:00
Haisu Wang	0f93976785	rue: Revert "kallsyms: unexport kallsyms_lookup_name() and kallsyms_on_each_symbol()" Export the two functions again for module like RUE This reverts commit `0bd476e6c6`. Signed-off-by: Haisu Wang <haisuwang@tencent.com> Signed-off-by: Honglin Li <honglinli@tencent.com>	2024-09-27 11:13:29 +08:00
Ze Gao	d5a175186d	rue: Add support for rue modularization Add framework support to enable rue to be installed as a separate module. In order to safely insmod/rmmod, we use per-cpu counter to track how many rue related functions are on the fly, and it's only safe to insmod/rmmod when there's no tasks using any of these functions registered by rue module. Signed-off-by: Ze Gao <zegao@tencent.com>	2024-09-27 11:13:29 +08:00
Hongbo Li	5dc70a633d	rue: init rue module Add the init code of rue module. Support both built-in and module(default) way. Signed-off-by: Hongbo Li <herberthbli@tencent.com> Signed-off-by: Haisu Wang <haisuwang@tencent.com> Reviewed-by: Honglin Li <honglinli@tencent.com>	2024-09-27 11:13:29 +08:00
Hongbo Li	fce3609ebf	rue: cgroup priority Add cgroup priority. Signed-off-by: Hongbo Li <herberthbli@tencent.com> Signed-off-by: Lei Chen <lennychen@tencent.com> Signed-off-by: Yu Liu <allanyuliu@tencent.com>	2024-09-27 11:13:29 +08:00
Haisu Wang	b03afc0d33	Revert "io/tqos: merge buffer io limit series patch from brookxu, and rework some function." This reverts commit `538ec11bed`. Revert due to refactory the buffer IO function. In TK5, unnecessary to compatible kabi by using the "nodeinfo" in "struct mem_cgroup {}". Original tapd and MR: https://tapd.woa.com/tapd_fe/20422414/story/detail/1020422414117471502 https://git.woa.com/tlinux/tkernel5/-/merge_requests/117 Signed-off-by: Haisu Wang <haisuwang@tencent.com>	2024-09-27 11:13:24 +08:00
Haisu Wang	3231efb956	Revert "io/tqos: add sysctl_buffer_io_limit switch for buffer io limit." This reverts commit `4d87de6bb4`. Revert due to refactory the buffer IO function. In TK5, unnecessary to compatible kabi by using the "nodeinfo" in "struct mem_cgroup {}". Original tapd and MR: https://tapd.woa.com/tapd_fe/20422414/story/detail/1020422414117471502 https://git.woa.com/tlinux/tkernel5/-/merge_requests/117 Signed-off-by: Haisu Wang <haisuwang@tencent.com>	2024-09-27 11:13:21 +08:00
Haisu Wang	24cfc0a666	Revert "cgroup: allow cgroup to split direct io and buffered io into different blkio cgroup" This reverts commit `71aaa09350`. Revert due to refactory the buffer IO function. In TK5, unnecessary to compatible kabi by using the "nodeinfo" in "struct mem_cgroup {}". Original tapd and MR: https://tapd.woa.com/tapd_fe/20422414/story/detail/1020422414117471502 https://git.woa.com/tlinux/tkernel5/-/merge_requests/117 Signed-off-by: Haisu Wang <haisuwang@tencent.com>	2024-09-27 11:12:41 +08:00
Jianping Liu	64a21c8a25	hung_task,watchdog: set thresh time to 600 seconds When CONFIG_KASAN is enabled, the kernel will run more slower, set hung_task and soft lockup thresh time to 600 seconds. Signed-off-by: Jianping Liu <frankjpliu@tencent.com> Reviewed-by: Yongliang Gao <leonylgao@tencent.com>	2024-09-05 15:24:07 +08:00
Ze Gao	2e2ffe48c5	rue/scx: Fix cgroupv2 cpu controller regression Due to the odd behavior of gcc designated initializer, we have to carefully order the fields inside cpu_cftypes. otherwise some important interfaces like cpu.max could be lost. Checkout details in [1] [1]: https://onlinegdb.com/T-AMLp4zw Fixes: `8c320a09af` ("rue/scx: Add cpu.offline to maintain SCHED_BT compatibility") Fixes: `2b9d28baab` ("rue/scx: Add cpu.scx to the cpu cgroup controller") Reported-by: likexu <likexu@tencent.com> Signed-off-by: Ze Gao <zegao@tencent.com>	2024-09-03 02:47:04 +00:00
Jianping Liu	dbef74015d	watchdog: increase watchdog_thresh max value to 300 in debug kernel If enable CONFIG_KASAN or CONFIG_KCSAN, the system will run much slower, increase watchdog_thresh's max value to avoid soft lockup or hungtask when run heavy test suit. Signed-off-by: Jianping Liu <frankjpliu@tencent.com> Reviewed-by: Yongliang Gao <leonylgao@tencent.com>	2024-08-30 17:19:41 +08:00
Jianping Liu	63e2660c48	config,oc: support WLAN and MTD and more SND drivers OpenCloud partner want use wireless card, sound card, so open the config to support. Signed-off-by: Jianping Liu <frankjpliu@tencent.com> Reviewed-by: Yongliang Gao <leonylgao@tencent.com>	2024-08-26 16:33:47 +08:00
Jianping Liu	0569444d2a	Merge linux 6.6.47 Conflicts: net/sunrpc/svc.c	2024-08-24 09:43:23 +08:00
Jianping Liu	0a76ebf09a	Merge linux 6.6.46 Conflicts: drivers/platform/x86/intel/ifs/core.c drivers/platform/x86/intel/ifs/ifs.h kernel/sched/core.c	2024-08-24 09:37:59 +08:00
Jianping Liu	d6563b9042	Merge OCK next branch to TK5 master branch	2024-08-23 19:52:09 +08:00
frankjpliu	897ad8fab4	Merge branch 'zegao/scx3' into 'master' (merge request !150 ) Add some general scx in-kernel support `5aec0abf10` rue/scx: Kill user tasks in SCHED_EXT when scheduler is gone `a1752a5760` rue/scx: Add readonly sysctl knob kernel.cpu_qos for SCHED_BT compatibility `ed0889e48a` rue/scx: Add /proc/bt_stat to maintain SCHED_BT compatibility `8c320a09af` rue/scx: Add cpu.offline to maintain SCHED_BT compatibility `2b9d28baab` rue/scx: Add cpu.scx to the cpu cgroup controller `576ee0803a` rue/scx: Add /proc/scx_stat to do scx cputime accounting `67d151255e` rue/scx: Fix lockdep warn on printk with rq lock held `ebf91df4dc` rue/scx: Reorder scx_fork_rwsem, cpu_hotplug_lock and scx_cgroup_rwsem	2024-08-23 11:40:38 +00:00
Yongliang Gao	44f5072e76	Revert "sched: adaptive default skew_tick value" This reverts commit `ca7d96bf43`. Maintain consistency and alignment with upstream, and this patch is not very friendly to virtualization. Signed-off-by: Yongliang Gao <leonylgao@tencent.com> Reviewed-by: Jianping Liu <frankjpliu@tencent.com>	2024-08-23 11:32:30 +00:00
frankjpliu	1541ee2d1b	Merge branch 'remotes/origin/huntazhang/cmdlog' into 'master' (merge request !140 ) Adapt cmdlog	2024-08-23 11:21:29 +00:00
Jianping Liu	f03179c2a4	submodule: update emm and thirdparty/release-drivers emm update to v0.1.7.2 release-drivers update to v1.0 Signed-off-by: Jianping Liu <frankjpliu@tencent.com> Reviewed-by: Yongliang Gao <leonylgao@tencent.com>	2024-08-20 14:53:00 +08:00
Alexei Starovoitov	63f13eb5d6	bpf: Avoid kfree_rcu() under lock in bpf_lpm_trie. [ Upstream commit 59f2f841179aa6a0899cb9cf53659149a35749b7 ] syzbot reported the following lock sequence: cpu 2: grabs timer_base lock spins on bpf_lpm lock cpu 1: grab rcu krcp lock spins on timer_base lock cpu 0: grab bpf_lpm lock spins on rcu krcp lock bpf_lpm lock can be the same. timer_base lock can also be the same due to timer migration. but rcu krcp lock is always per-cpu, so it cannot be the same lock. Hence it's a false positive. To avoid lockdep complaining move kfree_rcu() after spin_unlock. Reported-by: syzbot+1fa663a2100308ab6eab@syzkaller.appspotmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20240329171439.37813-1-alexei.starovoitov@gmail.com Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-08-19 06:04:27 +02:00
Kees Cook	ef33f02968	bpf: Replace bpf_lpm_trie_key 0-length array with flexible array [ Upstream commit 896880ff30866f386ebed14ab81ce1ad3710cfc4 ] Replace deprecated 0-length array in struct bpf_lpm_trie_key with flexible array. Found with GCC 13: ../kernel/bpf/lpm_trie.c:207:51: warning: array subscript i is outside array bounds of 'const __u8[0]' {aka 'const unsigned char[]'} [-Warray-bounds=] 207 \| (__be16 )&key->data[i]); \| ^~~~~~~~~~~~~ ../include/uapi/linux/swab.h:102:54: note: in definition of macro '__swab16' 102 \| #define __swab16(x) (__u16)__builtin_bswap16((__u16)(x)) \| ^ ../include/linux/byteorder/generic.h:97:21: note: in expansion of macro '__be16_to_cpu' 97 \| #define be16_to_cpu __be16_to_cpu \| ^~~~~~~~~~~~~ ../kernel/bpf/lpm_trie.c:206:28: note: in expansion of macro 'be16_to_cpu' 206 \| u16 diff = be16_to_cpu((__be16 )&node->data[i] ^ \| ^~~~~~~~~~~ In file included from ../include/linux/bpf.h:7: ../include/uapi/linux/bpf.h:82:17: note: while referencing 'data' 82 \| __u8 data[0]; /* Arbitrary size / \| ^~~~ And found at run-time under CONFIG_FORTIFY_SOURCE: UBSAN: array-index-out-of-bounds in kernel/bpf/lpm_trie.c:218:49 index 0 is out of range for type '__u8 []' Changing struct bpf_lpm_trie_key is difficult since has been used by userspace. For example, in Cilium: struct egress_gw_policy_key { struct bpf_lpm_trie_key lpm_key; __u32 saddr; __u32 daddr; }; While direct references to the "data" member haven't been found, there are static initializers what include the final member. For example, the "{}" here: struct egress_gw_policy_key in_key = { .lpm_key = { 32 + 24, {} }, .saddr = CLIENT_IP, .daddr = EXTERNAL_SVC_IP & 0Xffffff, }; To avoid the build time and run time warnings seen with a 0-sized trailing array for struct bpf_lpm_trie_key, introduce a new struct that correctly uses a flexible array for the trailing bytes, struct bpf_lpm_trie_key_u8. As part of this, include the "header" portion (which is just the "prefixlen" member), so it can be used by anything building a bpf_lpr_trie_key that has trailing members that aren't a u8 flexible array (like the self-test[1]), which is named struct bpf_lpm_trie_key_hdr. Unfortunately, C++ refuses to parse the __struct_group() helper, so it is not possible to define struct bpf_lpm_trie_key_hdr directly in struct bpf_lpm_trie_key_u8, so we must open-code the union directly. Adjust the kernel code to use struct bpf_lpm_trie_key_u8 through-out, and for the selftest to use struct bpf_lpm_trie_key_hdr. Add a comment to the UAPI header directing folks to the two new options. Reported-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Gustavo A. R. Silva <gustavoars@kernel.org> Closes: https://paste.debian.net/hidden/ca500597/ Link: https://lore.kernel.org/all/202206281009.4332AA33@keescook/ [1] Link: https://lore.kernel.org/bpf/20240222155612.it.533-kees@kernel.org Stable-dep-of: 59f2f841179a ("bpf: Avoid kfree_rcu() under lock in bpf_lpm_trie.") Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-08-19 06:04:27 +02:00
Yafang Shao	dd9542ae7c	cgroup: Make operations on the cgroup root_list RCU safe commit d23b5c577715892c87533b13923306acc6243f93 upstream. At present, when we perform operations on the cgroup root_list, we must hold the cgroup_mutex, which is a relatively heavyweight lock. In reality, we can make operations on this list RCU-safe, eliminating the need to hold the cgroup_mutex during traversal. Modifications to the list only occur in the cgroup root setup and destroy paths, which should be infrequent in a production environment. In contrast, traversal may occur frequently. Therefore, making it RCU-safe would be beneficial. Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> To: Michal Koutný <mkoutny@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-08-19 06:04:25 +02:00
Dongli Zhang	bcd5148043	genirq/cpuhotplug: Retry with cpu_online_mask when migration fails commit 88d724e2301a69c1ab805cd74fc27aa36ae529e0 upstream. When a CPU goes offline, the interrupts affine to that CPU are re-configured. Managed interrupts undergo either migration to other CPUs or shutdown if all CPUs listed in the affinity are offline. The migration of managed interrupts is guaranteed on x86 because there are interrupt vectors reserved. Regular interrupts are migrated to a still online CPU in the affinity mask or if there is no online CPU to any online CPU. This works as long as the still online CPUs in the affinity mask have interrupt vectors available, but in case that none of those CPUs has a vector available the migration fails and the device interrupt becomes stale. This is not any different from the case where the affinity mask does not contain any online CPU, but there is no fallback operation for this. Instead of giving up, retry the migration attempt with the online CPU mask if the interrupt is not managed, as managed interrupts cannot be affected by this problem. Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240423073413.79625-1-dongli.zhang@oracle.com Cc: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-08-19 06:04:24 +02:00
David Stevens	20dbad7525	genirq/cpuhotplug: Skip suspended interrupts when restoring affinity commit a60dd06af674d3bb76b40da5d722e4a0ecefe650 upstream. irq_restore_affinity_of_irq() restarts managed interrupts unconditionally when the first CPU in the affinity mask comes online. That's correct during normal hotplug operations, but not when resuming from S3 because the drivers are not resumed yet and interrupt delivery is not expected by them. Skip the startup of suspended interrupts and let resume_device_irqs() deal with restoring them. This ensures that irqs are not delivered to drivers during the noirq phase of resuming from S3, after non-boot CPUs are brought back online. Signed-off-by: David Stevens <stevensd@chromium.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240424090341.72236-1-stevensd@chromium.org Cc: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-08-19 06:04:24 +02:00
刘诗	89747eb5ee	!206 [next] Zhaoxin : cpufreq: ACPI: add ITMT support when CPPC enabled Merge pull request !206 from LeoLiu-oc/next-6.6-57-cppc	2024-08-16 01:04:50 +00:00
leoliu-oc	c090c94dbb	Set ASYM_PACKING Flag on Zhaoxin KH-40000 platform Set ASYM_PACKING Flag on Zhaoxin KH-40000 platform Signed-off-by: leoliu-oc <leoliu-oc@zhaoxin.com>	2024-08-15 11:35:28 +08:00
leoliu-oc	ab7f82d722	Add kh40000_direct_dma_ops for KH-40000 platform Add 'kh40000_direct_dma_ops' to replace 'direct_dma_ops' for KH-40000 platform. For coherent DMA access, memory can be allocated only from the memory node of the node where the device resides. For streaming DMA access, add a PCI read operation at the end of DMA access. Signed-off-by: leoliu-oc <leoliu-oc@zhaoxin.com>	2024-08-15 11:07:01 +08:00
Yang Yingliang	78f1990b6b	sched/core: Fix unbalance set_rq_online/offline() in sched_cpu_deactivate() commit fe7a11c78d2a9bdb8b50afc278a31ac177000948 upstream. If cpuset_cpu_inactive() fails, set_rq_online() need be called to rollback. Fixes: `120455c514` ("sched: Fix hotplug vs CPU bandwidth control") Cc: stable@kernel.org Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20240703031610.587047-5-yangyingliang@huaweicloud.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-08-14 13:59:00 +02:00
Yang Yingliang	4c15b20c26	sched/core: Introduce sched_set_rq_on/offline() helper commit 2f027354122f58ee846468a6f6b48672fff92e9b upstream. Introduce sched_set_rq_on/offline() helper, so it can be called in normal or error path simply. No functional changed. Cc: stable@kernel.org Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20240703031610.587047-4-yangyingliang@huaweicloud.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-08-14 13:59:00 +02:00
Yang Yingliang	65727331b6	sched/smt: Fix unbalance sched_smt_present dec/inc commit e22f910a26cc2a3ac9c66b8e935ef2a7dd881117 upstream. I got the following warn report while doing stress test: jump label: negative count! WARNING: CPU: 3 PID: 38 at kernel/jump_label.c:263 static_key_slow_try_dec+0x9d/0xb0 Call Trace: <TASK> __static_key_slow_dec_cpuslocked+0x16/0x70 sched_cpu_deactivate+0x26e/0x2a0 cpuhp_invoke_callback+0x3ad/0x10d0 cpuhp_thread_fun+0x3f5/0x680 smpboot_thread_fn+0x56d/0x8d0 kthread+0x309/0x400 ret_from_fork+0x41/0x70 ret_from_fork_asm+0x1b/0x30 </TASK> Because when cpuset_cpu_inactive() fails in sched_cpu_deactivate(), the cpu offline failed, but sched_smt_present is decremented before calling sched_cpu_deactivate(), it leads to unbalanced dec/inc, so fix it by incrementing sched_smt_present in the error path. Fixes: `c5511d03ec` ("sched/smt: Make sched_smt_present track topology") Cc: stable@kernel.org Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Chen Yu <yu.c.chen@intel.com> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com> Link: https://lore.kernel.org/r/20240703031610.587047-3-yangyingliang@huaweicloud.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-08-14 13:59:00 +02:00
Yang Yingliang	41d856565d	sched/smt: Introduce sched_smt_present_inc/dec() helper commit 31b164e2e4af84d08d2498083676e7eeaa102493 upstream. Introduce sched_smt_present_inc/dec() helper, so it can be called in normal or error path simply. No functional changed. Cc: stable@kernel.org Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20240703031610.587047-2-yangyingliang@huaweicloud.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-08-14 13:59:00 +02:00
Waiman Long	924f788c90	padata: Fix possible divide-by-0 panic in padata_mt_helper() commit 6d45e1c948a8b7ed6ceddb14319af69424db730c upstream. We are hit with a not easily reproducible divide-by-0 panic in padata.c at bootup time. [ 10.017908] Oops: divide error: 0000 1 PREEMPT SMP NOPTI [ 10.017908] CPU: 26 PID: 2627 Comm: kworker/u1666:1 Not tainted 6.10.0-15.el10.x86_64 #1 [ 10.017908] Hardware name: Lenovo ThinkSystem SR950 [7X12CTO1WW]/[7X12CTO1WW], BIOS [PSE140J-2.30] 07/20/2021 [ 10.017908] Workqueue: events_unbound padata_mt_helper [ 10.017908] RIP: 0010:padata_mt_helper+0x39/0xb0 : [ 10.017963] Call Trace: [ 10.017968] <TASK> [ 10.018004] ? padata_mt_helper+0x39/0xb0 [ 10.018084] process_one_work+0x174/0x330 [ 10.018093] worker_thread+0x266/0x3a0 [ 10.018111] kthread+0xcf/0x100 [ 10.018124] ret_from_fork+0x31/0x50 [ 10.018138] ret_from_fork_asm+0x1a/0x30 [ 10.018147] </TASK> Looking at the padata_mt_helper() function, the only way a divide-by-0 panic can happen is when ps->chunk_size is 0. The way that chunk_size is initialized in padata_do_multithreaded(), chunk_size can be 0 when the min_chunk in the passed-in padata_mt_job structure is 0. Fix this divide-by-0 panic by making sure that chunk_size will be at least 1 no matter what the input parameters are. Link: https://lkml.kernel.org/r/20240806174647.1050398-1-longman@redhat.com Fixes: `004ed42638` ("padata: add basic support for multithreaded jobs") Signed-off-by: Waiman Long <longman@redhat.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Steffen Klassert <steffen.klassert@secunet.com> Cc: Waiman Long <longman@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-08-14 13:58:59 +02:00
Tze-nan Wu	a172c7b22b	tracing: Fix overflow in get_free_elt() commit bcf86c01ca4676316557dd482c8416ece8c2e143 upstream. "tracing_map->next_elt" in get_free_elt() is at risk of overflowing. Once it overflows, new elements can still be inserted into the tracing_map even though the maximum number of elements (`max_elts`) has been reached. Continuing to insert elements after the overflow could result in the tracing_map containing "tracing_map->max_size" elements, leaving no empty entries. If any attempt is made to insert an element into a full tracing_map using `__tracing_map_insert()`, it will cause an infinite loop with preemption disabled, leading to a CPU hang problem. Fix this by preventing any further increments to "tracing_map->next_elt" once it reaches "tracing_map->max_elt". Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Fixes: `08d43a5fa0` ("tracing: Add lock-free tracing_map") Co-developed-by: Cheng-Jui Wang <cheng-jui.wang@mediatek.com> Link: https://lore.kernel.org/20240805055922.6277-1-Tze-nan.Wu@mediatek.com Signed-off-by: Cheng-Jui Wang <cheng-jui.wang@mediatek.com> Signed-off-by: Tze-nan Wu <Tze-nan.Wu@mediatek.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-08-14 13:58:58 +02:00
Shay Drory	0688cacd0e	genirq/irqdesc: Honor caller provided affinity in alloc_desc() commit edbbaae42a56f9a2b39c52ef2504dfb3fb0a7858 upstream. Currently, whenever a caller is providing an affinity hint for an interrupt, the allocation code uses it to calculate the node and copies the cpumask into irq_desc::affinity. If the affinity for the interrupt is not marked 'managed' then the startup of the interrupt ignores irq_desc::affinity and uses the system default affinity mask. Prevent this by setting the IRQD_AFFINITY_SET flag for the interrupt in the allocator, which causes irq_setup_affinity() to use irq_desc::affinity on interrupt startup if the mask contains an online CPU. [ tglx: Massaged changelog ] Fixes: `45ddcecbfa` ("genirq: Use affinity hint in irqdesc allocation") Signed-off-by: Shay Drory <shayd@nvidia.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: <stable@vger.kernel.org> Link: https://lore.kernel.org/all/20240806072044.837827-1-shayd@nvidia.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-08-14 13:58:58 +02:00
Andrey Konovalov	d0137ce03f	kcov: properly check for softirq context commit 7d4df2dad312f270d62fecb0e5c8b086c6d7dcfc upstream. When collecting coverage from softirqs, KCOV uses in_serving_softirq() to check whether the code is running in the softirq context. Unfortunately, in_serving_softirq() is > 0 even when the code is running in the hardirq or NMI context for hardirqs and NMIs that happened during a softirq. As a result, if a softirq handler contains a remote coverage collection section and a hardirq with another remote coverage collection section happens during handling the softirq, KCOV incorrectly detects a nested softirq coverate collection section and prints a WARNING, as reported by syzbot. This issue was exposed by commit a7f3813e589f ("usb: gadget: dummy_hcd: Switch to hrtimer transfer scheduler"), which switched dummy_hcd to using hrtimer and made the timer's callback be executed in the hardirq context. Change the related checks in KCOV to account for this behavior of in_serving_softirq() and make KCOV ignore remote coverage collection sections in the hardirq and NMI contexts. This prevents the WARNING printed by syzbot but does not fix the inability of KCOV to collect coverage from the __usb_hcd_giveback_urb when dummy_hcd is in use (caused by a7f3813e589f); a separate patch is required for that. Link: https://lkml.kernel.org/r/20240729022158.92059-1-andrey.konovalov@linux.dev Fixes: `5ff3b30ab5` ("kcov: collect coverage from interrupts") Signed-off-by: Andrey Konovalov <andreyknvl@gmail.com> Reported-by: syzbot+2388cdaeb6b10f0c13ac@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=2388cdaeb6b10f0c13ac Acked-by: Marco Elver <elver@google.com> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: Aleksandr Nogikh <nogikh@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Marcello Sylvester Bauer <sylv@sylv.io> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-08-14 13:58:57 +02:00
Thomas Gleixner	65d76c0aa2	timekeeping: Fix bogus clock_was_set() invocation in do_adjtimex() commit 5916be8a53de6401871bdd953f6c60237b47d6d3 upstream. The addition of the bases argument to clock_was_set() fixed up all call sites correctly except for do_adjtimex(). This uses CLOCK_REALTIME instead of CLOCK_SET_WALL as argument. CLOCK_REALTIME is 0. As a result the effect of that clock_was_set() notification is incomplete and might result in timers expiring late because the hrtimer code does not re-evaluate the affected clock bases. Use CLOCK_SET_WALL instead of CLOCK_REALTIME to tell the hrtimers code which clock bases need to be re-evaluated. Fixes: `17a1b8826b` ("hrtimer: Add bases argument to clock_was_set()") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/877ccx7igo.ffs@tglx Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-08-14 13:58:57 +02:00
Justin Stitt	ae5848cb5b	ntp: Safeguard against time_constant overflow commit 06c03c8edce333b9ad9c6b207d93d3a5ae7c10c0 upstream. Using syzkaller with the recently reintroduced signed integer overflow sanitizer produces this UBSAN report: UBSAN: signed-integer-overflow in ../kernel/time/ntp.c:738:18 9223372036854775806 + 4 cannot be represented in type 'long' Call Trace: handle_overflow+0x171/0x1b0 __do_adjtimex+0x1236/0x1440 do_adjtimex+0x2be/0x740 The user supplied time_constant value is incremented by four and then clamped to the operating range. Before commit `eea83d896e` ("ntp: NTP4 user space bits update") the user supplied value was sanity checked to be in the operating range. That change removed the sanity check and relied on clamping after incrementing which does not work correctly when the user supplied value is in the overflow zone of the '+ 4' operation. The operation requires CAP_SYS_TIME and the side effect of the overflow is NTP getting out of sync. Similar to the fixups for time_maxerror and time_esterror, clamp the user space supplied value to the operating range. [ tglx: Switch to clamping ] Fixes: `eea83d896e` ("ntp: NTP4 user space bits update") Signed-off-by: Justin Stitt <justinstitt@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Miroslav Lichvar <mlichvar@redhat.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/20240517-b4-sio-ntp-c-v2-1-f3a80096f36f@google.com Closes: https://github.com/KSPP/linux/issues/352 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-08-14 13:58:56 +02:00
Paul E. McKenney	9d6193fd91	clocksource: Fix brown-bag boolean thinko in cs_watchdog_read() [ Upstream commit f2655ac2c06a15558e51ed6529de280e1553c86e ] The current "nretries > 1 \|\| nretries >= max_retries" check in cs_watchdog_read() will always evaluate to true, and thus pr_warn(), if nretries is greater than 1. The intent is instead to never warn on the first try, but otherwise warn if the successful retry was the last retry. Therefore, change that "\|\|" to "&&". Fixes: `db3a34e174` ("clocksource: Retry clock read if long delays detected") Reported-by: Borislav Petkov <bp@alien8.de> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/20240802154618.4149953-2-paulmck@kernel.org Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-08-14 13:58:56 +02:00
Feng Tang	03c3855528	clocksource: Scale the watchdog read retries automatically [ Upstream commit 2ed08e4bc53298db3f87b528cd804cb0cce066a9 ] On a 8-socket server the TSC is wrongly marked as 'unstable' and disabled during boot time on about one out of 120 boot attempts: clocksource: timekeeping watchdog on CPU227: wd-tsc-wd excessive read-back delay of 153560ns vs. limit of 125000ns, wd-wd read-back delay only 11440ns, attempt 3, marking tsc unstable tsc: Marking TSC unstable due to clocksource watchdog TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'. sched_clock: Marking unstable (119294969739, 159204297)<-(125446229205, -5992055152) clocksource: Checking clocksource tsc synchronization from CPU 319 to CPUs 0,99,136,180,210,542,601,896. clocksource: Switched to clocksource hpet The reason is that for platform with a large number of CPUs, there are sporadic big or huge read latencies while reading the watchog/clocksource during boot or when system is under stress work load, and the frequency and maximum value of the latency goes up with the number of online CPUs. The cCurrent code already has logic to detect and filter such high latency case by reading the watchdog twice and checking the two deltas. Due to the randomness of the latency, there is a low probabilty that the first delta (latency) is big, but the second delta is small and looks valid. The watchdog code retries the readouts by default twice, which is not necessarily sufficient for systems with a large number of CPUs. There is a command line parameter 'max_cswd_read_retries' which allows to increase the number of retries, but that's not user friendly as it needs to be tweaked per system. As the number of required retries is proportional to the number of online CPUs, this parameter can be calculated at runtime. Scale and enlarge the number of retries according to the number of online CPUs and remove the command line parameter completely. [ tglx: Massaged change log and comments ] Signed-off-by: Feng Tang <feng.tang@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Jin Wang <jin1.wang@intel.com> Tested-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Waiman Long <longman@redhat.com> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Link: https://lore.kernel.org/r/20240221060859.1027450-1-feng.tang@intel.com Stable-dep-of: f2655ac2c06a ("clocksource: Fix brown-bag boolean thinko in cs_watchdog_read()") Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-08-14 13:58:56 +02:00
Justin Stitt	b5cf99eb7a	ntp: Clamp maxerror and esterror to operating range [ Upstream commit 87d571d6fb77ec342a985afa8744bb9bb75b3622 ] Using syzkaller alongside the newly reintroduced signed integer overflow sanitizer spits out this report: UBSAN: signed-integer-overflow in ../kernel/time/ntp.c:461:16 9223372036854775807 + 500 cannot be represented in type 'long' Call Trace: handle_overflow+0x171/0x1b0 second_overflow+0x2d6/0x500 accumulate_nsecs_to_secs+0x60/0x160 timekeeping_advance+0x1fe/0x890 update_wall_time+0x10/0x30 time_maxerror is unconditionally incremented and the result is checked against NTP_PHASE_LIMIT, but the increment itself can overflow, resulting in wrap-around to negative space. Before commit `eea83d896e` ("ntp: NTP4 user space bits update") the user supplied value was sanity checked to be in the operating range. That change removed the sanity check and relied on clamping in handle_overflow() which does not work correctly when the user supplied value is in the overflow zone of the '+ 500' operation. The operation requires CAP_SYS_TIME and the side effect of the overflow is NTP getting out of sync. Miroslav confirmed that the input value should be clamped to the operating range and the same applies to time_esterror. The latter is not used by the kernel, but the value still should be in the operating range as it was before the sanity check got removed. Clamp them to the operating range. [ tglx: Changed it to clamping and included time_esterror ] Fixes: `eea83d896e` ("ntp: NTP4 user space bits update") Signed-off-by: Justin Stitt <justinstitt@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Miroslav Lichvar <mlichvar@redhat.com> Link: https://lore.kernel.org/all/20240517-b4-sio-ntp-usec-v2-1-d539180f2b79@google.com Closes: https://github.com/KSPP/linux/issues/354 Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-08-14 13:58:56 +02:00
Thomas Gleixner	b9d604933d	tick/broadcast: Move per CPU pointer access into the atomic section commit 6881e75237a84093d0986f56223db3724619f26e upstream. The recent fix for making the take over of the broadcast timer more reliable retrieves a per CPU pointer in preemptible context. This went unnoticed as compilers hoist the access into the non-preemptible region where the pointer is actually used. But of course it's valid that the compiler keeps it at the place where the code puts it which rightfully triggers: BUG: using smp_processor_id() in preemptible [00000000] code: caller is hotplug_cpu__broadcast_tick_pull+0x1c/0xc0 Move it to the actual usage site which is in a non-preemptible region. Fixes: f7d43dd206e7 ("tick/broadcast: Make takeover of broadcast hrtimer reliable") Reported-by: David Wang <00107082@163.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Yu Liao <liaoyu15@huawei.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/87ttg56ers.ffs@tglx Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-08-14 13:58:55 +02:00

1 2 3 4 5 ...

43370 Commits