Commit Graph

43370 Commits

Author SHA1 Message Date
Haojie Ning a1574c433d rue/mm: add sysctl_vm_use_priority_oom to enable priority oom for all cgroups
Add sysctl_vm_use_priority_oom as a global setting to enable the
priority_oom setting for all cgroups without the need to manually
set it for each cgroup. This global setting has no effect when it
is turned off.

Signed-off-by: Haojie Ning <paulning@tencent.com>
Signed-off-by: Honglin Li <honglinli@tencent.com>
2024-09-27 11:13:32 +08:00
Honglin Li b82ababba6 rue/mm: introduce new feature to async clean dying memcgs
When memcg was removed, page caches and slab pages still
reference to this memcg, it will cause very large number
of dying memcgs in out system. This feature can async to
clean dying memcgs in system.

1) sysctl -w vm.clean_dying_memcg_async=1
   #start a kthread to async clean dying memcgs, default
   #value is 0.

2) sysctl -w vm.clean_dying_memcg_threshold=10
   #Whenever 10 dying memcgs are generated in the system,
   #wakeup a kthread to async clean dying memcgs, default
   #value is 100.

Signed-off-by: Bin Lai <robinlai@tencent.com>
Signed-off-by: Jingxiang Zeng <linuszeng@tencent.com>
Signed-off-by: Honglin Li <honglinli@tencent.com>
2024-09-27 11:13:31 +08:00
Honglin Li 200560da23 rue/mm: introduce memcg page cache hit & miss ratio tool
A new memory.page_cache_hit control file is added
under each memory cgroup directory. Cat this file can
print page cache hit and miss ratio at the memory
cgroup level.

Signed-off-by: Jingxiang Zeng <linuszeng@tencent.com>
Signed-off-by: Honglin Li <honglinli@tencent.com>
2024-09-27 11:13:31 +08:00
Honglin Li 8de07be077 rue/mm: introduce memory allocation latency for per-cgroup tool
A new memory.latency_histogram control file is added
under each memory cgroup directory. Cat this file can
print the memory access latency at the memory cgroup level.

Signed-off-by: Jingxiang Zeng <linuszeng@tencent.com>
Signed-off-by: Honglin Li <honglinli@tencent.com>
2024-09-27 11:13:31 +08:00
Honglin Li 1824581599 rue/mm: async free memory while process exiting
Introduce async free memory while process exiting
to shorten exit time.

Signed-off-by: Xiaoguang Chen <xiaoggchen@tencent.com>
Signed-off-by: Jingxiang Zeng <linuszeng@tencent.com>
Signed-off-by: Honglin Li <honglinli@tencent.com>
2024-09-27 11:13:31 +08:00
Honglin Li 75ad2bae3d rue/mm: pagecache limit per cgroup support
Functional test:
http://tapd.oa.com/TencentOS_QoS/prong/stories/view/
1020426664867405667?jump_count=1

Signed-off-by: Xiaoguang Chen <xiaoggchen@tencent.com>
Signed-off-by: Jingxiang Zeng <linuszeng@tencent.com>
Signed-off-by: Xuan Liu <benxliu@tencent.com>
Signed-off-by: Honglin Li <honglinli@tencent.com>
2024-09-27 11:13:31 +08:00
Honglin Li 0d35c4c639 rue/mm: introduce memcg priority oom
Under memory pressure reclaim and oom would happen,
with multiple cgroups exist in one system, we might
want some of their memory or tasks survived the
reclaim and oom while there are other cadidates.

When oom happens it always choose victim from low
priority memcg. And it works both for memcg oom and
global oom, it can be enabled/disabled through
@memory.use_priority_oom, for global oom through the root
memcg's @memory.use_priority_oom, it is disabled by default.

Signed-off-by: Haiwei Li <gerryhwli@tencent.com>
Signed-off-by: Mengmeng Chen <bauerchen@tencent.com>
Signed-off-by: Xiaoguang Chen <xiaoggchen@tencent.com>
Signed-off-by: Honglin Li <honglinli@tencent.com>
2024-09-27 11:13:31 +08:00
Honglin Li db44c11cdd rue/mm: add priority reclaim support
Introduce the sync && async priority reclaim mechanism.

Signed-off-by: Yu Liu <allanyuliu@tencent.com>
Signed-off-by: Xiaoguang Chen <xiaoggchen@tencent.com>
Signed-off-by: Honglin Li <honglinli@tencent.com>
2024-09-27 11:13:30 +08:00
Honglin Li 55f6748cd1 rue/net: adapt to the new rue modular framework
Add to register and unregister rue net ops through
rue modular framework.

Signed-off-by: Honglin Li <honglinli@tencent.com>
Reviewed-by: Haisu Wang <haisuwang@tencent.com>
2024-09-27 11:13:30 +08:00
Honglin Li 703664bf47 rue/net: add support for cgroup whitelist ports
Introduce the cgroup whitelist ports mechanism.

Signed-off-by: Honglin Li <honglinli@tencent.com>
Signed-off-by: Zhiping Du <zhipingdu@tencent.com>
2024-09-27 11:13:30 +08:00
Haisu Wang 0f93976785 rue: Revert "kallsyms: unexport kallsyms_lookup_name() and kallsyms_on_each_symbol()"
Export the two functions again for module like RUE

This reverts commit 0bd476e6c6.

Signed-off-by: Haisu Wang <haisuwang@tencent.com>
Signed-off-by: Honglin Li <honglinli@tencent.com>
2024-09-27 11:13:29 +08:00
Ze Gao d5a175186d rue: Add support for rue modularization
Add framework support to enable rue to be installed as
a separate module.

In order to safely insmod/rmmod, we use per-cpu counter to
track how many rue related functions are on the fly, and
it's only safe to insmod/rmmod when there's no tasks using
any of these functions registered by rue module.

Signed-off-by: Ze Gao <zegao@tencent.com>
2024-09-27 11:13:29 +08:00
Hongbo Li 5dc70a633d rue: init rue module
Add the init code of rue module.
Support both built-in and module(default) way.

Signed-off-by: Hongbo Li <herberthbli@tencent.com>
Signed-off-by: Haisu Wang <haisuwang@tencent.com>
Reviewed-by: Honglin Li <honglinli@tencent.com>
2024-09-27 11:13:29 +08:00
Hongbo Li fce3609ebf rue: cgroup priority
Add cgroup priority.

Signed-off-by: Hongbo Li <herberthbli@tencent.com>
Signed-off-by: Lei Chen  <lennychen@tencent.com>
Signed-off-by: Yu Liu    <allanyuliu@tencent.com>
2024-09-27 11:13:29 +08:00
Haisu Wang b03afc0d33 Revert "io/tqos: merge buffer io limit series patch from brookxu, and rework some function."
This reverts commit 538ec11bed.

Revert due to refactory the buffer IO function.
In TK5, unnecessary to compatible kabi by using the "nodeinfo"
in "struct mem_cgroup {}".

Original tapd and MR:
  https://tapd.woa.com/tapd_fe/20422414/story/detail/1020422414117471502
  https://git.woa.com/tlinux/tkernel5/-/merge_requests/117

Signed-off-by: Haisu Wang <haisuwang@tencent.com>
2024-09-27 11:13:24 +08:00
Haisu Wang 3231efb956 Revert "io/tqos: add sysctl_buffer_io_limit switch for buffer io limit."
This reverts commit 4d87de6bb4.

Revert due to refactory the buffer IO function.
In TK5, unnecessary to compatible kabi by using the "nodeinfo"
in "struct mem_cgroup {}".

Original tapd and MR:
  https://tapd.woa.com/tapd_fe/20422414/story/detail/1020422414117471502
  https://git.woa.com/tlinux/tkernel5/-/merge_requests/117

Signed-off-by: Haisu Wang <haisuwang@tencent.com>
2024-09-27 11:13:21 +08:00
Haisu Wang 24cfc0a666 Revert "cgroup: allow cgroup to split direct io and buffered io into different blkio cgroup"
This reverts commit 71aaa09350.

Revert due to refactory the buffer IO function.
In TK5, unnecessary to compatible kabi by using the "nodeinfo"
in "struct mem_cgroup {}".

Original tapd and MR:
  https://tapd.woa.com/tapd_fe/20422414/story/detail/1020422414117471502
  https://git.woa.com/tlinux/tkernel5/-/merge_requests/117

Signed-off-by: Haisu Wang <haisuwang@tencent.com>
2024-09-27 11:12:41 +08:00
Jianping Liu 64a21c8a25 hung_task,watchdog: set thresh time to 600 seconds
When CONFIG_KASAN is enabled, the kernel will run more slower, set
hung_task and soft lockup thresh time to 600 seconds.

Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
Reviewed-by: Yongliang Gao <leonylgao@tencent.com>
2024-09-05 15:24:07 +08:00
Ze Gao 2e2ffe48c5 rue/scx: Fix cgroupv2 cpu controller regression
Due to the odd behavior of gcc designated initializer, we
have to carefully order the fields inside cpu_cftypes.
otherwise some important interfaces like cpu.max could
be lost.

Checkout details in [1]

[1]: https://onlinegdb.com/T-AMLp4zw

Fixes: 8c320a09af ("rue/scx: Add cpu.offline to maintain SCHED_BT compatibility")
Fixes: 2b9d28baab ("rue/scx: Add cpu.scx to the cpu cgroup controller")
Reported-by: likexu <likexu@tencent.com>
Signed-off-by: Ze Gao <zegao@tencent.com>
2024-09-03 02:47:04 +00:00
Jianping Liu dbef74015d watchdog: increase watchdog_thresh max value to 300 in debug kernel
If enable CONFIG_KASAN or CONFIG_KCSAN, the system will run much
slower, increase watchdog_thresh's max value to avoid soft lockup
or hungtask when run heavy test suit.

Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
Reviewed-by: Yongliang Gao <leonylgao@tencent.com>
2024-08-30 17:19:41 +08:00
Jianping Liu 63e2660c48 config,oc: support WLAN and MTD and more SND drivers
OpenCloud partner want use wireless card, sound card, so open the
config to support.

Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
Reviewed-by: Yongliang Gao <leonylgao@tencent.com>
2024-08-26 16:33:47 +08:00
Jianping Liu 0569444d2a Merge linux 6.6.47
Conflicts:
	net/sunrpc/svc.c
2024-08-24 09:43:23 +08:00
Jianping Liu 0a76ebf09a Merge linux 6.6.46
Conflicts:
	drivers/platform/x86/intel/ifs/core.c
	drivers/platform/x86/intel/ifs/ifs.h
	kernel/sched/core.c
2024-08-24 09:37:59 +08:00
Jianping Liu d6563b9042 Merge OCK next branch to TK5 master branch 2024-08-23 19:52:09 +08:00
frankjpliu 897ad8fab4 Merge branch 'zegao/scx3' into 'master' (merge request !150)
Add some general scx in-kernel support
5aec0abf10 rue/scx: Kill user tasks in SCHED_EXT when scheduler is gone
a1752a5760 rue/scx: Add readonly sysctl knob kernel.cpu_qos for SCHED_BT compatibility
ed0889e48a rue/scx: Add /proc/bt_stat to maintain SCHED_BT compatibility
8c320a09af rue/scx: Add cpu.offline to maintain SCHED_BT compatibility
2b9d28baab rue/scx: Add cpu.scx to the cpu cgroup controller
576ee0803a rue/scx: Add /proc/scx_stat to do scx cputime accounting
67d151255e rue/scx: Fix lockdep warn on printk with rq lock held
ebf91df4dc rue/scx: Reorder scx_fork_rwsem, cpu_hotplug_lock and scx_cgroup_rwsem
2024-08-23 11:40:38 +00:00
Yongliang Gao 44f5072e76 Revert "sched: adaptive default skew_tick value"
This reverts commit ca7d96bf43.

Maintain consistency and alignment with upstream, and this patch
is not very friendly to virtualization.

Signed-off-by: Yongliang Gao <leonylgao@tencent.com>
Reviewed-by: Jianping Liu <frankjpliu@tencent.com>
2024-08-23 11:32:30 +00:00
frankjpliu 1541ee2d1b Merge branch 'remotes/origin/huntazhang/cmdlog' into 'master' (merge request !140)
Adapt cmdlog
2024-08-23 11:21:29 +00:00
Jianping Liu f03179c2a4 submodule: update emm and thirdparty/release-drivers
emm update to v0.1.7.2
release-drivers update to v1.0

Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
Reviewed-by: Yongliang Gao <leonylgao@tencent.com>
2024-08-20 14:53:00 +08:00
Alexei Starovoitov 63f13eb5d6 bpf: Avoid kfree_rcu() under lock in bpf_lpm_trie.
[ Upstream commit 59f2f841179aa6a0899cb9cf53659149a35749b7 ]

syzbot reported the following lock sequence:
cpu 2:
  grabs timer_base lock
    spins on bpf_lpm lock

cpu 1:
  grab rcu krcp lock
    spins on timer_base lock

cpu 0:
  grab bpf_lpm lock
    spins on rcu krcp lock

bpf_lpm lock can be the same.
timer_base lock can also be the same due to timer migration.
but rcu krcp lock is always per-cpu, so it cannot be the same lock.
Hence it's a false positive.
To avoid lockdep complaining move kfree_rcu() after spin_unlock.

Reported-by: syzbot+1fa663a2100308ab6eab@syzkaller.appspotmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20240329171439.37813-1-alexei.starovoitov@gmail.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-08-19 06:04:27 +02:00
Kees Cook ef33f02968 bpf: Replace bpf_lpm_trie_key 0-length array with flexible array
[ Upstream commit 896880ff30866f386ebed14ab81ce1ad3710cfc4 ]

Replace deprecated 0-length array in struct bpf_lpm_trie_key with
flexible array. Found with GCC 13:

../kernel/bpf/lpm_trie.c:207:51: warning: array subscript i is outside array bounds of 'const __u8[0]' {aka 'const unsigned char[]'} [-Warray-bounds=]
  207 |                                        *(__be16 *)&key->data[i]);
      |                                                   ^~~~~~~~~~~~~
../include/uapi/linux/swab.h:102:54: note: in definition of macro '__swab16'
  102 | #define __swab16(x) (__u16)__builtin_bswap16((__u16)(x))
      |                                                      ^
../include/linux/byteorder/generic.h:97:21: note: in expansion of macro '__be16_to_cpu'
   97 | #define be16_to_cpu __be16_to_cpu
      |                     ^~~~~~~~~~~~~
../kernel/bpf/lpm_trie.c:206:28: note: in expansion of macro 'be16_to_cpu'
  206 |                 u16 diff = be16_to_cpu(*(__be16 *)&node->data[i]
^
      |                            ^~~~~~~~~~~
In file included from ../include/linux/bpf.h:7:
../include/uapi/linux/bpf.h:82:17: note: while referencing 'data'
   82 |         __u8    data[0];        /* Arbitrary size */
      |                 ^~~~

And found at run-time under CONFIG_FORTIFY_SOURCE:

  UBSAN: array-index-out-of-bounds in kernel/bpf/lpm_trie.c:218:49
  index 0 is out of range for type '__u8 [*]'

Changing struct bpf_lpm_trie_key is difficult since has been used by
userspace. For example, in Cilium:

	struct egress_gw_policy_key {
	        struct bpf_lpm_trie_key lpm_key;
	        __u32 saddr;
	        __u32 daddr;
	};

While direct references to the "data" member haven't been found, there
are static initializers what include the final member. For example,
the "{}" here:

        struct egress_gw_policy_key in_key = {
                .lpm_key = { 32 + 24, {} },
                .saddr   = CLIENT_IP,
                .daddr   = EXTERNAL_SVC_IP & 0Xffffff,
        };

To avoid the build time and run time warnings seen with a 0-sized
trailing array for struct bpf_lpm_trie_key, introduce a new struct
that correctly uses a flexible array for the trailing bytes,
struct bpf_lpm_trie_key_u8. As part of this, include the "header"
portion (which is just the "prefixlen" member), so it can be used
by anything building a bpf_lpr_trie_key that has trailing members that
aren't a u8 flexible array (like the self-test[1]), which is named
struct bpf_lpm_trie_key_hdr.

Unfortunately, C++ refuses to parse the __struct_group() helper, so
it is not possible to define struct bpf_lpm_trie_key_hdr directly in
struct bpf_lpm_trie_key_u8, so we must open-code the union directly.

Adjust the kernel code to use struct bpf_lpm_trie_key_u8 through-out,
and for the selftest to use struct bpf_lpm_trie_key_hdr. Add a comment
to the UAPI header directing folks to the two new options.

Reported-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Closes: https://paste.debian.net/hidden/ca500597/
Link: https://lore.kernel.org/all/202206281009.4332AA33@keescook/ [1]
Link: https://lore.kernel.org/bpf/20240222155612.it.533-kees@kernel.org
Stable-dep-of: 59f2f841179a ("bpf: Avoid kfree_rcu() under lock in bpf_lpm_trie.")
Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-08-19 06:04:27 +02:00
Yafang Shao dd9542ae7c cgroup: Make operations on the cgroup root_list RCU safe
commit d23b5c577715892c87533b13923306acc6243f93 upstream.

At present, when we perform operations on the cgroup root_list, we must
hold the cgroup_mutex, which is a relatively heavyweight lock. In reality,
we can make operations on this list RCU-safe, eliminating the need to hold
the cgroup_mutex during traversal. Modifications to the list only occur in
the cgroup root setup and destroy paths, which should be infrequent in a
production environment. In contrast, traversal may occur frequently.
Therefore, making it RCU-safe would be beneficial.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
To: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-08-19 06:04:25 +02:00
Dongli Zhang bcd5148043 genirq/cpuhotplug: Retry with cpu_online_mask when migration fails
commit 88d724e2301a69c1ab805cd74fc27aa36ae529e0 upstream.

When a CPU goes offline, the interrupts affine to that CPU are
re-configured.

Managed interrupts undergo either migration to other CPUs or shutdown if
all CPUs listed in the affinity are offline. The migration of managed
interrupts is guaranteed on x86 because there are interrupt vectors
reserved.

Regular interrupts are migrated to a still online CPU in the affinity mask
or if there is no online CPU to any online CPU.

This works as long as the still online CPUs in the affinity mask have
interrupt vectors available, but in case that none of those CPUs has a
vector available the migration fails and the device interrupt becomes
stale.

This is not any different from the case where the affinity mask does not
contain any online CPU, but there is no fallback operation for this.

Instead of giving up, retry the migration attempt with the online CPU mask
if the interrupt is not managed, as managed interrupts cannot be affected
by this problem.

Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20240423073413.79625-1-dongli.zhang@oracle.com
Cc: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-08-19 06:04:24 +02:00
David Stevens 20dbad7525 genirq/cpuhotplug: Skip suspended interrupts when restoring affinity
commit a60dd06af674d3bb76b40da5d722e4a0ecefe650 upstream.

irq_restore_affinity_of_irq() restarts managed interrupts unconditionally
when the first CPU in the affinity mask comes online. That's correct during
normal hotplug operations, but not when resuming from S3 because the
drivers are not resumed yet and interrupt delivery is not expected by them.

Skip the startup of suspended interrupts and let resume_device_irqs() deal
with restoring them. This ensures that irqs are not delivered to drivers
during the noirq phase of resuming from S3, after non-boot CPUs are brought
back online.

Signed-off-by: David Stevens <stevensd@chromium.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20240424090341.72236-1-stevensd@chromium.org
Cc: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-08-19 06:04:24 +02:00
刘诗 89747eb5ee
!206 [next] Zhaoxin : cpufreq: ACPI: add ITMT support when CPPC enabled
Merge pull request !206 from LeoLiu-oc/next-6.6-57-cppc
2024-08-16 01:04:50 +00:00
leoliu-oc c090c94dbb Set ASYM_PACKING Flag on Zhaoxin KH-40000 platform
Set ASYM_PACKING Flag on Zhaoxin KH-40000 platform

Signed-off-by: leoliu-oc <leoliu-oc@zhaoxin.com>
2024-08-15 11:35:28 +08:00
leoliu-oc ab7f82d722 Add kh40000_direct_dma_ops for KH-40000 platform
Add 'kh40000_direct_dma_ops' to replace 'direct_dma_ops' for KH-40000
platform.
For coherent DMA access, memory can be allocated only from the memory node
of the node where the device resides.
For streaming DMA access, add a PCI read operation at the end of DMA
access.

Signed-off-by: leoliu-oc <leoliu-oc@zhaoxin.com>
2024-08-15 11:07:01 +08:00
Yang Yingliang 78f1990b6b sched/core: Fix unbalance set_rq_online/offline() in sched_cpu_deactivate()
commit fe7a11c78d2a9bdb8b50afc278a31ac177000948 upstream.

If cpuset_cpu_inactive() fails, set_rq_online() need be called to rollback.

Fixes: 120455c514 ("sched: Fix hotplug vs CPU bandwidth control")
Cc: stable@kernel.org
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20240703031610.587047-5-yangyingliang@huaweicloud.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-08-14 13:59:00 +02:00
Yang Yingliang 4c15b20c26 sched/core: Introduce sched_set_rq_on/offline() helper
commit 2f027354122f58ee846468a6f6b48672fff92e9b upstream.

Introduce sched_set_rq_on/offline() helper, so it can be called
in normal or error path simply. No functional changed.

Cc: stable@kernel.org
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20240703031610.587047-4-yangyingliang@huaweicloud.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-08-14 13:59:00 +02:00
Yang Yingliang 65727331b6 sched/smt: Fix unbalance sched_smt_present dec/inc
commit e22f910a26cc2a3ac9c66b8e935ef2a7dd881117 upstream.

I got the following warn report while doing stress test:

jump label: negative count!
WARNING: CPU: 3 PID: 38 at kernel/jump_label.c:263 static_key_slow_try_dec+0x9d/0xb0
Call Trace:
 <TASK>
 __static_key_slow_dec_cpuslocked+0x16/0x70
 sched_cpu_deactivate+0x26e/0x2a0
 cpuhp_invoke_callback+0x3ad/0x10d0
 cpuhp_thread_fun+0x3f5/0x680
 smpboot_thread_fn+0x56d/0x8d0
 kthread+0x309/0x400
 ret_from_fork+0x41/0x70
 ret_from_fork_asm+0x1b/0x30
 </TASK>

Because when cpuset_cpu_inactive() fails in sched_cpu_deactivate(),
the cpu offline failed, but sched_smt_present is decremented before
calling sched_cpu_deactivate(), it leads to unbalanced dec/inc, so
fix it by incrementing sched_smt_present in the error path.

Fixes: c5511d03ec ("sched/smt: Make sched_smt_present track topology")
Cc: stable@kernel.org
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Chen Yu <yu.c.chen@intel.com>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Link: https://lore.kernel.org/r/20240703031610.587047-3-yangyingliang@huaweicloud.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-08-14 13:59:00 +02:00
Yang Yingliang 41d856565d sched/smt: Introduce sched_smt_present_inc/dec() helper
commit 31b164e2e4af84d08d2498083676e7eeaa102493 upstream.

Introduce sched_smt_present_inc/dec() helper, so it can be called
in normal or error path simply. No functional changed.

Cc: stable@kernel.org
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20240703031610.587047-2-yangyingliang@huaweicloud.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-08-14 13:59:00 +02:00
Waiman Long 924f788c90 padata: Fix possible divide-by-0 panic in padata_mt_helper()
commit 6d45e1c948a8b7ed6ceddb14319af69424db730c upstream.

We are hit with a not easily reproducible divide-by-0 panic in padata.c at
bootup time.

  [   10.017908] Oops: divide error: 0000 1 PREEMPT SMP NOPTI
  [   10.017908] CPU: 26 PID: 2627 Comm: kworker/u1666:1 Not tainted 6.10.0-15.el10.x86_64 #1
  [   10.017908] Hardware name: Lenovo ThinkSystem SR950 [7X12CTO1WW]/[7X12CTO1WW], BIOS [PSE140J-2.30] 07/20/2021
  [   10.017908] Workqueue: events_unbound padata_mt_helper
  [   10.017908] RIP: 0010:padata_mt_helper+0x39/0xb0
    :
  [   10.017963] Call Trace:
  [   10.017968]  <TASK>
  [   10.018004]  ? padata_mt_helper+0x39/0xb0
  [   10.018084]  process_one_work+0x174/0x330
  [   10.018093]  worker_thread+0x266/0x3a0
  [   10.018111]  kthread+0xcf/0x100
  [   10.018124]  ret_from_fork+0x31/0x50
  [   10.018138]  ret_from_fork_asm+0x1a/0x30
  [   10.018147]  </TASK>

Looking at the padata_mt_helper() function, the only way a divide-by-0
panic can happen is when ps->chunk_size is 0.  The way that chunk_size is
initialized in padata_do_multithreaded(), chunk_size can be 0 when the
min_chunk in the passed-in padata_mt_job structure is 0.

Fix this divide-by-0 panic by making sure that chunk_size will be at least
1 no matter what the input parameters are.

Link: https://lkml.kernel.org/r/20240806174647.1050398-1-longman@redhat.com
Fixes: 004ed42638 ("padata: add basic support for multithreaded jobs")
Signed-off-by: Waiman Long <longman@redhat.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Waiman Long <longman@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-08-14 13:58:59 +02:00
Tze-nan Wu a172c7b22b tracing: Fix overflow in get_free_elt()
commit bcf86c01ca4676316557dd482c8416ece8c2e143 upstream.

"tracing_map->next_elt" in get_free_elt() is at risk of overflowing.

Once it overflows, new elements can still be inserted into the tracing_map
even though the maximum number of elements (`max_elts`) has been reached.
Continuing to insert elements after the overflow could result in the
tracing_map containing "tracing_map->max_size" elements, leaving no empty
entries.
If any attempt is made to insert an element into a full tracing_map using
`__tracing_map_insert()`, it will cause an infinite loop with preemption
disabled, leading to a CPU hang problem.

Fix this by preventing any further increments to "tracing_map->next_elt"
once it reaches "tracing_map->max_elt".

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Fixes: 08d43a5fa0 ("tracing: Add lock-free tracing_map")
Co-developed-by: Cheng-Jui Wang <cheng-jui.wang@mediatek.com>
Link: https://lore.kernel.org/20240805055922.6277-1-Tze-nan.Wu@mediatek.com
Signed-off-by: Cheng-Jui Wang <cheng-jui.wang@mediatek.com>
Signed-off-by: Tze-nan Wu <Tze-nan.Wu@mediatek.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-08-14 13:58:58 +02:00
Shay Drory 0688cacd0e genirq/irqdesc: Honor caller provided affinity in alloc_desc()
commit edbbaae42a56f9a2b39c52ef2504dfb3fb0a7858 upstream.

Currently, whenever a caller is providing an affinity hint for an
interrupt, the allocation code uses it to calculate the node and copies the
cpumask into irq_desc::affinity.

If the affinity for the interrupt is not marked 'managed' then the startup
of the interrupt ignores irq_desc::affinity and uses the system default
affinity mask.

Prevent this by setting the IRQD_AFFINITY_SET flag for the interrupt in the
allocator, which causes irq_setup_affinity() to use irq_desc::affinity on
interrupt startup if the mask contains an online CPU.

[ tglx: Massaged changelog ]

Fixes: 45ddcecbfa ("genirq: Use affinity hint in irqdesc allocation")
Signed-off-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: <stable@vger.kernel.org>
Link: https://lore.kernel.org/all/20240806072044.837827-1-shayd@nvidia.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-08-14 13:58:58 +02:00
Andrey Konovalov d0137ce03f kcov: properly check for softirq context
commit 7d4df2dad312f270d62fecb0e5c8b086c6d7dcfc upstream.

When collecting coverage from softirqs, KCOV uses in_serving_softirq() to
check whether the code is running in the softirq context.  Unfortunately,
in_serving_softirq() is > 0 even when the code is running in the hardirq
or NMI context for hardirqs and NMIs that happened during a softirq.

As a result, if a softirq handler contains a remote coverage collection
section and a hardirq with another remote coverage collection section
happens during handling the softirq, KCOV incorrectly detects a nested
softirq coverate collection section and prints a WARNING, as reported by
syzbot.

This issue was exposed by commit a7f3813e589f ("usb: gadget: dummy_hcd:
Switch to hrtimer transfer scheduler"), which switched dummy_hcd to using
hrtimer and made the timer's callback be executed in the hardirq context.

Change the related checks in KCOV to account for this behavior of
in_serving_softirq() and make KCOV ignore remote coverage collection
sections in the hardirq and NMI contexts.

This prevents the WARNING printed by syzbot but does not fix the inability
of KCOV to collect coverage from the __usb_hcd_giveback_urb when dummy_hcd
is in use (caused by a7f3813e589f); a separate patch is required for that.

Link: https://lkml.kernel.org/r/20240729022158.92059-1-andrey.konovalov@linux.dev
Fixes: 5ff3b30ab5 ("kcov: collect coverage from interrupts")
Signed-off-by: Andrey Konovalov <andreyknvl@gmail.com>
Reported-by: syzbot+2388cdaeb6b10f0c13ac@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=2388cdaeb6b10f0c13ac
Acked-by: Marco Elver <elver@google.com>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Aleksandr Nogikh <nogikh@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Marcello Sylvester Bauer <sylv@sylv.io>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-08-14 13:58:57 +02:00
Thomas Gleixner 65d76c0aa2 timekeeping: Fix bogus clock_was_set() invocation in do_adjtimex()
commit 5916be8a53de6401871bdd953f6c60237b47d6d3 upstream.

The addition of the bases argument to clock_was_set() fixed up all call
sites correctly except for do_adjtimex(). This uses CLOCK_REALTIME
instead of CLOCK_SET_WALL as argument. CLOCK_REALTIME is 0.

As a result the effect of that clock_was_set() notification is incomplete
and might result in timers expiring late because the hrtimer code does
not re-evaluate the affected clock bases.

Use CLOCK_SET_WALL instead of CLOCK_REALTIME to tell the hrtimers code
which clock bases need to be re-evaluated.

Fixes: 17a1b8826b ("hrtimer: Add bases argument to clock_was_set()")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/all/877ccx7igo.ffs@tglx
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-08-14 13:58:57 +02:00
Justin Stitt ae5848cb5b ntp: Safeguard against time_constant overflow
commit 06c03c8edce333b9ad9c6b207d93d3a5ae7c10c0 upstream.

Using syzkaller with the recently reintroduced signed integer overflow
sanitizer produces this UBSAN report:

UBSAN: signed-integer-overflow in ../kernel/time/ntp.c:738:18
9223372036854775806 + 4 cannot be represented in type 'long'
Call Trace:
 handle_overflow+0x171/0x1b0
 __do_adjtimex+0x1236/0x1440
 do_adjtimex+0x2be/0x740

The user supplied time_constant value is incremented by four and then
clamped to the operating range.

Before commit eea83d896e ("ntp: NTP4 user space bits update") the user
supplied value was sanity checked to be in the operating range. That change
removed the sanity check and relied on clamping after incrementing which
does not work correctly when the user supplied value is in the overflow
zone of the '+ 4' operation.

The operation requires CAP_SYS_TIME and the side effect of the overflow is
NTP getting out of sync.

Similar to the fixups for time_maxerror and time_esterror, clamp the user
space supplied value to the operating range.

[ tglx: Switch to clamping ]

Fixes: eea83d896e ("ntp: NTP4 user space bits update")
Signed-off-by: Justin Stitt <justinstitt@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Miroslav Lichvar <mlichvar@redhat.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/all/20240517-b4-sio-ntp-c-v2-1-f3a80096f36f@google.com
Closes: https://github.com/KSPP/linux/issues/352
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-08-14 13:58:56 +02:00
Paul E. McKenney 9d6193fd91 clocksource: Fix brown-bag boolean thinko in cs_watchdog_read()
[ Upstream commit f2655ac2c06a15558e51ed6529de280e1553c86e ]

The current "nretries > 1 || nretries >= max_retries" check in
cs_watchdog_read() will always evaluate to true, and thus pr_warn(), if
nretries is greater than 1.  The intent is instead to never warn on the
first try, but otherwise warn if the successful retry was the last retry.

Therefore, change that "||" to "&&".

Fixes: db3a34e174 ("clocksource: Retry clock read if long delays detected")
Reported-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/all/20240802154618.4149953-2-paulmck@kernel.org
Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-08-14 13:58:56 +02:00
Feng Tang 03c3855528 clocksource: Scale the watchdog read retries automatically
[ Upstream commit 2ed08e4bc53298db3f87b528cd804cb0cce066a9 ]

On a 8-socket server the TSC is wrongly marked as 'unstable' and disabled
during boot time on about one out of 120 boot attempts:

    clocksource: timekeeping watchdog on CPU227: wd-tsc-wd excessive read-back delay of 153560ns vs. limit of 125000ns,
    wd-wd read-back delay only 11440ns, attempt 3, marking tsc unstable
    tsc: Marking TSC unstable due to clocksource watchdog
    TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'.
    sched_clock: Marking unstable (119294969739, 159204297)<-(125446229205, -5992055152)
    clocksource: Checking clocksource tsc synchronization from CPU 319 to CPUs 0,99,136,180,210,542,601,896.
    clocksource: Switched to clocksource hpet

The reason is that for platform with a large number of CPUs, there are
sporadic big or huge read latencies while reading the watchog/clocksource
during boot or when system is under stress work load, and the frequency and
maximum value of the latency goes up with the number of online CPUs.

The cCurrent code already has logic to detect and filter such high latency
case by reading the watchdog twice and checking the two deltas. Due to the
randomness of the latency, there is a low probabilty that the first delta
(latency) is big, but the second delta is small and looks valid. The
watchdog code retries the readouts by default twice, which is not
necessarily sufficient for systems with a large number of CPUs.

There is a command line parameter 'max_cswd_read_retries' which allows to
increase the number of retries, but that's not user friendly as it needs to
be tweaked per system. As the number of required retries is proportional to
the number of online CPUs, this parameter can be calculated at runtime.

Scale and enlarge the number of retries according to the number of online
CPUs and remove the command line parameter completely.

[ tglx: Massaged change log and comments ]

Signed-off-by: Feng Tang <feng.tang@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Jin Wang <jin1.wang@intel.com>
Tested-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Waiman Long <longman@redhat.com>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Link: https://lore.kernel.org/r/20240221060859.1027450-1-feng.tang@intel.com
Stable-dep-of: f2655ac2c06a ("clocksource: Fix brown-bag boolean thinko in cs_watchdog_read()")
Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-08-14 13:58:56 +02:00
Justin Stitt b5cf99eb7a ntp: Clamp maxerror and esterror to operating range
[ Upstream commit 87d571d6fb77ec342a985afa8744bb9bb75b3622 ]

Using syzkaller alongside the newly reintroduced signed integer overflow
sanitizer spits out this report:

UBSAN: signed-integer-overflow in ../kernel/time/ntp.c:461:16
9223372036854775807 + 500 cannot be represented in type 'long'
Call Trace:
 handle_overflow+0x171/0x1b0
 second_overflow+0x2d6/0x500
 accumulate_nsecs_to_secs+0x60/0x160
 timekeeping_advance+0x1fe/0x890
 update_wall_time+0x10/0x30

time_maxerror is unconditionally incremented and the result is checked
against NTP_PHASE_LIMIT, but the increment itself can overflow, resulting
in wrap-around to negative space.

Before commit eea83d896e ("ntp: NTP4 user space bits update") the user
supplied value was sanity checked to be in the operating range. That change
removed the sanity check and relied on clamping in handle_overflow() which
does not work correctly when the user supplied value is in the overflow
zone of the '+ 500' operation.

The operation requires CAP_SYS_TIME and the side effect of the overflow is
NTP getting out of sync.

Miroslav confirmed that the input value should be clamped to the operating
range and the same applies to time_esterror. The latter is not used by the
kernel, but the value still should be in the operating range as it was
before the sanity check got removed.

Clamp them to the operating range.

[ tglx: Changed it to clamping and included time_esterror ]

Fixes: eea83d896e ("ntp: NTP4 user space bits update")
Signed-off-by: Justin Stitt <justinstitt@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Miroslav Lichvar <mlichvar@redhat.com>
Link: https://lore.kernel.org/all/20240517-b4-sio-ntp-usec-v2-1-d539180f2b79@google.com
Closes: https://github.com/KSPP/linux/issues/354
Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-08-14 13:58:56 +02:00
Thomas Gleixner b9d604933d tick/broadcast: Move per CPU pointer access into the atomic section
commit 6881e75237a84093d0986f56223db3724619f26e upstream.

The recent fix for making the take over of the broadcast timer more
reliable retrieves a per CPU pointer in preemptible context.

This went unnoticed as compilers hoist the access into the non-preemptible
region where the pointer is actually used. But of course it's valid that
the compiler keeps it at the place where the code puts it which rightfully
triggers:

  BUG: using smp_processor_id() in preemptible [00000000] code:
       caller is hotplug_cpu__broadcast_tick_pull+0x1c/0xc0

Move it to the actual usage site which is in a non-preemptible region.

Fixes: f7d43dd206e7 ("tick/broadcast: Make takeover of broadcast hrtimer reliable")
Reported-by: David Wang <00107082@163.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Yu Liao <liaoyu15@huawei.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/all/87ttg56ers.ffs@tglx
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-08-14 13:58:55 +02:00