OpenCloudOS-Kernel

Commit Graph

Author	SHA1	Message	Date
Jianping Liu	a899c9ea34	drivers,ps3stor: fix compile error when using allyesconfig config There is a compile error when run: make allyesconfig && make -j32 The compile error log is as below: ld: error: unplaced orphan section `.GCC.command.line' from `vmlinux.o'. or aarch64-linux-gnu-ld: error: unplaced orphan section `.GCC.command.line' from `drivers/scsi/linkdata/ps3stor/ps3_cmd_channel.o' ...... .GCC.command.line section is created by -frecord-gcc-switches compile option. The info about -frecord-gcc-switches option: This switch causes the command line that was used to invoke the compiler to be recorded into the object file that is being created. This switch is only implemented on some targets and the exact format of the recording is target and binary file format dependent, but it usually takes the form of a section containing ASCII text. -frecord-gcc-switches option is useless in release version, delete it. Signed-off-by: Jianping Liu <frankjpliu@tencent.com> Reviewed-by: Yongliang Gao <leonylgao@tencent.com>	2024-12-13 20:45:28 +08:00
liujie_answer	9b08da6364	scsi: Solve the problem of duplicate definition of first_online_pgdat and next_online_pgdat functions in ps3stor and other modules category: bugfix Rename first_online_pgdat to ps3_first_online_pgdat Rename next_online_pgdat to ps3_next_online_pgdat Add for_each_ps3_online_pgdat definition to replace for_each_online_pgdat Verified on x86 and arm64 Signed-off-by: liujie5@linkdatatechnology.com	2024-12-12 15:31:05 +08:00
Jianping Liu	dd8418c96c	dist: release 5.4.241-30.0017.17 Upstream: no Signed-off-by: Jianping Liu <frankjpliu@tencent.com>	2024-12-09 19:42:29 +08:00
mengensun	2ed2e1edbc	mm: make concurrent-accessing of pagetypeinfo queued on a mutex lock if there are processes concurrent-accessing of pagetypeinfo, all processes will be queued on zone spinlock. while once the process got the zone spin lock, it will holding the spinlock tens ms. which is not friently to other process allocating or freeing pages. for now, we make concurrent-accessing of pagetypeinfo queue on a mutex lock first before geting the zone spinlock. on a INTEL(R) XEON(R) PLATINUM 8576C which 2 numa node of total 1TB main memory and 224 cpu, using test case below,we get following test result: test case: for i in $(seq 1 1 5); do taskset -c $i ./a.sh & done where a.sh like follow: while : do cat /proc/pagetypeinfo >/dev/null done watch pagetypeinfo_showfree_print using bcc /usr/share/bcc/tools/funclatency pagetypeinfo_showfree_print before patch(all time in ns): 4096 -> 8191 : 309 8192 -> 16383 : 8 16384 -> 32767 : 17 32768 -> 65535 : 299 65536 -> 131071 : 1 131072 -> 262143 : 0 262144 -> 524287 : 0 524288 -> 1048575 : 0 1048576 -> 2097151 : 0 2097152 -> 4194303 : 0 4194304 -> 8388607 : 0 8388608 -> 16777215 : 297 16777216 -> 33554431 : 83 33554432 -> 67108863 : 79 67108864 -> 134217727 : 102 134217728 -> 268435455 : 63 268435456 -> 536870911 : 5 after patch(all time in ns): 4096 -> 8191 : 161 8192 -> 16383 : 8 16384 -> 32767 : 1 32768 -> 65535 : 170 65536 -> 131071 : 0 131072 -> 262143 : 0 262144 -> 524287 : 0 524288 -> 1048575 : 0 1048576 -> 2097151 : 0 2097152 -> 4194303 : 0 4194304 -> 8388607 : 0 8388608 -> 16777215 : 170 16777216 -> 33554431 : 171 => make function latency from 5s to 30ms Reviewed-by: caelli <caelli@tencent.com> Reviewed-by: alexjlzheng <alexjlzheng@tencent.com> Signed-off-by: mengensun <mengensun@tencent.com> Signed-off-by: Yongliang Gao <leonylgao@tencent.com> Signed-off-by: Jianping Liu <frankjpliu@tencent.com>	2024-12-09 19:38:18 +08:00
Aurelianliu	3b2e445dc1	xfs: modify mount recovery if recovery error, not to wait io complete. in 6.6 or 6.11, this uses shutdown to force io, xfs not stall, but can't mount successfully. Here only uses force io when no error happens. after mount, xfs will flush io. Signed-off-by: Aurelianliu <aurelianliu@tencent.com> Signed-off-by: Jianping Liu <frankjpliu@tencent.com>	2024-12-09 19:38:09 +08:00
Jianping Liu	e16b1f5b7b	config,arm64: enable CONFIG_DRM_HISI_HIBMC and CONFIG_DRM_HISI_KIRIN --bug=1020426283134913779 Troubleshooting the issue of inaccurate graphical interface resolution when installing aarch64 architecture ISO on Huawei servers.	2024-12-09 19:31:47 +08:00
liujie_answer	843f1ef744	scsi: linkdata ps3stor compilation optimization category: feature Add Makefie and Kconfig in the drivers/scsi/linkdata directory, and drivers/scsi/Makefile and Kconfig call the Makefile and Kconfig in the drivers/scsi/linkdata directory respectively. Verified on x86 and arm64 Signed-off-by: liujie5@linkdatatechnology.com	2024-12-09 19:31:47 +08:00
Jianping Liu	33b13d3cee	config,arm64: update tencent.config without manual change Update tencent.config by follow commands: make tencentconfig make savedefconfig mv defconfig arch/arm64/configs/tencent.config	2024-12-06 12:47:08 +08:00
Jianping Liu	b7a235df77	Merge branch linux-5.4/devel into linux-5.4/lts/5.4.241-30.0017	2024-12-05 11:14:05 +08:00
Jianping Liu	3c39c5a4fc	drivers,3snic: support incremental compilation In 3snic's Makefile, it using "SYS_TIME=$(shell date +%Y-%m-%d_%H:%M:%S)", and put it into driver version. That causing 3snic don't support incremental compilation. Signed-off-by: Jianping Liu <frankjpliu@tencent.com> Reviewed-by: Yongliang Gao <leonylgao@tencent.com>	2024-12-05 11:10:57 +08:00
chinaljp030	c45bbd064f	!275 [devel-5.4] linkdata: add ps3stor driver support Merge pull request !275 from liujie_answer/linux-5.4/devel	2024-12-05 02:58:29 +00:00
liujie_answer	ddd836946e	scsi: add support for linkdata HBA/RAID Controller driver category: feature Verified on x86 and arm64 Signed-off-by: liujie5@linkdatatechnology.com	2024-12-05 09:58:17 +08:00
Yonghong Song	7140f9e1da	selftests/bpf: Fix pyperf180 compilation failure with clang18 commit 100888fb6d8a185866b1520031ee7e3182b173de upstream. With latest clang18 (main branch of llvm-project repo), when building bpf selftests, [~/work/bpf-next (master)]$ make -C tools/testing/selftests/bpf LLVM=1 -j The following compilation error happens: fatal error: error in backend: Branch target out of insn range ... Stack dump: 0. Program arguments: clang -g -Wall -Werror -D__TARGET_ARCH_x86 -mlittle-endian -I/home/yhs/work/bpf-next/tools/testing/selftests/bpf/tools/include -I/home/yhs/work/bpf-next/tools/testing/selftests/bpf -I/home/yhs/work/bpf-next/tools/include/uapi -I/home/yhs/work/bpf-next/tools/testing/selftests/usr/include -idirafter /home/yhs/work/llvm-project/llvm/build.18/install/lib/clang/18/include -idirafter /usr/local/include -idirafter /usr/include -Wno-compare-distinct-pointer-types -DENABLE_ATOMICS_TESTS -O2 --target=bpf -c progs/pyperf180.c -mcpu=v3 -o /home/yhs/work/bpf-next/tools/testing/selftests/bpf/pyperf180.bpf.o 1. <eof> parser at end of file 2. Code generation ... The compilation failure only happens to cpu=v2 and cpu=v3. cpu=v4 is okay since cpu=v4 supports 32-bit branch target offset. The above failure is due to upstream llvm patch [1] where some inlining behavior are changed in clang18. To workaround the issue, previously all 180 loop iterations are fully unrolled. The bpf macro __BPF_CPU_VERSION__ (implemented in clang18 recently) is used to avoid unrolling changes if cpu=v4. If __BPF_CPU_VERSION__ is not available and the compiler is clang18, the unrollng amount is unconditionally reduced. [1] `1a2e77cf9e` Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Tested-by: Alan Maguire <alan.maguire@oracle.com> Link: https://lore.kernel.org/bpf/20231110193644.3130906-1-yonghong.song@linux.dev Signed-off-by: Jianping Liu <frankjpliu@tencent.com>	2024-12-03 14:19:10 +08:00
Jianping Liu	aaf99ade66	dist: release 5.4.241-30.0017.16 Upstream: no Signed-off-by: Jianping Liu <frankjpliu@tencent.com>	2024-11-29 19:19:13 +08:00
Jianping Liu	b4cb2fb5d6	drivers,wangxun: fix compile error when using allyesconfig he symbol _kc_eth_hw_addr_set has been redefined, fix the error as below: drivers/net/ethernet/wangxun/txgbe/txgbe_kcompat.o：in function '_kc_eth_hw_addr_set': txgbe_kcompat.c:(.text+0x100): _kc_eth_hw_addr_set multiple definition drivers/net/ethernet/wangxun/ngbe/ngbe_kcompat.o:ngbe_kcompat.c:(.text+0x100): first defined here Signed-off-by: Jianping Liu <frankjpliu@tencent.com>	2024-11-29 19:12:52 +08:00
Menglong Dong	8bec235a25	libbpf: backport BTF_KIND_* from upstream In order to fix the compiling error on tools/testing/selftests/bpf, backport all defination of BTF_KIND_* from upstream. Signed-off-by: Menglong Dong <imagedong@tencent.com> Reviewed-by: Bin Lai <robinlai@tencent.com> Signed-off-by: Yongliang Gao <leonylgao@tencent.com> Signed-off-by: Jianping Liu <frankjpliu@tencent.com>	2024-11-29 19:05:59 +08:00
Jianping Liu	89c81c1144	drivers: fix compile error when using allyesconfig The symbol _kc_pci_get_dsn has been redefined, fix the error as below: aarch64-linux-gnu-ld: drivers/net/ethernet/wangxun/txgbe/txgbe_kcompat.o: in function `_kc_pci_get_dsn': txgbe_kcompat.c:(.text+0x0): multiple definition of `_kc_pci_get_dsn'; drivers/net/ethernet/wangxun/ngbe/ngbe_kcompat.o:ngbe_kcompat.c:(.text+0x0): first defined here aarch64-linux-gnu-ld: drivers/net/ethernet/wangxun/txgbe/txgbe_kcompat.o: in function `_kc_eth_hw_addr_set': txgbe_kcompat.c:(.text+0xe4): multiple definition of `_kc_eth_hw_addr_set'; drivers/net/ethernet/wangxun/ngbe/ngbe_kcompat.o:ngbe_kcompat.c:(.text+0xe4): first defined here Signed-off-by: Jianping Liu <frankjpliu@tencent.com>	2024-11-29 09:09:23 +08:00
Jianping Liu	6642e166dd	dist: release 5.4.241-30.0017.15 Upstream: no Signed-off-by: Jianping Liu <frankjpliu@tencent.com>	2024-11-28 22:03:05 +08:00
Huang Cun	4e4a7cdd4c	crashkernel: give 2M default reserve memory to pstore divide crashkernel reserve memory to two part. the end partion memory is gived to pstore. support kernel last log before reboot and reset. support pstore_size=xM cmdline config, but the pstore_addr cann't be configured from cmdline. Signed-off-by: Huang Cun <cunhuang@tencent.com> Signed-off-by: Jianping Liu <frankjpliu@tencent.com>	2024-11-28 21:59:34 +08:00
Alexander Lobakin	6cc0b996a3	net: fix premature exit from NAPI state polling in napi_disable() commit `0315a075f1` upstream. Commit `719c571970` ("net: make napi_disable() symmetric with enable") accidentally introduced a bug sometimes leading to a kernel BUG when bringing an iface up/down under heavy traffic load. Prior to this commit, napi_disable() was polling n->state until none of (NAPIF_STATE_SCHED \| NAPIF_STATE_NPSVC) is set and then always flip them. Now there's a possibility to get away with the NAPIF_STATE_SCHE unset as 'continue' drops us to the cmpxchg() call with an uninitialized variable, rather than straight to another round of the state check. Error path looks like: napi_disable(): unsigned long val, new; /* new is uninitialized / do { val = READ_ONCE(n->state); / NAPIF_STATE_NPSVC and/or NAPIF_STATE_SCHED is set / if (val & (NAPIF_STATE_SCHED \| NAPIF_STATE_NPSVC)) { / true / usleep_range(20, 200); continue; / go straight to the condition check / } new = val \| <...> } while (cmpxchg(&n->state, val, new) != val); / state == val, cmpxchg() writes garbage / napi_enable(): do { val = READ_ONCE(n->state); BUG_ON(!test_bit(NAPI_STATE_SCHED, &val)); / 50/50 boom */ <...> while the typical BUG splat is like: [ 172.652461] ------------[ cut here ]------------ [ 172.652462] kernel BUG at net/core/dev.c:6937! [ 172.656914] invalid opcode: 0000 [#1] PREEMPT SMP PTI [ 172.661966] CPU: 36 PID: 2829 Comm: xdp_redirect_cp Tainted: G I 5.15.0 #42 [ 172.670222] Hardware name: Intel Corporation S2600WFT/S2600WFT, BIOS SE5C620.86B.02.01.0014.082620210524 08/26/2021 [ 172.680646] RIP: 0010:napi_enable+0x5a/0xd0 [ 172.684832] Code: 07 49 81 cc 00 01 00 00 4c 89 e2 48 89 d8 80 e6 fb f0 48 0f b1 55 10 48 39 c3 74 10 48 8b 5d 10 f6 c7 04 75 3d f6 c3 01 75 b4 <0f> 0b 5b 5d 41 5c c3 65 ff 05 b8 e5 61 53 48 c7 c6 c0 f3 34 ad 48 [ 172.703578] RSP: 0018:ffffa3c9497477a8 EFLAGS: 00010246 [ 172.708803] RAX: ffffa3c96615a014 RBX: 0000000000000000 RCX: ffff8a4b575301a0 < snip > [ 172.782403] Call Trace: [ 172.784857] <TASK> [ 172.786963] ice_up_complete+0x6f/0x210 [ice] [ 172.791349] ice_xdp+0x136/0x320 [ice] [ 172.795108] ? ice_change_mtu+0x180/0x180 [ice] [ 172.799648] dev_xdp_install+0x61/0xe0 [ 172.803401] dev_xdp_attach+0x1e0/0x550 [ 172.807240] dev_change_xdp_fd+0x1e6/0x220 [ 172.811338] do_setlink+0xee8/0x1010 [ 172.814917] rtnl_setlink+0xe5/0x170 [ 172.818499] ? bpf_lsm_binder_set_context_mgr+0x10/0x10 [ 172.823732] ? security_capable+0x36/0x50 < snip > Fix this by replacing 'do { } while (cmpxchg())' with an "infinite" for-loop with an explicit break. From v1 [0]: - just use a for-loop to simplify both the fix and the existing code (Eric). [0] https://lore.kernel.org/netdev/20211110191126.1214-1-alexandr.lobakin@intel.com Fixes: `719c571970` ("net: make napi_disable() symmetric with enable") Suggested-by: Eric Dumazet <edumazet@google.com> # for-loop Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com> Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20211110195605.1304-1-alexandr.lobakin@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Jason Xing <kernelxing@tencent.com> Signed-off-by: Jianping Liu <frankjpliu@tencent.com>	2024-11-28 21:59:16 +08:00
Jakub Kicinski	3cad448421	net: make napi_disable() symmetric with enable commit `719c571970` upstream. I remove one line which is about THREADED and BUSY_POLL because we don't have two clear operation lines before. Commit `3765996e4f` ("napi: fix race inside napi_enable") fixed an ordering bug in napi_enable() and made the napi_enable() diverge from napi_disable(). The state transitions done on disable are not symmetric to enable. There is no known bug in napi_disable() this is just refactoring. Eric suggests we can also replace msleep(1) with a more opportunistic usleep_range(). Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Jason Xing <kernelxing@tencent.com> Signed-off-by: Jianping Liu <frankjpliu@tencent.com>	2024-11-28 21:59:16 +08:00
Xuan Zhuo	7094bec5e1	napi: fix race inside napi_enable commit `3765996e4f` upstream. In this patch, I intentionally remove two lines about the NAPI_STATE_THREADED flag because we haven't backported that part. The process will cause napi.state to contain NAPI_STATE_SCHED and not in the poll_list, which will cause napi_disable() to get stuck. The prefix "NAPI_STATE_" is removed in the figure below, and NAPI_STATE_HASHED is ignored in napi.state. CPU0 \| CPU1 \| napi.state =============================================================================== napi_disable() \| \| SCHED \| NPSVC napi_enable() \| \| { \| \| smp_mb__before_atomic(); \| \| clear_bit(SCHED, &n->state); \| \| NPSVC \| napi_schedule_prep() \| SCHED \| NPSVC \| napi_poll() \| \| napi_complete_done() \| \| { \| \| if (n->state & (NPSVC \| \| (1) \| _BUSY_POLL))) \| \| return false; \| \| ................ \| \| } \| SCHED \| NPSVC \| \| clear_bit(NPSVC, &n->state); \| \| SCHED } \| \| \| \| napi_schedule_prep() \| \| SCHED \| MISSED (2) (1) Here return direct. Because of NAPI_STATE_NPSVC exists. (2) NAPI_STATE_SCHED exists. So not add napi.poll_list to sd->poll_list Since NAPI_STATE_SCHED already exists and napi is not in the sd->poll_list queue, NAPI_STATE_SCHED cannot be cleared and will always exist. 1. This will cause this queue to no longer receive packets. 2. If you encounter napi_disable under the protection of rtnl_lock, it will cause the entire rtnl_lock to be locked, affecting the overall system. This patch uses cmpxchg to implement napi_enable(), which ensures that there will be no race due to the separation of clear two bits. Fixes: `2d8bff1269` ("netpoll: Close race condition between poll_one_napi and napi_disable") Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Reviewed-by: Dust Li <dust.li@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Jason Xing <kernelxing@tencent.com> Signed-off-by: Jianping Liu <frankjpliu@tencent.com>	2024-11-28 21:59:16 +08:00
Yongliang Gao	784345b5ac	rtc: retry to read rtc time if it fails We add a retry here if the __rtc_read_time call fails or the rtc_tm_to_ktime result is KTIME_MAX. Signed-off-by: Yongliang Gao <leonylgao@tencent.com> Reviewed-by: Jianping Liu <frankjpliu@tencent.com> Signed-off-by: Jianping Liu <frankjpliu@tencent.com>	2024-11-28 21:58:57 +08:00
Yongliang Gao	6ddc70a757	rtc: show rtc time upon read or time conversion failure We can show the rtc_time information when it fails, which is more helpful for problem localization. Signed-off-by: Yongliang Gao <leonylgao@tencent.com> Reviewed-by: Jianping Liu <frankjpliu@tencent.com> Signed-off-by: Jianping Liu <frankjpliu@tencent.com>	2024-11-28 21:58:12 +08:00
Yongliang Gao	8a41406b2d	rtc: fix the issue of missing pm_relax Before queue rtc irq work, pm_stay_awake is called, so pm_relax needs to be called before return. Fixes: 9721d939cb8c ("rtc: check if rtc_tm_to_ktime was successful in rtc_timer_do_work()") Fixes: 3ab026fe94f2 ("rtc: check if __rtc_read_time was successful in rtc_timer_do_work()") Signed-off-by: Yongliang Gao <leonylgao@tencent.com> Reviewed-by: Jianping Liu <frankjpliu@tencent.com> Signed-off-by: Jianping Liu <frankjpliu@tencent.com>	2024-11-28 21:58:12 +08:00
Jianping Liu	5401d9702f	dist: release 5.4.241-30-0017.14 Upstream: no Signed-off-by: Jianping Liu <frankjpliu@tencent.com>	2024-11-28 21:44:04 +08:00
Yongliang Gao	6cf46c359c	rtc: check if rtc_tm_to_ktime was successful in rtc_timer_do_work() The struct rtc_time tm may contain a date/time read from the RTC hardware, but it is far away from now. However, __rtc_read_time return success. When calling rtc_tm_to_ktime later, the result may be a very large value KTIME_MAX. If there are periodic timers in rtc->timerqueue, they will continually expire, may causing kernel softlockup. Signed-off-by: Yongliang Gao <leonylgao@tencent.com> Reviewed-by: Jianping Liu <frankjpliu@tencent.com> Signed-off-by: Jianping Liu <frankjpliu@tencent.com>	2024-11-28 21:44:04 +08:00
Haisu Wang	083e3422ee	rue/io: fix child blkcg of hier buffered write can not exceed 2MB When using hierarchy buffered_write_bps function, unless explictly set 0 or a higher value of buffered_write_bps in child blkcg. The child group can't exceed 2MB by default. --story=132623821 "child cgroup can not exceed 2MB" Fixes: 126af8f8e346 ("rue/io: buffered_write_bps hierarchy support") Reported-by: Zhijian Xu <zhijianxu@tencent.com> Signed-off-by: Haisu Wang <haisuwang@tencent.com> Signed-off-by: Jianping Liu <frankjpliu@tencent.com>	2024-11-28 21:44:03 +08:00
Huang Cun	dd461732f2	crashkernel: auto adjust crashkernel min size to 800MB for KASAN Signed-off-by: Huang Cun <cunhuang@tencent.com> Signed-off-by: Jianping Liu <frankjpliu@tencent.com>	2024-11-28 21:44:03 +08:00
linuszeng	473e7516fa	Revert "toa add net namespace from csig" This reverts commit 46146c6fa5fbf18d29bf08c3533138bd74a336a0. This patch will cause TOA component authentication to fail. Signed-off-by: Zeng Jingxiang <linuszeng@tencent.com> Signed-off-by: Jianping Liu <frankjpliu@tencent.com>	2024-11-28 21:44:03 +08:00
Ze Gao	c0ddad1e29	sched, qos: Fix OOB on switching SCHED_BT to rt Since SCHED_BT modifies p->static_prio and set_load_weight() is called unconditionally during __setscheduler_params, any attempts to set other policies other that SCHED_BT on SCHED_BT tasks ought to recover what p->static_prio stands for originally. Currently only switching to SCHED_FAIR is considered whereas others are not. Fix it by resetting p->static_prio before doing set_load_weight() for the rest. Note this is an integrity fix and polices changes should be only allowed in between SCHED_FAIR and SCHED_BT. Signed-off-by: Ze Gao <zegao@tencent.com> Signed-off-by: Jianping Liu <frankjpliu@tencent.com>	2024-11-28 21:44:03 +08:00
Haisu Wang	419302b0cf	rue/io: do not check sysctl_io_qos_enabled for throttle hierarchy Since parent_sq linked in throtl_pd_init(). Switch sysctl_io_qos_enabled won't do any effect. So remove the sysctl_io_qos_enabled condition. Signed-off-by: Haisu Wang <haisuwang@tencent.com> Signed-off-by: Jianping Liu <frankjpliu@tencent.com>	2024-11-28 21:44:03 +08:00
Yongliang Gao	71e32887e5	rtc: check if __rtc_read_time was successful in rtc_timer_do_work() upstream: upstreaming If the __rtc_read_time call fails,, the struct rtc_time tm; may contain uninitialized data, or an illegal date/time read from the RTC hardware. When calling rtc_tm_to_ktime later, the result may be a very large value (possibly KTIME_MAX). If there are periodic timers in rtc->timerqueue, they will continually expire, may causing kernel softlockup. Fixes: `6610e0893b` ("RTC: Rework RTC code to use timerqueue for events") Signed-off-by: Yongliang Gao <leonylgao@tencent.com> Acked-by: Jingqun Li <jingqunli@tencent.com> Signed-off-by: Jianping Liu <frankjpliu@tencent.com>	2024-11-28 21:44:03 +08:00
Alexandre Belloni	33d2b1a1e6	rtc: disallow update interrupts when time is invalid commit `3e74ddaa7c` upstream. Never enable update interrupts when the time set on the rtc is invalid. In that case, also avoid enabling the emulation because it will fail for the same reason. Link: https://lore.kernel.org/r/20191021155631.3342-2-alexandre.belloni@bootlin.com Link: https://lore.kernel.org/r/CA+ASDXMarBG5C1Kz42B9i_iVZ1=i6GgH9Yja2cdmSueKD_As_g@mail.gmail.com Reported-by: Jeffy Chen <jeffy.chen@rock-chips.com> Reported-by: Brian Norris <briannorris@chromium.org> Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com> Signed-off-by: Yongliang Gao <leonylgao@tencent.com> Signed-off-by: Jianping Liu <frankjpliu@tencent.com>	2024-11-28 21:44:03 +08:00
Darrick J. Wong	26d2f312d3	xfs: verify buffer contents when we skip log replay commit `22ed903eee` upstream. syzbot detected a crash during log recovery: XFS (loop0): Mounting V5 Filesystem bfdc47fc-10d8-4eed-a562-11a831b3f791 XFS (loop0): Torn write (CRC failure) detected at log block 0x180. Truncating head block from 0x200. XFS (loop0): Starting recovery (logdev: internal) ================================================================== BUG: KASAN: slab-out-of-bounds in xfs_btree_lookup_get_block+0x15c/0x6d0 fs/xfs/libxfs/xfs_btree.c:1813 Read of size 8 at addr ffff88807e89f258 by task syz-executor132/5074 CPU: 0 PID: 5074 Comm: syz-executor132 Not tainted 6.2.0-rc1-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/26/2022 Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x1b1/0x290 lib/dump_stack.c:106 print_address_description+0x74/0x340 mm/kasan/report.c:306 print_report+0x107/0x1f0 mm/kasan/report.c:417 kasan_report+0xcd/0x100 mm/kasan/report.c:517 xfs_btree_lookup_get_block+0x15c/0x6d0 fs/xfs/libxfs/xfs_btree.c:1813 xfs_btree_lookup+0x346/0x12c0 fs/xfs/libxfs/xfs_btree.c:1913 xfs_btree_simple_query_range+0xde/0x6a0 fs/xfs/libxfs/xfs_btree.c:4713 xfs_btree_query_range+0x2db/0x380 fs/xfs/libxfs/xfs_btree.c:4953 xfs_refcount_recover_cow_leftovers+0x2d1/0xa60 fs/xfs/libxfs/xfs_refcount.c:1946 xfs_reflink_recover_cow+0xab/0x1b0 fs/xfs/xfs_reflink.c:930 xlog_recover_finish+0x824/0x920 fs/xfs/xfs_log_recover.c:3493 xfs_log_mount_finish+0x1ec/0x3d0 fs/xfs/xfs_log.c:829 xfs_mountfs+0x146a/0x1ef0 fs/xfs/xfs_mount.c:933 xfs_fs_fill_super+0xf95/0x11f0 fs/xfs/xfs_super.c:1666 get_tree_bdev+0x400/0x620 fs/super.c:1282 vfs_get_tree+0x88/0x270 fs/super.c:1489 do_new_mount+0x289/0xad0 fs/namespace.c:3145 do_mount fs/namespace.c:3488 [inline] __do_sys_mount fs/namespace.c:3697 [inline] __se_sys_mount+0x2d3/0x3c0 fs/namespace.c:3674 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x3d/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd RIP: 0033:0x7f89fa3f4aca Code: 83 c4 08 5b 5d c3 66 2e 0f 1f 84 00 00 00 00 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 c0 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007fffd5fb5ef8 EFLAGS: 00000206 ORIG_RAX: 00000000000000a5 RAX: ffffffffffffffda RBX: 00646975756f6e2c RCX: 00007f89fa3f4aca RDX: 0000000020000100 RSI: 0000000020009640 RDI: 00007fffd5fb5f10 RBP: 00007fffd5fb5f10 R08: 00007fffd5fb5f50 R09: 000000000000970d R10: 0000000000200800 R11: 0000000000000206 R12: 0000000000000004 R13: 0000555556c6b2c0 R14: 0000000000200800 R15: 00007fffd5fb5f50 </TASK> The fuzzed image contains an AGF with an obviously garbage agf_refcount_level value of 32, and a dirty log with a buffer log item for that AGF. The ondisk AGF has a higher LSN than the recovered log item. xlog_recover_buf_commit_pass2 reads the buffer, compares the LSNs, and decides to skip replay because the ondisk buffer appears to be newer. Unfortunately, the ondisk buffer is corrupt, but recovery just read the buffer with no buffer ops specified: error = xfs_buf_read(mp->m_ddev_targp, buf_f->blf_blkno, buf_f->blf_len, buf_flags, &bp, NULL); Skipping the buffer leaves its contents in memory unverified. This sets us up for a kernel crash because xfs_refcount_recover_cow_leftovers reads the buffer (which is still around in XBF_DONE state, so no read verification) and creates a refcountbt cursor of height 32. This is impossible so we run off the end of the cursor object and crash. Fix this by invoking the verifier on all skipped buffers and aborting log recovery if the ondisk buffer is corrupt. It might be smarter to force replay the log item atop the buffer and then see if it'll pass the write verifier (like ext4 does) but for now let's go with the conservative option where we stop immediately. Link: https://syzkaller.appspot.com/bug?extid=7e9494b8b399902e994e Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com> Acked-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: aurelianliu <aurelianliu@tencent.com> Signed-off-by: Jianping Liu <frankjpliu@tencent.com>	2024-11-28 21:44:03 +08:00
Gao Xiang	305ee9b3ba	xfs: fix forkoff miscalculation related to XFS_LITINO(mp) commit `ada49d64fb` upstream. Currently, commit `e9e2eae89d` dropped a (int) decoration from XFS_LITINO(mp), and since sizeof() expression is also involved, the result of XFS_LITINO(mp) is simply as the size_t type (commonly unsigned long). Considering the expression in xfs_attr_shortform_bytesfit(): offset = (XFS_LITINO(mp) - bytes) >> 3; let "bytes" be (int)340, and "XFS_LITINO(mp)" be (unsigned long)336. on 64-bit platform, the expression is offset = ((unsigned long)336 - (int)340) >> 3 = (int)(0xfffffffffffffffcUL >> 3) = -1 but on 32-bit platform, the expression is offset = ((unsigned long)336 - (int)340) >> 3 = (int)(0xfffffffcUL >> 3) = 0x1fffffff instead. so offset becomes a large positive number on 32-bit platform, and cause xfs_attr_shortform_bytesfit() returns maxforkoff rather than 0. Therefore, one result is "ASSERT(new_size <= XFS_IFORK_SIZE(ip, whichfork));" assertion failure in xfs_idata_realloc(), which was also the root cause of the original bugreport from Dennis, see: https://bugzilla.redhat.com/show_bug.cgi?id=1894177 And it can also be manually triggered with the following commands: $ touch a; $ setfattr -n user.0 -v "`seq 0 80`" a; $ setfattr -n user.1 -v "`seq 0 80`" a on 32-bit platform. Fix the case in xfs_attr_shortform_bytesfit() by bailing out "XFS_LITINO(mp) < bytes" in advance suggested by Eric and a misleading comment together with this bugfix suggested by Darrick. It seems the other users of XFS_LITINO(mp) are not impacted. Fixes: commit `e9e2eae89d` ("xfs: only check the superblock version for dinode size calculation") Cc: <stable@vger.kernel.org> # 5.7+ Reported-and-tested-by: Dennis Gilmore <dgilmore@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Gao Xiang <hsiangkao@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com> Acked-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: aurelianliu <aurelianliu@tencent.com> Signed-off-by: Jianping Liu <frankjpliu@tencent.com>	2024-11-28 21:44:03 +08:00
Yongliang Gao	3347ea435a	Revert "netfilter: ipset: fix race condition between swap/destroy and kernel side add/del/test" This reverts commit 50c07f202106c32b8658cf3a036cf10a617f38f1. This patch fixes a race condition. but the synchronize_rcu() added to the swap function unnecessarily slows it down. And the patch that fixed the performance regression introduced a UAF (Use After Free) issue: 961858ce4471 netfilter: ipset: fix performance regression in swap operation So it is being revert along with the others. Signed-off-by: Yongliang Gao <leonylgao@tencent.com> Reviewed-by: Jianping Liu <frankjpliu@tencent.com> Signed-off-by: Jianping Liu <frankjpliu@tencent.com>	2024-11-28 21:44:03 +08:00
Yongliang Gao	cad23b9f88	Revert "netfilter: ipset: fix performance regression in swap operation" This reverts commit 961858ce4471afb8fe13fcd4e6a1de03bbb877b0. According to: 4e7aaa6b82d6 netfilter: ipset: Fix race between namespace cleanup and gc in the list:set type There is a race condition between namespace cleanup in ipset and the garbage collection of the list:set type. The namespace cleanup can destroy the list:set type of sets while the gc of the set type is waiting to run in rcu cleanup. The latter uses data from the destroyed set which thus leads use after free. Therefore, we will first revert this patch to avoid introducing more problems. Signed-off-by: Yongliang Gao <leonylgao@tencent.com> Reviewed-by: Jianping Liu <frankjpliu@tencent.com> Signed-off-by: Jianping Liu <frankjpliu@tencent.com>	2024-11-28 21:44:03 +08:00
Peter Xu	ce9bdd85ff	mm/hugetlb: fix missing hugetlb_lock for resv uncharge commit b76b46902c2d0395488c8412e1116c2486cdfcb2 upstream There is a recent report on UFFDIO_COPY over hugetlb: https://lore.kernel.org/all/000000000000ee06de0616177560@google.com/ 350: lockdep_assert_held(&hugetlb_lock); Should be an issue in hugetlb but triggered in an userfault context, where it goes into the unlikely path where two threads modifying the resv map together. Mike has a fix in that path for resv uncharge but it looks like the locking criteria was overlooked: hugetlb_cgroup_uncharge_folio_rsvd() will update the cgroup pointer, so it requires to be called with the lock held. Conflicts: mm/hugetlb.c Link: https://lkml.kernel.org/r/20240417211836.2742593-3-peterx@redhat.com Fixes: `79aa925bf2` ("hugetlb_cgroup: fix reservation accounting") Signed-off-by: Peter Xu <peterx@redhat.com> Reported-by: syzbot+4b8077a5fccc61c385a1@syzkaller.appspotmail.com Reviewed-by: Mina Almasry <almasrymina@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Yongliang Gao <leonylgao@tencent.com> Signed-off-by: Jianping Liu <frankjpliu@tencent.com>	2024-11-28 21:44:03 +08:00
Miaohe Lin	9952b570b2	hugetlb_cgroup: fix imbalanced css_get and css_put pair for shared mappings commit `d85aecf284` upstream The current implementation of hugetlb_cgroup for shared mappings could have different behavior. Consider the following two scenarios: 1.Assume initial css reference count of hugetlb_cgroup is 1: 1.1 Call hugetlb_reserve_pages with from = 1, to = 2. So css reference count is 2 associated with 1 file_region. 1.2 Call hugetlb_reserve_pages with from = 2, to = 3. So css reference count is 3 associated with 2 file_region. 1.3 coalesce_file_region will coalesce these two file_regions into one. So css reference count is 3 associated with 1 file_region now. 2.Assume initial css reference count of hugetlb_cgroup is 1 again: 2.1 Call hugetlb_reserve_pages with from = 1, to = 3. So css reference count is 2 associated with 1 file_region. Therefore, we might have one file_region while holding one or more css reference counts. This inconsistency could lead to imbalanced css_get() and css_put() pair. If we do css_put one by one (i.g. hole punch case), scenario 2 would put one more css reference. If we do css_put all together (i.g. truncate case), scenario 1 will leak one css reference. The imbalanced css_get() and css_put() pair would result in a non-zero reference when we try to destroy the hugetlb cgroup. The hugetlb cgroup directory is removed __but__ associated resource is not freed. This might result in OOM or can not create a new hugetlb cgroup in a busy workload ultimately. In order to fix this, we have to make sure that one file_region must hold exactly one css reference. So in coalesce_file_region case, we should release one css reference before coalescence. Also only put css reference when the entire file_region is removed. The last thing to note is that the caller of region_add() will only hold one reference to h_cg->css for the whole contiguous reservation region. But this area might be scattered when there are already some file_regions reside in it. As a result, many file_regions may share only one h_cg->css reference. In order to ensure that one file_region must hold exactly one css reference, we should do css_get() for each file_region and release the reference held by caller when they are done. [linmiaohe@huawei.com: fix imbalanced css_get and css_put pair for shared mappings] Link: https://lkml.kernel.org/r/20210316023002.53921-1-linmiaohe@huawei.com Link: https://lkml.kernel.org/r/20210301120540.37076-1-linmiaohe@huawei.com Fixes: `075a61d07a` ("hugetlb_cgroup: add accounting for shared mappings") Reported-by: kernel test robot <lkp@intel.com> (auto build test ERROR) Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Wanpeng Li <liwp.linux@gmail.com> Cc: Mina Almasry <almasrymina@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Yongliang Gao <leonylgao@tencent.com> Signed-off-by: Jianping Liu <frankjpliu@tencent.com>	2024-11-28 21:44:02 +08:00
Mike Kravetz	4e5ff9ac7f	hugetlb_cgroup: fix reservation accounting commit `79aa925bf2` upstream Michal Privoznik was using "free page reporting" in QEMU/virtio-balloon with hugetlbfs and hit the warning below. QEMU with free page hinting uses fallocate(FALLOC_FL_PUNCH_HOLE) to discard pages that are reported as free by a VM. The reporting granularity is in pageblock granularity. So when the guest reports 2M chunks, we fallocate(FALLOC_FL_PUNCH_HOLE) one huge page in QEMU. WARNING: CPU: 7 PID: 6636 at mm/page_counter.c:57 page_counter_uncharge+0x4b/0x50 Modules linked in: ... CPU: 7 PID: 6636 Comm: qemu-system-x86 Not tainted 5.9.0 #137 Hardware name: Gigabyte Technology Co., Ltd. X570 AORUS PRO/X570 AORUS PRO, BIOS F21 07/31/2020 RIP: 0010:page_counter_uncharge+0x4b/0x50 ... Call Trace: hugetlb_cgroup_uncharge_file_region+0x4b/0x80 region_del+0x1d3/0x300 hugetlb_unreserve_pages+0x39/0xb0 remove_inode_hugepages+0x1a8/0x3d0 hugetlbfs_fallocate+0x3c4/0x5c0 vfs_fallocate+0x146/0x290 __x64_sys_fallocate+0x3e/0x70 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Investigation of the issue uncovered bugs in hugetlb cgroup reservation accounting. This patch addresses the found issues. Fixes: `075a61d07a` ("hugetlb_cgroup: add accounting for shared mappings") Reported-by: Michal Privoznik <mprivozn@redhat.com> Co-developed-by: David Hildenbrand <david@redhat.com> Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Michal Privoznik <mprivozn@redhat.com> Reviewed-by: Mina Almasry <almasrymina@google.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Cc: <stable@vger.kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <songmuchun@bytedance.com> Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Tejun Heo <tj@kernel.org> Link: https://lkml.kernel.org/r/20201021204426.36069-1-mike.kravetz@oracle.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Yongliang Gao <leonylgao@tencent.com> Signed-off-by: Jianping Liu <frankjpliu@tencent.com>	2024-11-28 21:44:02 +08:00
luoxuanqiang	47d2a1c9c2	Fix race for duplicate reqsk on identical SYN commit ff46e3b4421923937b7f6e44ffcd3549a074f321 upstream. When bonding is configured in BOND_MODE_BROADCAST mode, if two identical SYN packets are received at the same time and processed on different CPUs, it can potentially create the same sk (sock) but two different reqsk (request_sock) in tcp_conn_request(). These two different reqsk will respond with two SYNACK packets, and since the generation of the seq (ISN) incorporates a timestamp, the final two SYNACK packets will have different seq values. The consequence is that when the Client receives and replies with an ACK to the earlier SYNACK packet, we will reset(RST) it. ======================================================================== This behavior is consistently reproducible in my local setup, which comprises: \| NETA1 ------ NETB1 \| PC_A --- bond --- \| \| --- bond --- PC_B \| NETA2 ------ NETB2 \| - PC_A is the Server and has two network cards, NETA1 and NETA2. I have bonded these two cards using BOND_MODE_BROADCAST mode and configured them to be handled by different CPU. - PC_B is the Client, also equipped with two network cards, NETB1 and NETB2, which are also bonded and configured in BOND_MODE_BROADCAST mode. If the client attempts a TCP connection to the server, it might encounter a failure. Capturing packets from the server side reveals: 10.10.10.10.45182 > localhost: Flags [S], seq 320236027, 10.10.10.10.45182 > localhost: Flags [S], seq 320236027, localhost > 10.10.10.10.45182: Flags [S.], seq 2967855116, localhost > 10.10.10.10.45182: Flags [S.], seq 2967855123, <== 10.10.10.10.45182 > localhost: Flags [.], ack 4294967290, 10.10.10.10.45182 > localhost: Flags [.], ack 4294967290, localhost > 10.10.10.10.45182: Flags [R], seq 2967855117, <== localhost > 10.10.10.10.45182: Flags [R], seq 2967855117, Two SYNACKs with different seq numbers are sent by localhost, resulting in an anomaly. ======================================================================== The attempted solution is as follows: Add a return value to inet_csk_reqsk_queue_hash_add() to confirm if the ehash insertion is successful (Up to now, the reason for unsuccessful insertion is that a reqsk for the same connection has already been inserted). If the insertion fails, release the reqsk. Due to the refcnt, Kuniyuki suggests also adding a return value check for the DCCP module; if ehash insertion fails, indicating a successful insertion of the same connection, simply release the reqsk as well. Simultaneously, In the reqsk_queue_hash_req(), the start of the req->rsk_timer is adjusted to be after successful insertion. Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Signed-off-by: luoxuanqiang <luoxuanqiang@kylinos.cn> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20240621013929.1386815-1-luoxuanqiang@kylinos.cn Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Jianping Liu <frankjpliu@tencent.com>	2024-11-28 21:43:55 +08:00
Andrii Nakryiko	7777039ad9	bpf: Fix use-after-free of bpf_link when priming half-fails commit `138c67677f` upstream. [tapd] https://tapd.woa.com/69992352/bugtrace/bugs/view?bug_id=1069992352117432493 If bpf_link_prime() succeeds to allocate new anon file, but then fails to allocate ID for it, link priming is considered to be failed and user is supposed ot be able to directly kfree() bpf_link, because it was never exposed to user-space. But at that point file already keeps a pointer to bpf_link and will eventually call bpf_link_release(), so if bpf_link was kfree()'d by caller, that would lead to use-after-free. Fix this by first allocating ID and only then allocating file. Adding ID to link_idr is ok, because link at that point still doesn't have its ID set, so no user-space process can create a new FD for it. Fixes: `a3b80e1078` ("bpf: Allocate ID for bpf_link") Reported-by: syzbot+39b64425f91b5aab714d@syzkaller.appspotmail.com Suggested-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20200501185622.3088964-1-andriin@fb.com Signed-off-by: Huang Cun <cunhuang@tencent.com> Signed-off-by: Jianping Liu <frankjpliu@tencent.com>	2024-11-28 21:43:54 +08:00
Felix Huettner	b523032c0d	net: openvswitch: fix race on port output commit `066b86787f` upstream assume the following setup on a single machine: 1. An openvswitch instance with one bridge and default flows 2. two network namespaces "server" and "client" 3. two ovs interfaces "server" and "client" on the bridge 4. for each ovs interface a veth pair with a matching name and 32 rx and tx queues 5. move the ends of the veth pairs to the respective network namespaces 6. assign ip addresses to each of the veth ends in the namespaces (needs to be the same subnet) 7. start some http server on the server network namespace 8. test if a client in the client namespace can reach the http server when following the actions below the host has a chance of getting a cpu stuck in a infinite loop: 1. send a large amount of parallel requests to the http server (around 3000 curls should work) 2. in parallel delete the network namespace (do not delete interfaces or stop the server, just kill the namespace) there is a low chance that this will cause the below kernel cpu stuck message. If this does not happen just retry. Below there is also the output of bpftrace for the functions mentioned in the output. The series of events happening here is: 1. the network namespace is deleted calling `unregister_netdevice_many_notify` somewhere in the process 2. this sets first `NETREG_UNREGISTERING` on both ends of the veth and then runs `synchronize_net` 3. it then calls `call_netdevice_notifiers` with `NETDEV_UNREGISTER` 4. this is then handled by `dp_device_event` which calls `ovs_netdev_detach_dev` (if a vport is found, which is the case for the veth interface attached to ovs) 5. this removes the rx_handlers of the device but does not prevent packages to be sent to the device 6. `dp_device_event` then queues the vport deletion to work in background as a ovs_lock is needed that we do not hold in the unregistration path 7. `unregister_netdevice_many_notify` continues to call `netdev_unregister_kobject` which sets `real_num_tx_queues` to 0 8. port deletion continues (but details are not relevant for this issue) 9. at some future point the background task deletes the vport If after 7. but before 9. a packet is send to the ovs vport (which is not deleted at this point in time) which forwards it to the `dev_queue_xmit` flow even though the device is unregistering. In `skb_tx_hash` (which is called in the `dev_queue_xmit`) path there is a while loop (if the packet has a rx_queue recorded) that is infinite if `dev->real_num_tx_queues` is zero. To prevent this from happening we update `do_output` to handle devices without carrier the same as if the device is not found (which would be the code path after 9. is done). Additionally we now produce a warning in `skb_tx_hash` if we will hit the infinite loop. bpftrace (first word is function name): __dev_queue_xmit server: real_num_tx_queues: 1, cpu: 2, pid: 28024, tid: 28024, skb_addr: 0xffff9edb6f207000, reg_state: 1 netdev_core_pick_tx server: addr: 0xffff9f0a46d4a000 real_num_tx_queues: 1, cpu: 2, pid: 28024, tid: 28024, skb_addr: 0xffff9edb6f207000, reg_state: 1 dp_device_event server: real_num_tx_queues: 1 cpu 9, pid: 21024, tid: 21024, event 2, reg_state: 1 synchronize_rcu_expedited: cpu 9, pid: 21024, tid: 21024 synchronize_rcu_expedited: cpu 9, pid: 21024, tid: 21024 synchronize_rcu_expedited: cpu 9, pid: 21024, tid: 21024 synchronize_rcu_expedited: cpu 9, pid: 21024, tid: 21024 dp_device_event server: real_num_tx_queues: 1 cpu 9, pid: 21024, tid: 21024, event 6, reg_state: 2 ovs_netdev_detach_dev server: real_num_tx_queues: 1 cpu 9, pid: 21024, tid: 21024, reg_state: 2 netdev_rx_handler_unregister server: real_num_tx_queues: 1, cpu: 9, pid: 21024, tid: 21024, reg_state: 2 synchronize_rcu_expedited: cpu 9, pid: 21024, tid: 21024 netdev_rx_handler_unregister ret server: real_num_tx_queues: 1, cpu: 9, pid: 21024, tid: 21024, reg_state: 2 dp_device_event server: real_num_tx_queues: 1 cpu 9, pid: 21024, tid: 21024, event 27, reg_state: 2 dp_device_event server: real_num_tx_queues: 1 cpu 9, pid: 21024, tid: 21024, event 22, reg_state: 2 dp_device_event server: real_num_tx_queues: 1 cpu 9, pid: 21024, tid: 21024, event 18, reg_state: 2 netdev_unregister_kobject: real_num_tx_queues: 1, cpu: 9, pid: 21024, tid: 21024 synchronize_rcu_expedited: cpu 9, pid: 21024, tid: 21024 ovs_vport_send server: real_num_tx_queues: 0, cpu: 2, pid: 28024, tid: 28024, skb_addr: 0xffff9edb6f207000, reg_state: 2 __dev_queue_xmit server: real_num_tx_queues: 0, cpu: 2, pid: 28024, tid: 28024, skb_addr: 0xffff9edb6f207000, reg_state: 2 netdev_core_pick_tx server: addr: 0xffff9f0a46d4a000 real_num_tx_queues: 0, cpu: 2, pid: 28024, tid: 28024, skb_addr: 0xffff9edb6f207000, reg_state: 2 broken device server: real_num_tx_queues: 0, cpu: 2, pid: 28024, tid: 28024 ovs_dp_detach_port server: real_num_tx_queues: 0 cpu 9, pid: 9124, tid: 9124, reg_state: 2 synchronize_rcu_expedited: cpu 9, pid: 33604, tid: 33604 stuck message: watchdog: BUG: soft lockup - CPU#5 stuck for 26s! [curl:1929279] Modules linked in: veth pktgen bridge stp llc ip_set_hash_net nft_counter xt_set nft_compat nf_tables ip_set_hash_ip ip_set nfnetlink_cttimeout nfnetlink openvswitch nsh nf_conncount nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 tls binfmt_misc nls_iso8859_1 input_leds joydev serio_raw dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua sch_fq_codel drm efi_pstore virtio_rng ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear hid_generic usbhid hid crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel virtio_net ahci net_failover crypto_simd cryptd psmouse libahci virtio_blk failover CPU: 5 PID: 1929279 Comm: curl Not tainted 5.15.0-67-generic #74-Ubuntu Hardware name: OpenStack Foundation OpenStack Nova, BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 RIP: 0010:netdev_pick_tx+0xf1/0x320 Code: 00 00 8d 48 ff 0f b7 c1 66 39 ca 0f 86 e9 01 00 00 45 0f b7 ff 41 39 c7 0f 87 5b 01 00 00 44 29 f8 41 39 c7 0f 87 4f 01 00 00 <eb> f2 0f 1f 44 00 00 49 8b 94 24 28 04 00 00 48 85 d2 0f 84 53 01 RSP: 0018:ffffb78b40298820 EFLAGS: 00000246 RAX: 0000000000000000 RBX: ffff9c8773adc2e0 RCX: 000000000000083f RDX: 0000000000000000 RSI: ffff9c8773adc2e0 RDI: ffff9c870a25e000 RBP: ffffb78b40298858 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: ffff9c870a25e000 R13: ffff9c870a25e000 R14: ffff9c87fe043480 R15: 0000000000000000 FS: 00007f7b80008f00(0000) GS:ffff9c8e5f740000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f7b80f6a0b0 CR3: 0000000329d66000 CR4: 0000000000350ee0 Call Trace: <IRQ> netdev_core_pick_tx+0xa4/0xb0 __dev_queue_xmit+0xf8/0x510 ? __bpf_prog_exit+0x1e/0x30 dev_queue_xmit+0x10/0x20 ovs_vport_send+0xad/0x170 [openvswitch] do_output+0x59/0x180 [openvswitch] do_execute_actions+0xa80/0xaa0 [openvswitch] ? kfree+0x1/0x250 ? kfree+0x1/0x250 ? kprobe_perf_func+0x4f/0x2b0 ? flow_lookup.constprop.0+0x5c/0x110 [openvswitch] ovs_execute_actions+0x4c/0x120 [openvswitch] ovs_dp_process_packet+0xa1/0x200 [openvswitch] ? ovs_ct_update_key.isra.0+0xa8/0x120 [openvswitch] ? ovs_ct_fill_key+0x1d/0x30 [openvswitch] ? ovs_flow_key_extract+0x2db/0x350 [openvswitch] ovs_vport_receive+0x77/0xd0 [openvswitch] ? __htab_map_lookup_elem+0x4e/0x60 ? bpf_prog_680e8aff8547aec1_kfree+0x3b/0x714 ? trace_call_bpf+0xc8/0x150 ? kfree+0x1/0x250 ? kfree+0x1/0x250 ? kprobe_perf_func+0x4f/0x2b0 ? kprobe_perf_func+0x4f/0x2b0 ? __mod_memcg_lruvec_state+0x63/0xe0 netdev_port_receive+0xc4/0x180 [openvswitch] ? netdev_port_receive+0x180/0x180 [openvswitch] netdev_frame_hook+0x1f/0x40 [openvswitch] __netif_receive_skb_core.constprop.0+0x23d/0xf00 __netif_receive_skb_one_core+0x3f/0xa0 __netif_receive_skb+0x15/0x60 process_backlog+0x9e/0x170 __napi_poll+0x33/0x180 net_rx_action+0x126/0x280 ? ttwu_do_activate+0x72/0xf0 __do_softirq+0xd9/0x2e7 ? rcu_report_exp_cpu_mult+0x1b0/0x1b0 do_softirq+0x7d/0xb0 </IRQ> <TASK> __local_bh_enable_ip+0x54/0x60 ip_finish_output2+0x191/0x460 __ip_finish_output+0xb7/0x180 ip_finish_output+0x2e/0xc0 ip_output+0x78/0x100 ? __ip_finish_output+0x180/0x180 ip_local_out+0x5e/0x70 __ip_queue_xmit+0x184/0x440 ? tcp_syn_options+0x1f9/0x300 ip_queue_xmit+0x15/0x20 __tcp_transmit_skb+0x910/0x9c0 ? __mod_memcg_state+0x44/0xa0 tcp_connect+0x437/0x4e0 ? ktime_get_with_offset+0x60/0xf0 tcp_v4_connect+0x436/0x530 __inet_stream_connect+0xd4/0x3a0 ? kprobe_perf_func+0x4f/0x2b0 ? aa_sk_perm+0x43/0x1c0 inet_stream_connect+0x3b/0x60 __sys_connect_file+0x63/0x70 __sys_connect+0xa6/0xd0 ? setfl+0x108/0x170 ? do_fcntl+0xe8/0x5a0 __x64_sys_connect+0x18/0x20 do_syscall_64+0x5c/0xc0 ? __x64_sys_fcntl+0xa9/0xd0 ? exit_to_user_mode_prepare+0x37/0xb0 ? syscall_exit_to_user_mode+0x27/0x50 ? do_syscall_64+0x69/0xc0 ? __sys_setsockopt+0xea/0x1e0 ? exit_to_user_mode_prepare+0x37/0xb0 ? syscall_exit_to_user_mode+0x27/0x50 ? __x64_sys_setsockopt+0x1f/0x30 ? do_syscall_64+0x69/0xc0 ? irqentry_exit+0x1d/0x30 ? exc_page_fault+0x89/0x170 entry_SYSCALL_64_after_hwframe+0x61/0xcb RIP: 0033:0x7f7b8101c6a7 Code: 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 2a 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 18 89 54 24 0c 48 89 34 24 89 RSP: 002b:00007ffffd6b2198 EFLAGS: 00000246 ORIG_RAX: 000000000000002a RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f7b8101c6a7 RDX: 0000000000000010 RSI: 00007ffffd6b2360 RDI: 0000000000000005 RBP: 0000561f1370d560 R08: 00002795ad21d1ac R09: 0030312e302e302e R10: 00007ffffd73f080 R11: 0000000000000246 R12: 0000561f1370c410 R13: 0000000000000000 R14: 0000000000000005 R15: 0000000000000000 </TASK> Fixes: `7f8a436eaa` ("openvswitch: Add conntrack action") Co-developed-by: Luca Czesla <luca.czesla@mail.schwarz> Signed-off-by: Luca Czesla <luca.czesla@mail.schwarz> Signed-off-by: Felix Huettner <felix.huettner@mail.schwarz> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Link: https://lore.kernel.org/r/ZC0pBXBAgh7c76CA@kernel-bug-kernel-bug Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Di Zhang <emilydzhang@tencent.com> Signed-off-by: Jianping Liu <frankjpliu@tencent.com>	2024-11-28 21:43:54 +08:00
Hyunwoo Kim	5a77fda546	net: openvswitch: Fix Use-After-Free in ovs_ct_exit commit 5ea7b72d4fac2fdbc0425cd8f2ea33abe95235b2 upstream. Since kfree_rcu, which is called in the hlist_for_each_entry_rcu traversal of ovs_ct_limit_exit, is not part of the RCU read critical section, it is possible that the RCU grace period will pass during the traversal and the key will be free. To prevent this, it should be changed to hlist_for_each_entry_safe. CVE-2024-27395 Fixes: `11efd5cb04` ("openvswitch: Support conntrack zone limit") Signed-off-by: Hyunwoo Kim <v4bel@theori.io> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Aaron Conole <aconole@redhat.com> Link: https://lore.kernel.org/r/ZiYvzQN/Ry5oeFQW@v4bel-B760M-AORUS-ELITE-AX Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Hongbo Li <herberthbli@tencent.com> Signed-off-by: Jianping Liu <frankjpliu@tencent.com>	2024-11-28 21:43:54 +08:00
Di Zhang	8a7ec1c178	add DEBUG_NET_WARN_ON_ONCE DEBUG_NET_WARN_ON_ONCE comes from `d268c1f5cf`, which depends on other patches and adds many other things. Only DEBUG_NET_WARN_ON_ONCE is added here to support subsequent patches. Signed-off-by: Di Zhang <emilydzhang@tencent.com> Signed-off-by: Jianping Liu <frankjpliu@tencent.com>	2024-11-28 21:43:54 +08:00
Di Zhang	ded83cd8b9	net: fix the RTO timer retransmitting skb every 1ms if linear option is enabled commit `e4dd0d3a2f` upstream In the real workload, I encountered an issue which could cause the RTO timer to retransmit the skb per 1ms with linear option enabled. The amount of lost-retransmitted skbs can go up to 1000+ instantly. The root cause is that if the icsk_rto happens to be zero in the 6th round (which is the TCP_THIN_LINEAR_RETRIES value), then it will always be zero due to the changed calculation method in tcp_retransmit_timer() as follows: icsk->icsk_rto = min(icsk->icsk_rto << 1, TCP_RTO_MAX); Above line could be converted to icsk->icsk_rto = min(0 << 1, TCP_RTO_MAX) = 0 Therefore, the timer expires so quickly without any doubt. I read through the RFC 6298 and found that the RTO value can be rounded up to a certain value, in Linux, say TCP_RTO_MIN as default, which is regarded as the lower bound in this patch as suggested by Eric. Fixes: `36e31b0af5` ("net: TCP thin linear timeouts") Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Di Zhang <emilydzhang@tencent.com> Signed-off-by: Jianping Liu <frankjpliu@tencent.com>	2024-11-28 21:43:54 +08:00
katrinzhou	ff5cd3ac27	xfs: fix the lack of curly brackets Signed-off-by: katrinzhou <katrinzhou@tencent.com> Signed-off-by: Jianping Liu <frankjpliu@tencent.com>	2024-11-28 21:43:54 +08:00
yilingjin	efa8e46af8	sli: fix period over-limit bug period in control should less than jiffies_to_usecs(MAX_JIFFY_OFFSET), otherwise it will use MAX_JIFFY_OFFSET or be truncated. [tapd] https://tapd.woa.com/OS_kernel_dev/bugtrace/bugs/view?bug_id=1069992352124425367 Reviewed-by: Liu Chun <kaicliu@tencent.com> Signed-off-by: yilingjin <yilingjin@tencent.com> Signed-off-by: Jianping Liu <frankjpliu@tencent.com>	2024-11-28 21:43:54 +08:00
Darrick J. Wong	4fc898c9c0	xfs: fix chown leaking delalloc quota blocks when fssetxattr fails commit `1aecf3734a` upstream. Fix CVE:CVE-2022-27672 While refactoring the quota code to create a function to allocate inode change transactions, I noticed that xfs_qm_vop_chown_reserve does more than just make reservations: it also modifies the incore counts directly to handle the owner id change for the delalloc blocks. I then observed that the fssetxattr code continues validating input arguments after making the quota reservation but before dirtying the transaction. If the routine decides to error out, it fails to undo the accounting switch! This leads to incorrect quota reservation and failure down the line. We can fix this by making the reservation function do only that -- for the new dquot, it reserves ondisk and delalloc blocks to the transaction, and the old dquot hangs on to its incore reservation for now. Once we actually switch the dquots, we can then update the incore reservations because we've dirtied the transaction and it's too late to turn back now. No fixes tag because this has been broken since the start of git. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Aurelianliu <aurelianliu@tencent.com> Signed-off-by: Jianping Liu <frankjpliu@tencent.com>	2024-11-28 21:43:54 +08:00

1 2 3 4 5 ...

874204 Commits All Branches Search

874204 Commits

All Branches