Commit Graph

874204 Commits

Author SHA1 Message Date
Jianping Liu a899c9ea34 drivers,ps3stor: fix compile error when using allyesconfig config
There is a compile error when run:
make allyesconfig && make -j32

The compile error log is as below:
ld: error: unplaced orphan section `.GCC.command.line' from `vmlinux.o'.
or aarch64-linux-gnu-ld: error: unplaced orphan section `.GCC.command.line' from `drivers/scsi/linkdata/ps3stor/ps3_cmd_channel.o'
......

.GCC.command.line section is created by -frecord-gcc-switches compile
option.

The info about -frecord-gcc-switches option:
This switch causes the command line that was used to invoke the compiler to
be recorded into the object file that is being created.  This switch is only
implemented on some targets and the exact format of the recording is target
and binary file format dependent, but it usually takes the form of a section
containing ASCII text.

-frecord-gcc-switches option is useless in release version, delete it.

Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
Reviewed-by: Yongliang Gao <leonylgao@tencent.com>
2024-12-13 20:45:28 +08:00
liujie_answer 9b08da6364 scsi: Solve the problem of duplicate definition of first_online_pgdat and next_online_pgdat functions in ps3stor and other modules
category: bugfix

Rename first_online_pgdat to ps3_first_online_pgdat
Rename next_online_pgdat to ps3_next_online_pgdat
Add for_each_ps3_online_pgdat definition to replace for_each_online_pgdat

Verified on x86 and arm64

Signed-off-by: liujie5@linkdatatechnology.com
2024-12-12 15:31:05 +08:00
Jianping Liu dd8418c96c dist: release 5.4.241-30.0017.17
Upstream: no

Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
2024-12-09 19:42:29 +08:00
mengensun 2ed2e1edbc mm: make concurrent-accessing of pagetypeinfo queued on a mutex lock
if there are processes concurrent-accessing of pagetypeinfo, all
processes will be queued on zone spinlock. while once the process
got the zone spin lock, it will holding the spinlock tens ms. which
is not friently to other process allocating or freeing pages.

for now, we make concurrent-accessing of pagetypeinfo queue on a
mutex lock first before geting the zone spinlock.

on a INTEL(R) XEON(R) PLATINUM 8576C which 2 numa node of total 1TB
main memory and 224 cpu, using test case below,we get following
test result:

test case:
for i in $(seq 1 1 5); do taskset -c $i ./a.sh & done
where a.sh like follow:
while :
do
  cat /proc/pagetypeinfo >/dev/null
done

watch pagetypeinfo_showfree_print using bcc
/usr/share/bcc/tools/funclatency pagetypeinfo_showfree_print

before patch(all time in ns):
      4096 -> 8191       : 309
      8192 -> 16383      : 8
     16384 -> 32767      : 17
     32768 -> 65535      : 299
     65536 -> 131071     : 1
    131072 -> 262143     : 0
    262144 -> 524287     : 0
    524288 -> 1048575    : 0
   1048576 -> 2097151    : 0
   2097152 -> 4194303    : 0
   4194304 -> 8388607    : 0
   8388608 -> 16777215   : 297
  16777216 -> 33554431   : 83
  33554432 -> 67108863   : 79
  67108864 -> 134217727  : 102
 134217728 -> 268435455  : 63
 268435456 -> 536870911  : 5

after patch(all time in ns):
      4096 -> 8191       : 161
      8192 -> 16383      : 8
     16384 -> 32767      : 1
     32768 -> 65535      : 170
     65536 -> 131071     : 0
    131072 -> 262143     : 0
    262144 -> 524287     : 0
    524288 -> 1048575    : 0
   1048576 -> 2097151    : 0
   2097152 -> 4194303    : 0
   4194304 -> 8388607    : 0
   8388608 -> 16777215   : 170
  16777216 -> 33554431   : 171

=> make function latency from 5s to 30ms

Reviewed-by: caelli <caelli@tencent.com>
Reviewed-by: alexjlzheng <alexjlzheng@tencent.com>
Signed-off-by: mengensun <mengensun@tencent.com>
Signed-off-by: Yongliang Gao <leonylgao@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
2024-12-09 19:38:18 +08:00
Aurelianliu 3b2e445dc1 xfs: modify mount recovery
if recovery error, not to wait io complete.
in 6.6 or 6.11, this uses shutdown to force io, xfs not stall,
but can't mount successfully. Here only uses force io when no
error happens.
after mount, xfs will flush io.

Signed-off-by: Aurelianliu <aurelianliu@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
2024-12-09 19:38:09 +08:00
Jianping Liu e16b1f5b7b config,arm64: enable CONFIG_DRM_HISI_HIBMC and CONFIG_DRM_HISI_KIRIN
--bug=1020426283134913779

Troubleshooting the issue of inaccurate graphical interface resolution
when installing aarch64 architecture ISO on Huawei servers.
2024-12-09 19:31:47 +08:00
liujie_answer 843f1ef744 scsi: linkdata ps3stor compilation optimization
category: feature

Add Makefie and Kconfig in the drivers/scsi/linkdata directory, and drivers/scsi/Makefile and Kconfig call the Makefile and Kconfig in the drivers/scsi/linkdata directory respectively.

Verified on x86 and arm64

Signed-off-by: liujie5@linkdatatechnology.com
2024-12-09 19:31:47 +08:00
Jianping Liu 33b13d3cee config,arm64: update tencent.config without manual change
Update tencent.config by follow commands:
make tencentconfig
make savedefconfig
mv defconfig arch/arm64/configs/tencent.config
2024-12-06 12:47:08 +08:00
Jianping Liu b7a235df77 Merge branch linux-5.4/devel into linux-5.4/lts/5.4.241-30.0017 2024-12-05 11:14:05 +08:00
Jianping Liu 3c39c5a4fc drivers,3snic: support incremental compilation
In 3snic's Makefile, it using "SYS_TIME=$(shell date +%Y-%m-%d_%H:%M:%S)",
and put it into driver version. That causing 3snic don't support incremental
compilation.

Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
Reviewed-by: Yongliang Gao <leonylgao@tencent.com>
2024-12-05 11:10:57 +08:00
chinaljp030 c45bbd064f
[devel-5.4] linkdata: add ps3stor driver support
Merge pull request  from liujie_answer/linux-5.4/devel
2024-12-05 02:58:29 +00:00
liujie_answer ddd836946e scsi: add support for linkdata HBA/RAID Controller driver
category: feature

Verified on x86 and arm64

Signed-off-by: liujie5@linkdatatechnology.com
2024-12-05 09:58:17 +08:00
Yonghong Song 7140f9e1da selftests/bpf: Fix pyperf180 compilation failure with clang18
commit 100888fb6d8a185866b1520031ee7e3182b173de upstream.

With latest clang18 (main branch of llvm-project repo), when building bpf selftests,
    [~/work/bpf-next (master)]$ make -C tools/testing/selftests/bpf LLVM=1 -j

The following compilation error happens:
    fatal error: error in backend: Branch target out of insn range
    ...
    Stack dump:
    0.      Program arguments: clang -g -Wall -Werror -D__TARGET_ARCH_x86 -mlittle-endian
      -I/home/yhs/work/bpf-next/tools/testing/selftests/bpf/tools/include
      -I/home/yhs/work/bpf-next/tools/testing/selftests/bpf -I/home/yhs/work/bpf-next/tools/include/uapi
      -I/home/yhs/work/bpf-next/tools/testing/selftests/usr/include -idirafter
      /home/yhs/work/llvm-project/llvm/build.18/install/lib/clang/18/include -idirafter /usr/local/include
      -idirafter /usr/include -Wno-compare-distinct-pointer-types -DENABLE_ATOMICS_TESTS -O2 --target=bpf
      -c progs/pyperf180.c -mcpu=v3 -o /home/yhs/work/bpf-next/tools/testing/selftests/bpf/pyperf180.bpf.o
    1.      <eof> parser at end of file
    2.      Code generation
    ...

The compilation failure only happens to cpu=v2 and cpu=v3. cpu=v4 is okay
since cpu=v4 supports 32-bit branch target offset.

The above failure is due to upstream llvm patch [1] where some inlining behavior
are changed in clang18.

To workaround the issue, previously all 180 loop iterations are fully unrolled.
The bpf macro __BPF_CPU_VERSION__ (implemented in clang18 recently) is used to avoid
unrolling changes if cpu=v4. If __BPF_CPU_VERSION__ is not available and the
compiler is clang18, the unrollng amount is unconditionally reduced.

  [1] 1a2e77cf9e

Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Tested-by: Alan Maguire <alan.maguire@oracle.com>
Link: https://lore.kernel.org/bpf/20231110193644.3130906-1-yonghong.song@linux.dev
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
2024-12-03 14:19:10 +08:00
Jianping Liu aaf99ade66 dist: release 5.4.241-30.0017.16
Upstream: no

Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
2024-11-29 19:19:13 +08:00
Jianping Liu b4cb2fb5d6 drivers,wangxun: fix compile error when using allyesconfig
he symbol _kc_eth_hw_addr_set has been redefined, fix the error as below:
drivers/net/ethernet/wangxun/txgbe/txgbe_kcompat.o:in function '_kc_eth_hw_addr_set':
txgbe_kcompat.c:(.text+0x100): _kc_eth_hw_addr_set multiple definition
drivers/net/ethernet/wangxun/ngbe/ngbe_kcompat.o:ngbe_kcompat.c:(.text+0x100): first defined here

Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
2024-11-29 19:12:52 +08:00
Menglong Dong 8bec235a25 libbpf: backport BTF_KIND_* from upstream
In order to fix the compiling error on tools/testing/selftests/bpf,
backport all defination of BTF_KIND_* from upstream.

Signed-off-by: Menglong Dong <imagedong@tencent.com>
Reviewed-by: Bin Lai <robinlai@tencent.com>
Signed-off-by: Yongliang Gao <leonylgao@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
2024-11-29 19:05:59 +08:00
Jianping Liu 89c81c1144 drivers: fix compile error when using allyesconfig
The symbol _kc_pci_get_dsn has been redefined, fix the error as below:
aarch64-linux-gnu-ld: drivers/net/ethernet/wangxun/txgbe/txgbe_kcompat.o: in function `_kc_pci_get_dsn':
txgbe_kcompat.c:(.text+0x0): multiple definition of `_kc_pci_get_dsn'; drivers/net/ethernet/wangxun/ngbe/ngbe_kcompat.o:ngbe_kcompat.c:(.text+0x0): first defined here
aarch64-linux-gnu-ld: drivers/net/ethernet/wangxun/txgbe/txgbe_kcompat.o: in function `_kc_eth_hw_addr_set':
txgbe_kcompat.c:(.text+0xe4): multiple definition of `_kc_eth_hw_addr_set'; drivers/net/ethernet/wangxun/ngbe/ngbe_kcompat.o:ngbe_kcompat.c:(.text+0xe4): first defined here

Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
2024-11-29 09:09:23 +08:00
Jianping Liu 6642e166dd dist: release 5.4.241-30.0017.15
Upstream: no

Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
2024-11-28 22:03:05 +08:00
Huang Cun 4e4a7cdd4c crashkernel: give 2M default reserve memory to pstore
divide crashkernel reserve memory to two part. the end partion memory
is gived to pstore. support kernel last log before reboot and reset.

support pstore_size=xM cmdline config, but the pstore_addr cann't be
configured from cmdline.

Signed-off-by: Huang Cun <cunhuang@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
2024-11-28 21:59:34 +08:00
Alexander Lobakin 6cc0b996a3 net: fix premature exit from NAPI state polling in napi_disable()
commit 0315a075f1 upstream.

Commit 719c571970 ("net: make napi_disable() symmetric with
enable") accidentally introduced a bug sometimes leading to a kernel
BUG when bringing an iface up/down under heavy traffic load.

Prior to this commit, napi_disable() was polling n->state until
none of (NAPIF_STATE_SCHED | NAPIF_STATE_NPSVC) is set and then
always flip them. Now there's a possibility to get away with the
NAPIF_STATE_SCHE unset as 'continue' drops us to the cmpxchg()
call with an uninitialized variable, rather than straight to
another round of the state check.

Error path looks like:

napi_disable():
unsigned long val, new; /* new is uninitialized */

do {
	val = READ_ONCE(n->state); /* NAPIF_STATE_NPSVC and/or
				      NAPIF_STATE_SCHED is set */
	if (val & (NAPIF_STATE_SCHED | NAPIF_STATE_NPSVC)) { /* true */
		usleep_range(20, 200);
		continue; /* go straight to the condition check */
	}
	new = val | <...>
} while (cmpxchg(&n->state, val, new) != val); /* state == val, cmpxchg()
						  writes garbage */

napi_enable():
do {
	val = READ_ONCE(n->state);
	BUG_ON(!test_bit(NAPI_STATE_SCHED, &val)); /* 50/50 boom */
<...>

while the typical BUG splat is like:

[  172.652461] ------------[ cut here ]------------
[  172.652462] kernel BUG at net/core/dev.c:6937!
[  172.656914] invalid opcode: 0000 [] PREEMPT SMP PTI
[  172.661966] CPU: 36 PID: 2829 Comm: xdp_redirect_cp Tainted: G          I       5.15.0 
[  172.670222] Hardware name: Intel Corporation S2600WFT/S2600WFT, BIOS SE5C620.86B.02.01.0014.082620210524 08/26/2021
[  172.680646] RIP: 0010:napi_enable+0x5a/0xd0
[  172.684832] Code: 07 49 81 cc 00 01 00 00 4c 89 e2 48 89 d8 80 e6 fb f0 48 0f b1 55 10 48 39 c3 74 10 48 8b 5d 10 f6 c7 04 75 3d f6 c3 01 75 b4 <0f> 0b 5b 5d 41 5c c3 65 ff 05 b8 e5 61 53 48 c7 c6 c0 f3 34 ad 48
[  172.703578] RSP: 0018:ffffa3c9497477a8 EFLAGS: 00010246
[  172.708803] RAX: ffffa3c96615a014 RBX: 0000000000000000 RCX: ffff8a4b575301a0
< snip >
[  172.782403] Call Trace:
[  172.784857]  <TASK>
[  172.786963]  ice_up_complete+0x6f/0x210 [ice]
[  172.791349]  ice_xdp+0x136/0x320 [ice]
[  172.795108]  ? ice_change_mtu+0x180/0x180 [ice]
[  172.799648]  dev_xdp_install+0x61/0xe0
[  172.803401]  dev_xdp_attach+0x1e0/0x550
[  172.807240]  dev_change_xdp_fd+0x1e6/0x220
[  172.811338]  do_setlink+0xee8/0x1010
[  172.814917]  rtnl_setlink+0xe5/0x170
[  172.818499]  ? bpf_lsm_binder_set_context_mgr+0x10/0x10
[  172.823732]  ? security_capable+0x36/0x50
< snip >

Fix this by replacing 'do { } while (cmpxchg())' with an "infinite"
for-loop with an explicit break.

From v1 [0]:
 - just use a for-loop to simplify both the fix and the existing
   code (Eric).

[0] https://lore.kernel.org/netdev/20211110191126.1214-1-alexandr.lobakin@intel.com

Fixes: 719c571970 ("net: make napi_disable() symmetric with enable")
Suggested-by: Eric Dumazet <edumazet@google.com> # for-loop
Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20211110195605.1304-1-alexandr.lobakin@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
2024-11-28 21:59:16 +08:00
Jakub Kicinski 3cad448421 net: make napi_disable() symmetric with enable
commit 719c571970 upstream.
I remove one line which is about THREADED and BUSY_POLL because
we don't have two clear operation lines before.

Commit 3765996e4f ("napi: fix race inside napi_enable") fixed
an ordering bug in napi_enable() and made the napi_enable() diverge
from napi_disable(). The state transitions done on disable are
not symmetric to enable.

There is no known bug in napi_disable() this is just refactoring.

Eric suggests we can also replace msleep(1) with a more opportunistic
usleep_range().

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
2024-11-28 21:59:16 +08:00
Xuan Zhuo 7094bec5e1 napi: fix race inside napi_enable
commit 3765996e4f upstream.
In this patch, I intentionally remove two lines about the NAPI_STATE_THREADED
flag because we haven't backported that part.

The process will cause napi.state to contain NAPI_STATE_SCHED and
not in the poll_list, which will cause napi_disable() to get stuck.

The prefix "NAPI_STATE_" is removed in the figure below, and
NAPI_STATE_HASHED is ignored in napi.state.

                      CPU0       |                   CPU1       | napi.state
===============================================================================
napi_disable()                   |                              | SCHED | NPSVC
napi_enable()                    |                              |
{                                |                              |
    smp_mb__before_atomic();     |                              |
    clear_bit(SCHED, &n->state); |                              | NPSVC
                                 | napi_schedule_prep()         | SCHED | NPSVC
                                 | napi_poll()                  |
                                 |   napi_complete_done()       |
                                 |   {                          |
                                 |      if (n->state & (NPSVC | | (1)
                                 |               _BUSY_POLL)))  |
                                 |           return false;      |
                                 |     ................         |
                                 |   }                          | SCHED | NPSVC
                                 |                              |
    clear_bit(NPSVC, &n->state); |                              | SCHED
}                                |                              |
                                 |                              |
napi_schedule_prep()             |                              | SCHED | MISSED (2)

(1) Here return direct. Because of NAPI_STATE_NPSVC exists.
(2) NAPI_STATE_SCHED exists. So not add napi.poll_list to sd->poll_list

Since NAPI_STATE_SCHED already exists and napi is not in the
sd->poll_list queue, NAPI_STATE_SCHED cannot be cleared and will always
exist.

1. This will cause this queue to no longer receive packets.
2. If you encounter napi_disable under the protection of rtnl_lock, it
   will cause the entire rtnl_lock to be locked, affecting the overall
   system.

This patch uses cmpxchg to implement napi_enable(), which ensures that
there will be no race due to the separation of clear two bits.

Fixes: 2d8bff1269 ("netpoll: Close race condition between poll_one_napi and napi_disable")
Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
2024-11-28 21:59:16 +08:00
Yongliang Gao 784345b5ac rtc: retry to read rtc time if it fails
We add a retry here if the __rtc_read_time call fails or the
rtc_tm_to_ktime result is KTIME_MAX.

Signed-off-by: Yongliang Gao <leonylgao@tencent.com>
Reviewed-by: Jianping Liu <frankjpliu@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
2024-11-28 21:58:57 +08:00
Yongliang Gao 6ddc70a757 rtc: show rtc time upon read or time conversion failure
We can show the rtc_time information when it fails, which is more
helpful for problem localization.

Signed-off-by: Yongliang Gao <leonylgao@tencent.com>
Reviewed-by: Jianping Liu <frankjpliu@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
2024-11-28 21:58:12 +08:00
Yongliang Gao 8a41406b2d rtc: fix the issue of missing pm_relax
Before queue rtc irq work, pm_stay_awake is called, so pm_relax needs
to be called before return.

Fixes: 9721d939cb8c ("rtc: check if rtc_tm_to_ktime was successful in rtc_timer_do_work()")
Fixes: 3ab026fe94f2 ("rtc: check if __rtc_read_time was successful in rtc_timer_do_work()")

Signed-off-by: Yongliang Gao <leonylgao@tencent.com>
Reviewed-by: Jianping Liu <frankjpliu@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
2024-11-28 21:58:12 +08:00
Jianping Liu 5401d9702f dist: release 5.4.241-30-0017.14
Upstream: no

Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
2024-11-28 21:44:04 +08:00
Yongliang Gao 6cf46c359c rtc: check if rtc_tm_to_ktime was successful in rtc_timer_do_work()
The struct rtc_time tm may contain a date/time read from the RTC hardware,
but it is far away from now. However, __rtc_read_time return success.

When calling rtc_tm_to_ktime later, the result may be a very large value
KTIME_MAX. If there are periodic timers in rtc->timerqueue, they will
continually expire, may causing kernel softlockup.

Signed-off-by: Yongliang Gao <leonylgao@tencent.com>
Reviewed-by: Jianping Liu <frankjpliu@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
2024-11-28 21:44:04 +08:00
Haisu Wang 083e3422ee rue/io: fix child blkcg of hier buffered write can not exceed 2MB
When using hierarchy buffered_write_bps function, unless explictly
set 0 or a higher value of buffered_write_bps in child blkcg. The
child group can't exceed 2MB by default.

--story=132623821 "child cgroup can not exceed 2MB"

Fixes: 126af8f8e346 ("rue/io: buffered_write_bps hierarchy support")
Reported-by: Zhijian Xu <zhijianxu@tencent.com>
Signed-off-by: Haisu Wang <haisuwang@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
2024-11-28 21:44:03 +08:00
Huang Cun dd461732f2 crashkernel: auto adjust crashkernel min size to 800MB for KASAN
Signed-off-by: Huang Cun <cunhuang@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
2024-11-28 21:44:03 +08:00
linuszeng 473e7516fa Revert "toa add net namespace from csig"
This reverts commit 46146c6fa5fbf18d29bf08c3533138bd74a336a0.

This patch will cause TOA component authentication to fail.

Signed-off-by: Zeng Jingxiang <linuszeng@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
2024-11-28 21:44:03 +08:00
Ze Gao c0ddad1e29 sched, qos: Fix OOB on switching SCHED_BT to rt
Since SCHED_BT modifies p->static_prio and set_load_weight()
is called unconditionally during __setscheduler_params, any
attempts to set other policies other that SCHED_BT on SCHED_BT
tasks ought to recover what p->static_prio stands for originally.

Currently only switching to SCHED_FAIR is considered whereas
others are not.  Fix it by resetting p->static_prio before
doing set_load_weight() for the rest.

Note this is an integrity fix and polices changes should be
only allowed in between SCHED_FAIR and SCHED_BT.

Signed-off-by: Ze Gao <zegao@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
2024-11-28 21:44:03 +08:00
Haisu Wang 419302b0cf rue/io: do not check sysctl_io_qos_enabled for throttle hierarchy
Since parent_sq linked in throtl_pd_init(). Switch sysctl_io_qos_enabled
won't do any effect. So remove the sysctl_io_qos_enabled condition.

Signed-off-by: Haisu Wang <haisuwang@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
2024-11-28 21:44:03 +08:00
Yongliang Gao 71e32887e5 rtc: check if __rtc_read_time was successful in rtc_timer_do_work()
upstream: upstreaming

If the __rtc_read_time call fails,, the struct rtc_time tm; may contain
uninitialized data, or an illegal date/time read from the RTC hardware.

When calling rtc_tm_to_ktime later, the result may be a very large value
(possibly KTIME_MAX). If there are periodic timers in rtc->timerqueue,
they will continually expire, may causing kernel softlockup.

Fixes: 6610e0893b ("RTC: Rework RTC code to use timerqueue for events")
Signed-off-by: Yongliang Gao <leonylgao@tencent.com>
Acked-by: Jingqun Li <jingqunli@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
2024-11-28 21:44:03 +08:00
Alexandre Belloni 33d2b1a1e6 rtc: disallow update interrupts when time is invalid
commit 3e74ddaa7c upstream.

Never enable update interrupts when the time set on the rtc is invalid.
In that case, also avoid enabling the emulation because it will fail for
the same reason.

Link: https://lore.kernel.org/r/20191021155631.3342-2-alexandre.belloni@bootlin.com
Link: https://lore.kernel.org/r/CA+ASDXMarBG5C1Kz42B9i_iVZ1=i6GgH9Yja2cdmSueKD_As_g@mail.gmail.com
Reported-by: Jeffy Chen <jeffy.chen@rock-chips.com>
Reported-by: Brian Norris <briannorris@chromium.org>
Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
Signed-off-by: Yongliang Gao <leonylgao@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
2024-11-28 21:44:03 +08:00
Darrick J. Wong 26d2f312d3 xfs: verify buffer contents when we skip log replay
commit 22ed903eee upstream.

syzbot detected a crash during log recovery:

XFS (loop0): Mounting V5 Filesystem bfdc47fc-10d8-4eed-a562-11a831b3f791
XFS (loop0): Torn write (CRC failure) detected at log block 0x180. Truncating head block from 0x200.
XFS (loop0): Starting recovery (logdev: internal)
==================================================================
BUG: KASAN: slab-out-of-bounds in xfs_btree_lookup_get_block+0x15c/0x6d0 fs/xfs/libxfs/xfs_btree.c:1813
Read of size 8 at addr ffff88807e89f258 by task syz-executor132/5074

CPU: 0 PID: 5074 Comm: syz-executor132 Not tainted 6.2.0-rc1-syzkaller 
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/26/2022
Call Trace:
 <TASK>
 __dump_stack lib/dump_stack.c:88 [inline]
 dump_stack_lvl+0x1b1/0x290 lib/dump_stack.c:106
 print_address_description+0x74/0x340 mm/kasan/report.c:306
 print_report+0x107/0x1f0 mm/kasan/report.c:417
 kasan_report+0xcd/0x100 mm/kasan/report.c:517
 xfs_btree_lookup_get_block+0x15c/0x6d0 fs/xfs/libxfs/xfs_btree.c:1813
 xfs_btree_lookup+0x346/0x12c0 fs/xfs/libxfs/xfs_btree.c:1913
 xfs_btree_simple_query_range+0xde/0x6a0 fs/xfs/libxfs/xfs_btree.c:4713
 xfs_btree_query_range+0x2db/0x380 fs/xfs/libxfs/xfs_btree.c:4953
 xfs_refcount_recover_cow_leftovers+0x2d1/0xa60 fs/xfs/libxfs/xfs_refcount.c:1946
 xfs_reflink_recover_cow+0xab/0x1b0 fs/xfs/xfs_reflink.c:930
 xlog_recover_finish+0x824/0x920 fs/xfs/xfs_log_recover.c:3493
 xfs_log_mount_finish+0x1ec/0x3d0 fs/xfs/xfs_log.c:829
 xfs_mountfs+0x146a/0x1ef0 fs/xfs/xfs_mount.c:933
 xfs_fs_fill_super+0xf95/0x11f0 fs/xfs/xfs_super.c:1666
 get_tree_bdev+0x400/0x620 fs/super.c:1282
 vfs_get_tree+0x88/0x270 fs/super.c:1489
 do_new_mount+0x289/0xad0 fs/namespace.c:3145
 do_mount fs/namespace.c:3488 [inline]
 __do_sys_mount fs/namespace.c:3697 [inline]
 __se_sys_mount+0x2d3/0x3c0 fs/namespace.c:3674
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x3d/0xb0 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7f89fa3f4aca
Code: 83 c4 08 5b 5d c3 66 2e 0f 1f 84 00 00 00 00 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 c0 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007fffd5fb5ef8 EFLAGS: 00000206 ORIG_RAX: 00000000000000a5
RAX: ffffffffffffffda RBX: 00646975756f6e2c RCX: 00007f89fa3f4aca
RDX: 0000000020000100 RSI: 0000000020009640 RDI: 00007fffd5fb5f10
RBP: 00007fffd5fb5f10 R08: 00007fffd5fb5f50 R09: 000000000000970d
R10: 0000000000200800 R11: 0000000000000206 R12: 0000000000000004
R13: 0000555556c6b2c0 R14: 0000000000200800 R15: 00007fffd5fb5f50
 </TASK>

The fuzzed image contains an AGF with an obviously garbage
agf_refcount_level value of 32, and a dirty log with a buffer log item
for that AGF.  The ondisk AGF has a higher LSN than the recovered log
item.  xlog_recover_buf_commit_pass2 reads the buffer, compares the
LSNs, and decides to skip replay because the ondisk buffer appears to be
newer.

Unfortunately, the ondisk buffer is corrupt, but recovery just read the
buffer with no buffer ops specified:

	error = xfs_buf_read(mp->m_ddev_targp, buf_f->blf_blkno,
			buf_f->blf_len, buf_flags, &bp, NULL);

Skipping the buffer leaves its contents in memory unverified.  This sets
us up for a kernel crash because xfs_refcount_recover_cow_leftovers
reads the buffer (which is still around in XBF_DONE state, so no read
verification) and creates a refcountbt cursor of height 32.  This is
impossible so we run off the end of the cursor object and crash.

Fix this by invoking the verifier on all skipped buffers and aborting
log recovery if the ondisk buffer is corrupt.  It might be smarter to
force replay the log item atop the buffer and then see if it'll pass the
write verifier (like ext4 does) but for now let's go with the
conservative option where we stop immediately.

Link: https://syzkaller.appspot.com/bug?extid=7e9494b8b399902e994e
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
Acked-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: aurelianliu <aurelianliu@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
2024-11-28 21:44:03 +08:00
Gao Xiang 305ee9b3ba xfs: fix forkoff miscalculation related to XFS_LITINO(mp)
commit ada49d64fb upstream.

Currently, commit e9e2eae89d dropped a (int) decoration from
XFS_LITINO(mp), and since sizeof() expression is also involved,
the result of XFS_LITINO(mp) is simply as the size_t type
(commonly unsigned long).

Considering the expression in xfs_attr_shortform_bytesfit():
  offset = (XFS_LITINO(mp) - bytes) >> 3;
let "bytes" be (int)340, and
    "XFS_LITINO(mp)" be (unsigned long)336.

on 64-bit platform, the expression is
  offset = ((unsigned long)336 - (int)340) >> 3 =
           (int)(0xfffffffffffffffcUL >> 3) = -1

but on 32-bit platform, the expression is
  offset = ((unsigned long)336 - (int)340) >> 3 =
           (int)(0xfffffffcUL >> 3) = 0x1fffffff
instead.

so offset becomes a large positive number on 32-bit platform, and
cause xfs_attr_shortform_bytesfit() returns maxforkoff rather than 0.

Therefore, one result is
  "ASSERT(new_size <= XFS_IFORK_SIZE(ip, whichfork));"

assertion failure in xfs_idata_realloc(), which was also the root
cause of the original bugreport from Dennis, see:
   https://bugzilla.redhat.com/show_bug.cgi?id=1894177

And it can also be manually triggered with the following commands:
  $ touch a;
  $ setfattr -n user.0 -v "`seq 0 80`" a;
  $ setfattr -n user.1 -v "`seq 0 80`" a

on 32-bit platform.

Fix the case in xfs_attr_shortform_bytesfit() by bailing out
"XFS_LITINO(mp) < bytes" in advance suggested by Eric and a misleading
comment together with this bugfix suggested by Darrick. It seems the
other users of XFS_LITINO(mp) are not impacted.

Fixes: commit e9e2eae89d ("xfs: only check the superblock version for dinode size calculation")
Cc: <stable@vger.kernel.org> # 5.7+
Reported-and-tested-by: Dennis Gilmore <dgilmore@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
Acked-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

Signed-off-by: aurelianliu <aurelianliu@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
2024-11-28 21:44:03 +08:00
Yongliang Gao 3347ea435a Revert "netfilter: ipset: fix race condition between swap/destroy and kernel side add/del/test"
This reverts commit 50c07f202106c32b8658cf3a036cf10a617f38f1.

This patch fixes a race condition. but the synchronize_rcu() added to
the swap function unnecessarily slows it down. And the patch that fixed
the performance regression introduced a UAF (Use After Free) issue:
961858ce4471 netfilter: ipset: fix performance regression in swap operation

So it is being revert along with the others.

Signed-off-by: Yongliang Gao <leonylgao@tencent.com>
Reviewed-by: Jianping Liu <frankjpliu@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
2024-11-28 21:44:03 +08:00
Yongliang Gao cad23b9f88 Revert "netfilter: ipset: fix performance regression in swap operation"
This reverts commit 961858ce4471afb8fe13fcd4e6a1de03bbb877b0.

According to:
4e7aaa6b82d6 netfilter: ipset: Fix race between namespace cleanup and gc in the list:set type

There is a race condition between namespace cleanup in ipset and the garbage
collection of the list:set type. The namespace cleanup can destroy the list:set
type of sets while the gc of the set type is waiting to run in rcu cleanup.
The latter uses data from the destroyed set which thus leads use after free.

Therefore, we will first revert this patch to avoid introducing more problems.

Signed-off-by: Yongliang Gao <leonylgao@tencent.com>
Reviewed-by: Jianping Liu <frankjpliu@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
2024-11-28 21:44:03 +08:00
Peter Xu ce9bdd85ff mm/hugetlb: fix missing hugetlb_lock for resv uncharge
commit b76b46902c2d0395488c8412e1116c2486cdfcb2 upstream

There is a recent report on UFFDIO_COPY over hugetlb:

https://lore.kernel.org/all/000000000000ee06de0616177560@google.com/

350:	lockdep_assert_held(&hugetlb_lock);

Should be an issue in hugetlb but triggered in an userfault context, where
it goes into the unlikely path where two threads modifying the resv map
together.  Mike has a fix in that path for resv uncharge but it looks like
the locking criteria was overlooked: hugetlb_cgroup_uncharge_folio_rsvd()
will update the cgroup pointer, so it requires to be called with the lock
held.

Conflicts:
	mm/hugetlb.c

Link: https://lkml.kernel.org/r/20240417211836.2742593-3-peterx@redhat.com
Fixes: 79aa925bf2 ("hugetlb_cgroup: fix reservation accounting")
Signed-off-by: Peter Xu <peterx@redhat.com>
Reported-by: syzbot+4b8077a5fccc61c385a1@syzkaller.appspotmail.com
Reviewed-by: Mina Almasry <almasrymina@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Yongliang Gao <leonylgao@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
2024-11-28 21:44:03 +08:00
Miaohe Lin 9952b570b2 hugetlb_cgroup: fix imbalanced css_get and css_put pair for shared mappings
commit d85aecf284 upstream

The current implementation of hugetlb_cgroup for shared mappings could
have different behavior.  Consider the following two scenarios:

 1.Assume initial css reference count of hugetlb_cgroup is 1:
  1.1 Call hugetlb_reserve_pages with from = 1, to = 2. So css reference
      count is 2 associated with 1 file_region.
  1.2 Call hugetlb_reserve_pages with from = 2, to = 3. So css reference
      count is 3 associated with 2 file_region.
  1.3 coalesce_file_region will coalesce these two file_regions into
      one. So css reference count is 3 associated with 1 file_region
      now.

 2.Assume initial css reference count of hugetlb_cgroup is 1 again:
  2.1 Call hugetlb_reserve_pages with from = 1, to = 3. So css reference
      count is 2 associated with 1 file_region.

Therefore, we might have one file_region while holding one or more css
reference counts. This inconsistency could lead to imbalanced css_get()
and css_put() pair. If we do css_put one by one (i.g. hole punch case),
scenario 2 would put one more css reference. If we do css_put all
together (i.g. truncate case), scenario 1 will leak one css reference.

The imbalanced css_get() and css_put() pair would result in a non-zero
reference when we try to destroy the hugetlb cgroup. The hugetlb cgroup
directory is removed __but__ associated resource is not freed. This
might result in OOM or can not create a new hugetlb cgroup in a busy
workload ultimately.

In order to fix this, we have to make sure that one file_region must
hold exactly one css reference. So in coalesce_file_region case, we
should release one css reference before coalescence. Also only put css
reference when the entire file_region is removed.

The last thing to note is that the caller of region_add() will only hold
one reference to h_cg->css for the whole contiguous reservation region.
But this area might be scattered when there are already some
file_regions reside in it. As a result, many file_regions may share only
one h_cg->css reference. In order to ensure that one file_region must
hold exactly one css reference, we should do css_get() for each
file_region and release the reference held by caller when they are done.

[linmiaohe@huawei.com: fix imbalanced css_get and css_put pair for shared mappings]
  Link: https://lkml.kernel.org/r/20210316023002.53921-1-linmiaohe@huawei.com

Link: https://lkml.kernel.org/r/20210301120540.37076-1-linmiaohe@huawei.com
Fixes: 075a61d07a ("hugetlb_cgroup: add accounting for shared mappings")
Reported-by: kernel test robot <lkp@intel.com> (auto build test ERROR)
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Wanpeng Li <liwp.linux@gmail.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Yongliang Gao <leonylgao@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
2024-11-28 21:44:02 +08:00
Mike Kravetz 4e5ff9ac7f hugetlb_cgroup: fix reservation accounting
commit 79aa925bf2 upstream

Michal Privoznik was using "free page reporting" in QEMU/virtio-balloon
with hugetlbfs and hit the warning below.  QEMU with free page hinting
uses fallocate(FALLOC_FL_PUNCH_HOLE) to discard pages that are reported
as free by a VM.  The reporting granularity is in pageblock granularity.
So when the guest reports 2M chunks, we fallocate(FALLOC_FL_PUNCH_HOLE)
one huge page in QEMU.

  WARNING: CPU: 7 PID: 6636 at mm/page_counter.c:57 page_counter_uncharge+0x4b/0x50
  Modules linked in: ...
  CPU: 7 PID: 6636 Comm: qemu-system-x86 Not tainted 5.9.0 
  Hardware name: Gigabyte Technology Co., Ltd. X570 AORUS PRO/X570 AORUS PRO, BIOS F21 07/31/2020
  RIP: 0010:page_counter_uncharge+0x4b/0x50
  ...
  Call Trace:
    hugetlb_cgroup_uncharge_file_region+0x4b/0x80
    region_del+0x1d3/0x300
    hugetlb_unreserve_pages+0x39/0xb0
    remove_inode_hugepages+0x1a8/0x3d0
    hugetlbfs_fallocate+0x3c4/0x5c0
    vfs_fallocate+0x146/0x290
    __x64_sys_fallocate+0x3e/0x70
    do_syscall_64+0x33/0x40
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

Investigation of the issue uncovered bugs in hugetlb cgroup reservation
accounting.  This patch addresses the found issues.

Fixes: 075a61d07a ("hugetlb_cgroup: add accounting for shared mappings")
Reported-by: Michal Privoznik <mprivozn@redhat.com>
Co-developed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Michal Privoznik <mprivozn@redhat.com>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Cc: <stable@vger.kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Tejun Heo <tj@kernel.org>
Link: https://lkml.kernel.org/r/20201021204426.36069-1-mike.kravetz@oracle.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Yongliang Gao <leonylgao@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
2024-11-28 21:44:02 +08:00
luoxuanqiang 47d2a1c9c2 Fix race for duplicate reqsk on identical SYN
commit ff46e3b4421923937b7f6e44ffcd3549a074f321 upstream.

When bonding is configured in BOND_MODE_BROADCAST mode, if two identical
SYN packets are received at the same time and processed on different CPUs,
it can potentially create the same sk (sock) but two different reqsk
(request_sock) in tcp_conn_request().

These two different reqsk will respond with two SYNACK packets, and since
the generation of the seq (ISN) incorporates a timestamp, the final two
SYNACK packets will have different seq values.

The consequence is that when the Client receives and replies with an ACK
to the earlier SYNACK packet, we will reset(RST) it.

========================================================================

This behavior is consistently reproducible in my local setup,
which comprises:

                  | NETA1 ------ NETB1 |
PC_A --- bond --- |                    | --- bond --- PC_B
                  | NETA2 ------ NETB2 |

- PC_A is the Server and has two network cards, NETA1 and NETA2. I have
  bonded these two cards using BOND_MODE_BROADCAST mode and configured
  them to be handled by different CPU.

- PC_B is the Client, also equipped with two network cards, NETB1 and
  NETB2, which are also bonded and configured in BOND_MODE_BROADCAST mode.

If the client attempts a TCP connection to the server, it might encounter
a failure. Capturing packets from the server side reveals:

10.10.10.10.45182 > localhost: Flags [S], seq 320236027,
10.10.10.10.45182 > localhost: Flags [S], seq 320236027,
localhost > 10.10.10.10.45182: Flags [S.], seq 2967855116,
localhost > 10.10.10.10.45182: Flags [S.], seq 2967855123, <==
10.10.10.10.45182 > localhost: Flags [.], ack 4294967290,
10.10.10.10.45182 > localhost: Flags [.], ack 4294967290,
localhost > 10.10.10.10.45182: Flags [R], seq 2967855117, <==
localhost > 10.10.10.10.45182: Flags [R], seq 2967855117,

Two SYNACKs with different seq numbers are sent by localhost,
resulting in an anomaly.

========================================================================

The attempted solution is as follows:
Add a return value to inet_csk_reqsk_queue_hash_add() to confirm if the
ehash insertion is successful (Up to now, the reason for unsuccessful
insertion is that a reqsk for the same connection has already been
inserted). If the insertion fails, release the reqsk.

Due to the refcnt, Kuniyuki suggests also adding a return value check
for the DCCP module; if ehash insertion fails, indicating a successful
insertion of the same connection, simply release the reqsk as well.

Simultaneously, In the reqsk_queue_hash_req(), the start of the
req->rsk_timer is adjusted to be after successful insertion.

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: luoxuanqiang <luoxuanqiang@kylinos.cn>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20240621013929.1386815-1-luoxuanqiang@kylinos.cn
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
2024-11-28 21:43:55 +08:00
Andrii Nakryiko 7777039ad9 bpf: Fix use-after-free of bpf_link when priming half-fails
commit 138c67677f upstream.

[tapd]
https://tapd.woa.com/69992352/bugtrace/bugs/view?bug_id=1069992352117432493

If bpf_link_prime() succeeds to allocate new anon file, but then fails to
allocate ID for it, link priming is considered to be failed and user is
supposed ot be able to directly kfree() bpf_link, because it was never exposed
to user-space.

But at that point file already keeps a pointer to bpf_link and will eventually
call bpf_link_release(), so if bpf_link was kfree()'d by caller, that would
lead to use-after-free.

Fix this by first allocating ID and only then allocating file. Adding ID to
link_idr is ok, because link at that point still doesn't have its ID set, so
no user-space process can create a new FD for it.

Fixes: a3b80e1078 ("bpf: Allocate ID for bpf_link")
Reported-by: syzbot+39b64425f91b5aab714d@syzkaller.appspotmail.com
Suggested-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20200501185622.3088964-1-andriin@fb.com
Signed-off-by: Huang Cun <cunhuang@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
2024-11-28 21:43:54 +08:00
Felix Huettner b523032c0d net: openvswitch: fix race on port output
commit 066b86787f upstream
assume the following setup on a single machine:
1. An openvswitch instance with one bridge and default flows
2. two network namespaces "server" and "client"
3. two ovs interfaces "server" and "client" on the bridge
4. for each ovs interface a veth pair with a matching name and 32 rx and
   tx queues
5. move the ends of the veth pairs to the respective network namespaces
6. assign ip addresses to each of the veth ends in the namespaces (needs
   to be the same subnet)
7. start some http server on the server network namespace
8. test if a client in the client namespace can reach the http server

when following the actions below the host has a chance of getting a cpu
stuck in a infinite loop:
1. send a large amount of parallel requests to the http server (around
   3000 curls should work)
2. in parallel delete the network namespace (do not delete interfaces or
   stop the server, just kill the namespace)

there is a low chance that this will cause the below kernel cpu stuck
message. If this does not happen just retry.
Below there is also the output of bpftrace for the functions mentioned
in the output.

The series of events happening here is:
1. the network namespace is deleted calling
   `unregister_netdevice_many_notify` somewhere in the process
2. this sets first `NETREG_UNREGISTERING` on both ends of the veth and
   then runs `synchronize_net`
3. it then calls `call_netdevice_notifiers` with `NETDEV_UNREGISTER`
4. this is then handled by `dp_device_event` which calls
   `ovs_netdev_detach_dev` (if a vport is found, which is the case for
   the veth interface attached to ovs)
5. this removes the rx_handlers of the device but does not prevent
   packages to be sent to the device
6. `dp_device_event` then queues the vport deletion to work in
   background as a ovs_lock is needed that we do not hold in the
   unregistration path
7. `unregister_netdevice_many_notify` continues to call
   `netdev_unregister_kobject` which sets `real_num_tx_queues` to 0
8. port deletion continues (but details are not relevant for this issue)
9. at some future point the background task deletes the vport

If after 7. but before 9. a packet is send to the ovs vport (which is
not deleted at this point in time) which forwards it to the
`dev_queue_xmit` flow even though the device is unregistering.
In `skb_tx_hash` (which is called in the `dev_queue_xmit`) path there is
a while loop (if the packet has a rx_queue recorded) that is infinite if
`dev->real_num_tx_queues` is zero.

To prevent this from happening we update `do_output` to handle devices
without carrier the same as if the device is not found (which would
be the code path after 9. is done).

Additionally we now produce a warning in `skb_tx_hash` if we will hit
the infinite loop.

bpftrace (first word is function name):

__dev_queue_xmit server: real_num_tx_queues: 1, cpu: 2, pid: 28024, tid: 28024, skb_addr: 0xffff9edb6f207000, reg_state: 1
netdev_core_pick_tx server: addr: 0xffff9f0a46d4a000 real_num_tx_queues: 1, cpu: 2, pid: 28024, tid: 28024, skb_addr: 0xffff9edb6f207000, reg_state: 1
dp_device_event server: real_num_tx_queues: 1 cpu 9, pid: 21024, tid: 21024, event 2, reg_state: 1
synchronize_rcu_expedited: cpu 9, pid: 21024, tid: 21024
synchronize_rcu_expedited: cpu 9, pid: 21024, tid: 21024
synchronize_rcu_expedited: cpu 9, pid: 21024, tid: 21024
synchronize_rcu_expedited: cpu 9, pid: 21024, tid: 21024
dp_device_event server: real_num_tx_queues: 1 cpu 9, pid: 21024, tid: 21024, event 6, reg_state: 2
ovs_netdev_detach_dev server: real_num_tx_queues: 1 cpu 9, pid: 21024, tid: 21024, reg_state: 2
netdev_rx_handler_unregister server: real_num_tx_queues: 1, cpu: 9, pid: 21024, tid: 21024, reg_state: 2
synchronize_rcu_expedited: cpu 9, pid: 21024, tid: 21024
netdev_rx_handler_unregister ret server: real_num_tx_queues: 1, cpu: 9, pid: 21024, tid: 21024, reg_state: 2
dp_device_event server: real_num_tx_queues: 1 cpu 9, pid: 21024, tid: 21024, event 27, reg_state: 2
dp_device_event server: real_num_tx_queues: 1 cpu 9, pid: 21024, tid: 21024, event 22, reg_state: 2
dp_device_event server: real_num_tx_queues: 1 cpu 9, pid: 21024, tid: 21024, event 18, reg_state: 2
netdev_unregister_kobject: real_num_tx_queues: 1, cpu: 9, pid: 21024, tid: 21024
synchronize_rcu_expedited: cpu 9, pid: 21024, tid: 21024
ovs_vport_send server: real_num_tx_queues: 0, cpu: 2, pid: 28024, tid: 28024, skb_addr: 0xffff9edb6f207000, reg_state: 2
__dev_queue_xmit server: real_num_tx_queues: 0, cpu: 2, pid: 28024, tid: 28024, skb_addr: 0xffff9edb6f207000, reg_state: 2
netdev_core_pick_tx server: addr: 0xffff9f0a46d4a000 real_num_tx_queues: 0, cpu: 2, pid: 28024, tid: 28024, skb_addr: 0xffff9edb6f207000, reg_state: 2
broken device server: real_num_tx_queues: 0, cpu: 2, pid: 28024, tid: 28024
ovs_dp_detach_port server: real_num_tx_queues: 0 cpu 9, pid: 9124, tid: 9124, reg_state: 2
synchronize_rcu_expedited: cpu 9, pid: 33604, tid: 33604

stuck message:

watchdog: BUG: soft lockup - CPU#5 stuck for 26s! [curl:1929279]
Modules linked in: veth pktgen bridge stp llc ip_set_hash_net nft_counter xt_set nft_compat nf_tables ip_set_hash_ip ip_set nfnetlink_cttimeout nfnetlink openvswitch nsh nf_conncount nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 tls binfmt_misc nls_iso8859_1 input_leds joydev serio_raw dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua sch_fq_codel drm efi_pstore virtio_rng ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear hid_generic usbhid hid crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel virtio_net ahci net_failover crypto_simd cryptd psmouse libahci virtio_blk failover
CPU: 5 PID: 1929279 Comm: curl Not tainted 5.15.0-67-generic #74-Ubuntu
Hardware name: OpenStack Foundation OpenStack Nova, BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
RIP: 0010:netdev_pick_tx+0xf1/0x320
Code: 00 00 8d 48 ff 0f b7 c1 66 39 ca 0f 86 e9 01 00 00 45 0f b7 ff 41 39 c7 0f 87 5b 01 00 00 44 29 f8 41 39 c7 0f 87 4f 01 00 00 <eb> f2 0f 1f 44 00 00 49 8b 94 24 28 04 00 00 48 85 d2 0f 84 53 01
RSP: 0018:ffffb78b40298820 EFLAGS: 00000246
RAX: 0000000000000000 RBX: ffff9c8773adc2e0 RCX: 000000000000083f
RDX: 0000000000000000 RSI: ffff9c8773adc2e0 RDI: ffff9c870a25e000
RBP: ffffb78b40298858 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffff9c870a25e000
R13: ffff9c870a25e000 R14: ffff9c87fe043480 R15: 0000000000000000
FS:  00007f7b80008f00(0000) GS:ffff9c8e5f740000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f7b80f6a0b0 CR3: 0000000329d66000 CR4: 0000000000350ee0
Call Trace:
 <IRQ>
 netdev_core_pick_tx+0xa4/0xb0
 __dev_queue_xmit+0xf8/0x510
 ? __bpf_prog_exit+0x1e/0x30
 dev_queue_xmit+0x10/0x20
 ovs_vport_send+0xad/0x170 [openvswitch]
 do_output+0x59/0x180 [openvswitch]
 do_execute_actions+0xa80/0xaa0 [openvswitch]
 ? kfree+0x1/0x250
 ? kfree+0x1/0x250
 ? kprobe_perf_func+0x4f/0x2b0
 ? flow_lookup.constprop.0+0x5c/0x110 [openvswitch]
 ovs_execute_actions+0x4c/0x120 [openvswitch]
 ovs_dp_process_packet+0xa1/0x200 [openvswitch]
 ? ovs_ct_update_key.isra.0+0xa8/0x120 [openvswitch]
 ? ovs_ct_fill_key+0x1d/0x30 [openvswitch]
 ? ovs_flow_key_extract+0x2db/0x350 [openvswitch]
 ovs_vport_receive+0x77/0xd0 [openvswitch]
 ? __htab_map_lookup_elem+0x4e/0x60
 ? bpf_prog_680e8aff8547aec1_kfree+0x3b/0x714
 ? trace_call_bpf+0xc8/0x150
 ? kfree+0x1/0x250
 ? kfree+0x1/0x250
 ? kprobe_perf_func+0x4f/0x2b0
 ? kprobe_perf_func+0x4f/0x2b0
 ? __mod_memcg_lruvec_state+0x63/0xe0
 netdev_port_receive+0xc4/0x180 [openvswitch]
 ? netdev_port_receive+0x180/0x180 [openvswitch]
 netdev_frame_hook+0x1f/0x40 [openvswitch]
 __netif_receive_skb_core.constprop.0+0x23d/0xf00
 __netif_receive_skb_one_core+0x3f/0xa0
 __netif_receive_skb+0x15/0x60
 process_backlog+0x9e/0x170
 __napi_poll+0x33/0x180
 net_rx_action+0x126/0x280
 ? ttwu_do_activate+0x72/0xf0
 __do_softirq+0xd9/0x2e7
 ? rcu_report_exp_cpu_mult+0x1b0/0x1b0
 do_softirq+0x7d/0xb0
 </IRQ>
 <TASK>
 __local_bh_enable_ip+0x54/0x60
 ip_finish_output2+0x191/0x460
 __ip_finish_output+0xb7/0x180
 ip_finish_output+0x2e/0xc0
 ip_output+0x78/0x100
 ? __ip_finish_output+0x180/0x180
 ip_local_out+0x5e/0x70
 __ip_queue_xmit+0x184/0x440
 ? tcp_syn_options+0x1f9/0x300
 ip_queue_xmit+0x15/0x20
 __tcp_transmit_skb+0x910/0x9c0
 ? __mod_memcg_state+0x44/0xa0
 tcp_connect+0x437/0x4e0
 ? ktime_get_with_offset+0x60/0xf0
 tcp_v4_connect+0x436/0x530
 __inet_stream_connect+0xd4/0x3a0
 ? kprobe_perf_func+0x4f/0x2b0
 ? aa_sk_perm+0x43/0x1c0
 inet_stream_connect+0x3b/0x60
 __sys_connect_file+0x63/0x70
 __sys_connect+0xa6/0xd0
 ? setfl+0x108/0x170
 ? do_fcntl+0xe8/0x5a0
 __x64_sys_connect+0x18/0x20
 do_syscall_64+0x5c/0xc0
 ? __x64_sys_fcntl+0xa9/0xd0
 ? exit_to_user_mode_prepare+0x37/0xb0
 ? syscall_exit_to_user_mode+0x27/0x50
 ? do_syscall_64+0x69/0xc0
 ? __sys_setsockopt+0xea/0x1e0
 ? exit_to_user_mode_prepare+0x37/0xb0
 ? syscall_exit_to_user_mode+0x27/0x50
 ? __x64_sys_setsockopt+0x1f/0x30
 ? do_syscall_64+0x69/0xc0
 ? irqentry_exit+0x1d/0x30
 ? exc_page_fault+0x89/0x170
 entry_SYSCALL_64_after_hwframe+0x61/0xcb
RIP: 0033:0x7f7b8101c6a7
Code: 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 2a 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 18 89 54 24 0c 48 89 34 24 89
RSP: 002b:00007ffffd6b2198 EFLAGS: 00000246 ORIG_RAX: 000000000000002a
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f7b8101c6a7
RDX: 0000000000000010 RSI: 00007ffffd6b2360 RDI: 0000000000000005
RBP: 0000561f1370d560 R08: 00002795ad21d1ac R09: 0030312e302e302e
R10: 00007ffffd73f080 R11: 0000000000000246 R12: 0000561f1370c410
R13: 0000000000000000 R14: 0000000000000005 R15: 0000000000000000
 </TASK>

Fixes: 7f8a436eaa ("openvswitch: Add conntrack action")
Co-developed-by: Luca Czesla <luca.czesla@mail.schwarz>
Signed-off-by: Luca Czesla <luca.czesla@mail.schwarz>
Signed-off-by: Felix Huettner <felix.huettner@mail.schwarz>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Link: https://lore.kernel.org/r/ZC0pBXBAgh7c76CA@kernel-bug-kernel-bug
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Di Zhang <emilydzhang@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
2024-11-28 21:43:54 +08:00
Hyunwoo Kim 5a77fda546 net: openvswitch: Fix Use-After-Free in ovs_ct_exit
commit 5ea7b72d4fac2fdbc0425cd8f2ea33abe95235b2 upstream.

Since kfree_rcu, which is called in the hlist_for_each_entry_rcu traversal
of ovs_ct_limit_exit, is not part of the RCU read critical section, it
is possible that the RCU grace period will pass during the traversal and
the key will be free.

To prevent this, it should be changed to hlist_for_each_entry_safe.

CVE-2024-27395

Fixes: 11efd5cb04 ("openvswitch: Support conntrack zone limit")
Signed-off-by: Hyunwoo Kim <v4bel@theori.io>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Aaron Conole <aconole@redhat.com>
Link: https://lore.kernel.org/r/ZiYvzQN/Ry5oeFQW@v4bel-B760M-AORUS-ELITE-AX
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Hongbo Li <herberthbli@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
2024-11-28 21:43:54 +08:00
Di Zhang 8a7ec1c178 add DEBUG_NET_WARN_ON_ONCE
DEBUG_NET_WARN_ON_ONCE comes from
d268c1f5cf, which depends on other
patches and adds many other things. Only DEBUG_NET_WARN_ON_ONCE
is added here to support subsequent patches.

Signed-off-by: Di Zhang <emilydzhang@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
2024-11-28 21:43:54 +08:00
Di Zhang ded83cd8b9 net: fix the RTO timer retransmitting skb every 1ms if linear option is enabled
commit e4dd0d3a2f upstream

In the real workload, I encountered an issue which could cause the RTO
timer to retransmit the skb per 1ms with linear option enabled. The amount
of lost-retransmitted skbs can go up to 1000+ instantly.

The root cause is that if the icsk_rto happens to be zero in the 6th round
(which is the TCP_THIN_LINEAR_RETRIES value), then it will always be zero
due to the changed calculation method in tcp_retransmit_timer() as follows:

icsk->icsk_rto = min(icsk->icsk_rto << 1, TCP_RTO_MAX);

Above line could be converted to
icsk->icsk_rto = min(0 << 1, TCP_RTO_MAX) = 0

Therefore, the timer expires so quickly without any doubt.

I read through the RFC 6298 and found that the RTO value can be rounded
up to a certain value, in Linux, say TCP_RTO_MIN as default, which is
regarded as the lower bound in this patch as suggested by Eric.

Fixes: 36e31b0af5 ("net: TCP thin linear timeouts")
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Di Zhang <emilydzhang@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
2024-11-28 21:43:54 +08:00
katrinzhou ff5cd3ac27 xfs: fix the lack of curly brackets
Signed-off-by: katrinzhou <katrinzhou@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
2024-11-28 21:43:54 +08:00
yilingjin efa8e46af8 sli: fix period over-limit bug
period in control should less than jiffies_to_usecs(MAX_JIFFY_OFFSET),
otherwise it will use MAX_JIFFY_OFFSET or be truncated.

[tapd]
https://tapd.woa.com/OS_kernel_dev/bugtrace/bugs/view?bug_id=1069992352124425367

Reviewed-by: Liu Chun <kaicliu@tencent.com>
Signed-off-by: yilingjin <yilingjin@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
2024-11-28 21:43:54 +08:00
Darrick J. Wong 4fc898c9c0 xfs: fix chown leaking delalloc quota blocks when fssetxattr fails
commit 1aecf3734a upstream.

Fix CVE:CVE-2022-27672

While refactoring the quota code to create a function to allocate inode
change transactions, I noticed that xfs_qm_vop_chown_reserve does more
than just make reservations: it also *modifies* the incore counts
directly to handle the owner id change for the delalloc blocks.

I then observed that the fssetxattr code continues validating input
arguments after making the quota reservation but before dirtying the
transaction.  If the routine decides to error out, it fails to undo the
accounting switch!  This leads to incorrect quota reservation and
failure down the line.

We can fix this by making the reservation function do only that -- for
the new dquot, it reserves ondisk and delalloc blocks to the
transaction, and the old dquot hangs on to its incore reservation for
now.  Once we actually switch the dquots, we can then update the incore
reservations because we've dirtied the transaction and it's too late to
turn back now.

No fixes tag because this has been broken since the start of git.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Aurelianliu <aurelianliu@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
2024-11-28 21:43:54 +08:00